Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as evaluators
. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
Use Azure AI Evaluation SDK to:
Azure AI SDK provides following to evaluate Generative AI Applications:
evaluate
API.Source code | Package (PyPI) | API reference documentation | Product documentation | Samples
Install the Azure AI Evaluation SDK for Python with pip:
pip install azure-ai-evaluation
If you want to track results in AI Studio, install remote
extra:
pip install azure-ai-evaluation[remote]
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
Built-in evaluators are out of box evaluators provided by Microsoft:
Category | Evaluator class |
---|---|
Performance and quality (AI-assisted) | GroundednessEvaluator , RelevanceEvaluator , CoherenceEvaluator , FluencyEvaluator , SimilarityEvaluator , RetrievalEvaluator |
Performance and quality (NLP) | F1ScoreEvaluator , RougeScoreEvaluator , GleuScoreEvaluator , BleuScoreEvaluator , MeteorScoreEvaluator |
Risk and safety (AI-assisted) | ViolenceEvaluator , SexualEvaluator , SelfHarmEvaluator , HateUnfairnessEvaluator , IndirectAttackEvaluator , ProtectedMaterialEvaluator |
Composite | QAEvaluator , ContentSafetyEvaluator |
For more in-depth information on each evaluator definition and how it's calculated, see Evaluation and monitoring metrics for generative AI.
import os
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score(
response="Tokyo is the capital of Japan.",
ground_truth="The capital of Japan is Tokyo."
)
# AI assisted quality evaluator
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
query="What is the capital of Japan?",
response="The capital of Japan is Tokyo."
)
# AI assisted safety evaluator
azure_ai_project = {
"subscription_id": "<subscription_id>",
"resource_group_name": "<resource_group_name>",
"project_name": "<project_name>",
}
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
query="What is the capital of France?",
response="Paris."
)
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
return len(response)
# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
def __init__(self, blocklist):
self._blocklist = blocklist
def __call__(self, *, response: str, **kwargs):
score = any([word in answer for word in self._blocklist])
return {"score": score}
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
result = response_length("The capital of Japan is Tokyo.")
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
The package provides an evaluate
API which can be used to run multiple evaluators together to evaluate generative AI application response.
from azure.ai.evaluation import evaluate
result = evaluate(
data="data.jsonl", # provide your data here
evaluators={
"blocklist": blocklist_evaluator,
"relevance": relevance_evaluator
},
# column mapping
evaluator_config={
"relevance": {
"column_mapping": {
"query": "${data.queries}"
"ground_truth": "${data.ground_truth}"
"response": "${outputs.response}"
}
}
}
# Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
azure_ai_project = azure_ai_project,
# Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
output_path="./evaluation_results.json"
)
For more details refer to Evaluate on test dataset using evaluate()
from askwiki import askwiki
result = evaluate(
data="data.jsonl",
target=askwiki,
evaluators={
"relevance": relevance_eval
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.queries}"
"context": "${outputs.context}"
"response": "${outputs.response}"
}
}
}
)
Above code snippet refers to askwiki application in this sample.
For more details refer to Evaluate on a target
Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
async def callback(
messages: Dict[str, List[Dict]],
stream: bool = False,
session_state: Any = None,
context: Optional[Dict[str, Any]] = None,
) -> dict:
messages_list = messages["messages"]
# Get the last message from the user
latest_message = messages_list[-1]
query = latest_message["content"]
# Call your endpoint or AI application here
# response should be a string
response = call_to_your_application(query, messages_list, context)
formatted_response = {
"content": response,
"role": "assistant",
"context": "",
}
messages["messages"].append(formatted_response)
return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
The simulator initialization and invocation looks like this:
from azure.ai.evaluation.simulator import Simulator
model_config = {
"azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
"api_version": os.environ.get("AZURE_API_VERSION"),
}
custom_simulator = Simulator(model_config=model_config)
outputs = asyncio.run(custom_simulator(
target=callback,
conversation_turns=[
[
"What should I know about the public gardens in the US?",
],
[
"How do I simulate data against LLMs",
],
],
max_conversation_turns=2,
))
with open("simulator_output.jsonl", "w") as f:
for output in outputs:
f.write(output.to_eval_qr_json_lines())
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential
azure_ai_project = {
"subscription_id": <subscription_id>,
"resource_group_name": <resource_group_name>,
"project_name": <project_name>
}
scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
simulator(
scenario=scenario,
max_conversation_turns=1,
max_simulation_results=3,
target=callback
)
)
print(outputs.to_eval_qr_json_lines())
For more details about the simulator, visit the following links:
In following section you will find examples of:
More examples can be found here.
Please refer to troubleshooting for common issues.
This library uses the standard logging library for logging. Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO level.
Detailed DEBUG level logging, including request/response bodies and unredacted
headers, can be enabled on a client with the logging_enable
argument.
See full SDK logging documentation with examples here.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
[remote]
extra. This is no longer needed when tracking results in Azure AI Studio.AttributeError: 'NoneType' object has no attribute 'get'
while running simulator with 1000+ resultsazure-ai-inference
as dependency.AttributeError: 'NoneType' object has no attribute 'get'
while running simulator with 1000+ resultsparallel
parameter has been removed from composite evaluators: QAEvaluator
, ContentSafetyChatEvaluator
, and ContentSafetyMultimodalEvaluator
. To control evaluator parallelism, you can now use the _parallel
keyword argument, though please note that this private parameter may change in the future.query_response_generating_prompty_kwargs
and user_simulator_prompty_kwargs
have been renamed to query_response_generating_prompty_options
and user_simulator_prompty_options
in the Simulator's call method.output_path
parameter in the evaluate
API did not support relative path.JsonLineList
and the helper function to_eval_qr_json_lines
now outputs context from both user and assistant turns along with category
if it exists in the conversationAZURE_TOKEN_REFRESH_INTERVAL
to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.ContentSafetyEvaluator
that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.evaluate
function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
otherwise difficult to process. Such values are ignored fully, so the aggregated metric of [1, 2, 3, NaN]
would be 2, not 1.5.AI_EVALS_DISABLE_EXPERIMENTAL_WARNING
to disable the warning message for experimental features.AdversarialSimulator
such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the AdversarialSimulator
outputs. Previously, for 200 max_simulation_results
a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.DirectAttackSimulator
, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass randomize_order=True
when you call the DirectAttackSimulator
, for example:adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
adversarial_simulator(
scenario=scenario,
target=callback,
randomize_order=True
)
)
GroundednessProEvaluator
, which is a service-based evaluator for determining response groundedness.import importlib.resources as pkg_resources
package = "azure.ai.evaluation.simulator._data_sources"
resource_name = "grounding.json"
custom_simulator = Simulator(model_config=model_config)
conversation_turns = []
with pkg_resources.path(package, resource_name) as grounding_file:
with open(grounding_file, "r") as file:
data = json.load(file)
for item in data:
conversation_turns.append([item])
outputs = asyncio.run(custom_simulator(
target=callback,
conversation_turns=conversation_turns,
max_conversation_turns=1,
))
PF_EVALS_BATCH_USE_ASYNC
to AI_EVALS_BATCH_USE_ASYNC
.RetrievalEvaluator
now requires a context
input in addition to query
in single-turn evaluation.RelevanceEvaluator
no longer takes context
as an input. It now only takes query
and response
in single-turn evaluation.FluencyEvaluator
no longer takes query
as an input. It now only takes response
in single-turn evaluation.ADVERSARIAL_INDIRECT_JAILBREAK
, invoking IndirectJailbreak or XPIA should be done with IndirectAttackSimulator
Simulator
and AdversarialSimulator
previously had to_eval_qa_json_lines
and now has to_eval_qr_json_lines
. Where to_eval_qa_json_lines
had:{"question": <user_message>, "answer": <assistant_message>}
to_eval_qr_json_lines
now has:
{"query": <user_message>, "response": assistant_message}
gpt-4o
models using the json_schema
response formatevaluate
API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.trace.destination
is set to none
Improved error messages for the evaluate
API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
GroundednessEvaluator
now supports query
as an optional input in single-turn evaluation. If query
is provided, a different prompt template will be used for the evaluation.
To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the gpt_
prefix. To maintain backwards compatibility, the old key with the gpt_
prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
CoherenceEvaluator
RelevanceEvaluator
FluencyEvaluator
GroundednessEvaluator
SimilarityEvaluator
RetrievalEvaluator
The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
Evaluator | New max_token for Generation |
---|---|
CoherenceEvaluator | 800 |
RelevanceEvaluator | 800 |
FluencyEvaluator | 800 |
GroundednessEvaluator | 800 |
RetrievalEvaluator | 1600 |
Improved the error message for storage access permission issues to provide clearer guidance for users.
numpy
dependency. All NaN values returned by the SDK have been changed to from numpy.nan
to math.nan
.credential
is now required to be passed in for all content safety evaluators and ProtectedMaterialsEvaluator
. DefaultAzureCredential
will no longer be chosen if a credential is not passed.Forbidden
. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.type
field to AzureOpenAIModelConfiguration
and OpenAIModelConfiguration
conversation
as an alternative input to their usual single-turn inputs:
ViolenceEvaluator
SexualEvaluator
SelfHarmEvaluator
HateUnfairnessEvaluator
ProtectedMaterialEvaluator
IndirectAttackEvaluator
CoherenceEvaluator
RelevanceEvaluator
FluencyEvaluator
GroundednessEvaluator
RetrievalScoreEvaluator
, formally an internal part of ChatEvaluator
as a standalone conversation-only evaluator.ContentSafetyChatEvaluator
and ChatEvaluator
evaluator_config
parameter of evaluate
now maps in evaluator name to a dictionary EvaluatorConfig
, which is a TypedDict
. The
column_mapping
between data
or target
and evaluator field names should now be specified inside this new dictionary:Before:
evaluate(
...,
evaluator_config={
"hate_unfairness": {
"query": "${data.question}",
"response": "${data.answer}",
}
},
...
)
After
evaluate(
...,
evaluator_config={
"hate_unfairness": {
"column_mapping": {
"query": "${data.question}",
"response": "${data.answer}",
}
}
},
...
)
azure_ai_project = {
"subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
"resource_group_name": os.environ.get("RESOURCE_GROUP"),
"project_name": os.environ.get("PROJECT_NAME"),
}
sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
After:
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
}
sim = Simulator(model_config=model_config)
If api_key
is not included in the model_config
, the prompty runtime in promptflow-core
will pick up DefaultAzureCredential
.
AzureOpenAIModelConfiguration
data
and evaluators
are now required keywords in evaluate
.synthetic
namespace has been renamed to simulator
, and sub-namespaces under this module have been removedevaluate
and evaluators
namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace azure.ai.evaluation
project_scope
in content safety evaluators have been renamed to azure_ai_project
for consistency with evaluate API and simulators.TypedDict
and are exposed in the azure.ai.evaluation
module instead of coming from promptflow.core
.question
and answer
in built-in evaluators to more generic terms: query
and response
.promptflow-evals
. New features will be added only to this package moving forward.TypedDict
for AzureAIProject
that allows for better intellisense and type checking when passing in project informationFAQs
Microsoft Azure Evaluation Library for Python
We found that azure-ai-evaluation demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.