
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
autoevals
Advanced tools
Autoevals is a tool to quickly and easily evaluate AI model outputs.
It bundles together a variety of automatic evaluation methods including:
Autoevals is developed by the team at Braintrust.
Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.
npm install autoevals
Use Autoevals to model-grade an example LLM completion using the Factuality prompt.
By default, Autoevals uses your OPENAI_API_KEY
environment variable to authenticate with OpenAI's API.
import { Factuality } from "autoevals";
(async () => {
const input = "Which country has the highest population?";
const output = "People's Republic of China";
const expected = "China";
const result = await Factuality({ output, expected, input });
console.log(`Factuality score: ${result.score}`);
console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();
When you use Autoevals, it will look for an OPENAI_BASE_URL
environment variable to use as the base for requests to an OpenAI compatible API. If OPENAI_BASE_URL
is not set, it will default to the AI proxy.
If you choose to use the proxy, you'll also get:
The proxy is free to use, even if you don't have a Braintrust account.
If you have a Braintrust account, you can optionally set the BRAINTRUST_API_KEY
environment variable instead of OPENAI_API_KEY
to unlock additional features like logging and monitoring. You can also route requests to supported AI providers and models or custom models you have configured in Braintrust.
// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
import { Factuality } from "autoevals";
(async () => {
const input = "Which country has the highest population?";
const output = "People's Republic of China";
const expected = "China";
// Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
const result = await Factuality({
model: "claude-3-5-sonnet-latest",
output,
expected,
input,
});
// The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
console.log(`Factuality score: ${result.score}`);
console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();
There are two ways you can configure a custom client when you need to use a different OpenAI compatible API:
Set up a client that all your evaluators will use:
import OpenAI from "openai";
import { init, Factuality } from "autoevals";
const client = new OpenAI({
baseURL: "https://api.openai.com/v1/",
});
init({ client });
(async () => {
const result = await Factuality({
input: "What is the speed of light in a vacuum?",
output: "The speed of light in a vacuum is 299,792,458 meters per second.",
expected:
"The speed of light in a vacuum is approximately 300,000 kilometers per second (or precisely 299,792,458 meters per second).",
});
console.log("Factuality Score:", result);
})();
Configure a client for a specific evaluator instance:
import OpenAI from "openai";
import { Factuality } from "autoevals";
(async () => {
const customClient = new OpenAI({
baseURL: "https://custom-api.example.com/v1/",
});
const result = await Factuality({
client: customClient,
output: "Paris is the capital of France",
expected:
"Paris is the capital of France and has a population of over 2 million",
input: "Tell me about Paris",
});
console.log(result);
})();
Once you grade an output using Autoevals, you can optionally use Braintrust to log and compare your evaluation results. This integration is completely optional and not required for using Autoevals.
Create a file named example.eval.js
(it must take the form *.eval.[ts|tsx|js|jsx]
):
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("Autoevals", {
data: () => [
{
input: "Which country has the highest population?",
expected: "China",
},
],
task: () => "People's Republic of China",
scores: [Factuality],
});
Then, run
npx braintrust run example.eval.js
Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
import { LLMClassifierFromTemplate } from "autoevals";
(async () => {
const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.
I'm going to provide you with the issue description, and two possible titles.
Issue Description: {{input}}
1: {{output}}
2: {{expected}}`;
const choiceScores = { 1: 1, 2: 0 };
const evaluator = LLMClassifierFromTemplate<{ input: string }>({
name: "TitleQuality",
promptTemplate,
choiceScores,
useCoT: true,
});
const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
const expected = `Standardize Error Responses across APIs`;
const response = await evaluator({ input, output, expected });
console.log("Score", response.score);
console.log("Metadata", response.metadata);
})();
You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana'
is in the output, you can use the following:
import { Score } from "autoevals";
const bananaScorer = ({
output,
expected,
input,
}: {
output: string;
expected: string;
input: string;
}): Score => {
return { name: "banana_scorer", score: output.includes("banana") ? 1 : 0 };
};
(async () => {
const input = "What is 1 banana + 2 bananas?";
const output = "3";
const expected = "3 bananas";
const result = bananaScorer({ output, expected, input });
console.log(`Banana score: ${result.score}`);
})();
There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:
input
, output
, and expected
values through a bunch of different evaluation methods.The full docs are available for your reference.
We welcome contributions!
To install the development dependencies, run make develop
, and run source env.sh
to activate the environment. Make a .env
file from the .env.example
file and set the environment variables. Run direnv allow
to load the environment variables.
To run the tests, run pytest
from the root directory.
Send a PR and we'll review it! We'll take care of versioning and releasing.
FAQs
Universal library for evaluating AI models
The npm package autoevals receives a total of 40,235 weekly downloads. As such, autoevals popularity was classified as popular.
We found that autoevals demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.