Security News
pnpm 10.0.0 Blocks Lifecycle Scripts by Default
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
autoevals
Advanced tools
Autoevals is a tool to quickly and easily evaluate AI model outputs.
It bundles together a variety of automatic evaluation methods including:
Autoevals is developed by the team at Braintrust.
Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.
Autoevals is distributed as a Python library on PyPI and Node.js library on NPM.
npm install autoevals
Use Autoevals to model-grade an example LLM completion using the factuality prompt.
By default, Autoevals uses your OPENAI_API_KEY
environment variable to authenticate with OpenAI's API.
If you need to use a custom OpenAI client, you can initialize the library with a custom client.
import openai
from autoevals import init
from autoevals.oai import LLMClient
openai_client = openai.OpenAI(base_url="https://api.openai.com/v1/")
class CustomClient(LLMClient):
openai=openai_client # you can also pass in openai module and we will instantiate it for you
embed = openai.embeddings.create
moderation = openai.moderations.create
RateLimitError = openai.RateLimitError
def complete(self, **kwargs):
# make adjustments as needed
return self.openai.chat.completions.create(**kwargs)
# Autoevals will now use your custom client
client = init(client=CustomClient)
If you only need to use a custom client for a specific evaluator, you can pass in the client to the evaluator.
evaluator = Factuality(client=CustomClient)
import { Factuality } from "autoevals";
(async () => {
const input = "Which country has the highest population?";
const output = "People's Republic of China";
const expected = "China";
const result = await Factuality({ output, expected, input });
console.log(`Factuality score: ${result.score}`);
console.log(`Factuality metadata: ${result.metadata.rationale}`);
})();
Once you grade an output using Autoevals, it's convenient to use Braintrust to log and compare your evaluation results.
Create a file named example.eval.js
(it must end with .eval.js
or .eval.js
):
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("Autoevals", {
data: () => [
{
input: "Which country has the highest population?",
expected: "China",
},
],
task: () => "People's Republic of China",
scores: [Factuality],
});
Then, run
npx braintrust run example.eval.js
Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
import { LLMClassifierFromTemplate } from "autoevals";
(async () => {
const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.
I'm going to provide you with the issue description, and two possible titles.
Issue Description: {{input}}
1: {{output}}
2: {{expected}}`;
const choiceScores = { 1: 1, 2: 0 };
const evaluator =
LLMClassifierFromTemplate <
{ input: string } >
{
name: "TitleQuality",
promptTemplate,
choiceScores,
useCoT: true,
};
const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
const expected = `Standardize Error Responses across APIs`;
const response = await evaluator({ input, output, expected });
console.log("Score", response.score);
console.log("Metadata", response.metadata);
})();
You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana'
is in the output, you can use the following:
import { Score } from "autoevals";
const bananaScorer = ({
output,
expected,
input,
}: {
output: string;
expected: string;
input: string;
}): Score => {
return { name: "banana_scorer", score: output.includes("banana") ? 1 : 0 };
};
(async () => {
const input = "What is 1 banana + 2 bananas?";
const output = "3";
const expected = "3 bananas";
const result = bananaScorer({ output, expected, input });
console.log(`Banana score: ${result.score}`);
})();
There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:
input
, output
, and expected
values through a bunch of different evaluation methods.The full docs are available here.
FAQs
Universal library for evaluating AI models
We found that autoevals demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Research
Security News
Socket researchers have discovered multiple malicious npm packages targeting Solana private keys, abusing Gmail to exfiltrate the data and drain Solana wallets.