
Security News
The Hidden Blast Radius of the Axios Compromise
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.
eval-genius
Advanced tools
eval-genius enables evals of arbitrary async code. It is generally intended for making multiple assertions on outputs which are generated nondeterministically. These assertions can be used to score algorithms on their effectiveness.
eval-genius enables evals of arbitrary async code. It is generally intended for making multiple assertions on outputs which are generated nondeterministically. These assertions can be used to score algorithms on their effectiveness.
eval-genius is based heavily on evalite, with some key differences:
yarn install -D eval-genius vitest
Override the default Vitest config so Vitest will pick up your evals from *.eval.ts files. If you already have vitest set up, you may want to use the --config flag to use a distinct configuration for evals from your existing tests.
// vitest.config.ts
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
include: ["./**/*.eval.ts"],
},
});
// my-test.eval.ts
import { genius } from "eval-genius";
import * as vitest from "vitest";
import { describe } from "vitest";
describe("my-test", () =>
genius({
vitest,
/**
* Runs tests concurrently according to the vitest
* maxConcurrency setting. Switches expect.soft() with
* expect() because expect.soft() does not work with
* concurrent tests in Vitest. Defaults to false.
*/
concurrent: true,
metadata: {
/**
* The name of the functionality under evaluation.
*/
name: "my-test",
/**
* The name of the variation being tested. For example, if you
* are testing two prompts, you can run the suite with
* different labels for each prompt.
*/
label: "my-experiment",
},
/**
* The data to be processed and evaluated. `input` and `expected`
* can be any type, and can diverge from each other.
*/
data: {
values: async () => [
{
name: "basic test",
input: "hello world!",
expected: "HELLO WORLD!"
},
],
},
/**
* The work done for every entry in data.values
*/
task: {
/**
* The behavior being evaluated.
*/
execute: async (input) => input.toUpperCase(),
/**
* Makes assertions to be shown in the Vitest output. Not used
* by the exporters. Use the expect() function provided here;
* do not use expect() from Vitest directly.
*/
test: async (expect, { rendered, expected, output }) => {
/**
* Use the rendered values to represent the values sent to
* the exporter
*/
expect
.soft(
rendered.capitalizesCorrectly,
"capitalizes correctly"
)
.toBe(true);
/**
* For more complex comparisons, error messages are clearer
* if the expect() call makes the comparison directly
*/
expect.soft(output).toBe(expected);
},
/**
* Renders output to be sent to the exporters
*/
renderer: {
/**
* The properties which the exporter should consume from
* the return values of the render function.
*/
fields: ["capitalizesCorrectly"],
/**
* The data that the exporters should consume.
*/
render: async ({ output, expected }) => ({
capitalizesCorrectly: output === expected,
}),
},
},
/**
* Destinations to send the rendered data.
*/
exporters: [],
}));
If you want to compare multiple implementations in an experiment, you can do something like this:
[
{ label: "control", execute: controlImplementation },
{ label: "test", execute: testImplementation },
].forEach(({ label, execute }) =>
describe(`my-test [${label}]`, () =>
genius({
metadata: { name: "my-test", label },
task: {
execute,
// ...task
},
// ...config
})),
);
See the google-sheets documentation for how to create your keys.
Create a .env file with:
GOOGLE_SERVICE_ACCOUNT_EMAIL=your-service-account-email
GOOGLE_PRIVATE_KEY=your-private-key
# this is the email of the account you want the documents to be saved in
MY_GOOGLE_ACCOUNT_EMAIL=your-google-account-email
type NewDocumentInit = { title: string; folderId?: string };
type ExistingDocumentInit = { spreadsheetId: string };
type InitArg = NewDocumentInit | ExistingDocumentInit;
import { defineConfig } from "vitest/config";
import { GoogleSheetsExporter } from "eval-genius/GoogleSheetsExporter";
import dotenv from "dotenv";
dotenv.config();
const googleSheetsExporter = GoogleSheetsExporter();
const now = new Date();
await googleSheetsExporter.init({
title: `Evals [${now.toLocaleDateString()} ${now.toLocaleTimeString()}]`,
});
export default defineConfig({
test: {
include: ["./**/*.eval.ts"],
},
});
// my-test.eval.ts
import { genius } from "eval-genius";
import { GoogleSheetsExporter } from "./src/GoogleSheetsExporter";
import * as vitest from "vitest";
genius({
// ...config
exporters: [GoogleSheetsExporter],
});
You will get a table of the output that is generated from the renderer, with a runId supplied.

Google Sheets is a straightforward way of running aggregate analysis on data. In particular, Pivot Tables make it very easy to compare outputs of different runs. The below example indicates a regression when changing from the control to the experiment.

Custom exporters can export to any destination. They must comply with this type definition:
type RenderedValue = boolean | number | string | null;
type Rendered<T extends string> = Record<T, RenderedValue>;
export type Reporter<FieldNames extends string> = {
/**
* Queues data to be sent to the destination.
*/
report: (arg: { result: Rendered<FieldNames> }) => MaybePromise<unknown>;
/**
* Sends data to the destination.
*/
flush: () => MaybePromise<unknown>;
};
export type Exporter<InitArgs extends any, InitReturn extends any> = () => {
/**
* Any initialization logic for the reporter.
*/
init: (arg: InitArgs) => InitReturn;
/**
* Creates the reporter.
*/
start: <FieldNames extends string>(arg: {
title: string;
fields: Array<FieldNames>;
}) => MaybePromise<Reporter<FieldNames>>;
};
FAQs
eval-genius enables evals of arbitrary async code. It is generally intended for making multiple assertions on outputs which are generated nondeterministically. These assertions can be used to score algorithms on their effectiveness.
We found that eval-genius demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.

Research
A supply chain attack on Axios introduced a malicious dependency, plain-crypto-js@4.2.1, published minutes earlier and absent from the project’s GitHub releases.

Research
Malicious versions of the Telnyx Python SDK on PyPI delivered credential-stealing malware via a multi-stage supply chain attack.