
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.
LLMStat?llmstat is a command-line tool for generating and evaluating responses from an LLM. The primary focus is on evaluating prompts with datasets and comparing results across multiple runs.
Goals:
This project is under active development and not production ready. This means there are limitations and some of the interfaces will likely change.
Current limitations:
Generate output
# Single prompt can generate a single result
npx llmstat generate --prompt prompt.json
# Prompt + dataset can get batch results
npx llmstat generate --prompt prompt.json --data dataset.jsonl
# These can also write directly to a file (json for single, jsonl for batch)
npx llmstat generate -p prompt.json -d dataset.jsonl --output out.json
# You can also point to a folder containing prompt.json and dataset.jsonl
npx llmstat generate --folder examples/batch
Run evaluations
# Prompt + dataset + evals can evaluate all responses
npx llmstat eval --prompt prompt.json --data dataset.jsonl --eval eval.json
# If you have pre-generated responses you can use that + evals
npx llmstat eval --run run.jsonl --eval eval.json
# You can also point to a folder containing the appropriate files
npx llmstat eval --folder examples/rubric
Environment variables
Environment variables can be set in the environment or in a .env file in the directory where the command is run.
OPENAI_API_KEY - An API key for OpenAI. This key is required. Other environment variables supported by openai-node are also supported.
LLMSTAT_CACHE_PATH to override the cache directory. Default is ~/.llmstat/cache.
LLMSTAT_CONCURRENCY - The number of concurrent requests that are run (default 6)
LLMSTAT_REQUEST_INTERVAL_MS - The number (in milliseconds) of seconds for rate limiting (default 60000)
LLMSTAT_REQUEST_CAP - The number of requests within the interval for rate limiting (default 1000)
Rate limiting
If LLMStat encounters a rate limit error (or other recoverable error from the API) it will retry with exponential backoff to try and recover from those errors. However, if you know you have a specific rate limit, it will be more efficient to configure LLMStat to run within that rate limit.
You can use the rate limiting environment variables to speed up or slow down your requests depending on the rate of requests you are trying to achieve. Use LLMSTAT_CONCURRENCY to increase the number of concurrent requests to run faster. Adust LLMSTAT_REQUEST_INTERVAL_MS and LLMSTAT_REQUEST_CAP to slow down requests. For example, if you wanted to run a maximum of 20 requests per minute you could set LLMSTAT_REQUEST_INTERVAL_MS=60000, LLMSTAT_REQUEST_CAP=20 and this would prevent more than 20 requests per minute. If you wanted to spread those requests out through the minute it might be better to set LLMSTAT_REQUEST_INTERVAL_MS=3000, LLMSTAT_REQUEST_CAP=1. This would run a maximum of one request every 3 seconds.
A prompt config matches the format of the underlying API request. This is intended to reduce the complexity of using this prompt config within a production application and to reduce issues where parameters or options might not be moved to the production application after evaluation. There are with two differences from the standard model request format:
metadata section containing additional information that is not part of the request. It is primarily used for labeling and tracking prompt files.{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": "You are an assistant. Today's date is {{date}}."},
...
],
"metadata": {
"name": "test-prompt",
"variant": "a",
"provider": "openai"
}
}
Possible fields in metadata:
provider (required) - Only current supported value is openaiheaders - Custom headers to pass along with the request. These can contain templates with variables from the dataset. Example: headers: {"x-header": "{{val}}"}.The dataset is a JSON lines (.jsonl) file with variables used to render the prompt config. Extra fields can be included for reference that are not used for rendering, and all of these extra fields are included in the output. In the example below, id is included as an identifier for the data entry.
{"id": "today", "date": "Thu Jun 20 2024"}
{"id": "yesterday", "date": "Wed Jun 19 2024"}
{"id": "broken", "date": "0"}
Specific named eval tests can be run against each dataset with the evals field. (See the Eval Suite section below for how to define these.)
{"evals": ["is-true", "contains-thurs"], "date": "Thu Jun 20 2024"}
{"evals": ["is-true", "contains-wed"], "date": "Wed Jun 19 2024"}
{"evals": ["is-true"], "date": "0"}
Advanced:
To evaluate multi-turn applications, provide additional messages in the dataset. These messages are appended to the messages configured in the prompt config.
{"id": "joke-question", "messages": [{"role": "user", "content": "Please tell me a joke"}]}
{"id": "joke-refusal", "messages": [{"role": "user", "content": "Please tell me a joke"},{"role": "assistant", "content": "No"},{"role": "user", "content": "That's not a good joke."}]}
What happens if I have a variable named evals, or messages in my data used for rendering a prompt?
You can prefix the LLMStat variable with llmstat- and that will always be used for evals or messages instead of your variable. Example: If your prompt contains something like here are your evals: {{evals}} then you can include that string in your data as evals and you can use llmstat-evals to indicate the id of your line in the dataset.
Run results contain the output results of a generation run as well as the inputs that were used for generation (for display or re-run purposes). The output results contain the full API response and not just the output text. This allows you to run assertions on things like token usage or perplexity when requested in the prompt config.
Run results are jsonl with each line containing the following sections:
{
metadata: { ... },
promptConfig: { ... },
data: {...},
request: { ... },
result: { ... }
}
Run results are generally an output format for the tool, but you can create your own run results to run evaluations against. The only required property is result. You can see an example in examples/previous-run-evaluation/run.jsonl
An eval suite is a collection of individual tests to run against the run results. Each entry in the dataset can require one or more test per entry. The evals follow the assertions format in Promptfoo and Promptfoo is used to run the assertions.
{
"contains-wed": {
"assert": "contains",
"value": "Wed"
},
"contains-thurs": {
"assert": "contains",
"value": "Thu"
},
"not-fri": {
"assert": "not-icontains",
"value": "fri"
}
}
The test named default will run a set of tests against every result.
{
"default": {
"type": "default",
"value": [
{
"type": "not-icontains-any",
"value": ["sorry", "as an ai"]
}
...
]
}
}
Eval results contains the output results of a generation and eval run. Each line includes all of the data required to display information about the request or to re-run the request or evaluation if required. This is the same format as Run Results with a newly added section evalResults.
Eval results are jsonl with each line containing the following sections:
{
metadata: { ... },
promptConfig: { ... },
data: {...},
request: { ... },
result: { ... }
evalResults: { ... }
}
FAQs
**Language Model generation and evaluation**
The npm package llmstat receives a total of 1 weekly downloads. As such, llmstat popularity was classified as not popular.
We found that llmstat demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.