
Security News
Browserslist-rs Gets Major Refactor, Cutting Binary Size by Over 1MB
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
An open-source visual environment for battle-testing prompts to LLMs.
ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. It enables rapid-fire, quick-and-dirty comparison of prompts, models, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can:
Read the docs to learn more. ChainForge comes with a number of example evaluation flows to give you a sense of what's possible, including 188 example flows generated from benchmarks in OpenAI evals.
ChainForge is built on ReactFlow and Flask.
For user-curated resources and learning materials, check out the 🌟Awesome ChainForge repo!
You can install ChainForge locally, or try it out on the web at https://chainforge.ai/play/. The web version of ChainForge has a limited feature set. In a locally installed version you can load API keys automatically from environment variables, write Python code to evaluate LLM responses, or query locally-run models hosted via Ollama.
To install Chainforge on your machine, make sure you have Python 3.8 or higher, then run
pip install chainforge
Once installed, do
chainforge serve
Open localhost:8000 in a Google Chrome, Firefox, Microsoft Edge, or Brave browser.
You can set your API keys by clicking the Settings icon in the top-right corner. If you prefer to not worry about this everytime you open ChainForge, we highly recommend that save your OpenAI, Anthropic, Google, etc API keys and/or Amazon AWS credentials to your local environment. For more details, see the How to Install.
You can use our Dockerfile to run ChainForge
locally using Docker Desktop
:
Build the Dockerfile
:
docker build -t chainforge .
Run the image:
docker run -p 8000:8000 chainforge
Now you can open the browser of your choice and open http://127.0.0.1:8000
.
We've prepared many example flows to give you a sense of what's possible with Chainforge.
Click the "Example Flows" button on the top-right corner and select one. Here is a basic comparison example, plotting the length of responses across different models and arguments for the prompt parameter {game}
:
You can also conduct ground truth evaluations using Tabular Data nodes. For instance, we can compare each LLM's ability to answer math problems by comparing each response to the expected answer:
Just import a dataset, hook it up to a template variable in a Prompt Node, and press run.
Compare across models and prompt variables with an interactive response inspector, including a formatted table and exportable data:
The key power of ChainForge lies in combinatorial power: ChainForge takes the cross product of inputs to prompt templates, meaning you can produce every combination of input values. This is incredibly effective at sending off hundreds of queries at once to verify model behavior more robustly than one-off prompting.
Here's a tutorial to get started comparing across prompt templates.
The web version of ChainForge (https://chainforge.ai/play/) includes a Share button.
Simply click Share to generate a unique link for your flow and copy it to your clipboard:
For instance, here's a experiment I made that tries to get an LLM to reveal a secret key: https://chainforge.ai/play/?f=28puvwc788bog
Note To prevent abuse, you can only share up to 10 flows at a time, and each flow must be <5MB after compression. If you share more than 10 flows, the oldest link will break, so make sure to always Export important flows to
cforge
files, and use Share to only pass data ephemerally.
For finer details about the features of specific nodes, check out the List of Nodes.
A key goal of ChainForge is facilitating comparison and evaluation of prompts and models. Overall, you can:
The features that enable this area:
Alongside built-in gen AI features 🪄💫 like synthetic data generation, prompt engineering is accelerated: you can compare prompts and model performance sometimes without needing to write a single line of code, speeding up the process of iteration and discovery tenfold.
We've also found that some users simply want to use ChainForge to make tons of parametrized queries to LLMs (e.g., chaining prompt templates into prompt templates), possibly score them, and then output the results to a spreadsheet (Excel xlsx
). To do this, attach an Inspect node to the output of a Prompt node and click Export Data
.
For more specific details, see our documentation.
ChainForge was created by Ian Arawjo, a postdoctoral scholar in Harvard HCI's Glassman Lab with support from the Harvard HCI community. Collaborators include PhD students Priyan Vaithilingam and Chelse Swoopes, Harvard undergraduate Sean Yang, and faculty members Elena Glassman and Martin Wattenberg. Additional collaborators include UC Berkeley PhD student Shreya Shankar and Université de Montréal undergraduate Cassandre Hamel.
This work was partially funded by the NSF grants IIS-2107391, IIS-2040880, and IIS-1955699. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
We provide ongoing releases of this tool in the hopes that others find it useful for their projects.
ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by Sherry Wu:
Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.
We welcome open-source collaborators. If you want to report a bug or request a feature, open an Issue. We also encourage users to implement the requested feature / bug fix and submit a Pull Request.
If you use ChainForge for research purposes, whether by building upon the source code or investigating LLM behavior using the tool, we ask that you cite our CHI research paper in any related publications. The BibTeX you can use is:
@inproceedings{arawjo2024chainforge,
title={ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing},
author={Arawjo, Ian and Swoopes, Chelse and Vaithilingam, Priyan and Wattenberg, Martin and Glassman, Elena L},
booktitle={Proceedings of the CHI Conference on Human Factors in Computing Systems},
pages={1--18},
year={2024}
}
ChainForge is released under the MIT License.
FAQs
A Visual Programming Environment for Prompt Engineering
We found that chainforge demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Research
Security News
Eight new malicious Firefox extensions impersonate games, steal OAuth tokens, hijack sessions, and exploit browser permissions to spy on users.
Security News
The official Go SDK for the Model Context Protocol is in development, with a stable, production-ready release expected by August 2025.