Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
flow-judge
Technical Report | Model Weights | HuggingFace Space | Evaluation Code | Tutorials
flow-judge
is a lightweight library for evaluating LLM applications with Flow-Judge-v0.1
.
Flow-Judge-v0.1
is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.
You can learn more about the unique features of our model in the technical report.
Install flow-judge using pip:
pip install -e ".[vllm,hf]"
pip install 'flash_attn>=2.6.3' --no-build-isolation
Extras available:
dev
to install development dependencieshf
to install Hugging Face Transformers dependenciesvllm
to install vLLM dependenciesllamafile
to install Llamafile dependenciesbaseten
to install Baseten dependenciesHere's a simple example to get you started:
from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display
# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()
# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)
# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)
# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()
# Initialize the judge
faithfulness_judge = FlowJudge(
metric=RESPONSE_FAITHFULNESS_5POINT,
model=model
)
# Sample to evaluate
query = """..."""
context = """...""""
response = """..."""
# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
inputs=[
{"query": query},
{"context": context},
],
output={"response": response},
)
# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)
# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
The library supports multiple inference backends to accommodate different hardware configurations and performance needs:
vLLM:
from flow_judge import Vllm
model = Vllm()
Hugging Face Transformers:
If you are running on an Ampere GPU or newer:
from flow_judge import Hf
model = Hf()
If you are not running on an Ampere GPU or newer, disable flash attention:
from flow_judge import Hf
model = Hf(flash_attn=False)
Llamafile:
from flow_judge import Llamafile
model = Llamafile()
Baseten:
from flow_judge import Baseten
model = Baseten()
For detailed information on using Baseten, visit the Baseten readme.
Choose the inference backend that best matches your hardware and performance requirements. The library provides a unified interface for all these options, making it easy to switch between them as needed.
Flow-Judge-v0.1
was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.
For convenience, flow-judge
library comes with pre-defined metrics such as RESPONSE_CORRECTNESS
or RESPONSE_FAITHFULNESS
. You can check the full list by running:
from flow_judge.metrics import list_all_metrics
list_all_metrics()
For efficient processing of multiple inputs, you can use the batch_evaluate
method:
# Read the sample data
import json
from flow_judge import Vllm, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display
# Initialize the model
model = Vllm()
# Initialize the judge
faithfulness_judge = FlowJudge(
metric=RESPONSE_FAITHFULNESS_5POINT,
model=model
)
# Load some sampledata
with open("sample_data/csr_assistant.json", "r") as f:
data = json.load(f)
# Create a list of inputs and outputs
inputs_batch = [
[
{"query": sample["query"]},
{"context": sample["context"]},
]
for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]
# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)
# Visualizing the results
for i, result in enumerate(results):
display(Markdown(f"__Sample {i+1}:__"))
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
display(Markdown("---"))
[!WARNING] There exists currently a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the
Hf()
model configuration for longer contexts at the moment. For more details, refer to: #33129 and #6135
Create your own evaluation metrics:
from flow_judge.metrics import CustomMetric, RubricItem
custom_metric = CustomMetric(
name="My Custom Metric",
criteria="Evaluate based on X, Y, and Z.",
rubric=[
RubricItem(score=0, description="Poor performance"),
RubricItem(score=1, description="Good performance"),
],
required_inputs=["query"],
required_output="response"
)
judge = FlowJudge(metric=custom_metric, config="Flow-Judge-v0.1-AWQ")
We support an integration with Llama Index evaluation module and Haystack:
Note that we are currently working on adding more integrations with other frameworks in the near future.
Clone the repository:
git clone https://github.com/flowaicom/flow-judge.git
cd flow-judge
Create a virtual environment:
virtualenv ./.venv
or
python -m venv ./.venv
Activate the virtual environment:
venv\Scripts\activate
source venv/bin/activate
Install the package in editable mode with development dependencies:
pip install -e ".[dev]"
or
pip install -e ".[dev,vllm]"
for vLLM support.
Set up pre-commit hooks:
pre-commit install
Make sure you have trufflehog installed:
# make trufflehog available in your path
# macos
brew install trufflehog
# linux
curl -sSfL https://raw.githubusercontent.com/trufflesecurity/trufflehog/main/scripts/install.sh | sh -s -- -b /usr/local/bin
# nix
nix profile install nixpkgs#trufflehog
Run pre-commit on all files:
pre-commit run --all-files
You're now ready to start developing! You can run the main script with:
python -m flow_judge
Remember to always activate your virtual environment when working on the project. To deactivate the virtual environment when you're done, simply run:
deactivate
To run the tests for Flow-Judge, follow these steps:
Navigate to the root directory of the project in your terminal.
Run the tests using pytest:
pytest tests/
This will discover and run all the tests in the tests/
directory.
If you want to run a specific test file, you can do so by specifying the file path:
pytest tests/test_flow_judge.py
For more verbose output, you can use the -v
flag:
pytest -v tests/
Contributions to flow-judge
are welcome! Please follow these steps:
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Please ensure that your code adheres to the project's coding standards and passes all tests.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Flow-Judge is developed and maintained by the Flow AI team. We appreciate the contributions and feedback from the AI community in making this tool more robust and versatile.
FAQs
A small yet powerful LM Judge
We found that flow-judge demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.