🍌 Open source AI Agent evaluations for web tasks 🍌
🔗 Main site
•
🐦 Twitter
•
📢 Discord
Banana-lyzer
Introduction
Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright (And has a
banana theme because why not).
We've created our own evals repo because:
- Websites change overtime, are affected by latency, and may have anti bot protections.
- We need a system that can reliably save and deploy historic/static snapshots of websites.
- Standard web practices are loose and there is an abundance of different underlying ways to represent a single
individual website. For an agent to best generalize, we require building a diverse dataset of websites across
industries and use-cases.
- We have specific evaluation criteria and agent use cases focusing on structured and direct information retrieval
across websites.
- There exists valuable web task datasets and evaluations that we'd like to unify in a single
repo (Mind2Web, WebArena, etc).
https://github.com/reworkd/bananalyzer/assets/50181239/4587615c-a5b4-472d-bca9-334594130af1
How does it work?
⚠️ Note that this repo is a work in progress. ⚠️
Banana-lyzer is a CLI tool that runs a set of evaluations against a set of example websites.
The examples are defined
in examples.json using a schema
similar to Mind2Web and WebArena. The examples
store metadata like the agent goal and the expected agent output in addition to snapshots of urls via mhtml to ensure
the page is not changed over time. Note all examples today expect structured JSON output using data directly extracted
from the page.
The CLI tool will sequentially run examples against a user defined agent by dynamically constructing a pytest test suite
and executing it.
As a user, you simply create a file that implements the AgentRunner
interface and defines an instance of your
AgentRunner in a variable called "agent".
AgentRunner exposes the example, and a playwright browser context to use.
In the future we will support more complex evaluation methods and examples that require multiple steps to complete. The
plan is to translate existing datasets like Mind2Web and WebArena into this format.
Test intents
We have defined a set of page types and test intents an agent can be evaluated on. These types are defined in the ExampleType
enum in schemas.py.
- listing: The example starts on a listing page but must scrape all detail page links and the information from those detail pages.
Note that currently, we only test that all of the detail page URLs were captured.
- detail: The example starts on a detail page and the agent must retrieve specific JSON information from the page. This is the most common test type.
- listing_detail: The agent is on a listing page and must scrape all information from the current page.
All of the required information is available on the current page. The agent need not visit the detail page.
Separately, there are specific tags
that can be used further filter test intents
- pagination: Must fetch data across pages. Either links or fetch for now.
Getting Started
Local testing installation
pip install --dev bananalyzer
- Implement the
agent_runner.py
interface and make a banalyzer.py test file (The name doesn't matter). Below is an
example file
import asyncio
from playwright.async_api import BrowserContext
from bananalyzer import AgentResult, AgentRunner, Example
class NullAgentRunner(AgentRunner):
"""
A test agent class that just returns an empty string
"""
async def run(
self,
context: BrowserContext,
example: Example,
) -> AgentResult:
page = await context.new_page()
await page.goto(
example.get_static_url())
await asyncio.sleep(0.5)
return example.evals[0].expected
- Run
bananalyze ./tests/banalyzer.py
to run the test suite - You can also run
bananalyze .
to run all tests in the current directory - To run local examples (from the repo's
static
folder) on MacOS, please run unix2dos static/*/*.mhtml
to convert CRLF formatting in MHTML files
Arguments
-h
or --help
: Show help--headless
: Run with Playwright headless mode-id
or --id
: Run a specific test by id-i
or --intent
: Only run tests of a particular intent (fetch, links, etc)-c
or --category
: Only run tests of a particular category (healthcare, manufacturing, software, etc)-n
or --n
: Number of test workers to use. The default is 1-skip
or --skip
: A list of ids to skip tests on, separated by commas-t
or --type
: Only run tests of a particular type (links, fetch, etc)
Contributing
Running the server
The project has a basic FastAPI server to expose example data. You can run it with the following command:
cd server
poetry run uvicorn server:app --reload
Then travel to http://127.0.0.1:8000/api/docs
in your browser to see the API docs.
Adding examples
All current examples have been manually added through running the fetch.ipynb
notebook at the root of this project.
This notebook will load a site with Playwright and use the chrome developer API to save the page as an MHTML file.
Roadmap
Launch
Features
Dataset updates
Citations
bibtex
@misc{reworkd2023bananalyzer,
title = {Bananalyzer},
author = {Asim Shrestha and Adam Watkins and Rohan Pandey and Srijan Subedi and Sunshine},
year = {2023},
howpublished = {GitHub},
url = {https://github.com/reworkd/bananalyzer}
}