VisionAgent
VisionAgent is the Visual AI Pilot from LandingAI. Submit a prompt and image to VisionAgent, and the app selects the best models for your tasks. VisionAgent then generates code so that you can build vision-enabled apps in minutes.
How to Use This VisionAgent Library
If you are a seasoned developer who wants to build locally using this library and enjoys having more control then we recommend setting this up. Otherwise, you can use the VisionAgent web app.
Get Your VisionAgent API Key
The most important step is to signup and obtain your API key.
Other Prerequisites
Why do I need Anthropic and Google API Keys?
VisionAgent uses models from Anthropic and Google to respond to prompts and generate code.
When you run the web-based version of VisionAgent, the app uses the LandingAI API keys to access these models.
When you run VisionAgent programmatically, the app will need to use your API keys to access the Anthropic and Google models. This ensures that any projects you run with VisionAgent aren’t limited by the rate limits in place with the LandingAI accounts, and it also prevents many users from overloading the LandingAI rate limits.
Anthropic and Gemini each have their own rate limits and paid tiers. Refer to their documentation and pricing to learn more.
NOTE: In VisionAgent v1.0.2 and earlier, VisionAgent was powered by Anthropic Claude-3.5 and OpenAI o1. If using one of these VisionAgent versions, you get an OpenAI API key and set it as an environment variable.
Get an Anthropic API Key
Get a Gemini API Key
Installation
pip install vision-agent
Quickstart: Prompt VisionAgent
Follow this quickstart to learn how to prompt VisionAgent. After learning the basics, customize your prompt and workflow to meet your needs.
Set API Keys as Environment Variables
Before running VisionAgent code, you must set the Anthropic, Gemini, and VisionAgent API keys as environment variables. Each operating system offers different ways to do this.
Here is the code for setting the variables:
export VISION_AGENT_API_KEY="your-api-key"
export ANTHROPIC_API_KEY="your-api-key"
export GEMINI_API_KEY="your-api-key"
Sample Script: Prompt VisionAgent
To use VisionAgent to generate code, use the following script as a starting point:
from vision_agent.agent import VisionAgentCoderV2
from vision_agent.models import AgentMessage
agent = VisionAgentCoderV2(verbose=True)
code_context = agent.generate_code(
[
AgentMessage(
role="user",
content="Describe the image",
media=["friends.jpg"]
)
]
)
with open("generated_code.py", "w") as f:
f.write(code_context.code + "\n" + code_context.test)
What to Expect When You Prompt VisionAgent
When you submit a prompt, VisionAgent performs the following tasks.
- Generates a plan for the code generation task. If verbose output is on, the numbered steps for this plan display.
- Generates code and a test case based on the plan.
- Tests the generated code with the test case. If the test case fails, VisionAgent iterates on the code generation process until the test case passes.
Example: Count Cans in an Image
Check out how to use VisionAgent in this Jupyter Notebook to learn how to count the number of cans in an image:
Count Cans in an Image
Use Specific Tools from VisionAgent
The VisionAgent library includes a set of tools, which are standalone models or functions that complete specific tasks. When you prompt VisionAgent, VisionAgent selects one or more of these tools to complete the tasks outlined in your prompt.
For example, if you prompt VisionAgent to “count the number of dogs in an image”, VisionAgent might use the florence2_object_detection
tool to detect all the dogs, and then the countgd_object_detection
tool to count the number of detected dogs.
After installing the VisionAgent library, you can also use the tools in your own scripts. For example, if you’re writing a script to track objects in videos, you can call the owlv2_sam2_video_tracking
function. In other words, you can use the VisionAgent tools outside of simply prompting VisionAgent.
The tools are in the vision_agent.tools API.
Sample Script: Use Specific Tools for Images
You can call the countgd_object_detection
function to count the number of objects in an image.
To do this, you could run this script:
import vision_agent.tools as T
import matplotlib.pyplot as plt
image = T.load_image("people.png")
dets = T.countgd_object_detection("person", image)
viz = T.overlay_bounding_boxes(image, dets)
T.save_image(viz, "people_detected.png")
plt.imshow(viz)
plt.show()
Sample Script: Use Specific Tools for Videos
You can call the countgd_sam2_video_tracking
function to track people in a video and pair it with the extract_frames_and_timestamps
function to return the frames and timestamps in which those people appear.
To do this, you could run this script:
import vision_agent.tools as T
frames_and_ts = T.extract_frames_and_timestamps("people.mp4")
frames = [f["frame"] for f in frames_and_ts]
tracks = T.countgd_sam2_video_tracking("person", frames)
viz = T.overlay_segmentation_masks(frames, tracks)
T.save_video(viz, "people_detected.mp4")
Use Other LLM Providers
VisionAgent uses Anthropic Claude 3.7 Sonnet and Gemini Flash 2.0 Experimental (gemini-2.0-flash-exp
) to respond to prompts and generate code. We’ve found that these provide the best performance for VisionAgent and are available on the free tiers (with rate limits) from their providers.
If you prefer to use only one of these models or a different set of models, you can change the selected LLM provider in this file: vision_agent/configs/config.py
. You must also add the provider’s API Key as an environment variable.
For example, if you want to use only the Anthropic model, run this command:
cp vision_agent/configs/anthropic_config.py vision_agent/configs/config.py
Or, you can manually enter the model details in the config.py
file. For example, if you want to change the planner model from Anthropic to OpenAI, you would replace this code:
planner: Type[LMM] = Field(default=AnthropicLMM)
planner_kwargs: dict = Field(
default_factory=lambda: {
"model_name": "claude-3-7-sonnet-20250219",
"temperature": 0.0,
"image_size": 768,
}
)
with this code:
planner: Type[LMM] = Field(default=OpenAILMM)
planner_kwargs: dict = Field(
default_factory=lambda: {
"model_name": "gpt-4o-2024-11-20",
"temperature": 0.0,
"image_size": 768,
"image_detail": "low",
}
)
Resources