GPU Pod Manager
Quickly deploy LLMs on GPU pods from Prime Intellect, Vast.ai, DataCrunch, AWS, etc., for local coding agents and AI assistants.
Installation
npm install -g @badlogic/pi
Or run directly with npx:
npx @badlogic/pi
What This Is
A simple CLI tool that automatically sets up and manages vLLM deployments on GPU pods. Start from a clean Ubuntu pod and have multiple models running in minutes. A GPU pod is defined as an Ubuntu machine with root access, one or more GPUs, and Cuda drivers installed. It is aimed at individuals who are limited by local hardware and want to experiment with large open weight LLMs for their coding assistent workflows.
Key Features:
- Zero to LLM in minutes - Automatically installs vLLM and all dependencies on clean pods
- Multi-model management - Run multiple models concurrently on a single pod
- Smart GPU allocation - Round robin assigns models to available GPUs on multi-GPU pods
- Tensor parallelism - Run large models across multiple GPUs with
--all-gpus
- OpenAI-compatible API - Drop-in replacement for OpenAI API clients with automatic tool/function calling support
- No complex setup - Just SSH access, no Kubernetes or Docker required
- Privacy first - vLLM telemetry disabled by default
Limitations:
- OpenAI endpoints exposed to the public internet (yolo)
- Requires manual pod creation via Prime Intellect, Vast.ai, AWS, etc.
- Assumes Ubuntu 22 image when creating pods
What this is not
- A provisioning manager for pods. You need to provision the pods on the respective provider themselves.
- Super optimized LLM deployment infrastructure for absolute best performance. This is for individuals who want to quickly spin up large open weights models for local LLM loads.
Requirements
- Node.js 14+ - To run the CLI tool on your machine
- HuggingFace Token - Required for downloading models (get one at https://huggingface.co/settings/tokens)
- Prime Intellect Account - Sign up at https://app.primeintellect.ai
- GPU Pod - At least one running pod with:
- Ubuntu 22+ image (selected when creating pod)
- SSH access enabled
- Clean state (no manual vLLM installation needed)
- Note: B200 GPUs require PyTorch nightly with CUDA 12.8+ (automatically installed if detected). However, vLLM may need to be built from source for full compatibility.
Quick Start
export HF_TOKEN=your_huggingface_token
pi setup my-pod-name "ssh root@135.181.71.41 -p 22"
pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 20%
pi prompt phi3 "What is 2+2?"
pi start Qwen/Qwen2.5-7B-Instruct --name qwen --memory 30%
pi list
export OPENAI_BASE_URL='http://135.181.71.41:8001/v1'
export OPENAI_API_KEY='dummy'
How It Works
-
Automatic Setup: When you run pi setup
, it:
- Connects to your clean Ubuntu pod
- Installs Python, CUDA drivers, and vLLM
- Configures HuggingFace tokens
- Sets up the model manager
-
Model Management: Each pi start
command:
- Automatically finds an available GPU (on multi-GPU systems)
- Allocates the specified memory fraction
- Starts a separate vLLM instance on a unique port accessible via the OpenAI API protocol
- Manages logs and process lifecycle
-
Multi-GPU Support: On pods with multiple GPUs:
- Single models automatically distribute across available GPUs
- Large models can use tensor parallelism with
--all-gpus
- View GPU assignments with
pi list
Commands
Pod Management
The tool supports managing multiple Prime Intellect pods from a single machine. Each pod is identified by a name you choose (e.g., "prod", "dev", "h200"). While all your pods continue running independently, the tool operates on one "active" pod at a time - all model commands (start, stop, list, etc.) are directed to this active pod. You can easily switch which pod is active to manage models on different machines.
pi setup <pod-name> "<ssh_command>"
pi pods
pi pod <pod-name>
pi pod remove <pod-name>
pi shell
Model Management
Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
pi list
pi search <query>
pi start <model> [options]
--name <name>
--context <size>
--memory <percent>
--all-gpus
--vllm-args
pi stop [name]
pi logs <name>
pi prompt <name> "message"
Examples
Search for models
pi search codellama
pi search deepseek
pi search qwen
Note: vLLM does not support formats like GGUF. Read the docs
A100 80GB scenarios
pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 30%
pi start meta-llama/Llama-3.1-8B-Instruct --name llama8b --memory 50%
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --memory 90%
pi start Qwen/Qwen2.5-Coder-1.5B --name coder1 --memory 15%
pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 15%
Understanding Context and Memory
Context Window vs Output Tokens
Models are loaded with their default context length. You can use the context
parameter to specify a lower or higher context length. The context
parameter sets the total token budget for input + output combined:
- Starting a model with
context=8k
means 8,192 tokens total
- If your prompt uses 6,000 tokens, you have 2,192 tokens left for the response
- Each OpenAI API request to the model can specify
max_output_tokens
to control output length within this budget
Example:
pi start meta-llama/Llama-3.1-8B --name llama --context 32k --memory 50%
GPU Memory and Concurrency
vLLM pre-allocates GPU memory controlled by gpu_fraction
. This matters for coding agents that spawn sub-agents, as each connection needs memory.
Example: On an A100 80GB with a 7B model (FP16, ~14GB weights):
gpu_fraction=0.3
(24GB): ~10GB for KV cache → ~30-50 concurrent requests
gpu_fraction=0.5
(40GB): ~26GB for KV cache → ~50-80 concurrent requests
gpu_fraction=0.9
(72GB): ~58GB for KV cache → ~100+ concurrent requests
Models load in their native precision from HuggingFace (usually FP16/BF16). Check the model card's "Files and versions" tab - look for file sizes: 7B models are ~14GB, 13B are ~26GB, 70B are ~140GB. Quantized models (AWQ, GPTQ) in the name use less memory but may have quality trade-offs.
Multi-GPU Support
For pods with multiple GPUs, the tool automatically manages GPU assignment:
Automatic GPU assignment for multiple models
pi start microsoft/Phi-3-mini-128k-instruct --memory 20%
pi start Qwen/Qwen2.5-7B-Instruct --memory 20%
pi start meta-llama/Llama-3.1-8B --memory 20%
pi list
Run large models across all GPUs
pi start meta-llama/Llama-3.1-70B-Instruct --all-gpus
pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
Advanced: Custom vLLM arguments
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
--data-parallel-size 8 --enable-expert-parallel \
--tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
--tensor-parallel-size 4 --quantization fp8 --trust-remote-code
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
--tensor-parallel-size 8 --pipeline-parallel-size 2
Check GPU usage
pi ssh "nvidia-smi"
Architecture Notes
- Multi-Pod Support: The tool stores multiple pod configurations in
~/.pi_config
with one active pod at a time.
- Port Allocation: Each model runs on a separate port (8001, 8002, etc.) allowing multiple models on one GPU.
- Memory Management: vLLM uses PagedAttention for efficient memory use with less than 4% waste.
- Model Caching: Models are downloaded once and cached on the pod.
- Tool Parser Auto-Detection: The tool automatically selects the appropriate tool parser based on the model:
- Qwen models:
hermes
(Qwen3-Coder: qwen3_coder
if available)
- Mistral models:
mistral
with optimized chat template
- Llama models:
llama3_json
or llama4_pythonic
based on version
- InternLM models:
internlm
- Phi models: Tool calling disabled by default (no compatible tokens)
- Override with
--vllm-args --tool-call-parser <parser> --enable-auto-tool-choice
Tool Calling (Function Calling)
Tool calling allows LLMs to request the use of external functions/APIs, but it's a complex feature with many caveats:
The Reality of Tool Calling
-
Model Compatibility: Not all models support tool calling, even if they claim to. Many models lack the special tokens or training needed for reliable tool parsing.
-
Parser Mismatches: Different models use different tool calling formats:
- Hermes format (XML-like)
- Mistral format (specific JSON structure)
- Llama format (JSON-based or pythonic)
- Custom formats for each model family
-
Common Issues:
- "Could not locate tool call start/end tokens" - Model doesn't have required special tokens
- Malformed JSON/XML output - Model wasn't trained for the parser format
- Tool calls when you don't want them - Model overeager to use tools
- No tool calls when you need them - Model doesn't understand when to use tools
How We Handle It
The tool automatically detects the model and tries to use an appropriate parser:
- Qwen models:
hermes
parser (Qwen3-Coder uses qwen3_coder
)
- Mistral models:
mistral
parser with custom template
- Llama models:
llama3_json
or llama4_pythonic
based on version
- Phi models: Tool calling disabled (no compatible tokens)
Your Options
-
Let auto-detection handle it (default):
pi start meta-llama/Llama-3.1-8B-Instruct --name llama
-
Force a specific parser (if you know better):
pi start model/name --name mymodel --vllm-args \
--tool-call-parser mistral --enable-auto-tool-choice
-
Disable tool calling entirely (most reliable):
pi start model/name --name mymodel --vllm-args \
--disable-tool-call-parser
-
Handle tools in your application (recommended for production):
- Send regular prompts asking the model to output JSON
- Parse the response in your code
- More control, more reliable
Best Practices
- Test first: Try a simple tool call to see if it works with your model
- Have a fallback: Be prepared for tool calling to fail
- Consider alternatives: Sometimes a well-crafted prompt works better than tool calling
- Read the docs: Check the model card for tool calling examples
- Monitor logs: Check
~/.vllm_logs/
for parser errors
Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
Troubleshooting
- OOM Errors: Reduce gpu_fraction or use a smaller model
- Slow Inference: Could be too many concurrent requests, try increasing gpu_fraction
- Connection Refused: Check pod is running and port is correct
- HF Token Issues: Ensure HF_TOKEN is set before running setup
- Access Denied: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
- Tool Calling Errors: See the Tool Calling section above - consider disabling it or using a different model