
Research
/Security News
Popular Go Decimal Library Targeted by Long-Running Typosquat with DNS Backdoor
A long-running Go typosquat impersonated the popular shopspring/decimal library and used DNS TXT records to execute commands.
vlllm
Advanced tools
vlllm is a Python utility package designed to simplify and accelerate text generation tasks using the powerful vLLM library. It offers a convenient interface for batch processing, chat templating, multiple sampling strategies, and multi-GPU inference with tensor and pipeline parallelism, all wrapped in an easy-to-use generate function with multiprocessing support.
[{"role": "user", "content": "Hello!"}]).n): Generate multiple completions per prompt.
use_sample=False): Duplicates input prompts n times for generation.use_sample=True): Uses vLLM's internal sampling parameter (SamplingParams(n=n)) for generating n completions.worker_num): Distribute generation tasks across multiple CPU worker processes, each potentially managing its own vLLM instance and GPU(s).tp or gpu_assignments): Configure tensor parallelism for vLLM instances within each worker.pp): Supports vLLM's pipeline parallelism (requires pp > 1 and uses distributed_executor_backend="ray").chunk_size): Control the maximum number of prompts processed by a vLLM engine in a single call, useful for managing memory and very large datasets.pip install vlllm
from vlllm import generate
# Example data with string prompts
data = [
{"prompt": "Write a story about a dragon"},
{"prompt": "Explain quantum computing"}
] * 1000
# Basic usage
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=2, # Use 2 worker processes
tp=1, # 1 GPU per worker
)
# Each item in results will have a new 'results' field with the generated text
print(results[0]["results"])
model_id (str): Model identifier or path to loaddata (List[Dict]): List of dictionaries containing prompts/messagesmessage_key (str, default: "prompt"): Key in each dictionary containing the prompt or messagessystem (str, optional): Global system prompt to prepend to all messagesresult_key (str, default: "results"): Key name for storing generation resultsThe package intelligently handles different input formats:
data[i][message_key] is a string, it's automatically converted to a chat message formatdata[i][message_key] is a list, it's treated as a chat conversation with roles and contentWhen a system prompt is provided:
n (int, default: 1): Number of samples to generate per promptuse_sample (bool, default: False):
False: Duplicates each prompt n times in the generation listTrue: Uses vLLM's native SamplingParams(n=n) for efficient samplingtemperature (float, default: 0.7): Sampling temperaturemax_output_len (int, default: 1024): Maximum tokens to generate per samplen=1: The result_key field contains a single stringn>1: The result_key field contains a list of stringsworker_num (int, default: 1): Number of worker processes
tp (int, default: 1): Tensor parallel size per workerpp (int, default: 1): Pipeline parallel size
worker_num=1)gpu_assignments (List[List[int]], optional): Custom GPU assignments per workerchunk_size (int, optional): Maximum items per generation batch
max_model_len (int, default: 4096): Maximum model sequence lengthgpu_memory_utilization (float, default: 0.90): Target GPU memory usagedtype (str, default: "auto"): Model data typetrust_remote_code (bool, default: True): Whether to trust remote code# Data with chat message format
data = [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
] * 100
# Generate 3 different responses per prompt
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
message_key="messages", # Specify the key containing messages
system="You are a helpful assistant.", # Global system prompt
n=3, # Generate 3 samples
use_sample=True, # Use vLLM's native sampling
temperature=0.8,
worker_num=2,
tp=2 # Use 2 GPUs per worker
)
# results[0]["results"] will be a list of 3 different responses
for i, response in enumerate(results[0]["results"]):
print(f"Response {i+1}: {response}")
# Large dataset
data = [{"prompt": f"Question {i}"} for i in range(10000)]
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=4,
chunk_size=100, # Process in chunks of 100 items
tp=1,
max_output_len=512
)
# Assign specific GPUs to each worker
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=2,
gpu_assignments=[[0, 1], [2, 3]], # Worker 0 uses GPU 0,1; Worker 1 uses GPU 2,3
)
# Use pipeline parallelism (requires worker_num=1)
results = generate(
model_id="meta-llama/Llama-2-70b-chat-hf",
data=data,
worker_num=1,
pp=4, # 4-way pipeline parallelism
tp=2, # 2-way tensor parallelism
)
pp > 1, worker_num must be 1worker_num * tp (when not using custom assignments)from vlllm import generate
import json
# Load your dataset
with open("questions.jsonl", "r") as f:
data = [json.loads(line) for line in f]
# Configure generation
results = generate(
model_id="meta-llama/Llama-2-13b-chat-hf",
data=data,
message_key="question", # Your data has questions in 'question' field
system="Answer concisely and accurately.",
n=1,
temperature=0.1, # Low temperature for consistency
worker_num=4, # 4 parallel workers
tp=2, # 2 GPUs per worker
chunk_size=50, # Process 50 items at a time
max_output_len=256,
result_key="answer" # Store results in 'answer' field
)
# Save results
with open("answers.jsonl", "w") as f:
for item in results:
f.write(json.dumps(item) + "\n")
Apache-2.0 License
FAQs
A utility package for text generation using vLLM with multiprocessing support.
We found that vlllm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
/Security News
A long-running Go typosquat impersonated the popular shopspring/decimal library and used DNS TXT records to execute commands.

Research
Active npm supply chain attack compromises @antv packages in a fast-moving malicious publish wave tied to Mini Shai-Hulud.

Security News
/Research
Socket detected malicious node-ipc versions with obfuscated stealer/backdoor behavior in a developing npm supply chain attack.