
Security News
NVD Quietly Sweeps 100K+ CVEs Into a “Deferred” Black Hole
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
OpenLLM allows developers to run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more) or custom models as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Docker, Kubernetes, and BentoCloud.
Understand the design philosophy of OpenLLM.
Run the following commands to install OpenLLM and explore it interactively.
pip install openllm # or pip3 install openllm
openllm hello
OpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add a model repository to run custom models with OpenLLM.
Model | Parameters | Required GPU | Start a Server |
---|---|---|---|
deepseek | r1 | 80Gx16 | openllm serve deepseek:r1 |
gemma2 | 2b | 12G | openllm serve gemma2:2b |
hermes-3 | deep-llama3-8b-91e3 | 80G | openllm serve hermes-3:deep-llama3-8b-91e3 |
jamba1.5 | large-8b32 | 80Gx8 | openllm serve jamba1.5:large-8b32 |
llama3.1 | 8b | 24G | openllm serve llama3.1:8b |
llama3.2 | 1b | 24G | openllm serve llama3.2:1b |
llama3.3 | 70b | 80Gx2 | openllm serve llama3.3:70b |
mistral | 8b | 24G | openllm serve mistral:8b |
mistral-large | 123b | 80Gx4 | openllm serve mistral-large:123b |
phi4 | 14b | 80G | openllm serve phi4:14b |
pixtral | 12b-2409 | 80G | openllm serve pixtral:12b-2409 |
qwen2.5 | 7b | 24G | openllm serve qwen2.5:7b |
qwen2.5-coder | 3b | 24G | openllm serve qwen2.5-coder:3b |
qwq | 32b | 80G | openllm serve qwq:32b |
For the full model list, see the OpenLLM models repository.
To start an LLM server locally, use the openllm serve
command and specify the model version.
[!NOTE] OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models.
- Create your Hugging Face token here.
- Request access to the gated model, such as meta-llama/Llama-3.2-1B-Instruct.
- Set your token as an environment variable by running:
export HF_TOKEN=<your token>
openllm serve llama3.2:1b
The server will be accessible at http://localhost:3000, providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:
Here are some examples:
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
# Use the following func to get the available models
# model_list = client.models.list()
# print(model_list)
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{
"role": "user",
"content": "Explain superconductors like I'm five years old"
}
],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
from llama_index.llms.openai import OpenAI
llm = OpenAI(api_bese="http://localhost:3000/v1", model="meta-llama/Llama-3.2-1B-Instruct", api_key="dummy")
...
OpenLLM provides a chat UI at the /chat
endpoint for the launched LLM server at http://localhost:3000/chat.
To start a chat conversation in the CLI, use the openllm run
command and specify the model version.
openllm run llama3:8b
A model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted at this GitHub repository. To see all available models from the default and any added repository, use:
openllm model list
To ensure your local list of models is synchronized with the latest updates from all connected repositories, run:
openllm repo update
To review a model’s information, run:
openllm model get llama3.2:1b
You can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out this example pull request.
You can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with a bentos
directory to store custom LLMs. You need to build your Bentos with BentoML and submit them to your model repository.
First, prepare your custom models in a bentos
directory following the guidelines provided by BentoML to build Bentos. Check out the default model repository for an example and read the Developer Guide for details.
Then, register your custom model repository with OpenLLM:
openllm repo add <repo-name> <repo-url>
Note: Currently, OpenLLM only supports adding public repositories.
OpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud.
Sign up for BentoCloud for free and log in. Then, run openllm deploy
to deploy a model to BentoCloud:
openllm deploy llama3.2:1b --env HF_TOKEN
[!NOTE] If you are deploying a gated model, make sure to set HF_TOKEN in enviroment variables.
Once the deployment is complete, you can run model inference on the BentoCloud console:
OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 Join our Slack community!
As an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute:
This project uses the following open-source projects:
We are grateful to the developers and contributors of these projects for their hard work and dedication.
FAQs
OpenLLM: Self-hosting LLMs Made Easy.
We found that openllm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
Research
Security News
Lazarus-linked threat actors expand their npm malware campaign with new RAT loaders, hex obfuscation, and over 5,600 downloads across 11 packages.
Security News
Safari 18.4 adds support for Iterator Helpers and two other TC39 JavaScript features, bringing full cross-browser coverage to key parts of the ECMAScript spec.