MLX Omni Server


MLX Omni Server is a local inference server powered by Apple's MLX framework, specifically designed for Apple Silicon (M-series) chips. It implements
OpenAI-compatible API endpoints, enabling seamless integration with existing OpenAI SDK clients while leveraging the power of local ML inference.
Features
- 🚀 Apple Silicon Optimized: Built on MLX framework, optimized for M1/M2/M3/M4 series chips
- 🔌 OpenAI API Compatible: Drop-in replacement for OpenAI API endpoints
- 🎯 Multiple AI Capabilities:
- Audio Processing (TTS & STT)
- Chat Completion
- Image Generation
- ⚡ High Performance: Local inference with hardware acceleration
- 🔐 Privacy-First: All processing happens locally on your machine
- 🛠 SDK Support: Works with official OpenAI SDK and other compatible clients
Supported API Endpoints
The server implements OpenAI-compatible endpoints:
- Chat completions:
/v1/chat/completions
- ✅ Chat
- ✅ Tools, Function Calling
- ✅ Structured Output
- ✅ LogProbs
- 🚧 Vision
- Audio
- ✅
/v1/audio/speech
- Text-to-Speech
- ✅
/v1/audio/transcriptions
- Speech-to-Text
- Models
- ✅
/v1/models
- List models
- ✅
/v1/models/{model}
- Retrieve or Delete model
- Images
- ✅
/v1/images/generations
- Image generation
- Embeddings
- ✅
/v1/embeddings
- Create embeddings for text
Quick Start
Follow these simple steps to get started with MLX Omni Server:
pip install mlx-omni-server
mlx-omni-server
- Run a simple chat example using curl
curl http://localhost:10240/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3-1b-it-4bit-DWQ",
"messages": [
{
"role": "user",
"content": "What can you do?"
}
]
}'
That's it! You're now running AI locally on your Mac. See Advanced Usage for more examples.
Server Options
mlx-omni-server
mlx-omni-server --port 8000
mlx-omni-server --help
Basic Client Setup
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:10240/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mlx-community/gemma-3-1b-it-4bit-DWQ",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)
Advanced Usage
MLX Omni Server supports multiple ways of interaction and various AI capabilities. Here's how to use each:
API Usage Options
MLX Omni Server provides flexible ways to interact with AI capabilities:
REST API
Access the server directly using HTTP requests:
curl http://localhost:10240/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3-1b-it-4bit-DWQ",
"messages": [{"role": "user", "content": "Hello"}]
}'
curl http://localhost:10240/v1/models
OpenAI SDK
Use the official OpenAI Python SDK for seamless integration:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:10240/v1",
api_key="not-needed"
)
See the FAQ section for information on using TestClient for development.
API Examples
Chat Completion
response = client.chat.completions.create(
model="mlx-community/Llama-3.2-3B-Instruct-4bit",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
temperature=0,
stream=True
)
for chunk in response:
print(chunk)
print(chunk.choices[0].delta.content)
print("****************")
Curl Example
curl http://localhost:10240/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"stream": true,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
Text-to-Speech
speech_file_path = "mlx_example.wav"
response = client.audio.speech.create(
model="lucasnewman/f5-tts-mlx",
voice="alloy",
input="MLX project is awsome.",
)
response.stream_to_file(speech_file_path)
Curl Example
curl -X POST "http://localhost:10240/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{
"model": "lucasnewman/f5-tts-mlx",
"input": "MLX project is awsome",
"voice": "alloy"
}' \
--output ~/Desktop/mlx.wav
Speech-to-Text
audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="mlx-community/whisper-large-v3-turbo",
file=audio_file
)
print(transcript.text)
Curl Example
curl -X POST "http://localhost:10240/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-F "file=@mlx_example.wav" \
-F "model=mlx-community/whisper-large-v3-turbo"
Response:
{
"text": " MLX Project is awesome!"
}
Image Generation
image_response = client.images.generate(
model="argmaxinc/mlx-FLUX.1-schnell",
prompt="A serene landscape with mountains and a lake",
n=1,
size="512x512"
)
Curl Example
curl http://localhost:10240/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "argmaxinc/mlx-FLUX.1-schnell",
"prompt": "A cute baby sea otter",
"n": 1,
"size": "1024x1024"
}'
Embeddings
response = client.embeddings.create(
model="mlx-community/all-MiniLM-L6-v2-4bit", input="I like reading"
)
print(f"Response type: {type(response)}")
print(f"Model used: {response.model}")
print(f"Embedding dimension: {len(response.data[0].embedding)}")
Curl Example
curl http://localhost:10240/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/all-MiniLM-L6-v2-4bit",
"input": ["Hello world!", "Embeddings are useful for semantic search."]
}'
For more detailed examples, check out the examples directory.
FAQ
How are models managed?
MLX Omni Server uses Hugging Face for model downloading and management. When you specify a model ID that hasn't been downloaded yet, the framework will automatically download it. However, since download times can vary significantly:
- It's recommended to pre-download models through Hugging Face before using them in your service
- To use a locally downloaded model, simply set the
model
parameter to the local model path
response = client.chat.completions.create(
model="mlx-community/gemma-3-1b-it-4bit-DWQ",
messages=[{"role": "user", "content": "Hello"}]
)
response = client.chat.completions.create(
model="/path/to/your/local/model",
messages=[{"role": "user", "content": "Hello"}]
)
The models currently supported on the machine can also be accessed through the following methods
curl http://localhost:10240/v1/models
How do I specify which model to use?
Use the model
parameter when creating a request:
response = client.chat.completions.create(
model="mlx-community/gemma-3-1b-it-4bit-DWQ",
messages=[{"role": "user", "content": "Hello"}]
)
Can I use TestClient for development?
Yes, TestClient allows you to use the OpenAI client without starting a local server. This is particularly useful for development and testing scenarios:
from openai import OpenAI
from fastapi.testclient import TestClient
from mlx_omni_server.main import app
client = OpenAI(
http_client=TestClient(app)
)
response = client.chat.completions.create(
model="mlx-community/gemma-3-1b-it-4bit-DWQ",
messages=[{"role": "user", "content": "Hello"}]
)
This approach bypasses the HTTP server entirely, making it ideal for unit testing and quick development iterations.
What if I get errors when starting the server?
- Confirm you're using an Apple Silicon Mac (M1/M2/M3/M4)
- Check that your Python version is 3.9 or higher
- Verify you have the latest version of mlx-omni-server installed
- Check the log output for more detailed error information
Contributing
We welcome contributions! If you're interested in contributing to MLX Omni Server, please check out our Development Guide
for detailed information about:
- Setting up the development environment
- Running the server in development mode
- Contributing guidelines
- Testing and documentation
For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Disclaimer
This project is not affiliated with or endorsed by OpenAI or Apple. It's an independent implementation that provides OpenAI-compatible APIs using
Apple's MLX framework.
Star History 🌟
