Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
The easiest way to get started is to install the mlx-lm
package:
With pip
:
pip install mlx-lm
With conda
:
conda install -c conda-forge mlx-lm
The mlx-lm
package also has:
To generate text with an LLM use:
mlx_lm.generate --prompt "Hi!"
To chat with an LLM use:
mlx_lm.chat
This will give you a chat REPL that you can use to interact with the LLM. The chat context is preserved during the lifetime of the REPL.
Commands in mlx-lm
typically take command line options which let you specify
the model, sampling parameters, and more. Use -h
to see a list of available
options for a command, e.g.:
mlx_lm.generate -h
You can use mlx-lm
as a module:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
text = generate(model, tokenizer, prompt=prompt, verbose=True)
To see a description of all the arguments you can do:
>>> help(generate)
Check out the generation example to see how to use the API in more detail.
The mlx-lm
package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.
You can convert models using the Python API:
from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)
This will generate a 4-bit quantized Mistral 7B and upload it to the repo
mlx-community/My-Mistral-7B-Instruct-v0.3-4bit
. It will also save the
converted model in the path mlx_model
by default.
To see a description of all the arguments you can do:
>>> help(convert)
For streaming generation, use the stream_generate
function. This yields
a generation response object.
For example,
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(response.text, end="", flush=True)
print()
You can also use mlx-lm
from the command line with:
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.
For a full list of options run:
mlx_lm.generate --help
To quantize a model from the command line run:
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
For more options run:
mlx_lm.convert --help
You can upload new models to Hugging Face by specifying --upload-repo
to
convert
. For example, to upload a quantized Mistral-7B model to the
MLX Hugging Face community you can do:
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--upload-repo mlx-community/my-4bit-mistral
Models can also be converted and quantized directly in the [mlx-my-repo]https://huggingface.co/spaces/mlx-community/mlx-my-repo) Hugging Face Space.
mlx-lm
has some tools to scale efficiently to long prompts and generations:
To use the rotating key-value cache pass the argument --max-kv-size n
where
n
can be any integer. Smaller values like 512
will use very little RAM but
result in worse quality. Larger values like 4096
or higher will use more RAM
but have better quality.
Caching prompts can substantially speedup reusing the same long context with
different queries. To cache a prompt use mlx_lm.cache_prompt
. For example:
cat prompt.txt | mlx_lm.cache_prompt \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--prompt - \
--prompt-cache-file mistral_prompt.safetensors
Then use the cached prompt with mlx_lm.generate
:
mlx_lm.generate \
--prompt-cache-file mistral_prompt.safetensors \
--prompt "\nSummarize the above text."
The cached prompt is treated as a prefix to the supplied prompt. Also notice when using a cached prompt, the model to use is read from the cache and need not be supplied explicitly.
Prompt caching can also be used in the Python API in order to to avoid recomputing the prompt. This is useful in multi-turn dialogues or across requests that use the same context. See the example for more usage details.
mlx-lm
supports thousands of Hugging Face format LLMs. If the model you want to
run is not supported, file an
issue or better yet,
submit a pull request.
Here are a few examples of Hugging Face models that work with this example:
Most Mistral, Llama, Phi-2, and Mixtral style models should work out of the box.
For some models (such as Qwen
and plamo
) the tokenizer requires you to
enable the trust_remote_code
option. You can do this by passing
--trust-remote-code
in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model.
For Qwen
models you must also specify the eos_token
. You can do this by
passing --eos-token "<|endoftext|>"
in the command
line.
These options can also be set in the Python API. For example:
model, tokenizer = load(
"qwen/Qwen-7B",
tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)
[!NOTE] This requires macOS 15.0 or higher to work.
Models which are large relative to the total RAM available on the machine can
be slow. mlx-lm
will attempt to make them faster by wiring the memory
occupied by the model and cache. This requires macOS 15 or higher to
work.
If you see the following warning message:
[WARNING] Generating with a model that requires ...
then the model will likely be slow on the given machine. If the model fits in
RAM then it can often be sped up by increasing the system wired memory limit.
To increase the limit, set the following sysctl
:
sudo sysctl iogpu.wired_limit_mb=N
The value N
should be larger than the size of the model in megabytes but
smaller than the memory size of the machine.
FAQs
LLMs on Apple silicon with MLX and the Hugging Face Hub
We found that mlx-lm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.