llama-cpp-cffi
Python binding for llama.cpp using cffi. Supports CPU, Vulkan 1.x and CUDA 12.6 runtimes, x86_64 and aarch64 platforms.
NOTE: Currently supported operating system is Linux (manylinux_2_28
and musllinux_1_2
), but we are working on both Windows and MacOS versions.
News
- Dec 9 2024, v0.2.0: Support for low-level and high-level APIs: llama, llava, clip and ggml API
- Nov 27 2024, v0.1.22: Support for Multimodal models such as llava and minicpmv.
Install
Basic library install:
pip install llama-cpp-cffi
IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12. If you don't have CUDA 12.X installed follow instructions here: https://developer.nvidia.com/cuda-downloads .
GPU Compute Capability: compute_61
, compute_70
, compute_75
, compute_80
, compute_86
, compute_89
covering from most of GPUs from GeForce GTX 1050 to NVIDIA H100. GPU Compute Capability.
LLM Example
from llama import Model
model = Model(
creator_hf_repo='HuggingFaceTB/SmolLM2-1.7B-Instruct',
hf_repo='bartowski/SmolLM2-1.7B-Instruct-GGUF',
hf_file='SmolLM2-1.7B-Instruct-Q4_K_M.gguf',
)
model.init(ctx_size=8192, predict=1024, gpu_layers=99)
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '1 + 1 = ?'},
{'role': 'assistant', 'content': '2'},
{'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]
for chunk in model.completions(messages=messages, temp=0.7, top_p=0.8, top_k=100):
print(chunk, flush=True, end='')
for chunk in model.completions(prompt='Evaluate 1 + 2 in Python. Result in Python is', temp=0.7, top_p=0.8, top_k=100):
print(chunk, flush=True, end='')
VLM Example
from llama import Model
model = Model(
creator_hf_repo='vikhyatk/moondream2',
hf_repo='vikhyatk/moondream2',
hf_file='moondream2-text-model-f16.gguf',
mmproj_hf_file='moondream2-mmproj-f16.gguf',
)
model.init(ctx_size=8192, predict=1024, gpu_layers=99)
for chunk in model.completions(prompt='Describe this image.', image='examples/llama-1.png'):
print(chunk, flush=True, end='')
References
examples/llm.py
examples/vlm.py