LlamaIndex Llms Integration: Llama Cpp
Installation
To get the best performance out of LlamaCPP
, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.
Full MACOS instructions are also here.
In general:
- Use
CuBLAS
if you have CUDA and an NVidia GPU - Use
METAL
if you are running on an M1/M2 MacBook - Use
CLBLAST
if you are running on an AMD/Intel GPU
Them, install the required llama-index packages:
pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp
Basic Usage
Initialize LlamaCPP
Set up the model URL and initialize the LlamaCPP LLM:
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer
model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
def messages_to_prompt(messages):
messages = [{"role": m.role.value, "content": m.content} for m in messages]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
def completion_to_prompt(completion):
messages = [{"role": "user", "content": completion}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
llm = LlamaCPP(
model_url=model_url,
model_path=None,
temperature=0.1,
max_new_tokens=256,
context_window=16384,
generate_kwargs={},
model_kwargs={"n_gpu_layers": -1},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
Generate Completions
Use the complete
method to generate a response:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
Stream Completions
You can also stream completions for a prompt:
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
Set Up Query Engine with LlamaCPP
Change the global tokenizer to match the LLM:
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)
Use Hugging Face Embeddings
Set up the embedding model and load documents:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
"../../../examples/paul_graham_essay/data"
).load_data()
Create Vector Store Index
Create a vector store index from the loaded documents:
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
Set Up Query Engine
Set up the query engine with the LlamaCPP LLM:
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)
LLM Implementation example
https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/