# Configure which llama-cpp-python precompiled binary to install (β οΈ not every combination is available):
LLAMA_CPP_PYTHON_VERSION=0.3.9
PYTHON_VERSION=310|311|312
ACCELERATOR=metal|cu121|cu122|cu123|cu124
PLATFORM=macosx_11_0_arm64|linux_x86_64|win_amd64
# Install llama-cpp-python:
pip install "https://github.com/abetlen/llama-cpp-python/releases/download/v$LLAMA_CPP_PYTHON_VERSION-$ACCELERATOR/llama_cpp_python-$LLAMA_CPP_PYTHON_VERSION-cp$PYTHON_VERSION-cp$PYTHON_VERSION-$PLATFORM.whl"
Install RAGLite with:
pip install raglite
To add support for a customizable ChatGPT-like frontend, use the chainlit extra:
pip install raglite[chainlit]
To add support for filetypes other than PDF, use the pandoc extra:
pip install raglite[pandoc]
To add support for evaluation, use the ragas extra:
[!TIP]
π§ RAGLite extends LiteLLM with support for llama.cpp models using llama-cpp-python. To select a llama.cpp model (e.g., from Unsloth's collection), use a model identifier of the form "llama-cpp-python/<hugging_face_repo_id>/<filename>@<n_ctx>", where n_ctx is an optional parameter that specifies the context size of the model.
[!TIP]
πΎ You can create a PostgreSQL database in a few clicks at neon.tech.
from raglite import RAGLiteConfig
# Example 'remote' config with a PostgreSQL database and an OpenAI LLM:
my_config = RAGLiteConfig(
db_url="postgresql://my_username:my_password@my_host:5432/my_database",
llm="gpt-4o-mini", # Or any LLM supported by LiteLLM
embedder="text-embedding-3-large", # Or any embedder supported by LiteLLM
)
# Example 'local' config with a DuckDB database and a llama.cpp LLM:
my_config = RAGLiteConfig(
db_url="duckdb:///raglite.db",
llm="llama-cpp-python/unsloth/Qwen3-8B-GGUF/*Q4_K_M.gguf@8192",
embedder="llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@512", # More than 512 tokens degrades bge-m3's performance
)
from rerankers import Reranker
# Example remote API-based reranker:
my_config = RAGLiteConfig(
db_url="postgresql://my_username:my_password@my_host:5432/my_database"
reranker=Reranker("rerank-v3.5", model_type="cohere", api_key=COHERE_API_KEY, verbose=0) # Multilingual
)
# Example local cross-encoder reranker per language (this is the default):
my_config = RAGLiteConfig(
db_url="duckdb:///raglite.db",
reranker={
"en": Reranker("ms-marco-MiniLM-L-12-v2", model_type="flashrank", verbose=0), # English"other": Reranker("ms-marco-MultiBERT-L-12", model_type="flashrank", verbose=0), # Other languages
}
)
2. Inserting documents
[!TIP]
βοΈ To insert documents other than PDF, install the pandoc extra with pip install raglite[pandoc].
# Insert documents given their file pathfrom pathlib import Path
from raglite import Document, insert_documents
documents = [
Document.from_path(Path("On the Measure of Intelligence.pdf")),
Document.from_path(Path("Special Relativity.pdf")),
]
insert_documents(documents, config=my_config)
# Insert documents given their text/plain or text/markdown content
content = """
# ON THE ELECTRODYNAMICS OF MOVING BODIES
## By A. EINSTEIN June 30, 1905
It is known that Maxwell...
"""
documents = [
Document.from_text(content)
]
insert_documents(documents, config=my_config)
3. Retrieval-Augmented Generation (RAG)
3.1 Adaptive RAG
Now you can run an adaptive RAG pipeline that consists of adding the user prompt to the message history and streaming the LLM response:
from raglite import rag
# Create a user message
messages = [] # Or start with an existing message history
messages.append({
"role": "user",
"content": "How is intelligence measured?"
})
# Adaptively decide whether to retrieve and then stream the response
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
print(update, end="")
# Access the documents referenced in the RAG context
documents = [chunk_span.document for chunk_span in chunk_spans]
The LLM will adaptively decide whether to retrieve information based on the complexity of the user prompt. If retrieval is necessary, the LLM generates the search query and RAGLite applies hybrid search and reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the on_retrieval callback and are appended to the message history as a tool output. Finally, the assistant response is streamed and appended to the message history.
3.2 Programmable RAG
If you need manual control over the RAG pipeline, you can run a basic but powerful pipeline that consists of retrieving the most relevant chunk spans with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:
from raglite import add_context, rag, retrieve_context, vector_search
# Choose a search methodfrom dataclasses import replace
my_config = replace(my_config, search_method=vector_search) # Or `hybrid_search`, `search_and_rerank_chunks`, ...# Retrieve relevant chunk spans with the configured search method
user_prompt = "How is intelligence measured?"
chunk_spans = retrieve_context(query=user_prompt, num_chunks=5, config=my_config)
# Append a RAG instruction based on the user prompt and context to the message history
messages = [] # Or start with an existing message history
messages.append(add_context(user_prompt=user_prompt, context=chunk_spans))
# Stream the RAG response and append it to the message history
stream = rag(messages, config=my_config)
for update in stream:
print(update, end="")
# Access the documents referenced in the RAG context
documents = [chunk_span.document for chunk_span in chunk_spans]
[!TIP]
π₯ Reranking can significantly improve the output quality of a RAG application. To add reranking to your application: first search for a larger set of 20 relevant chunks, then rerank them with a rerankers reranker, and finally keep the top 5 chunks.
RAGLite also offers more advanced control over the individual steps of a full RAG pipeline:
Searching for relevant chunks with keyword, vector, or hybrid search
Retrieving the chunks from the database
Reranking the chunks and selecting the top 5 results
Extending the chunks with their neighbors and grouping them into chunk spans
Converting the user prompt to a RAG instruction and appending it to the message history
Streaming an LLM response to the message history
Accessing the cited documents from the chunk spans
A full RAG pipeline is straightforward to implement with RAGLite:
# Search for chunksfrom raglite import hybrid_search, keyword_search, vector_search
user_prompt = "How is intelligence measured?"
chunk_ids_vector, _ = vector_search(user_prompt, num_results=20, config=my_config)
chunk_ids_keyword, _ = keyword_search(user_prompt, num_results=20, config=my_config)
chunk_ids_hybrid, _ = hybrid_search(user_prompt, num_results=20, config=my_config)
# Retrieve chunksfrom raglite import retrieve_chunks
chunks_hybrid = retrieve_chunks(chunk_ids_hybrid, config=my_config)
# Rerank chunks and keep the top 5 (optional, but recommended)from raglite import rerank_chunks
chunks_reranked = rerank_chunks(user_prompt, chunks_hybrid, config=my_config)
chunks_reranked = chunks_reranked[:5]
# Extend chunks with their neighbors and group them into chunk spansfrom raglite import retrieve_chunk_spans
chunk_spans = retrieve_chunk_spans(chunks_reranked, config=my_config)
# Append a RAG instruction based on the user prompt and context to the message historyfrom raglite import add_context
messages = [] # Or start with an existing message history
messages.append(add_context(user_prompt=user_prompt, context=chunk_spans))
# Stream the RAG response and append it to the message historyfrom raglite import rag
stream = rag(messages, config=my_config)
for update in stream:
print(update, end="")
# Access the documents referenced in the RAG context
documents = [chunk_span.document for chunk_span in chunk_spans]
4. Computing and using an optimal query adapter
RAGLite can compute and apply an optimal closed-form query adapter to the prompt embedding to improve the output quality of RAG. To benefit from this, first generate a set of evals with insert_evals and then compute and store the optimal query adapter with update_query_adapter:
# Improve RAG with an optimal query adapterfrom raglite import insert_evals, update_query_adapter
insert_evals(num_evals=100, config=my_config)
update_query_adapter(config=my_config) # From here, every vector search will use the query adapter
5. Evaluation of retrieval and generation
If you installed the ragas extra, you can use RAGLite to answer the evals and then evaluate the quality of both the retrieval and generation steps of RAG using Ragas:
Now, when you start Claude desktop you should see a π¨ icon at the bottom right of your prompt indicating that the Claude has successfully connected with the MCP server.
When relevant, Claude will suggest to use the search_knowledge_base tool that the MCP server provides. You can also explicitly ask Claude to search the knowledge base if you want to be certain that it does.
7. Serving a customizable ChatGPT-like frontend
If you installed the chainlit extra, you can serve a customizable ChatGPT-like frontend with:
raglite chainlit
The application is also deployable to web, Slack, and Teams.
You can specify the database URL, LLM, and embedder directly in the Chainlit frontend, or with the CLI as follows:
βοΈ VS Code Dev Container (with container volume): click on Open in Dev Containers to clone this repository in a container volume and create a Dev Container with VS Code.
βοΈ uv: clone this repository and run the following from root of the repository:
# Create and install a virtual environment
uv sync --python 3.10 --all-extras
# Activate the virtual environmentsource .venv/bin/activate
# Install the pre-commit hooks
pre-commit install --install-hooks
VS Code Dev Container: clone this repository, open it with VS Code, and run Ctrl/β + β§ + P β Dev Containers: Reopen in Container.
Run poe from within the development environment to print a list of Poe the Poet tasks available to run on this project.
Run uv add {package} from within the development environment to install a run time dependency and add it to pyproject.toml and uv.lock. Add --dev to install a development dependency.
Run uv sync --upgrade from within the development environment to upgrade all dependencies to the latest versions allowed by pyproject.toml. Add --only-dev to upgrade the development dependencies only.
Run cz bump to bump the package's version, update the CHANGELOG.md, and create a git tag. Then push the changes and the git tag with git push origin main --tags.
Star History
FAQs
A Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL.
We found that raglite demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Socket CEO Feross Aboukhadijeh and a16z partner Joel de la Garza discuss vibe coding, AI-driven software development, and how the rise of LLMs, despite their risks, still points toward a more secure and innovative future.