LlamaParse
LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents).
It is really good at the following:
- ✅ Broad file type support: Parsing a variety of unstructured file types (.pdf, .pptx, .docx, .xlsx, .html) with text, tables, visual elements, weird layouts, and more.
- ✅ Table recognition: Parsing embedded tables accurately into text and semi-structured representations.
- ✅ Multimodal parsing and chunking: Extracting visual elements (images/diagrams) into structured formats and return image chunks using the latest multimodal models.
- ✅ Custom parsing: Input custom prompt instructions to customize the output the way you want it.
LlamaParse directly integrates with LlamaIndex.
The free plan is up to 1000 pages a day. Paid plan is free 7k pages per week + 0.3c per additional page by default. There is a sandbox available to test the API https://cloud.llamaindex.ai/parse ↗.
Read below for some quickstart information, or see the full documentation.
If you're a company interested in enterprise RAG solutions, and/or high volume/on-prem usage of LlamaParse, come talk to us.
Getting Started
First, login and get an api-key from https://cloud.llamaindex.ai/api-key ↗.
Then, make sure you have the latest LlamaIndex version installed.
NOTE: If you are upgrading from v0.9.X, we recommend following our migration guide, as well as uninstalling your previous version first.
pip uninstall llama-index # run this if upgrading from v0.9.x or older
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall
Lastly, install the package:
pip install llama-parse
Now you can parse your first PDF file using the command line interface. Use the command llama-parse [file_paths]
. See the help text with llama-parse --help
.
export LLAMA_CLOUD_API_KEY='llx-...'
llama-parse my_file.pdf --result-type text --output-file output.txt
llama-parse my_file.pdf --result-type markdown --output-file output.md
llama-parse my_file.pdf --output-raw-json --output-file output.json
You can also create simple scripts:
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse
parser = LlamaParse(
api_key="llx-...",
result_type="markdown",
num_workers=4,
verbose=True,
language="en",
)
documents = parser.load_data("./my_file.pdf")
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])
documents = await parser.aload_data("./my_file.pdf")
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])
Using with file object
You can parse a file object directly:
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse
parser = LlamaParse(
api_key="llx-...",
result_type="markdown",
num_workers=4,
verbose=True,
language="en",
)
file_name = "my_file1.pdf"
extra_info = {"file_name": file_name}
with open(f"./{file_name}", "rb") as f:
documents = parser.load_data(f, extra_info=extra_info)
with open(f"./{file_name}", "rb") as f:
file_bytes = f.read()
documents = parser.load_data(file_bytes, extra_info=extra_info)
Using with SimpleDirectoryReader
You can also integrate the parser as the default PDF loader in SimpleDirectoryReader
:
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
parser = LlamaParse(
api_key="llx-...",
result_type="markdown",
verbose=True,
)
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
Full documentation for SimpleDirectoryReader
can be found on the LlamaIndex Documentation.
Examples
Several end-to-end indexing examples can be found in the examples folder
Documentation
https://docs.cloud.llamaindex.ai/
Terms of Service
See the Terms of Service Here.
Get in Touch (LlamaCloud)
LlamaParse is part of LlamaCloud, our e2e enterprise RAG platform that provides out-of-the-box, production-ready connectors, indexing, and retrieval over your complex data sources. We offer SaaS and VPC options.
LlamaCloud is currently available via waitlist (join by creating an account). If you're interested in state-of-the-art quality and in centralizing your RAG efforts, come get in touch with us.