
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).
Handle PDF more easily and simply, utilizing Doc2X's powerful document conversion capabilities for retained format file conversion/RAG enhancement.
Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal
provides abstract packaged classes to use Doc2X for requests.
Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.
After conversion and pre-processing of PDF using Doc2X, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.
pdfdeal
also provides a series of powerful tools to handle Markdown documents:
For detailed feature introduction and usage, please refer to the documentation link.
See how to use it with graphrag, its not supported to recognize pdf, but you can use the CLI tool doc2x
to convert it to a txt document for use.
Or for knowledge base applications, you can use pdfdeal
's built-in variety of enhancements to documents, such as uploading images to remote storage services, adding breaks by paragraph, etc. See Integration with RAG applications.
For details, please refer to the documentation
Or check out the documentation repository pdfdeal-docs.
For details, please refer to the documentation
Install using pip:
pip install --upgrade pdfdeal
If you need document processing tools:
pip install --upgrade "pdfdeal[rag]"
from pdfdeal import Doc2X
client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
pdf_file="tests/pdf",
output_path="./Output",
output_format="docx",
)
print(success)
print(failed)
print(flag)
from pdfdeal import Doc2X
client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
pdf_file="tests/pdf/sample.pdf",
output_path="./Output/test/single/pdf2file",
output_names=["sample1.zip"],
output_format="md_dollar",
)
print(success)
print(failed)
print(flag)
See the online documentation for details.
FAQs
A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).
We found that pdfdeal demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.