![38% of CISOs Fear They’re Not Moving Fast Enough on AI](https://cdn.sanity.io/images/cgdhsj6q/production/faa0bc28df98f791e11263f8239b34207f84b86f-1024x1024.webp?w=400&fit=max&auto=format)
Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
This package provides various document loaders that utilize different methods for processing and chunking documents. It is designed to facilitate the loading of documents in various formats into a structured format suitable for using them with langchain vector databases
The package includes the following loaders:
pymupdf4llm
library.MarkItDown
library.LlamaParse
library and processes different file types.To install this package, simply run:
pip install docitup
from docitup import PyMUPdf4LLMLoader
loader = PyMUPdf4LLMLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
from docitup import MarkitDownLoader
loader = MarkitdownLoader(file_path='path/to/your/file.md')
documents = loader.load()
from docitup import LlamaparseLoader
from llama_parse.utils import ResultType
loader = LlamaparseLoader(file_path='path/to/your/directory', result_type=ResultType.MD, api_key='your_api_key')
documents = loader.load()
from docitup import DoclingLoader
loader = DoclingLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
from docitup import FitzPyMUPDFLoader
loader = FitzPyMUPDFLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
from docitup import PyPdfLoader
loader = PyPdfLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
from docitup import PyPdfLoader2
loader = PyPdf2Loader(file_path='path/to/your/file.pdf')
documents = loader.load()
Each loader can be configured with the following optional parameters:
splitter_type
: The type of text splitter to use ("recursive" or other).
chunk_size
: The size of each chunk (default is 1000).
chunk_overlap
: The number of overlapping characters between chunks (default is 100).
from docitup import LlamaparseLoader
# Initialize the loader
loader = LlamaparseLoader(
file_path="example.pdf",
api_key="your_api_key",
splitter_type="recursive",
chunk_size=500,
chunk_overlap=50,
extra_metadata={"category": "example"}
)
# Load documents lazily
for document in loader.load():
print("Text Chunk:", document.text)
print("Metadata:", document.metadata)
Contributions are welcome! Please feel free to submit issues or pull requests for improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for more information.
This package is made possible by the following libraries:
FAQs
Unknown package
We found that docitup demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.