
Security News
New CVE Forecasting Tool Predicts 47,000 Disclosures in 2025
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
# pip install cv-doc-chunker
This package requires the user to provide certain data externally:
input/
): Place the PDF documents you want to process in a directory (e.g., input/
). You will need to provide the path to your input file(s) when using the package.models/
): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt
) and place them in a dedicated directory (e.g., models/
). The path to this directory (or the specific model file) will be needed by the parser.Provide examples of how to import and use your library functions or the command-line tool.
Example (Conceptual Python Usage):
from cv_doc_chunker import PDFProcessor
# --- User Configuration ---
input_pdf_path = "path/to/your/input/document.pdf" # Path to user's PDF
model_path = "path/to/your/models/doclayout_yolo.pt" # Path to user's model
output_dir = "path/to/your/output/" # Directory to save results
# --- Initialize and Run ---
processor = PDFProcessor(model_path=model_path, output_dir=output_dir)
# Process the document (layout detection, chunking, etc.)
results = processor.process_document(pdf_path=input_pdf_path)
print(f"Processing complete. Results saved in {output_dir}")
After running the parser, the following outputs will typically be available in the specified output_dir
:
{your-document}_parsed.json
: JSON file containing the detected document structure (element labels, coordinates, confidence).{your-document}_annotations/
: Directory containing annotated images showing the detected elements for each page (if generate_annotations=True
).{your-document}_boxes/
: Directory containing individual images for each detected element, organized by page number (if save_bounding_boxes=True
). This is required for OCR.{your-document}_sorted_text.json
: (Only if ocr=True
) JSON file containing the extracted text for each element, sorted according to the structure defined in _parsed.json
.If debug mode is enabled (debug_mode=True
), additional debug images might be saved, typically in a debug/
subdirectory within the output_dir
, showing intermediate steps of the parsing process.
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
A tool for parsing PDF document layouts and chunking content.
We found that cv-doc-chunker demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
Security News
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Research
Security News
Eight new malicious Firefox extensions impersonate games, steal OAuth tokens, hijack sessions, and exploit browser permissions to spy on users.