
Research
PyPI Package Disguised as Instagram Growth Tool Harvests User Credentials
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
pdf-processing-system
Advanced tools
Comprehensive PDF content extraction and intelligent splitting system
A comprehensive PDF content extraction and intelligent splitting system that can process large PDF documents into manageable sections.
The project includes sample PDF files for testing and demonstration:
samples/sample-pdf-with-images.pdf
- Multi-page PDF with images (3.9MB, 10 pages) - Default test filesamples/file-example_PDF_1MB.pdf
- Standard PDF for basic testing (1MB)samples/image-based-pdf-sample.pdf
- Image-heavy PDF for image extraction testingsamples/dummy.pdf
- Simple PDF for quick validation testsAll examples in this documentation use the sample files, ensuring you can run them immediately after installation.
pip install PyMuPDF pdf2image
PyMuPDF (fitz)
: PDF processingpdf2image
: PDF to image conversionos
, json
, datetime
, math
, re
: Built-in Python librariesdifflib
: Fuzzy string matchingDuring installation, you may see dependency conflict warnings like:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.32.0 requires packaging<24,>=16.8, but you have packaging 25.0 which is incompatible.
streamlit 1.32.0 requires pillow<11,>=7.1.0, but you have pillow 11.2.1 which is incompatible.
This is normal and does not affect functionality. These conflicts occur when other packages (like Streamlit) have stricter version requirements than our package. The PDF Processing System works correctly with the newer versions of these dependencies.
# Install from wheel (recommended)
pip install pdf_processing_system-1.0.0-py3-none-any.whl
# Or install from source
pip install pdf_processing_system-1.0.0.tar.gz
# Test installation
pdf-extractor --help
# Default: Automatic timestamped organization (NEW DEFAULT BEHAVIOR)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results
# Creates: results/extraction_20250527_143022/
# Process without embedded images
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-images
# Creates: results/extraction_20250527_143055/
# When you need exact output control (legacy behavior)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-timestamps
# Creates: results/ (directly in the folder)
# Split into 3 equal parts with timestamped organization
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --parts 3
# Creates: results/extraction_20250527_143122/equal_parts/
# Validate PDF file
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --validate
# Combine page images into PDF
python pdf_cli.py --combine-images ./page_images --output ./results
# View help and all options
python pdf_cli.py --help
The PDF extractor supports selective extraction for improved performance and specific use cases:
# Extract only text content (fastest - ~0.4 seconds vs ~20+ seconds full processing)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --text-only
# Extract only embedded images (~1-2 seconds)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --images-only
# Convert only pages to images (~4-6 seconds for 10 pages)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --page-images-only
# Combine page images back into a single PDF
python pdf_cli.py --combine-images ./page_images --output results
# Skip page-to-image conversion (saves ~50% processing time)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-page-images
# Skip all PDF splitting operations
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-splitting
# Skip only equal-parts splitting (keep section-based splitting)
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-equal-parts
# Combine multiple skip options
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output results --no-page-images --no-equal-parts
Convert page images back into a single PDF with full fidelity:
# Combine page images from a directory into a single PDF
python pdf_cli.py --combine-images ./page_images --output ./results
# Creates: results/combined_pages.pdf
# Complete round-trip workflow: PDF ā Images ā PDF
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --page-images-only --output ./step1
python pdf_cli.py --combine-images ./step1/extraction_*/page_images --output ./step2
# Result: Original PDF reconstructed with high fidelity
# Combine images with custom output directory
python pdf_cli.py --combine-images "processed_data/page_images" --output "final_pdfs"
# Creates: final_pdfs/combined_pages.pdf
Process multiple PDF files efficiently:
# Create a batch file listing PDFs (one per line)
echo "samples/sample-pdf-with-images.pdf" > batch_files.txt
echo "samples/file-example_PDF_1MB.pdf" >> batch_files.txt
# Process all files in batch
python pdf_cli.py --batch batch_files.txt --output batch_results
# Batch processing with selective extraction
python pdf_cli.py --batch batch_files.txt --output batch_results --text-only
# Include detailed error information for debugging
python pdf_cli.py --batch batch_files.txt --output batch_results --verbose-errors
Mode | Processing Time | Output | Use Case |
---|---|---|---|
Full Processing | ~20-25 seconds | All features | Complete analysis |
Text Only | ~0.4 seconds | Text file only | Quick content review |
Images Only | ~1-2 seconds | Embedded images only | Image extraction |
Page Images Only | ~4-6 seconds | Page PNGs only | Visual conversion |
Combine Images | ~25-30 seconds | Single PDF from images | Reconstruct PDF from pages |
No Page Images | ~12-15 seconds | All except page images | Skip heavy conversion |
No Splitting | ~8-12 seconds | No PDF splits | Keep original structure |
š Default Timestamped Behavior:
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output analysis
Result:
analysis/
āāā extraction_20250527_143022/
āāā extracted_text.txt
āāā page_images/
āāā equal_parts/
āāā ... (all outputs organized here)
š Exact Directory Control:
python pdf_cli.py "samples/sample-pdf-with-images.pdf" --output analysis --no-timestamps
Result:
analysis/
āāā extracted_text.txt
āāā page_images/
āāā equal_parts/
āāā ... (outputs directly in analysis/)
š” Why Use Timestamped (Default)?
š” When to Use --no-timestamps
?
your_output_directory/
āāā extraction_YYYYMMDD_HHMMSS/
āāā extracted_text.txt
āāā extracted_images/
ā āāā image_001.png
ā āāā ...
āāā page_images/
ā āāā page_001.png
ā āāā page_002.png
ā āāā ...
āāā equal_parts/
ā āāā part_1_pages_1-N.pdf
ā āāā part_2_pages_N-M.pdf
ā āāā ...
āāā extracted_content.json
āāā section_info.json
āāā processing_summary.json
āāā 01_Section_Name_pages_X-Y.pdf
āāā 02_Next_Section_pages_A-B.pdf
āāā ...
--no-timestamps
Flagyour_output_directory/
āāā extracted_text.txt
āāā extracted_images/
ā āāā image_001.png
ā āāā ...
āāā page_images/
ā āāā page_001.png
ā āāā page_002.png
ā āāā ...
āāā equal_parts/
ā āāā part_1_pages_1-N.pdf
ā āāā part_2_pages_N-M.pdf
ā āāā ...
āāā extracted_content.json
āāā section_info.json
āāā processing_summary.json
āāā 01_Section_Name_pages_X-Y.pdf
āāā 02_Next_Section_pages_A-B.pdf
āāā ...
Key improvements:
equal_parts/
subdirectoryThe system can be configured via the config.json
file or by passing parameters directly to functions.
{
"processing": {
"enable_page_conversion": true,
"page_image_dpi": 300,
"page_image_format": "png",
"white_text_threshold": 15000000,
"default_equal_parts": 4
},
"output": {
"images_dirname": "extracted_images",
"page_images_dirname": "page_images"
}
}
The system uses predefined section definitions that can be customized:
predefined_sections = {
"Message From Founders": {"start": 3, "end": 4},
"General Information": {"start": 5, "end": 31},
"Sales": {"start": 32, "end": 78},
"Business Location A": {"start": 79, "end": 92},
"Business Location B": {"start": 93, "end": 96},
"Miscellaneous": {"start": 97, "end": 112}
}
Adjust the confidence threshold for section matching:
# In fuzzy_match_section_titles function
similarity = SequenceMatcher(None, section_title.lower(), extracted_title.lower()).ratio()
if similarity > 0.6: # 60% minimum threshold
Adjust the white text filtering threshold:
if color > 15000000: # Adjust threshold as needed
continue # Skip white/very light text
Configure image conversion quality and format:
# In convert_pages_to_images function
dpi = 300 # High-quality output (300 DPI)
fmt = 'PNG' # Output format
Or via configuration file:
{
"processing": {
"enable_page_conversion": true,
"page_image_dpi": 300,
"page_image_format": "png"
}
}
extract_text(pdf_path)
Extracts text from PDF with page separators and filters out white text.
Parameters:
pdf_path
(str): Path to the PDF fileReturns:
str
: Extracted text with page markersextract_images(pdf_path, output_dir)
Extracts all images from PDF and saves them with metadata.
Parameters:
pdf_path
(str): Path to the PDF fileoutput_dir
(str): Directory to save imagesReturns:
list
: Image metadata with filenames and propertiesconvert_pages_to_images(pdf_path, output_dir)
Converts each PDF page to a high-quality PNG image at 300 DPI.
Parameters:
pdf_path
(str): Path to the PDF fileoutput_dir
(str): Directory to save page imagesReturns:
dict
: Page images metadata including count, DPI, format, and file listsplit_pdf_into_equal_parts(pdf_path, output_dir, num_parts=4)
Splits PDF into equal-sized parts.
Parameters:
pdf_path
(str): Path to the PDF fileoutput_dir
(str): Directory to save split filesnum_parts
(int): Number of parts to split into (default: 4)Returns:
list
: Paths to created PDF partssplit_pdf_by_sections(pdf_path, output_dir, toc_structure)
Splits PDF based on Table of Contents structure.
Parameters:
pdf_path
(str): Path to the PDF fileoutput_dir
(str): Directory to save section filestoc_structure
(dict): Section definitions with page rangesReturns:
list
: Paths to created section files# Modify the main function
if __name__ == "__main__":
pdf_path = "path/to/your/document.pdf"
# Extract text
text = extract_text(pdf_path)
# Create output directory
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"processed_data/extraction_{timestamp}"
# Save text
save_text(text, output_dir)
# Extract images
image_metadata = extract_images(pdf_path, output_dir)
# Convert pages to images
page_images_metadata = convert_pages_to_images(pdf_path, output_dir)
# Split into equal parts
equal_parts_dir = f"processed_data/equal_parts_{timestamp}"
split_pdf_into_equal_parts(pdf_path, equal_parts_dir, num_parts=4)
custom_sections = {
"Introduction": {"start": 1, "end": 10},
"Chapter 1": {"start": 11, "end": 25},
"Chapter 2": {"start": 26, "end": 40},
"Appendix": {"start": 41, "end": 50}
}
toc_structure = parse_toc_structure(pdf_path, custom_sections)
split_pdf_by_sections(pdf_path, output_dir, toc_structure)
extract_text()
convert_pages_to_images()
The system includes basic error handling for:
To contribute improvements:
The project includes comprehensive unit tests for all major features:
python test_pdf_extractor.py
Test coverage includes:
This project is for internal use and processing of business documents.
FAQs
Comprehensive PDF content extraction and intelligent splitting system
We found that pdf-processing-system demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Ā It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
Product
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
Security News
Research
Socket uncovered two npm packages that register hidden HTTP endpoints to delete all files on command.