parsekit

Package Overview

Dependencies

Maintainers

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

parsekit

0.1.0.pre.1

Rubygems

Version published: 2 weeks ago

Maintainers: 1

Created: 2 weeks ago

Source

ParseKit

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

Features

📄 Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
🖼️ OCR Support: Extract text from images using Tesseract OCR
🚀 High Performance: Native Rust performance with Ruby convenience
🔧 Unified API: Single interface for multiple document formats
📦 Cross-Platform: Works on Linux, macOS, and Windows
🧪 Well Tested: Comprehensive test suite with RSpec

Installation

Add this line to your application's Gemfile:

gem 'parsekit'

And then execute:

$ bundle install

Or install it yourself as:

gem install parsekit

Requirements

Ruby >= 3.0.0
Rust toolchain (stable)
C compiler (for linking)
System libraries for document parsing:
- macOS: brew install leptonica tesseract poppler
- Ubuntu/Debian: sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev
- Fedora/RHEL: sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel
- Windows: See DEPENDENCIES.md for MSYS2 instructions

For detailed installation instructions and troubleshooting, see DEPENDENCIES.md.

Usage

Basic Usage

require 'parsekit'

# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text  # Extracted text from the PDF

# Parse an Office document
text = ParseKit.parse_file("presentation.pptx")
puts text  # Extracted text from all slides

# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text  # Extracted text from all sheets

# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text

# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text

Module-Level Convenience Methods

# Parse files directly
content = ParseKit.parse_file('document.pdf')

# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)

# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported
ParseKit.supports_file?('document.pdf')  # => true

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

Format-Specific Parsing

parser = ParseKit::Parser.new

# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)

Supported Formats

Format	Extensions	Method	Notes
PDF	.pdf	`parse_pdf`	Text extraction via MuPDF
Word	.docx	`parse_docx`	Office Open XML format
Excel	.xlsx, .xls	`parse_xlsx`	Both modern and legacy formats
Images	.png, .jpg, .jpeg, .tiff, .bmp	`ocr_image`	OCR via embedded Tesseract
JSON	.json	`parse_json`	Pretty-printed output
XML/HTML	.xml, .html	`parse_xml`	Extracts text content
Text	.txt, .csv, .md	`parse_text`	With encoding detection

Performance

ParseKit is built with performance in mind:

Native Rust implementation for speed
Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
Efficient memory usage with streaming where possible
Configurable size limits to prevent memory issues

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests.

To compile the Rust extension:

rake compile

To run tests with coverage:

rake dev:coverage

Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

Ruby Layer: Provides convenient API and format detection
Rust Layer: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- rusty-tesseract for OCR (with embedded Tesseract)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.

License

The gem is available as open source under the terms of the MIT License.

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

FAQs

What is parsekit?

Is parsekit well maintained?

Package last updated on 21 Aug 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

parsekit

ParseKit

Features

Installation

Requirements

Usage

Basic Usage

Module-Level Convenience Methods

Configuration Options

Format-Specific Parsing

Supported Formats

Performance

Development

Architecture

Contributing

License

Related posts

AGENTS.md Gains Traction as an Open Format for AI Coding Agents

Wallet-Draining npm Package Impersonates Nodemailer to Hijack Crypto Transactions