ParseKit

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
Features
- 📄 Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
- 🖼️ OCR Support: Extract text from images using Tesseract OCR
- 🚀 High Performance: Native Rust performance with Ruby convenience
- 🔧 Unified API: Single interface for multiple document formats
- 📦 Cross-Platform: Works on Linux, macOS, and Windows
- 🧪 Well Tested: Comprehensive test suite with RSpec
Installation
Add this line to your application's Gemfile:
gem 'parsekit'
And then execute:
$ bundle install
Or install it yourself as:
gem install parsekit
Requirements
- Ruby >= 3.0.0
- Rust toolchain (stable)
- C compiler (for linking)
- System libraries for document parsing:
- macOS:
brew install leptonica tesseract poppler
- Ubuntu/Debian:
sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev
- Fedora/RHEL:
sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel
- Windows: See DEPENDENCIES.md for MSYS2 instructions
For detailed installation instructions and troubleshooting, see DEPENDENCIES.md.
Usage
Basic Usage
require 'parsekit'
text = ParseKit.parse_file("document.pdf")
puts text
text = ParseKit.parse_file("presentation.pptx")
puts text
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text
Module-Level Convenience Methods
content = ParseKit.parse_file('document.pdf')
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)
formats = ParseKit.supported_formats
ParseKit.supports_file?('document.pdf')
Configuration Options
parser = ParseKit::Parser.new(
strict_mode: true,
max_size: 50 * 1024 * 1024,
encoding: 'UTF-8'
)
parser = ParseKit::Parser.strict
Format-Specific Parsing
parser = ParseKit::Parser.new
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)
image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)
excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)
Supported Formats
PDF | .pdf | parse_pdf | Text extraction via MuPDF |
Word | .docx | parse_docx | Office Open XML format |
Excel | .xlsx, .xls | parse_xlsx | Both modern and legacy formats |
Images | .png, .jpg, .jpeg, .tiff, .bmp | ocr_image | OCR via embedded Tesseract |
JSON | .json | parse_json | Pretty-printed output |
XML/HTML | .xml, .html | parse_xml | Extracts text content |
Text | .txt, .csv, .md | parse_text | With encoding detection |
Performance
ParseKit is built with performance in mind:
- Native Rust implementation for speed
- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
- Efficient memory usage with streaming where possible
- Configurable size limits to prevent memory issues
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests.
To compile the Rust extension:
rake compile
To run tests with coverage:
rake dev:coverage
Architecture
ParseKit uses a hybrid Ruby/Rust architecture:
- Ruby Layer: Provides convenient API and format detection
- Rust Layer: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- rusty-tesseract for OCR (with embedded Tesseract)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.
License
The gem is available as open source under the terms of the MIT License.
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.