Socket
Book a DemoInstallSign in
Socket

parsekit

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

parsekit

0.1.0.pre.1
bundlerRubygems
Version published
Maintainers
1
Created
Source

ParseKit

CI Gem Version License: MIT

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

Features

  • 📄 Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
  • 🖼️ OCR Support: Extract text from images using Tesseract OCR
  • 🚀 High Performance: Native Rust performance with Ruby convenience
  • 🔧 Unified API: Single interface for multiple document formats
  • 📦 Cross-Platform: Works on Linux, macOS, and Windows
  • 🧪 Well Tested: Comprehensive test suite with RSpec

Installation

Add this line to your application's Gemfile:

gem 'parsekit'

And then execute:

$ bundle install

Or install it yourself as:

gem install parsekit

Requirements

  • Ruby >= 3.0.0
  • Rust toolchain (stable)
  • C compiler (for linking)
  • System libraries for document parsing:
    • macOS: brew install leptonica tesseract poppler
    • Ubuntu/Debian: sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev
    • Fedora/RHEL: sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel
    • Windows: See DEPENDENCIES.md for MSYS2 instructions

For detailed installation instructions and troubleshooting, see DEPENDENCIES.md.

Usage

Basic Usage

require 'parsekit'

# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text  # Extracted text from the PDF

# Parse an Office document
text = ParseKit.parse_file("presentation.pptx")
puts text  # Extracted text from all slides

# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text  # Extracted text from all sheets

# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text

# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text

Module-Level Convenience Methods

# Parse files directly
content = ParseKit.parse_file('document.pdf')

# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)

# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported
ParseKit.supports_file?('document.pdf')  # => true

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

Format-Specific Parsing

parser = ParseKit::Parser.new

# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)

Supported Formats

FormatExtensionsMethodNotes
PDF.pdfparse_pdfText extraction via MuPDF
Word.docxparse_docxOffice Open XML format
Excel.xlsx, .xlsparse_xlsxBoth modern and legacy formats
Images.png, .jpg, .jpeg, .tiff, .bmpocr_imageOCR via embedded Tesseract
JSON.jsonparse_jsonPretty-printed output
XML/HTML.xml, .htmlparse_xmlExtracts text content
Text.txt, .csv, .mdparse_textWith encoding detection

Performance

ParseKit is built with performance in mind:

  • Native Rust implementation for speed
  • Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
  • Efficient memory usage with streaming where possible
  • Configurable size limits to prevent memory issues

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests.

To compile the Rust extension:

rake compile

To run tests with coverage:

rake dev:coverage

Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

  • Ruby Layer: Provides convenient API and format detection
  • Rust Layer: Implements high-performance parsing using:
    • MuPDF for PDF text extraction (statically linked)
    • rusty-tesseract for OCR (with embedded Tesseract)
    • Pure Rust libraries for DOCX/XLSX parsing
    • Magnus for Ruby-Rust FFI bindings

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.

License

The gem is available as open source under the terms of the MIT License.

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

FAQs

Package last updated on 21 Aug 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.