Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

document_to_rich_html

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

document_to_rich_html

  • 0.2.0
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

DocumentToRichHtml

DocumentToRichHtml is a powerful Ruby gem that converts various document formats (PDF, Word, Excel, and images) to rich HTML format compatible with the Trix editor. It preserves formatting, styles, and embedded images, making it ideal for applications that need to import and display formatted content.

Features

  • Converts PDF files to rich HTML, preserving text content
  • Converts Word documents (.docx, .doc) to rich HTML, maintaining formatting and embedded images
  • Converts Excel spreadsheets (.xlsx, .xls) to HTML tables
  • Converts images (.jpg, .jpeg, .png, .gif, .svg) to embedded base64 data in HTML
  • Formats output HTML to be compatible with Trix editor
  • Implements security measures to prevent processing of malicious files

The convert method returns a string containing the rich HTML representation of the document, which can be used directly with the Trix editor or other rich text editors.

Supported Formats and Capabilities

PDF (.pdf)

  • Extracts text content from all pages
  • Preserves line breaks and basic structure

Word (.docx, .doc)

  • Preserves text formatting (bold, italic, underline, etc.)
  • Maintains document structure (headings, paragraphs, lists)
  • Retains embedded images
  • Converts tables to HTML tables

Excel (.xlsx, .xls)

  • Converts spreadsheets to HTML tables
  • Preserves cell values and basic formatting

Images (.jpg, .jpeg, .png, .gif, .svg)

  • Embeds images as base64-encoded data within the HTML
  • Preserves image quality and dimensions

Security Features

  • File type validation using MIME type checking
  • File size limits to prevent processing of extremely large files
  • Secure temporary file handling
  • Input sanitization to prevent XSS attacks

Configuration

You can configure the maximum file size limit by setting an environment variable:

export MAX_FILE_SIZE=10000000

Installation

Add this line to your application's Gemfile:

gem 'document_to_rich_html'

And then execute:

bundle install

```bash
gem install document_to_rich_html

Usage

require 'document_to_rich_html'

html = DocumentToRichHtml.convert('path/to/your/document.pdf')
puts html

Convert a PDF file
rich_html = DocumentToRichHtml.convert('path/to/your/document.pdf')

Convert a Word document
rich_html = DocumentToRichHtml.convert('path/to/your/document.docx')

Convert an Excel spreadsheet
rich_html = DocumentToRichHtml.convert('path/to/your/spreadsheet.xlsx')

Convert an image
rich_html = DocumentToRichHtml.convert('path/to/your/image.jpg')

The convert method returns a string containing the rich HTML representation of the document, which can be used directly with the Trix editor or other rich text editors.

Supported Formats and Capabilities

PDF (.pdf)

  • Extracts text content from all pages
  • Preserves line breaks and basic structure

Word (.docx, .doc)

  • Preserves text formatting (bold, italic, underline, etc.)
  • Maintains document structure (headings, paragraphs, lists)
  • Retains embedded images
  • Converts tables to HTML tables

Excel (.xlsx, .xls)

  • Converts spreadsheets to HTML tables
  • Preserves cell values and basic formatting

Images (.jpg, .jpeg, .png, .gif, .svg)

  • Embeds images as base64-encoded data within the HTML
  • Preserves image quality and dimensions

Security Features

  • File type validation using MIME type checking
  • File size limits to prevent processing of extremely large files
  • Secure temporary file handling
  • Input sanitization to prevent XSS attacks

Limitations

  • PDF conversion is limited to text content; complex layouts or embedded images in PDFs are not preserved
  • Some advanced formatting in Word documents may not be perfectly converted
  • Excel conversion is basic and doesn't support advanced features like formulas or charts

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/yourusername/document_to_rich_html. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the DocumentToRichHtml project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

FAQs

Package last updated on 24 Sep 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc