New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

text-extractor

Package Overview
Dependencies
Maintainers
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

text-extractor

  • 0.1.0
  • Rubygems
  • Socket score

Version published
Maintainers
3
Created
Source

Text-Extractor

This gem wraps command line tools to extract plain text from typical files such as

  • PDF
  • RTF
  • MS Office
    • Word (doc, docx)
    • Excel (xsl, xslx)
    • PowerPoint (ppt, pptx)
  • OpenOffice + Libre
    • Presentation
    • Text
    • Spreadsheet
  • Image files (png, jpeg, tiff), such as screenshots and scanned documents, through character recognition (OCR)
  • Plaintext (txt)
  • Comma-separated values (csv)

Acknowledgements

This gem bases on work by Jens Krämer / Planio, who originally provided it as a patch for Redmine. Now, it is a collaborative effort of both project management software providers Planio and OpenProject as both systems tackle the identical challenge to extract plain text from attachment files.

Installation

Add this line to your application's Gemfile:

gem 'text-extractor'

And then execute:

$ bundle

Or install it yourself as:

$ gem install text-extractor
Rails

In a Rails application save text-extractor.yml.example in config/text-extractor.yml and overwrite the settings to your needs.

Then load that configuration file in an initializer. Add the following lines to config/initializers/text_extractor.rb:

file_name = File.join([Rails.root.to_s, 'config', 'text_extractor.yml'])
if File.file?(file_name)
  config_file = File.read(file_name)
  TextExtractor::Configuration.load(config_file)
end
Plain Ruby

Please overwrite TextExtractor::Configuration.load.

Linux

On linux the default configuration should work. However, make sure that the following packages are installed

$ apt-get install catdoc unrtf poppler-utils tesseract-ocr

Mac OS X

On Mac things are still not complete. Please help us to have the same capabilities as under Linux. Right now we cannot extract text from presentation and spreadsheets.

Please use homebrew to install the missing command line tools.

$ brew install unrtf poppler tesseract

The text-extraction.yml should look like this:

pdftotext:
  - /usr/local/bin/pdftotext
  - -enc
  - UTF-8
  - __FILE__
  - '-'

unrtf:
  - /usr/local/bin/unrtf
  - --text
  - __FILE__

tesseract:
  - /usr/local/bin/tesseract
  - __FILE__
  - stdout

catdoc:
  - /usr/bin/textutil
  - -convert
  - txt
  - -stdout
  - __FILE__

Usage

# `file` is of type File.
# `content_type` is a String.
fulltext = TextExtractor::Resolver.new(file, content_type).text

License

The text-extractor gem is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with the plugin. If not, see www.gnu.org/licenses.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/planio-gmbh/text-extractor.

FAQs

Package last updated on 13 Feb 2018

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc