
Product
Announcing Bun and vlt Support in Socket
Bringing supply chain security to the next generation of JavaScript package managers
rika
Advanced tools
Rika is a JRuby wrapper for the Apache Tika Java library, which extracts text and metadata from files and resources of many different formats.
Caution: This gem only works with JRuby.
Rika currently supports some basic and commonly used functions of Tika. Future development may add Ruby support for more Tika functionality, and perhaps a command line interface as well. See the Other Tika Resources section for alternatives to Rika that may suit more demanding needs.
For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:
require 'rika'
content = Rika.parse_content('x.pdf') # string containing all content text
metadata = Rika.parse_metadata('x.pdf') # hash containing the document metadata
content, metadata = Rika.parse_content_and_metadata('x.pdf') # both of the above
A URL can be used instead of a filespec wherever a data source is specified:
content, metadata = Rika.parse_content_and_metadata('https://github.com/keithrbennett/rika')
For other use cases and finer control, you can work directly with the Rika::Parser object:
require 'rika'
parser = Rika::Parser.new('x.pdf')
# Return the content of the document:
parser.content
# Return the metadata of the document:
parser.metadata
# Return the media type for the document, e.g. "application/pdf":
parser.media_type
# Return only the first 10000 chars of the content:
parser = Rika::Parser.new('x.pdf', 10000)
parser.content # 10000 first chars returned
# Return content from URL
parser = Rika::Parser.new('http://example.com/x.pdf', 200)
parser.content
# Return the language for the content
parser = Rika::Parser.new('german-document.pdf')
parser.language
=> "de"
# Check whether the language identification is certain enough to be trusted
parser.language_is_reasonably_certain?
Since Ruby supports the -r option to require a library, and the -e option to evaluate a string of code, you can easily do simple parsing on the command line, such as:
ruby -r rika -e 'puts Rika.parse_content("x.pdf")'
You could also parse the metadata and output it as JSON as follows:
ruby -r rika -r json -e 'puts Rika.parse_metadata("x.pdf").to_json'
If you want to get both content and metadata in JSON format, this would do that:
ruby -r rika -r json -e 'c,m = Rika.parse_content_and_metadata("tw.pdf"); puts({ c: c, m: m }.to_json)'
Using the rexe gem, that can be made much more concise:
rexe -r rika -oj 'c,m = Rika.parse_content_and_metadata("x.pdf"); { c: c, m: m }'
...and changing the -oj option gives you access to other output formats such as "Pretty JSON", YAML, and AwesomePrint (a very human readable format).
Add this line to your application's Gemfile. Use gem or jgem depending on your JRuby installation:
gem 'rika' # or: jgem 'rika'
And then execute:
$ bundle
Or install it yourself as:
$ gem install rika # or: jgem install rika
For more sophisticated use of Tika, you can use the Tika jar file directly in your JRuby code. After installing the rika gem, the Tika jar file will be located in $GEM_HOME/gems/rika-[rika-version]-java/target/dependency/tika-core-[tika-version].jar.
Tika also provides another jar file containing a RESTful server that you can run on the command line. You can download this server jar from http://tika.apache.org/download.html. See the "Running the Tika Server as a Jar file" section of https://cwiki.apache.org/confluence/display/TIKA/TikaServer for more information.
@chrismattman and others have provided a Python library and CLI that interfaces with the Tika server.
A general Tika wiki is at https://cwiki.apache.org/confluence/display/tika.
Richard Nyström (@ricn) is the original author of Rika, but has not been able to maintain it since 2015. In July 2020, Richard transferred the project to Keith Bennett (@keithrbennett), who had made made some contributions back in 2013.
git checkout -b my-new-feature)git commit -am 'Add some feature')git push origin my-new-feature)FAQs
Unknown package
We found that rika demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Bringing supply chain security to the next generation of JavaScript package managers

Product
A safer, faster way to eliminate vulnerabilities without updating dependencies

Product
Reachability analysis for Ruby is now in beta, helping teams identify which vulnerabilities are truly exploitable in their applications.