textminer
textminer
helps you text mine through Crossref's TDM (Text & Data Mining) services:
Changes
For changes see the CHANGELOG
gem API
Textiner.search
- search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps Serrano.works()
, but only a subset of params - this interface may change depending on feedback.Textiner.fetch
- Fetch full text given a url, supports Crossref's Text and Data Mining serviceTextiner.extract
- Extract text from a pdf
Install
Release version
gem install textminer
Development version
git clone git@github.com:sckott/textminer.git
cd textminer
rake install
Examples
Within Ruby
Search
Search by DOI
require 'textminer'
Textminer.search(doi: '10.7554/elife.06430')
Textminer.search(doi: "10.1371/journal.pone.0000308")
Many DOIs at once
require 'serrano'
dois = Serrano.random_dois(sample: 6)
Textminer.search(doi: dois)
Search with filters
Textminer.search(filter: {has_full_text: true})
Get full text links
The object returned form Textminer.search
is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only
x = Textminer.search(filter: {has_full_text: true})
x.links_xml
x.links_pdf
x.links_plain
Fetch full text
Textminer.fetch()
gets full text based on URL input. We determine how to pull down and parse the content based on content type.
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_xml(true);
res = Textminer.fetch(url: links[0]);
res.url
res.path
res.type
res.parse
Textminer.extract()
extracts text from a pdf, given a path for a pdf
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_pdf(true);
res = Textminer.fetch(url: links[0]);
Textminer.extract(res.path)
On the CLI
Coming soon...
To do
- CLI executable
- better test suite
- better documentation