Anystyle-Parser
Anystyle-Parser is a very fast and smart parser for academic references. It
is inspired by ParsCit and
FreeCite; Anystyle-Parser uses machine
learning algorithms and is designed
for raw speed (it uses wapiti based
conditional random fields and Kyoto Cabinet
or Redis as a key-value store), flexibility (it is easy to
train the model with data that is relevant to your parsing needs), and
compatibility (Anystyle-Parser exports to Ruby Hashes, BibTeX, or the
CSL/CiteProc JSON format).
Web Application and Web Service
Anystyle-Parser is available as a web application and a web service at
http://anystyle.io. For example Ruby code using
the anystyle.io API, see this prototype
for a style predictor.
Installation
$ [sudo] gem install anystyle-parser
During the statistical analysis of reference strings, Anystyle-Parser relies
on a large feature dictionary; by default, Anystyle-Parser creates a
Kyoto Cabinet file-based hash database
from the dictionary file that ships with the parser. If Kyoto Cabinet is
not installed on your system, Anystyle-Parser uses a simple Ruby Hash as a
fall-back; this Hash has to be re-created every time you load the parser
and takes up a lot of memory in your Ruby process; it is therefore strongly
recommended to install Kyoto Cabinet and the kyotocabinet-ruby
gem.
$ [sudo] gem install kyotocabinet-ruby
The database file will be created the first time you access the dictionary;
note that you will need write permissions in the directory where the file
is to be created. You can change the Dictionary's default path in the
Dictionary's options:
Anystyle::Parser::Dictionary.instance.options[:cabinet]
Starting with version 0.1.0, Anystyle-Parser also supports
Redis; to use Redis as the data store you need to install
the redis
and redis-namespace
gems (optionally, the hiredis
gem).
$ [sudo] gem install redis redis-namespace
To see which data store modes are available in you current environment,
check the output of Dictionary.modes
:
> Anystyle::Parser::Dictionary.modes
=> [:kyoto, :redis, :hash]
To select one of the available modes, use the dictionary instance options:
> Anystyle.dictionary.options[:mode]
=> :kyoto
To use Redis you also need to set the host or unix socket
where your redis server is available. For example:
Anystyle.dictionary.options[:mode] = :redis
Anystyle.dictionary.options[:host] = 'localhost'
When the data store is opened using redis-mode and the data store is empty,
the feature dictionary will be imported automatically. If you want to import
the data explicitly you can use Dictionary#create
after setting the
required options.
Usage
Parsing
You can access the main Anystyle-Parser instance at Anystyle.parser
;
the #parse
method is also available via Anystyle.parse
. For more complex
requirements (e.g., if you need multiple Parser instances simultaneously) you
can create your own instances from the Anystyle::Parser::Parser
class.
The two fundamental methods you need to know about in order to use
Anystyle-Parser are #parse
and #train
that both accept two arguments.
Parser#parse(input, format = :hash)
Parser#train(input = options[:training_data], truncate = true)
#parse
parses the passed-in input (either a filename, your reference strings,
or an array of your reference strings; files are only opened if the string is
not tainted) and returns the parsed data in the
format specified as the second argument (supported formats include: :hash,
:bibtex, :citeproc, :tags, and :raw).
#train
allows you to easily train the Parser's CRF model. The first argument
is either a filename (if the string is not tainted) or your data as a string;
the format of training data
follows the XML-like syntax of the
CORA dataset; the
optional boolean argument lets you decide whether to train the existing
model or to create an entirely new one.
The following irb sessions illustrates some parser goodness:
> require 'anystyle/parser'
> Anystyle.parse 'Poe, Edgar A. Essays and Reviews. New York: Library of America, 1984.'
=> [{:author=>"Poe, Edgar A.", :title=>"Essays and Reviews", :location=>"New York", :publisher=>"Library of America", :year=>1984, :type=>:book}]
> b = Anystyle.parse 'Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503–528.', :bibtex
> b[0].author[1].given
=> "Jorge"
> b[0].author.to_s
=> "Liu, Dong C. and Nocedal, Jorge"
> puts Anystyle.parse('Auster, Paul. The Art of Hunger. Expanded. New York: Penguin, 1997.', :bibtex).to_s
@book{auster1997a,
author = {Auster, Paul},
title = {The Art of Hunger},
location = {New York},
publisher = {Penguin},
edition = {Expanded},
year = {1997}
}
=> nil
Unhappy with the results?
Citation references come in many forms, so, inevitably, you will find data
where Anystyle-Parser does not produce satisfying parsing results.
> Anystyle.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
=> [{:author=>"John Lafferty and Andrew McCallum and Fernando Pereira. 2001", :title=>"Conditional random fields: probabilistic models for segmenting and labeling sequence data", :booktitle=>"Proceedings of the International Conference on Machine Learning", :pages=>"282--289", :publisher=>"Morgan Kaufmann", :location=>"San Francisco, CA", :type=>:inproceedings}]
This result is not bad, but notice how the year was not picked up as a date
but interpreted as part of the author name. If you have such a problem
(particularly, if the problem applies to a range of your input data, e.g.,
data that follows a style that Anystyle-Parser was not trained to recognize),
you can teach Anystyle-Parser to recognize your format. The easiest way to
go about this is to create new file (e.g., 'training.txt'), copy and paste a
few references, and tag them for training. For example, a tagged version of
the input from the example above would look like this:
<author> John Lafferty, Andrew McCallum, and Fernando Pereira. </author> <date> 2001. </date> <title> Conditional random fields: probabilistic models for segmenting and labeling sequence data. </title> <booktitle> In Proceedings of the International Conference on Machine Learning, </booktitle> <pages> pages 282–289. </pages> <publisher> Morgan Kaufmann, </publisher> <location> San Francisco, CA. </location>
Note that you can pick any tag names, but when working with Anystyle's model
you should use the same names used to to train the model. You can always ask
the Parser's model what names (labels) it knows about:
> Anystyle.parser.model.labels
=> ["author", "booktitle", "container", "date", "doi", "edition", "editor", "institution", "isbn", "journal", "location", "note", "pages", "publisher", "retrieved", "tech", "title", "translator", "unknown", "url", "volume"]
Once you have tagged a few references that you want Anystyle-Parser to learn,
you can train the model as follows:
> Anystyle.parser.train 'training.txt', false
By passing true
as the second argument, you will discard Anystyle's default
model; the resulting model will be based entirely on your own data. By default
the new or altered model will not be saved, but you can do so at any time
by calling Anystyle.parser.model.save
to save the model to the default file.
If you want to save the model to a different file, set the
Anystyle.parser.model.path
attribute accordingly.
After teaching Anystyle-Parser with the tagged references, try to parse your
data again:
> Anystyle.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
=> [{:author=>"John Lafferty and Andrew McCallum and Fernando Pereira", :title=>"Conditional random fields: probabilistic models for segmenting and labeling sequence data", :booktitle=>"Proceedings of the International Conference on Machine Learning", :pages=>"282--289", :publisher=>"Morgan Kaufmann", :location=>"San Francisco, CA", :year=>2001, :type=>:inproceedings}]
If you want to make Anystyle-Parser smarter, please consider sending us your
tagged references (see below).
Contributing
The Anystyle-Parser source code is
hosted on GitHub.
You can check out a copy of the latest code using Git:
$ git clone https://github.com/inukshuk/anystyle-parser.git
If you've found a bug or have a question, please open an issue on the
Anystyle-Parser issue tracker.
Or, for extra credit, clone the Anystyle-Parser repository, write a failing
example, fix the bug and submit a pull request.
If you want to contribute tagged references, please either add them to
resources/train.txt
or create a new file in the resources
directory
and open a pull request on GitHub.
License
Copyright 2011-2014 Sylvester Keil. All rights reserved.
Some of the code in Anystyle-Parser's post processing (normalizing) routines
was originally based on the source code of FreeCite and
Copyright 2008 Public Display Inc.
The CRF template is a modified version of ParsCit's original template
Copyright 2008, 2009, 2010, 2011 Min-Yen Kan,
Isaac G. Councill, C. Lee Giles, Minh-Thang Luong and Huy Nhat Hoang
Do.
Anystyle-Parser is distributed under a BSD-style license. See LICENSE for details.