Security News
pnpm 10.0.0 Blocks Lifecycle Scripts by Default
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
stanford-core-nlp-abstractor
Advanced tools
About
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.
Installing
First, install the gem: gem install stanford-core-nlp
. Then, download the Stanford Core NLP JAR and model files. Two packages are available:
Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
Configuration
You may want to set some optional configuration options. Here are some examples:
# Set an alternative path to look for the JAR files
# Default is gem's bin folder.
StanfordCoreNLP.jar_path = '/path_to_jars/'
# Set an alternative path to look for the model files
# Default is gem's bin folder.
StanfordCoreNLP.model_path = '/path_to_models/'
# Pass some alternative arguments to the Java VM.
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
# to take a coffee break).
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
# Redirect VM output to log.txt
StanfordCoreNLP.log_file = 'log.txt'
# Change a specific model file.
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
Using the gem
# Use the model files for a different language than English.
StanfordCoreNLP.use :french # or :german
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
'Berlin to discuss a new austerity package. Sarkozy ' +
'looked pleased, but Merkel was dismayed.'
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each do |sentence|
# Syntatical dependencies
puts sentence.get(:basic_dependencies).to_s
sentence.get(:tokens).each do |token|
# Default annotations for all tokens
puts token.get(:value).to_s
puts token.get(:original_text).to_s
puts token.get(:character_offset_begin).to_s
puts token.get(:character_offset_end).to_s
# POS returned by the tagger
puts token.get(:part_of_speech).to_s
# Lemma (base form of the token)
puts token.get(:lemma).to_s
# Named entity tag
puts token.get(:named_entity_tag).to_s
# Coreference
puts token.get(:coref_cluster_id).to_s
# Also of interest: coref, coref_chain,
# coref_cluster, coref_dest, coref_graph.
end
end
Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Annotation class.
The Ruby symbol (e.g. :named_entity_tag
) corresponding to a Java annotation class is the snake_case
of the class name, with 'Annotation' at the end removed. For example, NamedEntityTagAnnotation
translates to :named_entity_tag
, PartOfSpeechAnnotation
to :part_of_speech
, etc.
A good reference for names of annotations are the Stanford Javadocs for CoreAnnotations, CoreCorefAnnotations, and TreeCoreAnnotations. For a full list of all possible annotations, see the config.rb
file inside the gem.
Loading specific classes
You may want to load additional Java classes (including any class from the Stanford NLP packages). The gem provides an API for this:
# Default base class is edu.stanford.nlp.pipeline.
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
# Here, we specify another base class.
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
puts StanfordCoreNLP::MaxentTagger.inspect
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
List of annotator classes
Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using StanfordCoreNLP.load_class
(see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
List of model files
Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using StanfordCoreNLP.set_model
(see above).
Testing
To run the specs for each language (after copying the JARs into the bin
folder):
rake spec[english]
rake spec[german]
rake spec[french]
Using the latest version of the Stanford CoreNLP
Using the latest version of the Stanford CoreNLP (version 3.5.0 as of 31/10/2014) requires some additional manual steps:
StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.5.0.jar',
'stanford-corenlp-3.5.0-models.jar',
'jollyday.jar',
'bridge.jar'
]
end
Or configure your setup (for French) as follows:
StanfordCoreNLP.use :french
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.set_model('pos.model', 'french.tagger')
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.5.0.jar',
'stanford-corenlp-3.5.0-models.jar',
'jollyday.jar',
'bridge.jar'
]
end
Or configure your setup (for German) as follows:
StanfordCoreNLP.use :german
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.set_model('pos.model', 'german-fast.tagger')
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.5.0.jar',
'stanford-corenlp-3.5.0-models.jar',
'jollyday.jar',
'bridge.jar'
]
end
Contributing
Simple.
FAQs
Unknown package
We found that stanford-core-nlp-abstractor demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Research
Security News
Socket researchers have discovered multiple malicious npm packages targeting Solana private keys, abusing Gmail to exfiltrate the data and drain Solana wallets.