retext-keywords
Keyword extraction with Retext.
Installation
npm:
$ npm install retext-keywords
Usage
var Retext = require('retext'),
keywords = require('retext-keywords'),
retext;
retext = new Retext().use(keywords);
retext.parse(
'Terminology mining, term extraction, term recognition, or ' +
'glossary extraction, is a subtask of information extraction. ' +
'The goal of terminology extraction is to automatically extract ' +
'relevant terms from a given corpus.' +
'\n\n' +
'In the semantic web era, a growing number of communities and ' +
'networked enterprises started to access and interoperate through ' +
'the internet. Modeling these communities and their information ' +
'needs is important for several web applications, like ' +
'topic-driven web crawlers, web services, recommender systems, ' +
'etc. The development of terminology extraction is essential to ' +
'the language industry.' +
'\n\n' +
'One of the first steps to model the knowledge domain of a ' +
'virtual community is to collect a vocabulary of domain-relevant ' +
'terms, constituting the linguistic surface manifestation of ' +
'domain concepts. Several methods to automatically extract ' +
'technical terms from domain-specific document warehouses have ' +
'been described in the literature.' +
'\n\n' +
'Typically, approaches to automatic term extraction make use of ' +
'linguistic processors (part of speech tagging, phrase chunking) ' +
'to extract terminological candidates, i.e. syntactically ' +
'plausible terminological noun phrases, NPs (e.g. compounds ' +
'"credit card", adjective-NPs "local tourist information office", ' +
'and prepositional-NPs "board of directors" - in English, the ' +
'first two constructs are the most frequent). Terminological ' +
'entries are then filtered from the candidate list using ' +
'statistical and machine learning methods. Once filtered, ' +
'because of their low ambiguity and high specificity, these terms ' +
'are particularly useful for conceptualizing a knowledge domain ' +
'or for supporting the creation of a domain ontology. Furthermore, ' +
'terminology extraction is a very useful starting point for ' +
'semantic similarity, knowledge management, human translation ' +
'and machine translation, etc.',
function (err, tree) {
tree.keywords();
}
);
API
Parent#keywords({minimum=5}?)
Extract keywords, based on the number of times they (nouns) occur in text.
tree.keywords({'minimum' : Infinity});
Options:
- minimum: Return at least (when possible)
minimum
keywords.
Results: An array, containing match-objects:
- stem: The stem of the word (see retext-porter-stemmer);
- score: A value between 0 and (including) 1. the first match has a score of 1;
- nodes: An array containing all matched word nodes.
Parent#keyphrases({minimum=5}?)
Extract keyphrases, based on the number of times they (multiple nouns) occur in text.
tree.keyphrases();
tree.keyphrases({'minimum' : Infinity});
Options:
- minimum: Return at least (when possible)
minimum
phrases.
Results: An array, containing match-objects:
- stems: An array containing the stems of all matched word nodes inside the phrase(s);
- score: A value between 0 and (including) 1. the first match has a score of 1;
- nodes: An array containing array-phrases, each containing word nodes.
Benchmark
Run the benchmark yourself:
$ npm run benchmark
On a MacBook Air, keywords()
runs about 3,784 op/s on a big section / small article.
A big section (10 paragraphs)
3,784 op/s Β» Finding keywords
788 op/s Β» Finding keyphrases
A big article (100 paragraphs)
401 op/s Β» Finding keywords
48 op/s Β» Finding keyphrases
License
MIT Β© Titus Wormer