retext-keywords
Keyword extraction with retext.
Installation
npm:
$ npm install retext-keywords
Component.js:
$ component install wooorm/retext-keywords
Bower:
$ bower install retext-keywords
Duo:
var keywords = require('wooorm/retext-keywords');
Usage
var Retext = require('retext');
var keywords = require('retext-keywords');
var retext = new Retext().use(keywords);
retext.parse(
'Terminology mining, term extraction, term recognition, or ' +
'glossary extraction, is a subtask of information extraction. ' +
'The goal of terminology extraction is to automatically extract ' +
'relevant terms from a given corpus.' +
'\n\n' +
'In the semantic web era, a growing number of communities and ' +
'networked enterprises started to access and interoperate through ' +
'the internet. Modeling these communities and their information ' +
'needs is important for several web applications, like ' +
'topic-driven web crawlers, web services, recommender systems, ' +
'etc. The development of terminology extraction is essential to ' +
'the language industry.' +
'\n\n' +
'One of the first steps to model the knowledge domain of a ' +
'virtual community is to collect a vocabulary of domain-relevant ' +
'terms, constituting the linguistic surface manifestation of ' +
'domain concepts. Several methods to automatically extract ' +
'technical terms from domain-specific document warehouses have ' +
'been described in the literature.' +
'\n\n' +
'Typically, approaches to automatic term extraction make use of ' +
'linguistic processors (part of speech tagging, phrase chunking) ' +
'to extract terminological candidates, i.e. syntactically ' +
'plausible terminological noun phrases, NPs (e.g. compounds ' +
'"credit card", adjective-NPs "local tourist information office", ' +
'and prepositional-NPs "board of directors" - in English, the ' +
'first two constructs are the most frequent). Terminological ' +
'entries are then filtered from the candidate list using ' +
'statistical and machine learning methods. Once filtered, ' +
'because of their low ambiguity and high specificity, these terms ' +
'are particularly useful for conceptualizing a knowledge domain ' +
'or for supporting the creation of a domain ontology. Furthermore, ' +
'terminology extraction is a very useful starting point for ' +
'semantic similarity, knowledge management, human translation ' +
'and machine translation, etc.',
function (err, tree) {
tree.keywords();
}
);
API
Extract keywords, based on the number of times they (nouns) occur in text.
tree.keywords({'minimum' : Infinity});
Options:
- minimum (non-negative integer
number
) — Return at least (when possible) minimum
keywords.
Results: An array, containing match-objects:
- stem: The stem of the word (see retext-porter-stemmer);
- score: A value between
0
and (including) 1
. The first match has a score of 1; - nodes: An array containing all matched
WordNode
s.
Extract keyphrases, based on the number of times they (one or more nouns) occur in text.
tree.keyphrases();
tree.keyphrases({'minimum' : Infinity});
Options:
- minimum (non-negative integer
number
) — Return at least (when possible) minimum
phrases.
Results: An array, containing match-objects:
- stems: An array containing the stems of all matched word nodes inside the phrase(s);
- score: A value between
0
and (including) 1
. The first match has a score of 1; - nodes: An array containing arrays of
WordNode
s.
Benchmark
On a MacBook Air, keywords()
runs about 3,784 op/s on a big section / small article.
A big section (10 paragraphs)
4,026 op/s » Finding keywords
625 op/s » Finding keyphrases
A big article (100 paragraphs)
438 op/s » Finding keywords
59 op/s » Finding keyphrases
License
MIT © Titus Wormer