compromise
modest natural language processing
npm install compromise
isn't it weird how we can
write text, but not parse it?
it's like we've agreed that
text is a dead-end.
and the knowledge in it
should not really be used.
.match():
interpret and match text:
let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
if (doc.has('simon says #Verb') === false) {
return null
}
.verbs():
conjugate and negate verbs in any tense:
let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
.nouns():
play between plural, singular and possessive forms:
let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
.numbers():
interpret plain-text numbers
nlp.extend(require('compromise-numbers'))
let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(2)
doc.text()
.topics():
names/places/orgs, tldr:
let doc = nlp(buddyHolly)
doc.people().if('mary').json()
let doc = nlp(freshPrince)
doc.places().first().text()
doc = nlp('the opera about richard nixon visiting china')
doc.topics().json()
.contractions():
handle implicit terms:
let doc = nlp("we're not gonna take it, no we ain't gonna take it.")
doc.has('going')
doc.contractions().expand()
dox.text()
Use it on the client-side:
<script src="https://unpkg.com/compromise"></script>
<script src="https://unpkg.com/compromise-numbers"></script>
<script>
nlp.extend(compromiseNumbers)
var doc = nlp('two bottles of beer')
doc.numbers().minus(1)
document.body.innerHTML = doc.text()
</script>
as an es-module:
import nlp from 'compromise'
var doc = nlp('London is calling')
doc.verbs().toNegative()
compromise is 180kb (minified):
it's pretty fast. It can run on keypress:
it works mainly by conjugating all forms of a basic word list.
The final lexicon is ~14,000 words:
you can read more about how it works, here. it's weird.
.extend():
decide how words get interpreted:
let myWords = {
kermit: 'FirstName',
fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)
or make heavier changes with a compromise-plugin.
const nlp = require('compromise')
nlp.extend((Doc, world) => {
world.addTags({
Character: {
isA: 'Person',
notA: 'Adjective',
},
})
world.addWords({
kermit: 'Character',
gonzo: 'Character',
})
world.postProcess(doc => {
doc.match('light the lights').tag('#Verb . #Plural')
})
Doc.prototype.kermitVoice = function () {
this.sentences().prepend('well,')
this.match('i [(am|was)]').prepend('um,')
return this
}
})
Docs:
gentle introduction:
Documentation:
| Concepts | API | Plugins |
| ------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------: | -------------------------------------------------------------------------------------: | --- |
| Accuracy | Accessors | Adjectives |
| Caching | Constructor-methods | Dates |
| Case | Contractions | Export |
| Filesize | Insert | Hash |
| Internals | Json | Html |
| Justification | Lists | Keypress |
| Lexicon | Loops | Ngrams |
| Match-syntax | Match | Numbers |
| Performance | Nouns | Paragraphs |
| Plugins | Output | Scan |
| Projects | Selections | Sentences |
| Tagger | Sorting | Syllables |
| Tags | Split | Pronounce | |
| Tokenization | Text | Strict |
| Named-Entities | Utils | Penn-tags |
| Whitespace | Verbs | Typeahead |
| World data | Normalization | |
| Fuzzy-matching | Typescript | |
Talks:
Articles:
Some fun Applications:
API:
Constructor
(these methods are on the nlp
object)
Utils
- .all() - return the whole original document ('zoom out')
- .found [getter] - is this document empty?
- .parent() - return the previous result
- .parents() - return all of the previous results
- .tagger() - (re-)run the part-of-speech tagger on this document
- .wordCount() - count the # of terms in the document
- .length [getter] - count the # of characters in the document (string length)
- .clone() - deep-copy the document, so that no references remain
- .cache({}) - freeze the current state of the document, for speed-purposes
- .uncache() - un-freezes the current state of the document, so it may be transformed
Accessors
Match
(all match methods use the match-syntax.)
- .match('') - return a new Doc, with this one as a parent
- .not('') - return all results except for this
- .matchOne('') - return only the first match
- .if('') - return each current phrase, only if it contains this match ('only')
- .ifNo('') - Filter-out any current phrases that have this match ('notIf')
- .has('') - Return a boolean if this match exists
- .lookBehind('') - search through earlier terms, in the sentence
- .lookAhead('') - search through following terms, in the sentence
- .before('') - return all terms before a match, in each phrase
- .after('') - return all terms after a match, in each phrase
- .lookup([]) - quick find for an array of string matches
Case
Whitespace
- .pre('') - add this punctuation or whitespace before each match
- .post('') - add this punctuation or whitespace after each match
- .trim() - remove start and end whitespace
- .hyphenate() - connect words with hyphen, and remove whitespace
- .dehyphenate() - remove hyphens between words, and set whitespace
- .toQuotations() - add quotation marks around these matches
- .toParentheses() - add brackets around these matches
Tag
- .tag('') - Give all terms the given tag
- .tagSafe('') - Only apply tag to terms if it is consistent with current tags
- .unTag('') - Remove this term from the given terms
- .canBe('') - return only the terms that can be this tag
Loops
- .map(fn) - run each phrase through a function, and create a new document
- .forEach(fn) - run a function on each phrase, as an individual document
- .filter(fn) - return only the phrases that return true
- .find(fn) - return a document with only the first phrase that matches
- .some(fn) - return true or false if there is one matching phrase
- .random(fn) - sample a subset of the results
Insert
Transform
Output
Selections
Subsets
Plugins:
These are some helpful extensions:
Adjectives
npm install compromise-adjectives
Dates
npm install compromise-dates
Numbers
npm install compromise-numbers
Export
npm install compromise-export
- .export() - store a parsed document for later use
- nlp.load() - re-generate a Doc object from .export() results
Html
npm install compromise-html
- .html({}) - generate sanitized html from the document
Hash
npm install compromise-hash
- .hash() - generate an md5 hash from the document+tags
- .isEqual(doc) - compare the hash of two documents for semantic-equality
Keypress
npm install compromise-keypress
Ngrams
npm install compromise-ngrams
Paragraphs
npm install compromise-paragraphs
this plugin creates a wrapper around the default sentence objects.
Sentences
npm install compromise-sentences
- .sentences() - return a sentence class with additional methods
Strict-match
npm install compromise-strict
Syllables
npm install compromise-syllables
- .syllables() - split each term by its typical pronounciation
Penn-tags
npm install compromise-penn-tags
Typescript
we're committed to typescript/deno support, both in main and in the official-plugins:
import nlp from 'compromise'
import ngrams from 'compromise-ngrams'
import numbers from 'compromise-numbers'
const nlpEx = nlp.extend(ngrams).extend(numbers)
nlpEx('This is type safe!').ngrams({ min: 1 })
nlpEx('This is type safe!').numbers()
Partial-builds
or if you don't care about POS-tagging, you can use the tokenize-only build: (90kb!)
<script src="https://unpkg.com/compromise/builds/compromise-tokenize.js"></script>
<script>
var doc = nlp('No, my son is also named Bort.')
console.log(doc.has('#Noun'))
console.log(doc.has('my .* is .? named /^b[oa]rt/'))
</script>
Limitations:
-
slash-support:
We currently split slashes up as different words, like we do for hyphens. so things like this don't work:
nlp('the koala eats/shoots/leaves').has('koala leaves') //false
-
inter-sentence match:
By default, sentences are the top-level abstraction.
Inter-sentence, or multi-sentence matches aren't supported without a plugin:
nlp("that's it. Back to Winnipeg!").has('it back')//false
-
nested match syntax:
the danger beauty of regex is that you can recurse indefinitely.
Our match syntax is much weaker. Things like this are not (yet) possible:
doc.match('(modern (major|minor))? general')
complex matches must be achieved with successive .match() statements.
-
dependency parsing:
Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do.
We should! Help wanted with this.
FAQ
☂️ Isn't javascript too...
💃 Can it run on my arduino-watch?
🌎 Compromise in other Languages?
✨ Partial builds?
See Also:
MIT