Pelias analysis libraries

This repository contains prebuild textual analysis functions (analyzers) which are composed of smaller modules (tokenizers), each tokenizer performs actions such as transforming, filtering and enriching word tokens.
Using Analyzers
Analyzers are available as functions and can be called like any regular function, the input is a single string and the output is also a single string:
var street = require('./analyzer/street')
var analyzer = street()
analyzer('main str s')
Analyzers also accept a 'context' object which is available throughout the analysis pipeline:
var analyzer = street({ locale: 'de' })
analyzer('main str s')
Using Tokenizers
Tokenizers are intended to be used as part of an analyzer, but can also be used independently by calling Array.reduce on an array of tokens:
var tokenizer = require('./tokenizer/diacritic')
[ 'žůžo', 'Cinématte' ].reduce( tokenizer, [] )
Writing Tokenizers
Tokenizers are functions with the interface expected by Array.reduce.
In their simplest form a tokenizer is written as:
var tokenizer = function( res, word, pos, arr ){
return res
}
For a tokenizer to have no effect on the token stream it must res.push() on to the response array each word it took in:
var tokenizer = function( res, word, pos, arr ){
res.push( word )
return res
}
A tokenizer can choose which words are pushed downstream, it can also modify words and push more than one word on to the response array:
var tokenizer = function( res, word, pos, arr ){
var parts = word.split(/\b/g)
parts.forEach( function( part ){
res.push( part )
})
return res
}
Using these techniques, you can write tokenizers which delete, modify or create new words.
Writing Tokenizers (advanced)
More advanced tokenizers require information about the context in which they were run, for example, knowing the locale of your input tokens might allow you to vary its functionality accordingly.
Context is provided to tokenizers by using Function.bind to bind the context to the tokenizer. This information will then be available inside the tokenizer using the this keyword:
var tokenizer = function( res, word, pos, arr ){
var locale = this.locale || 'en'
if( 'str.' === word ){
switch( locale ){
case 'de':
res.push( 'strasse' )
return res
case 'en':
res.push( 'street' )
return res
}
}
res.push( word )
return res
}
You can then control the runtime context of the analyzer using Function.bind:
var english = tokenizer.bind({ locale: 'en' })
[ 'str.' ].reduce( english, [] )
var german = tokenizer.bind({ locale: 'de' })
[ 'str.' ].reduce( german, [] )
Command line interface
there is an included CLI script which allows you to easily pipe in files for testing an analyzer:
$ node cli.js en street <<< "n foo st w"
North Foo Street West
$ echo -e "n foo st w\nw 16th st" | node cli.js en street
North Foo Street West
West 16 Street
$ node cli.js en street < nyc.names
100 Avenue
100 Drive
100 Road
... etc
$ cut -d',' -f4 /data/oa/de/berlin.csv | sort | uniq | node cli.js de street
Aachener Strasse
Aalemannufer
Aalesunder Strasse
... etc
using the linux diff command you can view a side-by-side comparison of the data before and after analysis:
$ diff \
--side-by-side \
--ignore-blank-lines \
--suppress-common-lines \
--width=100 \
--expand-tabs \
nyc.names \
<(node cli.js en street < nyc.names)
ZEBRA PL | Zebra Place
ZECK CT | Zeck Court
ZEPHYR AVE | Zephyr Avenue
... etc
Running tests
units test are run with:
$ npm test
functional tests are run with:
$ npm run funcs