prose
prose
is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
You can can find a more detailed summary on the library's performance here: Introducing prose
v2.0.0: Bringing NLP to Go.
Installation
$ go get gopkg.in/jdkato/prose.v2
Usage
Contents
Overview
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
}
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
}
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
}
}
The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmenatation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
Tokenizing
prose
includes a tokenizer capable of hanlding modern text, including the non-word character spans shown below.
Type | Example |
---|
Email addresses | Jane.Doe@example.com |
Hashtags | #trending |
Mentions | @jdkato |
URLs | https://github.com/jdkato/prose |
Emoticons | :-) , >:( , o_0 , etc. |
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
}
}
Segmenting
prose
includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter
.
Name | Language | License | GRS (English) | GRS (Other) | Speed† |
---|
Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose
was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main
import (
"fmt"
"strings"
"github.com/jdkato/prose"
)
func main() {
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))
sents := doc.Sentences()
fmt.Println(len(sents))
for _, sent := range sents {
fmt.Println(sent.Text)
}
}
Tagging
prose
includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library | Accuracy | 5-Run Average (sec) |
---|
NLTK | 0.893 | 7.224 |
prose | 0.961 | 2.538 |
(See scripts/test_model.py
for more information.)
The full list of supported POS tags is given below.
TAG | DESCRIPTION |
---|
( | left round bracket |
) | right round bracket |
, | comma |
: | colon |
. | period |
'' | closing quotation mark |
`` | opening quotation mark |
# | number sign |
$ | currency |
CC | conjunction, coordinating |
CD | cardinal number |
DT | determiner |
EX | existential there |
FW | foreign word |
IN | conjunction, subordinating or preposition |
JJ | adjective |
JJR | adjective, comparative |
JJS | adjective, superlative |
LS | list item marker |
MD | verb, modal auxiliary |
NN | noun, singular or mass |
NNP | noun, proper singular |
NNPS | noun, proper plural |
NNS | noun, plural |
PDT | predeterminer |
POS | possessive ending |
PRP | pronoun, personal |
PRP$ | pronoun, possessive |
RB | adverb |
RBR | adverb, comparative |
RBS | adverb, superlative |
RP | adverb, particle |
SYM | symbol |
TO | infinitival to |
UH | interjection |
VB | verb, base form |
VBD | verb, past tense |
VBG | verb, gerund or present participle |
VBN | verb, past participle |
VBP | verb, non-3rd person singular present |
VBZ | verb, 3rd person singular present |
WDT | wh-determiner |
WP | wh-pronoun, personal |
WP$ | wh-pronoun, possessive |
WRB | wh-adverb |
NER
prose
v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON
) and geographical/political Entities (GPE
) by default.
package main
import (
"gopkg.in/jdkato/prose.v2"
)
func main() {
doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
}
}
However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose
: Radically efficient machine teaching in Go for a tutorial.