
Product
A New Overview in our Dashboard
We redesigned Socket's first logged-in page to display rich and insightful visualizations about your repositories protected against supply chain threats.
github.com/jdkato/prose/v2
prose
is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
You can can find a more detailed summary on the library's performance here: Introducing prose
v2.0.0: Bringing NLP to Go.
$ go get gopkg.in/jdkato/prose.v2
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}
// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}
// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
}
}
The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmenatation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
prose
includes a tokenizer capable of hanlding modern text, including the non-word character spans shown below.
Type | Example |
---|---|
Email addresses | Jane.Doe@example.com |
Hashtags | #trending |
Mentions | @jdkato |
URLs | https://github.com/jdkato/prose |
Emoticons | :-) , >:( , o_0 , etc. |
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
// @jdkato NN
// , ,
// go VB
// to TO
// http://example.com NN
// thanks NNS
// :) SYM
// . .
}
}
prose
includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter
.
Name | Language | License | GRS (English) | GRS (Other) | Speed† |
---|---|---|---|---|---|
Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while
prose
was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main
import (
"fmt"
"strings"
"github.com/jdkato/prose"
)
func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))
// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
}
prose
includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library | Accuracy | 5-Run Average (sec) |
---|---|---|
NLTK | 0.893 | 7.224 |
prose | 0.961 | 2.538 |
(See scripts/test_model.py
for more information.)
The full list of supported POS tags is given below.
TAG | DESCRIPTION |
---|---|
( | left round bracket |
) | right round bracket |
, | comma |
: | colon |
. | period |
'' | closing quotation mark |
`` | opening quotation mark |
# | number sign |
$ | currency |
CC | conjunction, coordinating |
CD | cardinal number |
DT | determiner |
EX | existential there |
FW | foreign word |
IN | conjunction, subordinating or preposition |
JJ | adjective |
JJR | adjective, comparative |
JJS | adjective, superlative |
LS | list item marker |
MD | verb, modal auxiliary |
NN | noun, singular or mass |
NNP | noun, proper singular |
NNPS | noun, proper plural |
NNS | noun, plural |
PDT | predeterminer |
POS | possessive ending |
PRP | pronoun, personal |
PRP$ | pronoun, possessive |
RB | adverb |
RBR | adverb, comparative |
RBS | adverb, superlative |
RP | adverb, particle |
SYM | symbol |
TO | infinitival to |
UH | interjection |
VB | verb, base form |
VBD | verb, past tense |
VBG | verb, gerund or present participle |
VBN | verb, past participle |
VBP | verb, non-3rd person singular present |
VBZ | verb, 3rd person singular present |
WDT | wh-determiner |
WP | wh-pronoun, personal |
WP$ | wh-pronoun, possessive |
WRB | wh-adverb |
prose
v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON
) and geographical/political Entities (GPE
) by default.
package main
import (
"gopkg.in/jdkato/prose.v2"
)
func main() {
doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
}
}
However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose
: Radically efficient machine teaching in Go for a tutorial.
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
We redesigned Socket's first logged-in page to display rich and insightful visualizations about your repositories protected against supply chain threats.
Product
Automatically fix and test dependency updates with socket fix—a new CLI tool that turns CVE alerts into safe, automated upgrades.
Security News
CISA denies CVE funding issues amid backlash over a new CVE foundation formed by board members, raising concerns about transparency and program governance.