Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
github.com/jdkato/prose/v3
prose
is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
You can find a more detailed summary on the library's performance here: Introducing prose
v2.0.0: Bringing NLP to Go.
$ go get github.com/jdkato/prose/v2
package main
import (
"fmt"
"log"
"github.com/jdkato/prose/v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}
// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}
// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
}
}
The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmentation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
prose
includes a tokenizer capable of processing modern text, including the non-word character spans shown below.
Type | Example |
---|---|
Email addresses | Jane.Doe@example.com |
Hashtags | #trending |
Mentions | @jdkato |
URLs | https://github.com/jdkato/prose |
Emoticons | :-) , >:( , o_0 , etc. |
package main
import (
"fmt"
"log"
"github.com/jdkato/prose/v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
// @jdkato NN
// , ,
// go VB
// to TO
// http://example.com NN
// thanks NNS
// :) SYM
// . .
}
}
prose
includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter
.
Name | Language | License | GRS (English) | GRS (Other) | Speed† |
---|---|---|---|---|---|
Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while
prose
was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main
import (
"fmt"
"strings"
"github.com/jdkato/prose/v2"
)
func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))
// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
}
prose
includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library | Accuracy | 5-Run Average (sec) |
---|---|---|
NLTK | 0.893 | 7.224 |
prose | 0.961 | 2.538 |
(See scripts/test_model.py
for more information.)
The full list of supported POS tags is given below.
TAG | DESCRIPTION |
---|---|
( | left round bracket |
) | right round bracket |
, | comma |
: | colon |
. | period |
'' | closing quotation mark |
`` | opening quotation mark |
# | number sign |
$ | currency |
CC | conjunction, coordinating |
CD | cardinal number |
DT | determiner |
EX | existential there |
FW | foreign word |
IN | conjunction, subordinating or preposition |
JJ | adjective |
JJR | adjective, comparative |
JJS | adjective, superlative |
LS | list item marker |
MD | verb, modal auxiliary |
NN | noun, singular or mass |
NNP | noun, proper singular |
NNPS | noun, proper plural |
NNS | noun, plural |
PDT | predeterminer |
POS | possessive ending |
PRP | pronoun, personal |
PRP$ | pronoun, possessive |
RB | adverb |
RBR | adverb, comparative |
RBS | adverb, superlative |
RP | adverb, particle |
SYM | symbol |
TO | infinitival to |
UH | interjection |
VB | verb, base form |
VBD | verb, past tense |
VBG | verb, gerund or present participle |
VBN | verb, past participle |
VBP | verb, non-3rd person singular present |
VBZ | verb, 3rd person singular present |
WDT | wh-determiner |
WP | wh-pronoun, personal |
WP$ | wh-pronoun, possessive |
WRB | wh-adverb |
prose
v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON
) and geographical/political Entities (GPE
) by default.
package main
import (
"github.com/jdkato/prose/v2"
)
func main() {
doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
}
}
However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose
: Radically efficient machine teaching in Go for a tutorial.
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.