New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

github.com/elliott5/sentences

Package Overview

Dependencies

Alerts

File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/elliott5/sentences

v1.0.2
Source
Go

Version published: 9 years ago

Created: 9 years ago

Source

MIT

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Install

go get gopkg.in/neurosnap/sentences.v1
go install gopkg.in/neurosnap/sentences.v1/cmd/sentences

Binaries

Linux

Mac

Windows

Command

Command line

Get it

go get gopkg.in/neurosnap/sentences.v1

Use it

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1"
    "gopkg.in/neurosnap/sentences.v1/data"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // Compiling language specific data into a binary file can be accomplished
    // by using `make <lang>` and then loading the `json` data:
    b, _ := data.Asset("data/english.json");

    // load the training data
    training, _ := sentences.LoadTraining(data)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Customizable

Sentences was built around composability, most major components of this package can be extended.

Eager to make adhoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelyhoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbrevation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library	Avg Speed (s, 10 runs)	Accuracy (%)
Sentences	1.96	98.95
NLTK	5.22	99.21

FAQs

What is github.com/elliott5/sentences?

Package last updated on 15 Apr 2016

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

github.com/elliott5/sentences

Sentences - A command line sentence tokenizer

Install

Binaries

Linux

Mac

Windows

Command

Get it

Use it

English

Customizable

Notice

A Punkt Tokenizer

Performance

Related posts

Deno 2.2 Improves Dependency Management and Expands Node.js Compatibility

React Team Updates CRA Migration Guidance After Community Pushback