sentence-splitter
Split {Japanese, English} text into sentences.
Installation
npm install sentence-splitter
Requirements:
CLI
$ npm install -g sentence-splitter
$ echo "This is a pen. But, this is not pen" | sentence-splitter
This is a pen.
But This is not pen
Usage
export declare function split(text: string): (TxtParentNode | TxtNode)[];
export declare function splitAST(paragraphNode: TxtParentNode): TxtParentNode;
TxtParentNode
and TxtNode
is defined in TxtAST.
Example
import {split, Syntax} from "sentence-splitter";
let sentences = split(`There it is! I found it.
Hello World. My name is Jonas.`);
console.log(JSON.stringify(sentences, null, 4));
Node
This node is based on TxtAST.
Node's type
Str
: Str node has value
Sentence
: Sentence Node has Str
, WhiteSpace
, or Punctuation
nodes as childrenWhiteSpace
: WhiteSpace Node has \n
.Punctuation
: Punctuation Node has .
, 。
Get these Syntax
constants value from the module:
import {Syntax} from "sentence-splitter";
console.log(Syntax.Sentence);
Node's interface
export interface WhiteSpaceNode extends TxtTextNode {
readonly type: "WhiteSpace";
}
export interface PunctuationNode extends TxtTextNode {
readonly type: "Punctuation";
}
export interface StrNode extends TxtTextNode {
readonly type: "Str";
}
export interface SentenceNode extends TxtParentNode {
readonly type: "Sentence";
}
Fore more details, Please see TxtAST.
Node layout
Node layout image.
<WhiteSpace />
<Sentence>
<Str />
<Punctuation />
<Str />
<Punctuation />
</Sentence>
<WhiteSpace />
Note: This library will not split Str
into Str
and WhiteSpace
(tokenize)
Because, Tokenize need to implement language specific context.
Reference
This library use "Golden Rule" test of pragmatic_segmenter
.
Tests
npm test
Contributing
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
License
MIT