Comparing version 1.0.0 to 1.0.1
{ | ||
"name": "pat-tree", | ||
"version": "1.0.0", | ||
"version": "1.0.1", | ||
"description": "PAT tree construction for Chinese documents, keyword extraction and text segmentation", | ||
@@ -16,5 +16,10 @@ "main": "index.js", | ||
"pat-tree", | ||
"trie", | ||
"patricia tree", | ||
"pat tree", | ||
"PAT", | ||
"tree", | ||
"information retrieval", | ||
"Chinese", | ||
"ckip", | ||
"keyword extraction", | ||
@@ -21,0 +26,0 @@ "text segmentation" |
pat-tree | ||
======== | ||
PAT tree construction for Chinese document. | ||
In Information Retrieval, text segmentation on Chinese like | ||
documents has been a difficult task, since Chinese words are | ||
continuous and has no white space between them. But finding basic | ||
elements of a document is critical for all applications in information retrieval. | ||
PAT tree is a Patricia tree, or called trie, that used particularly for | ||
text segmentation and word retrieval. This module can be used for | ||
PAT tree construction for Chinese documents. | ||
Provide functionality to add documents and construct PAT tree in memory, | ||
@@ -9,2 +16,6 @@ convert to JSON for storing to database, | ||
You can collect a corpus, adding all of them to construct a PAT tree, | ||
then extract significant lexical patterns, and do text segmentation | ||
on other documents. | ||
example of result: | ||
@@ -39,2 +50,4 @@ | ||
`doc` is the document you want to add to the tree. data type: string | ||
### Extract Significant Lexical Patterns | ||
@@ -142,7 +155,7 @@ | ||
```javascript | ||
tree.printTreeContent(printExternalNodes, printDocuments); | ||
tree.printTreeContent(printExternalNode, printDocuments); | ||
``` | ||
Print the content of the tree on console. | ||
If `printExternalNodes` is set to true, print out all external nodes for each internal node. | ||
If `printExternalNode` is set to true, print out one external node for each internal node. | ||
If `printDocuments` is set to true, print out the whole collection of the tree. | ||
@@ -181,3 +194,3 @@ | ||
id: 3, // the id of this node, data type: integer, auto generated. | ||
parent: 1, // the parent id of this node, data type: integer | ||
parent: parentNode, // the parent of this node, data type: Node | ||
left: leftChildNode, // data type: Node | ||
@@ -251,4 +264,19 @@ right: rightChildNode, // data type: Node | ||
# Performance | ||
All operations are fast, but require more memory and disk space to operate successfully. | ||
Running on Macbook Pro Retina, connected to local MongoDB, given 8GB memory size | ||
by specifying V8 option `--max_old_space_size=8000`, has following performance. | ||
* Add 32,769 Facebook-like posts by `tree.addDocument()` takes about 5 minutes. | ||
* After above operation, extract SLP by `tree.extractSLP()` takes about 5 minutes. | ||
* After above operation, converting to JSON by `tree.toJSON()` and store three collections to database takes about 1 minutes | ||
and 5 GB disk space, and about 1,000,000 records of tree nodes. | ||
* After above operation, find all collections in database and reborn the tree by `tree.reborn()` takes about 1 minutes. | ||
* After above operation, do text segmentation on 32,769 posts by `tree.segmentDoc()`, given SLPs extracted above, | ||
takes about 5 minutes. | ||
# Release History | ||
* 1.0.1 Modify README file | ||
* 1.0.0 Stable release | ||
@@ -255,0 +283,0 @@ * 0.2.8 Improve algorithm of `segmentDoc()` |
34926
295