Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pat-tree

Package Overview
Dependencies
Maintainers
1
Versions
25
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pat-tree - npm Package Compare versions

Comparing version 1.0.0 to 1.0.1

7

package.json
{
"name": "pat-tree",
"version": "1.0.0",
"version": "1.0.1",
"description": "PAT tree construction for Chinese documents, keyword extraction and text segmentation",

@@ -16,5 +16,10 @@ "main": "index.js",

"pat-tree",
"trie",
"patricia tree",
"pat tree",
"PAT",
"tree",
"information retrieval",
"Chinese",
"ckip",
"keyword extraction",

@@ -21,0 +26,0 @@ "text segmentation"

pat-tree
========
PAT tree construction for Chinese document.
In Information Retrieval, text segmentation on Chinese like
documents has been a difficult task, since Chinese words are
continuous and has no white space between them. But finding basic
elements of a document is critical for all applications in information retrieval.
PAT tree is a Patricia tree, or called trie, that used particularly for
text segmentation and word retrieval. This module can be used for
PAT tree construction for Chinese documents.
Provide functionality to add documents and construct PAT tree in memory,

@@ -9,2 +16,6 @@ convert to JSON for storing to database,

You can collect a corpus, adding all of them to construct a PAT tree,
then extract significant lexical patterns, and do text segmentation
on other documents.
example of result:

@@ -39,2 +50,4 @@

`doc` is the document you want to add to the tree. data type: string
### Extract Significant Lexical Patterns

@@ -142,7 +155,7 @@

```javascript
tree.printTreeContent(printExternalNodes, printDocuments);
tree.printTreeContent(printExternalNode, printDocuments);
```
Print the content of the tree on console.
If `printExternalNodes` is set to true, print out all external nodes for each internal node.
If `printExternalNode` is set to true, print out one external node for each internal node.
If `printDocuments` is set to true, print out the whole collection of the tree.

@@ -181,3 +194,3 @@

id: 3, // the id of this node, data type: integer, auto generated.
parent: 1, // the parent id of this node, data type: integer
parent: parentNode, // the parent of this node, data type: Node
left: leftChildNode, // data type: Node

@@ -251,4 +264,19 @@ right: rightChildNode, // data type: Node

# Performance
All operations are fast, but require more memory and disk space to operate successfully.
Running on Macbook Pro Retina, connected to local MongoDB, given 8GB memory size
by specifying V8 option `--max_old_space_size=8000`, has following performance.
* Add 32,769 Facebook-like posts by `tree.addDocument()` takes about 5 minutes.
* After above operation, extract SLP by `tree.extractSLP()` takes about 5 minutes.
* After above operation, converting to JSON by `tree.toJSON()` and store three collections to database takes about 1 minutes
and 5 GB disk space, and about 1,000,000 records of tree nodes.
* After above operation, find all collections in database and reborn the tree by `tree.reborn()` takes about 1 minutes.
* After above operation, do text segmentation on 32,769 posts by `tree.segmentDoc()`, given SLPs extracted above,
takes about 5 minutes.
# Release History
* 1.0.1 Modify README file
* 1.0.0 Stable release

@@ -255,0 +283,0 @@ * 0.2.8 Improve algorithm of `segmentDoc()`

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc