Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pat-tree

Package Overview
Dependencies
Maintainers
1
Versions
25
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pat-tree

PAT tree construction for Chinese documents, keyword extraction and text segmentation

  • 0.2.7
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
2
decreased by-92%
Maintainers
1
Weekly downloads
 
Created
Source

pat-tree

PAT tree construction for Chinese document, now in development. Provide functionality to add documents and construct PAT tree in memory, store it to database, extract keywords, and split documents.

example of result:

有時 喜歡   有時候 不喜歡
為什麼 會 這樣   … ?
20 點 求 解 哈哈

WARNING

This project is now in development and used for academic purpose, DO NOT use this module in production until the WARNING statement is removed. //TODO: improve document splitting algorithm

Installation

npm install pat-tree --save

Usage

Instanitiate

var PATtree = require("pat-tree");
var tree = new PATtree();

Add document

tree.addDocument(doc);

Extract Significant Lexical Patterns

var SLPs = tree.extractSLP(TFThreshold, SEThreshold, verbose); 
// SLPs: array of strings, which are signifiant lexical patterns.

If the frequency of a pattern exceeds TFThreshold, and the SE value exceeds SEThreshold, it would appear in the result array.

verbose: optional, if set to true, then will print out progress on console.

TFTreshold should be integer, SEThreshold should be float between 0 and 1.

Text segmentation

var result = tree.segmentDoc(doc, SLPs); 

doc is the document to be segmented, data type: string.

SLPs is array of SLP that extracted by tree.extractSLP(), or array of keywords retrieved any other way, data type: array of strings.

result is the result of document segmentation, data type: string.

Convert to JSON

var json = tree.toJSON(); 

The result json has following three content:

  • json.header: JSON object,
  • json.documents: array,
  • json.tree: array

You could store them to database and use tree.reborn() to generate the tree again. In NoSQL database, you can store the three items to seperate collections, header collection would contain exactly one document, and documents and tree would contain lots of documents.

For Example, if using MongoDB native driver:

	var json = tree.toJSON();

	// One header object would be stored to database
	db.collection("header").insert(json.header, function(err, result) {
		if(err) throw err;
	});

	// All documents would be stored to database
	db.collection("documents").insert(json.documents, function(err, result) {
		if(err) throw err;
	});	

	// All nodes of the tree would be stored to database
	db.collection("tree").insert(json.tree, function(err, result) {
		if(err) throw err;				
	});	

Reborn

tree.reborn(json);

If you use tree.toJSON() to generate the JSON object and store the three objects to different collections, you can construct them to the original JSON object and use tree.reborn(json) to reborn the tree.

For example, if using MongoDB native driver:

	db.collection("header").find().toArray(function(err, headers) {
		db.collection("documents").find().toArray(function(err, documents) {
			db.collection("tree").find().toArray(function(err, tree) {
				var json = {};
				json.header = headers[0];  // there should be only one header.
				json.documents = documents;
				json.tree = tree;

				var patTree = new PATTree();
				patTree.reborn(json);
			})
		})
	})	

The patTree object would now be the same as the previously stored status, and you can do all operations like patTree.addDocuments(doc) to it.

CATUION If you reborn the tree by above method, and do some operations like patTree.addDocument(doc), and you want to store the tree back to database as illustrated in Convert to JSON, you MUST drop the collections(header, documents, tree) in the database first, avoiding any record that is previously stored.

Additional functions

Print tree content

tree.printTreeContent(printExternalNodes, printDocuments);

Print the content of the tree on console. If printExternalNodes is set to true, print out all external nodes for each internal node. If printDocuments is set to true, print out the whole collection of the tree.

Traversal

tree.traverse(preCallback, inCallback, postCallback);

For convenience, there are functions for each order of traversal

tree.preOrderTraverse(callback);
tree.inOrderTraverse(callback);
tree.postOrderTraverse(callback);

For example

tree.preOrderTraverse(function(node) {
	console.log("node id: " + node.id);
})

Data type

Node

Every nodes has some common informaitons, an node has the following structure:

	node = {
		id: 3,        // the id of this node, data type: integer, auto generated.
		parent: 1,    // the parent id of this node, data type: integer
		left: leftChildNode,      // data type: Node 
		right: rightChildNode,    // data type: Node
	}

Other attributes in nodes are different for internal nodes and external nodes, Internal nodes has following structure:

Internal nodes

	internalNode = {
		// ... 

		type: "internal", 
        // indicates this is an internal node
		position: 13,
        // the branch position of external nodes, data type: integer
		prefix: "00101", 
        // the sharing prefix of external nodes, data type: string of 0s and 1s
		externalNodeNum: 87, 
        // number of external nodes contained in subtree of this node, 
        // data type: integer
		totalFrequency: 89, 
        // number of the total frequency of the external nodes in the collection,
        // data type: integer
		sistringRepres: node 
        // one of the external node in the subree of this internal node,
        // data type: Node
	}

External nodes

External nodes has following structure:

	externalNode = {
		// ...

		type: "external", 
        // indicates this is an external node,
		sistring: "00101100110101", 
        // binary representation of the character, data type: string
		indexes: ["0.1,3", "1.2.5"] 
        // the positions where the sistring appears in the collection,
        // data type: array
	}

Collection

The whole collection consists of documents, which consists of sentenses, which consists of words. An example could be this:

	[ [ '嗨你好',
    	'這是測試文件' ],
  	  [ '你好',
    	'這是另外一個測試文件' ] ]

An index is in following structure:

DocumentPosition.SentensePosition.wordPosition

For example, "0.1.2" is the index of the character "測".

Release History

  • 0.2.7 Fix bug in reborn()
  • 0.2.6 Greatly improve performance of extractSLP()
  • 0.2.5 Greatly improve performance of addDocument()
  • 0.2.4 Fix bug in reborn()
  • 0.2.3 Add functions toJSON() and reborn()
  • 0.2.2 Change function name of splitDoc to segmentDoc
  • 0.2.1 Mofify README file
  • 0.2.0 Add text segmentation functionality
  • 0.1.8 Alter algorithm, improve simplicity
  • 0.1.7 Improve performance
  • 0.1.6 Improve performance
  • 0.1.5 Add functionality of SLP extraction
  • 0.1.4 Add external node number and term frequency to internal nodes
  • 0.1.3 Able to restore Chinese characters
  • 0.1.2 Construction complete
  • 0.1.1 First release

Keywords

FAQs

Package last updated on 13 Dec 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc