Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pat-tree

Package Overview
Dependencies
Maintainers
1
Versions
25
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pat-tree

PAT tree construction for Chinese documents

  • 0.2.0
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
2
decreased by-92%
Maintainers
1
Weekly downloads
 
Created
Source

pat-tree

PAT tree construction for Chinese document, now in development. Provide functionality to add documents and construct PAT tree in memory, extract keywords, and split documents.

example of result:

有時 喜歡   有時候 不喜歡
為什麼 會 這樣   … ?
20 點 求 解 哈哈

WARNING

This project is now in development and used for academic purpose, DO NOT use this module until the WARNING statement is removed. //TODO: improve document splitting algorithm

Installation

npm install pat-tree --save

Usage

Init

var PATtree = require("pat-tree");
var tree = new PATtree();

Add document

tree.addDocument(input);

Extract Significant Lexical Patterns

var SLPs = tree.extractSLP(TFThreshold, SEThreshold); // SLPs: array of signifiant lexical patterns.

If the frequency of a pattern exceeds THThreshold, and the SE value exceeds SEThreshold, it would appear in the result array.

THTreshold shold be integer, SEThreshold shold be between 0 and 1.

Split document

var result = tree.splitDoc(doc, SLPs); 

doc is the document to be splitted, data type: string.

SLPs is array of SLP that extracted by tree.extractSLP(), or array of keywords retrieved any other way. data type: array of strings.

result is the result of splitted document, data type: string.

Additional functions

Print tree content

tree.printTreeContent(printExternalNodes, printDocuments);

Print the content of the tree on console. If printExternalNodes is set to true, print out all external nodes for each internal node. If printDocuments is set to true, print out the whole collection of the tree.

Traversal

tree.traverse(preCallback, inCallback, postCallback);

For convenient, there are functions for each order of traversal

tree.preOrderTraverse(callback);
tree.inOrderTraverse(callback);
tree.postOrderTraverse(callback);

For example

tree.preOrderTraverse(function(node) {
	console.log("node id: " + node.id);
})

Data type

Node

Every nodes has some common informaitons, an node has the following structure:

node = {
	id: 3, // the id of this node, data type: JSON, auto generated.
	parent: 1, // the parent id of this node, data type: integer
	left: leftChildNode, // data type: Node 
	right: rightChildNode, // data type: Node
	data: {} // payload for this node, data type : JSON
}

Data is different for internal nodes and external nodes, Internal nodes has following structure:

Internal nodes

internalNode.data = {
	type: "internal",  // indicates this is an internal node
	position: 13, // the branch position of external nodes, data type: integer
	prefix: "00101", // the sharing prefix of external nodes, data type: string of 0s and 1s
	externalNodeNum: 87, // number of external nodes contained in subtree of this node, data type: integer
	totalFrequency: 89, // number of the total frequency of the external nodes in the collection, data type: integer
	sistringRepres: node // one of the external node in the subree of this internal node, data type: Node
}

External nodes

External nodes has following structure:

externalNode.data = {
	type: "external", // indicates this is an external node,
	sistring: "00101100110101", // binary representation of the character, data type: string
	indexes: ["0.1,3", "1.2.5"] // the positions where the sistring appears in the collection, data type: array
}

Collection

The whole collection consists of documents, which consists of sentenses, which consists of words. An example could be this:

[ [ '嗨你好',
	'這是測試文件' ],
  [ '你好',
	'這是另外一個測試文件' ] ]

An index is in following structure:

DocumentPosition.SentensePosition.wordPosition

For example, "0.1.2" is the index of the character "測".

Release History

  • 0.2.0 Add document splitting functionality
  • 0.1.8 Alter algorithm, improve simplicity
  • 0.1.7 Improve performance
  • 0.1.6 Improve performance
  • 0.1.5 Add functionality of SLP extraction
  • 0.1.4 Add external node number and term frequency to internal nodes
  • 0.1.3 Able to restore Chinese characters
  • 0.1.2 Construction complete
  • 0.1.1 First release

Keywords

FAQs

Package last updated on 05 Dec 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc