Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
PAT tree construction for Chinese documents, keyword extraction and text segmentation
PAT tree construction for Chinese document, now in development. Provide functionality to add documents and construct PAT tree in memory, store it to database, extract keywords, and split documents.
example of result:
有時 喜歡 有時候 不喜歡
為什麼 會 這樣 … ?
20 點 求 解 哈哈
This project is now in development and used for academic purpose, DO NOT use this module in production until the WARNING statement is removed. //TODO: improve document splitting algorithm
npm install pat-tree --save
var PATtree = require("pat-tree");
var tree = new PATtree();
tree.addDocument(doc);
var SLPs = tree.extractSLP(TFThreshold, SEThreshold, verbose);
// SLPs: array of strings, which are signifiant lexical patterns.
If the frequency of a pattern exceeds TFThreshold
,
and the SE value exceeds SEThreshold
, it would appear in the result array.
verbose
: optional, if set to true, then will print out progress on console.
TFTreshold
should be integer, SEThreshold
should be float between 0 and 1.
var result = tree.segmentDoc(doc, SLPs);
doc
is the document to be segmented, data type: string.
SLPs
is array of SLP that extracted by tree.extractSLP()
, or array of keywords retrieved any other way, data type: array of strings.
result
is the result of document segmentation, data type: string.
var json = tree.toJSON();
The result json has following three content:
json.header
: JSON object,json.documents
: array,json.tree
: arrayYou could store them to database and use tree.reborn()
to generate the tree again.
In NoSQL database, you can store the three items to seperate collections,
header
collection would contain exactly one document, and documents
and tree
would contain lots of documents.
For Example, if using MongoDB native driver:
var json = tree.toJSON();
// One header object would be stored to database
db.collection("header").insert(json.header, function(err, result) {
if(err) throw err;
});
// All documents would be stored to database
db.collection("documents").insert(json.documents, function(err, result) {
if(err) throw err;
});
// All nodes of the tree would be stored to database
db.collection("tree").insert(json.tree, function(err, result) {
if(err) throw err;
});
tree.reborn(json);
If you use tree.toJSON()
to generate the JSON object and store the three objects to different collections,
you can construct them to the original JSON object and use tree.reborn(json)
to reborn the tree.
For example, if using MongoDB native driver:
db.collection("header").find().toArray(function(err, headers) {
db.collection("documents").find().toArray(function(err, documents) {
db.collection("tree").find().toArray(function(err, tree) {
var json = {};
json.header = headers[0]; // there should be only one header.
json.documents = documents;
json.tree = tree;
var patTree = new PATTree();
patTree.reborn(json);
})
})
})
The patTree
object would now be the same as the previously stored status,
and you can do all operations like patTree.addDocuments(doc)
to it.
CATUION If you reborn the tree by above method, and do some operations like
patTree.addDocument(doc)
, and you want to store the tree back to database as illustrated in Convert to JSON, you MUST drop the collections(header, documents, tree) in the database first, avoiding any record that is previously stored.
tree.printTreeContent(printExternalNodes, printDocuments);
Print the content of the tree on console.
If printExternalNodes
is set to true, print out all external nodes for each internal node.
If printDocuments
is set to true, print out the whole collection of the tree.
tree.traverse(preCallback, inCallback, postCallback);
For convenience, there are functions for each order of traversal
tree.preOrderTraverse(callback);
tree.inOrderTraverse(callback);
tree.postOrderTraverse(callback);
For example
tree.preOrderTraverse(function(node) {
console.log("node id: " + node.id);
})
Every nodes has some common informaitons, an node has the following structure:
node = {
id: 3, // the id of this node, data type: integer, auto generated.
parent: 1, // the parent id of this node, data type: integer
left: leftChildNode, // data type: Node
right: rightChildNode, // data type: Node
}
Other attributes in nodes are different for internal nodes and external nodes, Internal nodes has following structure:
internalNode = {
// ...
type: "internal",
// indicates this is an internal node
position: 13,
// the branch position of external nodes, data type: integer
prefix: "00101",
// the sharing prefix of external nodes, data type: string of 0s and 1s
externalNodeNum: 87,
// number of external nodes contained in subtree of this node,
// data type: integer
totalFrequency: 89,
// number of the total frequency of the external nodes in the collection,
// data type: integer
sistringRepres: node
// one of the external node in the subree of this internal node,
// data type: Node
}
External nodes has following structure:
externalNode = {
// ...
type: "external",
// indicates this is an external node,
sistring: "00101100110101",
// binary representation of the character, data type: string
indexes: ["0.1,3", "1.2.5"]
// the positions where the sistring appears in the collection,
// data type: array
}
The whole collection consists of documents, which consists of sentenses, which consists of words. An example could be this:
[ [ '嗨你好',
'這是測試文件' ],
[ '你好',
'這是另外一個測試文件' ] ]
An index is in following structure:
DocumentPosition.SentensePosition.wordPosition
For example, "0.1.2"
is the index of the character "測"
.
reborn()
extractSLP()
addDocument()
reborn()
toJSON()
and reborn()
FAQs
PAT tree construction for Chinese documents, keyword extraction and text segmentation
The npm package pat-tree receives a total of 1 weekly downloads. As such, pat-tree popularity was classified as not popular.
We found that pat-tree demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.