HAC
HAC stands for Hierarchical Agglomerative Clustering, a commeon technique for unsupervised document clustering.
Installation
npm install hac --save
Usage
Instantiate
var HAC = require("hac");
var hac = new HAC();
Add documents
hac.addDocument(doc, id, class);
Arguments:
- doc
String
: the document to be added, could be string of text or array of terms - id
String/int
(optional): the id of the docuemnt. If ignored, a uuid would generated automatically - class
String/int
(optional): the class(or label) of this document. You probably won't need this,
but if specified, you could use getMeasure()
to get F measure or Randon Index to see clustering performance.
Clustering
hac.cluster(clusterMethod);
Arguments:
- clusterMethod
Class Method
: the clustering algorithm to be used. Available options are as following:
HAC.GA
: Group-average Agglomerative clusteringHAC.SingleLink
: single link clusteringHAC.CompleteLink
: complete link clusteringHAC.Centroid
: centroid clustering. To Be Implemented
Get clustering result
var clusters = hac.getClusters(k, fields);
Arguments:
- k
int
: the number of clusters - fields
Array
: array of fields of a document that you want in the final clustering result. Available fields are as following:
- "id": the id of the document
- "class": the class(label) of the document, if specified when calling
addDocument()
- "content": string of document content
- "terms": document content represented as array of terms
- "tfs": array of term frequencies for this document
- "vector": vector representation of this document
Alternatively, you could use following method to get clusters with cluster labeling:
var clusters = hac.getClustersWithLabels(k, fields, featureCount, featureMethod);
The cluster labeling algorithm uses feature selection, which is a module called FeatureSelector.
Arguments:
- k
int
: number of clusters. - fields
Array
: array of fields. see above description of getClusters()
- featureCount
int
: the number of feature terms that you want for each cluster - featureMethod
Class Method
: the feature selection algorithm to be used. Available options are as following:;
FeatureSelector.MI
: Expected Mutual Information feature selectionFeatureSelectr.LLR
: Likelihood Ratio feature selection
Get performance measurement
You could get F measure or Random index for the clustering result.
NOTE: if you want to see performance measurements, you must specify the class
argument when calling addDocument()
.
Also, when calling getClusters()
or getClustersWithLabels()
, you must include the field "class"
in the argment fields
.
var measure = getMeasure(clusters, method, beta, showRawScore);
Arguments:
- clusters
Array
: the clustering result that you get by calling getClusters()
or getClustersWithLabels()
- method
Class Method
: the measuring algorithm to be used. Available options are as following:
HAC.F
: F measureHAC.RI
: Random Index
- beta
int
(optional): If you use HAC.F
, you should give hac
a beta value, which should be integer greater than or equal to 1 - showRawScore
boolean
(optional): If set to true, print the tp, fp, fn, tn, total negative and total positive on the console
Complete example
var hac = new HAC();
var docs = [];
docs.push(["嗨", "你好"]);
docs.push(["嗨", "很", "高興", "認識", "你"]);
docs.push("hello, how's everything today? is everything ok today?")
docs.push("let's test one more document!");
docs.push("documents are always not large enough");
for(var i = 0; i < docs.length; i++) {
hac.addDocument(docs[i], i);
}
hac.cluster(HAC.GA);
var clusters = hac.getClusters(2, ["id", "content"]);
_.forEach(clusters, function(cluster) {
console.log("cluster id: " + cluster.id)
_.forEach(cluster.docs, function(doc) {
console.log("doc id: " + doc.id)
console.log("doc content: " + doc.content);
})
console.log()
})
the result would be:
cluster id: 7
doc id: 0
doc content: 嗨,你好
doc id: 1
doc content: 嗨,很,高興,認識,你
doc id: 2
doc content: hello, how's everything today? is everything ok today?
cluster id: 6
doc id: 3
doc content: let's test one more document!
doc id: 4
doc content: documents are always not large enough
Release Notes
- 1.0.4: require es6-shim to support older node engine
- 1.0.3: change arrow functions to anonymous functions for backward compatibility
- 1.0.2: subtle modification to README
- 1.0.1: first publishment