Security News
Input Validation Vulnerabilities Dominate MITRE's 2024 CWE Top 25 List
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Matilda is a webscale inference toolkit.
This Javascript library makes it easy to perform inference on statistical topic models anywhere. At present, Matilda performs Latent Dirichlet Allocation by way of Markov Chain Monte Carlo (MCMC).
This project is alpha.
This is not a production library. Do not rely on it.
The interface and functionality will change drastically from week to week.
Proceed at your own risk.
This is a standalone library.
Matilda is a fluently structured toolkit styled after d3.js.
The basic unit of Matilda is a Matilda model.
var mM = new Matilda.Model();
To train a Matilda model, you must first addDocuments. The method addDocuments takes an array of words. But they need not be strings. A Matilda model is representation agnostic. So long as they are in an array everything will work out. All the items in the array are are the features that the model will be trained on.
mM.addDocument(['Matilda','said','Never','do','anything','by','halves',
'if','you','want','to','get','away','with','it',
'Be','outrageous','Go','the','whole','hog','Make',
'sure','everything','you','do','is','so','completely',
'crazy','it','s','unbelievable']);
mM.addDocument(['When','the','earlier','Infantry','Tank','Mark','I',
'which','was','also','known','as','Matilda','was',
'removed','from','service','the','Infantry',
'Tank','Mk','II','became','known','simply','as','the','Matilda' ]);
mM.addDocument(['When','war','was','recognised','as','imminent','production','of','the',
'Matilda','II','was','ordered','and','that','of','the','Matilda',
'I','curtailed','The','first','order','was','placed','shortly','after',
'trials','were','completed','with','140','ordered','from','Vulcan',
'Foundry','in','mid','1938' ]);
mM.addDocument([ 'So','Matilda','s','strong','young','mind','continued','to',
'grow','nurtured','by','the','voices',
'of','all','those','authors','who',
'had','sent','their','books','out',
'into','the','world','like','ships','on','the','sea',
'These','books','gave','Matilda','a','hopeful',
'and','comforting','message','You','are','not','alone' ]);
But a model can also take an array of arrays.
var document3 = ['Matilda','said','Never','do','anything','by','halves',
'if','you','want','to','get','away','with','it',
'Be','outrageous','Go','the','whole','hog','Make',
'sure','everything','you','do','is','so','completely',
'crazy','it','s','unbelievable'];
mM.addDocument([document1, document2, document3]])
Callbacks are also supported.
When sending multiple documents via addDocument, the callback is run after every individual document is inserted into the object. The callback receives an object containing model data, and the current document
var arrayOfArrays = [document1, document2, document3]
mM.addDocument(arrayOfArrays,
function(dataObject, curDoc) {
console.log(dataObject.vocab);
console.log(dataObject.topics);
console.log(dataObject.documents);
});
For natural word pre-processing NaturalNode is highly recommended.
Now that the documents have been added, you can train your model. By default models are set to five topics. You can overide this defaults by using the setNumberOfTopics method.
mM.setNumberOfTopics(3);
It looks like everything is good to go. Time to train. Train is a method which takes a number which represents the number of iterations. It is recommended that at least 50 iterations be made, but for this simple example 5 will do.
mM.train(5);
WARNING: Do not call setNumberOfTopics after training. Setting the Model's Topic Count after training will erase all training.
The train method also takes a callback which is called after every iteration of the training. The callback receives an object containing the topics, vocabulary, and document data of the model.
mM.train(5, function(modelData){
console.log(modelData.vocab);
console.log(modelData.topics);
console.log(modelData.documents);
});
There are a number of features that can be drawn from the training. A Topic by Topic matrix of correlations may be obtained by calling the topicCorrelations method.
mM.topicCorrelations();
You can get back the documents containing their respective features, and their topic proportions.
mM.getDocuments();
You can get an object containing all the Topics, and their words.
mM.getTopics();
You can even look over the words themselves and their topic memberships.
mM.getVocabulary();
But maybe you want your data more structured. You can get back your words organized by topic, and sorted by frequency.
mM.getWordsByTopics();
Or maybe you need to organize your documents by similarity. Just call getSimilarDocuments and pass in one of the documents you've already added to the collection. In return you'll see all the documents similar to it.
mM.getSimilarDocuments(docIndex);
Matilda has been made as modular and unopinionated as possible, and works well with Node libraries and client-side libraries alike.
Combine Matilda with MongoDB and maintain an index of entries sorted by topical similarity. Mix mM.topicCorrelations() with a static blogging engine and compose a topical map of your blog every regeneration. Plug in the Google Analytics API and cluster your customers by behavioral traits. Match it with an email service and fight spam in a whole new way, or just organize your inbox by subject. Feed your forum into a Matilda Model and find out what your community is talking about.
And that's just the beginning.
There are big plans.
The smoothing factors of LDA are at present automated.
Matilda.js is based on LDAjs, Gensim, and Mallet, and inspired by the works of David Mimno, Ted Underwood, David Blei, Roald Dahl, and Sir John Carden.
FAQs
Webscale Inference Engine
We found that matilda demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.
Research
Security News
A threat actor's playbook for exploiting the npm ecosystem was exposed on the dark web, detailing how to build a blockchain-powered botnet.