
Security News
Researcher Exposes Zero-Day Clickjacking Vulnerabilities in Major Password Managers
Hacker Demonstrates How Easy It Is To Steal Data From Popular Password Managers
Author:: Paul Dix (mailto:paul@pauldix.net)
=Summary This is Daniel DeLeo's fork of Paul Dix's basset[http://github.com/pauldix/basset/], a library for machine learning.
Basset includes a generic document representation class, a feature selector, a feature extractor, naive bayes and SVM classifiers, and a classification evaluator for running tests. The goal is to create a general framework that is easy to modify for specific problems. It is designed to be extensible so it should be easy to add more classification and clustering algorithms.
==Additions I have added lots of tests (TATFT![http://smartic.us/2008/8/15/tatft-i-feel-a-revolution-coming-on]), a document class for URIs, a high level classifier class, SVM support using libsvm-ruby-swig[http://github.com/tomz/libsvm-ruby-swig] and an anomaly detector (one-class classifier). I have also modified the naive bayes classifier so that it does not require you to use the feature selector and extractor. This should help you to get encouraging results when you first start. Once you have a large training dataset, you can use the feature selector and extractor for better performance.
=Usage The most popular task is spam identification, though there are many others, such as document retrieval, security/intrusion detection, and applications in Biology and Astrophysics.
To build a classifier, you'll first need a set of training documents. For a spam/non-spam classifier, this would consist of a number of documents which you have labeled as either spam or not. With training sets, bigger is better. You should probably have at least 100 of each type (spam and not spam). Really 1,000 of each type would be better and 10,000 of each would be super sweet.
==Simple Example The Classifier class takes care of all of the messy details for you. It defaults to a naive bayes classifier and using the Document (best for western natural language) class to represent documents.
# This example based on the song "Losing My Edge" by LCD Soundsystem
classifier = Basset::Classifier.new #default options
classifier.train(:hip, "turntables", "techno music", "DJs with turntables", "techno DJs")
classifier.train(:unhip, "rock music", "guitar bass drums", "guitar rock", "guitar players")
classifier.classify("guitar music")
=> :unhip
# now everyone likes rock music again! retrain the classifier fast!
classifier.train_iterative(:hip, "guitars") # takes 3 iterations
classifier.classify("guitars")
=> :hip
==Full Control For more control over the various stages of the training and classification process, you can create document, feature selector, feature extractor, and document classifier objects directly. The process is as follows:
That represents the process of selecting features and training the classifier. Once you've done that you can predict if a new previously unseen document is spam or not by just doing the following:
Something that you'll probably want to do before doing real classification is to test things. Use the ClassificationEvaluator for this. Using the evaluator you can pass your training documents in and have it run through a series of tests to give you an estimate of how successful the classifier will be at predicting unseen documents. Easy classification tasks will generally be > 90% accurate while others can be much harder. Each classification task is different and most of the time you won't know until you actually test it out.
=Contact I love machine learning and classification so if you have a problem that is giving you trouble don't hesitate to get a hold of me. The same applies for anyone who wants to write additional classifiers, better document representations, or just to tell my my code is amateur.
Author:: Paul Dix (mailto:paul@pauldix.net) Site:: http://www.pauldix.net Freenode:: pauldix in #nyc.rb
FAQs
Unknown package
We found that danielsdeleo-basset demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Hacker Demonstrates How Easy It Is To Steal Data From Popular Password Managers
Security News
Oxlint’s new preview brings type-aware linting powered by typescript-go, combining advanced TypeScript rules with native-speed performance.
Security News
A new site reviews software projects to reveal if they’re truly FOSS, making complex licensing and distribution models easy to understand.