Natural Language Processing
An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.
Built upon the Gonum library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn.
Check out the companion blog post or the go documentation page for full usage and examples.
Features
Planned
- Ability to persist trained vectorisers
- LDA (Latent Dirichlet Allocation) implementation for topic extraction
- Stemming to treat words with common root as the same e.g. "go" and "going"
- Clustering algorithms e.g. Heirachical, K-means, etc.
- Classification algorithms e.g. SVM, random forest, etc.
References
- Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
- Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
- Thomo, Alex. Latent Semantic Analysis (Tutorial).
- Latent Semantic Indexing. Standford NLP Course
- Charikar, Moses S. Similarity Estimation Techniques from Rounding Algorithms
- Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
- Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
- Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature