github.com/vench/nlp

v0.0.2
Source
Go

Version published: 5 years ago

Created: 5 years ago

Source

Natural Language Processing

An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.

Built upon the Gonum library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn.

Check out the companion blog post or the go documentation page for full usage and examples.

Features

Sparse matrix implementations for more effective memory usage
Convert plain text strings into numerical feature vectors for analysis
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
TF-IDF weighting to account for frequently occuring words
LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
PCA (Principal Component Analysis)
SimHash implementation of LSH (Locality Sensitive Hashing) using sign random projection to support approximate cosine similarity using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) with semantic vector space models.
Cosine, Angular and Hamming similarity/distance measures to calculate the similarity/distance between feature vectors.
Persistence for trained models (persistence for Vectorisers coming soon)

Planned

Ability to persist trained vectorisers
LDA (Latent Dirichlet Allocation) implementation for topic extraction
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, random forest, etc.

References

FAQs

What is github.com/vench/nlp?

Package last updated on 21 May 2020

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install