github.com/james-bowman/nlp

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents. The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/) The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc. These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions. Typically the algorithms in this package implement one of three primary interfaces: One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD): Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:

v0.0.0-20210511120306-26d441fa0ded

Source

Version published: 3 years ago

Readme

Source

Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.

Features

LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
PCA (Principal Component Analysis)
TF-IDF weighting to account for frequently occuring words
Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
Similarity/distance measures to calculate the similarity/distance between feature vectors.

Planned

Expanded persistence support
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, KNN, random forest, etc.

References

FAQs

What is github.com/james-bowman/nlp?

Last updated on 11 May 2021

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install