Socket
Socket
Sign inDemoInstall

github.com/james-bowman/nlp

Package Overview
Dependencies
0
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    github.com/james-bowman/nlp

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents. The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/) The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc. These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions. Typically the algorithms in this package implement one of three primary interfaces: One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD): Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:


Version published

Readme

Source

Natural Language Processing

License: MIT GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

FAQs

Last updated on 11 May 2021

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc