Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents. The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/) The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc. These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions. Typically the algorithms in this package implement one of three primary interfaces: One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD): Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:
Package nlp provides general purpose Natural Language Processing.