Package doc
Package jsonnlp provides the data structures to read and generate JSON-NLP. See https://github.com/SemiringInc/JSON-NLP for the JSON Schema specification of the JSON-NLP exchange format. JSON-NLP encapsulates different Natural Language Processing (NLP) annotations and analyses in one uniform JSON format. Basic Structure Every JSON-NLP...
Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The initial primary focus being on the implementation of algorithms supporting LSA (Latent Semantic Analysis), often referred to as Latent Semantic Indexing in the context of information retrieval. The algorithms in the package typically support document input as text strings which are then encoded as a matrix of numerical feature vectors called a `term document matrix`. Columns in this matrix represent the documents in the corpus and the rows represent terms occurring in the documents. The individual elements within the matrix contains counts of the number of occurrences of each term in the associated document. This matrix can be manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions. A common transformation is for the purpose of weighting features to remove natural biases which would skew results e.g. commonly occurring words like `the`, `of`, `and`, etc. which should carry lower weight than unusual words. Term Document matrices typically have a very large number of dimensions and so transformations are often applied to reduce the dimensionality using techniques such as Locality Sensitive Hashing or Latent Semantic Analysis (typically performed using matrix SVD - `Singular Value Decomposition`) which approximates the original term document matrix with a new matrix of much lower rank (typically around 100 rather than 1000s). Truncated SVD is a fundamental part of LSA (Latent Semantic Analysis aka Latent Semantic Indexing) and serves a number of purposes: 1. The reduced dimensionality of the data theoretically requires less memory. 2. As less significant dimensions are removed, there is less `noise` in the data which could have artificially skewed results. 3. Perhaps most importantly, the SVD effectively encodes the co-occurrence of terms within the documents to capture semantic meaning rather than simply the presence (or lack of presence) of words. This combats the problem of synonymy (a common challenge in NLP) where different words in the English language can be used to mean the same thing (synonyms). In LSA, documents can have a high degree of semantic similarity with very few words in common. The post SVD matrix (with each column being a feature vector representing a document within the corpus) can be compared for similarity with each other (for clustering) or with a query (also represented as a feature vector projected into the same dimensional space). Similarity is measured by the angle between the two feature vectors being considered.
Package nlp provides general purpose Natural Language Processing.