LeNLP
Natural Language Processing toolbox for Python with Rust
LeNLP is a toolkit dedicated to natural language processing (NLP). It provides optimized and parallelized functions in Rust for use in Python, offering high performance and ease of integration.
Installation
We can install LeNLP using:
pip install lenlp
Sections
Quick Start
Sparse Module
The sparse
module offers a variety of vectorizers and transformers for text data. These sparse matrices are scipy.sparse.csr_matrix
objects, optimized for memory usage and speed. They can be used as drop-in replacements for scikit-learn
vectorizers.
CountVectorizer
The CountVectorizer
converts a list of texts into a sparse matrix of token counts. This is a Rust implementation of the CountVectorizer
from scikit-learn
.
from lenlp import sparse
vectorizer = sparse.CountVectorizer(
ngram_range=(3, 5),
analyzer="char_wb",
normalize=True,
stop_words=["based"],
)
You can fit the vectorizer and transform a list of texts into a sparse matrix of token counts:
X = [
"Hello World",
"Rust based vectorizer"
]
matrix = vectorizer.fit_transform(X)
Or use separate calls:
vectorizer.fit(X)
matrix = vectorizer.transform(X)
Benchmark:
LeNLP CountVectorizer versus Sklearn CountVectorizer fit_transform
with char
analyzer.
TfidfVectorizer
The TfidfVectorizer
converts a list of texts into a sparse matrix of tf-idf weights, implemented in Rust.
from lenlp import sparse
vectorizer = sparse.TfidfVectorizer(
ngram_range=(3, 5),
analyzer="char_wb",
normalize=True,
stop_words=["based"]
)
Fit the vectorizer and transform texts:
X = [
"Hello World",
"Rust based vectorizer"
]
matrix = vectorizer.fit_transform(X)
Or use separate calls:
vectorizer.fit(X)
matrix = vectorizer.transform(X)
Benchmark:
LeNLP TfidfVectorizer versus Sklearn TfidfVectorizer fit_transform
with char
analyzer.
BM25Vectorizer
The BM25Vectorizer
converts texts into a sparse matrix of BM25 weights, which are more accurate than tf-idf and count weights.
from lenlp import sparse
vectorizer = sparse.BM25Vectorizer(
ngram_range=(3, 5),
analyzer="char_wb",
normalize=True,
stop_words=["based"]
)
Fit the vectorizer and transform texts:
X = [
"Hello World",
"Rust based vectorizer"
]
matrix = vectorizer.fit_transform(X)
Or use separate calls:
vectorizer.fit(X)
matrix = vectorizer.transform(X)
Benchmark:
LeNLP BM25Vectorizer versus LeNLP TfidfVectorizer fit_transform
with char
analyzer. BM25Vectorizer counterpart is not available in Sklearn.
FlashText
The flashtext
module allows for efficient keyword extraction from texts. It implements the FlashText algorithm as described in the paper Replace or Retrieve Keywords In Documents At Scale.
from lenlp import flash
flash_text = flash.FlashText(
normalize=True
)
flash_text.add(["paris", "bordeaux", "toulouse"])
Extract keywords and their positions from sentences:
sentences = [
"Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
"Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]
flash_text.extract(sentences)
Output:
[[('toulouse', 0, 8), ('bordeaux', 60, 68), ('bordeaux', 74, 82)],
[('paris', 0, 5), ('bordeaux', 62, 70), ('toulouse', 76, 84)]]
The FlashText algorithm is highly efficient, significantly faster than regular expressions for keyword extraction. LeNLP's implementation normalizes input documents by removing accents and converting to lowercase to enhance keyword extraction.
Benchmark:
LeNLP FlashText is benchmarked versus the official implementation of FlashText.
Counter
The counter module allows to convert a list of texts into a dictionary of token counts.
from lenlp import counter
sentences = [
"Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
"Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]
counter.count(
sentences,
ngram_range=(1, 1),
analyzer="word",
normalize=True,
stop_words=["its", "in", "is", "of", "the", "and", "to", "a"]
)
Output:
[{'compared': 1,
'south': 1,
'city': 1,
'toulouse': 1,
'bordeaux': 2,
'france': 1},
{'toulouse': 1,
'france': 1,
'capital': 1,
'paris': 1,
'north': 1,
'compared': 1,
'bordeaux': 1}]
Normalizer
The normalizer module allows to normalize a list of texts by removing accents and converting to lowercase.
from lenlp import normalizer
sentences = [
"Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
"Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]
normalizer.normalize(sentences)
Output:
[
'toulouse is a city in france its in the south compared to bordeaux and bordeaux',
'paris is the capital of france its in the north compared to bordeaux and toulouse',
]
References