pre-processing package for text strings
('Core libraries for natural language processing',)
Phrase Tree from Natural Language Toolkit
Omnilingual ASR Modeling Library
Breame is a lightweight Python package with a number of tools to aid in the detection of words that have dual spellings and meanings in British and American English.
A high-level NLP toolkit built on top of modern LLMs.
Effortless LLM extraction from documents
HuSpaCy: industrial strength Hungarian natural language processing
Analiticcl is an approximate string matching or fuzzy-matching system that can be used to find variants for spelling correction or text normalisation
an extensible tool to process legal citations in text
Sculpt: Structuring unstructured data with LLMs
Text2Text Language Modeling Toolkit
A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
TNH Scholar is an AI-driven project designed to explore, query, and translate the teachings of Thich Nhat Hanh and Plum Village community.
Advanced Language Model with Centering Theory for Coherent Text Generation
Python ctypes bindings for reliq
A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package
A python module implementing the Rapid Automatic Keyword Extraction algorithm.
Text processing with pandas DataFrames.
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. This fork is specialized for IndicTrans2.
A library for augmenting text for natural language processing applications.
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.
A simple and efficient Python SDK for DeepSeek-OCR API
An augmentation library based on SpaCy for joint augmentation of text and labels.
A Python package for determining a piece of text's point of view (first, second, third, or unknown).
A high-performance Python tokenizer using Byte-Pair Encoding (BPE) with 100k vocabulary, supporting text encoding, decoding, and normalization for NLP applications.
SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.
A text-to-intent parsing framework.
high quality multi-lingual speech to text
Text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction.
fenic is a Python DataFrame library for processing text data with APIs inspired by PySpark.
Extracts the Machine Readable Zone (MRZ) data from document images
Aspose.PSD for Python via .NET is a standalone API to read, write, process, convert Adobe Photoshop PSD, PSB formats without needing to install Adobe Photoshop® and AI files without Adobe Illustrator®
HuSpaCy: industrial strength Hungarian natural language processing
A minimalist collection of text processing tools for Python 3
GATE NLP implementation in Python.
ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.
A forum scraper library
A neural network intent parser
A utility for normalizing persian, arabic and english texts
Utils for automatic document images processing
Simplifying Persian NLP for Modern Applications