Natural Language Toolkit
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
An accurate natural language detection library, suitable for short text and mixed-language text
Extensive Language Pack for Tree-Sitter
Thai Natural Language Processing library
Microsoft Azure Text Analytics Client Library for Python
Textile processing for python.
Natural language processing augmentation library for deep neural networks
Python package for Korean natural language processing.
Generalist model for NER (Extract any entity types from texts)
Pyap is an MIT Licensed text processing library, written in Python, for detecting and parsing addresses. Currently it supports USA, Canadian and British addresses.
Extract quantities from unstructured text.
Functions to preprocess and normalize text.
Module for automatic summarization of text documents and HTML pages.
NeMo text processing for ASR and TTS
Python library for processing Chinese text
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense chunking library
A text summarization and keyword extraction package based on TextRank
NLP, before and after spaCy
Nonsense String Evaluator
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages.
A fast Voice Activity Detection and Transcription System
Wrappers for several pre-processing scripts from the Moses toolkit.
Natural Language Processing (NLP) library for Urdu language.
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
A base class for wrapping text-processing tools
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
Convert HTML to markdown
A library for extracting abbreviations from text.
A Python library for a _FULL_ Zalgo experience
A command to manage a header section for a source code tree
Identification and conversion functions for Chinese text processing
pre-processing package for text strings
STAM is a library for dealing with standoff annotations on text, this is the python binding.
Onnx Text Recognition (OnnxTR): docTR Onnx-Wrapper for high-performance OCR on documents.
Blazing-fast Thai text processing library powered by Rust
Python ctypes bindings for reliq
Process-Sanskrit is python library for automatic Sanskrit text annotation and inflected dictionary search
A text extraction library supporting PDFs, images, office documents and more
Unsupervised Korean Natural Language Processing Toolkits
Open-source tool for exploring, labeling, and monitoring data for NLP projects.
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. This fork is specialized for IndicTrans2.
SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.
An augmentation library based on SpaCy for joint augmentation of text and labels.