Natural Language Toolkit
Comprehensive collection of 160+ tree-sitter language parsers
A library for Unicode normalization (NFC, NFD, NFKC, NFKD) independent of Python's core Unicode database.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
An accurate natural language detection library, suitable for short text and mixed-language text
Thai Natural Language Processing library
Textile processing for python.
Extract quantities from unstructured text.
Microsoft Azure Text Analytics Client Library for Python
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense chunking library
Generalist model for NER (Extract any entity types from texts)
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
High-performance HTML to Markdown converter powered by Rust with a clean Python API
Natural language processing augmentation library for deep neural networks
Python package for Korean natural language processing.
Pyap is an MIT Licensed text processing library, written in Python, for detecting and parsing addresses. Currently it supports USA, Canadian and British addresses.
Functions to preprocess and normalize text.
A powerful MCP server for comprehensive PDF processing with OCR and diagram detection
Module for automatic summarization of text documents and HTML pages.
Nonsense String Evaluator
Python library for processing Chinese text
NLP, before and after spaCy
NeMo text processing for ASR and TTS
Onnx Text Recognition (OnnxTR): docTR Onnx-Wrapper for high-performance OCR on documents.
Identification and conversion functions for Chinese text processing
Onnx Text Recognition (OnnxTR) OCR plugin for docling
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages.
A Python library for a _FULL_ Zalgo experience
Wrappers for several pre-processing scripts from the Moses toolkit.
A fast, compact pure-Python tokenizer for Icelandic text with sentence segmentation
Natural Language Processing (NLP) library for Urdu language.
A base class for wrapping text-processing tools
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats (v3 LTS)
A text summarization and keyword extraction package based on TextRank
A library for calculating a variety of features from text using spaCy
A command to manage a header section for a source code tree
Blazing-fast Thai text processing library powered by Rust
Unsupervised Korean Natural Language Processing Toolkits
A fast Voice Activity Detection and Transcription System
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
A modern, high-performance image processing library for Python, powered by Rust.
an extensible tool to process legal citations in text
A library for extracting abbreviations from text.