Natural Language Toolkit
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
Thai Natural Language Processing library
An accurate natural language detection library, suitable for short text and mixed-language text
Comprehensive collection of 160+ tree-sitter language parsers
Microsoft Azure Text Analytics Client Library for Python
Textile processing for python.
Generalist model for NER (Extract any entity types from texts)
Python package for Korean natural language processing.
Natural language processing augmentation library for deep neural networks
Extract quantities from unstructured text.
Pyap is an MIT Licensed text processing library, written in Python, for detecting and parsing addresses. Currently it supports USA, Canadian and British addresses.
Functions to preprocess and normalize text.
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense chunking library
Module for automatic summarization of text documents and HTML pages.
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages.
A text summarization and keyword extraction package based on TextRank
A Python library for a _FULL_ Zalgo experience
NeMo text processing for ASR and TTS
Nonsense String Evaluator
Python library for processing Chinese text
STAM is a library for dealing with standoff annotations on text, this is the python binding.
NLP, before and after spaCy
Identification and conversion functions for Chinese text processing
Wrappers for several pre-processing scripts from the Moses toolkit.
A base class for wrapping text-processing tools
A fast Voice Activity Detection and Transcription System
Onnx Text Recognition (OnnxTR): docTR Onnx-Wrapper for high-performance OCR on documents.
Onnx Text Recognition (OnnxTR) OCR plugin for docling
Privacy-first text anonymization tool with enterprise-grade accuracy for removing PII from documents
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
A command to manage a header section for a source code tree
Unsupervised Korean Natural Language Processing Toolkits
Natural Language Processing (NLP) library for Urdu language.
A library for extracting abbreviations from text.
Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract
pre-processing package for text strings
Python ctypes bindings for reliq
A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).
('Core libraries for natural language processing',)