🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

Search by package name or paste a Package URL (PURL) to jump directly to a package (e.g., pkg:npm/react@18.0.0 or abbreviated npm/react@18.0.0). Type an ecosystem name or PURL type like 'pypi/' to switch ecosystems, then add a space to clear the input and start typing your search query.

Book a Demo Install Sign in

pypi

Categories
Server
Text Processing

Text Processing

nltk

Natural Language Toolkit

natural language processing

computational linguistics

tree-sitter-language-pack

Comprehensive collection of 160+ tree-sitter language parsers

text-processing

pyunormalize

A library for Unicode normalization (NFC, NFD, NFKC, NFKD) independent of Python's core Unicode database.

normalization forms

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.

natural-language-processing

textblob

Simple, Pythonic text processing. Sentiment analysis, part-of-speech tagging, noun phrase parsing, and more.

lingua-language-detector

An accurate natural language detection library, suitable for short text and mixed-language text

language-processing

language-detection

language-recognition

pythainlp

Thai Natural Language Processing library

natural language processing

text processing

textile

Textile processing for python.

demoji

Accurately remove and replace emojis in text strings

natural langauge processing

quantulum3

Extract quantities from unstructured text.

information extraction

natural language processing

azure-ai-textanalytics

Microsoft Azure Text Analytics Client Library for Python

cognitive services

natural language processing

zhon

Zhon provides constants used in Chinese text processing.

chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense chunking library

retrieval-augmented-generation

natural-language-processing

text-processing

gliner

Generalist model for NER (Extract any entity types from texts)

named-entity-recognition

natural-language-processing

artificial-intelligence

uroman

uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.

computational linguistics

machine translation

natural language processing

string similarity

html-to-markdown

High-performance HTML to Markdown converter powered by Rust with a clean Python API

nlpaug

Natural language processing augmentation library for deep neural networks

machine learning

natural language processing

konlpy

Python package for Korean natural language processing.

natural language processing

computational linguistics

pyap

Pyap is an MIT Licensed text processing library, written in Python, for detecting and parsing addresses. Currently it supports USA, Canadian and British addresses.

clean-text

Functions to preprocess and normalize text.

natural-language-processing

text-preprocessing

text-normalization

user-generated-content

pdf-reader-mcp

A powerful MCP server for comprehensive PDF processing with OCR and diagram detection

text-extraction

razdel

Splits russian text into tokens, sentences, section. Rule-based

natural language processing

sumy

Module for automatic summarization of text documents and HTML pages.

automatic summarization

web-data extraction

natural language processing

nostril-detector

Nonsense String Evaluator

program-analysis text-processing gibberish-detection identifiers

snownlp

Python library for processing Chinese text

textacy

NLP, before and after spaCy

text processing

rs-document

High-performance Rust implementation of LangChain's Document model and Unstructured.io's text cleaners for RAG applications

nemo-text-processing

NeMo text processing for ASR and TTS

text processing

text normalization

onnxtr

Onnx Text Recognition (OnnxTR): docTR Onnx-Wrapper for high-performance OCR on documents.

computer vision

text recognition

dragonmapper

Identification and conversion functions for Chinese text processing

docling-ocr-onnxtr

Onnx Text Recognition (OnnxTR) OCR plugin for docling

computer vision

text recognition

ripgrep

ripgrep is a line-oriented search tool that recursively searches the current directory for a regex pattern while respecting gitignore rules. ripgrep has first class support on Windows, macOS and Linux.

command-line-utilities

indic-nlp-library

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages.

zalgolib

A Python library for a _FULL_ Zalgo experience

language encoding

text processing

mosestokenizer

Wrappers for several pre-processing scripts from the Moses toolkit.

text tokenization pre-processing

tokenizer

A fast, compact pure-Python tokenizer for Icelandic text with sentence segmentation

natural-language-processing

text-processing

urduhack

Natural Language Processing (NLP) library for Urdu language.

urdu machine learning text pre-processing tensorflow nlp

toolwrapper

A base class for wrapping text-processing tools

subprocess text tool wrapper

kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats (v3 LTS)

document-analysis

document-classification

document-intelligence

document-processing

summa

A text summarization and keyword extraction package based on TextRank

natural language processing

automatic summarization

textdescriptives

A library for calculating a variety of features from text using spaCy

natural language processing

text statistics

pystempel

Polish stemmer.

natural language processing

computational linguistics

addheader

A command to manage a header section for a source code tree

software engineering

text processing

thongna

Blazing-fast Thai text processing library powered by Rust

word-segmentation

soynlp

Unsupervised Korean Natural Language Processing Toolkits

korean-text-processing

word-extraction

realtimestt

A fast Voice Activity Detection and Transcription System

voice-activity-detection

stream2sentence

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

sentence detection

sentence generation

imgrs

A modern, high-performance image processing library for Python, powered by Rust.

citeurl

an extensible tool to process legal citations in text

abbreviation-extractor

A library for extracting abbreviations from text.

text-processing

Product

Package Alerts
Integrations
Docs
Pricing
FAQ
Roadmap
Changelog

About

About
Love
Blog
Glossary
CareersHiring
Send Feedback
Contact Us
System Status

Packages

Explore GitHub Actions

Explore crates.io

Explore Chrome Web Store

Explore Go Modules

Explore Hugging Face Hub

Explore Maven Central

Explore Open VSX

Explore RubyGems.org

Stay in touch

Get open source security insights delivered straight into your inbox.

Enter your email

Terms
Privacy
Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.