🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

Search by package name or paste a Package URL (PURL) to jump directly to a package (e.g., pkg:npm/react@18.0.0 or abbreviated npm/react@18.0.0). Type an ecosystem name or PURL type like 'pypi/' to switch ecosystems, then add a space to clear the input and start typing your search query.

Book a Demo Install Sign in

pypi

Categories
Server
Text Processing

Text Processing

preprocessing

pre-processing package for text strings

text pre-processing

nlup

('Core libraries for natural language processing',)

natural language processing

text processing

artificial intelligence

phrasetree

Phrase Tree from Natural Language Toolkit

natural language processing

computational linguistics

omnilingual-asr

Omnilingual ASR Modeling Library

speech recognition

automatic speech recognition

breame

Breame is a lightweight Python package with a number of tools to aid in the detection of words that have dual spellings and meanings in British and American English.

text processing

natural language proessing

utility library

hamtaa-texttools

A high-level NLP toolkit built on top of modern LLMs.

text-processing

contextgem

Effortless LLM extraction from documents

artificial-intelligence

aspect-extraction

automated-prompting

concept-extraction

content-extraction

huspacy

HuSpaCy: industrial strength Hungarian natural language processing

text processing

text processing

language processing

analiticcl

Analiticcl is an approximate string matching or fuzzy-matching system that can be used to find variants for spelling correction or text normalisation

text-processing

spelling-correction

citeurl

an extensible tool to process legal citations in text

sculpt

Sculpt: Structuring unstructured data with LLMs

large language model

unstructured data

structured data

data extraction

text2text

Text2Text Language Modeling Toolkit

multilingual gpt chatgpt bert natural language processing nlp nlg text generation gpt question answer answering information retrieval tfidf tf-idf bm25 search index summary summarizer summarization tokenizer tokenization translation backtranslation data augmentation science machine learning colab embedding levenshtein sub-word edit distance conversational dialog chatbot llama rag

livekit-plugins-nltk

Agent Framework plugin for NLTK-based text processing.

aspose-tasks

Aspose.Tasks for Python via .NET is a native library that enables the developers to add MS-Project files processing capabilities to their applications

pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).

rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

byte-pair-encoding

subword-tokenization

tnh-scholar

TNH Scholar is an AI-driven project designed to explore, query, and translate the teachings of Thich Nhat Hanh and Plum Village community.

doc2mark

Unified document processing with AI-powered OCR

document-processing

centering-lgram

Advanced Language Model with Centering Theory for Coherent Text Generation

natural language processing

text generation

centering theory

ll-core

LivingLogic base package: ansistyle, color, make, sisyphus, xpit, url, xml_codec

escape sequence

reliq

Python ctypes bindings for reliq

text-processing

flashtext2

A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package

text-processing

extracting-keywords

keyword-extraction

python-rake

A python module implementing the Rapid Automatic Keyword Extraction algorithm.

text mining data mining natural language processing

tidytext

Text processing with pandas DataFrames.

indic-nlp-library-itt

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. This fork is specialized for IndicTrans2.

textaugment

A library for augmenting text for natural language processing applications.

text augmentation

natural language processing

docstrange

Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

document-processing

document-conversion

image-processing

deepseek-ocr

A simple and efficient Python SDK for DeepSeek-OCR API

document-processing

augmenty

An augmentation library based on SpaCy for joint augmentation of text and labels.

natural language processing

pointofview

A Python package for determining a piece of text's point of view (first, second, third, or unknown).

natural language processing

bp-tokenizer

A high-performance Python tokenizer using Byte-Pair Encoding (BPE) with 100k vocabulary, supporting text encoding, decoding, and normalization for NLP applications.

text-processing

smashed

SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.

adapt-parser

A text-to-intent parsing framework.

natural language processing

verbatim

high quality multi-lingual speech to text

audio processing

aiwand

A simple AI toolkit for text processing using OpenAI and Gemini APIs

ekphrasis

Text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction.

fenic

fenic is a Python DataFrame library for processing text data with APIs inspired by PySpark.

fastmrz

Extracts the Machine Readable Zone (MRZ) data from document images

image processing

image recognition

computer vision

aspose-psd

Aspose.PSD for Python via .NET is a standalone API to read, write, process, convert Adobe Photoshop PSD, PSB formats without needing to install Adobe Photoshop® and AI files without Adobe Illustrator®

uniqseq

Stream-based deduplication for repeating sequences

huspacy-nightly

HuSpaCy: industrial strength Hungarian natural language processing

text processing

text processing

language processing

chirptext

A minimalist collection of text processing tools for Python 3

gatenlp

GATE NLP implementation in Python.

text processing

natural language processing

architxt

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

forumscraper

A forum scraper library

text-processing

massedit

Edit multiple files using Python text processing modules

ovos-padatious

A neural network intent parser

intent-parser parser text text-processing

piraye

A utility for normalizing persian, arabic and english texts

Natural Language Processing

dedoc-utils

Utils for automatic document images processing

text recognition

computer vision

shekar

Simplifying Persian NLP for Modern Applications

Machine Learning

Natural Language Processing

Product

Package Alerts
Integrations
Docs
Pricing
FAQ
Roadmap
Changelog

About

About
Love
Blog
Glossary
CareersHiring
Send Feedback
Contact Us
System Status

Packages

Explore GitHub Actions

Explore crates.io

Explore Chrome Web Store

Explore Go Modules

Explore Hugging Face Hub

Explore Maven Central

Explore Open VSX

Explore RubyGems.org

Stay in touch

Get open source security insights delivered straight into your inbox.

Enter your email

Terms
Privacy
Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.