
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
🚀 The fastest and most comprehensive NLP preprocessing solution that solves all tokenization and text cleaning problems in one place
If you've worked with NLP preprocessing, you've probably faced these frustrating issues:
import nltk import spacy import re import string from bs4 import BeautifulSoup from textblob import TextBlob
Current libraries struggle with modern text patterns:
$20
, 20Rs
, support@company.com
properlyawesome😊text
user@domain.com
, #hashtag
, @mentions
as single tokensNo single library handles all these tasks efficiently:
$20
, ₹100
, 20USD
)#hashtags
, @mentions
)def preprocess_text(text):
from bs4 import BeautifulSoup text = BeautifulSoup(text, "html.parser").get_text()
import re text = re.sub(r'https?://\S+', '', text)
text = text.lower()
import emoji text = emoji.replace_emoji(text, replace='')
import nltk tokens = nltk.word_tokenize(text)
import string tokens = [t for t in tokens if t not in string.punctuation]
from textblob import TextBlob corrected = [str(TextBlob(word).correct()) for word in tokens]
return corrected
UltraNLP is designed to solve all these problems with a single, ultra-fast library:
Function | Syntax | Description | Returns |
---|---|---|---|
preprocess() | ultranlp.preprocess(text, options) | Quick text preprocessing with default settings | dict with tokens, cleaned_text, etc. |
batch_preprocess() | ultranlp.batch_preprocess(texts, options, max_workers) | Process multiple texts in parallel | list of processed results |
Method | Syntax | Parameters | Description | Returns |
---|---|---|---|---|
__init__() | processor = UltraNLPProcessor() | None | Initialize the main processor | UltraNLPProcessor object |
process() | processor.process(text, options) | text (str), options (dict, optional) | Process single text with custom options | dict with processing results |
batch_process() | processor.batch_process(texts, options, max_workers) | texts (list), options (dict), max_workers (int) | Process multiple texts efficiently | list of results |
get_performance_stats() | processor.get_performance_stats() | None | Get processing statistics | dict with performance metrics |
Method | Syntax | Parameters | Description | Returns |
---|---|---|---|---|
__init__() | tokenizer = UltraFastTokenizer() | None | Initialize advanced tokenizer | UltraFastTokenizer object |
tokenize() | tokenizer.tokenize(text) | text (str) | Tokenize text with advanced patterns | list of Token objects |
Method | Syntax | Parameters | Description | Returns |
---|---|---|---|---|
__init__() | cleaner = HyperSpeedCleaner() | None | Initialize text cleaner | HyperSpeedCleaner object |
clean() | cleaner.clean(text, options) | text (str), options (dict, optional) | Clean text with specified options | str cleaned text |
Method | Syntax | Parameters | Description | Returns |
---|---|---|---|---|
__init__() | corrector = LightningSpellCorrector() | None | Initialize spell corrector | LightningSpellCorrector object |
correct() | corrector.correct(word) | word (str) | Correct spelling of a single word | str corrected word |
train() | corrector.train(text) | text (str) | Train corrector on custom corpus | None |
Option | Type | Default | Description | Example |
---|---|---|---|---|
lowercase | bool | True | Convert text to lowercase | {'lowercase': True} |
remove_html | bool | True | Remove HTML tags | {'remove_html': True} |
remove_urls | bool | True | Remove URLs | {'remove_urls': False} |
remove_emails | bool | False | Remove email addresses | {'remove_emails': True} |
remove_phones | bool | False | Remove phone numbers | {'remove_phones': True} |
remove_emojis | bool | True | Remove emojis | {'remove_emojis': False} |
normalize_whitespace | bool | True | Normalize whitespace | {'normalize_whitespace': True} |
remove_special_chars | bool | False | Remove special characters | {'remove_special_chars': True} |
Option | Type | Default | Description | Example |
---|---|---|---|---|
clean | bool | True | Enable text cleaning | {'clean': True} |
tokenize | bool | True | Enable tokenization | {'tokenize': True} |
spell_correct | bool | False | Enable spell correction | {'spell_correct': True} |
clean_options | dict | Default config | Custom cleaning options | See Clean Options above |
max_workers | int | 4 | Number of parallel workers for batch processing | {'max_workers': 8} |
Use Case | Code Example | Output |
---|---|---|
Simple Text | ultranlp.preprocess("Hello World!") | {'tokens': ['hello', 'world'], 'cleaned_text': 'hello world'} |
With Emojis | ultranlp.preprocess("Hello 😊 World!") | {'tokens': ['hello', 'world'], 'cleaned_text': 'hello world'} |
Keep Emojis | ultranlp.preprocess("Hello 😊", {'clean_options': {'remove_emojis': False}}) | {'tokens': ['hello', '😊'], 'cleaned_text': 'hello 😊'} |
Use Case | Code Example | Expected Tokens |
---|---|---|
Hashtags & Mentions | ultranlp.preprocess("Follow @user #hashtag") | ['follow', '@user', '#hashtag'] |
Currency & Prices | ultranlp.preprocess("Price: $29.99 or ₹2000") | ['price', '$29.99', 'or', '₹2000'] |
Social Media URLs | ultranlp.preprocess("Check https://twitter.com/user") | ['check', 'twitter.com/user'] (URL simplified) |
Use Case | Code Example | Expected Tokens |
---|---|---|
Product Reviews | ultranlp.preprocess("Great product! Costs $99.99") | ['great', 'product', 'costs', '$99.99'] |
Contact Information | ultranlp.preprocess("Email: support@company.com", {'clean_options': {'remove_emails': False}}) | ['email', 'support@company.com'] |
Phone Numbers | ultranlp.preprocess("Call +1-555-123-4567", {'clean_options': {'remove_phones': False}}) | ['call', '+1-555-123-4567'] |
Use Case | Code Example | Expected Tokens |
---|---|---|
Code & URLs | ultranlp.preprocess("Visit https://api.example.com/v1", {'clean_options': {'remove_urls': False}}) | ['visit', 'https://api.example.com/v1'] |
Mixed Content | ultranlp.preprocess("API costs $0.01/request") | ['api', 'costs', '$0.01/request'] |
Date/Time | ultranlp.preprocess("Meeting at 2:30PM on 12/25/2024") | ['meeting', 'at', '2:30PM', 'on', '12/25/2024'] |
Use Case | Code Example | Description |
---|---|---|
Small Batch | ultranlp.batch_preprocess(["Text 1", "Text 2", "Text 3"]) | Process few documents sequentially |
Large Batch | ultranlp.batch_preprocess(documents, max_workers=8) | Process many documents in parallel |
Custom Options | ultranlp.batch_preprocess(texts, {'spell_correct': True}) | Batch process with spell correction |
Use Case | Code Example | Description |
---|---|---|
Custom Processor | processor = UltraNLPProcessor(); result = processor.process(text) | Create reusable processor instance |
Only Tokenization | tokenizer = UltraFastTokenizer(); tokens = tokenizer.tokenize(text) | Use tokenizer independently |
Only Cleaning | cleaner = HyperSpeedCleaner(); clean_text = cleaner.clean(text) | Use cleaner independently |
Spell Correction | corrector = LightningSpellCorrector(); word = corrector.correct("helo") | Correct individual words |
Key | Type | Description | Example |
---|---|---|---|
original_text | str | Input text unchanged | "Hello World!" |
cleaned_text | str | Processed/cleaned text | "hello world" |
tokens | list | List of token strings | ["hello", "world"] |
token_objects | list | List of Token objects with metadata | [Token(text="hello", start=0, end=5, type=WORD)] |
token_count | int | Number of tokens found | 2 |
processing_stats | dict | Performance statistics | {"documents_processed": 1, "total_tokens": 2} |
Property | Type | Description | Example |
---|---|---|---|
text | str | The token text | "$29.99" |
start | int | Start position in original text | 15 |
end | int | End position in original text | 21 |
token_type | TokenType | Type of token | TokenType.CURRENCY |
Token Type | Description | Examples |
---|---|---|
WORD | Regular words | hello , world , amazing |
NUMBER | Numeric values | 123 , 45.67 , 1.23e-4 |
EMAIL | Email addresses | user@domain.com , support@company.co.uk |
URL | Web addresses | https://example.com , www.site.com |
CURRENCY | Currency amounts | $29.99 , ₹1000 , €50.00 |
PHONE | Phone numbers | +1-555-123-4567 , (555) 123-4567 |
HASHTAG | Social media hashtags | #python , #nlp , #machinelearning |
MENTION | Social media mentions | @username , @company |
EMOJI | Emojis and emoticons | 😊 , 💰 , 🎉 |
PUNCTUATION | Punctuation marks | ! , ? , . , , |
DATETIME | Date and time | 12/25/2024 , 2:30PM , 2024-01-01 |
CONTRACTION | Contractions | don't , won't , it's |
HYPHENATED | Hyphenated words | state-of-the-art , multi-level |
Tip | Code Example | Benefit |
---|---|---|
Reuse Processor | processor = UltraNLPProcessor() then call processor.process() multiple times | Faster for multiple calls |
Batch Processing | Use batch_preprocess() for >20 documents | Parallel processing speedup |
Disable Spell Correction | {'spell_correct': False} (default) | Much faster processing |
Customize Workers | batch_preprocess(texts, max_workers=8) | Optimize for your CPU cores |
Cache Results | Store results for repeated texts | Avoid reprocessing same content |
Error Type | Cause | Solution |
---|---|---|
ImportError: bs4 | BeautifulSoup4 not installed | pip install beautifulsoup4 |
TypeError: 'NoneType' | Passing None as text | Check input text is not None |
AttributeError | Wrong method name | Check spelling of method names |
MemoryError | Processing very large texts | Use batch processing with smaller chunks |
Function | Purpose | Example |
---|---|---|
get_performance_stats() | Monitor processing performance | processor.get_performance_stats() |
token.to_dict() | Convert token to dictionary for inspection | token.to_dict() |
len(result['tokens']) | Check number of tokens | Quick validation |
result['token_objects'] | Inspect detailed token information | Debug tokenization issues |
What makes our tokenization special:
$20
, ₹100
, 20USD
, 100Rs
user@domain.com
, support@company.co.uk
#hashtag
, @mention
+1-555-123-4567
, (555) 123-4567
https://example.com
, www.site.com
12/25/2024
, 2:30PM
😊
, 💰
, 🎉
(handles attached to text)don't
, won't
, it's
state-of-the-art
, multi-threaded
Library | Speed (1M documents) | Memory Usage |
---|---|---|
NLTK | 45 minutes | 2.1 GB |
spaCy | 12 minutes | 1.8 GB |
TextBlob | 38 minutes | 2.5 GB |
UltraNLP | 3 minutes | 0.8 GB |
Performance features:
Feature | NLTK | spaCy | TextBlob | UltraNLP |
---|---|---|---|---|
Currency tokens ($20 , ₹100 ) | ❌ | ❌ | ❌ | ✅ |
Email detection | ❌ | ❌ | ❌ | ✅ |
Social media (# , @ ) | ❌ | ❌ | ❌ | ✅ |
Emoji handling | ❌ | ❌ | ❌ | ✅ |
HTML cleaning | ❌ | ❌ | ❌ | ✅ |
URL removal | ❌ | ❌ | ❌ | ✅ |
Spell correction | ❌ | ❌ | ✅ | ✅ |
Batch processing | ❌ | ✅ | ❌ | ✅ |
Memory efficient | ❌ | ❌ | ❌ | ✅ |
One-line setup | ❌ | ❌ | ❌ | ✅ |
FAQs
Ultra-fast, comprehensive NLP preprocessing library with advanced tokenization
We found that ultranlp demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.