TextPrettifier
TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords. It now features asynchronous processing and multithreading capabilities for efficient processing of large texts.
Key Features
Text Cleaning Features
1. Removing Emojis
The remove_emojis
method removes emojis from the text.
2. Removing Internet Words
The remove_internet_words
method removes internet-specific words from the text.
3. Removing HTML Tags
The remove_html_tags
method removes HTML tags from the text.
4. Removing URLs
The remove_urls
method removes URLs from the text.
5. Removing Numbers
The remove_numbers
method removes numbers from the text.
6. Removing Special Characters
The remove_special_chars
method removes special characters from the text.
7. Expanding Contractions
The remove_contractions
method expands contractions in the text.
8. Removing Stopwords
The remove_stopwords
method removes stopwords from the text.
Advanced Processing Features
9. Asynchronous Processing
All methods have async counterparts prefixed with 'a' (e.g., aremove_emojis
) for non-blocking operations.
10. Batch Processing
Process multiple texts in parallel with process_batch
and aprocess_batch
.
11. Chunked Processing for Large Texts
Efficiently process large texts with chunk_and_process
and achunk_and_process
.
12. Lemmatization and Stemming
Apply lemmatization or stemming to text with dedicated methods.
Installation
You can install TextPrettifier using pip:
pip install text-prettifier
Quick Start
Basic Usage
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier()
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text, is_lower=True)
print(all_cleaned)
tokens = text_prettifier.sigma_cleaner(all_text, is_token=True, is_lower=True)
print(tokens)
Asynchronous Processing
import asyncio
from text_prettifier import TextPrettifier
async def process_text():
text_prettifier = TextPrettifier()
text = "Hello, @world! 123 I can't believe it. 😊"
result = await text_prettifier.asigma_cleaner(text, is_lower=True)
print(result)
asyncio.run(process_text())
Batch Processing
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier(max_workers=4)
texts = [
"Hello, how are you? 😊",
"<p>This is HTML</p> content",
"Visit https://example.com for more info",
"I can't believe it's not butter!"
]
results = text_prettifier.process_batch(texts, is_lower=True)
for text, result in zip(texts, results):
print(f"Original: {text}")
print(f"Cleaned: {result}")
print()
async def process_async():
results = await text_prettifier.aprocess_batch(texts, is_lower=True)
for text, result in zip(texts, results):
print(f"Original: {text}")
print(f"Cleaned: {result}")
print()
Processing Large Texts
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier()
large_text = "Hello, this is a sample text with some HTML <p>tags</p> and URLs https://example.com and emojis 😊" * 1000
result = text_prettifier.chunk_and_process(
large_text,
chunk_size=5000,
is_lower=True,
keep_numbers=True
)
print(f"Original length: {len(large_text)}")
print(f"Processed length: {len(result)}")
async def process_large_async():
result = await text_prettifier.achunk_and_process(
large_text,
chunk_size=5000,
is_lower=True,
keep_numbers=True
)
print(f"Original length: {len(large_text)}")
print(f"Processed length: {len(result)}")
Lemmatization and Stemming
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier()
text = "I am running in the park with friends"
lemmatized = text_prettifier.sigma_cleaner(text, is_lemmatize=True)
print(lemmatized)
stemmed = text_prettifier.sigma_cleaner(text, is_stemming=True)
print(stemmed)
Advanced Configuration
TextPrettifier supports various configuration options:
text_prettifier = TextPrettifier(max_workers=8)
result = text_prettifier.sigma_cleaner(
text,
is_token=True,
is_lower=True,
is_lemmatize=True,
is_stemming=False,
keep_numbers=True
)
Contact Information
Feel free to reach out to me on social media:

License
This project is licensed under the MIT License - see the LICENSE file for details.