
Security News
vlt Launches "reproduce": A New Tool Challenging the Limits of Package Provenance
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.
TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.
The remove_emojis
method removes emojis from the text.
The remove_internet_words
method removes internet-specific words from the text.
The remove_html_tags
method removes HTML tags from the text.
The remove_urls
method removes URLs from the text.
The remove_numbers
method removes numbers from the text.
The remove_special_chars
method removes special characters from the text.
The remove_contractions
method expands contractions in the text.
The remove_stopwords
method removes stopwords from the text.
is_lower
and is_token
are both True
, the text is returned in lowercase and as a list of tokens.is_lower
is True
, the text is returned in lowercase.is_token
is True
, the text is returned as a list of tokens.is_lower
nor is_token
is True
, the text is returned as is.You can install TextPrettifier using pip:
pip install text-prettifier
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier()
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)
Output Hi,Pythonogist! I Python.
html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)
Output Hello,world!
url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)
Output Visit our webiste at
number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)
Output There are apples
special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)
Output Hello world
contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)
Output I cannot do it
stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)
Output This test
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)
Output Hello world 123 apples cannot test
If you are interested to tokenized and lower the cleaned text write the code
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)
Output ['Hello','world', '123','apples', 'cannot','test']
Note: I didn't include remove_numbers
in sigma_cleaner
because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner
.
Feel free to reach out to me on social media:
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.
We found that text-prettifier demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
Research
Security News
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
Research
The Socket Research Team discovered a malicious npm package, '@ton-wallet/create', stealing cryptocurrency wallet keys from developers and users in the TON ecosystem.