Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

pre-processing-text-basic-tools

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pre-processing-text-basic-tools

Toolkit for basic steps on Natural Language Processing - aimed on portuguese BR language.

0.5
PyPI

Maintainers: 1

Toolkit for basic Natural Language Processing processes

This package is a toolkit (multiple def functions) for executing basic process related to initial steps on Natural Language Processing. It's aimed for portuguese BR language usage.

Portuguese version available on:

Functionalities

Cleaning text;
Text analysis;
Text pre-processing for subsequent insertion into natural language training models;
Easy integration with other Python programs by importing the desired module(s) or function.

Installation

This package is installed using the command "pip install"

pip install pre-processing-text-basic-tools

Usage/Examples

Removing simple special characters

from pre_processing_text_basic_tools import removeSpecialCharacters

text = "Is this an exa@mple of $text? with special# character.s. I want to clean it!!!"

cleaned_text = removeSpecialCharacters(text)

print(cleaned_text)



>>>"This is an example of text with special characters I want to clean it"

Important note about hyphenated words (click to expand)

It is important to highlight that the functions were designed for direct applications in the Portuguese language. Therefore, words with a hyphen, such as "sexta-feira", do not have their special character "-" removed by default, but you can choose to remove the hyphens from such words using the remove_hyphen_from_words parameter, passing it to True. Furthermore, if you want hyphens not to be replaced by a space " ", you can pass the parameter personalized_treatment to False, which replaces characters "/", "\ " is for " ".

from pre_processing_text_basic_tools import removeSpecialCharacters

text = "Today is sexta-feira and 03/09/2024! Or even 03-09-2024."

cleaned_text = removeSpecialCharacters(text,remove_hyphen_from_words=True)

print(cleaned_text)



>>>"Today is sexta feira and 03 09 2024 Or even 03 09 2024"

Full text formatting and standardization

from pre_processing_text_basic_tools import formatText


text = "This is an example, of $text? I want/ t.o# format and&*. standardize!?"

formatted_text = formatText(text_string=text,
                            standardize_lower_case=True,
                            remove_special_characters=True,
                            remove_morethanspecial_characters=True,
                            remove_extra_blank_spaces=True,
                            standardize_canonic_form=True)

print(formatted_text)



>>>"this is an example of text I want to format and standardize"

Standardization of diverse elements

from pre_processing_text_basic_tools import formatText

text = '''If I have a text with an email like esteehumemail@gmail.com or
noreply@hotmail.com or even emaildeteste@yahoo.com.br.
In addition, I will also have several telephone numbers such as +55 48 911223344 or
4890011-2233 and why not a landline like 48 0011-2233?
You can also have dates such as 12/12/2024 or 2023-06-12 in different types
type 1/2/24
What if the text has a lot of money involved? We are talking about R$200,000.00 or
R$200.00 or even with
the wrong formatting like R$2500!
Furthermore we can simply standardize numbers like 123123 or 24 or
129381233 or even 1,200,234!'''

formatted_text = formatText(text_string=text,                                        
                            standardize_canonic_form=True,
                            standardize_dates=True,
                            standard_date='_data_',
                            standardize_money=True,
                            standard_money='$',
                            standardize_emails=True,
                            standard_email='_email_',
                            standardize_celphones=True,
                            standard_celphone='_tel_',
                            standardize_numbers=True,
                            standard_number='0',
                            standardize_lower_case=True)

print(formatted_text)



>>>"""if i have a text with an email like _email_ or
_email_ or even _email_
in addition i will also have several telephone numbers such as _tel_ or
_tel_ and why not a landline like _tel_
you can also have dates such as _data_ or _data_ in different types
type _data_
what if the text has a lot of money involved we are talking about $ or
$ or even with
the wrong formatting like $
furthermore we can simply standardize numbers like 0 or 0 or
0 or even 0"""

Text tokenization

Basic tokenization

from pre_processing_text_basic_tools import tokenizeText

text = '''This is another example text for tokenization!!! Let's use characters,
specials# too @igorc.s and $follow there?!'''

tokenization = tokenizeText(text)

print(tokenization)



>>>['this', 'is', 'another', 'example', 'text', 'for', 'tokenization', 'lets', 
'use', 'characters', 'specials', 'too', 'igorcs', 'and', 'follow', 'there']

Tokenization removing stopwords (click to expand)

Stopwords are words that do not have much meaning in sentences, so some applications, in order to optimize their processing and training time, remove such words from the text corpus. Some examples of common stopwords are articles and prepositions.

from pre_processing_text_basic_tools import tokenizeText

text = '''O menino gosta de comer frutas e verduras!'''

tokenization = tokenizeText(text,remove_stopwords=True)

print(tokenization)



>>>['menino', 'gosta', 'comer', 'frutas', 'verduras']

Tokenization removing stopwords with custom stopwords list (click to expand)

We can also select a personalized list of stopwords, adding or removing from the default list standard_list_with_stopwords_for_tokenization or even creating a completely unique list.

from pre_processing_text_basic_tools import tokenizeText
from pre_processing_text_basic_tools import standard_list_with_stopwords_for_tokenization

text = '''This is an example of usage! That is cool for some people, but not for others.'''

custom_stopwords_list = standard_list_with_stopwords_for_tokenization + ['the','a','an','for','this','that','of','is']

tokenization = tokenizeText(text_string=text,
                            remove_stopwords=True,
                            list_of_stopwords=custom_stopwords_list)

print(tokenization)



>>>['example', 'usage', 'cool', 'some', 'people', 'but', 'not', 'others']

More complete tokenization (click to expand)

You can also use prior formatting before the tokenization process. In the example below, the text is passed into canonical form before tokenizing it. In other words, words like "coração" become "coracao", losing their accents, "ç", etc.

from pre_processing_text_basic_tools import tokenizeText,formatText

text = "Este é um exemplo para a ficção científica. Vôo alto! Açaí é bom demais!"

formatted_text = formatText(text_string=text,standardize_canonic_form=True)

tokenization = tokenizeText(text_string=formatted_text,
                            remove_stopwords=True)

print(tokenization)



>>>['este', 'um', 'exemplo', 'para', 'ficcao', 'cientifica', 'voo', 'alto', 
'acai', 'bom', 'demais']

Authors

@IgorCaetano

Used by

This project is used in the text pre-processing stage in the WOKE project of the Grupo de Estudos e Pesquisa em IA e História ("Study and Research Group on AI and History") at UFSC ("Federal University of Santa Catarina"):

WOKE - UFSC

FAQs

What is pre-processing-text-basic-tools?

Is pre-processing-text-basic-tools well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install