Data Preprocessors
An easy-to-use tool for Data Preprocessing especially for Text Preprocessing

Table of Contents
Installation
Install the latest stable release
For windows
pip install -U data-preprocessors
For Linux/WSL2
pip3 install -U data-preprocessors
Quick Start
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
>> bla bla bla bla
Features
Split Textfile
This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle
and seed
value, you can randomly shuffle the lines of your text files.
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
main_file_path="example.txt",
train_file_path="splitted/train.txt",
val_file_path="splitted/val.txt",
test_file_path="splitted/test.txt",
train_size=0.6,
val_size=0.2,
test_size=0.2,
shuffle=True,
seed=42
)
Separate Parallel Corpus
By using this function, you will be able to easily separate src_tgt_file
into separated src_file
and tgt_file
.
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
Decontracting Words from Sentence
tp.decontracting_words(sentence)
Remove Punctuation
By using this function, you will be able to remove the punction of a single line of a text file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
Space Punctuation
By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)
Text File to List
Convert any text file into list.
mylist= tp.text2list(myfile_path="myfile.txt")
List to Text File
Convert any list into a text file (filename.txt)
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
Count Characters of a Sentence
This function will help to count the total characters of a sentence.
tp.count_chars(myfile="file.txt")
Convert Excel to Multiple Text Files
This function will help to Convert an Excel file's columns into multiple text files.
tp.excel2multitext(excel_file_path="",
column_names=None,
src_file="",
tgt_file="",
aligns_file="",
separator="|||",
src_tgt_file="",
)
Apply a function in whole text file
In the place of function_name
you can use any function and that function will be applied in the full/whole text file.
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
function_name,
myfile_path="myfile.txt",
modified_file_path="modified_file.txt"
)