Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

data-preprocessors

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

data-preprocessors

An easy to use tool for Data Preprocessing specially for Text Preprocessing

  • 0.58.0
  • PyPI
  • Socket score

Maintainers
1

Data Preprocessors

An easy-to-use tool for Data Preprocessing especially for Text Preprocessing

Downloads

Table of Contents

Installation

Install the latest stable release
For windows

pip install -U data-preprocessors

For Linux/WSL2

pip3 install -U data-preprocessors

Quick Start

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla

Features

Split Textfile

This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle and seed value, you can randomly shuffle the lines of your text files.

from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
    main_file_path="example.txt",
    train_file_path="splitted/train.txt",
    val_file_path="splitted/val.txt",
    test_file_path="splitted/test.txt",
    train_size=0.6,
    val_size=0.2,
    test_size=0.2,
    shuffle=True,
    seed=42
)

# Total lines:  500
# Train set size:  300
# Validation set size:  100
# Test set size:  100

Separate Parallel Corpus

By using this function, you will be able to easily separate src_tgt_file into separated src_file and tgt_file.

from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")

Decontracting Words from Sentence

tp.decontracting_words(sentence)

Remove Punctuation

By using this function, you will be able to remove the punction of a single line of a text file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla

Space Punctuation

By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla

Text File to List

Convert any text file into list.

 mylist= tp.text2list(myfile_path="myfile.txt")

List to Text File

Convert any list into a text file (filename.txt)

tp.list2text(mylist=mylist, myfile_path="myfile.txt")

Count Characters of a Sentence

This function will help to count the total characters of a sentence.

tp.count_chars(myfile="file.txt")

Convert Excel to Multiple Text Files

This function will help to Convert an Excel file's columns into multiple text files.

tp.excel2multitext(excel_file_path="",
                    column_names=None,
                    src_file="",
                    tgt_file="",
                    aligns_file="",
                    separator="|||",
                    src_tgt_file="",
                    )

Apply a function in whole text file

In the place of function_name you can use any function and that function will be applied in the full/whole text file.

from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
    function_name, 
    myfile_path="myfile.txt", 
    modified_file_path="modified_file.txt"
)

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc