🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →
Socket
DemoInstallSign in
Socket

pybangla

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pybangla

pybangla is the bangla text normalizer tool, it use for text normalization like word to number and date formating purposes

2.10.0
PyPI
Maintainers
1

Citation Paper: BnVITS: Voice Cloning in Bangla with Minimal Audio Samples

PyBangla:

PyBangla is a python3 package for Bangla Number, DateTime and Text Normalizer and Date Extraction. This package can be used to Normalize the text number and date (ex: number to text vice versa). This framework also can be used Django, Flask, FastAPI, and others. PyBangla module supported operating systems Linux/Unix, Mac OS and Windows. Available Features.

Features available in PyBangla:

  • Text Normalization
  • Number Conversion
  • Date Format
  • Emoji Removal
  • Months, Weekdays, Seasons

[N.B: Here listed Every Feature has implemented Text Normalization as well as Isolated Uses feature]

Installation

The easiest way to install pybangla is to use pip:

pip install pybangla
#or
pip install git+https://github.com/saiful9379/pybangla.git
#or
git clone https://github.com/saiful9379/pybangla.git
cd pybangla
pip install -e .

Evaluation

For the evaluation, we selected 200 sentences. The dataset contains numerical values and has been normalized using PyBangla. We generated AI-based ground truth (GT) text and had it corrected by human annotators. The performance of our tool is evaluated using three key metrics: Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER).

PyBangla Evaluation

The performance of PyBangla was evaluated using 200 sentences. However, no evaluation report is available for versions earlier than V2.0.9. PyBangla V2.0.9 Presenting conversion accuracy as well as it's processing time performance.

Conversion Accuracy

Module VersionNo. of SentencesWER (Word Error Rate)CER (Character Error Rate)MER (Match Error Rate)
<= V2.0.8200No evaluation reportNo evaluation reportNo evaluation report
V2.0.92000.12910.03190.0975

N.B : For more detail and all of processing category listed here please check : link

Processing Time Performance

Module VersionTotal SentencesRaw Character CountNormalized Character CountPer Character Processing Time (sec)Total Processing Time (sec)
2.0.92009,21712,5840.00011671.076

Interpretation

  • The text normalization process increased the character count from 9,217 to 12,584 due to transformations such as Unicode normalization, diacritic removal, and standardization.
  • The average processing time per character was 0.0001167 seconds, resulting in a total processing time of 1.076 seconds for 200 sentences.
  • These metrics demonstrate the efficiency of PyBangla in handling Bangla text normalization.

Usage

1. Text Normalization

It supports converting Bangla abbreviations, symbols, and currencies to Bangla textual format.

Processes a given text by applying various normalization techniques based on specified boolean parameters.

Parameters:

  • text (str): The input text to be normalized.
  • all_operation (bool): Make this True if you need all operations to take place or False
  • number_plate (bool, default=False): Converts or normalizes vehicle number plates if present in the text.
  • abbreviations (bool, default=False): Expands common abbreviations into their full forms.
  • year (bool, default=False): Handles and formats years correctly.
  • punctuation (bool, default=False): Removes or standardizes unwanted punctuation marks.
  • phone_number (bool, default=False): Extracts and normalizes phone numbers.
  • symbols (bool, default=False): Expands common symbols into their textual representation.
  • ordinals (bool, default=False): Converts ordinal numbers.
  • currency (bool, default=False): Converts currency values into words.
  • date (bool, default=False): Standardizes and normalizes date formats.
  • nid (bool, default=False): Converts national identification numbers (NID) into a textual format.
  • passport (bool, default=False): Normalizes passport numbers.
  • number (bool, default=False): Processes and converts numeric values into textual form.
  • emoji (bool, default=False): Removes emojis from text.

Returns:

  • str: The normalized text after applying the selected transformations.

Example:

We can enable all conversion with a simple boolean parameter.

import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text, 
                                                     all_operation=True)}")

print(text)

# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ āĻĒā§āϰāĻĨāĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ āϤ⧇āĻ¤ā§āϰāĻŋāĻļāϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ• āĻāĻ• āĻĻ⧁āχ āϤāĻŋāύ āϚāĻžāϰ āχāϝāĻŧ⧇āύ āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇'

This can be used for single operations also.

For example, if only year conversion needed -

import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
                                                     all_operation=False
                                                     year=True)}")

print(text)

# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇'

If only ordinal conversion needed -

import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
                                                     all_operation=False
                                                     ordinals=True)}")

print(text)

# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ āĻĒā§āϰāĻĨāĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ āϤ⧇āĻ¤ā§āϰāĻŋāĻļāϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇'

If only currency conversion needed -

import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
                                                     all_operation=False
                                                     currency=True)}")

print(text)

# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ• āĻāĻ• āĻĻ⧁āχ āϤāĻŋāύ āϚāĻžāϰ āχāϝāĻŧ⧇āύ āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇'

We can also use multiple conversion at once.

import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
                                                     all_operation=False
                                                     currency=True)}")

print(text)

# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻ•ā§āϞāĻžāϏ āĻ“ā§ŸāĻžāύ āĻ ā§§āĻŽ, āĻāĻ¨ā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻ•ā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧇ āϜāĻ¨ā§āϝ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§€āϤ⧇ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ• āĻāĻ• āĻĻ⧁āχ āϤāĻŋāύ āϚāĻžāϰ āχāϝāĻŧ⧇āύ āĻĻāĻŋāϤ⧇ āĻšā§Ÿā§‡āϛ⧇'

Normalizer more information or example check the link

2. Number Conversion

Example:

import pybangla
nrml = pybangla.Normalizer()
text = "āφāĻŽāĻžāϕ⧇ āĻāĻ• āϞāĻ•ā§āώ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻāĻ• āϟāĻžāĻ•āĻž āĻĻā§‡ā§Ÿ āĻāĻ¨ā§āĻĄ āϤ⧁āĻŽāĻŋ āĻŦāĻŋāĻļ āĻšāĻžāϜāĻžāϰ āϟāĻžāĻ•āĻž āύāĻŋāĻ“ āĻāĻ¨ā§āĻĄ āĻāĻ• āϞāĻ•ā§āώ āϚāĻžāϰ āĻšāĻžāϜāĻžāϰ āĻĻ⧁āχāĻļ āĻāĻ• āϟāĻžāĻ•āĻž āĻāĻ• āĻĄāĻŦāϞ āĻĻ⧁āχ"
text = nrml.word2number(text)
print(text)
#output:
'āφāĻŽāĻžāϕ⧇ 102001 āϟāĻžāĻ•āĻž āĻĻā§‡ā§Ÿ āĻāĻ¨ā§āĻĄ āϤ⧁āĻŽāĻŋ 20000 āϟāĻžāĻ•āĻž āύāĻŋāĻ“ āĻāĻ¨ā§āĻĄ 104201 āϟāĻžāĻ•āĻž 122 '

Number conversion more information or examples check the link

3. Date Format

Example:

import pybangla
nrml = pybangla.Normalizer()
date = "ā§Ļā§§-āĻāĻĒā§āϰāĻŋāϞ/⧍ā§Ļā§¨ā§Š"
date = nrml.date_format(date, language="bn")
print(date)
#output:


{'date': 'ā§Ļā§§', 'month': 'ā§Ē', 'year': '⧍ā§Ļā§¨ā§Š', 'txt_date': 'āĻāĻ•', 'txt_month': 'āĻāĻĒā§āϰāĻŋāϞ', 'txt_year': 'āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āϤ⧇āχāĻļ', 'weekday': 'āĻļāύāĻŋāĻŦāĻžāϰ', 'ls_month': 'āĻļā§āϰāĻžāĻŦāĻŖ', 'seasons': 'āĻŦāĻ°ā§āώāĻž'}

Date Format for more information or example check the link

4. Today, Months, Weekdays, Seasons

import pybangla
nrml = pybangla.Normalizer()
today = nrml.today()
print(today)

Output: 
{'date': 'ā§Šā§Ļ', 'month': 'āĻāĻĒā§āϰāĻŋāϞ', 'year': '⧍ā§Ļ⧍ā§Ē', 'txt_date': 'āĻ¤ā§āϰāĻŋāĻļ', 'txt_year': 'āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āϚāĻŦā§āĻŦāĻŋāĻļ', 'weekday': 'āĻŽāĻ™ā§āĻ—āϞāĻŦāĻžāϰ', 'ls_month': 'āĻļā§āϰāĻžāĻŦāĻŖ', 'seasons': 'āĻŦāĻ°ā§āώāĻž'}

Today, Months, Weekdays, Seasons more information or examples check the link

New Feature

(UPDATE TEXT NORMALIZATION) It supports year conversion like

  • "ā§§ā§¯ā§Žā§­-āϰ" to "āωāύāĻŋāĻļāĻļā§‹ āϏāĻžāϤāĻžāĻļāĻŋ āĻāϰ"
  • "⧧⧝⧝ā§Ģ āϏāĻžāϞ⧇" to "āωāύāĻŋāĻļāĻļā§‹ āĻĒāρāϚāĻžāύāĻŦā§āĻŦāχ āϏāĻžāϞ⧇"
  • "⧍ā§Ļ⧍ā§Ŧ-⧍⧭" to "āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ›āĻžāĻŦā§āĻŦāĻŋāĻļ āϏāĻžāϤāĻžāĻļ"

Now it also has the abbreviation for units of temperature

  • "ā§Ēā§Ē°F" to "āϚ⧁⧟āĻžāĻ˛ā§āϞāĻŋāĻļ āĻĄāĻŋāĻ—ā§āϰ⧀ āĻĢāĻžāϰ⧇āύāĻšāĻžāχāϟ"
  • "ā§Ēā§Ē°C" to "āϚ⧁⧟āĻžāĻ˛ā§āϞāĻŋāĻļ āĻĄāĻŋāĻ—ā§āϰ⧀ āϏ⧇āϞāϏāĻŋ⧟āĻžāϏ"

Phone Number Processing

  • "01790-540211" to "āϜāĻŋāϰ⧋ āĻ“ā§ŸāĻžāύ āϏ⧇āϭ⧇āύ āύāĻžāχāύ āϜāĻŋāϰ⧋ āĻĢāĻžāχāĻ­ āĻĢā§‹āϰ āϜāĻŋāϰ⧋ āϟ⧁ āĻĄāĻžāĻŦāϞ āĻ“ā§ŸāĻžāύ"
import pybangla
nrml = pybangla.Normalizer()
number_string = nrml.process_phone_number("01790-540211")
Output:
āϜāĻŋāϰ⧋ āĻ“ā§ŸāĻžāύ āϏ⧇āϭ⧇āύ āύāĻžāχāύ āϜāĻŋāϰ⧋ āĻĢāĻžāχāĻ­ āĻĢā§‹āϰ āϜāĻŋāϰ⧋ āϟ⧁ āĻĄāĻžāĻŦāϞ āĻ“ā§ŸāĻžāύ

Compare Two String Changes

import pybangla
nrml = pybangla.Normalizer()

input1 = "⧧⧝⧝ā§ŦāϏāĻžāϞ⧇āϰ ā§Ŧ āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰāϰāĻŖ āĻ­ā§āϰāĻŽāĻŖ āĻĒāϰāĻŋāĻ•āĻ˛ā§āĻĒāύāĻž āĻ•āϰāĻ›āĻŋ ⧍ā§Ļā§Šā§ĻāϏāĻžāϞ⧇āϰ ā§ŦāϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ"

input2 = "āωāύāĻŋāĻļāĻļā§‹ āĻ›āĻŋ⧟āĻžāύāĻŦā§āĻŦāχ āϏāĻžāϞ⧇āϰ āĻ›ā§Ÿ āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ āϰāĻŖ āĻ­ā§āϰāĻŽāĻŖ āĻĒāϰāĻŋāĻ•āĻ˛ā§āĻĒāύāĻž āĻ•āϰāĻ›āĻŋ āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āϏāĻžāϞ⧇āϰ āĻ›ā§Ÿ āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ"

print(nrml.text_diff(input1, input2))

#Output: 

(
    ['⧧⧝⧝ā§ŦāϏāĻžāϞ⧇āϰ ā§Ŧ', 'āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰāϰāĻŖ', '⧍ā§Ļā§Šā§ĻāϏāĻžāϞ⧇āϰ', 'ā§ŦāϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ'], 
    ['āωāύāĻŋāĻļāĻļā§‹ āĻ›āĻŋ⧟āĻžāύāĻŦā§āĻŦāχ āϏāĻžāϞ⧇āϰ āĻ›ā§Ÿ', 'āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ āϰāĻŖ', 'āĻĻ⧁āχ āĻšāĻžāϜāĻžāϰ āĻ¤ā§āϰāĻŋāĻļ āϏāĻžāϞ⧇āϰ āĻ›ā§Ÿ', 'āϏ⧇āĻĒā§āĻŸā§‡āĻŽā§āĻŦāϰ']
)

Next Upcoming Features

  • Bangla lemmatization and stemming algorithm
  • Bangla Tokenizer

Contact

If you have any suggestions: Email: saifulbrur79@gmail.com

Contributor

@misc{pybangla,
  title={PYBANGLA module used for normalize textual format like text to number and number to text},
  author={Islam, Md Saiful and Emon, Hassan Ali and  HM-badhon and Sarker, Sagor and Das, Udoy},
  howpublished={},
  year={2024}
}

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts