
Research
Security News
The Growing Risk of Malicious Browser Extensions
Socket researchers uncover how browser extensions in trusted stores are used to hijack sessions, redirect traffic, and manipulate user behavior.
pybangla is the bangla text normalizer tool, it use for text normalization like word to number and date formating purposes
Citation Paper: BnVITS: Voice Cloning in Bangla with Minimal Audio Samples
PyBangla:
PyBangla is a python3 package for Bangla Number, DateTime and Text Normalizer and Date Extraction. This package can be used to Normalize the text number and date (ex: number to text vice versa). This framework also can be used Django, Flask, FastAPI, and others. PyBangla module supported operating systems Linux/Unix, Mac OS and Windows. Available Features.
Features available in PyBangla:
The easiest way to install pybangla is to use pip:
pip install pybangla
#or
pip install git+https://github.com/saiful9379/pybangla.git
#or
git clone https://github.com/saiful9379/pybangla.git
cd pybangla
pip install -e .
For the evaluation, we selected 200 sentences. The dataset contains numerical values and has been normalized using PyBangla. We generated AI-based ground truth (GT) text and had it corrected by human annotators. The performance of our tool is evaluated using three key metrics: Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER).
The performance of PyBangla was evaluated using 200 sentences. However, no evaluation report is available for versions earlier than V2.0.9. PyBangla V2.0.9 Presenting conversion accuracy as well as it's processing time performance.
Module Version | No. of Sentences | WER (Word Error Rate) | CER (Character Error Rate) | MER (Match Error Rate) |
---|---|---|---|---|
<= V2.0.8 | 200 | No evaluation report | No evaluation report | No evaluation report |
V2.0.9 | 200 | 0.1291 | 0.0319 | 0.0975 |
N.B : For more detail and all of processing category listed here please check : link
Module Version | Total Sentences | Raw Character Count | Normalized Character Count | Per Character Processing Time (sec) | Total Processing Time (sec) |
---|---|---|---|---|---|
2.0.9 | 200 | 9,217 | 12,584 | 0.0001167 | 1.076 |
It supports converting Bangla abbreviations, symbols, and currencies to Bangla textual format.
Processes a given text by applying various normalization techniques based on specified boolean parameters.
Parameters:
text
(str): The input text to be normalized.all_operation
(bool): Make this True
if you need all operations to take place or False
number_plate
(bool, default=False): Converts or normalizes vehicle number plates if present in the text.abbreviations
(bool, default=False): Expands common abbreviations into their full forms.year
(bool, default=False): Handles and formats years correctly.punctuation
(bool, default=False): Removes or standardizes unwanted punctuation marks.phone_number
(bool, default=False): Extracts and normalizes phone numbers.symbols
(bool, default=False): Expands common symbols into their textual representation.ordinals
(bool, default=False): Converts ordinal numbers.currency
(bool, default=False): Converts currency values into words.date
(bool, default=False): Standardizes and normalizes date formats.nid
(bool, default=False): Converts national identification numbers (NID) into a textual format.passport
(bool, default=False): Normalizes passport numbers.number
(bool, default=False): Processes and converts numeric values into textual form.emoji
(bool, default=False): Removes emojis from text.Returns:
Example:
import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
all_operation=True)}")
print(text)
# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ āĻĒā§āϰāĻĨāĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ āϤā§āϤā§āϰāĻŋāĻļāϤāĻŽ, āϏ⧠āĻāύā§āϝ āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ āĻāĻ āĻĻā§āĻ āϤāĻŋāύ āĻāĻžāϰ āĻāϝāĻŧā§āύ āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§'
For example, if only year conversion needed -
import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
all_operation=False
year=True)}")
print(text)
# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§'
If only ordinal conversion needed -
import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
all_operation=False
ordinals=True)}")
print(text)
# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ āĻĒā§āϰāĻĨāĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ āϤā§āϤā§āϰāĻŋāĻļāϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§'
If only currency conversion needed -
import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
all_operation=False
currency=True)}")
print(text)
# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ āĻāĻ āĻĻā§āĻ āϤāĻŋāύ āĻāĻžāϰ āĻāϝāĻŧā§āύ āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§'
import pybangla
nrml = pybangla.Normalizer()
text = "āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ ⧍ā§Ļā§Šā§Ļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠ÂĨ⧍ā§Ļā§Šā§Ļ.ā§§ā§¨ā§Šā§Ē āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§"
print(f"Input: {text} \nOutput {nrml.text_normalizer(text,
all_operation=False
currency=True)}")
print(text)
# output:
'āϰāĻžāĻšāĻŋāĻŽ āĻā§āϞāĻžāϏ āĻā§āĻžāύ āĻ ā§§āĻŽ, āĻāύā§āĻĄ āĻŦāĻžāϏāĻžāϰ āĻā§āϞāĻžāϏ āĻ ā§Šā§Š āϤāĻŽ, āϏ⧠āĻāύā§āϝ āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻļāϤāĻžāĻŦā§āĻĻā§āϤ⧠āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āĻĻāĻļāĻŽāĻŋāĻ āĻāĻ āĻĻā§āĻ āϤāĻŋāύ āĻāĻžāϰ āĻāϝāĻŧā§āύ āĻĻāĻŋāϤ⧠āĻšā§ā§āĻā§'
Normalizer more information or example check the link
Example:
import pybangla
nrml = pybangla.Normalizer()
text = "āĻāĻŽāĻžāĻā§ āĻāĻ āϞāĻā§āώ āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āĻāĻ āĻāĻžāĻāĻž āĻĻā§ā§ āĻāύā§āĻĄ āϤā§āĻŽāĻŋ āĻŦāĻŋāĻļ āĻšāĻžāĻāĻžāϰ āĻāĻžāĻāĻž āύāĻŋāĻ āĻāύā§āĻĄ āĻāĻ āϞāĻā§āώ āĻāĻžāϰ āĻšāĻžāĻāĻžāϰ āĻĻā§āĻāĻļ āĻāĻ āĻāĻžāĻāĻž āĻāĻ āĻĄāĻŦāϞ āĻĻā§āĻ"
text = nrml.word2number(text)
print(text)
#output:
'āĻāĻŽāĻžāĻā§ 102001 āĻāĻžāĻāĻž āĻĻā§ā§ āĻāύā§āĻĄ āϤā§āĻŽāĻŋ 20000 āĻāĻžāĻāĻž āύāĻŋāĻ āĻāύā§āĻĄ 104201 āĻāĻžāĻāĻž 122 '
Number conversion more information or examples check the link
Example:
import pybangla
nrml = pybangla.Normalizer()
date = "ā§Ļā§§-āĻāĻĒā§āϰāĻŋāϞ/⧍ā§Ļā§¨ā§Š"
date = nrml.date_format(date, language="bn")
print(date)
#output:
{'date': 'ā§Ļā§§', 'month': 'ā§Ē', 'year': '⧍ā§Ļā§¨ā§Š', 'txt_date': 'āĻāĻ', 'txt_month': 'āĻāĻĒā§āϰāĻŋāϞ', 'txt_year': 'āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āĻāĻļ', 'weekday': 'āĻļāύāĻŋāĻŦāĻžāϰ', 'ls_month': 'āĻļā§āϰāĻžāĻŦāĻŖ', 'seasons': 'āĻŦāϰā§āώāĻž'}
Date Format for more information or example check the link
import pybangla
nrml = pybangla.Normalizer()
today = nrml.today()
print(today)
Output:
{'date': 'ā§Šā§Ļ', 'month': 'āĻāĻĒā§āϰāĻŋāϞ', 'year': '⧍ā§Ļ⧍ā§Ē', 'txt_date': 'āϤā§āϰāĻŋāĻļ', 'txt_year': 'āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āĻāĻŦā§āĻŦāĻŋāĻļ', 'weekday': 'āĻŽāĻā§āĻāϞāĻŦāĻžāϰ', 'ls_month': 'āĻļā§āϰāĻžāĻŦāĻŖ', 'seasons': 'āĻŦāϰā§āώāĻž'}
Today, Months, Weekdays, Seasons more information or examples check the link
import pybangla
nrml = pybangla.Normalizer()
number_string = nrml.process_phone_number("01790-540211")
Output:
āĻāĻŋāϰ⧠āĻā§āĻžāύ āϏā§āĻā§āύ āύāĻžāĻāύ āĻāĻŋāϰ⧠āĻĢāĻžāĻāĻ āĻĢā§āϰ āĻāĻŋāϰ⧠āĻā§ āĻĄāĻžāĻŦāϞ āĻā§āĻžāύ
import pybangla
nrml = pybangla.Normalizer()
input1 = "⧧⧝⧝ā§ŦāϏāĻžāϞā§āϰ ā§Ŧ āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰāϰāĻŖ āĻā§āϰāĻŽāĻŖ āĻĒāϰāĻŋāĻāϞā§āĻĒāύāĻž āĻāϰāĻāĻŋ ⧍ā§Ļā§Šā§ĻāϏāĻžāϞā§āϰ ā§ŦāϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ"
input2 = "āĻāύāĻŋāĻļāĻļā§ āĻāĻŋā§āĻžāύāĻŦā§āĻŦāĻ āϏāĻžāϞā§āϰ āĻā§ āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ āϰāĻŖ āĻā§āϰāĻŽāĻŖ āĻĒāϰāĻŋāĻāϞā§āĻĒāύāĻž āĻāϰāĻāĻŋ āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āϏāĻžāϞā§āϰ āĻā§ āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ"
print(nrml.text_diff(input1, input2))
#Output:
(
['⧧⧝⧝ā§ŦāϏāĻžāϞā§āϰ ā§Ŧ', 'āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰāϰāĻŖ', '⧍ā§Ļā§Šā§ĻāϏāĻžāϞā§āϰ', 'ā§ŦāϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ'],
['āĻāύāĻŋāĻļāĻļā§ āĻāĻŋā§āĻžāύāĻŦā§āĻŦāĻ āϏāĻžāϞā§āϰ āĻā§', 'āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ āϰāĻŖ', 'āĻĻā§āĻ āĻšāĻžāĻāĻžāϰ āϤā§āϰāĻŋāĻļ āϏāĻžāϞā§āϰ āĻā§', 'āϏā§āĻĒā§āĻā§āĻŽā§āĻŦāϰ']
)
If you have any suggestions: Email: saifulbrur79@gmail.com
@misc{pybangla,
title={PYBANGLA module used for normalize textual format like text to number and number to text},
author={Islam, Md Saiful and Emon, Hassan Ali and HM-badhon and Sarker, Sagor and Das, Udoy},
howpublished={},
year={2024}
}
FAQs
pybangla is the bangla text normalizer tool, it use for text normalization like word to number and date formating purposes
We found that pybangla demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover how browser extensions in trusted stores are used to hijack sessions, redirect traffic, and manipulate user behavior.
Research
Security News
An in-depth analysis of credential stealers, crypto drainers, cryptojackers, and clipboard hijackers abusing open source package registries to compromise Web3 development environments.
Security News
pnpm 10.12.1 introduces a global virtual store for faster installs and new options for managing dependencies with version catalogs.