🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →
Socket
Book a DemoInstallSign in
Socket

bkit

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

bkit

A python tool for Bangla text processing

0.0.9
PyPI
Maintainers
1

bangla-text-processing-kit

A python tool kit for processing Bangla texts.

How to use

Installing

There are three installation options of the bkit package. These are:

  • bkit: The most basic version of bkit with the normalization, cleaning and tokenization capabilities.
pip install bkit
  • bkit[lemma]: Everything in the basic version plus lemmatization capability.
pip install bkit[lemma]
  • bkit[all]: Everything that are available in bkit including normalization, cleaning, tokenization, lemmatization, NER, POS and shallow parsing.
pip install bkit[all]

Checking text

  • bkit.utils.is_bangla(text) -> bool: Checks if text contains only Bangla characters, digits, spaces, punctuations and some symbols. Returns true if so, else return false.
  • bkit.utils.is_digit(text) -> bool: Checks if text contains only Bangla digit characters. Returns true if so, else return false.
  • bkit.utils.contains_digit(text, check_english_digits) -> bool: Checks if text contains any digits. By default checks only Bangla digits. Returns true if so, else return false.
  • bkit.utils.contains_bangla(text) -> bool: Checks if text contains any Bangla character. Returns true if so, else return false.

Transforming text

Text transformation includes the normalization and cleaning procedures. To transform text, use the bkit.transform module. Supported functionalities are:

Normalizer

This module normalize Bangla text using the following steps:

import bkit

text = 'āĻ…āĻžāĻžāĻŽāĻžāĻŦāĻŧ āĨ¤ '
print(list(text))
# >>> ['āĻ…', 'āĻž', 'āĻž', 'āĻŽ', 'āĻž', 'āĻŦ', 'āĻŧ', ' ', 'āĨ¤', ' ']

normalizer = bkit.transform.Normalizer(
    normalize_characters=True,
    normalize_zw_characters=True,
    normalize_halant=True,
    normalize_vowel_kar=True,
    normalize_punctuation_spaces=True
)

clean_text = normalizer(text)
print(clean_text, list(clean_text))
# >>> āφāĻŽāĻžāϰāĨ¤ ['āφ', 'āĻŽ', 'āĻž', 'āϰ', 'āĨ¤']

Character Normalization

This module performs character normalization in Bangla text. It performs nukta normalization, Assamese normalization, Kar normalization, legacy character normalization and Punctuation normalization sequentially.

import bkit

text = 'āφāĻŽāĻžāĻŦāĻŧ'
print(list(text))
# >>> ['āφ', 'āĻŽ', 'āĻž', 'āĻŦ', 'āĻŧ']

text = bkit.transform.normalize_characters(text)

print(list(text))
# >>> ['āφ', 'āĻŽ', 'āĻž', 'āϰ']

Punctuation space normalization

Normalizes punctuation spaces i.e. adds necessary spaces before or after specific punctuations, also removes if necessary.

import bkit

text = 'āϰāĻšāĻŋāĻŽ(ā§¨ā§Š)āĻ āĻ•āĻĨāĻž āĻŦāϞ⧇āύ   āĨ¤āϤāĻŋāύāĻŋ (    āϰāĻšāĻŋāĻŽ ) āφāϰāĻ“ āϜāĻžāύāĻžāύ, ā§§,⧍ā§Ē,ā§Šā§Ģ,ā§Ŧā§Ģā§Ē.ā§Šā§¨ā§Š āϕ⧋āϟāĻŋ āϟāĻžāĻ•āĻž āĻŦā§āϝāĻžā§Ÿā§‡...'

clean_text = bkit.transform.normalize_punctuation_spaces(text)
print(clean_text)
# >>> āϰāĻšāĻŋāĻŽ (ā§¨ā§Š) āĻ āĻ•āĻĨāĻž āĻŦāϞ⧇āύāĨ¤ āϤāĻŋāύāĻŋ (āϰāĻšāĻŋāĻŽ) āφāϰāĻ“ āϜāĻžāύāĻžāύ, ā§§,⧍ā§Ē,ā§Šā§Ģ,ā§Ŧā§Ģā§Ē.ā§Šā§¨ā§Š āϕ⧋āϟāĻŋ āϟāĻžāĻ•āĻž āĻŦā§āϝāĻžā§Ÿā§‡...

Zero width characters normalization

There are two zero-width characters. These are Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) characters. Generally ZWNJ is not used with Bangla texts and ZWJ joiner is used with āϰ only. So, these characters are normalized based on these intuitions.

import bkit

text = 'āĻ°â€ā§āĻ¯â€ŒāĻžāϕ⧇āϟ'
print(f"text: {text} \t Characters: {list(text)}")
# >>> text: āĻ°â€ā§āĻ¯â€ŒāĻžāϕ⧇āϟ     Characters: ['āϰ', '\u200d', 'ā§', 'āϝ', '\u200c', 'āĻž', 'āĻ•', '⧇', 'āϟ']

clean_text = bkit.transform.normalize_zero_width_chars(text)
print(f"text: {clean_text} \t Characters: {list(clean_text)}")
# >>> text: āĻ°â€ā§āϝāĻžāϕ⧇āϟ     Characters: ['āϰ', '\u200d', 'ā§', 'āϝ', 'āĻž', 'āĻ•', '⧇', 'āϟ']

Halant (āĻšāϏāĻ¨ā§āϤ) normalization

This function normalizes halant (āĻšāϏāĻ¨ā§āϤ) [0x09CD] in Bangla text. While using this function, it is recommended to normalize the zero width characters at first, e.g. using the bkit.transform.normalize_zero_width_chars() function.

During the normalization it also handles the āĻ¤ā§ -> ā§Ž conversion. For a valid conjunct letter (āϝ⧁āĻ•ā§āϤāĻŦāĻ°ā§āĻŖ) where 'āϤ' is the former character, can take one of 'āϤ', 'āĻĨ', 'āύ', 'āĻŦ', 'āĻŽ', 'āϝ', and 'āϰ' as the next character. The conversion is perform based on this intuition.

During the halant normalization, the following cases are handled.

  • Remove any leading and tailing halant of a word and/or text.
  • Replace two or more consecutive occurrences of halant by a single halant.
  • Remove halant between any characters that do not follow or precede a halant character. Like a halant that follows or precedes a vowel, kar, ⧟, etc will be removed.
  • Remove multiple fola (multiple ref, ro-fola and jo-fola)
import bkit

text = 'āφāϏāĻ¨ā§ā§ā§āύ āφāϏāĻĢāĻžāϕ⧁āĻ˛ā§āϞāĻžāĻšā§â€Œ āφāϞāĻŦāĻ¤ā§â€ āφāϞāĻŦāĻ¤ā§ āĻ°â€ā§āϝāĻžāĻŦ āĻ‡ā§āϏāĻŋ'
print(list(text))
# >>> ['āφ', 'āϏ', 'āύ', 'ā§', 'ā§', 'ā§', 'āύ', ' ', 'āφ', 'āϏ', 'āĻĢ', 'āĻž', 'āĻ•', '⧁', 'āϞ', 'ā§', 'āϞ', 'āĻž', 'āĻš', 'ā§', '\u200c', ' ', 'āφ', 'āϞ', 'āĻŦ', 'āϤ', 'ā§', '\u200d', ' ', 'āφ', 'āϞ', 'āĻŦ', 'āϤ', 'ā§', ' ', 'āϰ', '\u200d', 'ā§', 'āϝ', 'āĻž', 'āĻŦ', ' ', 'āχ', 'ā§', 'āϏ', 'āĻŋ']

clean_text = bkit.transform.normalize_zero_width_chars(text)
clean_text = bkit.transform.normalize_halant(clean_text)
print(clean_text, list(clean_text))
# >>> āφāϏāĻ¨ā§āύ āφāϏāĻĢāĻžāϕ⧁āĻ˛ā§āϞāĻžāĻš āφāϞāĻŦā§Ž āφāϞāĻŦā§Ž āĻ°â€ā§āϝāĻžāĻŦ āχāϏāĻŋ ['āφ', 'āϏ', 'āύ', 'ā§', 'āύ', ' ', 'āφ', 'āϏ', 'āĻĢ', 'āĻž', 'āĻ•', '⧁', 'āϞ', 'ā§', 'āϞ', 'āĻž', 'āĻš', ' ', 'āφ', 'āϞ', 'āĻŦ', 'ā§Ž', ' ', 'āφ', 'āϞ', 'āĻŦ', 'ā§Ž', ' ', 'āϰ', '\u200d', 'ā§', 'āϝ', 'āĻž', 'āĻŦ', ' ', 'āχ', 'āϏ', 'āĻŋ']

Kar ambiguity

Normalizes kar ambiguity with vowels, āρ, āĻ‚, and āσ. It removes any kar that is preceded by a vowel or consonant diacritics like: āφāĻž will be normalized to āφ. In case of consecutive occurrence of kars like: āĻ•āĻžāĻžāĻžā§€, only the first kar will be kept like: āĻ•āĻž.

import bkit

text = 'āĻ…āĻ‚āĻļāχ⧇ āĻ…āĻ‚āĻļāĻ—ā§āϰāĻšāĻŖāχ⧇ āφāĻžāĻžāϰ⧋ āĻāĻ–āύāϓ⧋ āφāϞāĻŦāĻžāĻ°ā§āϤ⧋⧇ āϏāĻžāϧ⧁⧁ āĻ•āĻžāĻžāĻžā§€'
print(list(text))
# >>> ['āĻ…', 'āĻ‚', 'āĻļ', 'āχ', '⧇', ' ', 'āĻ…', 'āĻ‚', 'āĻļ', 'āĻ—', 'ā§', 'āϰ', 'āĻš', 'āĻŖ', 'āχ', '⧇', ' ', 'āφ', 'āĻž', 'āĻž', 'āϰ', 'ā§‹', ' ', 'āĻ', 'āĻ–', 'āύ', 'āĻ“', 'ā§‹', ' ', 'āφ', 'āϞ', 'āĻŦ', 'āĻž', 'āϰ', 'ā§', 'āϤ', 'ā§‹', '⧇', ' ', 'āϏ', 'āĻž', 'āϧ', '⧁', '⧁', ' ', 'āĻ•', 'āĻž', 'āĻž', 'āĻž', 'ā§€']

clean_text = bkit.transform.normalize_kar_ambiguity(text)
print(clean_text, list(clean_text))
# >>> āĻ…āĻ‚āĻļāχ āĻ…āĻ‚āĻļāĻ—ā§āϰāĻšāĻŖāχ āφāϰ⧋ āĻāĻ–āύāĻ“ āφāϞāĻŦāĻžāĻ°ā§āϤ⧋ āϏāĻžāϧ⧁ āĻ•āĻž ['āĻ…', 'āĻ‚', 'āĻļ', 'āχ', ' ', 'āĻ…', 'āĻ‚', 'āĻļ', 'āĻ—', 'ā§', 'āϰ', 'āĻš', 'āĻŖ', 'āχ', ' ', 'āφ', 'āϰ', 'ā§‹', ' ', 'āĻ', 'āĻ–', 'āύ', 'āĻ“', ' ', 'āφ', 'āϞ', 'āĻŦ', 'āĻž', 'āϰ', 'ā§', 'āϤ', 'ā§‹', ' ', 'āϏ', 'āĻž', 'āϧ', '⧁', ' ', 'āĻ•', 'āĻž']

Clean text

Clean text using the following steps sequentially:

import bkit

text = '<a href=some_URL>āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ</a>\nāĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ   āĻ†ā§ŸāϤāύ ā§§.ā§Ēā§­ āϞāĻ•ā§āώ āĻ•āĻŋāϞ⧋āĻŽāĻŋāϟāĻžāϰ!!!'

clean_text = bkit.transform.clean_text(text)
print(clean_text)
# >>> āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āĻ†ā§ŸāϤāύ āϞāĻ•ā§āώ āĻ•āĻŋāϞ⧋āĻŽāĻŋāϟāĻžāϰ

Clean punctuations

Remove punctuations with the given replace_with character/string.

import bkit

text = 'āφāĻŽāϰāĻž āĻŽāĻžāϠ⧇ āĻĢ⧁āϟāĻŦāϞ āϖ⧇āϞāϤ⧇ āĻĒāĻ›āĻ¨ā§āĻĻ āĻ•āϰāĻŋ!'

clean_text = bkit.transform.clean_punctuations(text)
print(clean_text)
# >>> āφāĻŽāϰāĻž āĻŽāĻžāϠ⧇ āĻĢ⧁āϟāĻŦāϞ āϖ⧇āϞāϤ⧇ āĻĒāĻ›āĻ¨ā§āĻĻ āĻ•āϰāĻŋ

clean_text = bkit.transform.clean_punctuations(text, replace_with=' PUNC ')
print(clean_text)
# >>> āφāĻŽāϰāĻž āĻŽāĻžāϠ⧇ āĻĢ⧁āϟāĻŦāϞ āϖ⧇āϞāϤ⧇ āĻĒāĻ›āĻ¨ā§āĻĻ āĻ•āϰāĻŋ PUNC

Clean digits

Remove any bangla digit from text by replacing with the given replace_with character/string.

import bkit

text = 'āϤāĻžāϰ āĻŦāĻžāϏāĻž ⧭⧝ āύāĻžāĻŽā§āĻŦāĻžāϰ āϰ⧋āĻĄā§‡āĨ¤'

clean_text = bkit.transform.clean_digits(text)
print(clean_text)
# >>> āϤāĻžāϰ āĻŦāĻžāϏāĻž    āύāĻžāĻŽā§āĻŦāĻžāϰ āϰ⧋āĻĄā§‡āĨ¤

clean_text = bkit.transform.clean_digits(text, replace_with='#')
print(clean_text)
# >>> āϤāĻžāϰ āĻŦāĻžāϏāĻž ## āύāĻžāĻŽā§āĻŦāĻžāϰ āϰ⧋āĻĄā§‡āĨ¤

Multiple spaces

Clean multiple consecutive whitespace characters including space, newlines, tabs, vertical tabs, etc. It also removes leading and trailing whitespace characters.

import bkit

text = 'āϤāĻžāϰ āĻŦāĻžāϏāĻž ⧭⧝   \t\t āύāĻžāĻŽā§āĻŦāĻžāϰ   āϰ⧋āĻĄā§‡āĨ¤\nāϏ⧇ āϖ⧁āĻŦ \v āĻ­āĻžāϞ⧋ āϛ⧇āϞ⧇āĨ¤'

clean_text = bkit.transform.clean_multiple_spaces(text)
print(clean_text)
# >>> āϤāĻžāϰ āĻŦāĻžāϏāĻž ⧭⧝ āύāĻžāĻŽā§āĻŦāĻžāϰ āϰ⧋āĻĄā§‡āĨ¤ āϏ⧇ āϖ⧁āĻŦ āĻ­āĻžāϞ⧋ āϛ⧇āϞ⧇āĨ¤

clean_text = bkit.transform.clean_multiple_spaces(text, keep_new_line=True)
print(clean_text)
# >>> āϤāĻžāϰ āĻŦāĻžāϏāĻž ⧭⧝ āύāĻžāĻŽā§āĻŦāĻžāϰ āϰ⧋āĻĄā§‡āĨ¤\nāϏ⧇ āϖ⧁āĻŦ \n āĻ­āĻžāϞ⧋ āϛ⧇āϞ⧇āĨ¤

URLs

Clean URLs from text and replace the URLs with any given string.

import bkit

text = 'āφāĻŽāĻŋ https://xyz.abc āϏāĻžāχāĻŸā§‡ āĻŦā§āϞāĻ— āϞāĻŋāĻ–āĻŋāĨ¤ āĻāχ ftp://10.17.5.23/books āϏāĻžāĻ°ā§āĻ­āĻžāϰ āĻĨ⧇āϕ⧇ āφāĻŽāĻžāϰ āĻŦāχāϗ⧁āϞ⧋ āĻĒāĻžāĻŦ⧇āĨ¤ āĻāχ https://bn.wikipedia.org/wiki/%E0%A6%A7%E0%A6%BE%E0%A6%A4%E0%A7%81_(%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE_%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%95%E0%A6%B0%E0%A6%A3) āϞāĻŋāĻ™ā§āĻ•āϟāĻŋāϤ⧇ āĻ­āĻžāϞ⧋ āϤāĻĨā§āϝ āφāϛ⧇āĨ¤'

clean_text = bkit.transform.clean_urls(text)
print(clean_text)
# >>> āφāĻŽāĻŋ   āϏāĻžāχāĻŸā§‡ āĻŦā§āϞāĻ— āϞāĻŋāĻ–āĻŋāĨ¤ āĻāχ   āϏāĻžāĻ°ā§āĻ­āĻžāϰ āĻĨ⧇āϕ⧇ āφāĻŽāĻžāϰ āĻŦāχāϗ⧁āϞ⧋ āĻĒāĻžāĻŦ⧇āĨ¤ āĻāχ   āϞāĻŋāĻ™ā§āĻ•āϟāĻŋāϤ⧇ āĻ­āĻžāϞ⧋ āϤāĻĨā§āϝ āφāϛ⧇āĨ¤

clean_text = bkit.transform.clean_urls(text, replace_with='URL')
print(clean_text)
# >>> āφāĻŽāĻŋ URL āϏāĻžāχāĻŸā§‡ āĻŦā§āϞāĻ— āϞāĻŋāĻ–āĻŋāĨ¤ āĻāχ URL āϏāĻžāĻ°ā§āĻ­āĻžāϰ āĻĨ⧇āϕ⧇ āφāĻŽāĻžāϰ āĻŦāχāϗ⧁āϞ⧋ āĻĒāĻžāĻŦ⧇āĨ¤ āĻāχ URL āϞāĻŋāĻ™ā§āĻ•āϟāĻŋāϤ⧇ āĻ­āĻžāϞ⧋ āϤāĻĨā§āϝ āφāϛ⧇āĨ¤

Emojis

Clean emoji and emoticons from text and replace those with any given string.

import bkit

text = 'āĻ•āĻŋāϛ⧁ āχāĻŽā§‹āϜāĻŋ āĻšāϞ: 😀đŸĢ…đŸžđŸĢ…đŸŋđŸĢƒđŸŧđŸĢƒđŸŊđŸĢƒđŸžđŸĢƒđŸŋđŸĢ„đŸĢ„đŸģđŸĢ„đŸŧđŸĢ„đŸŊđŸĢ„đŸžđŸĢ„đŸŋ🧌đŸĒ¸đŸĒˇđŸĒšđŸĒēđŸĢ˜đŸĢ—đŸĢ™đŸ›đŸ›žđŸ›ŸđŸĒŦđŸĒŠđŸĒĢđŸŠŧđŸŠģđŸĢ§đŸĒĒ🟰'

clean_text = bkit.transform.clean_emojis(text, replace_with='<EMOJI>')
print(clean_text)
# >>> āĻ•āĻŋāϛ⧁ āχāĻŽā§‹āϜāĻŋ āĻšāϞ: <EMOJI>

HTML tags

Clean HTML tags from text and replace those with any given string.

import bkit

text = '<a href=some_URL>āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ</a>'

clean_text = bkit.transform.clean_html(text)
print(clean_text)
# >>> āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ

Multiple punctuations

Remove multiple consecutive punctuations and keep the first punctuation only.

import bkit

text = 'āĻ•āĻŋ āφāύāĻ¨ā§āĻĻ!!!!!'

clean_text = bkit.transform.clean_multiple_punctuations(text)
print(clean_text)
# >>> āĻ•āĻŋ āφāύāĻ¨ā§āĻĻ!

Special characters

Remove special characters like $, #, @, etc and replace them with the given string. If no character list is passed, [$, #, &, %, @] are removed by default.

import bkit

text = '#āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ$'

clean_text = bkit.transform.clean_special_characters(text, characters=['#', '$'], replace_with='')
print(clean_text)
# >>> āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ

Non Bangla characters

Non Bangla characters include characters and punctuation not used in Bangla like english or other language's alphabets and replace them with the given string.

import bkit

text = 'āĻāχ āĻļā§‚āĻ•āϕ⧀āϟ āĻšāĻžāϤāĻŋāĻļ⧁āρāĻĄāĻŧ Heliotropium indicum, āĻ…āϤāϏ⧀, āφāĻ•āĻ¨ā§āĻĻ Calotropis gigantea āĻ—āĻžāϛ⧇āϰ āĻĒāĻžāϤāĻžāϰ āϰāϏāĻžāϞ⧋ āĻ…āĻ‚āĻļ āφāĻšāĻžāϰ āĻ•āϰ⧇āĨ¤'

clean_text = bkit.transform.clean_non_bangla(text, replace_with='')
print(clean_text)
# >>> āĻāχ āĻļā§‚āĻ•āϕ⧀āϟ āĻšāĻžāϤāĻŋāĻļ⧁āρāĻĄāĻŧ  , āĻ…āϤāϏ⧀, āφāĻ•āĻ¨ā§āĻĻ  āĻ—āĻžāϛ⧇āϰ āĻĒāĻžāϤāĻžāϰ āϰāϏāĻžāϞ⧋ āĻ…āĻ‚āĻļ āφāĻšāĻžāϰ āĻ•āϰ⧇

Text Analysis

Word count

The bkit.analysis.count_words function can be used to get the word counts. It has the following paramerts:

"""
Args:
  text (Tuple[str, List[str]]): The text to count words from. If a string is provided,
    it will be split into words. If a list of strings is provided, each string will
    be split into words and counted separately.
  clean_punctuation (bool, optional): Whether to clean punctuation from the words count. Defaults to False.
  punct_replacement (str, optional): The replacement for the punctuation. Only applicable if
    clean_punctuation is True. Defaults to "".
  return_dict (bool, optional): Whether to return the word count as a dictionary.
    Defaults to False.
  ordered (bool, optional): Whether to return the word count in descending order. Only
    applicable if return_dict is True. Defaults to False.

Returns:
  Tuple[int, Dict[str, int]]: If return_dict is True, returns a tuple containing the
    total word count and a dictionary where the keys are the words and the values
    are their respective counts. If return_dict is False, returns only the total
    word count as an integer.
"""

# examples

import bkit

text='āĻ…āĻ­āĻŋāώ⧇āϕ⧇āϰ āφāϗ⧇āϰ āĻĻāĻŋāύ āĻ—āϤāĻ•āĻžāϞ āϰ⧋āĻŦāĻŦāĻžāϰ āĻ“ā§ŸāĻžāĻļāĻŋāĻ‚āϟāύ⧇ āĻŦāĻŋāĻļāĻžāϞ āĻāĻ• āϏāĻŽāĻžāĻŦ⧇āĻļ⧇ āĻšāĻžāϜāĻŋāϰ āĻšāύ āĻŸā§āϰāĻžāĻŽā§āĻĒāĨ¤ āϤāĻŋāύāĻŋ āωāĻšā§āĻ›ā§āĻŦāϏāĻŋāϤ āĻ­āĻ•ā§āϤ-āϏāĻŽāĻ°ā§āĻĨāĻ•āĻĻ⧇āϰ āφāĻŽā§‡āϰāĻŋāĻ•āĻžāϰ āĻĒāϤāύ⧇āϰ āϝāĻŦāύāĻŋāĻ•āĻž āϘāϟāĻžāύ⧋āϰ āĻ…āĻ™ā§āĻ—ā§€āĻ•āĻžāϰ āĻ•āϰ⧇āύāĨ¤'
total_words=bkit.analysis.count_words(text)
print(total_words)
# >>> 21

Sentence Count

The bkit.analysis.count_sentences function can be used to get the word counts. It has the following paramerts:

"""
Counts the number of sentences in the given text or list of texts.

Args:
  text (Tuple[str, List[str]]): The text or list of texts to count sentences from.
  return_dict (bool, optional): Whether to return the result as a dictionary. Defaults to False.
  ordered (bool, optional): Whether to order the result in descending order.
    Only applicable if return_dict is True. Defaults to False.

Returns:
  int or dict: The count of sentences. If return_dict is True, returns a dictionary with sentences as keys
    and their counts as values. If return_dict is False, returns the total count of sentences.

Raises:
  AssertionError: If ordered is True but return_dict is False.
"""

# examples
import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ\n āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'

count = bkit.analysis.count_sentences(text)
print(count)
# >>> 5

count = bkit.analysis.count_sentences(text, return_dict=True, ordered=True)
print(count)
# >>> {'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•?': 1, 'āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ\n': 1, 'āϰāĻžāϜāϧāĻžāύ⧀āĨ¤': 1, 'āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ!': 1, '⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤': 1}

Lemmatization

Lemmatization is implemented based on our this paper BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

For more details

Lemmatize text

Lemmatize a given text. Generally expects the text to be a sentence.

import bkit

text = 'āĻĒ⧃āĻĨāĻŋāĻŦā§€āϰ āϜāύāϏāĻ‚āĻ–ā§āϝāĻž ā§Ž āĻŦāĻŋāϞāĻŋ⧟āύ⧇āϰ āĻ•āĻŋāϛ⧁ āĻ•āĻŽ'

lemmatized = bkit.lemmatizer.lemmatize(text)

print(lemmatized)
# >>> āĻĒ⧃āĻĨāĻŋāĻŦā§€ āϜāύāϏāĻ‚āĻ–ā§āϝāĻž ā§Ž āĻŦāĻŋāϞāĻŋ⧟āύ āĻ•āĻŋāϛ⧁ āĻ•āĻŽ

Lemmatize word

Lemmatize a word given the PoS information.

import bkit

text = 'āĻĒ⧃āĻĨāĻŋāĻŦā§€āϰ'

lemmatized = bkit.lemmatizer.lemmatize_word(text, 'noun')

print(lemmatized)
# >>> āĻĒ⧃āĻĨāĻŋāĻŦā§€

Stemmer

Stemming is the process of reducing words to their base or root form. Our implementation achieves this by conditionally stripping away predefined prefixes and suffixes from each word.

Stem word

import bkit

stemmer = bkit.stemmer.SimpleStemmer()
stemmer.word_stemer('āύāĻ—āϰāĻŦāĻžāϏ⧀')
# >>> āύāĻ—āϰ

Stem Sentence

import bkit

stemmer = bkit.stemmer.SimpleStemmer()
stemmer.sentence_stemer('āĻŦāĻŋāϕ⧇āϞ⧇ āϰ⧋āĻĻ āĻ•āĻŋāϛ⧁āϟāĻž āĻ•āĻŽā§‡āϛ⧇āĨ¤')
# >>> āĻŦāĻŋāϕ⧇āϞ āϰ⧋āĻĻ āĻ•āĻŋāϛ⧁ āĻ•āĻŽ

Tokenization

Tokenize a given text. The bkit.tokenizer module is used to tokenizer text into tokens. It supports three types of tokenization.

Word tokenization

Tokenize text into words. Also separates some punctuations including comma, danda (āĨ¤), question mark, etc.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'

tokens = bkit.tokenizer.tokenize(text)

print(tokens)
# >>> ['āϤ⧁āĻŽāĻŋ', 'āϕ⧋āĻĨāĻžā§Ÿ', 'āĻĨāĻžāĻ•', '?', 'āĻĸāĻžāĻ•āĻž', 'āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ', 'āϰāĻžāϜāϧāĻžāύ⧀', 'āĨ¤', 'āĻ•āĻŋ', 'āĻ…āĻŦāĻ¸ā§āĻĨāĻž', 'āϤāĻžāϰ', '!', '⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍', 'āϤāĻžāϰāĻŋāϖ⧇', 'āϏ⧇', 'ā§Ē/āĻ•', 'āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ', 'āĻ—āĻŋā§Ÿā§‡', '⧧⧍,ā§Šā§Ēā§Ģ', 'āϟāĻžāĻ•āĻž', 'āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞ', 'āĨ¤']

Word and Punctuation tokenization

Tokenize text into words and any punctuation.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'

tokens = bkit.tokenizer.tokenize_word_punctuation(text)

print(tokens)
# >>> ['āϤ⧁āĻŽāĻŋ', 'āϕ⧋āĻĨāĻžā§Ÿ', 'āĻĨāĻžāĻ•', '?', 'āĻĸāĻžāĻ•āĻž', 'āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ', 'āϰāĻžāϜāϧāĻžāύ⧀', 'āĨ¤', 'āĻ•āĻŋ', 'āĻ…āĻŦāĻ¸ā§āĻĨāĻž', 'āϤāĻžāϰ', '!', '⧧⧍', '/', 'ā§Ļā§Š', '/', '⧍ā§Ļ⧍⧍', 'āϤāĻžāϰāĻŋāϖ⧇', 'āϏ⧇', 'ā§Ē', '/', 'āĻ•', 'āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ', 'āĻ—āĻŋā§Ÿā§‡', '⧧⧍', ',', 'ā§Šā§Ēā§Ģ', 'āϟāĻžāĻ•āĻž', 'āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞ', 'āĨ¤']

Sentence tokenization

Tokenize text into sentences.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'

tokens = bkit.tokenizer.tokenize_sentence(text)

print(tokens)
# >>> ['āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•?', 'āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤', 'āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ!', '⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤']

Named Entity Recognition (NER)

Predicts the tags of the Named Entities of a given text.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'

ner = bkit.ner.Infer('ner-noisy-label')
predictions = ner(text)

print(predictions)
# >>> [('āϤ⧁āĻŽāĻŋ', 'O', 0.9998692), ('āϕ⧋āĻĨāĻžā§Ÿ', 'O', 0.99988306), ('āĻĨāĻžāĻ•?', 'O', 0.99983954), ('āĻĸāĻžāĻ•āĻž', 'B-GPE', 0.99891424), ('āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ', 'B-GPE', 0.99710876), ('āϰāĻžāϜāϧāĻžāύ⧀āĨ¤', 'O', 0.9995414), ('āĻ•āĻŋ', 'O', 0.99989176), ('āĻ…āĻŦāĻ¸ā§āĻĨāĻž', 'O', 0.99980336), ('āϤāĻžāϰ!', 'O', 0.99983263), ('⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍', 'B-D&T', 0.97921854), ('āϤāĻžāϰāĻŋāϖ⧇', 'O', 0.9271435), ('āϏ⧇', 'O', 0.99934834), ('ā§Ē/āĻ•', 'B-NUM', 0.8297553), ('āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ', 'O', 0.99728775), ('āĻ—āĻŋā§Ÿā§‡', 'O', 0.9994825), ('⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š', 'B-NUM', 0.99740463), ('āϟāĻžāĻ•āĻž', 'B-UNIT', 0.99914896), ('āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤', 'O', 0.9998908)]

Named Entity Recognition (NER) Visualization

It takes the model's output and visualizes the NER tag for every word in the text.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'
ner = bkit.ner.Infer('ner-noisy-label')
predictions = ner(text)
bkit.ner.visualize(predictions)

NER.png

Parts of Speech (PoS) tagging

Predicts the tags of the parts of speech of a given text.

import bkit

text = 'āĻ—āϤ āĻ•āĻŋāϛ⧁āĻĻāĻŋāύ āϧāϰ⧇āχ āĻœā§āĻŦāĻžāϞāĻžāύāĻŋāĻšā§€āύ āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ āĻāĻ•āϟāĻŋ āϛ⧋āϟ āĻŽāĻžāĻ› āϧāϰāĻžāϰ āύ⧌āĻ•āĻžā§Ÿ ā§§ā§Ģā§Ļ āϜāύ āϰ⧋āĻšāĻŋāĻ™ā§āĻ—āĻž āφāĻ¨ā§āĻĻāĻžāĻŽāĻžāύ āϏāĻžāĻ—āϰ⧇ āĻ­āĻžāϏāĻŽāĻžāύ āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ āĻ°ā§Ÿā§‡āϛ⧇ āĨ¤'
pos = bkit.pos.Infer('pos-noisy-label')
predictions = pos(text)

print(predictions)
# >>> [('āĻ—āϤ', 'ADJ', 0.98674506), ('āĻ•āĻŋāϛ⧁āĻĻāĻŋāύ', 'NNC', 0.97954935), ('āϧāϰ⧇āχ', 'PP', 0.96124), ('āĻœā§āĻŦāĻžāϞāĻžāύāĻŋāĻšā§€āύ', 'ADJ', 0.93195957), ('āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ', 'NNC', 0.9960413), ('āĻāĻ•āϟāĻŋ', 'QF', 0.9912915), ('āϛ⧋āϟ', 'ADJ', 0.9810739), ('āĻŽāĻžāĻ›', 'NNC', 0.97365385), ('āϧāϰāĻžāϰ', 'NNC', 0.96641904), ('āύ⧌āĻ•āĻžā§Ÿ', 'NNC', 0.99680626), ('ā§§ā§Ģā§Ļ', 'QF', 0.996005), ('āϜāύ', 'NNC', 0.99434316), ('āϰ⧋āĻšāĻŋāĻ™ā§āĻ—āĻž', 'NNP', 0.9141038), ('āφāĻ¨ā§āĻĻāĻžāĻŽāĻžāύ', 'NNP', 0.9856694), ('āϏāĻžāĻ—āϰ⧇', 'NNP', 0.7122378), ('āĻ­āĻžāϏāĻŽāĻžāύ', 'ADJ', 0.93841994), ('āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ', 'NNC', 0.9965629), ('āĻ°ā§Ÿā§‡āϛ⧇', 'VF', 0.99680847), ('āĨ¤', 'PUNCT', 0.9963098)]

Parts of Speech (PoS) Visualization

"It takes the model's output and visualizes the Part-of-Speech tag for every word in the text.

import bkit

text = 'āĻ—āϤ āĻ•āĻŋāϛ⧁āĻĻāĻŋāύ āϧāϰ⧇āχ āĻœā§āĻŦāĻžāϞāĻžāύāĻŋāĻšā§€āύ āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ āĻāĻ•āϟāĻŋ āϛ⧋āϟ āĻŽāĻžāĻ› āϧāϰāĻžāϰ āύ⧌āĻ•āĻžā§Ÿ ā§§ā§Ģā§Ļ āϜāύ āϰ⧋āĻšāĻŋāĻ™ā§āĻ—āĻž āφāĻ¨ā§āĻĻāĻžāĻŽāĻžāύ āϏāĻžāĻ—āϰ⧇ āĻ­āĻžāϏāĻŽāĻžāύ āĻ…āĻŦāĻ¸ā§āĻĨāĻžā§Ÿ āĻ°ā§Ÿā§‡āϛ⧇ āĨ¤'
pos = bkit.pos.Infer('pos-noisy-label')
predictions = pos(text)
bkit.pos.visualize(predictions)

pos.png

Shallow Parsing (Constituency Parsing)

Predicts the shallow parsing tags of a given text.

import bkit

text = 'āϤ⧁āĻŽāĻŋ āϕ⧋āĻĨāĻžā§Ÿ āĻĨāĻžāĻ•? āĻĸāĻžāĻ•āĻž āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ āϰāĻžāϜāϧāĻžāύ⧀āĨ¤ āĻ•āĻŋ āĻ…āĻŦāĻ¸ā§āĻĨāĻž āϤāĻžāϰ! ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍ āϤāĻžāϰāĻŋāϖ⧇ āϏ⧇ ā§Ē/āĻ• āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ āĻ—āĻŋā§Ÿā§‡ ⧧⧍,ā§Šā§Ēā§Ģ.ā§¨ā§Š āϟāĻžāĻ•āĻž āĻĻāĻŋā§Ÿā§‡āĻ›āĻŋāϞāĨ¤'
shallow = bkit.shallow.Infer(pos_model='pos-noisy-label')
predictions = shallow(text)
print(predictions)
# >>> (S (VP (NP (PRO āϤ⧁āĻŽāĻŋ)) (VP (ADVP (ADV āϕ⧋āĻĨāĻžā§Ÿ)) (VF āĻĨāĻžāĻ•))) (NP (NNP ?) (NNP āĻĸāĻžāĻ•āĻž) (NNC āĻŦāĻžāĻ‚āϞāĻžāĻĻ⧇āĻļ⧇āϰ)) (ADVP (ADV āϰāĻžāϜāϧāĻžāύ⧀)) (NP (NP (NP (NNC āĨ¤)) (NP (PRO āĻ•āĻŋ))) (NP (QF āĻ…āĻŦāĻ¸ā§āĻĨāĻž) (NNC āϤāĻžāϰ)) (NP (PRO !))) (NP (NP (QF ⧧⧍/ā§Ļā§Š/⧍ā§Ļ⧍⧍) (NNC āϤāĻžāϰāĻŋāϖ⧇)) (VNF āϏ⧇) (NP (QF ā§Ē/āĻ•) (NNC āĻ āĻŋāĻ•āĻžāύāĻžā§Ÿ))) (VF āĻ—āĻŋā§Ÿā§‡))

Shallow Parsing Visualization

It converts model predictions into an interactive shallow parsing Tree for clear and intuitive analysis

from bkit import shallow
text = "āĻ•āĻžāϤāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ⧇ āφāĻ°ā§āĻœā§‡āĻ¨ā§āϟāĻŋāύāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ āĻœā§Ÿā§‡ āĻŽāĻžāĻ°ā§āϤāĻŋāύ⧇āĻœā§‡āϰ āĻ…āĻŦāĻĻāĻžāύ āĻ…āύ⧇āĻ•āĨ¤"
shallow = shallow.Infer(pos_model='pos-noisy-label')
predictions = shallow(text)
shallow.visualize(predictions)

shallow.png

Dependency Parsing

Predicts the dependency parsing tags of a given text.

from bkit import dependency

text = "āĻ•āĻžāϤāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ⧇ āφāĻ°ā§āĻœā§‡āĻ¨ā§āϟāĻŋāύāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ āĻœā§Ÿā§‡ āĻŽāĻžāĻ°ā§āϤāĻŋāύ⧇āĻœā§‡āϰ āĻ…āĻŦāĻĻāĻžāύ āĻ…āύ⧇āĻ•āĨ¤"
dep =dependency.Infer('dependency-parsing')
predictions = dep(text)
print(predictions)
# >>>[{'text': 'āĻ•āĻžāϤāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ⧇ āφāĻ°ā§āĻœā§‡āĻ¨ā§āϟāĻŋāύāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ āĻœā§Ÿā§‡ āĻŽāĻžāĻ°ā§āϤāĻŋāύ⧇āĻœā§‡āϰ āĻ…āĻŦāĻĻāĻžāύ āĻ…āύ⧇āĻ• āĨ¤', 'predictions': [{'token_start': 1, 'token_end': 0, 'label': 'compound'}, {'token_start': 7, 'token_end': 1, 'label': 'obl'}, {'token_start': 4, 'token_end': 2, 'label': 'nmod'}, {'token_start': 4, 'token_end': 3, 'label': 'nmod'}, {'token_start': 7, 'token_end': 4, 'label': 'obl'}, {'token_start': 6, 'token_end': 5, 'label': 'nmod'}, {'token_start': 7, 'token_end': 6, 'label': 'nsubj'}, {'token_start': 7, 'token_end': 7, 'label': 'root'}, {'token_start': 7, 'token_end': 8, 'label': 'punct'}]}]

Dependency Parsing Visualization

It converts model predictions into an interactive dependency graph for clear and intuitive analysis

from bkit import dependency
text = "āĻ•āĻžāϤāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ⧇ āφāĻ°ā§āĻœā§‡āĻ¨ā§āϟāĻŋāύāĻžāϰ āĻŦāĻŋāĻļā§āĻŦāĻ•āĻžāĻĒ āĻœā§Ÿā§‡ āĻŽāĻžāĻ°ā§āϤāĻŋāύ⧇āĻœā§‡āϰ āĻ…āĻŦāĻĻāĻžāύ āĻ…āύ⧇āĻ•āĨ¤"
dep = dependency.Infer('dependency-parsing')
predictions = dep(text)
dependency.visualize(predictions)

dependency-visu.png

Coreference Resolution

Predicts the coreferent clusters of a given text.

import bkit

text = "āϤāĻžāϰāĻžāϏ⧁āĻ¨ā§āĻĻāϰ⧀ ( ā§§ā§Žā§­ā§Ž - ⧧⧝ā§Ēā§Ž ) āĻ…āĻ­āĻŋāύ⧇āĻ¤ā§āϰ⧀ āĨ¤ ā§§ā§Žā§Žā§Ē āϏāĻžāϞ⧇ āĻŦāĻŋāύ⧋āĻĻāĻŋāύ⧀āϰ āϏāĻšāĻžāϝāĻŧāϤāĻžāϝāĻŧ āĻ¸ā§āϟāĻžāϰ āĻĨāĻŋāϝāĻŧ⧇āϟāĻžāϰ⧇ āϝ⧋āĻ—āĻĻāĻžāύ⧇āϰ āĻŽāĻžāĻ§ā§āϝāĻŽā§‡ āϤāĻŋāύāĻŋ āĻ…āĻ­āĻŋāύāϝāĻŧ āĻļ⧁āϰ⧁ āĻ•āϰ⧇āύ āĨ¤ āĻĒā§āϰāĻĨāĻŽā§‡ āϤāĻŋāύāĻŋ āĻ—āĻŋāϰāĻŋāĻļāϚāĻ¨ā§āĻĻā§āϰ āĻ˜ā§‹āώ⧇āϰ āϚ⧈āϤāĻ¨ā§āϝāϞ⧀āϞāĻž āύāĻžāϟāϕ⧇ āĻāĻ• āĻŦāĻžāϞāĻ• āĻ“ āϏāϰāϞāĻž āύāĻžāϟāϕ⧇ āĻ—ā§‹āĻĒāĻžāϞ āϚāϰāĻŋāĻ¤ā§āϰ⧇ āĻ…āĻ­āĻŋāύāϝāĻŧ āĻ•āϰ⧇āύ āĨ¤"
coref = bkit.coref.Infer('coref')
predictions = coref(text)
print(predictions)
# >>> {'text': ['āϤāĻžāϰāĻžāϏ⧁āĻ¨ā§āĻĻāϰ⧀', '(', 'ā§§ā§Žā§­ā§Ž', '-', '⧧⧝ā§Ēā§Ž', ')', 'āĻ…āĻ­āĻŋāύ⧇āĻ¤ā§āϰ⧀', 'āĨ¤', 'ā§§ā§Žā§Žā§Ē', 'āϏāĻžāϞ⧇', 'āĻŦāĻŋāύ⧋āĻĻāĻŋāύ⧀āϰ', 'āϏāĻšāĻžāϝāĻŧāϤāĻžāϝāĻŧ', 'āĻ¸ā§āϟāĻžāϰ', 'āĻĨāĻŋāϝāĻŧ⧇āϟāĻžāϰ⧇', 'āϝ⧋āĻ—āĻĻāĻžāύ⧇āϰ', 'āĻŽāĻžāĻ§ā§āϝāĻŽā§‡', 'āϤāĻŋāύāĻŋ', 'āĻ…āĻ­āĻŋāύāϝāĻŧ', 'āĻļ⧁āϰ⧁', 'āĻ•āϰ⧇āύ', 'āĨ¤', 'āĻĒā§āϰāĻĨāĻŽā§‡', 'āϤāĻŋāύāĻŋ', 'āĻ—āĻŋāϰāĻŋāĻļāϚāĻ¨ā§āĻĻā§āϰ', 'āĻ˜ā§‹āώ⧇āϰ', 'āϚ⧈āϤāĻ¨ā§āϝāϞ⧀āϞāĻž', 'āύāĻžāϟāϕ⧇', 'āĻāĻ•', 'āĻŦāĻžāϞāĻ•', 'āĻ“', 'āϏāϰāϞāĻž', 'āύāĻžāϟāϕ⧇', 'āĻ—ā§‹āĻĒāĻžāϞ', 'āϚāϰāĻŋāĻ¤ā§āϰ⧇', 'āĻ…āĻ­āĻŋāύāϝāĻŧ', 'āĻ•āϰ⧇āύ', 'āĨ¤'], 'mention_indices': {0: [{'start_token': 0, 'end_token': 0}, {'start_token': 6, 'end_token': 6}, {'start_token': 10, 'end_token': 10}, {'start_token': 16, 'end_token': 16}, {'start_token': 22, 'end_token': 22}]}}

Coreference Resolution Visualization

It takes the model's output and creates an interactive visualization to clearly depict coreference resolution, highlighting the relationships between entities in the text

from bkit import coref

text = "āϤāĻžāϰāĻžāϏ⧁āĻ¨ā§āĻĻāϰ⧀ ( ā§§ā§Žā§­ā§Ž - ⧧⧝ā§Ēā§Ž ) āĻ…āĻ­āĻŋāύ⧇āĻ¤ā§āϰ⧀ āĨ¤ ā§§ā§Žā§Žā§Ē āϏāĻžāϞ⧇ āĻŦāĻŋāύ⧋āĻĻāĻŋāύ⧀āϰ āϏāĻšāĻžāϝāĻŧāϤāĻžāϝāĻŧ āĻ¸ā§āϟāĻžāϰ āĻĨāĻŋāϝāĻŧ⧇āϟāĻžāϰ⧇ āϝ⧋āĻ—āĻĻāĻžāύ⧇āϰ āĻŽāĻžāĻ§ā§āϝāĻŽā§‡ āϤāĻŋāύāĻŋ āĻ…āĻ­āĻŋāύāϝāĻŧ āĻļ⧁āϰ⧁ āĻ•āϰ⧇āύ āĨ¤ āĻĒā§āϰāĻĨāĻŽā§‡ āϤāĻŋāύāĻŋ āĻ—āĻŋāϰāĻŋāĻļāϚāĻ¨ā§āĻĻā§āϰ āĻ˜ā§‹āώ⧇āϰ āϚ⧈āϤāĻ¨ā§āϝāϞ⧀āϞāĻž āύāĻžāϟāϕ⧇ āĻāĻ• āĻŦāĻžāϞāĻ• āĻ“ āϏāϰāϞāĻž āύāĻžāϟāϕ⧇ āĻ—ā§‹āĻĒāĻžāϞ āϚāϰāĻŋāĻ¤ā§āϰ⧇ āĻ…āĻ­āĻŋāύāϝāĻŧ āĻ•āϰ⧇āύ āĨ¤"
coref = coref.Infer('coref')
predictions = coref(text)
coref.visualize(predictions)

coref.png

Keywords

bangla kit

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts