bo-sent-tokenizer

Package Overview

Dependencies

Maintainers

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

bo-sent-tokenizer

Tibetan sentence tokenizer for segmenting Tibetan text into sentences

PyPI

Version: 0.0.1

Maintainers: 1

tibetan sentence tokenizer.

Description

Tibetan sentence tokenizer designed specifically for data preparation.

Project owner(s)

@tenzin3

Installation

pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git

Usage

Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.

1.Sentence tokenization

from bo_sent_tokenizer import tokenize

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'

Explanation

code is refered from op_mt_tools and made minor changes to get the following desired output.

Output Explanation

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.

The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.

The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.

If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.

2.Sentence segmentation

from bo_sent_tokenizer import segment

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'

Terms:

Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.

Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.

How Sentence Segmentation Works:

Preprocessing: All carriage returns and new lines are removed from the string.
Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.
Joining the Parts:
- Empty parts are ignored.
- In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances. Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
  - ༄༅ = opening punctuation
  - །། = closing punctuation
Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.

Note:

Closing punctuation, opening punctuation, and predefined symbols are defined in the file vars.py
To have a better understanding of the code, refer to the test cases in test_segmenter.py

FAQs

What is bo-sent-tokenizer?

Is bo-sent-tokenizer well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

bo-sent-tokenizer

tibetan sentence tokenizer.

Description

Project owner(s)

Installation

Usage

1.Sentence tokenization

Explanation

Output Explanation

2.Sentence segmentation

Terms:

How Sentence Segmentation Works:

Related posts

Feross on Risky Business Weekly Podcast: npm’s Ongoing Supply Chain Attacks

Introducing Tier 1 Reachability: Precision CVE Triage for Enterprise Teams