Socket
Book a DemoInstallSign in
Socket

bo-sent-tokenizer

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

bo-sent-tokenizer

Tibetan sentence tokenizer for segmenting Tibetan text into sentences

pipPyPI
Version
0.0.1
Maintainers
1


OpenPecha

tibetan sentence tokenizer.

Description

Tibetan sentence tokenizer designed specifically for data preparation.

Project owner(s)

  • @tenzin3

Installation

pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git

Usage

Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.

1.Sentence tokenization

from bo_sent_tokenizer import tokenize

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'


Explanation

code is refered from op_mt_tools and made minor changes to get the following desired output.

Output Explanation

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.

The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.

The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.

If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.

2.Sentence segmentation

from bo_sent_tokenizer import segment

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'

Terms:

Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.

Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.

How Sentence Segmentation Works:

  • Preprocessing: All carriage returns and new lines are removed from the string.

  • Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.

  • Joining the Parts:

    • Empty parts are ignored.
    • In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances. Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
      • ༄༅ = opening punctuation
      • །། = closing punctuation
  • Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.

Note:

  • Closing punctuation, opening punctuation, and predefined symbols are defined in the file vars.py
  • To have a better understanding of the code, refer to the test cases in test_segmenter.py

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.