
Research
NPM targeted by malware campaign mimicking familiar library names
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
Description • Key Features • Installation • Basic Usage • Advanced Usage • Documentation • Development • Contributing • Acknowledgements
Botok is a powerful Python library for tokenizing Tibetan text. It segments text into words with high accuracy and provides optional attributes such as lemma, part-of-speech (POS) tags, and clean forms. The library supports various text formats, custom dialects, and multiple tokenization modes, making it a versatile tool for Tibetan Natural Language Processing (NLP).
pip install botok
git clone https://github.com/OpenPecha/botok.git
cd botok
pip install -e .
from botok import WordTokenizer
from botok.config import Config
from pathlib import Path
# Initialize tokenizer with default configuration
config = Config(dialect_name="general", base_path=Path.home())
wt = WordTokenizer(config=config)
# Tokenize text
text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
tokens = wt.tokenize(text, split_affixes=False)
# Print each token
for token in tokens:
print(token)
from botok import Text
from pathlib import Path
# Process a file
input_file = Path("input.txt")
t = Text(input_file)
t.tokenize_chunks_plaintext # Creates input_pybo.txt with tokenized output
from botok import WordTokenizer
from botok.config import Config
from pathlib import Path
# Configure custom dialect
config = Config(
dialect_name="custom",
base_path=Path.home() / "my_dialects"
)
# Initialize tokenizer with custom config
wt = WordTokenizer(config=config)
# Process text with custom settings
text = "བཀྲ་ཤིས་བདེ་ལེགས།"
tokens = wt.tokenize(
text,
split_affixes=True,
pos_tagging=True,
lemmatize=True
)
from botok import Text
text = """ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་"""
t = Text(text)
# 1. Word tokenization
words = t.tokenize_words_raw_text
# 2. Chunk tokenization (groups of meaningful characters)
chunks = t.tokenize_chunks_plaintext
# 3. Space-based tokenization
spaces = t.tokenize_on_spaces
For comprehensive documentation, visit:
rm -rf dist/
python setup.py clean sdist
The repository is configured with GitHub Actions to automatically handle version bumping and publishing to PyPI when changes are pushed to the master branch. The workflow uses semantic versioning based on commit messages:
Use the following commit message formats:
fix: your message
- For bug fixes (triggers PATCH version bump)feat: your message
- For new features (triggers MINOR version bump)BREAKING CHANGE: description
in the commit body for breaking changes (triggers MAJOR version bump)Examples:
# This will trigger a PATCH version bump (e.g., 0.8.12 → 0.8.13)
fix: improve test coverage to 90% and fix Python 3.12 compatibility
# This will trigger a MINOR version bump (e.g., 0.8.12 → 0.9.0)
feat: add new sentence tokenization mode for complex Tibetan sentences
# This will trigger a MAJOR version bump (e.g., 0.8.12 → 1.0.0)
feat: refactor token attributes structure
BREAKING CHANGE: Token.attributes now uses a dictionary format instead of properties, requiring changes to code that accesses token attributes directly
When you push to the master branch, the CI workflow will:
For manual publishing (if needed):
twine upload dist/*
pytest tests/
We welcome contributions! Here's how you can help:
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Please ensure your PR adheres to:
botok is an open source library for Tibetan NLP. We are grateful to our sponsors and contributors:
Copyright (C) 2019-2025 OpenPecha. Licensed under Apache 2.0.
FAQs
Tibetan Word Tokenizer
We found that botok demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
Research
Socket's research uncovers three dangerous Go modules that contain obfuscated disk-wiping malware, threatening complete data loss.
Research
Socket uncovers malicious packages on PyPI using Gmail's SMTP protocol for command and control (C2) to exfiltrate data and execute commands.