Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

unit-bpe

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

unit-bpe

BPE tokenizer that operates on integer sequences

  • 0.2.0
  • PyPI
  • Socket score

Maintainers
1

unit-bpe

CI

BPE tokenizer that operates on integer sequences. The implementation is in Rust and Python bindings are provided utilizing pyo3 and Maturin.

Installation

pip install unit-bpe

Example usage from Python

from unit_bpe import fit_concurrent_py, encode_concurrent_py, decode_concurrent_py

units_list = [
    [0, 1, 0, 1, 2, 0, 1, 2, 3],
    [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5]
]
vocab_size = 10
# Since there are 6 units in the training data, 10 - 6 = 4 merge operations are performed

encoded_units, merges = fit_concurrent_py(units_list, vocab_size)
print(encoded_units)  # [[6, 7, 8], [9, 9, 5]]
print(merges)  # [((0, 1), 6), ((8, 4), 9), ((7, 3), 8), ((6, 2), 7)]

units_list_to_encode = [[0, 1, 0, 1, 2, 3, 4, 5], [0, 1, 2, 0, 1, 2, 3]]
encoded = encode_concurrent_py(units_list_to_encode, merges)
print(encoded)  # [[6, 9, 5], [7, 8]]

decoded = decode_concurrent_py(encoded, merges)
print(decoded)  # [[0, 1, 0, 1, 2, 3, 4, 5], [0, 1, 2, 0, 1, 2, 3]]

Development Guide

Installation

  • Rust environment

  • Python environment

    • uv is used as the package manager
    • Run uv sync to install dependencies

Running tests

  • Rust

    cargo test --lib
    
  • Python

    uv run pytest
    
    • To install the crate as a Python module in the virtual environment, run maturin develop.

Directory structure

unit-bpe
├── src
│   ├── lib.rs                # Rust library entry point
│   ├── core.rs               # Core logic of BPE
│   ├── concurrent.rs         # Extension of core.rs for concurrent processing
│   ├── python_bindings.rs    # Bindings to expose Rust functions to Python
│   └── test.rs               # Rust unit tests
├── tests
│   └── test_unit_bpe.py      # Python unit tests
├── .gitignore
├── Cargo.toml                # Rust dependency definitions
├── Cargo.lock                # Rust dependency lock file
├── README.md
├── pyproject.toml            # Python dependency definitions
└── uv.lock                   # Python dependency lock file

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc