🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

bp-tokenizer

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

bp-tokenizer

A high-performance Python tokenizer using Byte-Pair Encoding (BPE) with 100k vocabulary, supporting text encoding, decoding, and normalization for NLP applications.

PyPI

Version: 1.0.2

Maintainers: 1

BP Tokenizer

bp_tokenizer is a high-performance Python tokenizer based on Byte-Pair Encoding (BPE). It supports a 100k vocabulary, efficient text encoding/decoding, batch processing, and basic text normalization. It is designed for NLP applications, preprocessing pipelines, or any project that requires custom tokenization.

Features

Efficient Encoding: Encode text into token IDs using BPE.
Decoding: Decode token IDs back to human-readable text.
Batch Processing: Support for encoding multiple texts at once.
Special Tokens: Built-in support for <UNK>, <BOS>, and <EOS>.
Large Vocabulary: Supports a vocabulary size of up to 100,000 tokens.
Normalization: Basic text normalization (lowercasing, punctuation removal).
Sentence Splitting: Simple utilities to split text into sentences.

Installation

You can install directly from PyPI :

pip install bp-tokenizer

Quick Start

Import and Initialize

from bp_tokenizer import Tokenizer

# Initialize tokenizer
tokenizer = Tokenizer()

Encode a Single Text

text = "Hello world!"
encoded = tokenizer.encode(text)
print("Encoded:", encoded)

Output:

Encoded: [72, 9257, 1295, 33]

Decode Token IDs

decoded = tokenizer.decode(encoded)
print("Decoded:", decoded)

Output:

Decoded: Hello world!

Encode a Batch of Texts

texts = ["Hello world!", "Byte-Pair Encoding example."]
encoded_batch = tokenizer.encode_batch(texts)
print("Batch Encoded:", encoded_batch)

Output:

Batch Encoded: [[72, 9257, 1295, 33], [66, 121, 507, 45, 80, 937, 18258, 16022, 2461, 46]]

Additional Utilities

# Get number of tokens in a text
print(tokenizer.token_count("Hello world!"))

# Normalize text
print(tokenizer.normalize_text("Hello, World!!!", lower=True, remove_punct=True))

# Split text into words/punctuation
print(tokenizer.tokenize("Hello, world! How are you?"))

Special Tokens

These tokens are used internally to handle unknown tokens or sequence markers.

Token	ID	Description
`<BOS>`	256	Beginning of Sequence
`<EOS>`	257	End of Sequence
`<UNK>`	258	Unknown Token

Contributing

If you want to improve bp_tokenizer:

Fork the repository.
Make your changes.
Submit a pull request with a clear description.

License

This project is licensed under the MIT License – see the LICENSE file for details.

Keywords

FAQs

What is bp-tokenizer?

Is bp-tokenizer well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

bp-tokenizer

BP Tokenizer

Features

Installation

Quick Start

Import and Initialize

Encode a Single Text

Decode Token IDs

Encode a Batch of Texts

Additional Utilities

Special Tokens

Contributing

License

Keywords

Related posts

Deno 2.6 + Socket: Supply Chain Defense In Your CLI

New React Server Components Vulnerabilities: DoS and Source Code Exposure