Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
bpeasy
is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface tokenizers
library, but makes opinionated decisions to simplify the tokenizer training specifically to:
fancy-regex
crate which supports a richer set of regex features than the regex
crate used in Huggingface.int64
types for counting to allow for training on much larger datasets without the risk of overflow.You can think of bpeasy
as the tiktoken
training code that never was.
See the benchmarks section for a comparison with the Huggingface library.
Simply install the package using pip:
pip install bpeasy
The training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).
# should be an iterator over str
iterator = jsonl_content_iterator(args)
# example regex from GPT-4
regex_pattern = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
# returns the vocab (dict[bytes, int])
vocab = bpeasy.train_bpe(
iterator,
regex_pattern,
args.max_sentencepiece_length, # max length of tokens
args.vocab_size, # max size of vocab
)
Alternatively, you can also train using the basic tokenizer class provided:
from bpeasy.tokenizer import BPEasyTokenizer
tokenizer = BPEasyTokenizer.train(
iterator, # iterator over str
vocab_size=vocab_size,
max_token_length=max_token_length,
regex_pattern=regex_pattern,
special_tokens=["<s>", "<pad>", "</s>"],
fill_to_nearest_multiple_of_eight=True,
name="bpeasy",
)
To test your tokenizer you can use the BPEasyTokenizer
class, which is a wrapper around the tiktoken.Encoding
module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.
from bpeasy.tokenizer import BPEasyTokenizer
your_special_tokens = ["<s>", "<pad>", "</s>"]
tokenizer = BPEasyTokenizer(
vocab=vocab,
regex_pattern=regex_pattern,
special_tokens=your_special_tokens,
fill_to_nearest_multiple_of_eight=True, # pad vocab to multiple of 8
name="bpeasy" # optional name for the tokenizer
)
test = "hello_world"
# encode and decode uses the tiktoken functions
encoded = tokenizer.encode(test)
decoded = tokenizer.decode(encoded)
> "hello_world"
You can also use tiktoken
directly, but you would need to handle the special tokens and regex pattern yourself:
import tiktoken
vocab = bpeasy.train_bpe(...)
special_tokens = ["<s>", "<pad>", "</s>"]
# Sort the vocab by rank
sorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])
# add special tokens
special_token_ranks = {}
for special_token in special_tokens:
special_token_ranks[special_token] = len(sorted_vocab)
sorted_vocab.append((special_token.encode("utf-8"), len(sorted_vocab)))
full_vocab = dict(sorted_vocab)
encoder = tiktoken.Encoding(
name=name,
pat_str=regex_pattern,
mergeable_ranks=full_vocab,
special_tokens=special_token_ranks,
)
We provide basic utility functions to save and load the tokenizer from a json file.
tokenizer.save("path_to_file.json")
tokenizer = BPEasyTokenizer.from_file("path_to_file.json")
We also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace transformers
library.
from bpeasy.tokenizer import BPEasyTokenizer
from trans
tokenizer = BPEasyTokenizer(
...
)
tokenizer.export_to_huggingface_format("hf_tokenizer.json")
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="hf_tokenizer.json")
tiktoken
txt formatfrom bpeasy import
vocab = bpeasy.train_bpe(...)
# saves the vocab to a tiktoken txt file format
save_vocab_to_tiktoken(vocab, "vocab.txt", special_tokens=["<s>", "<pad>", "</s>"])
If you want to use the tiktoken
txt format, you will still need to handle the regex and special tokens yourself, as shown above,
Contributions are welcome! Please open an issue if you have any suggestions or improvements.
This project is licensed under the MIT License.
FAQs
Fast bare-bones BPE for modern tokenizer training
We found that bpeasy demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.