bpeasy
Overview
bpeasy
is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface tokenizers
library, but makes opinionated decisions to simplify the tokenizer training specifically to:
- Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface).
- Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the
fancy-regex
crate which supports a richer set of regex features than the regex
crate used in Huggingface. - Use
int64
types for counting to allow for training on much larger datasets without the risk of overflow.
You can think of bpeasy
as the tiktoken
training code that never was.
See the benchmarks section for a comparison with the Huggingface library.
Installation
Simply install the package using pip:
pip install bpeasy
Training
The training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).
iterator = jsonl_content_iterator(args)
regex_pattern = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
vocab = bpeasy.train_bpe(
iterator,
regex_pattern,
args.max_sentencepiece_length,
args.vocab_size,
)
Alternatively, you can also train using the basic tokenizer class provided:
from bpeasy.tokenizer import BPEasyTokenizer
tokenizer = BPEasyTokenizer.train(
iterator,
vocab_size=vocab_size,
max_token_length=max_token_length,
regex_pattern=regex_pattern,
special_tokens=["<s>", "<pad>", "</s>"],
fill_to_nearest_multiple_of_eight=True,
name="bpeasy",
)
Encoding/Decoding
To test your tokenizer you can use the BPEasyTokenizer
class, which is a wrapper around the tiktoken.Encoding
module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.
from bpeasy.tokenizer import BPEasyTokenizer
your_special_tokens = ["<s>", "<pad>", "</s>"]
tokenizer = BPEasyTokenizer(
vocab=vocab,
regex_pattern=regex_pattern,
special_tokens=your_special_tokens,
fill_to_nearest_multiple_of_eight=True,
name="bpeasy"
)
test = "hello_world"
encoded = tokenizer.encode(test)
decoded = tokenizer.decode(encoded)
> "hello_world"
You can also use tiktoken
directly, but you would need to handle the special tokens and regex pattern yourself:
import tiktoken
vocab = bpeasy.train_bpe(...)
special_tokens = ["<s>", "<pad>", "</s>"]
sorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])
special_token_ranks = {}
for special_token in special_tokens:
special_token_ranks[special_token] = len(sorted_vocab)
sorted_vocab.append((special_token.encode("utf-8"), len(sorted_vocab)))
full_vocab = dict(sorted_vocab)
encoder = tiktoken.Encoding(
name=name,
pat_str=regex_pattern,
mergeable_ranks=full_vocab,
special_tokens=special_token_ranks,
)
Save/Load tokenizer from file
We provide basic utility functions to save and load the tokenizer from a json file.
tokenizer.save("path_to_file.json")
tokenizer = BPEasyTokenizer.from_file("path_to_file.json")
Export to HuggingFace format
We also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace transformers
library.
from bpeasy.tokenizer import BPEasyTokenizer
from trans
tokenizer = BPEasyTokenizer(
...
)
tokenizer.export_to_huggingface_format("hf_tokenizer.json")
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="hf_tokenizer.json")
Export vocab to tiktoken
txt format
from bpeasy import
vocab = bpeasy.train_bpe(...)
save_vocab_to_tiktoken(vocab, "vocab.txt", special_tokens=["<s>", "<pad>", "</s>"])
If you want to use the tiktoken
txt format, you will still need to handle the regex and special tokens yourself, as shown above,
Contributing
Contributions are welcome! Please open an issue if you have any suggestions or improvements.
License
This project is licensed under the MIT License.