⏳ tiktoken-async
tiktoken is a fast BPE tokeniser for use with
OpenAI's models.
import asyncio
import tiktoken_async
enc = asyncio.run(tiktoken_async.get_encoding("cl100k_base"))
assert enc.decode(enc.encode("hello world")) == "hello world"
enc = asyncio.run(tiktoken_async.encoding_for_model("gpt-4"))
The open source version of tiktoken-async
can be installed from PyPI:
pip install tiktoken-async
The tokeniser API is documented in tiktoken_async/core.py
.
Example code using tiktoken
can be found in the
OpenAI Cookbook.
Performance
tiktoken
is between 3-6x faster than a comparable open source tokeniser:
Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast
from
tokenizers==0.13.2
, transformers==4.24.0
and tiktoken==0.2.0
.
Getting help
Please post questions in the issue tracker.
If you work at OpenAI, make sure to check the internal documentation or feel free to contact
@shantanu.
Extending tiktoken
You may wish to extend tiktoken-async
to support new encodings. There are two ways to do this.
Create your Encoding
object exactly the way you want and simply pass it around.
import asyncio
cl100k_base = asyncio.run(tiktoken.get_encoding("cl100k_base"))
enc = tiktoken_async.Encoding(
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
Use the tiktoken_async_ext
plugin mechanism to register your Encoding
objects with tiktoken_async
.
This is only useful if you need tiktoken_async.get_encoding
to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under tiktoken_async_ext
.
Layout your project like this, making sure to omit the tiktoken_ext/__init__.py
file:
my_tiktoken_extension
├── tiktoken_async_ext
│ └── my_encodings.py
└── setup.py
my_encodings.py
should be a module that contains a variable named ENCODING_CONSTRUCTORS
.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to tiktoken_async.Encoding
to construct that encoding. For an example, see
tiktoken_async_ext/openai_public.py
. For precise details, see tiktoken_async/registry.py
.
Your setup.py
should look something like this:
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_async_ext*']),
install_requires=["tiktoken_async"],
...
)
Then simply pip install ./my_tiktoken_extension
and you should be able to use your
custom encodings! Make sure not to use an editable install.