Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")
The open source version of tiktoken
can be installed from PyPI:
pip install tiktoken
The tokeniser API is documented in tiktoken/core.py
.
Example code using tiktoken
can be found in the
OpenAI Cookbook.
tiktoken
is between 3-6x faster than a comparable open source tokeniser:
Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast
from
tokenizers==0.13.2
, transformers==4.24.0
and tiktoken==0.2.0
.
Please post questions in the issue tracker.
If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.
Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens). Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable properties:
tiktoken
contains an educational submodule that is friendlier if you want to learn more about
the details of BPE, including code that helps visualise the BPE procedure:
from tiktoken._educational import *
# Train a BPE tokeniser on a small amount of text
enc = train_simple_encoding()
# Visualise how the GPT-4 encoder encodes text
enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
enc.encode("hello world aaaaaaaaaaaa")
You may wish to extend tiktoken
to support new encodings. There are two ways to do this.
Create your Encoding
object exactly the way you want and simply pass it around.
cl100k_base = tiktoken.get_encoding("cl100k_base")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
Use the tiktoken_ext
plugin mechanism to register your Encoding
objects with tiktoken
.
This is only useful if you need tiktoken.get_encoding
to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under tiktoken_ext
.
Layout your project like this, making sure to omit the tiktoken_ext/__init__.py
file:
my_tiktoken_extension
├── tiktoken_ext
│ └── my_encodings.py
└── setup.py
my_encodings.py
should be a module that contains a variable named ENCODING_CONSTRUCTORS
.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to tiktoken.Encoding
to construct that encoding. For an example, see
tiktoken_ext/openai_public.py
. For precise details, see tiktoken/registry.py
.
Your setup.py
should look something like this:
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_ext*']),
install_requires=["tiktoken"],
...
)
Then simply pip install ./my_tiktoken_extension
and you should be able to use your
custom encodings! Make sure not to use an editable install.
FAQs
tiktoken is a fast BPE tokeniser for use with OpenAI's models
We found that tiktoken demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.