@dqbd/tiktoken
Advanced tools
Comparing version 0.2.0 to 0.2.1
{ | ||
"name": "@dqbd/tiktoken", | ||
"version": "0.2.0", | ||
"version": "0.2.1", | ||
"description": "Javascript bindings for tiktoken", | ||
"files": [ | ||
"_tiktoken_bg.wasm", | ||
"_tiktoken.js", | ||
"_tiktoken.d.ts" | ||
"dist/**/*", | ||
"package.json" | ||
], | ||
"main": "_tiktoken.js", | ||
"types": "_tiktoken.d.ts" | ||
} | ||
"license": "Apache-2.0", | ||
"main": "dist/node/_tiktoken.js", | ||
"browser": "dist/web/_tiktoken.js", | ||
"types": "dist/node/_tiktoken.d.ts", | ||
"repository": { | ||
"type": "git", | ||
"url": "https://github.com/dqbd/tiktoken" | ||
}, | ||
"devDependencies": {}, | ||
"scripts": { | ||
"build": "rm -rf dist/ && npm run build:node && npm run build:bundler && npm run build:web", | ||
"build:bundler": "wasm-pack build --target bundler --release --out-dir dist/bundler && rm dist/bundler/.gitignore", | ||
"build:node": "wasm-pack build --target nodejs --release --out-dir dist/node && rm dist/node/.gitignore", | ||
"build:web": "wasm-pack build --target no-modules --release --out-dir dist/web && rm dist/web/.gitignore" | ||
} | ||
} |
108
README.md
# ⏳ tiktoken | ||
tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with | ||
OpenAI's models. | ||
tiktoken is a [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with | ||
OpenAI's models, forked from the original tiktoken library to provide NPM bindings for Node and other JS runtimes. | ||
```python | ||
import tiktoken | ||
enc = tiktoken.get_encoding("gpt2") | ||
assert enc.decode(enc.encode("hello world")) == "hello world" | ||
```typescript | ||
import assert from "node:assert"; | ||
import { get_encoding, encoding_for_model } from "@dqbd/tiktoken"; | ||
# To get the tokeniser corresponding to a specific model in the OpenAI API: | ||
enc = tiktoken.encoding_for_model("text-davinci-003") | ||
const enc = get_encoding("gpt2"); | ||
assert( | ||
new TextDecoder().decode(enc.decode(enc.encode("hello world"))) === | ||
"hello world" | ||
); | ||
// To get the tokeniser corresponding to a specific model in the OpenAI API: | ||
const enc = encoding_for_model("text-davinci-003"); | ||
``` | ||
The open source version of `tiktoken` can be installed from PyPI: | ||
``` | ||
pip install tiktoken | ||
``` | ||
The tokeniser API is documented in `tiktoken/core.py`. | ||
Example code using `tiktoken` can be found in the | ||
[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb). | ||
## Performance | ||
`tiktoken` is between 3-6x faster than a comparable open source tokeniser: | ||
![image](./perf.svg) | ||
Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from | ||
`tokenizers==0.13.2` and `transformers==4.24.0`. | ||
## Getting help | ||
Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues). | ||
If you work at OpenAI, make sure to check the internal documentation or feel free to contact | ||
@shantanu. | ||
## Extending tiktoken | ||
You may wish to extend `tiktoken` to support new encodings. There are two ways to do this. | ||
**Create your `Encoding` object exactly the way you want and simply pass it around.** | ||
```python | ||
cl100k_base = tiktoken.get_encoding("cl100k_base") | ||
# In production, load the arguments directly instead of accessing private attributes | ||
# See openai_public.py for examples of arguments for specific encodings | ||
enc = tiktoken.Encoding( | ||
# If you're changing the set of special tokens, make sure to use a different name | ||
# It should be clear from the name what behaviour to expect. | ||
name="cl100k_im", | ||
pat_str=cl100k_base._pat_str, | ||
mergeable_ranks=cl100k_base._mergeable_ranks, | ||
special_tokens={ | ||
**cl100k_base._special_tokens, | ||
"<|im_start|>": 100264, | ||
"<|im_end|>": 100265, | ||
} | ||
) | ||
``` | ||
**Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.** | ||
This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer | ||
option 1. | ||
To do this, you'll need to create a namespace package under `tiktoken_ext`. | ||
Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file: | ||
npm install @dqbd/tiktoken | ||
``` | ||
my_tiktoken_extension | ||
├── tiktoken_ext | ||
│ └── my_encodings.py | ||
└── setup.py | ||
``` | ||
`my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`. | ||
This is a dictionary from an encoding name to a function that takes no arguments and returns | ||
arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see | ||
`tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`. | ||
## Acknowledgements | ||
Your `setup.py` should look something like this: | ||
```python | ||
from setuptools import setup, find_namespace_packages | ||
setup( | ||
name="my_tiktoken_extension", | ||
packages=find_namespace_packages(include=['tiktoken_ext.*']) | ||
install_requires=["tiktoken"], | ||
... | ||
) | ||
``` | ||
Then simply `pip install my_tiktoken_extension` and you should be able to use your custom encodings! | ||
Make sure **not** to use an editable install. | ||
- https://github.com/zurawiki/tiktoken-rs |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Major refactor
Supply chain riskPackage has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.
Found 1 instance in 1 package
Network access
Supply chain riskThis module accesses the network.
Found 1 instance in 1 package
Mixed license
License(Experimental) Package contains multiple licenses.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Native code
Supply chain riskContains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.
Found 1 instance in 1 package
No License Found
License(Experimental) License information could not be found.
Found 1 instance in 1 package
No repository
Supply chain riskPackage does not have a linked source code repository. Without this field, a package will have no reference to the location of the source code use to generate the package.
Found 1 instance in 1 package
12446943
23
904
2
29
2