Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@dqbd/tiktoken

Package Overview
Dependencies
Maintainers
1
Versions
31
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@dqbd/tiktoken - npm Package Compare versions

Comparing version 0.2.0 to 0.2.1

CHANGELOG.md

27

package.json
{
"name": "@dqbd/tiktoken",
"version": "0.2.0",
"version": "0.2.1",
"description": "Javascript bindings for tiktoken",
"files": [
"_tiktoken_bg.wasm",
"_tiktoken.js",
"_tiktoken.d.ts"
"dist/**/*",
"package.json"
],
"main": "_tiktoken.js",
"types": "_tiktoken.d.ts"
}
"license": "Apache-2.0",
"main": "dist/node/_tiktoken.js",
"browser": "dist/web/_tiktoken.js",
"types": "dist/node/_tiktoken.d.ts",
"repository": {
"type": "git",
"url": "https://github.com/dqbd/tiktoken"
},
"devDependencies": {},
"scripts": {
"build": "rm -rf dist/ && npm run build:node && npm run build:bundler && npm run build:web",
"build:bundler": "wasm-pack build --target bundler --release --out-dir dist/bundler && rm dist/bundler/.gitignore",
"build:node": "wasm-pack build --target nodejs --release --out-dir dist/node && rm dist/node/.gitignore",
"build:web": "wasm-pack build --target no-modules --release --out-dir dist/web && rm dist/web/.gitignore"
}
}
# ⏳ tiktoken
tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
OpenAI's models.
tiktoken is a [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
OpenAI's models, forked from the original tiktoken library to provide NPM bindings for Node and other JS runtimes.
```python
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
```typescript
import assert from "node:assert";
import { get_encoding, encoding_for_model } from "@dqbd/tiktoken";
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("text-davinci-003")
const enc = get_encoding("gpt2");
assert(
new TextDecoder().decode(enc.decode(enc.encode("hello world"))) ===
"hello world"
);
// To get the tokeniser corresponding to a specific model in the OpenAI API:
const enc = encoding_for_model("text-davinci-003");
```
The open source version of `tiktoken` can be installed from PyPI:
```
pip install tiktoken
```
The tokeniser API is documented in `tiktoken/core.py`.
Example code using `tiktoken` can be found in the
[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
## Performance
`tiktoken` is between 3-6x faster than a comparable open source tokeniser:
![image](./perf.svg)
Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
`tokenizers==0.13.2` and `transformers==4.24.0`.
## Getting help
Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
If you work at OpenAI, make sure to check the internal documentation or feel free to contact
@shantanu.
## Extending tiktoken
You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
**Create your `Encoding` object exactly the way you want and simply pass it around.**
```python
cl100k_base = tiktoken.get_encoding("cl100k_base")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
```
**Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under `tiktoken_ext`.
Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
npm install @dqbd/tiktoken
```
my_tiktoken_extension
├── tiktoken_ext
│   └── my_encodings.py
└── setup.py
```
`my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
`tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
## Acknowledgements
Your `setup.py` should look something like this:
```python
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_ext.*'])
install_requires=["tiktoken"],
...
)
```
Then simply `pip install my_tiktoken_extension` and you should be able to use your custom encodings!
Make sure **not** to use an editable install.
- https://github.com/zurawiki/tiktoken-rs
SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc