Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
gpt-tokenizer
Advanced tools
gpt-tokenizer
is a highly optimized Token Byte Pair Encoder/Decoder for all OpenAI's models (including those used by GPT-2, GPT-3, GPT-3.5 and GPT-4). It's written in TypeScript, and is fully compatible with all modern JavaScript environments.
OpenAI's GPT models utilize byte pair encoding to transform text into a sequence of integers before feeding them into the model.
As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. It implements some unique features, such as:
r50k_base
, p50k_base
, p50k_edit
and cl100k_base
)decodeAsyncGenerator
and decodeGenerator
with any iterable input)isWithinTokenLimit
function to assess token limit without encoding the entire textThis package is a port of OpenAI's tiktoken, with some additional features sprinkled on top.
Thanks to @dmitry-brazhenko's SharpToken, whose code was served as a reference for the port.
Historical note: This package started off as a fork of latitudegames/GPT-3-Encoder, but version 2.0 was rewritten from scratch.
npm install gpt-tokenizer
<script src="https://unpkg.com/gpt-tokenizer"></script>
<script>
// the package is now available as a global:
const { encode, decode } = GPTTokenizer
</script>
If you wish to use a custom encoding, fetch the relevant script:
Refer to supported models and their encodings section for more information.
You can play with the package in the browser using the Playground.
The playground mimics the official OpenAI Tokenizer.
import {
encode,
decode,
isWithinTokenLimit,
encodeGenerator,
decodeGenerator,
decodeAsyncGenerator,
} from 'gpt-tokenizer'
const text = 'Hello, world!'
const tokenLimit = 10
// Encode text into tokens
const tokens = encode(text)
// Decode tokens back into text
const decodedText = decode(tokens)
// Check if text is within the token limit
// returns false if the limit is exceeded, otherwise returns the actual number of tokens (truthy value)
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)
// Encode text using generator
for (const tokenChunk of encodeGenerator(text)) {
console.log(tokenChunk)
}
// Decode tokens using generator
for (const textChunk of decodeGenerator(tokens)) {
console.log(textChunk)
}
// Decode tokens using async generator
// (assuming `asyncTokens` is an AsyncIterableIterator<number>)
for await (const textChunk of decodeAsyncGenerator(asyncTokens)) {
console.log(textChunk)
}
By default, importing from gpt-tokenizer
uses cl100k_base
encoding, used by gpt-3.5-turbo
and gpt-4
.
To get a tokenizer for a different model, import it directly, for example:
import {
encode,
decode,
isWithinTokenLimit,
} from 'gpt-tokenizer/model/text-davinci-003'
If you're dealing with a resolver that doesn't support package.json exports
resolution, you might need to import from the respective cjs
or esm
directory, e.g.:
import {
encode,
decode,
isWithinTokenLimit,
} from 'gpt-tokenizer/cjs/model/text-davinci-003'
chat:
gpt-4
(cl100k_base
)gpt-3.5-turbo
(cl100k_base
)text:
text-davinci-003
(p50k_base
)text-davinci-002
(p50k_base
)text-davinci-001
(r50k_base
)text-curie-001
(r50k_base
)text-babbage-001
(r50k_base
)text-ada-001
(r50k_base
)davinci
(r50k_base
)curie
(r50k_base
)babbage
(r50k_base
)ada
(r50k_base
)code:
code-davinci-002
(p50k_base
)code-davinci-001
(p50k_base
)code-cushman-002
(p50k_base
)code-cushman-001
(p50k_base
)davinci-codex
(p50k_base
)cushman-codex
(p50k_base
)edit:
text-davinci-edit-001
(p50k_edit
)code-davinci-edit-001
(p50k_edit
)embeddings:
text-embedding-ada-002
(cl100k_base
)old embeddings:
text-similarity-davinci-001
(r50k_base
)text-similarity-curie-001
(r50k_base
)text-similarity-babbage-001
(r50k_base
)text-similarity-ada-001
(r50k_base
)text-search-davinci-doc-001
(r50k_base
)text-search-curie-doc-001
(r50k_base
)text-search-babbage-doc-001
(r50k_base
)text-search-ada-doc-001
(r50k_base
)code-search-babbage-code-001
(r50k_base
)code-search-ada-code-001
(r50k_base
)encode(text: string): number[]
Encodes the given text into a sequence of tokens. Use this method when you need to transform a piece of text into the token format that the GPT models can process.
Example:
import { encode } from 'gpt-tokenizer'
const text = 'Hello, world!'
const tokens = encode(text)
decode(tokens: number[]): string
Decodes a sequence of tokens back into text. Use this method when you want to convert the output tokens from GPT models back into human-readable text.
Example:
import { decode } from 'gpt-tokenizer'
const tokens = [18435, 198, 23132, 328]
const text = decode(tokens)
isWithinTokenLimit(text: string, tokenLimit: number): false | number
Checks if the text is within the token limit. Returns false
if the limit is exceeded, otherwise returns the number of tokens. Use this method to quickly check if a given text is within the token limit imposed by GPT models, without encoding the entire text.
Example:
import { isWithinTokenLimit } from 'gpt-tokenizer'
const text = 'Hello, world!'
const tokenLimit = 10
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)
encodeGenerator(text: string): Generator<number[], void, undefined>
Encodes the given text using a generator, yielding chunks of tokens. Use this method when you want to encode text in chunks, which can be useful for processing large texts or streaming data.
Example:
import { encodeGenerator } from 'gpt-tokenizer'
const text = 'Hello, world!'
const tokens = []
for (const tokenChunk of encodeGenerator(text)) {
tokens.push(...tokenChunk)
}
decodeGenerator(tokens: Iterable<number>): Generator<string, void, undefined>
Decodes a sequence of tokens using a generator, yielding chunks of decoded text. Use this method when you want to decode tokens in chunks, which can be useful for processing large outputs or streaming data.
Example:
import { decodeGenerator } from 'gpt-tokenizer'
const tokens = [18435, 198, 23132, 328]
let decodedText = ''
for (const textChunk of decodeGenerator(tokens)) {
decodedText += textChunk
}
decodeAsyncGenerator(tokens: AsyncIterable<number>): AsyncGenerator<string, void, undefined>
Decodes a sequence of tokens asynchronously using a generator, yielding chunks of decoded text. Use this method when you want to decode tokens in chunks asynchronously, which can be useful for processing large outputs or streaming data in an asynchronous context.
Example:
import { decodeAsyncGenerator } from 'gpt-tokenizer'
async function processTokens(asyncTokensIterator) {
let decodedText = ''
for await (const textChunk of decodeAsyncGenerator(asyncTokensIterator)) {
decodedText += textChunk
}
}
There are a few special tokens that are used by the GPT models. Not all models support all of these tokens.
gpt-tokenizer
allows you to specify custom sets of allowed special tokens when encoding text. To do this, pass a
Set
containing the allowed special tokens as a parameter to the encode
function:
import {
EndOfPrompt,
EndOfText,
FimMiddle,
FimPrefix,
FimSuffix,
encode,
} from 'gpt-tokenizer'
const inputText = `Some Text ${EndOfPrompt}`
const allowedSpecialTokens = new Set([EndOfPrompt])
const encoded = encode(inputText, allowedSpecialTokens)
const expectedEncoded = [8538, 2991, 220, 100276]
expect(encoded).toBe(expectedEncoded)
Similarly, you can specify custom sets of disallowed special tokens when encoding text. Pass a Set
containing the disallowed special tokens as a parameter to the encode
function:
import { encode } from 'gpt-tokenizer'
const inputText = `Some Text`
const disallowedSpecial = new Set(['Some'])
// throws an error:
const encoded = encode(inputText, undefined, disallowedSpecial)
In this example, an Error is thrown, because the input text contains a disallowed special token.
gpt-tokenizer
includes a set of test cases in the TestPlans.txt file to ensure its compatibility with OpenAI's Python tiktoken
library. These test cases validate the functionality and behavior of gpt-tokenizer
, providing a reliable reference for developers.
Running the unit tests and verifying the test cases helps maintain consistency between the library and the original Python implementation.
MIT
Contributions are welcome! Please open a pull request or an issue to discuss your bug reports, or use the discussions feature for ideas or any other inquiries.
Hope you find the gpt-tokenizer
useful in your projects!
FAQs
A pure JavaScript implementation of a BPE tokenizer (Encoder/Decoder) for GPT-2 / GPT-3 / GPT-4 and other OpenAI models
The npm package gpt-tokenizer receives a total of 100,400 weekly downloads. As such, gpt-tokenizer popularity was classified as popular.
We found that gpt-tokenizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.