Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
llama3-tokenizer-js
Advanced tools
JavaScript tokenizer for LLaMA 3 and LLaMA 3.1.
Intended use case is calculating token count accurately on the client-side.
Works client-side in the browser, in Node, in TypeScript codebases, in ES6 projects, and in CommonJS projects.
Install as an npm package and import as ES6 module
npm install llama3-tokenizer-js
import llama3Tokenizer from 'llama3-tokenizer-js'
console.log(llama3Tokenizer.encode("Hello world!").length) // returns token count 5
It's possible to load the main bundle file with simple <script>
tags:
<script type="module" src="https://belladoreai.github.io/llama3-tokenizer-js/bundle/llama3-tokenizer-with-baked-data.js"></script>
If you decide to load with script tags, be sure to either grab a copy of the file into your local build, or change the github URL such that you lock the file to a specific commit.
Alternative import syntax for CommonJS projects:
async function main() {
const llama3Tokenizer = await import('llama3-tokenizer-js')
console.log(llama3Tokenizer.default.encode("Hello world!").length)
}
main();
If you need to use CommonJS with the normal import syntax, you can try loading this experimental CommonJS version of the library: bundle/commonjs-llama3-tokenizer-with-baked-data.js.
Once you have the module imported, you can encode or decode with it. Training is not supported.
When used in browser, llama3-tokenizer-js pollutes global namespace with llama3Tokenizer
.
Encode:
llama3Tokenizer.encode("Hello world!")
> [128000, 9906, 1917, 0, 128001]
Decode:
llama3Tokenizer.decode([128000, 9906, 1917, 0, 128001])
> '<|begin_of_text|>Hello world!<|end_of_text|>'
Note the special tokens in the beginning and end. These affect token count. You can pass an options object if you don't want to add these:
llama3Tokenizer.encode("Hello world!", { bos: false, eos: false })
> [9906, 1917, 0]
Note that, contrary to LLaMA 1 tokenizer, the LLaMA 3 tokenizer does not add a preceding space (please open an issue if there are circumstances in which a preceding space is still added).
This tokenizer is mostly* compatible with all models which have been trained on top of checkpoints released by Facebook in April 2024 ("LLaMA 3").
What this means in practice:
*See below section "Special tokens and fine tunes".
If you are unsure about compatibility, try it and see if the token ids are the same (compared to running the model with, for example, the transformers library). If you are testing a fine tune, remember to test with the relevant special tokens.
If you want to make this library work with different tokenizer data, you may be interested in this script which was used to convert the data.
You can pass custom vocab and merge data to the tokenizer by instantiating it like this:
import { Llama3Tokenizer } from 'llama3-tokenizer-js'
const tokenizer = new Llama3Tokenizer(custom_vocab, custom_merge_data);
Please note that if you try to adapt this library to work for a different tokenizer, there are many footguns and it's easy to set up something that almost works. If the only thing that needs to change is vocab and merge data, and they are of same size as the previous vocab and merge data, you should be fine. But if anything else in addition to vocab and merge data needs to change, you have to read and understand the full source code and make changes where needed.
It's common with language models, including Llama 3, to denote the end of sequence (eos) with a special token. Please note that in May 2024 the eos token in the official Huggingface repo for Llama 3 instruct was changed by Huggingface staff from <|end_of_text|>
to <|eot_id|>
. Both of these special tokens already existed in the tokenizer, the change merely affects how these tokens are treated in commonly used software such as oobabooga. This change makes sense in the context of Llama 3 instruct, but it does not make sense in the context of Llama 3 base model. Therefore, I have decided I will not change the eos token in this library. In any case, this discrepancy will not affect token counts. It's something you need to be aware of only if you use the generated tokens for purposes other than counting.
There is a large number of special tokens in Llama 3 (e.g. <|end_of_text|>
). You can pass these inside text input, they will be parsed and counted correctly (try the example-demo playground if you are unsure).
However, sometimes when people fine tune models, they change the special tokens by adding their own tokens and even shifting the ids of pre-existing special tokens. For example: Hermes-2-Pro-Llama-3-8B. This is unfortunate for our token counting purposes. If you are using this library to count tokens, and you are using a fine tune which messes around with special tokens, you can choose one of the following approaches:
.encode(str).length
, you can call .optimisticCount(str)
. Optimistic count is a convenience function which parses the text with the assumption that anything that looks like a special token (e.g. <|boom|>
) is actually a special token.node test/node-test.js
live-server
and open test/browser-test.htmlcd example-demo && npm install && npm run build && live-server
and open the "build" folderNote that some parts of the code might behave differently in node compared to browser environment.
Release steps:
cd src && node create-bundle.js
LLaMA3-tokenizer-js is a fork of my earlier LLaMA 1 tokenizer llama-tokenizer-js.
Several helper functions used in LLaMA 3 pretokenization were adapted from the fantastic transformers.js library. The BPE implementation, which is the core of this library, is original work and was adapted into transformers.js. In other words, some work has been adapted from llama-tokenizer-js into transformers.js, and some work has been adapted the other way, from transformers.js into llama3-tokenizer-js.
The example-demo (tokenizer playground) is a fork of gpt-tokenizer playground.
Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi and ConProgramming.
FAQs
JS tokenizer for LLaMA 3
We found that llama3-tokenizer-js demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.