{
		"name": "llama3-tokenizer-js",
		"version": "1.1.0",
		"version": "1.1.1",
		"description": "JS tokenizer for LLaMA 3",
		"main": "llama-tokenizer.js",
		"main": "src/llama3-tokenizer.js",
		"types": "types.d.ts",
		"scripts": {
		"test": "node test-llama-tokenizer.js"
		},
		"repository": {
		@@ -11,0 +8,0 @@ "type": "git",

README.md

		@@ -11,3 +11,3 @@ # 🦙 llama3-tokenizer-js 🦙

		- Easy to use: 0 dependencies, code and data baked into a [single file](llama-tokenizer.js).
		- Easy to use: 0 dependencies, code and data baked into a [single file](src/llama3-tokenizer.js).
		- Compatible with most LLaMA 3 models (see [Compatibility](#compatibility))
		@@ -71,20 +71,10 @@ - Optimized running time (highly efficient BPE implementation)

		## Tests

		You can run tests with:

		```
		llama3Tokenizer.runTests()
		```

		Note that tests can be run both in browser and in Node (this is necessary because some parts of the code work differently in different environments).

		## Compatibility

		This tokenizer is compatible with all models which have been trained on top of checkpoints released by Facebook in April 2024 ("LLaMA 3").
		This tokenizer is mostly* compatible with all models which have been trained on top of checkpoints released by Facebook in April 2024 ("LLaMA 3").

		What this means in practice:
		- ✅ LLaMA 3 models released by Facebook: yes, they are compatible
		- ✅ New LLaMA 3 based fine tune by somebody other than Facebook: yes, it's compatible
		- ❌ New LLaMA 3 model trained from scratch by somebody other than Facebook: probably not compatible, depends if they also retrained the tokenizer
		- ✅ New LLaMA 3 based fine tune by somebody other than Facebook: yes, it's compatible (except possibly for some special tokens*)
		- ❌ New LLaMA 3 model trained from scratch by somebody other than Facebook: probably not compatible, depends if they also retrained the tokenizer (and/or if they added their own special tokens*)
		- ❌ LLaMA 1 or LLaMA 2 based models: no, not compatible (use [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js) instead)
		@@ -94,6 +84,8 @@ - ❌ OpenAI models: no, not compatible

		If you are unsure about compatibility, try it and see if the token ids are the same (compared to running the model with, for example, the transformers library).
		_*See below section "Special tokens and fine tunes"._

		If you want to make this library work with different tokenizer data, you may be interested in [this script](data-conversion.py) which was used to convert the data.
		If you are unsure about compatibility, try it and see if the token ids are the same (compared to running the model with, for example, the transformers library). If you are testing a fine tune, remember to test with the relevant special tokens.

		If you want to make this library work with different tokenizer data, you may be interested in [this script](src/data-conversion.py) which was used to convert the data.

		You can pass custom vocab and merge data to the tokenizer by instantiating it like this:
		@@ -108,2 +100,19 @@

		## Special tokens and fine tunes

		There is a large number of special tokens in Llama 3 (e.g. `<\|end_of_text\|>`). You can pass these inside text input, they will be parsed and counted correctly (try the example-demo playground if you are unsure).

		However, sometimes when people fine tune models, they change the special tokens by adding their own tokens and even shifting the ids of pre-existing special tokens. For example: [Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json). This is unfortunate for our token counting purposes. If you are using this library to count tokens, and you are using a fine tune which messes around with special tokens, you can choose one of the following approaches:

		1) If you need exact token counts, you can work around this issue by using this library to tokenize _only_ user input text (which shouldn't contain any special tokens) and then programmatically adding the relevant counts for the special tokens that you are using to wrap the input text.
		2) Alternatively, you can choose to ignore this issue, in which case you will be overcounting tokens by a little bit, which is not too bad (in typical use cases, undercounting can lead to more severe quality issues than overcounting).

		## Tests

		Some parts of the code might behave differently in node versus browser, so it is necessary to run tests in both:

		1. Node test: node test/node-test.js
		2. Browser test: run `live-server` and open test/browser-test.html
		3. TypeScript test: run `cd test/typescript-test`, bump the dependency in its package.json, run `npm i && npm test`.

		## Repo maintenance
		@@ -113,19 +122,21 @@

		1. node test-llama-tokenizer.js
		2. open test.html (with live-server or similar)
		3. do you need to update this README?
		4. bump version number in root package.json
		5. push tokenizer changes to github
		6. npm publish --dry-run
		7. npm publish
		8. bump version number in example-demo/package.json
		9. cd example-demo && npm run build && live-server
		10. push example demo changes to github
		1. run/update tests
		2. do you need to update this README?
		3. bump version number in root package.json
		4. push tokenizer changes to github
		5. npm publish --dry-run
		6. npm publish
		7. bump version number in example-demo/package.json
		8. cd example-demo && npm install && npm run build && live-server
		9. push example demo changes to github
		10. create new release on github

		## Who did this

		LLaMA3-tokenizer-js is a fork of my earlier LLaMA 1 tokenizer [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js). Several helper functions used in LLaMA 3 pretokenization were adapted from the fantastic [transformers.js](https://github.com/xenova/transformers.js) library. The BPE implementation, which is the core of this library, is original work and [was adapted into transformers.js](https://github.com/belladoreai/llama-tokenizer-js/issues/9). In other words, some work has been adapted from llama-tokenizer-js into transformers.js, and some work has been adapted the other way, from transformers.js into llama3-tokenizer-js.
		LLaMA3-tokenizer-js is a fork of my earlier LLaMA 1 tokenizer [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js).

		Several helper functions used in LLaMA 3 pretokenization were adapted from the fantastic [transformers.js](https://github.com/xenova/transformers.js) library. The BPE implementation, which is the core of this library, is original work and [was adapted into transformers.js](https://github.com/belladoreai/llama-tokenizer-js/issues/9). In other words, some work has been adapted from llama-tokenizer-js into transformers.js, and some work has been adapted the other way, from transformers.js into llama3-tokenizer-js.

		The example-demo (tokenizer playground) is a fork of [gpt-tokenizer playground](https://github.com/niieani/gpt-tokenizer).

		Developed by [belladore.ai](https://belladore.ai) with contributions from [xenova](https://github.com/xenova), [blaze2004](https://github.com/blaze2004), [imoneoi](https://github.com/imoneoi) and [ConProgramming](https://github.com/ConProgramming).

types.d.ts

		@@ -0,1 +1,6 @@
		type EncodeOptions = {
		bos?: boolean,
		eos?: boolean
		}

		export declare class Llama3Tokenizer {
		@@ -6,6 +11,6 @@ vocabById: string[];
		constructor(vocab_base64?: string, merges_binary?: string);
		encode(prompt: string, add_bos_token?: boolean, add_preceding_space?: boolean, log_performance?: boolean): number[];
		decode(tokenIds: number[], add_bos_token?: boolean, add_preceding_space?: boolean): string;
		encode(prompt: string, options?: EncodeOptions): number[];
		decode(tokenIds: number[]): string;
		getSpecialTokenId(tokenString: string): number;
		runTests(tests?: (tokenizer: LlamaTokenizer) => boolean): void
		runTests(tests?: (tokenizer: Llama3Tokenizer) => boolean): void
		}
		@@ -12,0 +17,0 @@ declare const llama3Tokenizer: Llama3Tokenizer;

data-conversion.py

llama-tokenizer.js

test-llama-tokenizer.js

test.html

llama3-tokenizer-js - npm Package Compare versions

New alerts

Improved metrics

Worsened metrics