llama3-tokenizer-js
Advanced tools
Comparing version 1.1.0 to 1.1.1
{ | ||
"name": "llama3-tokenizer-js", | ||
"version": "1.1.0", | ||
"version": "1.1.1", | ||
"description": "JS tokenizer for LLaMA 3", | ||
"main": "llama-tokenizer.js", | ||
"main": "src/llama3-tokenizer.js", | ||
"types": "types.d.ts", | ||
"scripts": { | ||
"test": "node test-llama-tokenizer.js" | ||
}, | ||
"repository": { | ||
@@ -11,0 +8,0 @@ "type": "git", |
@@ -11,3 +11,3 @@ # 🦙 llama3-tokenizer-js 🦙 | ||
- Easy to use: 0 dependencies, code and data baked into a [single file](llama-tokenizer.js). | ||
- Easy to use: 0 dependencies, code and data baked into a [single file](src/llama3-tokenizer.js). | ||
- Compatible with most LLaMA 3 models (see [Compatibility](#compatibility)) | ||
@@ -71,20 +71,10 @@ - Optimized running time (highly efficient BPE implementation) | ||
## Tests | ||
You can run tests with: | ||
``` | ||
llama3Tokenizer.runTests() | ||
``` | ||
Note that tests can be run both in browser and in Node (this is necessary because some parts of the code work differently in different environments). | ||
## Compatibility | ||
This tokenizer is compatible with all models which have been trained on top of checkpoints released by Facebook in April 2024 ("LLaMA 3"). | ||
This tokenizer is mostly* compatible with all models which have been trained on top of checkpoints released by Facebook in April 2024 ("LLaMA 3"). | ||
What this means in practice: | ||
- ✅ LLaMA 3 models released by Facebook: yes, they are compatible | ||
- ✅ New LLaMA 3 based fine tune by somebody other than Facebook: yes, it's compatible | ||
- ❌ New LLaMA 3 model trained from scratch by somebody other than Facebook: probably not compatible, depends if they also retrained the tokenizer | ||
- ✅ New LLaMA 3 based fine tune by somebody other than Facebook: yes, it's compatible (except possibly for some special tokens*) | ||
- ❌ New LLaMA 3 model trained from scratch by somebody other than Facebook: probably not compatible, depends if they also retrained the tokenizer (and/or if they added their own special tokens*) | ||
- ❌ LLaMA 1 or LLaMA 2 based models: no, not compatible (use [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js) instead) | ||
@@ -94,6 +84,8 @@ - ❌ OpenAI models: no, not compatible | ||
If you are unsure about compatibility, try it and see if the token ids are the same (compared to running the model with, for example, the transformers library). | ||
_*See below section "Special tokens and fine tunes"._ | ||
If you want to make this library work with different tokenizer data, you may be interested in [this script](data-conversion.py) which was used to convert the data. | ||
If you are unsure about compatibility, try it and see if the token ids are the same (compared to running the model with, for example, the transformers library). If you are testing a fine tune, remember to test with the relevant special tokens. | ||
If you want to make this library work with different tokenizer data, you may be interested in [this script](src/data-conversion.py) which was used to convert the data. | ||
You can pass custom vocab and merge data to the tokenizer by instantiating it like this: | ||
@@ -108,2 +100,19 @@ | ||
## Special tokens and fine tunes | ||
There is a large number of special tokens in Llama 3 (e.g. `<|end_of_text|>`). You can pass these inside text input, they will be parsed and counted correctly (try the example-demo playground if you are unsure). | ||
However, sometimes when people fine tune models, they change the special tokens by adding their own tokens and even shifting the ids of pre-existing special tokens. For example: [Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json). This is unfortunate for our token counting purposes. If you are using this library to count tokens, and you are using a fine tune which messes around with special tokens, you can choose one of the following approaches: | ||
1) If you need exact token counts, you can work around this issue by using this library to tokenize _only_ user input text (which shouldn't contain any special tokens) and then programmatically adding the relevant counts for the special tokens that you are using to wrap the input text. | ||
2) Alternatively, you can choose to ignore this issue, in which case you will be overcounting tokens by a little bit, which is not too bad (in typical use cases, undercounting can lead to more severe quality issues than overcounting). | ||
## Tests | ||
Some parts of the code might behave differently in node versus browser, so it is necessary to run tests in both: | ||
1. Node test: node test/node-test.js | ||
2. Browser test: run `live-server` and open test/browser-test.html | ||
3. TypeScript test: run `cd test/typescript-test`, bump the dependency in its package.json, run `npm i && npm test`. | ||
## Repo maintenance | ||
@@ -113,19 +122,21 @@ | ||
1. node test-llama-tokenizer.js | ||
2. open test.html (with live-server or similar) | ||
3. do you need to update this README? | ||
4. bump version number in root package.json | ||
5. push tokenizer changes to github | ||
6. npm publish --dry-run | ||
7. npm publish | ||
8. bump version number in example-demo/package.json | ||
9. cd example-demo && npm run build && live-server | ||
10. push example demo changes to github | ||
1. run/update tests | ||
2. do you need to update this README? | ||
3. bump version number in root package.json | ||
4. push tokenizer changes to github | ||
5. npm publish --dry-run | ||
6. npm publish | ||
7. bump version number in example-demo/package.json | ||
8. cd example-demo && npm install && npm run build && live-server | ||
9. push example demo changes to github | ||
10. create new release on github | ||
## Who did this | ||
LLaMA3-tokenizer-js is a fork of my earlier LLaMA 1 tokenizer [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js). Several helper functions used in LLaMA 3 pretokenization were adapted from the fantastic [transformers.js](https://github.com/xenova/transformers.js) library. The BPE implementation, which is the core of this library, is original work and [was adapted into transformers.js](https://github.com/belladoreai/llama-tokenizer-js/issues/9). In other words, some work has been adapted from llama-tokenizer-js into transformers.js, and some work has been adapted the other way, from transformers.js into llama3-tokenizer-js. | ||
LLaMA3-tokenizer-js is a fork of my earlier LLaMA 1 tokenizer [llama-tokenizer-js](https://github.com/belladoreai/llama-tokenizer-js). | ||
Several helper functions used in LLaMA 3 pretokenization were adapted from the fantastic [transformers.js](https://github.com/xenova/transformers.js) library. The BPE implementation, which is the core of this library, is original work and [was adapted into transformers.js](https://github.com/belladoreai/llama-tokenizer-js/issues/9). In other words, some work has been adapted from llama-tokenizer-js into transformers.js, and some work has been adapted the other way, from transformers.js into llama3-tokenizer-js. | ||
The example-demo (tokenizer playground) is a fork of [gpt-tokenizer playground](https://github.com/niieani/gpt-tokenizer). | ||
Developed by [belladore.ai](https://belladore.ai) with contributions from [xenova](https://github.com/xenova), [blaze2004](https://github.com/blaze2004), [imoneoi](https://github.com/imoneoi) and [ConProgramming](https://github.com/ConProgramming). |
@@ -0,1 +1,6 @@ | ||
type EncodeOptions = { | ||
bos?: boolean, | ||
eos?: boolean | ||
} | ||
export declare class Llama3Tokenizer { | ||
@@ -6,6 +11,6 @@ vocabById: string[]; | ||
constructor(vocab_base64?: string, merges_binary?: string); | ||
encode(prompt: string, add_bos_token?: boolean, add_preceding_space?: boolean, log_performance?: boolean): number[]; | ||
decode(tokenIds: number[], add_bos_token?: boolean, add_preceding_space?: boolean): string; | ||
encode(prompt: string, options?: EncodeOptions): number[]; | ||
decode(tokenIds: number[]): string; | ||
getSpecialTokenId(tokenString: string): number; | ||
runTests(tests?: (tokenizer: LlamaTokenizer) => boolean): void | ||
runTests(tests?: (tokenizer: Llama3Tokenizer) => boolean): void | ||
} | ||
@@ -12,0 +17,0 @@ declare const llama3Tokenizer: Llama3Tokenizer; |
Major refactor
Supply chain riskPackage has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.
Found 1 instance in 1 package
3226630
137
13028
1