llama-tokenizer-js
Advanced tools
Comparing version 1.1.3 to 1.2.0
@@ -0,0 +0,0 @@ MIT LICENSE |
{ | ||
"name": "llama-tokenizer-js", | ||
"version": "1.1.3", | ||
"version": "1.2.0", | ||
"description": "JS tokenizer for LLaMA-based LLMs", | ||
@@ -19,3 +19,3 @@ "main": "llama-tokenizer.js", | ||
"type": "module", | ||
"author": "Belladore <135602125+belladoreai@users.noreply.github.com> (https://belladore.ai/tools)", | ||
"author": "Belladore <135602125+belladoreai@users.noreply.github.com> (https://belladore.ai)", | ||
"license": "MIT", | ||
@@ -22,0 +22,0 @@ "bugs": { |
# 🦙 llama-tokenizer-js 🦙 | ||
The first JavaScript tokenizer for LLaMA which works client-side in the browser (and also in Node). | ||
JavaScript tokenizer for LLaMA which works client-side in the browser (and also in Node). | ||
@@ -9,5 +9,7 @@ Intended use case is calculating token count accurately on the client-side. | ||
Developed by [belladore.ai](https://belladore.ai) | ||
## Features | ||
- Easy to use: 0 dependencies, code and data baked into a single file. | ||
- Easy to use: 0 dependencies, code and data baked into a [single file](llama-tokenizer.js). | ||
- Compatible with most LLaMA models (see [Compatibility](#compatibility)) | ||
@@ -19,3 +21,3 @@ - Optimized running time: tokenize a sentence in roughly 1ms, or 2000 tokens in roughly 20ms. | ||
Option 1: Install as an npm package and import as ES6 module | ||
Recommended way: Install as an npm package and import as ES6 module | ||
@@ -32,3 +34,3 @@ ``` | ||
Option 2: Load as ES6 module with `<script>` tags in your HTML | ||
Alternative: Load as ES6 module with `<script>` tags in your HTML | ||
@@ -39,2 +41,6 @@ ``` | ||
Alternative: for TypeScript projects, imports [should](https://github.com/belladoreai/llama-tokenizer-js/issues/12#issuecomment-1790073415) work now with the `types.d.ts` file, but please file an issue if I need to change something. | ||
Alternative: for CommonJS projects, [should](https://github.com/belladoreai/llama-tokenizer-js/issues/10) work with `const llamaTokenizer = await import('llama-tokenizer-js');` | ||
## Usage | ||
@@ -81,6 +87,7 @@ | ||
As mentioned, llama-tokenizer-js is the first JavaScript tokenizer for LLaMA which works client-side in the browser. You might be wondering, what are people currently using to count tokens in web applications? | ||
llama-tokenizer-js is the first JavaScript tokenizer for LLaMA which works client-side in the browser. You might be wondering, what other solutions are people using to count tokens in web applications? | ||
- Many web applications currently use client-side JavaScript libraries for other, _incompatible_ tokenizers. In particular, OpenAI's tokenizers are popular (see [tiktoken](https://www.npmjs.com/package/@dqbd/tiktoken) and [gpt-tokenizer](https://www.npmjs.com/package/gpt-tokenizer)). It's not entirely clear to me why people using LLaMA would want to count tokens with an OpenAI tokenizer that is not compatible with LLaMA. I guess people are assuming that there's not much difference between tokenizers? However, in my own testing I discovered that the token counts will commonly differ by as much as 20% between these tokenizers. So you can get a _very rough_ approximation of LLaMA token count by using an OpenAI tokenizer. | ||
- Some web applications make network calls to Python applications that run the Huggingface transformers tokenizer. For example, the oobabooga-text-webui exposes an API endpoint for token count. The drawback of this approach is latency: although the Python tokenizer itself is very fast, oobabooga adds a lot of overhead. In my testing, making a network call to locally running oobabooga to count tokens for short Strings of text took roughly 300ms (compared to ~1ms when counting tokens client-side with llama-tokenizer-js). The latency will be even higher when a real web client is making requests over the internet. The latency issue is even worse if an application needs to iteratively trim down a prompt to get it to fit within a context limit, requiring multiple network calls. | ||
- Since releasing llama-tokenizer-js, alternative llama tokenizers have been released. One notable example is [transformers.js](https://github.com/xenova/transformers.js), which actually introduced a llama tokenizer by [integrating llama-tokenizer-js into transformers.js](https://github.com/belladoreai/llama-tokenizer-js/issues/9). | ||
@@ -104,56 +111,9 @@ ## Compatibility | ||
## Adding support for incompatible LLaMA models | ||
If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of `llama-tokenizer.js` file). This repo has [a Python script](data-conversion.py) for your convenience. | ||
If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of `llama-tokenizer.js` file). Below is Python code that you can use for this. | ||
You can pass custom vocab and merge data to the tokenizer by instantiating it like this: | ||
``` | ||
# Load the tokenizer.json file that was distributed with the LLaMA model | ||
d = None | ||
with open(r"tokenizer.json", 'r', encoding='utf-8') as f: | ||
d = json.load(f) | ||
# Extract the vocabulary as a list of token strings | ||
vocab = [] | ||
for token in d['model']['vocab']: | ||
vocab.append(token) | ||
# Transform the vocabulary into a UTF-8 String delimited by line breaks, base64 encode it, and save to a file | ||
with open('vocab_base64.txt', 'wb') as f: | ||
f.write(base64.b64encode(('\n').join(vocab).encode("utf-8"))) | ||
# Extract the merge data as a list of strings, where location in list indicates priority of merge. | ||
# Example: one merge might be "gr a" (indicating that "gr" and "a" merge into "gra") | ||
merges = [] | ||
for merge in d['model']['merges']: | ||
merges.append(merge) | ||
# Create helper map where keys are token Strings, values are their positions in the vocab. | ||
# Note that positions of the vocabulary do not have any special meaning in the tokenizer, | ||
# we are merely using them to aid with compressing the data. | ||
vocab_map = {} | ||
for i,v in enumerate(vocab): | ||
vocab_map[v] = i | ||
# Each merge can be represented with 2 integers, e.g. "merge the 5th and the 11th token in vocab". | ||
# Since the vocabulary has fewer than 2^16 entries, each integer can be represented with 16 bits (2 bytes). | ||
# We are going to compress the merge data into a binary format, where | ||
# the first 4 bytes define the first merge, the next 4 bytes define the second merge, and so on. | ||
integers = [] | ||
for merge in merges: | ||
f, t = merge.split(" ") | ||
integers.append(vocab_map[f]) | ||
integers.append(vocab_map[t]) | ||
# Pack the integers into bytes using the 'H' format (2 bytes per integer) | ||
byte_array = struct.pack(f'{len(integers)}H', *integers) | ||
# Save the byte array as base64 encoded file | ||
with open('merges_binary.bin', 'wb') as file: | ||
file.write(base64.b64encode(byte_array)) | ||
``` | ||
## Credit | ||
You are free to use llama-tokenizer-js for basically whatever you want (MIT license). | ||
You are not required to give anything in exchange, but I kindly ask that you give back by linking to [https://belladore.ai/tools](https://belladore.ai/tools) in an appropriate place in your website. For example, you might link with the text "Using llama-tokenizer-js by belladore.ai" or something similar. | ||
import { llamaTokenizer } from 'llama-tokenizer-js' | ||
const tokenizer = new LlamaTokenizer(custom_vocab, custom_merge_data); | ||
``` |
@@ -0,0 +0,0 @@ import llamaTokenizer from './llama-tokenizer.js' |
Sorry, the diff of this file is too big to display
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
686031
8
2998
113