Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

llama-tokenizer-js

Package Overview
Dependencies
Maintainers
1
Versions
9
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

llama-tokenizer-js - npm Package Compare versions

Comparing version 1.0.1 to 1.1.0

2

package.json
{
"name": "llama-tokenizer-js",
"version": "1.0.1",
"version": "1.1.0",
"description": "JS tokenizer for LLaMA-based LLMs",

@@ -5,0 +5,0 @@ "main": "llama-tokenizer.js",

@@ -55,2 +55,9 @@ # 🦙 llama-tokenizer-js 🦙

Special use case: decode only selected individual tokens, without including beginning of prompt token and preceeding space:
```
llamaTokenizer.decode([3186], false, false)
> 'Hello'
```
## Tests

@@ -77,4 +84,8 @@

The tokenizer is the same for all LLaMA models which have been trained on top of the checkpoints (model weights) leaked by Facebook in early 2023.
The tokenizer used by LLaMA is a SentencePiece Byte-Pair Encoding tokenizer.
Note that this is a tokenizer for LLaMA models, and it's different than the tokenizers used by OpenAI models. If you need a tokenizer for OpenAI models, I recommend [gpt-tokenizer](https://www.npmjs.com/package/gpt-tokenizer).
What is this tokenizer compatible with? All LLaMA models which have been trained on top of the checkpoints (model weights) leaked by Facebook in early 2023.
Examples of compatible models:

@@ -84,4 +95,56 @@ - wizard-vicuna-13b-uncensored-gptq

Incompatible models are those which have been trained from scratch, not on top of the checkpoints leaked by Facebook. For example, [OpenLLaMA](https://github.com/openlm-research/open_llama) models are incompatible. I'd be happy to adapt this to any LLaMA models that people need, just open an issue for it.
Incompatible LLaMA models are those which have been trained from scratch, not on top of the checkpoints leaked by Facebook. For example, [OpenLLaMA](https://github.com/openlm-research/open_llama) models are incompatible.
When you see a new LLaMA model released, this tokenizer is mostly likely compatible with it without any modifications. If you are unsure, try it and see if the token ids are the same (compared to running the model with, for example, oobabooga webui). You can find great test input/output samples by searching for `runTests` inside `llama-tokenizer.js`.
## Adding support for incompatible LLaMA models
If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of `llama-tokenizer.js` file). Below is Python code that you can use for this.
```
# Load the tokenizer.json file that was distributed with the LLaMA model
d = None
with open(r"tokenizer.json", 'r', encoding='utf-8') as f:
d = json.load(f)
# Extract the vocabulary as a list of token strings
vocab = []
for token in d['model']['vocab']:
vocab.append(token)
# Transform the vocabulary into a UTF-8 String delimited by line breaks, base64 encode it, and save to a file
with open('vocab_base64.txt', 'wb') as f:
f.write(base64.b64encode(('\n').join(vocab).encode("utf-8")))
# Extract the merge data as a list of strings, where location in list indicates priority of merge.
# Example: one merge might be "gr a" (indicating that "gr" and "a" merge into "gra")
merges = []
for merge in d['model']['merges']:
merges.append(merge)
# Create helper map where keys are token Strings, values are their positions in the vocab.
# Note that positions of the vocabulary do not have any special meaning in the tokenizer,
# we are merely using them to aid with compressing the data.
vocab_map = {}
for i,v in enumerate(vocab):
vocab_map[v] = i
# Each merge can be represented with 2 integers, e.g. "merge the 5th and the 11th token in vocab".
# Since the vocabulary has fewer than 2^16 entries, each integer can be represented with 16 bits (2 bytes).
# We are going to compress the merge data into a binary format, where
# the first 4 bytes define the first merge, the next 4 bytes define the second merge, and so on.
integers = []
for merge in merges:
f, t = merge.split(" ")
integers.append(vocab_map[f])
integers.append(vocab_map[t])
# Pack the integers into bytes using the 'H' format (2 bytes per integer)
byte_array = struct.pack(f'{len(integers)}H', *integers)
# Save the byte array as base64 encoded file
with open('merges_binary.bin', 'wb') as file:
file.write(base64.b64encode(byte_array))
```
## Credit

@@ -88,0 +151,0 @@

@@ -1,2 +0,2 @@

import llamaTokenizer from './llama_tokenizer.js'
import llamaTokenizer from './llama-tokenizer.js'

@@ -3,0 +3,0 @@ // The reason why tests are in llama_tokenizer.js is that I want to be able to run tests on browser-side too, not only in Node.

Sorry, the diff of this file is too big to display

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc