Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
llama-tokenizer-js
Advanced tools
JavaScript tokenizer for LLaMA which works client-side in the browser (and also in Node).
Intended use case is calculating token count accurately on the client-side.
Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi and ConProgramming.
Recommended way: Install as an npm package and import as ES6 module
npm install llama-tokenizer-js
import llamaTokenizer from 'llama-tokenizer-js'
console.log(llamaTokenizer.encode("Hello world!").length)
Alternative: Load as ES6 module with <script>
tags in your HTML
<script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>
Alternative: for TypeScript projects, imports should work now with the types.d.ts
file, but please file an issue if I need to change something.
Alternative: for CommonJS projects, should work with const llamaTokenizer = await import('llama-tokenizer-js');
Once you have the module imported, you can encode or decode with it. Training is not supported.
When used in browser, llama-tokenizer-js pollutes global namespace with llamaTokenizer
.
Encode:
llamaTokenizer.encode("Hello world!")
> [1, 15043, 3186, 29991]
Decode:
llamaTokenizer.decode([1, 15043, 3186, 29991])
> 'Hello world!'
Note that special "beginning of sentence" token and preceding space are added by default when encoded (and correspondingly expected when decoding). These affect token count. There may be some use cases where you don't want to add these. You can pass additional boolean parameters in these use cases. For example, if you want to decode an individual token:
llamaTokenizer.decode([3186], false, false)
> 'Hello'
You can run tests with:
llamaTokenizer.runTests()
The test suite is small, but it covers different edge cases very well.
Note that tests can be run both in browser and in Node (this is necessary because some parts of the code work differently in different environments).
llama-tokenizer-js is the first JavaScript tokenizer for LLaMA which works client-side in the browser. You might be wondering, what other solutions are people using to count tokens in web applications?
The tokenizer used by LLaMA is a SentencePiece Byte-Pair Encoding tokenizer.
Note that this is a tokenizer for LLaMA models, and it's different than the tokenizers used by OpenAI models. If you need a tokenizer for OpenAI models, I recommend gpt-tokenizer.
What is this tokenizer compatible with? All LLaMA models which have been trained on top of checkpoints (model weights) released by Facebook in March 2023 ("LLaMA") and July of 2023 ("LLaMA2").
Examples of compatible models:
Incompatible LLaMA models are those which have been trained from scratch, not on top of the checkpoints released by Facebook. For example, OpenLLaMA models are incompatible.
When you see a new LLaMA model released, this tokenizer is mostly likely compatible with it without any modifications. If you are unsure, try it and see if the token ids are the same (compared to running the model with, for example, oobabooga webui). You can find great test input/output samples by searching for runTests
inside llama-tokenizer.js
.
If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of llama-tokenizer.js
file). This repo has a Python script for your convenience.
You can pass custom vocab and merge data to the tokenizer by instantiating it like this:
import { llamaTokenizer } from 'llama-tokenizer-js'
const tokenizer = new LlamaTokenizer(custom_vocab, custom_merge_data);
Release steps:
FAQs
JS tokenizer for LLaMA-based LLMs
The npm package llama-tokenizer-js receives a total of 3,975 weekly downloads. As such, llama-tokenizer-js popularity was classified as popular.
We found that llama-tokenizer-js demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.