New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

rwkv-tokenizer-node

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

rwkv-tokenizer-node

RWKV / gpt-NeoX / Pythia, 0-dep tokenizer library, for nodejs

latest
npmnpm
Version
1.0.5
Version published
Maintainers
1
Created
Source

Native Node.js tokenizer for RWKV

0 dependency tokenizer for the RWKV project

Should also work for EleutherAI neox and pythia, as they use the same tokenizer

Setup

npm i rwkv-tokenizer-node

Usage

const tokenizer = require("RWKV-tokenizer-node");

// Encode into token int : [12092, 3645, 2]
const tokens = tokenizer.encode("Hello World!");

// Decode back to "Hello World!"
const decoded = tokenizer.decode(tokens);

Its primary purpose is for use in implementing RWKV-cpp-node , though it could probably be used for other use cases (eg. pure-JS implementaiton of gpt-neox or RWKV)

What can be improved?

  • performance: its kinda disappointing that this is easily 10x slower then the python implementation (which i believe is using the rust library), however this is generally still good enough for most usecases
  • Why not use the hugging face library? Sadly the official huggingface tokenizer lib for nodejs is broken : https://github.com/huggingface/tokenizers/issues/911

PS: Anyone who has any ideas on how to improve its performance, while not failing the test suite, is welcomed to do so.

How to run the test?

# This run the sole test file test/tokenizer.test.js
npm run test

The python script used to seed the refence data (using huggingface tokenizer) is found at test/build-test-token-json.py This test includes a very extensive UTF-8 test file covering all major (and many minor) languages

Designated maintainer

@picocreator - is the current maintainer of the project, ping him on the RWKV discord if you have any questions on this project

Special thanks & refrences

@saharNooby - which the current implementation is heavily based on

@cztomsik @josephrocca @BlinkDL - for their various implementation, which is used as refence to squash out mismatching encoding with HF implementation.

Keywords

RWKV

FAQs

Package last updated on 06 May 2023

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts