Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

ai21-tokenizer

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ai21-tokenizer

0.12.0
PyPI

Maintainers: 1

AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Mini tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Jamba 1.5 Large Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Large tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Jamba Instruct Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our Jamba tokenizer directly:

from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here

J2 Tokenizer

from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_tokenizer()
# Your code here

Another way would be to use our Jurassic model directly:

from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)

Async usage

from ai21_tokenizer import Tokenizer

tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here

Functions

Encode and Decode

These functions allow you to encode your text to a list of token ids and back to plaintext

text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

Async

# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

What if you had wanted to convert your tokens to ids or vice versa?

tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

Async

# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

For more examples, please see our examples folder.

FAQs

What is ai21-tokenizer?

Is ai21-tokenizer well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

ai21-tokenizer

AI21 Labs Tokenizer

Installation

pip

poetry

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

Async usage

Jamba 1.5 Large Tokenizer

Async usage

Jamba Instruct Tokenizer

Async usage

J2 Tokenizer

Async usage

Functions

Encode and Decode

Async

What if you had wanted to convert your tokens to ids or vice versa?

Async

Related posts

Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm

Malicious npm Package Typosquats Popular TypeScript ESLint Plugin, Exfiltrates Data and Enables Remote Exploitation

Ultralytics PyPI Package Compromised Through GitHub Actions Cache Poisoning