Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A SentencePiece based tokenizer for production uses with AI21's models
pip install ai21-tokenizer
poetry add ai21-tokenizer
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Mini tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Large tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our Jamba tokenizer directly:
from ai21_tokenizer import JambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer.get_tokenizer()
# Your code here
Another way would be to use our Jurassic model directly:
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
from ai21_tokenizer import Tokenizer
tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
These functions allow you to encode your text to a list of token ids and back to plaintext
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
For more examples, please see our examples folder.
FAQs
Unknown package
We found that ai21-tokenizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.