OpenVINO Tokenizers
OpenVINO Tokenizers adds text processing operations to OpenVINO.
Features
- Perform tokenization and detokenization without third-party dependencies
- Convert a HuggingFace tokenizer into OpenVINO model tokenizer and detokenizer
- Combine OpenVINO models into a single model
- Add greedy decoding pipeline to text generation model
Installation
(Recommended) Create and activate virtual env:
python3 -m venv venv
source venv/bin/activate
conda create --name openvino_tokenizers
conda activate openvino_tokenizers
Minimal Installation
Use minimal installation when you have a converted OpenVINO tokenizer:
pip install openvino-tokenizers
conda install -c conda-forge openvino openvino-tokenizers
Convert Tokenizers Installation
If you want to convert HuggingFace tokenizers into OpenVINO tokenizers:
pip install openvino-tokenizers[transformers]
conda install -c conda-forge openvino openvino-tokenizers && pip install transformers[sentencepiece] tiktoken
Install Pre-release Version
Use openvino-tokenizers[transformers]
to install tokenizers conversion dependencies.
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Build and Install from Source
Using OpenVINO PyPI package
openvino-tokenizers build depends on openvino package which will be automatically installed from PyPI during the build process. To install unreleased versions, you would need to install openvino package from the nightly distribution channel using --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install . --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:
pip install transformers[sentencepiece] tiktoken
:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version.
Use a nightly build of OpenVINO or build
OpenVINO Tokenizers from a release branch if you have issues with the build process.
Using OpenVINO archive
Install OpenVINO archive distribution. Use --no-deps
to avoid OpenVINO installation from PyPI into your current environment.
--extra-index-url
is needed to resolve build dependencies only.
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install --no-deps . --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:
pip install transformers[sentencepiece] tiktoken
:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version.
Use a nightly build of OpenVINO or build
OpenVINO Tokenizers from a release branch if you have issues with the build process.
Build and install for development
Using OpenVINO PyPI package
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
cd tests/
pytest .
Using OpenVINO archive
Install OpenVINO archive distribution. Use --no-deps
to avoid OpenVINO installation from PyPI into your current environment.
--extra-index-url
is needed to resolve build dependencies only.
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
cd tests/
pytest .
C++ Installation
You can use converted tokenizers in C++ pipelines with prebuild binaries.
- Download OpenVINO archive distribution for your OS from here and extract the archive.
- Download OpenVINO Tokenizers prebuild libraries from here. To ensure compatibility first three numbers of OpenVINO Tokenizers version should match OpenVINO version and OS.
- Extract OpenVINO Tokenizers archive into OpenVINO installation directory. OpenVINO Tokenizers archive maintains the structure to be aligned with OpenVINO archive:
- Windows:
<openvino_dir>\runtime\bin\intel64\Release\
- MacOS_x86:
<openvino_dir>/runtime/lib/intel64/Release
- MacOS_arm64:
<openvino_dir>/runtime/lib/arm64/Release/
- Linux_x86:
<openvino_dir>/runtime/lib/intel64/
- Linux_arm64:
<openvino_dir>/runtime/lib/aarch64/
After that you can add binary extension in the code with:
core.add_extension("openvino_tokenizers.dll")
for Windowscore.add_extension("libopenvino_tokenizers.dylib")
for MacOScore.add_extension("libopenvino_tokenizers.so")
for Linux
and read
/compile
converted (de)tokenizers models.
If you use version 2023.3.0.0
, the binary extension file is called (lib)user_ov_extension.(dll/dylib/so)
.
C++ Build
To build OpenVINO Tokenizers binaries locally, use this command:
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
After that, you can transfer all binaries from build/src
to <openvino_dir>
as described in the C++ installation instruction above.
Reducing the ICU Data Size
By default, all available ICU locales are supported, which significantly increases the package size. To reduce the size of the ICU libraries included in your final package, follow these steps:
-
Use the ICU Data Configuration File:
- This file specifies which features and locales to include in a custom data bundle. You can find more information here.
-
Set the ICU Data Filter File as an Environment Variable:
-
On Unix-like systems (Linux, macOS):
Set the ICU_DATA_FILTER_FILE
environment variable to the path of your configuration file (filters.json
):
export ICU_DATA_FILTER_FILE="filters.json"
-
On Windows:
Set the ICU_DATA_FILTER_FILE
environment variable using the Command Prompt or PowerShell:
Command Prompt:
set ICU_DATA_FILTER_FILE=filters.json
PowerShell:
$env:ICU_DATA_FILTER_FILE="filters.json"
-
Create a Configuration File:
- An example configuration file (
filters.json
) might look like this:
{
"localeFilter": {
"filterType": "language",
"includelist": [
"en"
]
}
}
-
Configure OpenVINO Tokenizers:
- When building OpenVINO tokenizers, set the following CMake option during the project configuration:
-DBUILD_FAST_TOKENIZERS=ON
- Example for a pip installation path:
ICU_DATA_FILTER_FILE=</path/to/filters.json> pip install git+https://github.com/openvinotoolkit/openvino_tokenizers.git --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly --config-settings=override=cmake.options.BUILD_FAST_TOKENIZERS=ON
By following these instructions, you can effectively reduce the size of the ICU libraries in your final package.
Build OpenVINO Tokenizers without FastTokenizer Library
If a tokenizer doesn't use CaseFold
, UnicodeNormalization
or Wordpiece
operations, you can drastically reduce package binary size by building OpenVINO Tokenizers without FastTokenizer dependency with this flag:
-DENABLE_FAST_TOKENIZERS=OFF
This option can also help with building for platform that is supported by FastTokenizer, for example Android x86_64
.
Example for a pip installation path:
pip install git+https://github.com/openvinotoolkit/openvino_tokenizers.git --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly --config-settings=override=cmake.options.ENABLE_FAST_TOKENIZERS=OFF
Usage
:warning: OpenVINO Tokenizers can be inferred on a CPU
device only.
Convert HuggingFace tokenizer
OpenVINO Tokenizers ships with CLI tool that can convert tokenizers from Huggingface Hub
or Huggingface tokenizers saved on disk:
convert_tokenizer codellama/CodeLlama-7b-hf --with-detokenizer -o output_dir
There is also convert_tokenizer
function that can convert tokenizer python object.
import numpy as np
from transformers import AutoTokenizer
from openvino import compile_model, save_model
from openvino_tokenizers import convert_tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
ov_tokenizer = convert_tokenizer(hf_tokenizer)
compiled_tokenzier = compile_model(ov_tokenizer)
text_input = ["Test string"]
hf_output = hf_tokenizer(text_input, return_tensors="np")
ov_output = compiled_tokenzier(text_input)
for output_name in hf_output:
print(f"OpenVINO {output_name} = {ov_output[output_name]}")
print(f"HuggingFace {output_name} = {hf_output[output_name]}")
save_model(ov_tokenizer, "openvino_tokenizer.xml")
loaded_tokenizer = compile_model("openvino_tokenizer.xml")
loaded_ov_output = loaded_tokenizer(text_input)
for output_name in hf_output:
assert np.all(loaded_ov_output[output_name] == ov_output[output_name])
Connect Tokenizer to a Model
To infer and convert the original model, install torch or torch-cpu to the virtual environment.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from openvino import compile_model, convert_model
from openvino_tokenizers import convert_tokenizer, connect_models
checkpoint = "mrm8488/bert-tiny-finetuned-sms-spam-detection"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
hf_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
text_input = ["Free money!!!"]
hf_input = hf_tokenizer(text_input, return_tensors="pt")
hf_output = hf_model(**hf_input)
ov_tokenizer = convert_tokenizer(hf_tokenizer)
ov_model = convert_model(hf_model, example_input=hf_input.data)
combined_model = connect_models(ov_tokenizer, ov_model)
compiled_combined_model = compile_model(combined_model)
openvino_output = compiled_combined_model(text_input)
print(f"OpenVINO logits: {openvino_output['logits']}")
print(f"HuggingFace logits {hf_output.logits}")
Use Extension With Converted (De)Tokenizer or Model With (De)Tokenizer
Import openvino_tokenizers
will add all tokenizer-related operations to OpenVINO,
after which you can work with saved tokenizers and detokenizers.
import numpy as np
import openvino_tokenizers
from openvino import Core
core = Core()
compiled_detokenizer = core.compile_model("detokenizer.xml")
token_ids = np.random.randint(100, 1000, size=(3, 5))
openvino_output = compiled_detokenizer(token_ids)
print(openvino_output["string_output"])
Text generation pipeline
import numpy as np
from openvino import compile_model, convert_model
from openvino_tokenizers import add_greedy_decoding, convert_tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint, use_cache=False)
text_input = ["Quick brown fox jumped "]
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True)
compiled_tokenizer = compile_model(ov_tokenizer)
ov_input = compiled_tokenizer(text_input)
hf_input = hf_tokenizer(text_input, return_tensors="pt")
ov_model = convert_model(hf_model, example_input=hf_input.data)
ov_model_with_greedy_decoding = add_greedy_decoding(ov_model)
compiled_model = compile_model(ov_model_with_greedy_decoding)
new_tokens_size = 10
prompt_size = ov_input["input_ids"].shape[-1]
input_dict = {
output.any_name: np.hstack([tensor, np.zeros(shape=(1, new_tokens_size), dtype=np.int_)])
for output, tensor in ov_input.items()
}
for idx in range(prompt_size, prompt_size + new_tokens_size):
output = compiled_model(input_dict)["token_ids"]
input_dict["input_ids"][:, idx] = output[:, idx - 1]
input_dict["attention_mask"][:, idx] = 1
ov_token_ids = input_dict["input_ids"]
hf_token_ids = hf_model.generate(
**hf_input,
min_new_tokens=new_tokens_size,
max_new_tokens=new_tokens_size,
temperature=0,
)
compiled_detokenizer = compile_model(ov_detokenizer)
ov_output = compiled_detokenizer(ov_token_ids)["string_output"]
hf_output = hf_tokenizer.batch_decode(hf_token_ids, skip_special_tokens=True)
print(f"OpenVINO output string: `{ov_output}`")
print(f"HuggingFace output string: `{hf_output}`")
TensorFlow Text Integration
OpenVINO Tokenizers include converters for certain TensorFlow Text operations.
Currently, only the MUSE model is supported.
Here is an example of model conversion and inference:
import numpy as np
import tensorflow_hub as hub
import tensorflow_text
from openvino import convert_model, compile_model
import openvino_tokenizers
sentences = ["dog", "I cuccioli sono carini.", "私は犬と一緒にビーチを散歩するのが好きです"]
tf_embed = hub.load(
"https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/"
"TensorFlow2/variations/multilingual/versions/2"
)
ov_model = convert_model(tf_embed)
ov_embed = compile_model(ov_model, "CPU")
ov_result = ov_embed(sentences)[ov_embed.output()]
tf_result = tf_embed(sentences)
assert np.all(np.isclose(ov_result, tf_result, atol=1e-4))
RWKV Tokenizer
from urllib.request import urlopen
from openvino import compile_model
from openvino_tokenizers import build_rwkv_tokenizer
rwkv_vocab_url = (
"https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/tokenizer/rwkv_vocab_v20230424.txt"
)
with urlopen(rwkv_vocab_url) as vocab_file:
vocab = map(bytes.decode, vocab_file)
tokenizer, detokenizer = build_rwkv_tokenizer(vocab)
tokenizer, detokenizer = compile_model(tokenizer), compile_model(detokenizer)
print(tokenized := tokenizer(["Test string"])["input_ids"])
print(detokenizer(tokenized)["string_output"])
Supported Tokenizer Types
Huggingface Tokenizer Type | Tokenizer Model Type | Tokenizer | Detokenizer |
---|
Fast | WordPiece | ✅ | ❌ |
| BPE | ✅ | ✅ |
| Unigram | ❌ | ❌ |
Legacy | SentencePiece .model | ✅ | ✅ |
Custom | tiktoken | ✅ | ✅ |
RWKV | Trie | ✅ | ✅ |
Test Results
This report is autogenerated and includes tokenizers and detokenizers tests. The Output Matched, %
column shows the percent of test strings for which the results of OpenVINO and Huggingface Tokenizers are the same. To update the report run pytest --update_readme tokenizers_test.py
in tests
directory.
Output Match by Tokenizer Type
Tokenizer Type | Output Matched, % | Number of Tests |
---|
BPE | 97.10 | 4544 |
SentencePiece | 88.32 | 6633 |
Tiktoken | 96.56 | 524 |
WordPiece | 98.39 | 747 |
Output Match by Model
Tokenizer Type | Model | Output Matched, % | Number of Tests |
---|
BPE | EleutherAI/gpt-neox-20b | 95.92 | 245 |
BPE | NousResearch/Meta-Llama-3-8B-Instruct | 100.00 | 247 |
BPE | Salesforce/codegen-16B-multi | 96.17 | 261 |
BPE | Xenova/gpt-4o | 100.00 | 261 |
BPE | ai-forever/rugpt3large_based_on_gpt2 | 94.64 | 261 |
BPE | bigscience/bloom | 97.55 | 245 |
BPE | databricks/dolly-v2-3b | 95.92 | 245 |
BPE | deepseek-ai/deepseek-coder-6.7b-instruct | 99.24 | 263 |
BPE | facebook/galactica-120b | 95.92 | 245 |
BPE | facebook/opt-66b | 96.73 | 245 |
BPE | gpt2 | 95.40 | 261 |
BPE | koalajun/Gemma-2-9b-it-Ko-Crypto-Translate | 100.00 | 247 |
BPE | laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 98.47 | 261 |
BPE | microsoft/deberta-base | 96.73 | 245 |
BPE | roberta-base | 95.40 | 261 |
BPE | stabilityai/stablecode-completion-alpha-3b-4k | 95.92 | 245 |
BPE | stabilityai/stablelm-2-1_6b | 100.00 | 245 |
BPE | tiiuae/falcon-7b | 93.87 | 261 |
SentencePiece | NousResearch/Llama-2-13b-hf | 96.73 | 245 |
SentencePiece | NousResearch/Llama-2-13b-hf_legacy | 95.92 | 245 |
SentencePiece | NousResearch/Llama-2-13b-hf_sp_backend | 95.10 | 245 |
SentencePiece | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 96.76 | 247 |
SentencePiece | TinyLlama/TinyLlama-1.1B-Chat-v1.0_legacy | 95.14 | 247 |
SentencePiece | TinyLlama/TinyLlama-1.1B-Chat-v1.0_sp_backend | 94.33 | 247 |
SentencePiece | baichuan-inc/Baichuan2-7B-Chat_legacy | 100.00 | 245 |
SentencePiece | camembert-base | 52.24 | 245 |
SentencePiece | camembert-base_legacy | 75.51 | 245 |
SentencePiece | facebook/musicgen-small | 83.67 | 245 |
SentencePiece | facebook/musicgen-small_legacy | 78.37 | 245 |
SentencePiece | microsoft/Phi-3-mini-128k-instruct | 95.95 | 247 |
SentencePiece | microsoft/Phi-3-mini-128k-instruct_legacy | 94.33 | 247 |
SentencePiece | microsoft/Phi-3-mini-128k-instruct_sp_backend | 95.14 | 247 |
SentencePiece | microsoft/deberta-v3-base | 96.73 | 245 |
SentencePiece | microsoft/deberta-v3-base_legacy | 100.00 | 245 |
SentencePiece | mlx-community/quantized-gemma-7b-it | 96.76 | 247 |
SentencePiece | mlx-community/quantized-gemma-7b-it_legacy | 97.57 | 247 |
SentencePiece | mlx-community/quantized-gemma-7b-it_sp_backend | 97.57 | 247 |
SentencePiece | rinna/bilingual-gpt-neox-4b | 82.04 | 245 |
SentencePiece | rinna/bilingual-gpt-neox-4b_legacy | 86.12 | 245 |
SentencePiece | t5-base | 85.31 | 245 |
SentencePiece | t5-base_legacy | 80.00 | 245 |
SentencePiece | xlm-roberta-base | 95.10 | 245 |
SentencePiece | xlm-roberta-base_legacy | 95.10 | 245 |
SentencePiece | xlnet-base-cased | 64.49 | 245 |
SentencePiece | xlnet-base-cased_legacy | 57.96 | 245 |
Tiktoken | Qwen/Qwen-14B-Chat | 100.00 | 261 |
Tiktoken | THUDM/glm-4-9b-chat | 93.16 | 263 |
WordPiece | ProsusAI/finbert | 100.00 | 109 |
WordPiece | bert-base-multilingual-cased | 100.00 | 109 |
WordPiece | cointegrated/rubert-tiny2 | 100.00 | 109 |
WordPiece | distilbert-base-uncased-finetuned-sst-2-english | 100.00 | 109 |
WordPiece | google/mobilebert-uncased | 100.00 | 93 |
WordPiece | rasa/LaBSE | 88.99 | 109 |
WordPiece | sentence-transformers/all-MiniLM-L6-v2 | 100.00 | 109 |
Recreating Tokenizers From Tests
In some tokenizers, you need to select certain settings so that their output is closer to the Huggingface tokenizers:
THUDM/chatglm2-6b
detokenizer always skips special tokens. Use skip_special_tokens=True
during conversionTHUDM/chatglm3-6b
detokenizer don't skips special tokens. Use skip_special_tokens=False
during conversion- All tested tiktoken based detokenizers leave extra spaces. Use
clean_up_tokenization_spaces=False
during conversion