ICE Tokenizer
- Token id
[0, 20000)
are image tokens. - Token id
[20000, 20100)
are common tokens, mainly punctuations. E.g., icetk[20000] == '<unk>'
, icetk[20003] == '<pad>'
, icetk[20006] == ','
. - Token id
[20100, 83823)
are English tokens. - Token id
[83823, 145653)
are Chinese tokens. - Token id
[145653, 150000)
are rare tokens. E.g., icetk[145803] == 'α'
.
You can install the package via
pip install icetk
Tokenization
from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
ids = icetk.encode('Hello World! I am icetk.')
en = icetk.decode(ids)
ids = icetk.encode('你好世界!这里是 icetk。')
ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
img = icetk.decode(image_ids=ids, compress_rate=8)
from torchvision.utils import save_image
save_image(img, 'recover.jpg')
icetk.add_special_tokens(['<start_of_image>', '<start_of_english>', '<start_of_chinese>'])
icetk.decode(icetk.encode('abc\nhi', ignore_linebreak=False))
icetk.decode(icetk.encode('abc\nhi'))
icetk.tokenize('//--------')
icetk.text_tokenizer.discourage_ids(range(125653,130000))
icetk.tokenize('//--------')