tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.
Documentation
Please visit readthedocs for the full documentation.
Installation
pip install tkseem
Usage
Tokenization
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])
Caching
tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)
Save and Load
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.save_model('vocab.pl')
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')
Model Agnostic
import tkseem as tk
import time
import seaborn as sns
import pandas as pd
def calc_time(fun):
start_time = time.time()
fun().train()
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)
Notebooks
We show how to use tkseem
to train some nlp models.
Name | Description | Notebook |
---|
Demo | Explain the syntax of all tokenizers. |
|
Sentiment Classification | WordTokenizer for processing sentences and then train a classifier for sentiment classification. |
|
Meter Classification | CharacterTokenizer for meter classification using bidirectional GRUs. |
|
Translation | Seq-to-seq model with attention. |
|
Citation
@misc{tkseem2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Tokenization Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tkseem}}
}
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.