import jagger
model_path = "model/kwdlc/patterns"
tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)
text = "吾輩は猫である。名前はまだない。"
toks = tokenizer.tokenize(text)
for tok in toks:
print(tok.surface(), tok.feature())
print("EOS")
"""
吾輩 名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人
は 助詞,副助詞,*,*,は,は,*
猫 名詞,普通名詞,*,*,猫,ねこ,*
である 判定詞,*,判定詞,デアル列基本形,だ,である,*
。 特殊,句点,*,*,。,。,*
名前 名詞,普通名詞,*,*,名前,なまえ,*
は 助詞,副助詞,*,*,は,は,*
まだ 副詞,*,*,*,まだ,まだ,*
ない 形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*
。 特殊,句点,*,*,。,。,*
EOS
"""# print tagsfor tok in toks:
# print tag(split feature() by comma)print(tok.surface())
for i inrange(tok.n_tags()):
print(" tag[{}] = {}".format(i, tok.tag(i)))
print("EOS")
Batch processing(experimental)
tokenize_batch tokenizes multiple lines(delimited by newline('\n', '\r', or '\r\n')) at once.
Splitting lines is done in C++ side.
import jagger
model_path = "model/kwdlc/patterns"
tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)
text = """
吾輩は猫である。
名前はまだない。
明日の天気は晴れです。
"""# optional: set C++ threads(CPU cores) to use# default: Use all CPU cores.# tokenizer.set_threads(4)
toks_list = tokenizer.tokenize_batch(text)
for toks in toks_list:
for tok in toks:
print(tok.surface(), tok.feature())
Train a model.
Pyhthon interface for training a model is not provided yet.
For a while, you can build C++ trainer cli using CMake(Windows supported).
See train/ for details.
Limitation
Single line string must be less than 262,144 bytes(~= 87,000 UTF-8 Japanese chars).
Jagger version
Jagger version used in this Python binding is
2023-02-18
For developer
Edit dev_mode=True in to enable asan + debug build
Python binding is available under 2-clause BSD licence.
Jagger and ccedar_core.h is licensed under GPLv2/LGPLv2.1/BSD triple licenses.
Third party licences
stack_container.h: BSD like license.
nanocsv.h MIT license.
FAQs
Unknown package
We found that jagger demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Malicious Go packages are impersonating popular libraries to install hidden loader malware on Linux and macOS, targeting developers with obfuscated payloads.
Bybit's $1.46B hack by North Korea's Lazarus Group pushes 2025 crypto losses to $1.6B in just two months, already surpassing all of 2024's $1.49B total.
OpenSSF has published OSPS Baseline, an initiative designed to establish a minimum set of security-related best practices for open source software projects.