Nagisa is a python module for Japanese word segmentation/POS-tagging.
It is designed to be a simple and easy-to-use tool.
This tool has the following features.
- Based on recurrent neural networks.
- The word segmentation model uses character- and word-level features [池田+].
- The POS-tagging model uses tag dictionary information [Inoue+].
For more details refer to the following links.
- The stop words for nagisa are available here.
- The presentation slide at PyCon JP (2022) is available here.
- The article in Japanese is available here.
- The documentation is available here.
Installation
Python 3.6 through 3.12 on Linux,
or Python 3.6 through 3.11 on macOS Intel is required.
This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks.
You can install nagisa by using the following command.
pip install nagisa
For Windows users, please run it with python 3.6, 3.7 or 3.8 (64bit).
It is also compatible with the Windows Subsystem for Linux (WSL).
Basic usage
Sample of word segmentation and POS-tagging for Japanese.
import nagisa
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
print(words.words)
print(words.postags)
Post-processing functions
Filter and extarct words by the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
print(nagisa.tagger.postags)
Add the user dictionary in easy way.
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
Train a model
Nagisa (v0.2.0+) provides a simple train method
for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.
The format of the train/dev/test files is tsv.
Each line is word
and tag
and one line is represented by word
\t(tab) tag
.
Note that you put EOS between sentences.
Refer to sample datasets and tutorial (Train a model for Universal Dependencies).
$ cat sample.train
唯一 NOUN
の ADP
趣味 NOU
は ADP
料理 NOUN
EOS
とても ADV
おいしかっ ADJ
た AUX
です AUX
。 PUNCT
EOS
ドル NOUN
は ADP
主要 ADJ
通貨 NOUN
EOS
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')
text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)