Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

kuro2sudachi

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

kuro2sudachi

  • 0.4.6
  • PyPI
  • Socket score

Maintainers
1

kuro2sudachi

PyPi version PyTest

kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.

Usage

$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt

Custom pos convert dict

you can overwrite convert config with setting json file.

{
    "固有名詞": {
        "sudachi_pos": "名詞,固有名詞,一般,*,*,*",
        "left_id": 4786,
        "right_id": 4786,
        "cost": 5000
    },
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000
    }
}

$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json

if you want to ignore unsupported pos error & invalid format, use --ignore flag.

Dictionary type

You can specify the dictionary with the tokenize option -s (default: core).

$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full

Auto Splitting

kuro2sudachi supports suto splitting.

{
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000,
        "split_mode": "C",
        "unit_div_mode": [
            "A", "B"
        ]
    }
}

output includes unit devision info.

$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞

$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore

$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*

Splitting Words defined by kuromoji

Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.

中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*

For Developer

test kuro2sudachi

$ poetry install
$ poetry run pytest

exec kuro2sudachi command

$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc