🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
DemoInstallSign in
Socket

yurenizer

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

yurenizer

A library for standardizing terms with spelling variations using a synonym dictionary.

0.2.2
PyPI
Maintainers
1

Python License PyPI Downloads

yurenizer

This is a Japanese text normalizer that resolves spelling inconsistencies.

Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md

Overview

yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the Sudachi Synonym Dictionary.

web-based Demo

You can try the web-based demo here.
yurenizer Web-demo

Installation

pip install yurenizer

Download Synonym Dictionary

curl -L -o synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt

Usage

Quick Start

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="synonyms.txt")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。

Customizing Settings

You can control normalization by specifying NormalizerConfig as an argument to the normalize function.

Example with Custom Settings

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="synonyms.txt")
text = "「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"
config = NormalizerConfig(
            taigen=True, 
            yougen=False,
            expansion="from_another", 
            unify_level="lexeme",
            other_language=False,
            alias=False,
            old_name=False,
            misuse=False,
            alphabetic_abbreviation=True, # Normalize only alphabetic abbreviations
            non_alphabetic_abbreviation=False,
            alphabet=False,
            orthographic_variation=False,
            misspelling=False
        )
print(f"Output: {normalizer.normalize(text, config)}")
# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます

Configuration Details

The settings in yurenizer are organized hierarchically, allowing you to control the scope and target of normalization.

1. taigen / yougen (Target Selection)

Use the taigen and yougen flags to control which parts of speech are included in the normalization.

SettingDefault ValueDescription
taigenTrueIncludes nouns and other substantives in the normalization. Set to False to exclude them.
yougenFalseIncludes verbs and other predicates in the normalization. Set to True to include them (normalized to their lemma).

2. expansion (Expansion Flag)

The expansion flag determines how synonyms are expanded based on the synonym dictionary's internal control flags.

ValueDescription
from_anotherExpands only the synonyms with a control flag value of 0 in the synonym dictionary.
anyExpands all synonyms regardless of their control flag value.

3. unify_level (Normalization Level)

Specify the level of normalization with the unify_level parameter.

ValueDescription
lexemePerforms the most comprehensive normalization, targeting all groups (a, b, c) mentioned below.
word_formNormalizes by word form, targeting groups b and c.
abbreviationNormalizes by abbreviation, targeting group c only.

4. Detailed Normalization Settings (a, b, c Groups)

a Group: Comprehensive Lexical Normalization

Controls normalization based on vocabulary and semantics using the following settings:

SettingDefault ValueDescription
other_languageTrueNormalizes non-Japanese terms (e.g., English) to Japanese. Set to False to disable this feature.
aliasTrueNormalizes aliases. Set to False to disable this feature.
old_nameTrueNormalizes old names. Set to False to disable this feature.
misuseTrueNormalizes misused terms. Set to False to disable this feature.

b Group: Abbreviation Normalization

Controls normalization of abbreviations using the following settings:

SettingDefault ValueDescription
alphabetic_abbreviationTrueNormalizes abbreviations written in alphabetic characters. Set to False to disable this feature.
non_alphabetic_abbreviationTrueNormalizes abbreviations written in non-alphabetic characters (e.g., Japanese). Set to False to disable this feature.

c Group: Orthographic Normalization

Controls normalization of orthographic variations and errors using the following settings:

SettingDefault ValueDescription
alphabetTrueNormalizes alphabetic variations. Set to False to disable this feature.
orthographic_variationTrueNormalizes orthographic variations. Set to False to disable this feature.
misspellingTrueNormalizes misspellings. Set to False to disable this feature.

5. custom_synonym (Custom Dictionary)

If you want to use a custom dictionary, control its behavior with the following setting:

SettingDefault ValueDescription
custom_synonymTrueEnables the use of a custom dictionary. Set to False to disable it.

This hierarchical configuration allows for flexible normalization by defining the scope and target in detail.

Custom Dictionary Specification

You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.

Custom Dictionary Format

The custom dictionary file should be in JSON, CSV, or TSV format.

  • JSON file
{
    "Representative word 1": ["Synonym 1_1", "Synonym 1_2", ...],
    "Representative word 2": ["Synonym 2_1", "Synonym 2_2", ...],
}
  • CSV file
Representative word 1,Synonym 1_1,Synonym 1_2,...
Representative word 2,Synonym 2_1,Synonym 2_2,...
  • TSV file
Representative word 1	Synonym 1_1	Synonym 1_2	...
Representative word 2	Synonym 2_1	Synonym 2_2	...
...

Example

If you create a file like the one below, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書".

  • JSON file
{
    "幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"],
}
  • CSV file
幽遊白書,幽白,ゆうはく,幽☆遊☆白書
  • TSV file
幽遊白書	幽白	ゆうはく	幽☆遊☆白書

How to Specify

normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict_file")

Normalization Using a CSV File

You can also normalize text using a CSV file.

Example

JR東日本
JR東
JR-East

Normalize using CsvSynonymNormalizer as shown below.

from yurenizer import CsvSynonymNormalizer
input_file_path = "input.csv"
output_file_path = "output.csv"
csv_normalizer = CsvSynonymNormalizer(synonym_file_path="synonyms.txt")
csv_normalizer.normalize_csv(input_file_path, output_file_path)

The output.csv file will be output as follows.

raw,normalized
JR東日本,東日本旅客鉄道
JR東,東日本旅客鉄道
JR-East,東日本旅客鉄道

Specifying SudachiDict

The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the SynonymNormalizer() arguments:

pip install sudachidict_small
# or
pip install sudachidict_core
normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")

※ Please refer to SudachiDict documentation for details.

License

This project is licensed under the Apache License 2.0.

Open Source Software Used

This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.

For detailed license information, please check the LICENSE files of each project:

Keywords

nlp

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts