
Security News
NIST Under Federal Audit for NVD Processing Backlog and Delays
As vulnerability data bottlenecks grow, the federal government is formally investigating NIST’s handling of the National Vulnerability Database.
A library for standardizing terms with spelling variations using a synonym dictionary.
This is a Japanese text normalizer that resolves spelling inconsistencies.
Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md
yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the Sudachi Synonym Dictionary.
You can try the web-based demo here.
yurenizer Web-demo
pip install yurenizer
curl -L -o synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt
from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="synonyms.txt")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。
You can control normalization by specifying NormalizerConfig
as an argument to the normalize function.
from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="synonyms.txt")
text = "「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"
config = NormalizerConfig(
taigen=True,
yougen=False,
expansion="from_another",
unify_level="lexeme",
other_language=False,
alias=False,
old_name=False,
misuse=False,
alphabetic_abbreviation=True, # Normalize only alphabetic abbreviations
non_alphabetic_abbreviation=False,
alphabet=False,
orthographic_variation=False,
misspelling=False
)
print(f"Output: {normalizer.normalize(text, config)}")
# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます
The settings in yurenizer are organized hierarchically, allowing you to control the scope and target of normalization.
Use the taigen
and yougen
flags to control which parts of speech are included in the normalization.
Setting | Default Value | Description |
---|---|---|
taigen | True | Includes nouns and other substantives in the normalization. Set to False to exclude them. |
yougen | False | Includes verbs and other predicates in the normalization. Set to True to include them (normalized to their lemma). |
The expansion flag determines how synonyms are expanded based on the synonym dictionary's internal control flags.
Value | Description |
---|---|
from_another | Expands only the synonyms with a control flag value of 0 in the synonym dictionary. |
any | Expands all synonyms regardless of their control flag value. |
Specify the level of normalization with the unify_level
parameter.
Value | Description |
---|---|
lexeme | Performs the most comprehensive normalization, targeting all groups (a, b, c) mentioned below. |
word_form | Normalizes by word form, targeting groups b and c. |
abbreviation | Normalizes by abbreviation, targeting group c only. |
Controls normalization based on vocabulary and semantics using the following settings:
Setting | Default Value | Description |
---|---|---|
other_language | True | Normalizes non-Japanese terms (e.g., English) to Japanese. Set to False to disable this feature. |
alias | True | Normalizes aliases. Set to False to disable this feature. |
old_name | True | Normalizes old names. Set to False to disable this feature. |
misuse | True | Normalizes misused terms. Set to False to disable this feature. |
Controls normalization of abbreviations using the following settings:
Setting | Default Value | Description |
---|---|---|
alphabetic_abbreviation | True | Normalizes abbreviations written in alphabetic characters. Set to False to disable this feature. |
non_alphabetic_abbreviation | True | Normalizes abbreviations written in non-alphabetic characters (e.g., Japanese). Set to False to disable this feature. |
Controls normalization of orthographic variations and errors using the following settings:
Setting | Default Value | Description |
---|---|---|
alphabet | True | Normalizes alphabetic variations. Set to False to disable this feature. |
orthographic_variation | True | Normalizes orthographic variations. Set to False to disable this feature. |
misspelling | True | Normalizes misspellings. Set to False to disable this feature. |
If you want to use a custom dictionary, control its behavior with the following setting:
Setting | Default Value | Description |
---|---|---|
custom_synonym | True | Enables the use of a custom dictionary. Set to False to disable it. |
This hierarchical configuration allows for flexible normalization by defining the scope and target in detail.
You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.
The custom dictionary file should be in JSON, CSV, or TSV format.
{
"Representative word 1": ["Synonym 1_1", "Synonym 1_2", ...],
"Representative word 2": ["Synonym 2_1", "Synonym 2_2", ...],
}
Representative word 1,Synonym 1_1,Synonym 1_2,...
Representative word 2,Synonym 2_1,Synonym 2_2,...
Representative word 1 Synonym 1_1 Synonym 1_2 ...
Representative word 2 Synonym 2_1 Synonym 2_2 ...
...
If you create a file like the one below, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書".
{
"幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"],
}
幽遊白書,幽白,ゆうはく,幽☆遊☆白書
幽遊白書 幽白 ゆうはく 幽☆遊☆白書
normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict_file")
You can also normalize text using a CSV file.
JR東日本
JR東
JR-East
Normalize using CsvSynonymNormalizer
as shown below.
from yurenizer import CsvSynonymNormalizer
input_file_path = "input.csv"
output_file_path = "output.csv"
csv_normalizer = CsvSynonymNormalizer(synonym_file_path="synonyms.txt")
csv_normalizer.normalize_csv(input_file_path, output_file_path)
The output.csv
file will be output as follows.
raw,normalized
JR東日本,東日本旅客鉄道
JR東,東日本旅客鉄道
JR-East,東日本旅客鉄道
The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the SynonymNormalizer()
arguments:
pip install sudachidict_small
# or
pip install sudachidict_core
normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")
※ Please refer to SudachiDict documentation for details.
This project is licensed under the Apache License 2.0.
This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.
For detailed license information, please check the LICENSE files of each project:
FAQs
A library for standardizing terms with spelling variations using a synonym dictionary.
We found that yurenizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
As vulnerability data bottlenecks grow, the federal government is formally investigating NIST’s handling of the National Vulnerability Database.
Research
Security News
Socket’s Threat Research Team has uncovered 60 npm packages using post-install scripts to silently exfiltrate hostnames, IP addresses, DNS servers, and user directories to a Discord-controlled endpoint.
Security News
TypeScript Native Previews offers a 10x faster Go-based compiler, now available on npm for public testing with early editor and language support.