UnihanLang
unihan_lang
is a Ruby library for identifying text language (Traditional Chinese, Simplified Chinese) and performing various checks on Chinese characters.
This document can also be read in Japanese.
Installation
Add this line to your application's Gemfile:
gem 'unihan_lang'
And then execute:
bundle install
Or install it yourself as:
gem install unihan_lang
Usage
require 'unihan_lang'
unihan = UnihanLang::Unihan.new
puts unihan.determine_language("這是繁體中文")
puts unihan.determine_language("这是简体中文")
puts unihan.zh_tw?("這是繁體中文")
puts unihan.zh_tw?("这不是繁体中文")
puts unihan.zh_cn?("这是简体中文")
puts unihan.zh_cn?("這不是簡體中文")
puts unihan.contains_chinese?("This text contains 中文")
puts unihan.contains_chinese?("This text has no Chinese")
puts unihan.extract_chinese_characters("This text contains 中文").join
puts unihan.only_zh_tw?("繁體")
puts unihan.only_zh_tw?("繁體简体")
puts unihan.only_zh_cn?("简体")
puts unihan.only_zh_cn?("简体繁體")
puts unihan.contains_zh_tw?("這個text包含繁體字")
puts unihan.contains_zh_tw?("这个text不包含繁体字")
puts unihan.contains_zh_cn?("这个text包含简体字")
puts unihan.contains_zh_cn?("這個text不包含簡體字")
Features
determine_language(text)
: Determines the language of the text ("ZH_TW", "ZH_CN", "JA", "Unknown").zh_tw?(text)
: Checks if the text is in Traditional Chinese.zh_cn?(text)
: Checks if the text is in Simplified Chinese.contains_chinese?(text)
: Checks if the text contains Chinese characters.extract_chinese_characters(text)
: Extracts Chinese characters from the text.only_zh_tw?(text)
: Checks if the text consists only of Traditional Chinese characters.only_zh_cn?(text)
: Checks if the text consists only of Simplified Chinese characters.contains_zh_tw?(text)
: Checks if the text contains Traditional Chinese characters.contains_zh_cn?(text)
: Checks if the text contains Simplified Chinese characters.
Note
This library does not guarantee 100% accuracy in language identification.
Particularly for short texts or texts containing multiple languages, determination may be challenging.
The distinction between Traditional and Simplified Chinese is based on the Unihan database.