biangbiang
Chinese NLP utilities
Installation
For npm:
npm install biangbiang
For Yarn:
yarn add biangbiang
Getting started
With import
:
import biangbiang from "biangbiang";
With require
:
var biangbiang = require('biangbiang');
Methods
Dictionary
define(word, dictionary)
Get the pinyin and definition of a word, where dictionary is "simplified", "traditional", or "merged". Also returns the frequency index (rank).
define('面条', 'simplified');
{
simplified: '面条',
traditional: '麵條',
pinyin: 'mian4 tiao2',
definition: 'noodles',
index: 6029
}
kind(character)
Check if a character is a traditional or simplified one. If so, returns the other form. type
is 1
for simplified, 2
for traditional, and 3
for both.
kind("面");
{ type: 1, other: '麵'}
wordsContaining(character)
Get a list of all dictionary words containing a character, sorted in order of decreasing frequency.
wordsContaining('面');
[
{
word: '面',
index: 322,
},
{
word: '里面',
index: 706,
},
{
word: '面对',
index: 930,
},
{
word: '外面',
index: 1234,
},
{
word: '后面',
index: 1270,
},
...
]
Frequency
characterFrequency(character)
Get frequency statistics for a character.
characterFrequency('面');
{
symbol: '面',
index: 211,
frequency: 1631866,
percentage: 0.0006532897206780486,
cumulativePercentage: 0.7101332080329651,
}
wordFrequency(word)
Get frequency statistics for a word.
wordFrequency('面条');
{
symbol: '面条',
index: 6029,
frequency: 66879,
percentage: 0.000015823013308250793,
cumulativePercentage: 0.8864603725508198,
}
multiFrequency(sentence)
Get frequency statistics for a body of text.
multiFrequency('我喜欢吃面条。');
{
byCharacter: [
{
symbol: '我',
index: 1,
frequency: 107133693,
percentage: 0.042889146765223256,
cumulativePercentage: 0.12608816399204145,
},
{
symbol: '喜',
index: 479,
frequency: 681772,
percentage: 0.0002729357921827617,
cumulativePercentage: 0.8216732504061582,
},
{
symbol: '欢',
index: 1490,
frequency: 140530,
percentage: 0.000056258788679270345,
cumulativePercentage: 0.9496496712024702,
},
{
symbol: '吃',
index: 42,
frequency: 9348265,
percentage: 0.0037424184526636244,
cumulativePercentage: 0.46991986609112824,
},
{
symbol: '面',
index: 211,
frequency: 1631866,
percentage: 0.0006532897206780486,
cumulativePercentage: 0.7101332080329651,
},
{
symbol: '条',
index: 169,
frequency: 2102653,
percentage: 0.0008417612665824651,
cumulativePercentage: 0.6785621013285376,
},
{
symbol: '。',
index: -1,
frequency: -1,
percentage: -1,
cumulativePercentage: -1,
},
],
indices: [1, 479, 1490, 42, 211, 169],
percentages: [
0.042889146765223256,
0.0002729357921827617,
0.000056258788679270345,
0.0037424184526636244,
0.0006532897206780486,
0.0008417612665824651,
],
cumulativePercentages: [
0.12608816399204145,
0.8216732504061582,
0.9496496712024702,
0.46991986609112824,
0.7101332080329651,
0.6785621013285376,
],
}
Components
decompose(character, depth)
Decompose a character into its components up to a specified depth. If depth is undefined, then the full component tree is returned.
decompose('面');
{
丆: {
'㇐': '㇐',
'㇓': '㇓',
},
囬: {
'55103': {
'10001': {
'10001': '㇑',
},
二: {
二: '㇐',
},
},
囗: {
'⺆': {
'㇑': '㇑',
'㇆': '㇆',
},
'㇐': '㇐',
},
},
}
charactersWithComponent(component)
Get a list of characters containing a component, sorted in order of decreasing frequency.
charactersWithComponent('囗');
[
{ character: '回', index: 139 },
{ character: '图', index: 166 },
{ character: '口', index: 307 },
{ character: '因', index: 381 },
{ character: '西', index: 382 },
{ character: '团', index: 388 },
{ character: '困', index: 413 },
{ character: '国', index: 544 },
{ character: '围', index: 644 },
{ character: '圈', index: 717 },
...
]
How it works
JSON files containing character/word/component information are generated by /src/prepare.js from raw files contained in /data/raw, with outputs saved to /data/processed.
The preparation script can also be run with npm run prepare
or yarn prepare
.
Sources
- Dictionary entries are entirely from CEDICT
- Frequency statistics are from BCC_LEX_Zh
- Character composition entries are from CJK-decomp
This project was inspired by HanziJS and offers many of the same functionalities.