nodescws
scws
About
scws即Simple Chinese Word Segmentation。是C语言开发的基于词频词典的机械式中文分词引擎。scws的作者为hightman,采用BSD许可协议发布。nodescws的作者在libscws上添加功能(包括停用词、忽略符号、json格式配置等)并添加了node.js binding,除自己代码,不持有libscws著作权。
scws的主页: http://www.xunsearch.com/scws,
GitHub: https://github.com/hightman/scws
nodescws
Current release: v0.5.1
Install
npm install scws
Usage
var Scws = require("scws");
var scws = new Scws(settings);
var results = scws.segment(text);
scws.destroy();
new Scws(settings)
注意,在v0.5.0之前,使用new Scws.init(settings)
初始化。
- settings:
Object
, 分词设置, 支持charset, dicts, rule, ignorePunct, multi, debug:
-
charset: String
, Optional
采用的encoding,支持"utf8","gbk", 默认值"utf8"
-
dicts: String
, Required
要采用的词典文件的filename,多个文件之间用':'分隔。
支持xdb格式以及txt格式,自制词典请以".txt"作文件后缀。
例如"./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt"
scws自带的xdb格式词典附在该extension目录下(一般是node_modules/scws/)的./dicts/ ,
有简体和繁体两种选择,如果该项缺失则默认使用自带utf8简体中文词典
-
rule: String
, Optional
要采用的规则文件,设置对应编码下的地名,人名,停用词等。
详见该extension目录下(一般是node_modules/scws/)的rules/rules.utf8.ini。
若该配置缺失则默认使用自带utf8的规则文件。
v0.2.3添加了JSON支持,避免繁复的ini语法。
若以.json结尾,则会解析对应的JSON rule文件,也可以直接传JSON string来进行配置。语法参考 ./rules/rules.utf8.json
-
ignorePunct: Bool
, Optional
是否忽略标点
-
multi: String
, Optional
是否进行长词复合切分,例如中国人这个词产生“中国人”,“中国”,“人”多个结果,可选值"short", "duality", "zmain", "zall":
short: 短词
duality: 组合相邻的两个单字
zmain: 重要单字
zall: 全部单字
-
debug: Bool
, Optional
是否以debug模式运行,若为true则输出scws的log, warning, error到stdout, defult为false
-
applyStopWord: Bool
, Optional
是否应用rule文件中[nostats]区块所规定的停用词,默认为true
scws.segment(text)
Return Array
[
{
word: '可读性',
offset: 183,
length: 9,
attr: 'n',
idf: 7.800000190734863
},
...
]
Example 用例
var fs = require("fs")
Scws = require("scws");
fs.readFile("./test_doc.txt", {
encoding: "utf8"
}, function(err, data){
if (err)
return console.error(err);
var scws = new Scws({
charset: "utf8",
dicts: "./dicts/dict.utf8.xdb",
rule: "./rules/rules.utf8.ini",
ignorePunct: true,
multi: "duality",
debug: true
});
res = scws.segment(data);
res1 = scws.segment("大家好我来自德国,我是德国人");
console.log(res, res1);
scws.destroy();
})
更多请参考test/
中的测试
Changelog
v0.5.1
- fix macOS build issue #18 thanks to agj
v0.5.0
- Update NAN, supports all major node.js versions
- New js API design
- Fix #11
v0.2.4
v0.2.3
- Changed project structure
- Refactored node bindings
- Added rule setting by JSON file and JSON string thus making adding stop words more easier with node
v0.2.2
- Some small bug fixes, including issue #5(Thanks to @Frully)
v0.2.1
- Add stop words support
- Remove line endings when
ignorePunct
is set true
You can add your own stop words in the entry [nostats]
in the rule file. Turn off stop words feature by setting applyStopWord
false.
v0.2.0
New syntax to initialize scws: scws = new Scws(config); result = scws.segment(text); scws.destroy()
so that we are able to reuse scws instance, thus gaining great improvement in perfermence when recurrently used(approximately 1/4 faster).
Added new setting entry debug
. Setting config.debug = true
will make scws output it's log, error, warning to stdout
v0.1.3
Published to npm registry. usage: scws(text, settings);
available setting entries: charset, dicts, rule, ignorePunct, multi.
Contributors