Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

sentence-spliter

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

sentence-spliter

This is a sentence cutting tool, currently support English & Chinese

  • 2.1.8
  • PyPI
  • Socket score

Maintainers
1

sentence-spliter

[toc]

简介

sentence-spliter 句子切分工具:将一个长句或者段落,切分为若干短句的 List 。支持自然切分,中间切分等。

目前支持语言:中文, 英文,韩语

Architechture

  • 项目结构
.
├── doc								# 补充文档
├── LICENSE							# 许可证
├── MANIFEST.in						# 用于setup时包含其他文件
├── pyproject.toml					# 用于构建项目
├── README.md
├── requirements.txt
├── sentence_spliter
│   ├── architect					# 存放切句的基本单元
│   ├── cutter4grammar				# 语法纠错定制的切句
│   ├── en_cutter					# 英文切句
│   ├── test						# 单元测试
│   ├── utility						# 其他工具函数
│   └── zh_cutter					# 中文切句
└── setup.py						# setup.py

更详细的目录结构见 链接

Setup

git 安装

git clone git@git.duowan.com:ai/nlp/sentence-spliter.git
pip install -U pip
pip install -r requirements.txt

PYPI 安装

pip install sentence_spliter

API

请求示例

curl --location --request POST 'https://rosetta-nlp-api.duowan.com/api/v1/sentence-spliter/en-sentence-spliter' \
--header 'Content-Type: application/json' \
--data-raw '{
  "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] ,
  "options": {
      "max_len": 30,
      "min_len": 6
  }
} '
  • Request
{
  "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] ,
  "options": {
      "max_len": 30,
      "min_len": 6
  }
} 
  • Response
{
    "code": 0,
    "data": {
        "paragraphs": [
            "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
        ],
        "sub_sentences": [
            [
                [
                    "A long time ago..... there is a mountain, and there is a temple in the mountain!!!"
                ],
                [
                    " And here is an old monk in the temple!?...."
                ]
            ]
        ],
        "version": "1.0.0"
    },
    "message": "success"
}

响应参数说明

字段名类型说明
paragraphsString需要切分的段落列表
sub_sentencesString切分完成的子句

接口相关更多内容见接口文档

特别注意:version字段改动涉及广东部门是否需要重跑流水线  链接

状态机

Data

需要用到的主要辅助数据为以下两个:

  • 白名单表: /white_list.txt
  • 权重表:/weights_list.txt

Format

白名单表格式:

Dr.
U!S!A!
No.
abbr.
Brig.
Ltd.
b.
N.
hr.

每行一个字符串,算法扫描到白名单中被记录字符串中的结束符号将会不计为一种象征结束的标志。

权重表:

and 10
or 10
but 10
even 10
however 10
whenever 10
whatever 10
although 10
thought 10

每行为:word+weight的格式,表示各个有转折、承接上下文等作用含义的词在需要句内切割时的权重大小。

介绍

以下句子作为样本:

sentence = 'I like chicken. I like chicken.'
Sequence

Sequence模块首先将需要切割的句子转换为某种特殊的序列格式。

graph LR
A[I like chicken.] -->B[I]
subgraph sequence
    B -->C[<space>]
    C -->D[like]
    D --> E[<space>]
    E --> F[chicken]
    F --> G[.]
end

sequence将直接进入状态机

Condition and Operation

Condition模块表示执行某个动作之前的某个条件或者判断,若满足该条件则执行,否则执行不满足该条件的动作。

Operation模块表示某个动作或者称为操作

    graph LR
A{Condition} -->|True| B[Operation1]
A -->|False| C[Operation2]
    
Condition&Operation模块

由一系列上图Condition&Operation组成的模块

表示一连串的判断、动作序列组合叠加

进而

    graph LR
A{Condition1} -->|True| B[Condition&Operation1]
A -->|False| C[Condition&Operation2]
B -->D[Condition&Operation3]
C -->E[Condition&Operation4]
Logic

上述Condition&Operation模块形成了整个Logic

所有的Condition&Operation模块进一步叠加得到整个大的逻辑图

运行

  • 导入相关包
from sentence_spliter.en_cutter.en_sequence import Sequence 				# 导入英文切句框架内的sequence类
from sentence_spliter.en_cutter.logic import SimpleLogic, LongShortLogic	# 导入英文切句框架内的logic类
  • 加载句子为sequence类
sentence = 'I like chicken. I like chicken.'								# 例句
seq = Sequence(sentence)                                                    # 转化为sequence
simple_logic = SimpleLogic()												# 自然切句逻辑
long_logic = LongShortLogic(max_len=max_len, min_len=min_len)				# 切割长短句
  • 执行切句
simple_result = simple_logic.run(seq, debug=True)
long_results = long_logic.run(seq, debug=True)

打包上传

  • 打开setup.py,修改相应的配置(version等)
from setuptools import setup, find_packages

setup(
    name="sentence-spliter",
    version="X.X.X",
    author="<your name>",
    author_email="<your email>",
	...
)
  • 在项目根目录运行以下命令
./bin/package.sh
  • 键入账号和密码
Enter your username: <your username>
Enter your password: <your password>
  • 等待上传即可

详细教程可见链接

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc