
Research
Two Malicious Rust Crates Impersonate Popular Logger to Steal Wallet Keys
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
tokenizer-tools
Advanced tools
################ Tokenizer Tools ################
.. image:: https://img.shields.io/pypi/v/tokenizer_tools.svg :target: https://pypi.python.org/pypi/tokenizer_tools
.. image:: https://travis-ci.com/howl-anderson/tokenizer_tools.svg?branch=master :target: https://travis-ci.com/howl-anderson/tokenizer_tools
.. image:: https://readthedocs.org/projects/tokenizer-tools/badge/?version=latest :target: https://tokenizer-tools.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
.. image:: https://pyup.io/repos/github/howlandersonn/tokenizer_tools/shield.svg :target: https://pyup.io/repos/github/howlandersonn/tokenizer_tools/ :alt: Updates
Tools/Utils for NLP (including dataset reading, tagset encoding & decoding, metrics computing) | NLP 工具集(包含数据集读取、tagset 编码和解码、指标的计算等)
Features
BMES 体系 <tokenizer_tools/tagset/BMES.py>
, BILUO 体系 <tokenizer_tools/tagset/NER/BILUO.py>
, IOB 体系 <tokenizer_tools/tagset/NER/IOB.py>
_]功能
本软件提供了一种语料存储的磁盘文件格式(暂定名为 conllx)和内存对象格式(暂定名为 offset)。
任务:读取 corpus.collx 文件,遍历打印每一条语料。
代码:
.. code-block:: python
from tokenizer_tools.tagset.offset.corpus import Corpus
corpus = Corpus.read_from_file("corpus.conllx")
for document in corpus:
print(document) # document 就是单条语料对象
任务:将多条语料写入 corpus.conllx 文件
代码:
.. code-block:: python
from tokenizer_tools.tagset.offset.corpus import Corpus
corpus_list = [corpus_item_one, corpus_item_two]
corpus = Corpus(corpus_list)
corpus.write_to_file("corpus.conllx")
每一个单条语料都是一个 Document 对象,现在介绍这个对象所拥有的属性和方法
text ^^^^^^^^^^^ 类型是 list, 代表文本的字段
domain ^^^^^^^^^^^ 类型是 string, 代表领域
function ^^^^^^^^^^^^ 类型是 string, 代表功能点
sub_function ^^^^^^^^^^^^^^^^^^ 类型是 string,代表子功能点
intent ^^^^^^^^^^^^ 类型是 string, 代表意图
entities ^^^^^^^^^^^^^^ 类型是 SpanSet, 代表实体,下文有详细介绍
compare_entities ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 比较文本和实体是否匹配
convert_to_md ^^^^^^^^^^^^^^^^^^^^^ 将文本和实体转换成 markdown 格式,用于文本化渲染输出
iter ^^^^^^^^^^^^^^^ 可以像列表一样访问,得到的每一个元素都是 Span 对象
check_overlap ^^^^^^^^^^^^^^^^^^^^^^ 检查 span 是否重叠
start ^^^^^^^^^^^ int, 从 0 开始,包含该位置
end ^^^^^^^^ int, 从0开始,不包含该位置
entity ^^^^^^^^^^^^ string, 实体类型
value ^^^^^^^^^^^^^ string, 实体的值
TODO
Credits
This package was created with Cookiecutter_ and the audreyr/cookiecutter-pypackage
_ project template.
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _audreyr/cookiecutter-pypackage
: https://github.com/audreyr/cookiecutter-pypackage
######## History ########
0.1.0 (2018-09-05)
FAQs
Tools for tokenizer develope and evaluation
We found that tokenizer-tools demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
Research
A malicious package uses a QR code as steganography in an innovative technique.
Research
/Security News
Socket identified 80 fake candidates targeting engineering roles, including suspected North Korean operators, exposing the new reality of hiring as a security function.