Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers) https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/
$ python -m pip install jdepp
pip install does not install the model(dictionary).
You can get precompiled model files(MeCab POS tagging + train with KNBC copus) from
https://github.com/lighttransport/jdepp-python/releases/tag/v0.1.0
Precompiled KNBC model file is licensed under 3-clause BSD license.
FEATURE_SEP ','
jdepp/typedf.h
for more info about ifdef macros.Download precompiled model file.
$ wget https://github.com/lighttransport/jdepp-python/releases/download/v0.1.0/knbc-mecab-jumandic-2ndpoly.tar.gz
$ tar xvf knbc-mecab-jumandic-2ndpoly.tar.gz
import jdepp
model_path = "model/knbc"
parser = jdepp.Jdepp()
parser.load_model(model_path)
# NOTE: Mecab format: surface + TAB + feature(comma separated 7 fields)
input_postagged = """吾輩 名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人
は 助詞,副助詞,*,*,は,は,*
猫 名詞,普通名詞,*,*,猫,ねこ,*
である 判定詞,*,判定詞,デアル列基本形,だ,である,*
。 特殊,句点,*,*,。,。,*
名前 名詞,普通名詞,*,*,名前,なまえ,*
は 助詞,副助詞,*,*,は,は,*
まだ 副詞,*,*,*,まだ,まだ,*
ない 形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*
。 特殊,句点,*,*,。,。,*
EOS
"""
sent = parser.parse_from_postagged(input_postagged)
print(sent)
print(jdepp.to_tree(str(sent)))
# S-ID: 1; J.DepP
0: 吾輩は━━┓
1: 猫である。━━┓
2: 名前は━━┫
3: まだ━━┫
4: ない。EOS
jdepp.to_dot
is provided to export graph as dot(Graphviz)
dot_text = jdepp.to_dot(str(sentence))
# feed output text to graphviz viewer, e.g. https://dreampuf.github.io/GraphvizOnline/
See examples/ for more details
MeCab style. surface + TAB + feature(comma separated 7 fields)
You can use jagger-python for POS tagging.
import jagger
import jdepp
jagger_model_path = "model/kwdlc/patterns"
tokenizer = jagger.Jagger()
tokenizer.load_model(jagger_model_path)
text = "吾輩は猫である。名前はまだない。"
toks = tokenizer.tokenize(text)
pos_tagged_input = ""
for tok in toks:
pos_tagged_input += tok.surface() + '\t' + tok.feature() + '\n'
pos_tagged_input += "EOS\n"
jdepp_model_path = "model/knbc"
parser.load_model(jdepp_model_path)
parser.parse_from_postagged(pos_tagged_input)
If you just want to use J.DepP from cli(e.g. batch processing), you can build a standalone C++ app using CMake.
We modified J.DepP source code to improve portablily(e.g. Ours works well on Windows)
Training a model from Python binding is also not yet supported. For a while, you can train a model by using standalone C++ jdepp app.
This is for developer usecase. Use setup.py(pyproject.toml) to build python module for end users.
Install pybind11 devkit.
$ python -m pip install pybind11
Then invoke cmake with -DJDEPP_WITH_PYTHON
and pybind11_DIR
$ pybind11_DIR=/path/to/pybind11 cmake -DJDEPP_WITH_PYTHON=1 ...
git tag vX.Y.Z
git push --tags
Versioning is automatically done through setuptools_scm
jdepp-python is licensed under 2-Clause BSD license.
J.DepP https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/ is licensed under GPLv2/LGPLv2.1/BSD triple license.
FAQs
Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)
We found that jdepp demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.