
Security News
Bun 1.2.19 Adds Isolated Installs for Better Monorepo Support
Bun 1.2.19 introduces isolated installs for smoother monorepo workflows, along with performance boosts, new tooling, and key compatibility fixes.
“TinySegmenter in Python” is a Python port_ by Masato Hagiwara of TinySegmenter_, which is an extremely compact Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo.
The library has been finally packaged by Jehan. It resulted into this fork because Masako Hagiwara did not answer emails, and packaging patches could therefore not be committed upstream. But this is a friendly fork, and Masako Hagiwara is welcome to take back maintainance over his project. For the time being, I (Jehan) took up the maintenance, so please refer to this new website_ as being official, and direct any new patch_ there. I will follow up on patchs and bug reports, but probably won't maintain an active development. Anyone wishing to improve the library is welcome to participate and will be gladly given committer rights.
It works on Python 2.6 or above (works on Python 3 too).
.. _port: http://lilyx.net/tinysegmenter-in-python/ .. _TinySegmenter: http://chasen.org/~taku/software/TinySegmenter/ .. _website: http://tinysegmenter.tuxfamily.org/
See all authors and contributors in AUTHORS
file.
This library can be installed the common ways: with a setup.py, as a pip package...
See the INSTALL
file in the package for more details.
If you simply want to download the source package, refer to the pypi repository: http://pypi.python.org/pypi/tinysegmenter
Development version can be downloaded anonymously at the Git repository::
$ git clone git://git.tuxfamily.org/gitroot/tinysegmente/tinysegmenter.git
or browsed online at: http://git.tuxfamily.org/tinysegmente/tinysegmenter/
Example code for direct usage::
> import tinysegmenter
> segmenter = tinysegmenter.TinySegmenter()
> print(' | '.join(segmenter.tokenize(u"私の名前は中野です")))
私 | の | 名前 | は | 中野 | です
TinySegmenter‘s interface is compatible with NLTK
’s TokenizerI
class, although the distribution does not directly depend on NLTK.
Here is one way to use it as a tokenizer in NLTK (order of the multiple base classes matters)::
import nltk.tokenize.api
class myTinySegmenter(tinysegmenter.TinySegmenter, nltk.tokenize.api.TokenizerI):
pass
segmenter = myTinySegmenter()
# This segmenter can be used any place which expects a NLTK's TokenizerI subclass.
For more about NLTK (Natural Language Toolkit module), see: http://nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI
.. _patch:
All bug, patch, question, etc. can be sent to tinysegmenter
at zemarmot
dot net
.
This package is distributed under a New BSD License (see COPYING
file).
FAQs
Very compact Japanese tokenizer
We found that tinysegmenter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Bun 1.2.19 introduces isolated installs for smoother monorepo workflows, along with performance boosts, new tooling, and key compatibility fixes.
Security News
Popular npm packages like eslint-config-prettier were compromised after a phishing attack stole a maintainer’s token, spreading malicious updates.
Security News
/Research
A phishing attack targeted developers using a typosquatted npm domain (npnjs.com) to steal credentials via fake login pages - watch out for similar scams.