Product
Introducing License Enforcement in Socket
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
The modified version of langid.c
with Python bindings -- a straightforward replacement for langid.py
, offering the same features, but 200 times as faster.
pip install langid-pyc
from langid_pyc import (
classify,
rank,
)
classify("This is English text")
# ('en', 0.9999999239251556)
rank("This is English text")
# [('en', 0.9999999239251556),
# ('la', 5.0319768731501096e-08),
# ('br', 1.2684715402216825e-08),
# ...]
from langid_pyc import (
classify,
nb_classes,
set_languages,
)
nb_classes()
# ['af',
# 'am',
# 'an',
# ...]
len(nb_classes())
# 97
set_languages(["en", "ru"])
nb_classes()
# ['en', 'ru']
classify("This is English text")
# ('en', 1.0)
classify("А это текст на русском")
# ('ru', 1.0)
set_languages() # reset languages
len(nb_classes())
# 97
LanguageIdentifier
classfrom langid_pyc import LanguageIdentifier
identifier = LanguageIdentifier.from_modelpath("ldpy3.pmodel") # default model
len(identifier.nb_classes)
# 97
identifier.classify("This is English text")
# ('en', 0.9999999239251556)
# identifier.rank(...)
# identifier.set_languages(...)
Install relevant protobuf
packages
apt install protobuf-c-compiler libprotobuf-c-dev
Install dev python requirements
pip install -r requirements.txt
Run build
make build
See Makefile for more details.
Train a new model using langid.py
package. You will get the model file as described here:
# output the model
output_path = os.path.join(model_dir, 'your_new_model.model')
model = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output
string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
with open(output_path, 'w') as f:
f.write(string)
print "wrote model to %s (%d bytes)" % (output_path, len(string))
Move your_new_model.model
to models
dir and run
make your_new_model.model
Now you have your_new_model.pmodel
file in the root which can be feed to LanguageIdentifer.from_modelpath
from langid_pyc import LanguageIdentifier
your_new_identifier = LanguageIdentifier.from_modelpath("your_new_model.pmodel")
Benchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found here.
TL;DR langid.pyc
is ~200x faster than langid.py
and ~1-1.5x faster than pycld2
, especially on long texts.
langid.c
readmelangid.c
is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
langid.py
[2], and uses langid.py
to train models.
See TODO
Initial comparisons against Google's cld2[3] suggest that langid.c
is about
twice as fast.
(langid.c) @mlui langid.c git:[master] wc -l wikifiles
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx 7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx 3.44s user 0.24s system 97% cpu 3.759 total
(langid.c) @mlui langid.c git:[master] wc -l rcv2files
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx 8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx 18.14s user 0.53s system 97% cpu 19.155 total
Google's protocol buffers [4] are used to transfer models between languages. The
Python program ldpy2ldc.py
can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.
Protocol buffers [4] protobuf-c [5]
Marco Lui saffsd@gmail.com
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c
FAQs
Written in C drop-in replacement of the language identification tool langid.py
We found that langid-pyc demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
Product
We're launching a new set of license analysis and compliance features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems.
Product
We're excited to introduce Socket Optimize, a powerful CLI command to secure open source dependencies with tested, optimized package overrides.