Security News
Opengrep Emerges as Open Source Alternative Amid Semgrep Licensing Controversy
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
The only language identification that includes Cantonese (廣東話), traditional and simplified Chinese.
This is a language identification language focus on providing higher accuracy in Japanese, Korean, and Chinese language compares to the original Fasttext model ( lid.176.ftz ). This package also include identification for cantonese, simplified and traditional Chinese language.
First stage model F1, which is same from fasttext language identification model
Model | F1@1 |
---|---|
lid.176.ftz | 0.977 |
We can achieve higher accuracy by including an additional language identification model to handle low confidence scores for Japanese, Korean, Chinese. The table below shows F1 (k=1) scores in identifying 3 languages. (we updated the validation corpus which is much harder to the first revision : shorter text, latest news text )
2nd-Stage Model | F1@1 | Acc@1 |
---|---|---|
version 1.0.0 | 0.826 | 0.744 |
master | 0.801 | 0.894 |
Master version is also trained with identifying Cantonese (zh-yue) text from Mozilla Common Voice corpus text. Currently the model is senstive to non cantonese text mixing inside the sentence, hence please use the model with care.
To use Cantonese prediction, it recommended to force inference using the second stage prediction
lang_code = langid.predict('平嘢有冇好嘢?', force_second=True)
For more edge case detail please refer to fasttext_issues.py
The training data for the supplement model was drawn from Common Crawl Corpus and Currents API internal language dataset.
We wish to support Cantonese language in the upcoming future. Feel free to contact us if you would like to provide any related corpus.
$ pip install fastlangid
Only one function call away to handle single or multiple sentences
from fastlangid.langid import LID
langid = LID()
result = langid.predict('This is a test')
print(result)
from fastlangid.langid import LID
langid = LID()
examples = [
'中文繁體',
'中文简体',
'Lorem Ipsum is simply dummy text of the printing and typesetting industry',
'Lorem Ipsum adalah text contoh digunakan didalam industri pencetakan dan typesetting',
'Le Lorem Ipsum est simplement du faux texte employé dans la composition et la mise en page avant impression'
]
results = langid.predict(examples)
print(results)
Supports 177 languages. The ISO codes for the corresponding languages are as below.
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk
ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga
gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km
kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms
mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu
rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr
tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh-hans zh-hant zh-yue
Bag of words method doesn't work well in short text classification as found by this article by Apple. Hence it's recommend that you ensure the text have at least more than 5 characters/words.
Cantonese language identification is trained on daily conversation text which may not represent well in article types text. Hence it may confuse with traditional chinese (zh-hant) as they share the exact same characters.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
FAQs
Language detection for news powered by fasttext
We found that fastlangid demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Security News
Critics call the Node.js EOL CVE a misuse of the system, sparking debate over CVE standards and the growing noise in vulnerability databases.
Security News
cURL and Go security teams are publicly rejecting CVSS as flawed for assessing vulnerabilities and are calling for more accurate, context-aware approaches.