fast-langdetect 🚀
Overview
fast-langdetect provides ultra-fast and highly accurate language detection based on FastText, a library developed by
Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.
It supports Python versions 3.9 to 3.12.
Support offline usage.
This project builds upon zafercavdar/fasttext-langdetect
with enhancements in packaging.
For more information on the underlying FastText model, refer to the official
documentation: FastText Language Identification.
[!NOTE]
This library requires over 200MB of memory to use in low memory mode.
Installation 💻
To install fast-langdetect, you can use either pip
or pdm
:
Using pip
pip install fast-langdetect
Using pdm
pdm add fast-langdetect
Usage 🖥️
For optimal performance and accuracy in language detection, use detect(text, low_memory=False)
to load the larger
model.
The model will be downloaded to the /tmp/fasttext-langdetect
directory upon first use.
Native API (Recommended)
[!NOTE]
This function assumes to be given a single line of text. You should remove \n
characters before passing the text.
If the sample is too long or too short, the accuracy will decrease (for example, in the case of too short, Chinese
will be predicted as Japanese).
from fast_langdetect import detect, detect_multilingual
print(detect("Hello, world!"))
print(detect("Hello, world!", low_memory=False, use_strict_mode=True))
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove `\n` characters or it will raise an ValueError.
"""
multiline_text = multiline_text.replace("\n", "")
print(detect(multiline_text))
print(detect("Привет, мир!")["lang"])
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
print(detect_multilingual("Hello, world!你好世界!Привет, мир!", low_memory=False))
Convenient detect_language
Function
from fast_langdetect import detect_language
print(detect_language("Hello, world!"))
print(detect_language("Привет, мир!"))
print(detect_language("你好,世界!"))
Splitting Text by Language 🌐
For text splitting based on language, please refer to the split-lang
repository.
Benchmark 📊
For detailed benchmark results, refer
to zafercavdar/fasttext-langdetect#benchmark.
References 📚
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification
models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}