fast-langdetect 🚀
Overview
fast-langdetect
is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9
to 3.13
. - Works offline in low memory mode
- No
numpy
required (thanks to @dalf).
Background
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging.
For more information about the underlying model, see the official FastText documentation: Language Identification.
Possible memory usage
This library requires at least 200MB memory in low-memory mode.
Installation 💻
To install fast-langdetect, you can use either pip
or pdm
:
Using pip
pip install fast-langdetect
Using pdm
pdm add fast-langdetect
Usage 🖥️
In scenarios where accuracy is important, you should not rely on the detection results of small models, use low_memory=False
to download larger models!
Prerequisites
- The “/n” character in the argument string must be removed before calling the function.
- If the sample is too long or too short, the accuracy will be reduced (e.g. if it is too short, Chinese will be predicted as Japanese).
- The model will be downloaded to the
/tmp/fasttext-langdetect
directory upon first use.
Native API (Recommended)
from fast_langdetect import detect, detect_multilingual
print(detect("Hello, world!"))
print(detect("Hello, world!", low_memory=False, use_strict_mode=True))
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove `\n` characters or it will raise an ValueError.
REMOVE \n
"""
multiline_text = multiline_text.replace("\n", "")
print(detect(multiline_text))
print(detect("Привет, мир!")["lang"])
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
print(detect_multilingual("Hello, world!你好世界!Привет, мир!", low_memory=False))
Fallbacks
We provide a fallback mechanism: when use_strict_mode=False
, if the program fails to load the large model (low_memory=False
), it will fall back to the offline small model to complete the prediction task.
Convenient detect_language
Function
from fast_langdetect import detect_language
print(detect_language("Hello, world!"))
print(detect_language("Привет, мир!"))
print(detect_language("你好,世界!"))
Splitting Text by Language 🌐
For text splitting based on language, please refer to the split-lang
repository.
Benchmark 📊
For detailed benchmark results, refer
to zafercavdar/fasttext-langdetect#benchmark.
References 📚
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification
models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}