fast-langdetect 🚀

Overview
fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9 to 3.13.
- Works offline with the lite model
- No
numpy required (thanks to @dalf).
Background
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging.
For more information about the underlying model, see the official FastText documentation: Language Identification.
Memory note
The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
Approximate memory usage (RSS after load):
- Lite: ~45–60 MB
- Full: ~170–210 MB
- Auto: tries full first, falls back to lite only on MemoryError.
Notes:
- Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
- Validate on your system if constrained; see
examples/memory_usage_check.py (credit: script by github@JackyHe398`).
- Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss),
leading to very large peaks with near-zero deltas.
Choose the model that best fits your constraints.
Installation 💻
To install fast-langdetect, you can use either pip or pdm:
Using pip
pip install fast-langdetect
Using pdm
pdm add fast-langdetect
Usage 🖥️
For higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.
Prerequisites
- If the sample is too long or too short, the accuracy will be reduced.
- The model will be downloaded to system temporary directory by default. You can customize it by:
- Setting
FTLANG_CACHE environment variable
- Using
LangDetectConfig(cache_dir="your/path")
Simple Usage (Recommended)
Call by model explicitly — clear and predictable, and use k to get multiple candidates. The function always returns a list of results:
from fast_langdetect import detect
print(detect("Hello", model='lite', k=1))
print(detect("Hello", model='full', k=1))
print(detect("Hello", model='auto', k=1))
print(detect("Hello 世界 こんにちは", model='auto', k=3))
If you need a custom cache directory, pass LangDetectConfig:
from fast_langdetect import LangDetectConfig, detect
cfg = LangDetectConfig(cache_dir="/custom/cache/path")
print(detect("Hello", model='full', config=cfg))
cfg_lite = LangDetectConfig(model="lite")
print(detect("Hello", config=cfg_lite))
print(detect("Bonjour", config=cfg_lite))
print(detect("Hello", model='full', config=cfg_lite))
Native API (Recommended)
from fast_langdetect import detect, LangDetector, LangDetectConfig
print(detect("Hello, world!", k=1))
print(detect("Hello, world!", model='full', k=1))
config = LangDetectConfig(cache_dir="/custom/cache/path", model="auto")
detector = LangDetector(config)
result = detector.detect("Hello world", k=1)
print(result)
multiline_text = "Hello, world!\nThis is a multiline text."
print(detect(multiline_text, k=1))
results = detect(
"Hello 世界 こんにちは",
model='auto',
k=3
)
print(results)
Fallback Policy (Keep It Simple)
- Only
MemoryError triggers fallback (in model='auto'): when loading the full model runs out of memory, it falls back to the lite model.
- I/O/network/permission/path/integrity errors raise standard exceptions (e.g.,
FileNotFoundError, PermissionError) or library-specific errors where applicable — no silent fallback.
model='lite' and model='full' never fallback by design.
Errors
- Base error:
FastLangdetectError (library-specific failures).
- Model loading failures:
ModelLoadError.
- Standard Python exceptions (e.g.,
ValueError, TypeError, FileNotFoundError, MemoryError) propagate when they are not library-specific.
Convenient detect_language Function
from fast_langdetect import detect_language
print(detect_language("Hello, world!"))
print(detect_language("Привет, мир!"))
print(detect_language("你好,世界!"))
Load Custom Models
config = LangDetectConfig(custom_model_path="/path/to/your/model.bin")
detector = LangDetector(config)
result = detector.detect("Hello world", model='auto', k=1)
Splitting Text by Language 🌐
For text splitting based on language, please refer to the split-lang
repository.
Input Handling
You can control log verbosity and input normalization via LangDetectConfig:
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(
max_input_length=80,
)
detector = LangDetector(config)
print(detector.detect("Some very long text..."))
- Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
- When truncation happens, a WARNING is logged because it may reduce accuracy.
max_input_length=80 truncates overly long inputs; set None to disable if you prefer no truncation.
Cache Directory Behavior
- Default cache: if
cache_dir is not set, models are stored under a system temp-based directory specified by FTLANG_CACHE or an internal default. This directory is created automatically when needed.
- User-provided cache_dir: if you set
LangDetectConfig(cache_dir=...) to a path that does not exist, the library raises FileNotFoundError instead of silently creating or using another location. Create the directory yourself if that’s intended.
Advanced Options (Optional)
The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.
Language Codes → English Names
The detector returns fastText language codes (e.g., en, zh, ja, pt-br). To present user-friendly names, you can map codes to English names using a third-party library. Example using langcodes:
from langcodes import Language
OVERRIDES = {
"yue": "Cantonese",
"wuu": "Wu Chinese",
"arz": "Egyptian Arabic",
"ckb": "Central Kurdish",
"kab": "Kabyle",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"pt-br": "Portuguese (Brazil)",
}
def code_to_english_name(code: str) -> str:
code = code.replace("_", "-").lower()
if code in OVERRIDES:
return OVERRIDES[code]
try:
return Language.get(code).display_name("en")
except Exception:
base = code.split("-")[0]
try:
return Language.get(base).display_name("en")
except Exception:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model='full', k=1)
print(code_to_english_name(result[0]["lang"]))
Alternatively, pycountry can be used for ISO 639 lookups (install with pip install pycountry), combined with a small override dict for non-standard tags like pt-br, zh-cn, yue, etc.
Benchmark 📊
For detailed benchmark results, refer
to zafercavdar/fasttext-langdetect#benchmark.
References 📚
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification
models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
License 📄
- Code: Released under the MIT License (see
LICENSE).
- Models: This package uses the pre-trained fastText language identification models (
lid.176.ftz bundled for offline use and lid.176.bin downloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.
- Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
- Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.