TinyLD
Tiny Language Detector, simply detect the language of a unicode UTF-8 text:
- pure javascript, no api call, and no dependency (node and browser compatible)
- alternative to libraries like CLD
- blazing fast and low memory footprint (unlike ML methods)
- support 62 languages (30 for the web version)
- format ISO 639-1
Getting Started
Install
yarn add tinyld
API
import { detect, detectAll } from 'tinyld'
detect('これは日本語です.')
detect('and this is english.')
detectAll('ceci est un text en francais.')
More Information
TinyLD CLI
tinyld This is the text that I want to check
More Information
Benchmark
Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages.
Library | Script | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
---|
TinyLD | yarn bench:tinyld | 96.1747% | 2.6938% | 1.1315% | 0.1315ms. | 778KB |
TinyLD Web | yarn bench:tinyld-light | 92.1169% | 3.9536% | 3.9295% | 0.0616ms. | 89KB |
node-cld | yarn bench:cld | 88.9148% | 1.7489% | 9.3363% | 0.0612ms. | > 10MB |
node-lingua | yarn bench:lingua | 82.3157% | 0.2158% | 17.4685% | 0.7085ms. | ~100MB |
franc | yarn bench:franc | 68.7783% | 26.3432% | 4.8785% | 0.1381ms. | 267KB |
franc-min | yarn bench:franc-min | 65.5163% | 23.5794% | 10.9044% | 0.0614ms. | 119KB |
languagedetect | yarn bench:languagedetect | 61.6068% | 12.295% | 26.0982% | 0.1585ms. | 240KB |
- For each category, top3 results are in Bold
- Language evaluated in this benchmark:
- Asia:
jpn
, cmn
, kor
, hin
- Europe:
fra
, spa
, por
, ita
, nld
, eng
, deu
, fin
, rus
- Middle east: ,
tur
, heb
, ara
- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
Conclusion
Recommended
- For NodeJS:
TinyLD
or node-cld
(fast and accurate) - For Browser:
TinyLD Light
or franc-min
(small, decent accuracy, franc is less accurate but support more languages)
Not recommended
node-lingua
is just too big and slowlanguagedetect
is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese)