Fast-Text Language Detection
In a search for the best option for predicting a language from text which didn't require a large machine learning model, it appeared that fast-text, created by FaceBook, was the best option (https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c).
Installation
npm install which-lang
Note: This will install the fast-text model by facebook which is about 150MB. You also need python installed, if you're running an alipine docker see how to easily do this here
Usage
Prediction
Testing
import LanguageDetection from 'which-lang';
async function run(){
const lid = new LanguageDetection()
console.log(await lid.predict('FastText-LID provides a great language identification'))
console.log(await lid.predict('FastText-LID bietet eine hervorragende Sprachidentifikation'))
console.log(await lid.predict('FastText-LID fornisce un ottimo linguaggio di identificazione'))
console.log(await lid.predict('FastText-LID fournit une excellente identification de la langue'))
console.log(await lid.predict('FastText-LID proporciona una gran identificación de idioma'))
console.log(await lid.predict('FastText-LID обеспечивает отличную идентификацию языка'))
console.log(await lid.predict('这个case我想close.'))
console.log(await lid.predict('FastText-LID提供了很好的語言識別'))
}
run()
The second argument is the number of returned responses, i.e. lid.predict(text, 10)
will return an array of 10 results
Output
[ { lang: 'en', prob: 0.6313226222991943, isReliableLanguage: true } ]
[ { lang: 'de', prob: 0.9137917160987854, isReliableLanguage: true } ]
[ { lang: 'it', prob: 0.974501371383667, isReliableLanguage: true } ]
[ { lang: 'fr', prob: 0.7358829379081726, isReliableLanguage: true } ]
[ { lang: 'es', prob: 0.9211937189102173, isReliableLanguage: true } ]
[ { lang: 'ru', prob: 0.9899846911430359, isReliableLanguage: true } ]
[ { lang: 'zh', prob: 0.9437162280082703, isReliableLanguage: true } ]
[ { lang: 'zh', prob: 0.8515647649765015, isReliableLanguage: true } ]
isReliableLanguage
is true if there were 10 + test results and accuracy was 95% or more