Socket
Socket
Sign inDemoInstall

@martincik/fast-text-language-detection

Package Overview
Dependencies
163
Maintainers
1
Versions
5
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    @martincik/fast-text-language-detection

Language detection with facebook fast-text model


Version published
Weekly downloads
1
Maintainers
1
Created
Weekly downloads
 

Readme

Source

Fast-Text Language Detection

In a search for the best option for predicting a language from text which didn't require a large machine learning model, it appeared that fast-text, created by FaceBook, was the best option (https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c).

Installation

npm i --save @smodin/fast-text-language-detection

Note: This will install the fast-text model by facebook which is about 150MB. You also need python installed, if you're running an alipine docker see how to easily do this here

Usage

Prediction

Testing

;(async () => {
  const LanguageDetection = require('@smodin/fast-text-language-detection')
  const lid = new LanguageDetection()

  console.log(await lid.predict('FastText-LID provides a great language identification'))
  console.log(await lid.predict('FastText-LID bietet eine hervorragende Sprachidentifikation'))
  console.log(await lid.predict('FastText-LID fornisce un ottimo linguaggio di identificazione'))
  console.log(await lid.predict('FastText-LID fournit une excellente identification de la langue'))
  console.log(await lid.predict('FastText-LID proporciona una gran identificación de idioma'))
  console.log(await lid.predict('FastText-LID обеспечивает отличную идентификацию языка'))
  console.log(await lid.predict('FastText-LID提供了很好的語言識別'))
})()

The second argument is the number of returned responses, i.e. lid.predict(text, 10) will return an array of 10 results

Output

[ { lang: 'en', prob: 0.6313226222991943, isReliableLanguage: true } ]
[ { lang: 'de', prob: 0.9137917160987854, isReliableLanguage: true } ]
[ { lang: 'it', prob: 0.974501371383667, isReliableLanguage: true } ]
[ { lang: 'fr', prob: 0.7358829379081726, isReliableLanguage: true } ]
[ { lang: 'es', prob: 0.9211937189102173, isReliableLanguage: true } ]
[ { lang: 'ru', prob: 0.9899846911430359, isReliableLanguage: true } ]
[ { lang: 'zh', prob: 0.8515647649765015, isReliableLanguage: true } ]

isReliableLanguage is true if there were 10 + test results and accuracy was 95% or more

Other Helpers

const LanguageDetection = require('@smodin/fast-text-language-detection')
const lid = new LanguageDetection()
const languageIsoCodes = lid.languageIsoCodes // ['af', 'als', 'am', 'an', 'ar', ...]

Similar Libaries

FastText has been used and implemented in other computer languages.

Reference Documents

Accuracy from Benchmark Testing

Long Input (30 to 250 characters)

Translated sentence data was obtained from tatoeba.org. Additional meta data can be found in benchmark-testing/results/RESULTS_with_metadata.csv.

Testing the 550k sentences of 30 - 250 characters took less than 30 seconds (personal macbook Pro).

Language (101)Symbol (alternates)Count (558260)Accuracy (30 - 250 chars)MislabelsFalse Positives
Englishen224281120
Greekel1203910
Hebrewhe861610
Japaneseja216910
Georgianka197310
Bengalibn11641131
Thaith57210
Mandarin Chinesezh56810
Malayalamml51710
Koreanko48217
Burmesemy21610
Tamilta20510
Kannadakn11811
Telugute10210
Punjabi (Eastern)pa8810
Laolo7010
Gujaratigu5710
Tibetanbo2010
Divehi, Dhivehi, Maldiviandv1510
Sinhalasi910
Amharicam310
Germande220140.9998637230853094en64
Polishpl177680.999718595227375en,eo,de,ro88
Russianru173290.9997114663281205bg,kk,uk,mk241
Hungarianhu179420.9996655891204994tr,br,it,de,en43
Hindihi53620.999627004848937mr0
Vietnamesevi130000.9996153846153846eo,hu,fr9
Turkishtr199190.9995983734123199eo,en,it,fr,nds1092
Esperantoeo178410.999551594641556it,es,pt,fr,ceb13
Frenchfr230760.999523314265904en,es,it,ru238
Marathimr104610.9995220342223496hi2
Uyghurug36920.9991874322860238ba,ru,hu0
Finnishfi174060.9990807767436516it,et,en,hr,de37
Italianit183260.9989632216522972es,de,fr,en,la2207
Spanishes182270.998134635430954pt,it,io,ca,ia3476
Armenianhy5180.9980694980694981de0
Arabicar87610.9978312977970552arz,fa,es,mzn,en0
Ukrainianuk142850.9963598179908996ru,sr133
Macedonianmk144650.9959903214656066bg,sr,ru93
Dutchnl196260.9934780393355752en,af,de,nds,fr382
Lithuanianlt138350.9933501987712324fi,pl,eo,pt,sr20
Portuguesept201740.9933082184990581es,gl,it,en,fr1149
Khmerkm3790.9920844327176781az,et0
Urduur9630.9906542056074766pnb,fa,ro,en9
Czechcs108630.9898738838258307sk,pl,hu,sl,en1
Swedishsv121880.9886773875943551no,da,en,fi,id174
Romanianro135600.9886430678466077es,fr,it,en,pt133
Bulgarianbg111440.9869885139985642mk,ru,uk,sr2
Ossetianos590.9830508474576272ru0
Icelandicis63640.9803582652419862et,no,da,hu,cs4
Kazakhkk22320.9802867383512545ru,tr,tt,uk,ky4
Tagalogtl103510.9737223456670853ceb,en,id,es,war21
Tatartt81780.9680851063829787az,tr,ru,fi,kk13
Basqueeu29990.9676558852950984it,nl,id,en,io14
Tajiktg300.9666666666666667ru0
Belarusianbe62530.9625779625779626uk,ru,pl,bg,sr0
Latvianlv12430.9597747385358005lt,hr,sr,fi,eo4
Chuvashcv4600.9543478260869566ru,uk,ba,sr0
Bretonbr24510.9543043655650755fr,nl,eu,de,pt0
Bashkirba1200.95tt,av0
Indonesianid93720.949637217242851ms,it,en,eo,tr16
Danishda152990.948035819334597no,sv,de,en,nn2
Estonianet12270.9356153219233904fi,en,hu,it,nl5
Latinla114370.9206085511934948fr,it,en,es,pt292
Irishga8670.9065743944636678en,gd,ca,kv,cs14
Scottish Gaelicgd5420.8966789667896679en,ga,de,fr,pam2
Welshcy6190.8917609046849758es,en,la,kw,de8
Catalanca47250.8833862433862434es,pt,fr,it,ro0
Kyrgyzky660.8787878787878788ru,kk4
Cornishkw4260.8779342723004695en,cy,de,br,sq1
Assameseas9600.8635416666666667bn0
Volapükvo8060.8511166253101737id,de,fi,en,eo15
Serbiansr134940.8489699125537276hr,sh,mk,bs,sl1050
Slovaksk43700.8263157894736842cs,pl,sl,no,sr45
Maltesemt520.8076923076923077es,cs,pt,sr,eo7
Norwegian Nynorsknn (no)6570.7990867579908676da,sv,de,es,fi29
Afrikaansaf16320.7879901960784313nl,en,fr,de,nds0
Occitanoc28610.7679133170220203ca,es,fr,pt,it27
Interlinguaia187820.7500798636992866es,it,fr,la,pt82
Sanskritsa110.7272727272727273hi,ne0
Chechence70.7142857142857143mn,ru0
Sloveniansl3720.6774193548387096sr,hr,bs,pl,eo62
Frisianfy1070.6635514018691588nl,en,de,af,fr8
Javanesejv2600.6461538461538462id,en,ms,ko,su5
Yorubayo50.6sk,rm1
Luxembourgishlb2170.5944700460829493de,nds,sv,fr,nl3
Galiciangl26180.5790679908326967pt,es,it,fr,ca8
Turkmentk37930.5710519377801213tr,uz,en,et,io0
Croatianhr22220.5333033303330333sr,sh,bs,sl,pl45
Aragonesean40.5es0
Idoio29050.48055077452667816eo,es,it,pt,tr7
Interlingueie20070.4718485301444943es,it,fr,en,ia7
Limburgan, Limburger, Limburgishli30.3333333333333333de1
Walloonwa160.3125fr,pt,tl,oc,en1
Somaliso320.21875fi,eo,cy,en,az1
Corsicanco50.2it,fr0
Sundanesesu110.18181818181818182id,ms,es19
Haitian Creoleht150.06666666666666667br,fr,su,diq,no3
Romanshrm160.0625it,fr,en,tl,qu3
Bosnianbs1390.03597122302158273sr,hr,sh,pl,sl0
Manxgv60cy,fr,nl,et,en0

Short Form (10 to 40 characters)

As a test of accuracy on shorter phrases, the min and max character count was changed to 10 - 40, and similar results can be seen for major languages, but less known languages suffer significantly:

Language (102)Symbol (alternates)Count (837539)Accuracy (10 - 40 chars)Mislabels
Thaith33991
Malayalamml5251
Burmesemy2431
Tamilta2291
Telugute2201
Punjabi (Eastern)pa1561
Amharicam1541
Kannadakn1261
Gujaratigu1161
Sinhalasi371
Tibetanbo291
Divehi, Dhivehi, Maldiviandv151
Japaneseja280600.9999643620812545zh
Greekel249800.9999599679743795en
Hebrewhe264610.9999244170666264en,yi
Koreanko61280.9996736292428199tr,ja
Armenianhy18550.9994609164420485de
Bengalibn41320.9992739593417231bpy,as
Marathimr256330.9989466703078064hi,gom,pt,new
Englishen170940.9986544986544986nl,it,hu,eo,es
Mandarin Chinesezh178010.9978652884669401wuu,yue,ja,sr,pt
Turkishtr188790.9978282748026909en,eo,az,es,it
Russianru208550.9977942939343083uk,bg,mk,sr,be
Germande172230.9974452766649248en,it,fr,es,sv
Uyghurug61350.9973920130399349ar,ba,tt,ca,hu
Vietnamesevi131300.9971058644325971it,pms,eo,pt,fr
Esperantoeo216410.9966729818400258it,es,tr,pt,pl
Georgianka45500.996043956043956xmf,en
Hindihi114970.9958249978255197mr,dty,new,bh,ne
Italianit204490.995598806787618es,en,fr,eo,pt
Arabicar255310.9955348399984333arz,fa,en,mzn,ps
Frenchfr160400.9953865336658354en,it,ia,es,pt
Hungarianhu208430.9952502039053879en,pt,it,nl,eo
Laolo1830.994535519125683el
Polishpl213860.9940147760216964en,it,eo,de,cs
Khmerkm12520.9920127795527156az,ru,sr,et
Spanishes204980.9895599570689824pt,it,fr,ca,en
Finnishfi207310.9849500747672567it,en,eo,et,nl
Portuguesept183520.9833805579773321es,it,gl,fr,en
Macedonianmk236020.9830099144140327ru,bg,sr,uk
Ukrainianuk232510.982667412154316ru,mk,bg,be,sr
Urduur15830.9797852179406191pnb,fa,ug,en,ro
Dutchnl193490.9720915809602564en,de,nds,af,fr
Lithuanianlt241840.9597667879589812eo,fi,sr,pt,pl
Czechcs251890.951605859700663sk,pl,hu,en,sl
Chuvashcv13320.9481981981981982ru,uk,krc,ba,sr
Tatartt82830.9471206084751902ru,tr,az,kk,ky
Swedishsv244660.9464563067113545da,no,en,de,eo
Icelandicis77450.9449967721110394da,et,cs,no,de
Bulgarianbg193280.9352235099337748mk,ru,uk,sr,tg
Sanskritsa1350.9259259259259259hi,ne,mr
Kazakhkk23730.9258322798145807uk,tt,tr,ru,ky
Romanianro183670.9235041106332008it,es,en,fr,pt
Tagalogtl111330.9193389023623462ceb,en,it,id,es
Ossetianos2050.9170731707317074ru,hy,sr,kv,mrj
Indonesianid97070.9138765839085197ms,en,it,eo,tr
Danishda225390.9081591907360576no,sv,de,en,fr
Latinla246990.8979310903275436it,fr,en,es,pt
Basqueeu45700.8851203501094091it,id,hu,nl,eo
Belarusianbe90050.8785119378123265ru,uk,bg,mk,pl
Cornishkw37570.8759648655842428en,de,cy,es,br
Tajiktg480.875ru,uk
Latvianlv21980.8735213830755232lt,es,sr,en,fr
Bretonbr54680.8579005120702268en,fr,pt,de,eu
Irishga19770.840161861406171en,pt,es,ca,gd
Bashkirba1280.8359375tt,ru,sr,av,kk
Sindhisd60.8333333333333334ur
Serbiansr231280.8054738844690419hr,mk,sh,ru,sl
Estonianet30770.8043548911277218fi,en,hu,tr,it
Scottish Gaelicgd7530.7822045152722443en,ga,de,fr,pam
Welshcy11670.7660668380462725en,es,kw,la,it
Volapükvo39410.7609743719868054id,en,eo,fi,de
Kyrgyzky2270.7533039647577092ru,kk,tt,mn,bg
Catalanca53130.7504234895539243es,pt,it,fr,en
Assameseas26350.7127134724857686bn,bpy,en,tl,bh
Yorubayo310.7096774193548387ga,pl,en,qu,ckb
Occitanoc40960.70751953125es,fr,ca,pt,it
Interlinguaia149490.7073382834972239it,es,fr,en,la
Afrikaansaf32990.6808123673840558nl,en,de,fr,nds
Norwegian Nynorsknn (no)12870.6798756798756799da,sv,de,es,hu
Maltesemt1650.6727272727272727hu,en,es,it,pl
Slovaksk138770.6105786553289616cs,pl,sl,no,sr
Chechence250.6bg,sr,mn,ba,uk
Interlingueie65380.5183542367696543es,it,en,fr,eo
Idoio64950.4857582755966128eo,es,it,pt,tr
Sloveniansl9080.46255506607929514sr,hr,cs,pl,bs
Javanesejv5480.45255474452554745id,en,ko,ms,hu
Turkmentk45850.45169029443838604tr,en,uz,et,pl
Croatianhr41860.4362159579550884sr,sh,bs,sl,pl
Galiciangl32450.4200308166409861pt,es,it,en,fr
Luxembourgishlb7320.3975409836065574de,fr,en,nds,nl
Frisianfy2820.36879432624113473nl,en,nds,de,fr
Walloonwa370.2972972972972973fr,en,no,it,gn
Corsicanco130.23076923076923078it,min,ro,ilo,id
Sundanesesu180.2222222222222222id,es,en,it,lmo
Somaliso610.14754098360655737en,fi,et,cy,su
Limburgan, Limburger, Limburgishli340.14705882352941177de,nl,en,no,is
Haitian Creoleht580.1206896551724138en,fr,br,la,de
Manxgv300.06666666666666667en,it,pt,fr,kw
Bosnianbs5200.04423076923076923sr,hr,sh,it,pl
Aragonesean730.0136986301369863es,pt,it,en,fr
Romanshrm110it,pt,fr,en,tl

Additional Insights

  • During testing, the highest incorrect probability was often near 1, which means it's not possible to use a high possibility to suggest a correct assessment

  • The lowest probability for a correct assessment varried widely. Although these were good predictors for some of the very accurate languages (99.9%), other languages were sometimes as low as a .09 probability. This means it's not possible to use a low probability as an accurate assessment of a false positive.

  • To improve expectations of an incorrect result, you can use the difference in probability of result 1 and 2. It appears that the verage probability difference between 1 and 2 is somewhat of an indicator of a potentially incorrect prediction.

  • Anything over 100 characters is strongly accurate, though there isn't enough sentences for test data to assure this for all the test languages. 55 out of 82 languages that had this data had a 99% or better accuracy, 63 had 90%+ accuracy, 72 had 75%+ accuracy. For 200+ characters, 42 of 47 languages had a perfect score, though most had less than 10 test cases.

  • Spanish tends to give the most false positives based on sheer quantity of percentage of false positives.

  • In attempting to add a second check with franc for a smaller difference in probabilities between language 1 and 2 (i.e. less than 0.2), only the worst performing languages showed significant benefit. There doesn't seem to be a trend for any other languages. You can see this data on the COMPARISONS.md.

Improving Accuracy

Most incorrect suggestions are due to non-text characters (i.e. punctuation) that should be filtered out to provide better results. Please submit an issue for incorrect suggestions so we can work on improving the accuracy.

Comparison NPM Libaries

Success benchmarking has been checked with other popular libraries (notably franc and languagedetect) and results are included in benchmark-testing/results/COMPARISONS.md

Sample Dockerfile

Note: You need to have python installed to make this work in alpine-node

FROM mhart/alpine-node:14

WORKDIR /usr/src/app

COPY package*.json ./

RUN apk add --no-cache --virtual .gyp \
  python \
  make \
  g++ \
  && npm ci --only=production \
  && apk del .gyp

COPY . ./

CMD [ "npm", "start" ]

TODO List

This is an improved modification of https://www.npmjs.com/package/fasttext-lid

Created with <3 for https://smodin.io

Keywords

FAQs

Last updated on 18 Jul 2023

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc