Socket
Socket
Sign inDemoInstall

eld

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

eld

Fast and accurate natural language detection. Detector written in Javascript. Efficient language detector, Nito-ELD, ELD.


Version published
Weekly downloads
965
increased by54.65%
Maintainers
1
Weekly downloads
 
Created
Source

Efficient Language Detector

supported Javascript versions supported Javascript versions license supported languages

Efficient language detector (Nito-ELD or ELD) is a fast and accurate language detector, is one of the fastest non compiled detectors, while its accuracy is within the range of the heaviest and slowest detectors.

It's 100% Javascript (vanilla), easy installation and no dependencies.
ELD is also avalible in Python and PHP.

This is the first version of a port made from the original version in PHP, the structure might not be definitive, the code could be optimized.

  1. Installation
  2. How to use
  3. Benchmarks
  4. Languages

Installation

  • For Node.js
$ npm install eld
  • For Web, just download or clone the files
    git clone https://github.com/nitotm/efficient-language-detector-js

How to use?

Load ELD

  • At Node.js REPL
const { langDetector } = await import('eld')
  • At Node.js
import {langDetector} from 'eld'
  • At the Web Browser
<script type="module">
  import {langDetector} from './languageDetector.js' // Update path.
/* code */</script>
  • To load the minified version, which is not a module
<script src="minified/eld.min.js"></script>

Usage

console.log( langDetector.detect('Hola, cómo te llamas?') )

detect() expects an UTF-8 string, and returns a list, with a value named 'language', which will be either an ISO 639-1 code or false

{'language': 'es'}
{'language': False, 'error': 'Some error', 'scores': {}}
  • To get the best guess, deactive minimum length & confidence threshold; used for benchmarking.
langDetector.detect('To', {cleanText: false, checkConfidence: false, minByteLength: 0, minNgrams: 1})
// cleanText: true, Removes Urls, domains, emails, alphanumerical & numbers
  • To retrieve the scores of all languages detected, we will set returnScores to true, just once
langDetector.returnScores = true
langDetector.detect('How are you? Bien, gracias')
// {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
  • To reduce the languages to be detected, there are 2 different options, they only need to be executed once. (Check available languages below)
let langs_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']

// with dynamicLangsSubset() the detector executes normally, and then filters excluded languages
langDetector.dynamicLangsSubset(langsSubset)

// to remove the subset
langDetector.dynamicLangsSubset(false)

The optimal way to regularly use the same subset, will be to first use saveSubset() to download a new database of Ngrams with only the subset languages.

langDetector.saveSubset(langsSubset) // ONLY for the Web Browser; not included at minified files

And finally import the new file replacing the old Ngrams file at languageDetector.js

import {eld_ngrams} from './ngrams/ngrams-subset.js' // Or load other files ngrams-L.js, ngrams-xs.js

Benchmarks

I compared ELD with a different variety of detectors, since the interesting part is the algorithm.

URLVersionLanguage
https://github.com/nitotm/efficient-language-detector-js/0.9.0Javascript
https://github.com/nitotm/efficient-language-detector/1.0.0PHP
https://github.com/pemistahl/lingua-py1.3.2Python
https://github.com/CLD2Owners/cld2Aug 21, 2015C++
https://github.com/google/cld3Aug 28, 2020C++
https://github.com/wooorm/franc6.1.0Javasript

Tests: Tweets: 760KB, short sentences of 140 chars max.; Big test: 10MB, sentences in all 60 languages supported; Sentences: 8MB, this is the Lingua sentences test, minus unsupported languages.
Short sentences is what ELD and most detectors focus on, as very short text is unreliable, but I included the Lingua Word pairs 1.5MB, and Single words 880KB tests to see how they all compare beyond their reliable limits.

These are the results, first, accuracy and then execution time.

1. Lingua could have a small advantage as it participates with 54 languages, 6 less.
2. CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage. Also, I confirm the results of CLD2 for short text are correct, contrary to the test on the Lingua page, they did not use the parameter "bestEffort = True", their benchmark for CLD2 is unfair.

Lingua is the average accuracy winner, but at what cost, the same test that in ELD or CLD2 is below 6 seconds, in Lingua takes more than 5 hours! It acts like a brute-force software. Also its lead comes from single and pair words, which are unreliable regardless.

I added ELD-L for comparison, which has a 2.3x bigger database, but only increases execution time marginally, a testament to the efficiency of the algorithm. ELD-L is not the main database as it does not improve language detection in sentences.

For a client side solution, I included an all in one detector+Ngrams minified file, of the standard version (M), and XS which still performs great for sentences. The XS version only weights 865kb, when gziped it's only 245kb. The standard version is 486kb gziped.

Here is the average, per test, of Tweets, Big test & Sentences.

Sentences tests average

Languages

These are the ISO 639-1 codes of the 60 supported languages for Nito-ELD v1

'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'

Full name languages:

'Amharic', 'Arabic', 'Azerbaijani (Latin)', 'Belarusian', 'Bulgarian', 'Bengali', 'Catalan', 'Czech', 'Danish', 'German', 'Greek', 'English', 'Spanish', 'Estonian', 'Basque', 'Persian', 'Finnish', 'French', 'Gujarati', 'Hebrew', 'Hindi', 'Croatian', 'Hungarian', 'Armenian', 'Icelandic', 'Italian', 'Japanese', 'Georgian', 'Kannada', 'Korean', 'Kurdish (Arabic)', 'Lao', 'Lithuanian', 'Latvian', 'Malayalam', 'Marathi', 'Malay (Latin)', 'Dutch', 'Norwegian', 'Oriya', 'Punjabi', 'Polish', 'Portuguese', 'Romanian', 'Russian', 'Slovak', 'Slovene', 'Albanian', 'Serbian (Cyrillic)', 'Swedish', 'Tamil', 'Telugu', 'Thai', 'Tagalog', 'Turkish', 'Ukrainian', 'Urdu', 'Vietnamese', 'Yoruba', 'Chinese'

Future improvements

  • Train from bigger datasets, and more languages.
  • The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed.

If you wish to Donate for open source improvements, Hire me for private modifications / upgrades, or to Contact me, use the following link: https://linktr.ee/nitotm

Keywords

FAQs

Package last updated on 31 May 2023

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc