Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

io.github.kju2.languagedetector:language-detector

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

io.github.kju2.languagedetector:language-detector

Detect the language of a given text. This library is able to distinguish 68 languages.

Version published: 6 years ago

Maintainers: 5

Source

language-detector

Language Detection Library for Java

<dependency>
    	<groupId>io.github.kju2.languagedetector</groupId>
	<artifactId>language-detector</artifactId>
	<version>1.0.3</version>
</dependency>

How to Use

LanguageDetector detector = new LanguageDetector();

String text = "This is the text you want to know the language it is written in.";
Language detectedLanguage = detector.detectPrimaryLanguageOf(text);

Language Support

68 Built-in Language Profiles

AFRIKAANS (af)
ALBANIAN (sq)
ARABIC (ar)
ARAGONESE (an)
BASQUE (eu)
BELARUSIAN (be)
BENGALI (bn)
BRETON (br)
BULGARIAN (bg)
CATALAN (ca)
CENTRAL_KHMER (km)
CHINESE (zh)
CROATIAN (hr)
CZECH (cs)
DANISH (da)
DUTCH (nl)
ENGLISH (en)
ESTONIAN (et)
FINNISH (fi)
FRENCH (fr)
GALICIAN (gl)
GERMAN (de)
GREEK (el)
GUJARATI (gu)
HAITIAN (ht)
HEBREW (he)
HINDI (hi)
HUNGARIAN (hu)
ICELANDIC (is)
INDONESIAN (id)
IRISH (ga)
ITALIAN (it)
JAPANESE (ja)
KANNADA (kn)
KOREAN (ko)
LATVIAN (lv)
LITHUANIAN (lt)
MACEDONIAN (mk)
MALAY (ms)
MALAYALAM (ml)
MALTESE (mt)
MARATHI (mr)
NEPALI (ne)
NORWEGIAN (no)
OCCITAN (oc)
PANJABI (pa)
PERSIAN (fa)
POLISH (pl)
PORTUGUESE (pt)
ROMANIAN (ro)
RUSSIAN (ru)
SERBIAN (sr)
SLOVAK (sk)
SLOVENIAN (sl)
SOMALI (so)
SPANISH (es)
SWAHILI (sw)
SWEDISH (sv)
TAGALOG (tl)
TAMIL (ta)
TELUGU (te)
THAI (th)
TURKISH (tr)
UKRAINIAN (uk)
URDU (ur)
VIETNAMESE (vi)
WELSH (cy)
YIDDISH (yi)

Other Languages

You can create a language profile for your own language easily. See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md

How it Works

The software uses language profiles which were created based on common text for each language. N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.

When trying to figure out in what language a certain text is written, the program goes through the same process: It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the language that matches best.

Challenges

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

If you are looking for a language detector / language guesser library in Java, this seems to be the best open source library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/

How You Can Help

If your language is not supported yet, then you can provide clean "training text", that is, common text written in your language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open a ticket.

If your language is supported already, but not identified clearly all the time, you can still provide such training text. We might then be able to improve detection for your language.

If you're a programmer, dig in the source and see what you can improve. Check the open tasks.

History and Changes

This project is a fork of a fork, the original author is Nakatani Shuyo. For detail see https://github.com/optimaize/language-detector/wiki/History-and-Changes

License

Apache 2 (business friendly)

Authors

Nakatani Shuyo, Fabian Kessler, Francois ROLAND, Robert Theis, Kju2

For detail see https://github.com/optimaize/language-detector/wiki/Authors

References

Research Papers and Articles on Language Identification

Automatic Language Identification in Texts: A Survey (2018)

Libraries for Language Identification

Compact Language Detector 2 (C++, 83 languages supported) uses Naïve Bayesian classifier for language identification. The library can handle HTML and is optimized for texts with 200 characters.
Compact Language Detector 3 (C++, ? languages supported) is used in Chromium and based on a neural network.
Langid (C, Javascript, Python) is a library for language detection with models for 97 languages included. It is based on Langid.py: an off-the-shelf language identification tool (2012).
Language-Detection (Java, Wiki, Slides) for 53 languages. This is the origin of this library and unfortunately no longer maintained.

Webservices for Language Identification

FAQs

What is io.github.kju2.languagedetector:language-detector?

Is io.github.kju2.languagedetector:language-detector well maintained?

Package last updated on 02 Dec 2018

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

io.github.kju2.languagedetector:language-detector

language-detector

How to Use

Language Support

68 Built-in Language Profiles

Other Languages

How it Works

Challenges

How You Can Help

History and Changes

License

Authors

References

Research Papers and Articles on Language Identification

Libraries for Language Identification

Webservices for Language Identification

Related posts

Threat Actor Exposes Playbook for Exploiting npm to Build Blockchain-Powered Botnets

NVD Backlog Tops 20,000 CVEs Awaiting Analysis as NIST Prepares System Updates