Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
com.optimaize.languagedetector:language-detector
Advanced tools
Language Detection Library for Java
<dependency>
<groupId>com.optimaize.languagedetector</groupId>
<artifactId>language-detector</artifactId>
<version>0.5</version>
</dependency>
User danielnaber has made available a profile for Esperanto on his website, see open tasks.
You can create a language profile for your own language easily. See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md
The software uses language profiles which were created based on common text for each language. N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.
When trying to figure out in what language a certain text is written, the program goes through the same process: It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the language that matches best.
This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.
This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)
If you are looking for a language detector / language guesser library in Java, this seems to be the best open source library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/
//load all languages:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
//query:
TextObject textObject = textObjectFactory.forText("my text");
Optional<String> lang = languageDetector.detect(textObject);
//create text object factory:
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forIndexingCleanText();
//load your training text:
TextObject inputText = textObjectFactory.create()
.append("this is my")
.append("training text")
//create the profile:
LanguageProfile languageProfile = new LanguageProfileBuilder("en")
.ngramExtractor(NgramExtractors.standard())
.minimalFrequency(5) //adjust please
.addText(inputText)
.build();
//store it to disk if you like:
new LanguageProfileWriter().writeToDirectory(languageProfile, "c:/foo/bar");
For the profile name, use he ISO 639-1 language code if there is one, otherwise the ISO 639-3 code.
The training text should be rather clean; it is a good idea to remove parts written in other languages (like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove proper nouns like (international) place names in case there are too many. It's up to you how far you go. As a general rule, the cleaner the text is, the better is its profile. If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.
The profile size should be similar to the existing profiles for practical reasons. To compute the likeliness for an identified language, the index size is put in relation, therefore a language with a larger profile won't have a higher probability to be chosen.
Please contribute your new language profile to this project. The file can be added to the languages folder, and then referenced in the BuiltInLanguages class. Or else open a ticket, and provide a download link.
Also, it's a good idea to put the original text along with the modifying (cleaning) code into a new project on GitHub. This gives others the possibility to improve on your work. Or maybe even use the training text in other, non-Java software.
If your language is not supported yet, then you can provide clean "training text", that is, common text written in your language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open a ticket.
If your language is supported already, but not identified clearly all the time, you can still provide such training text. We might then be able to improve detection for your language.
If you're a programmer, dig in the source and see what you can improve. Check the open tasks.
This is a fork from https://code.google.com/p/lang-guess/ (forked on 2014-02-27) which itself is a fork of the original project https://code.google.com/p/language-detection/
Apache2 license, just like the work from which this is derived. (I had temporarily changed it to LGPLv3, but that change was invalid and therefore reverted.)
The software works well, there are things that can be improved. Check the Issues list.
The original project hasn't seen any commit in a while. The issue list is growing. The news page says for 2012 that it now has Maven support, but there is no pom in git. There is a release in Maven see http://mvnrepository.com/artifact/com.cybozu.labs/langdetect/1.1-20120112 for version 1.1-20120112 but not in git. So I don't know what's going on there.
The lang-guess fork saw quite some commits in 2011 and up to march 2012, then nothing anymore. It uses Maven.
The 2 projects are not in sync, it looks like they did not integrate changes from each other anymore.
Both are on Google Code, I believe that GitHub is a much better place for contributing.
My goals were to bring the code up to current standards, and to update it for Java 7. So I quickly noticed that I have to touch pretty much all code. And with the status of the other two projects, I figured that I better start my own. This ensures that my work is published to the public.
An adapted version of this is used by the http://www.NameAPI.org server.
https://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.
Apache 2 (business friendly)
The project is in Maven central http://search.maven.org/#artifactdetails%7Ccom.optimaize.languagedetector%7Clanguage-detector%7C0.4%7Cjar this is the latest version:
<dependency>
<groupId>com.optimaize.languagedetector</groupId>
<artifactId>language-detector</artifactId>
<version>0.5</version>
</dependency>
FAQs
Language Detection Library for Java.
We found that com.optimaize.languagedetector:language-detector demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.