Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
com.salesforce.transmogrifai:language-detector
Advanced tools
Language Detection Library for Java
<dependency>
<groupId>com.salesforce.transmogrifai</groupId>
<artifactId>language-detector</artifactId>
<version>0.0.1</version>
</dependency>
User danielnaber has made available a profile for Esperanto on his website, see open tasks.
There are two kinds of profiles. The standard ones created from Wikipedia articles and similar. And the "short text" profiles created from Twitter tweets. Fewer language profiles exist for the short text, more would be available, see https://github.com/optimaize/language-detector/issues/57
You can create a language profile for your own language easily. See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md
The software uses language profiles which were created based on common text for each language. N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.
When trying to figure out in what language a certain text is written, the program goes through the same process: It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the language that matches best.
This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.
This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)
If you are looking for a language detector / language guesser library in Java, this seems to be the best open source library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/
//load all languages:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
//query:
TextObject textObject = textObjectFactory.forText("my text");
Optional<LdLocale> lang = languageDetector.detect(textObject);
See https://github.com/optimaize/language-detector/wiki/Creating-Language-Profiles
If your language is not supported yet, then you can provide clean "training text", that is, common text written in your language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open a ticket.
If your language is supported already, but not identified clearly all the time, you can still provide such training text. We might then be able to improve detection for your language.
If you're a programmer, dig in the source and see what you can improve. Check the open tasks.
Loading all 71 language profiles uses 74MB ram to store the data in memory. For memory considerations see https://github.com/optimaize/language-detector/wiki/Memory-Consumption
This project is a fork of a fork, the original author is Nakatani Shuyo. For detail see https://github.com/optimaize/language-detector/wiki/History-and-Changes
An adapted version of this is used by the http://www.NameAPI.org server.
https://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.
Apache 2 (business friendly)
Nakatani Shuyo, Fabian Kessler, Francois ROLAND, Robert Theis
For detail see https://github.com/optimaize/language-detector/wiki/Authors
The project is in Maven central https://search.maven.org/search?q=g:com.salesforce.transmogrifai%20AND%20a:language-detector this is the latest version:
<dependency>
<groupId>com.salesforce.transmogrifai</groupId>
<artifactId>language-detector</artifactId>
<version>0.0.1</version>
</dependency>
FAQs
Language Detection Library for Java.
We found that com.salesforce.transmogrifai:language-detector demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.