
Research
/Security News
Toptal’s GitHub Organization Hijacked: 10 Malicious Packages Published
Threat actors hijacked Toptal’s GitHub org, publishing npm packages with malicious payloads that steal tokens and attempt to wipe victim systems.
lc4j
Java library (and program) implement the technique described in this paper:
Cavnar, W. B. and J. M. Trenkle, ''N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
and can be used to find the language in which any given input text is written by using a technique based on n-grams.
The idea behind this program and most of the sample texts have their roots in TextCat, a free Perl library which implements the text categorization algorithm.
This project was made in 2005 and updated in 2009, and finally published on Github around mid-2020.
No warranty is provided with the code; however, as far as I know, many people and international projects have found it useful and implemented it in their code over the years.
As of today (2020), this library offers a functionality which is offered as a PaaS service by many cloud providers; however, if one wants to incorporate the functionality in his own code and not depend on the cloud, this library could still be beneficial, even nowawadays.
Despite this library being written in Java, the code is engineered using memory-efficient and computational-efficient libraries such as fastutil, mg4j and colt (links below); also, eccessive usage of Java garbage collection is trying to be avoided by trying to re-use objects and avoid their creation in the first place, where possible.
Installation of lc4j
library is very simple, since there are no special needs for an installation at all.
If you want to use the plain library in your project, just build the JAR file with Maven and then copy the generated .jar
file from the target Maven folder to your desired destination directory (command to run to build: mvn package
)
Please note that to compile/run lc4j
you will also need the following libraries:
and they must be placed in your $CLASSPATH
env variable. Please note that you should also add to your classpath the lc4j
library.
The $CLASSPATH
can be set by sourcing this file at shell command prompt: it should work, since the paths provided are all relative to the home and to base repo folder.
In the models directory you will find some default (and valid) language models. These models have been extracted from the texts contained in the sample-texts directory.
To regenerate language-models starting from the *.txt
files contained in the sample-texts directory you can launch this command at your prompt:
for i in sample-texts/*.txt ; do $(cat $i | java net.olivo.lc4j.LanguageCategorization \
-c > models/`basename $i | sed s/.txt//g`.lm) ; done ;
(it will take a few minutes, depending also on your CPU processing speed)
To run lc4j just run:
java net.olivo.lc4j.LanguageCategorization
after properly setting your $CLASSPATH
(see installation instructions above) at your command prompt.
For more help and options, run:
java net.olivo.lc4j.LanguageCategorization -h
FAQs
Java library for language detection of a text snippet
We found that net.olivo:lc4j demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
Threat actors hijacked Toptal’s GitHub org, publishing npm packages with malicious payloads that steal tokens and attempt to wipe victim systems.
Research
/Security News
Socket researchers investigate 4 malicious npm and PyPI packages with 56,000+ downloads that install surveillance malware.
Security News
The ongoing npm phishing campaign escalates as attackers hijack the popular 'is' package, embedding malware in multiple versions.