Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
com.github.crawler-commons:crawler-commons
Advanced tools
crawler-commons is a set of reusable Java components that implement functionality common to any web crawler.
Crawler-Commons is a set of reusable Java components that implement functionality common to any web crawler.
These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
There is a mailing list on Google Groups.
Using Maven, add the following dependency to your pom.xml:
<dependency>
<groupId>com.github.crawler-commons</groupId>
<artifactId>crawler-commons</artifactId>
<version>1.3</version>
</dependency>
Using Gradle, add the folling to your build file:
dependencies {
implementation group: 'com.github.crawler-commons', name: 'crawler-commons', version: '1.3'
}
We are glad to announce the 1.3 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. The new release includes multiple dependency upgrades, improvements to the automatic builds, and a tighter protections against XXE vulnerability issues in the Sitemap parser.
We are glad to announce the 1.2 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. This version fixes an XXE vulnerability issue in the Sitemap parser and includes several improvements to the URL normalizer and the Sitemaps parser.
We are glad to announce the 1.1 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.
We are glad to announce the 1.0 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. Among other bug fixes and improvements this version adds support for parsing sitemap extensions (image, video, news, alternate links).
We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the Sitemap parsing and the removal of the Tika dependency.
We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.
The main changes are the removal of DOM-based sitemap parser as the SAX equivalent introduced in the previous version has better performance and is also more robust. You might need to change your code to replace SiteMapParserSAX
with SiteMapParser
.
The parser is now aware of namespaces, and by default does not force the namespace to be the one recommended in the specification (http://www.sitemaps.org/schemas/sitemap/0.9
) as variants can be found in the wild. You can set the behaviour using the method setStrictNamespace(boolean).
As usual, the version 0.9 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.
We are glad to announce the 0.8 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of the HTTP fetcher support, which has been put in a separate project. We also added a SAX-based parser for processing sitemaps, which requires less memory and is more robust to malformed documents than its DOM-based counterpart. The latter has been kept for now but might be removed in the future.
We are glad to announce the 0.7 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it
find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./'
Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version.
The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.
We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details.
We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central. Please note that the groupId has changed to com.github.crawler-commons.
The Java documentation can be found here.
The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.
We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.
See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.
We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.
We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.
See the CHANGES.txt file included with the release for a full list of details.
We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central.
This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.
See the CHANGES.txt file included with the release for a full list of details.
Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See Apache Nutch v1.7 Released for more details.
See Apache Nutch v2.2 Released for more details.
This release improves robots.txt and sitemap parsing support.
See the CHANGES.txt file included with the release for a full list of details.
Published under Apache License 2.0, see LICENSE
FAQs
crawler-commons is a set of reusable Java components that implement functionality common to any web crawler.
We found that com.github.crawler-commons:crawler-commons demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.