![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.
The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.5+ using coroutines.
CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.
Screenshot:
We recommend that you use pyenv, because (1) CoCrawler requires Python 3.5+, and (2) requirements.txt specifies exact module versions.
git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
make init # will install requirements using pip
make pytest
make test_coverage
Pluggable modules make policy decisions, and use utility routines to keep policy modules short and sweet.
An additional set of pluggable modules provide support for a variety of databases. These databases are mostly used to orchestrate the cooperation of multiple crawl processes, enabling the horizontal scalability of the crawler over many cores and many nodes.
Crawled web assets are intended to be stored as WARC files, although this interface should also pluggable.
Everyone knows that ranking is extremely important to search queries, but it's also important to crawling. Crawling the most important stuff is one of the best ways to avoid crawling too much webspam, soft 404s, and crawler trap pages.
SEO is a multi-billion-dollar industry created to game search engine ranking, and any crawl of a wide swath of the web is going to run into poor-quality content attempting to appear to have high quality. There's little chance that CoCrawler's algorithms will beat the most sophisticated SEO techniques, but a little ranking goes a long way.
CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.
Apache 2.0
FAQs
A modern web crawler framework for Python
We found that cocrawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.