Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

com.github.crawler-commons:urlfrontier

Package Overview
Maintainers
2
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

com.github.crawler-commons:urlfrontier

API definition, resources and reference implementation of URL Frontiers

  • 2.4
  • Source
  • Maven

Version published
Maintainers
2
Source
URL Frontier

license Build Status Docker Image Version (latest semver)

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open-source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs.

The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etc... Such an API can used by a variety of web crawlers, regardless of whether they are implemented in Java like StormCrawler and Heritrix or in Python like Scrapy.

The outcomes of the project are to:

  • design an API with gRPC, provide a Java stubs for the API and instructions on how to achieve the same for other languages
  • deliver a robust reference implementation of the URL Frontier service
  • implement a command line client for basic interactions with a service
  • provide a test suite to check that any implementation of the API behaves as expected

One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

Please use the project mailing list or Discussions section for questions, comments or suggestions.

There are many ways to get involved if you want to.

This project is funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.

NLNet
NGI0

License information

This project is available as open source under the terms of Apache 2.0. For accurate information, please check individual files.

FAQs

Package last updated on 27 Sep 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc