Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

schul-cloud-url-crawler

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

schul-cloud-url-crawler

Crawler for Schul-Cloud Ressources

1.0.17
PyPI

Maintainers: 1

schul-cloud-url-crawler

.. image:: https://travis-ci.org/schul-cloud/url-crawler.svg?branch=master :target: https://travis-ci.org/schul-cloud/url-crawler :alt: Build Status

.. image:: https://badge.fury.io/py/schul-cloud-url-crawler.svg :target: https://pypi.python.org/pypi/schul-cloud-url-crawler :alt: Python Package Index

This crawler fetches ressources from urls and posts them to a server.

Purpose

The purpose of this crawler is:

We can provide test data to the API.
It can crawl ressources which are not active and cannot post.
Other crawl services can use this crawler to upload their conversions.
It has the full crawler logic but does not transform into other formats.
- Maybe we can create recommendations or a library for crawlers from this case.

Requirements

The crawler should work as follows:

Provide urls
- as command line arguments
- as a link to a file with one url per line
Provide ressources_
- as one ressource in a file
- as a list of ressources

The crawler must be invoked to crawl.

Example


This example gets a ressource from the url and post it to the api.

.. code:: shell

    python3 -m ressource_url_crawler http://localhost:8080 \
            https://raw.githubusercontent.com/schul-cloud/ressources-api-v1/master/schemas/ressource/examples/valid/example-website.json

Authentication

You can specify the authentication_ like this:

--basic=username:password for basic authentication
--apikey=apikey for api key authentication

Further Requirements

The crawler does not post ressources twice. This can be implemented by
- caching the ressources locally, to see if they changed
  - compare ressource
  - compare timestamp
- removing the ressources from the database if they are updated after posting new ressources.

This may require some form of state for the crawler. The state could be added to the ressources in a X-Ressources-Url-Crawler-Source field. This allows local caching and requires getting the objects from the database.

.. _ressources: https://github.com/schul-cloud/ressources-api-v1#ressources-api .. _authentication: https://github.com/schul-cloud/ressources-api-v1#authorization

Keywords

FAQs

What is schul-cloud-url-crawler?

Is schul-cloud-url-crawler well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

schul-cloud-url-crawler

schul-cloud-url-crawler

Purpose

Requirements

Further Requirements

Keywords

Related posts

Author Typosquatting on npm: Attackers Impersonate Sindre Sorhus with Malicious ‘chalk-node’ Package

Supply Chain Attack on LottieFiles Player Caused by Compromised npmjs Credentials