Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest :alt: Documentation Status :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest
.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg :target: https://pypi.python.org/pypi/scrapy-redis
.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg :target: https://pypi.python.org/pypi/scrapy-redis
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml
.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master :alt: Coverage Status :target: https://codecov.io/github/rmax/scrapy-redis
.. image:: https://img.shields.io/badge/security-bandit-green.svg :alt: Security Status :target: https://github.com/rmax/scrapy-redis
Redis-based components for Scrapy.
Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json
supported data in Redis
data contains url
, meta
and other optional parameters. meta
is a nested json which contains sub-data.
this function extract this data and send another FormRequest with url
, meta
and addition formdata
.
For example:
.. code-block:: json
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
this data can be accessed in scrapy spider
through response.
like: request.url
, request.meta
, request.cookies
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.
Scrapy
>= 2.0redis-py
>= 4.0From pip
.. code-block:: bash
pip install scrapy-redis
From GitHub
.. code-block:: bash
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
.. code-block:: bash
pip uninstall scrapy-redis
Frontera_ is a web crawling framework consisting of crawl frontier
_, and distribution/scaling primitives, allowing to build a large scale online web crawler.
.. _Frontera: https://github.com/scrapinghub/frontera .. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html
.. bumpversion marker
Scheduler
not compatible with BaseDupeFilter (#294)REDIS_START_URLS_BATCH_SIZE
.REDIS_ENCODING
setting (default utf-8
).CONCURRENT_REQUESTS
value for REDIS_START_URLS_BATCH_SIZE
.REDIS_START_URLS_KEY
setting.from_crawler
signature.redis_cls
parameter in REDIS_PARAMS
setting.SCHEDULER_SERIALIZER
setting.DUPEFILTER_CLASS
setting.SCHEDULER_FLUSH_ON_START
setting.REDIS_START_URLS_AS_SET
setting.REDIS_ITEMS_KEY
setting.REDIS_ITEMS_SERIALIZER
setting.REDIS_PARAMS
setting.REDIS_START_URLS_BATCH_SIZE
spider attribute to read start urls
in batches.RedisCrawlSpider
.-a domain=...
option for example spiders.REDIS_URL
setting to support Redis connection string.SCHEDULER_IDLE_BEFORE_CLOSE
setting to prevent the spider closing too
quickly when the queue is empty. Default value is zero keeping the previous
behavior.RedisSpider
and RedisMixin
classes as building blocks for spiders
to be fed through a redis queue.marshal
to cPickle
.FAQs
Redis-based components for Scrapy.
We found that scrapy-redis demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.