![PyPI Now Supports iOS and Android Wheels for Mobile Python Development](https://cdn.sanity.io/images/cgdhsj6q/production/96416c872705517a6a65ad9646ce3e7caef623a0-1024x1024.webp?w=400&fit=max&auto=format)
Security News
PyPI Now Supports iOS and Android Wheels for Mobile Python Development
PyPI now supports iOS and Android wheels, making it easier for Python developers to distribute mobile packages.
os-scrapy-rq-crawler
Advanced tools
This project provide Crawler for RQ mode. Based on Scrapy 2.0+, require Python 3.6+
The Scrapy framework is used for crawling specific sites. It is not good for "Broad Crawls". The Scrapy built-in schedule mechanism is not for many domains, it use one channel queue for requests of all different domains. The scheduler can not decide to crawl request of the specified domain.
The RQ mode is the key mechanism/concept for broad crawls. The key point is RQ(request queue), actually it is a banch of queues, requests of different domains in the different sub-queues.
Deploy with the os-rq-pod and os-rq-hub you can build a large scalable distributed "Broad Crawls" system.
We offer Crawler for RQ mode. Because the Scrapy framework can not custom Crawler class, so you can use os-scrapy(installed with this project) to start crawling.
install
pip install os-scrapy-rq-crawler
start your project
os-scrapy startproject <your_project>
set Crawler class and enable asyncio reactor in project setting.py
CRAWLER_CLASS = "os_scrapy_rq_crawler.asyncio.Crawler"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
start crawling example spider
os-scrapy crawl example
pip install os-scrapy-rq-crawler
When you already installed os-scrapy, you can run example spider directly in this project root path
os-scrapy crawl example
This project offer Crawler class can be used in your project by config in settings.py file. And it can also be used as a scrapy project directly.
This project provide Crawler to enable RQ mode. The Scrapy framework can not config Crawler class, so you can use os-scrapy to specify
in the settings.py
CRAWLER_CLASS = "os_scrapy_rq_crawler.asyncio.Crawler"
or with -c
command line option
os-scrapy crawl -c os_scrapy_rq_crawler.asyncio.Crawler example
Because the Crawler deponds on asyncio reactor, so you need to enable it
in the settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
or with -r
command line option
os-scrapy crawl -r asyncio example
RQ is just a conception can be implements in different ways.
config in the settings.py file
SCHEDULER_REQUEST_QUEUE = "os_scrapy_rq_crawler.AsyncRequestQueue"
os_scrapy_rq_crawler.MemoryRequestQueue
os_scrapy_rq_crawler.AsyncRequestQueue
os_scrapy_rq_crawler.MultiUpstreamRequestQueue
same as os_scrapy_rq_crawler.AsyncRequestQueue
, can configure multi upstreams request queues
config the pod/hub http apis in the settings.py
RQ_API = ["http://server01:6789/api/", "http://server02:6789/api/"]
Our RQ mode Crawler is a substitute of the Scrapy built-in Crawler. Most of the Scrapy functionalities(middleware/extension) can also be used as normal.
There are some tips:
CONCURRENT_REQUESTS
control the max concurrency and DOWNLOAD_DELAY
control the download delysh scripts/test.sh
MIT licensed.
FAQs
Scrapy Crawler for RQ
We found that os-scrapy-rq-crawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now supports iOS and Android wheels, making it easier for Python developers to distribute mobile packages.
Security News
Create React App is officially deprecated due to React 19 issues and lack of maintenance—developers should switch to Vite or other modern alternatives.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.