New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

scrapyscript

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

scrapyscript

Run a Scrapy spider programmatically from a script or a Celery task - no project required.

1.1.5
PyPI

Maintainers: 1

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

wrap regular Scrapy Spiders in a Job
load the Job(s) in a Processor
call processor.run() to execute them

... returning all results when the last job completes.

Let's see an example.

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)

[{ "title": "Welcome to Python.org" }]

See the examples directory for more, including a complete Celery example.

Install

pip install scrapyscript

Requirements

Linux or MacOS
Python 3.8+
Scrapy 2.5+

API

Job (spider, *args, **kwargs)

A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')

Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings object to configure the Scrapy runtime.

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)

Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list. jobs can be a single instance of Job, or a list.

results = processor.run(myjob)

results = processor.run([myjob1, myjob2, ...])

A word about Spider outputs

As per the scrapy docs, a Spider must return an iterable of Request and/or dict or Item objects.

Requests will be consumed by Scrapy inside the Job. dict or scrapy.Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or Item must be pickle-able using pickle protocol 0. It's generally best to output dict objects from your Spider.

Contributing

Updates, additional features or bug fixes are always welcome.

Setup

Install Poetry
git clone git@github.com:jschnurr/scrapyscript.git
poetry install

Tests

make test or make tox

Version History

See CHANGELOG.md

License

The MIT License (MIT). See LICENCE file for details.

Keywords

scrapy

FAQs

What is scrapyscript?

Is scrapyscript well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install