🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Book a Demo Install Sign in

aioscpy

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

aioscpy

An asyncio + aiolibs crawler imitate scrapy framework

0.3.13

PyPI

Maintainers: 1

aioscpy

Aioscpy

A powerful, high-performance asynchronous web crawling and scraping framework built on Python's asyncio ecosystem.

English | 中文

Overview

Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It draws inspiration from Scrapy and scrapy_redis but is designed from the ground up to leverage the full power of asynchronous programming.

Key Features

Fully Asynchronous: Built on Python's asyncio for high-performance concurrent operations
Scrapy-like API: Familiar API for those coming from Scrapy
Distributed Crawling: Support for distributed crawling using Redis
Multiple HTTP Backends: Support for aiohttp, httpx, and requests
Dynamic Variable Injection: Powerful dependency injection system
Flexible Middleware System: Customizable request/response processing pipeline
Robust Item Processing: Pipeline for processing scraped items

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Installation

Basic Installation

pip install aioscpy

With All Dependencies

pip install aioscpy[all]

With Specific HTTP Backends

pip install aioscpy[aiohttp,httpx]

Latest Version from GitHub

pip install git+https://github.com/ihandmine/aioscpy

Quick Start

Creating a New Project

aioscpy startproject myproject
cd myproject

Creating a Spider

aioscpy genspider myspider

This will create a basic spider in the spiders directory.

tree

Example Spider

from aioscpy.spider import Spider


class QuotesSpider(Spider):
    name = 'quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Creating a Single Spider Script

aioscpy onespider single_quotes

Advanced Spider Example

from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat


class SingleQuotesSpider(Spider):
    name = 'single_quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/',
    ]

    async def process_request(self, request):
        request.headers = Header(url=request.url, platform='windows', connection=True).random
        return request

    async def process_response(self, request, response):
        if response.status in [404, 503]:
            return request
        return response

    async def process_exception(self, request, exc):
        raise exc

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    async def process_item(self, item):
        self.logger.info("{item}", **{'item': pformat(item)})


if __name__ == '__main__':
    quotes = SingleQuotesSpider()
    quotes.start()

Running Spiders

# Run a spider from a project
aioscpy crawl quotes


# Run a single spider script
aioscpy runspider quotes.py

run

Running from Python Code

from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings

# Method 1: Load all spiders from a directory
def load_spiders_from_directory():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.load_spider(path='./spiders')
    process.start()

# Method 2: Run a specific spider by name
def run_specific_spider():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.crawl('myspider')
    process.start()

if __name__ == '__main__':
    run_specific_spider()

Configuration

Aioscpy can be configured through the settings.py file in your project. Here are the most important settings:

Concurrency Settings

# Maximum number of concurrent items being processed
CONCURRENT_ITEMS = 100

# Maximum number of concurrent requests
CONCURRENT_REQUESTS = 16

# Maximum number of concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Maximum number of concurrent requests per IP
CONCURRENT_REQUESTS_PER_IP = 0

Download Settings

# Delay between requests (in seconds)
DOWNLOAD_DELAY = 0

# Timeout for requests (in seconds)
DOWNLOAD_TIMEOUT = 20

# Whether to randomize the download delay
RANDOMIZE_DOWNLOAD_DELAY = True

# HTTP backend to use
DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.httpx.HttpxDownloadHandler"
# Other options:
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler"
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.requests.RequestsDownloadHandler"

Scheduler Settings

# Scheduler to use (memory-based or Redis-based)
SCHEDULER = "aioscpy.core.scheduler.memory.MemoryScheduler"
# For distributed crawling:
# SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"

# Redis connection settings (for Redis scheduler)
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Response API

Aioscpy provides a rich API for working with responses:

Extracting Data

# Using CSS selectors
title = response.css('title::text').get()
all_links = response.css('a::attr(href)').getall()

# Using XPath
title = response.xpath('//title/text()').get()
all_links = response.xpath('//a/@href').getall()

Following Links

# Follow a link
yield response.follow('next-page.html', self.parse)

# Follow a link with a callback
yield response.follow('details.html', self.parse_details)

# Follow all links matching a CSS selector
yield from response.follow_all(css='a.product::attr(href)', callback=self.parse_product)

More Commands

aioscpy -h

Distributed Crawling

To enable distributed crawling with Redis:

Configure Redis in settings:

SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Run multiple instances of your spider on different machines, all connecting to the same Redis server.

Contributing

Please submit your suggestions to the owner by creating an issue.

Thanks

Keywords

crawler scrapy asyncio aiohttp anti-header anti-useragent python3

FAQs

What is aioscpy?

Is aioscpy well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

aioscpy

Aioscpy

Overview

Key Features

Requirements

Installation

Basic Installation

With All Dependencies

With Specific HTTP Backends

Latest Version from GitHub

Quick Start

Creating a New Project

Creating a Spider

Example Spider

Creating a Single Spider Script

Advanced Spider Example

Running Spiders

Running from Python Code

Configuration

Concurrency Settings

Download Settings

Scheduler Settings

Response API

Extracting Data

Following Links

More Commands

Distributed Crawling

Contributing

Thanks

Keywords

Related posts

Open Source CAI Framework Handles Pen Testing Tasks up to 3,600× Faster Than Humans

Deno 2.4 Brings Back deno bundle, Improves Dependency Management and Observability