Socket
Socket
Sign inDemoInstall

async-scrape

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

async-scrape

A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.


Maintainers
1

Async-scrape

Perform webscrape asyncronously

Build Status

Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.

Features

  • Breaks - pause scraping when a website blocks your requests consistently
  • Rate limit - slow down scraping to prevent being blocked

Installation

Async-scrape requires C++ Build tools v15+ to run.

pip install async-scrape

How to use it

Key inpur parameters:

  • post_process_func - the callable used to process the returned response
  • post_process_kwargs - and kwargs to be passed to the callable
  • use_proxy - should a proxy be used (if this is true then either provide a proxy or pac_url variable)
  • attempt_limit - how manay attempts should each request be given before it is marked as failed
  • rest_wait - how long should the programme pause between loops
  • call_rate_limit - limits the rate of requests (useful to stop getting blocked from websites)
  • randomise_headers - if set to True a new set of headers will be generated between each request

Get requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://www.google.com",
    "https://www.bing.com",
]

resps = async_Scrape.scrape_all(urls)

Post requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://eos1jv6curljagq.m.pipedream.net",
    "https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
    {"value": 0},
    {"value": 1}
]

resps = async_Scrape.scrape_all(urls, payloads=payloads)

Response

Response object is a list of dicts in the format:

{
    "url":url, # url of request
    "req":req, # combination of url and params
    "func_resp":func_resp, # response from post processing function
    "status":resp.status, # http status
    "error":None # any error encountered
}

License

MIT

Free Software, Hell Yeah!

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc