Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

robotsparser

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

robotsparser

Parse robots.txt files and find indexed urls

  • 0.0.12
  • PyPI
  • Socket score

Maintainers
1

robotsparser

Python library that parses robots.txt files

Functionalities

  • Automatically discover all sitemap files
  • Unzip gziped files
  • Fetch all URLs from sitemaps

Install

pip install robotsparser

Usage

from robotsparser.parser import Robotparser

robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)

# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls

Crawl in the background using thread

Crawl in the background and output new entries to file

This is useful for sites where sitemaps are heavily nested and take a long time to crawl

from robotsparser.parser import Robotparser
import threading

if __name__ == '__main__':
    robots_url = "https://www.example.com/robots.txt"
    rb = Robotparser(url=robots_url, verbose=False)

    sitemap_crawl_proc = threading.Thread(target = rb.read, kwargs = {'fetch_sitemap_urls': False}, daemon=True)
    sitemap_crawl_proc.start()

    while sitemap_crawl_proc.is_alive():
        time.sleep(1)
        print(f"entries_count: {len(rb.get_sitemap_entries())}, indexes: {len(rb.get_sitemap_indexes())}")
        # any logic here to get object data

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc