Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

sitemapbuilder

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

sitemapbuilder

Simple sitemap builder

  • 0.0.7
  • PyPI
  • Socket score

Maintainers
1

A simple sitemap builder

The sitemap builder traverses links from a website and constrains itself to the given domain name. The final result will be a simple sitemap deduced from the links visited. The crawler will accept & process only URLs with http or https schemes.

Installation and usage

To run the following command to install the tool:

.. code-block:: bash

pip install -U sitemapbuilder

To run the sitemap builder:

.. code-block:: bash

sitemapbuilder -u 'https://monzo.com' -o test_monzo.dot

Some websites have strong protection and the tool will not work for them:

.. code-block:: bash

sitemapbuilder -u 'https://bloomberg.com' -o test_bloomberg.dot

Highlights

#. Generate Graphviz .dot file showing directed links between pages. One can generate PNG/PDF and other image/document formats. #. Have configurable decay (maximum depth) to avoid abuse. #. Visit web link within the same hostname by default. #. Use 5 threads by default and times out after 10 seconds. #. Timeout after 5 seconds when fetching a URL. #. Handle timeout exceptions when querying a website. #. Send a HTTP HEAD request and verify that Content-Type is text/html and charset is either UTF-8 or US-ASCII. #. Have a map of visited URLs to avoid revisiting them. #. Follow HTTP redirects.

Upcoming features

  • Configure the number of threads and timeout via cmd args.
  • Allow web links from all subdomains.
  • Allow web links from a list of domains.
  • Allow web links matching a pattern.
  • Add an option for hierarchical sitemap instead of directed graph.
  • Use PriorityQueue instead of Queue to process links with higher decay first.
  • Fine-graned info, warn and error logging.
  • Pass seed links from a file.
  • Save to and resume from a DB/persistent data source.
  • Faster concurrency and better performance with asyncio.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc