robotsparse
A python package that enhances speed and simplicity of parsing robots files.
Usage
Basic usage, such as getting robots contents:
import robotsparse
robots = robotsparse.getRobots("https://github.com/", find_url=True)
print(list(robots))
The user-agents
key will contain each user-agent found in the robots file contents along with information associated with them.
Alternatively, we can assign the robots contents as an object, which allows faster accessability:
import robotsparse
robots = robotsparse.getRobotsObject("https://duckduckgo.com/", find_url=True)
assert isinstance(robots, object)
print(robots.allow)
print(robots.disallow)
print(robots.crawl_delay)
print(robots.robots)
Additional Features
When parsing robots files, it sometimes may be useful to parse sitemap files:
import robotsparse
sitemap = robotsparse.getSitemap("https://pypi.org/", find_url=True)
The above code contains a variable named sitemap
which contains information that looks like this:
[{"url": "", "lastModified": ""}]