Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.
YiraBot is a versatile Python package designed for crawling, scraping, and analyzing web pages. It provides a range of functionalities from basic webpage crawling to detailed SEO analysis, mobile responsiveness checks, and social media integration verification. This document serves as a comprehensive guide to using YiraBot, including installation, usage examples, and an explanation of its core features.
Before you can use YiraBot, you need to ensure Python is installed on your system. YiraBot is compatible with Python 3.6 and above. You can install YiraBot using pip:
pip install yirabot
YiraBot offers a range of functionalities, including:
YiraBot can be invoked directly from the command line with various commands and options:
yirabot <command> [options]
crawl
: Crawls a given URL to extract data.scrape
: Specifically extracts main content from a URL.seo
: Performs an SEO analysis of the specified web page.get-html
: Downloads and saves the complete HTML content of a web page.-mobile
: Uses a mobile user agent for requests.-file
: Saves the extracted data in text format.-json
: Saves the extracted data in JSON format.Crawling a Web Page
To crawl a web page and display extracted data:
yirabot crawl example.com
Saving Crawled Data
To crawl a web page and save the extracted data in JSON format:
yirabot crawl example.com -json
Performing SEO Analysis
To perform an SEO analysis on a web page:
yirabot seo example.com
Checking Mobile Responsiveness
Mobile responsiveness is part of the SEO analysis. To check if a page is mobile responsive:
yirabot seo example.com
Look for the "Mobile Responsiveness" section in the output.
Crawling Protected Pages
YiraBot also supports crawling pages that require authentication. This process is more involved and requires setting up a session:
yirabot session
Follow the interactive prompts to enter login details and choose the crawling method.
When using YiraBot from the command line, you can modify its behavior with various flags. These flags allow you to tailor the crawling and analysis process to your specific needs. Here’s how the functionality changes with different flags:
Each flag is designed to offer flexibility and control over the crawling and analysis process, ensuring that you can obtain the data you need in the format that best suits your project.
YiraBot is a powerful tool for developers, SEO specialists, and anyone interested in web page analysis. By following this guide, you should be able to install YiraBot, understand its capabilities, and start using it for your web crawling and analysis needs.
seo_analysis
)get_random_user_agent()
for requests, simulating different browser types for more accurate SEO testing.crawl
)force
parameter.dynamic_delay
mechanism to adjust request timing based on the server's response, simulating more natural browsing behavior.scrape
)force
parameter.validate
)dynamic_delay
function to adjust the frequency of requests dynamically, reducing the risk of being blocked by the target server.To use Yirabot, instantiate the class and call the desired method with appropriate parameters. For SEO analysis and content scraping, pass the target URL and, if available, a session object for authenticated requests.
When performing actions that might be restricted by robots.txt, consider the ethical implications and the legality of bypassing such restrictions with the force
parameter.
This class serves as a versatile tool for developers, SEO specialists, and content managers looking to automate the process of web data extraction and analysis, enhancing SEO strategies and website maintenance practices.
from yirabot import Yirabot
bot = Yirabot()
url = "https://example.com"
seo_data = bot.seo_analysis(url)
# Example of processing SEO data
print("Title Length:", seo_data['title_length'])
print("Meta Description Length:", seo_data['meta_desc_length'])
print("Responsive:", "Yes" if seo_data['is_responsive'] else "No")
url = "https://example.com"
crawl_data = bot.crawl(url, force=True) #Only use force in ethical situations
# Displaying some extracted data
print("Page Title:", crawl_data['title'])
print("Number of Internal Links:", len(crawl_data['internal_links']))
print("Number of External Links:", len(crawl_data['external_links']))
url = "https://example.com/blog"
content_data = bot.scrape(url)
# Displaying the first paragraph and heading
print("First Paragraph:", content_data['paragraphs'][0])
print("First Heading:", content_data['headings'][0])
sitemap_url = "https://example.com/sitemap.xml"
validation_results = bot.validate(sitemap_url)
# Checking and printing inaccessible URLs
inaccessible_urls = {url: status for url, status in validation_results.items() if status != 200}
print("Inaccessible URLs:", inaccessible_urls)
Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.
All contributors must follow the Contribution Policy to ensure a smooth and collaborative development process.
YiraBot is open-sourced software licensed under the GNU General Public License (Version 3)
Owen Orcan |
Yigit Ocak |
FAQs
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.
We found that YiraBot demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.