Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Abstract Web Tools is a Python package that provides various utility functions for web scraping tasks. It is built on top of popular libraries such as `requests`, `BeautifulSoup`, and `urllib3` to simplify the process of fetching and parsing web content.
Provides utilities for inspecting and parsing web content, including React components and URL utilities, with enhanced capabilities for managing HTTP requests and TLS configurations.
Description:
Abstract WebTools offers a suite of utilities designed for web content inspection and parsing. One of its standout features is its ability to analyze URLs, ensuring their validity and automatically attempting different URL variations to obtain correct website access. It boasts a custom HTTP request management system that tailors user-agent strings and employs a specialized TLS adapter for heightened security. The toolkit also provides robust capabilities for extracting source code, including detecting React components on web pages. Additionally, it offers functionalities for extracting all internal website links and performing in-depth web content analysis. This makes Abstract WebTools an indispensable tool for web developers, cybersecurity professionals, and digital analysts.
requests
ssl
HTTPAdapter
from requests.adapters
PoolManager
from urllib3.poolmanager
ssl_
from urllib3.util
urlparse
, urljoin
from urllib.parse
BeautifulSoup
from bs4
The UrlManager
is a Python class designed to handle and manipulate URLs. It provides methods for cleaning and normalizing URLs, determining the correct version of a URL, extracting URL components, and more. This class is particularly useful for web scraping, web crawling, or any application where URL management is essential.
To use the UrlManager
class, first import it into your Python script:
from abstract_webtools import UrlManager
You can create a UrlManager
object by providing an initial URL and an optional requests
session. If no URL is provided, it defaults to 'www.example.com':
url_manager = UrlManager(url='https://www.example.com')
The clean_url
method takes a URL and returns a list of potential URL variations, including versions with and without 'www.', 'http://', and 'https://':
cleaned_urls = url_manager.clean_url()
The get_correct_url
method tries each possible URL variation with an HTTP request to determine the correct version of the URL:
correct_url = url_manager.get_correct_url()
You can update the URL associated with the UrlManager
object using the update_url
method:
url_manager.update_url('https://www.example2.com')
The url_to_pieces
method extracts various components of the URL, such as protocol, domain name, path, and query:
url_manager.url_to_pieces()
print(url_manager.protocol)
print(url_manager.domain_name)
print(url_manager.path)
print(url_manager.query)
get_domain_name(url)
: Returns the domain name (netloc) of a given URL.is_valid_url(url)
: Checks if a URL is valid.make_valid(href, url)
: Ensures a relative or incomplete URL is valid by joining it with a base URL.get_relative_href(url, href)
: Converts a relative URL to an absolute URL based on a base URL.The get_domain
method is kept for compatibility but is inconsistent. Use it only for "webpage_url_domain." Similarly, url_basename
, base_url
, and urljoin
methods are available for URL manipulation.
Here's a quick example of using the UrlManager
class:
from abstract_webtools import UrlManager
url_manager = UrlManager(url='https://www.example.com')
cleaned_urls = url_manager.clean_url()
correct_url = url_manager.get_correct_url()
url_manager.update_url('https://www.example2.com')
print(f"Cleaned URLs: {cleaned_urls}")
print(f"Correct URL: {correct_url}")
The UrlManager
class relies on the requests
library for making HTTP requests. Ensure you have the requests
library installed in your Python environment.
The SafeRequest
class is a versatile Python utility designed to handle HTTP requests with enhanced safety features. It integrates with other managers like UrlManager
, NetworkManager
, and UserAgentManager
to manage various aspects of the request, such as user-agent, SSL/TLS settings, proxies, headers, and more.
To use the SafeRequest
class, first import it into your Python script:
from abstract_webtools import SafeRequest
You can create a SafeRequest
object with various configuration options. By default, it uses sensible default values, but you can customize it as needed:
safe_request = SafeRequest(url='https://www.example.com')
You can update the URL associated with the SafeRequest
object using the update_url
method, which also updates the underlying UrlManager
:
safe_request.update_url('https://www.example2.com')
You can also update the UrlManager
directly:
from url_manager import UrlManager
url_manager = UrlManager(url='https://www.example3.com')
safe_request.update_url_manager(url_manager)
The SafeRequest
class handles making HTTP requests using the try_request
method. It handles retries, timeouts, and rate limiting:
response = safe_request.try_request()
if response:
# Process the response here
You can access the response data in various formats:
safe_request.source_code
: HTML source code as a string.safe_request.source_code_bytes
: HTML source code as bytes.safe_request.source_code_json
: JSON data from the response (if the content type is JSON).safe_request.react_source_code
: JavaScript and JSX source code extracted from <script>
tags.The SafeRequest
class provides several options for customizing the request, such as headers, user-agent, proxies, SSL/TLS settings, and more. These can be set during initialization or updated later.
The class can handle rate limiting scenarios by implementing rate limiters and waiting between requests.
The SafeRequest
class handles various request-related exceptions and provides error messages for easier debugging.
The SafeRequest
class relies on the requests
library for making HTTP requests. Ensure you have the requests
library installed in your Python environment:
pip install requests
Here's a quick example of using the SafeRequest
class:
from abstract_webtools import SafeRequest
safe_request = SafeRequest(url='https://www.example.com')
response = safe_request.try_request()
if response:
print(f"Response status code: {response.status_code}")
print(f"HTML source code: {safe_request.source_code}")
The SoupManager
class is a Python utility designed to simplify web scraping by providing easy access to the BeautifulSoup library. It allows you to parse and manipulate HTML or XML source code from a URL or provided source code.
To use the SoupManager
class, first import it into your Python script:
from abstract_webtools import SoupManager
You can create a SoupManager
object with various configuration options. By default, it uses sensible default values, but you can customize it as needed:
soup_manager = SoupManager(url='https://www.example.com')
You can update the URL associated with the SoupManager
object using the update_url
method, which also updates the underlying UrlManager
and SafeRequest
:
soup_manager.update_url('https://www.example2.com')
You can also update the source code directly:
source_code = '<html>...</html>'
soup_manager.update_source_code(source_code)
The SoupManager
class provides easy access to the BeautifulSoup object, allowing you to search, extract, and manipulate HTML elements easily. You can use methods like find_all
, get_class
, has_attributes
, and more to work with the HTML content.
elements = soup_manager.find_all(tag='a')
The class also includes methods for extracting all website links from the HTML source code:
all_links = soup_manager.all_links
You can extract meta tags from the HTML source code using the meta_tags
property:
meta_tags = soup_manager.meta_tags
You can customize the parsing behavior by specifying the parser type during initialization or updating it:
soup_manager.update_parse_type('lxml')
The SoupManager
class relies on the BeautifulSoup
library for parsing HTML or XML. Ensure you have the beautifulsoup4
library installed in your Python environment:
pip install beautifulsoup4
Here's a quick example of using the SoupManager
class:
from abstract_webtools import SoupManager
soup_manager = SoupManager(url='https://www.example.com')
all_links = soup_manager.all_links
print(f"All Links: {all_links}")
The LinkManager
class is a Python utility designed to simplify the extraction and management of links (URLs) and associated data from HTML source code. It leverages other classes like UrlManager
, SafeRequest
, and SoupManager
to facilitate link extraction and manipulation.
To use the LinkManager
class, first import it into your Python script:
from abstract_webtools import LinkManager
You can create a LinkManager
object with various configuration options. By default, it uses sensible default values, but you can customize it as needed:
link_manager = LinkManager(url='https://www.example.com')
You can update the URL associated with the LinkManager
object using the update_url
method, which also updates the underlying UrlManager
, SafeRequest
, and SoupManager
:
link_manager.update_url('https://www.example2.com')
The LinkManager
class provides easy access to extracted links and associated data:
all_links = link_manager.all_desired_links
You can customize the link extraction behavior by specifying various parameters during initialization or updating them:
link_manager.update_desired(
img_attr_value_desired=['thumbnail', 'image'],
img_attr_value_undesired=['icon'],
link_attr_value_desired=['blog', 'article'],
link_attr_value_undesired=['archive'],
image_link_tags='img',
img_link_attrs='src',
link_tags='a',
link_attrs='href',
strict_order_tags=True,
associated_data_attr=['data-title', 'alt', 'title'],
get_img=['data-title', 'alt', 'title']
)
The LinkManager
class relies on other classes within the abstract_webtools
module, such as UrlManager
, SafeRequest
, and SoupManager
. Ensure you have these classes and their dependencies correctly set up in your Python environment.
Here's a quick example of using the LinkManager
class:
from abstract_webtools import LinkManager
link_manager = LinkManager(url='https://www.example.com')
all_links = link_manager.all_desired_links
print(f"All Links: {all_links}")
##Overall Usecases
from abstract_webtools import UrlManager, SafeRequest, SoupManager, LinkManager, VideoDownloader
# --- UrlManager: Manages and manipulates URLs for web scraping/crawling ---
url = "example.com"
url_manager = UrlManager(url=url)
# --- SafeRequest: Safely handles HTTP requests by managing user-agent, SSL/TLS, proxies, headers, etc. ---
request_manager = SafeRequest(
url_manager=url_manager,
proxies={'8.219.195.47', '8.219.197.111'},
timeout=(3.05, 70)
)
# --- SoupManager: Simplifies web scraping with easy access to BeautifulSoup ---
soup_manager = SoupManager(
url_manager=url_manager,
request_manager=request_manager
)
# --- LinkManager: Extracts and manages links and associated data from HTML source code ---
link_manager = LinkManager(
url_manager=url_manager,
soup_manager=soup_manager,
link_attr_value_desired=['/view_video.php?viewkey='],
link_attr_value_undesired=['phantomjs']
)
# Download videos from provided links (list or string)
video_manager = VideoDownloader(link=link_manager.all_desired_links).download()
# Use them individually, with default dependencies for basic inputs:
standalone_soup = SoupManager(url=url).soup
standalone_links = LinkManager(url=url).all_desired_links
# Updating methods for manager classes
url_1 = 'thedailydialectics.com'
print(f"updating URL to {url_1}")
url_manager.update_url(url=url_1)
request_manager.update_url(url=url_1)
soup_manager.update_url(url=url_1)
link_manager.update_url(url=url_1)
# Updating URL manager references
request_manager.update_url_manager(url_manager=url_manager)
soup_manager.update_url_manager(url_manager=url_manager)
link_manager.update_url_manager(url_manager=url_manager)
# Updating source code for managers
source_code_bytes = request_manager.source_code_bytes
soup_manager.update_source_code(source_code=source_code_bytes)
link_manager.update_source_code(source_code=source_code_bytes)
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
Abstract Web Tools is a Python package that provides various utility functions for web scraping tasks. It is built on top of popular libraries such as `requests`, `BeautifulSoup`, and `urllib3` to simplify the process of fetching and parsing web content.
We found that abstract-webtools demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.