
Research
Security News
The Landscape of Malicious Open Source Packages: 2025 Mid‑Year Threat Report
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
WebLeaf is a Python package that brings the power of graph neural networks (GNNs) to HTML parsing and element comparison. It encodes HTML elements into feature-rich graph embeddings, allowing for advanced tasks like element extraction, structural comparison, and distance measurement between elements. WebLeaf is perfect for web scraping, semantic HTML analysis, and automated web page comparison tasks.
You can install WebLeaf using pip:
pip install webleaf
WebLeaf represents an HTML document as a graph, where each HTML element is a node, and the parent-child relationships between elements form the edges of the graph. The graph is then processed by a GCN (Graph Convolutional Network) that creates embeddings for each HTML element. These embeddings capture both the semantic content and structural relationships of the elements, allowing for tasks like element comparison, similarity measurement, and extraction.
The model also combines tag embeddings (representing HTML tags) and text embeddings (representing the textual content of elements), creating a powerful representation of the HTML page.
Here's a quick example of how to use WebLeaf:
from webleaf import Web
# Load your HTML content
html_content = open('example.html').read()
# Create a Web object
web = Web(html_content)
# Extract an element using XPath
leaf = web.leaf(xpath=".//p")
# Extract an element using CSS selectors
leaf_css = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
# Compare two elements
similarity = leaf.similarity(leaf_css)
print(f"Similarity: {similarity}")
>>> Similarity: 1.0
# Find the closest match for an element
path = web.find(leaf)
print(f"Element found at: {path}")
>>> Element found at: /html/body/div/div/div[1]/div[1]/p
Find Similar Elements: You can also find the top n
most similar elements to a given one:
similar_paths = web.find_n(leaf, n=3)
print(f"Top 3 similar elements: {similar_paths}")
>>> Top 3 similar elements: ['/html/body/div/div/div[1]/div[1]/p', '/html/body/div/div/div[2]/div[1]/p', '/html/body/div/div/div[3]/div[1]/span']
Distance Measurement: Measure how unique or similar two elements are using mdist()
:
distance = leaf.mdist(leaf_css)
print(f"Distance: {distance}")
>>>
Distance: 0.0
Web(html)
html
(str): The HTML content as a string.leaf(xpath=None, css_select=None)
Leaf
object using either an XPath or CSS selector.xpath
(str): The XPath of the desired element.css_select
(str): The CSS selector for the desired element.similarity(leaf)
Leaf
objects based on their embeddings.mdist(leaf)
Leaf
objects, representing how unique or different they are.find(leaf)
Leaf
object within the HTML structure.find_n(leaf, n)
n
most similar elements to a given Leaf
object, sorted by similarity.n
most similar elements.WebLeaf comes with a suite of unit tests to ensure everything works as expected. These tests cover basic operations like element extraction, similarity comparisons, and graph encoding. To run the tests:
pip install -r requirements.txt
.pytest
:pytest
def test_leaf_extraction():
web = Web(example_html)
leaf = web.leaf(xpath=".//p")
assert leaf
def test_element_comparison():
web = Web(example_html)
leaf1 = web.leaf(xpath=".//p")
leaf2 = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
assert leaf1.similarity(leaf2) > 0.9
The WebLeaf model uses a pretrained Graph Convolutional Network (GCN) that has been trained on a diverse set of web pages to learn the structure and semantic relationships within HTML. The model is loaded from product_page_model_4_80.torch
and is used to encode HTML elements into embeddings.
This t-SNE (t-Distributed Stochastic Neighbor Embedding) plot provides a 2D visualization of the WebLeaf-encoded web elements, which have been projected into a lower-dimensional space. The purpose of t-SNE is to represent high-dimensional data (such as the embeddings generated by WebLeaf) in two dimensions, allowing us to better visualize relationships and groupings among different types of web elements.
We welcome contributions! Feel free to submit issues, feature requests, or pull requests. Here's how you can contribute:
git checkout -b feature/new-feature
.git commit -m 'Add new feature'
.git push origin feature/new-feature
.This project is licensed under the MIT License - see the LICENSE file for details.
🌿 WebLeaf is a powerful and flexible tool for working with HTML as structured graph data. Give it a try and start leveraging the power of graph neural networks for your web scraping and analysis needs!
FAQs
HTML DOM Tree Leaf Structure Identification Package
We found that webleaf demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
Security News
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.