![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
WebLeaf is a Python package that brings the power of graph neural networks (GNNs) to HTML parsing and element comparison. It encodes HTML elements into feature-rich graph embeddings, allowing for advanced tasks like element extraction, structural comparison, and distance measurement between elements. WebLeaf is perfect for web scraping, semantic HTML analysis, and automated web page comparison tasks.
You can install WebLeaf using pip:
pip install webleaf
WebLeaf represents an HTML document as a graph, where each HTML element is a node, and the parent-child relationships between elements form the edges of the graph. The graph is then processed by a GCN (Graph Convolutional Network) that creates embeddings for each HTML element. These embeddings capture both the semantic content and structural relationships of the elements, allowing for tasks like element comparison, similarity measurement, and extraction.
The model also combines tag embeddings (representing HTML tags) and text embeddings (representing the textual content of elements), creating a powerful representation of the HTML page.
Here's a quick example of how to use WebLeaf:
from webleaf import Web
# Load your HTML content
html_content = open('example.html').read()
# Create a Web object
web = Web(html_content)
# Extract an element using XPath
leaf = web.leaf(xpath=".//p")
# Extract an element using CSS selectors
leaf_css = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
# Compare two elements
similarity = leaf.similarity(leaf_css)
print(f"Similarity: {similarity}")
>>> Similarity: 1.0
# Find the closest match for an element
path = web.find(leaf)
print(f"Element found at: {path}")
>>> Element found at: /html/body/div/div/div[1]/div[1]/p
Find Similar Elements: You can also find the top n
most similar elements to a given one:
similar_paths = web.find_n(leaf, n=3)
print(f"Top 3 similar elements: {similar_paths}")
>>> Top 3 similar elements: ['/html/body/div/div/div[1]/div[1]/p', '/html/body/div/div/div[2]/div[1]/p', '/html/body/div/div/div[3]/div[1]/span']
Distance Measurement: Measure how unique or similar two elements are using mdist()
:
distance = leaf.mdist(leaf_css)
print(f"Distance: {distance}")
>>>
Distance: 0.0
Web(html)
html
(str): The HTML content as a string.leaf(xpath=None, css_select=None)
Leaf
object using either an XPath or CSS selector.xpath
(str): The XPath of the desired element.css_select
(str): The CSS selector for the desired element.similarity(leaf)
Leaf
objects based on their embeddings.mdist(leaf)
Leaf
objects, representing how unique or different they are.find(leaf)
Leaf
object within the HTML structure.find_n(leaf, n)
n
most similar elements to a given Leaf
object, sorted by similarity.n
most similar elements.WebLeaf comes with a suite of unit tests to ensure everything works as expected. These tests cover basic operations like element extraction, similarity comparisons, and graph encoding. To run the tests:
pip install -r requirements.txt
.pytest
:pytest
def test_leaf_extraction():
web = Web(example_html)
leaf = web.leaf(xpath=".//p")
assert leaf
def test_element_comparison():
web = Web(example_html)
leaf1 = web.leaf(xpath=".//p")
leaf2 = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
assert leaf1.similarity(leaf2) > 0.9
The WebLeaf model uses a pretrained Graph Convolutional Network (GCN) that has been trained on a diverse set of web pages to learn the structure and semantic relationships within HTML. The model is loaded from product_page_model_4_80.torch
and is used to encode HTML elements into embeddings.
This t-SNE (t-Distributed Stochastic Neighbor Embedding) plot provides a 2D visualization of the WebLeaf-encoded web elements, which have been projected into a lower-dimensional space. The purpose of t-SNE is to represent high-dimensional data (such as the embeddings generated by WebLeaf) in two dimensions, allowing us to better visualize relationships and groupings among different types of web elements.
We welcome contributions! Feel free to submit issues, feature requests, or pull requests. Here's how you can contribute:
git checkout -b feature/new-feature
.git commit -m 'Add new feature'
.git push origin feature/new-feature
.This project is licensed under the MIT License - see the LICENSE file for details.
🌿 WebLeaf is a powerful and flexible tool for working with HTML as structured graph data. Give it a try and start leveraging the power of graph neural networks for your web scraping and analysis needs!
FAQs
HTML DOM Tree Leaf Structure Identification Package
We found that webleaf demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.