Process & Manipulate HTML via Python API

Product Page | Docs | Demos | API Reference | Examples | Blog | Search | Free Support
Aspose.HTML for Python via .NET is a powerful API for Python that provides headless browser functionality, allowing you to work with HTML documents. With this API, you can easily create new HTML documents or open existing ones from different sources. Once you have the document, you can perform various manipulation operations, such as removing and replacing HTML nodes, rendering, and converting HTML to other popular formats, etc.
HTML API Features
The following are some popular features of Aspose.HTML for Python via .NET:
General Features
- Create, Load, and Read Documents. Create, load, and modify HTML, XHTML, Markdown, or SVG documents with full control over elements, attributes, and structure using a powerful DOM-based API.
- Load EPUB and MHTML file Formats. Open, read, and convert EPUB and MHTML documents with full support for their internal structure and linked resources.
- Edit Documents. Insert, remove, clone, or replace HTML elements at any level of the DOM tree for granular control over content.
- Save HTML Documents. Save documents along with all linked resources like CSS, fonts, and images using customizable saving options.
- Navigate HTML. Navigate through documents using either NodeIterator or TreeWalker.
- Sandboxing. Configure a Sandbox environment that is independent of the execution machine, ensuring a secure and isolated environment for running and testing.
- DOM Traversal. Navigate and manipulate the DOM tree using W3C-compliant traversal interfaces to inspect and retrieve content from HTML documents.
- XPath Queries. Perform high-performance XPath queries to find and extract target content from large HTML documents.
- CSS Selector and JavaScript. Use CSS selector queries and JavaScript execution to dynamically locate and extract specific elements.
- Extract CSS Styling Information. Retrieve and analyze inline styles, embedded
<style> blocks, and external stylesheets within HTML documents.
- Extract any Data from HTML Documents. Text, attributes, form values, metadata, tables, links, or media elements: Aspose.HTML for Python via .NET enables the accurate and efficient extraction of any content for processing, analysis, or editing.
Conversion and Rendering
- Convert Documents. Convert HTML, XHTML, SVG, MHTML, MD, and EPUB files to a wide range of formats, including PDF, XPS, DOCX, and different image formats (PNG, JPEG, BMP, TIFF, and GIF).
- Custom Conversion Settings. Adjust page size, resolution, stylesheets, resource management, script execution, and other settings during conversion to fine-tune the output.
- Markdown Support. Convert HTML to Markdown or vice versa for content migration and Markdown-based workflows.
- Timeout Control. Set and control the timeout for the rendering process.
Advanced HTML Features
- Monitor DOM Changes. Use MutationObserver to monitor DOM modifications.
- HTML Templates. Populate HTML documents with external data sources such as XML and JSON.
- Output Streams. Support for both single (PDF, XPS) and multiple (image formats) output file streams.
- Check Web Accessibility. Check web documents against WCAG standards using built-in validators and accessibility rule sets.
Supported File Formats
| HTML | HyperText Markup Language format | ✔️ | ✔️ |
| XHTML | eXtensible HyperText Markup Language format | ✔️ | ✔️ |
| MHTML | MIME HTML format | ✔️ | ✔️ |
| EPUB | E-book file format | ✔️ | |
| SVG | Scalable Vector Graphics format | ✔️ | ✔️ |
| MD | Markdown markup language format | ✔️ | ✔️ |
| PDF | Portable Document Format | | ✔️ |
| XPS | XML Paper Specification format | | ✔️ |
| DOCX | Microsoft Word Open XML document format | | ✔️ |
| TIFF | Tagged Image File Format | | ✔️ |
| JPEG | Joint Photographic Experts Group format | | ✔️ |
| PNG | Portable Network Graphics format | | ✔️ |
| BMP | Bitmap Picture format | | ✔️ |
| GIF | Graphics Interchange Format | | ✔️ |
| WEBP | Modern image format providing both lossy and lossless compression | | ✔️ |
Platform Independence
Aspose.HTML for Python via .NET can be used to develop applications for a vast range of operating systems, such as Windows, where Python 3.5 or later is installed. You can build both 32-bit and 64-bit Python applications.
Get Started
Are you ready to give Aspose.HTML for Python via .NET a try?
Simply run pip install aspose-html-net from the Console to fetch the package.
If you already have Aspose.HTML for Python via .NET and want to upgrade the version, please run pip install --upgrade aspose-html-net to get the latest version.
You can run the following snippets in your environment to see how Aspose.HTML works, or check out the GitHub Repository or Aspose.HTML for Python via .NET Documentation for other common use cases.
Create a New HTML Document
If you want to create an HTML document programmatically from scratch, use the parameterless constructor:
import aspose.html as ah
with ah.HTMLDocument() as document:
text = document.create_text_node("Hello, World!")
document.body.append_child(text)
document.save("create-new-document.html")
Source - Create a Document in Python
Here is an example of how to use Aspose.HTML for Python via .NET to find images specified by the <img> element:
import os
import aspose.html as ah
import aspose.html.net as ahnet
output_dir = "output/"
os.makedirs(output_dir, exist_ok=True)
with ah.HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-color/") as doc:
images = doc.get_elements_by_tag_name("img")
urls = set(img.get_attribute("src") for img in images)
abs_urls = [ah.Url(url, doc.base_uri) for url in urls]
for url in abs_urls:
request = ahnet.RequestMessage(url)
response = doc.context.network.send(request)
if response.is_success:
file_name = os.path.basename(url.pathname)
with open(os.path.join(output_dir, file_name), "wb") as f:
f.write(response.content.read_as_byte_array())
Source - Extract Images From Website in Python
HTML to PDF in one line of code
Aspose.HTML for Python via .NET allows you to convert HTML to PDF, XPS, Markdown, MHTML, PNG, JPEG, and other file formats. The following snippet demonstrates the conversion from HTML to PDF literally with a single line of code!
import aspose.html.converters as conv
import aspose.html.saving as sav
conv.Converter.convert_html("document.html", sav.PdfSaveOptions(), "document.pdf")
Source - Convert HTML to PDF in Python
Convert HTML to Markdown (MD)
The following snippet demonstrates the conversion from HTML to GIT-based Markdown (MD) Format:
import aspose.html.converters as conv
import aspose.html.saving as sav
code = "<h1>Header 1</h1>" \
"<h2>Header 2</h2>" \
"<p>Hello, World!!</p>"
with open("document.html", "w", encoding="utf-8") as f:
f.write(code)
f.close()
conv.Converter.convert_html("document.html", sav.MarkdownSaveOptions.git, "output.md")
Source - Creating an HTML Document
Convert EPUB to PDF using SaveOptions
The PdfSaveOptions class provides numerous properties that give you full control over a wide range of parameters and improve the process of converting EPUB to PDF format. In the example, we use the page_setup, jpeg_quality, and css.media_type properties:
import os
import aspose.html.converters as conv
import aspose.html.saving as sav
import aspose.html.drawing as dr
output_dir = "output/"
input_dir = "data/"
os.makedirs(output_dir, exist_ok=True)
document_path = os.path.join(input_dir, "input.epub")
save_path = os.path.join(output_dir, "epub-to-pdf.pdf")
with open(document_path, "rb") as stream:
options = sav.PdfSaveOptions()
options.page_setup.any_page = dr.Page(dr.Size(800, 600), dr.Margin(10, 10, 10, 10))
options.css.media_type.PRINT
conv.Converter.convert_epub(stream, options, save_path)
Source - Convert EPUB to PDF in Python
Product Page | Docs | Demos | API Reference | Examples | Blog | Search | Free Support | Temporary License