PyTidyLib
_ is a Python package that wraps the HTML Tidy
_ library. This
allows you, from Python code, to "fix" invalid (X)HTML markup. Some of the
library's many capabilities include:
- Clean up unclosed tags and unescaped characters such as ampersands
- Output HTML 4 or XHTML, strict or transitional, and add missing doctypes
- Convert named entities to numeric entities, which can then be used in XML
documents without an HTML doctype.
- Clean up HTML from programs such as Word (to an extent)
- Indent the output, including proper (i.e. no) indenting for
pre
elements,
which some (X)HTML indenting code overlooks.
Changes
-
0.3.2: Initialization bug fix
-
0.3.1: find_library support while still allowing a list of library names
-
0.3.0: Refactored to use Tidy and PersistentTidy classes while keeping the
functional interface (which will lazily create a global Tidy() object) for
backward compatibility. You can now pass a list of library names and base
options when instantiating Tidy. The keep_doc argument is now deprecated
and does nothing; use PersistentTidy.
-
0.2.4: Bugfix for a strange memory allocation corner case in Tidy.
-
0.2.3: Python 3 support (2 + 3 cross compatible) with passing Tox tests.
Small example of use
The following code cleans up an invalid HTML document and sets an option::
from tidylib import tidy_document
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
options={'numeric-entities':1})
print document
print errors
Docs
Documentation is shipped with the source distribution and is available at
the PyTidyLib
_ web page.
.. _HTML Tidy
: http://tidy.sourceforge.net/
.. _PyTidyLib
: http://countergram.com/open-source/pytidylib/