FilterHTML
v0.6 - White-list tags, attributes, classes, styles. With tag-specific text filtering and tag contents removal.
Demo (JS)
A dictionary-defined white-listing HTML filter. Useful for filtering HTML to leave behind a supported or safe sub-set.
Python installation:
pip install FilterHTML
Node.js installation:
npm install filterhtml
Browser: use ./lib/FilterHTML.js
in a <script> tag
Run Python Tests: nosetests --with-coverage
Run JavaScript Tests: nodeunit tests/run_tests.js
Filtering Example, in Python:
import FilterHTML
whitelist = {
'a': {
'href': 'url',
'target': [
'_blank',
'_self'
],
'class': [
'button'
]
},
'img': {
'src': 'url',
'width': 'measurement',
'height': 'measurement'
},
'span': {
'style': {
'color': 'color',
'background-color': 'color'
}
}
}
def replace_text(text, tags):
return text.replace('sad', '<strong>happy</strong>')
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist, ('http', 'https', 'mailto', 'ftp'), replace_text)
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)
What this does:
- Lets you easily define a subset of HTML and it filters out everything else
- Ensures there's no unicode encoding in attributes (e.g. : or \3A for CSS)
- Lets you use regular expressions, lists, function delegates or built-ins as rules/filters
- Lets you filter or match attributes on tags
- Lets you filter or match individual CSS styles in style attributes
- Lets you define allowed classes as a list
- Lets you specify a function delegate to define the specification for a tag, depending on which tags it is inside
- Lets you specify a function delegate for modifying or filtering text nodes, i.e. text between tags (e.g. url auto-linking, emoticon parsing, #tagging, @mentioning, etc.), the output is also HTML filtered
- Lets you convert one tag into another (with specified attributes)
- Lets you completely remove contents of specified tags from HTML
- Runs server-side in Python (e.g. Flask, Bottle, Django) or JavaScript (e.g. Node.JS, IO.js, Browser)
- Really helps to reduce XSS/code injection vulnerabilities
What this doesn't do:
- Clean up tag soup (use something else for that, like BeautifulSoup): this assumes the HTML is valid and complete. It will throw exceptions if it detects unclosed opening tags, or extra closing tags.
- Claim to be XSS-safe out of the box: be careful with your white-list specification and test it thoroughly (here's a handy resource: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet).
Class and Style filtering
- parses the 'class' attribute into a list of values to match against allowed classes (list of values or regular expressions)
- parses the 'style' attribute to match each style against a list of allowed styles, each with individual rules
e.g.
{
'div': {
'style': {
'width': 'measurement',
'height': 'measurement',
'background-color': 'color',
'text-align': ['left', 'right', 'center', 'justify', 'inherit'],
'border': border_filter_function,
'border-radius': re.compile(r'^\d+px$')
}
},
'span': {
'class': [
'icon',
re.compile(r'^icon\-[a-zA-Z0-9\-]+$')
]
}
}
Text filtering/modification
- Text (between tags) can be filtered or modified with a delegate function. This function is passed each string of text between tags, as well as a list of the tags this string is inside (and their attributes). The string is replaced with the output of this function, and it is also filtered according to the supplied white-list specification.
The following python example does simple auto-linking of URLs, but only those not already inside 'a' tags.
N.B. the output HTML of the urlize function is also HTML filtered using the same spec.
URLIZE_RE = '(%s)' % '|'.join([
r'<(?:f|ht)tps?://[^>]*>',
r'\b(?:f|ht)tps?://[^)<>\s]+[^.,)<>\s]',
])
def urlize(text, stack):
is_inside_a_tag = False
for tag in stack:
tag_name, attributes = tag
if tag_name == 'a':
is_inside_a_tag = True
break
if is_inside_a_tag:
return text
else:
return re.sub(URLIZE_RE, r'<a href="\1">\1</a>', text)
result = FilterHTML.filter_html(html, spec, text_filter=urlize)
result = FilterHTML.filter_html(html, spec, text_filter=urlize, remove=['script', 'style'])
Built-In Filters and Whitelist Types:
At the attribute, class, or style level of the whitelist, the following are valid filters:
- "url", for parsing URLs and matching against allowed schemes (http://, ftp://, mailto:, etc.). This also escapes unsafe URL characters (if not already escaped). Invalid URL attributes will be replaced with "#". Leading and trailing spaces will be stripped.
- "url|empty", same as above, but also allows empty-string attributes. Invalid URL attributes will be replaced with "" (empty string).
- "boolean", for attributes which have no value (are either present, or not, such as the "checked" attribute). N.B. attributes such as: checked="checked" or checked="" will keep the attribute present, all other values will incur the removal of the attribute.
- "color", for matching an HTML color value (either a string, like "red", "blue", etc. or "#fff", "#f0f0f0", or valid "rgb", "rgba", "hsl", or "hsla" values)
- "measurement", for matching style measurements, e.g. "42px", "10%", "6em", etc.
- "int", for matching an integer
- "alpha", for matching alphabetical characters
- "alphanumeric", for matching alphabetical and digit characters
- "alpha|empty", for matching alphabetical characters, or empty string
- "alphanumeric|empty", for matching alphabetical and digit characters, or empty string
- "text", for matching against HTML-entity escaped text (e.g. alt attributes). Greater-than, less-than, ampersand, semicolon, and single/double-quote characters will be replaced with their HTML escaped entity equivalents. Existing escape sequences will remain unmodified.
"[allowedchars]"
, for allowing characters specified between starting and ending [ ]
- regular expressions (which must match the value, or the value will be removed)
- a function, which takes the value as an argument, and returns a string replacement (or a None/null value to reject and remove the attribute)
- "*", which matches anything, and will allow any value to remain unchanged
Additionally for attributes:
- a list of allowed (string) values can be provided to all attributes
- "class" attributes can be treated like a standard attribute, or can be given a list of allowed (string) values which match against any of the provided class names. This may also include a function or regular expression to decide which class names are kept
- "style" attributes can be given an object/dictionary with the keys as style names, and any of the above filters as the values.
At the tag-level:
- An object/dictionary defines the allowed attributes (keys are attribute names, values are the above filters)
- A false boolean value to remove this tag and its contents
- A function, which takes two arguments: the tag name, and the stack of tags above the current tag in the document. This function returns either of the above (object/dictionary, or boolean)
Special Attribute Values
The following can be used instead of attribute names:
- "*" to allow these rules on all attributes which have not otherwise been specified
- A regular expression object (Python only), to use this rule-set for matching attributes which have not otherwise been specified
- "^$" to define a list of
[RegEx, rule]
pairs, to be used instead of the above, when a regular expression cannot be given as a key (i.e. JavaScript), or the regular expressions need to be evaluated in a specific order
e.g.
{
"tag_name": {
"attribute_name": attribute_rules,
"^$": [
[/^regex$/, matching_attribute_rules]
],
"*": remaining_attribute_rules
}
}
White-list
Define an allowed HTML subset as a JavaScript Object/Python Dictionary.
For regular expression filters, you can use /pattern/modifiers syntax in JavaScript (or new RegExp), or in Python: re.compile()
White-list format for allowing a tag can use many combinations of different filtering options, e.g.
{
"tag_name_a": {
"attribute_a": ["allowed-value", "another-allowed-value"],
"attribute_b": "url",
"attribute_c": re.compile(r'^regex$'),
"attribute_d": attribute_filtering_function,
"attribute_e": [
"allowed-value",
re.compile(r'^regex$'),
attribute_filtering_function
],
"class": [
"allowed-class-name",
"another-allowed-class-name",
re.compile(r'^class-name-regex$')
],
"style": {
"style-name-a": "color",
"style-name-b": [
"value-1", "value-2"
],
"style-name-c": re.compile(r'^regex$'),
"style-name-d": style_filtering_functon
}
},
"tag_name_b": {},
"tag_name_c": tag_filtering_function,
"tag_name_d": false,
}
White-list tag filtering functions are defined as:
def tag_filtering_function(tag_name, tag_stack):
return False
return None
return {
'attribute_name': ['attribute_value']
}
Attribute/Style filtering functions are defined as:
def attr_filter(attribute_value):
return "new-attribute-value"
def style_filter(style_value):
return "new-style-value"
Python example whitelist:
spec = {
"div": {
"class": [
"container",
"content"
]
},
"p": {
"class": [
"centered"
],
"style": {
"color": re.compile(r'^#[0-9A-Fa-f]{6}$')
}
},
"a": {
"href": "url",
"target": [
"_blank"
]
},
"img": {
"src": "url",
"width": "int",
"height": "int"
},
"input": {
"type": "alpha",
"name": "[abcdefghijklmnopqrstuvwxyz-]",
"value": "alphanumeric"
},
"hr": {},
"br": {},
"strong": {},
"i": {
"class": re.compile(r'^icon-[a-z0-9_]+$/')
},
"*": {
"class": ["text-left", "text-right", "text-centered"]
},
"b": "strong",
"center": "p class=\"text-centered\""
}