Security News
The Push to Ban Ransom Payments Is Gaining Momentum
Ransomware costs victims an estimated $30 billion per year and has gotten so out of control that global support for banning payments is gaining momentum.
filterhtml
Advanced tools
Readme
v0.3 - White-list tags, attributes, classes, styles. With tag-specific text filtering and tag contents removal.
A dictionary-defined white-listing HTML filter. Useful for filtering HTML to leave behind a supported or safe sub-set.
Python and JavaScript versions
Python installation:
pip install FilterHTML
Node.js installation:
npm install filterhtml
Browser: copy ./lib/FilterHTML.js
into your project
Example:
import FilterHTML
# only allow:
# <a> tags with valid href URLs
# <img> tags with valid src URLs and measurements
# <span> tags with valid color styles
whitelist = {
'a': {
'href': 'url',
'target': [
'_blank',
'_self'
],
'class': [
'button'
]
},
'img': {
'src': 'url',
'width': 'measurement',
'height': 'measurement'
},
'span': {
'style': {
'color': 'color',
'background-color': 'color'
}
}
}
# perform replacements on text (between tags)
def replace_text(text, tags):
return text.replace('sad', '<strong>happy</strong>')
# filter the unfiltered_html, using the above whitelist, using specified allowed url schemes, and a text replacement function
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist, ('http', 'https', 'mailto', 'ftp'), replace_text)
# simpler usage: filter using the default (same as above) url schemes, and no replacement function:
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)
What this does:
What this doesn't do:
e.g.
{ 'div': { # style filtering: 'style': { 'width': 'measurement', 'height': 'measurement', 'background-color': 'color', 'text-align': ['left', 'right', 'center', 'justify', 'inherit'], 'border': border_filter_function, # implement your own function, 'border-radius': re.compile(r'^\d+px$') } }, 'span': { # class filtering (a list of allowed matches, strings, regex or functions): 'class': [ 'icon', re.compile(r'^icon\-[a-zA-Z0-9\-]+$') ] } }
The following python example does simple auto-linking of URLs, but only those not already inside 'a' tags. N.B. the output HTML of the urlize function is also HTML filtered using the same spec.
URLIZE_RE = '(%s)' % '|'.join([
r'<(?:f|ht)tps?://[^>]*>',
r'\b(?:f|ht)tps?://[^)<>\s]+[^.,)<>\s]',
])
# second argument is a list of tags which this text is inside,
# each element a tuple: (tag_name, attributes)
def urlize(text, stack):
is_inside_a_tag = False
for tag in stack:
tag_name, attributes = tag
if tag_name == 'a':
is_inside_a_tag = True
break
if is_inside_a_tag:
return text
else:
return re.sub(URLIZE_RE, r'<a href="\1">\1</a>', text)
result = FilterHTML.filter_html(html, spec, text_filter=urlize)
# script and style tag contents can be removed:
result = FilterHTML.filter_html(html, spec, text_filter=urlize, remove=['script', 'style'])
Matching can also be done against regular expressions or a list of allowed values. Values can also be passed through custom filtering functions.
Define an allowed HTML subset as a JSON object (for the JS version) or a Python dictionary.
For regular expression filters, you can use /pattern/modifiers syntax in JavaScript (or new RegExp), or in Python: re.compile()
Python example whitelist:
spec = {
"div": {
# list allowed attribute values, as a list
"class": [
"container",
"content"
]
},
"p": {
"class": [
"centered"
],
# style parsing
"style": {
"color": re.compile(r'^#[0-9A-Fa-f]{6}$')
}
},
"a": {
# parse urls to ensure there's no javascript, by using the "url" string.
# disallow &# unicode encoding
# by default allowed schemes are 'http', 'https', 'mailto', and 'ftp' (as well as local URIs)
# this can be changed by passing in allowed_schemes=('http', 'myscheme')
"href": "url",
"target": [
"_blank"
]
},
"img": {
"src": "url",
# make sure these fields are integers, by using the "int" string
"width": "int",
"height": "int"
},
"input": {
# only allow alphabetical characters
"type": "alpha",
# allow any of these characters (within the [])
"name": "[abcdefghijklmnopqrstuvwxyz-]",
# allow alphabetical and digit characters
"value": "alphanumeric"
},
# filter out all attributes for these tags
"hr": {},
"br": {},
"strong": {},
"i": {
# use a regex match
# in javascript you can use /this style/ regex.
"class": re.compile(r'^icon-[a-z0-9_]+$/')
},
# global attributes (allowed on all elements):
# (N.B. only applies to tags already supplied as keys)
# element's specific attributes take precedence, but if they are all filtered out
# these global rules are applied to the original attribute value
"*": {
"class": ["text-left", "text-right", "text-centered"]
},
# aliases (convert one tag to another):
# convert <b> tags to <strong> tags
"b": "strong",
# convert <center> tags to <p class="text-centered"> tags
"center": "p class=\"text-centered\""
}
FAQs
FilterHTML: A whitelisting HTML filter for Python and JavaScript
We found that filterhtml demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Ransomware costs victims an estimated $30 billion per year and has gotten so out of control that global support for banning payments is gaining momentum.
Application Security
New SEC disclosure rules aim to enforce timely cyber incident reporting, but fear of job loss and inadequate resources lead to significant underreporting.
Security News
The Python Software Foundation has secured a 5-year sponsorship from Fastly that supports PSF's activities and events, most notably the security and reliability of the Python Package Index (PyPI).