Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
filterhtml
Advanced tools
Readme
v0.6 - White-list tags, attributes, classes, styles. With tag-specific text filtering and tag contents removal.
A dictionary-defined white-listing HTML filter. Useful for filtering HTML to leave behind a supported or safe sub-set.
Python installation:
pip install FilterHTML
Node.js installation:
npm install filterhtml
Browser: use ./lib/FilterHTML.js
in a <script> tag
Run Python Tests: nosetests --with-coverage
Run JavaScript Tests: nodeunit tests/run_tests.js
Filtering Example, in Python:
import FilterHTML
# only allow:
# <a> tags with valid href URLs
# <img> tags with valid src URLs and measurements
# <span> tags with valid color styles
whitelist = {
'a': {
'href': 'url',
'target': [
'_blank',
'_self'
],
'class': [
'button'
]
},
'img': {
'src': 'url',
'width': 'measurement',
'height': 'measurement'
},
'span': {
'style': {
'color': 'color',
'background-color': 'color'
}
}
}
# perform replacements on text (between tags)
def replace_text(text, tags):
return text.replace('sad', '<strong>happy</strong>')
# filter the unfiltered_html, using the above whitelist, using specified allowed url schemes, and a text replacement function
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist, ('http', 'https', 'mailto', 'ftp'), replace_text)
# simpler usage: filter using the default (same as above) url schemes, and no replacement function:
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)
What this does:
What this doesn't do:
e.g.
{
'div': {
# style filtering:
'style': {
'width': 'measurement',
'height': 'measurement',
'background-color': 'color',
'text-align': ['left', 'right', 'center', 'justify', 'inherit'],
'border': border_filter_function, # implement your own function,
'border-radius': re.compile(r'^\d+px$')
}
},
'span': {
# class filtering (a list of allowed matches, strings, regex or functions):
'class': [
'icon',
re.compile(r'^icon\-[a-zA-Z0-9\-]+$')
]
}
}
The following python example does simple auto-linking of URLs, but only those not already inside 'a' tags. N.B. the output HTML of the urlize function is also HTML filtered using the same spec.
URLIZE_RE = '(%s)' % '|'.join([
r'<(?:f|ht)tps?://[^>]*>',
r'\b(?:f|ht)tps?://[^)<>\s]+[^.,)<>\s]',
])
# second argument is a list of tags which this text is inside,
# each element a tuple: (tag_name, attributes)
def urlize(text, stack):
is_inside_a_tag = False
for tag in stack:
tag_name, attributes = tag
if tag_name == 'a':
is_inside_a_tag = True
break
if is_inside_a_tag:
return text
else:
return re.sub(URLIZE_RE, r'<a href="\1">\1</a>', text)
result = FilterHTML.filter_html(html, spec, text_filter=urlize)
# script and style tag contents can be removed:
result = FilterHTML.filter_html(html, spec, text_filter=urlize, remove=['script', 'style'])
At the attribute, class, or style level of the whitelist, the following are valid filters:
"[allowedchars]"
, for allowing characters specified between starting and ending [ ]
Additionally for attributes:
At the tag-level:
The following can be used instead of attribute names:
[RegEx, rule]
pairs, to be used instead of the above, when a regular expression cannot be given as a key (i.e. JavaScript), or the regular expressions need to be evaluated in a specific ordere.g.
{
"tag_name": {
"attribute_name": attribute_rules,
"^$": [
[/^regex$/, matching_attribute_rules]
],
"*": remaining_attribute_rules
}
}
Define an allowed HTML subset as a JavaScript Object/Python Dictionary.
For regular expression filters, you can use /pattern/modifiers syntax in JavaScript (or new RegExp), or in Python: re.compile()
White-list format for allowing a tag can use many combinations of different filtering options, e.g.
{
"tag_name_a": {
# attribute filtering by list of allowed values, built-in, regex, function delegate,
# or a list of these types
"attribute_a": ["allowed-value", "another-allowed-value"],
"attribute_b": "url",
"attribute_c": re.compile(r'^regex$'),
"attribute_d": attribute_filtering_function,
"attribute_e": [
"allowed-value",
re.compile(r'^regex$'),
attribute_filtering_function
],
# class filtering by a list of allowed values, or class-name matching regex
"class": [
"allowed-class-name",
"another-allowed-class-name",
re.compile(r'^class-name-regex$')
],
# style filtering by object of allowed styles
# filtered by: build-in, list of allowed values, regex, function delegate
"style": {
"style-name-a": "color",
"style-name-b": [
"value-1", "value-2"
],
"style-name-c": re.compile(r'^regex$'),
"style-name-d": style_filtering_functon
}
},
# Allow this tag, but no attributes
"tag_name_b": {},
# Use a function delegate to specify this tag's white-list
"tag_name_c": tag_filtering_function,
# Remove this tag, and all its contents
"tag_name_d": false,
# Unlisted tags will be removed, but their contents left in-tact
}
White-list tag filtering functions are defined as:
def tag_filtering_function(tag_name, tag_stack):
# tag_name: the name of the tag being filtered
# tag_stack: a list of (tag_name, attributes) for each tag
# above the current tag (in its parsing context)
# where the last in the list is the direct parent tag
# Delete this tag and all its contents
return False
# Delete this tag, but not its contents
return None
# Return a custom specification for how to filter this tag
return {
'attribute_name': ['attribute_value']
}
Attribute/Style filtering functions are defined as:
def attr_filter(attribute_value):
return "new-attribute-value"
# or return None, or return '' to remove this attribute
def style_filter(style_value):
return "new-style-value"
# or return None, or return '' to remove this style
Python example whitelist:
spec = {
"div": {
# list allowed attribute values, as a list
"class": [
"container",
"content"
]
},
"p": {
"class": [
"centered"
],
# style parsing
"style": {
"color": re.compile(r'^#[0-9A-Fa-f]{6}$')
}
},
"a": {
# parse urls to ensure there's no javascript, by using the "url" string.
# disallow &# unicode encoding
# by default allowed schemes are 'http', 'https', 'mailto', and 'ftp' (as well as local URIs)
# this can be changed by passing in allowed_schemes=('http', 'myscheme')
"href": "url",
"target": [
"_blank"
]
},
"img": {
"src": "url",
# make sure these fields are integers, by using the "int" string
"width": "int",
"height": "int"
},
"input": {
# only allow alphabetical characters
"type": "alpha",
# allow any of these characters (within the [])
"name": "[abcdefghijklmnopqrstuvwxyz-]",
# allow alphabetical and digit characters
"value": "alphanumeric"
},
# filter out all attributes for these tags
"hr": {},
"br": {},
"strong": {},
"i": {
# use a regex match
# in javascript you can use /this style/ regex.
"class": re.compile(r'^icon-[a-z0-9_]+$/')
},
# global attributes (allowed on all elements):
# (N.B. only applies to tags already supplied as keys)
# element's specific attributes take precedence, but if they are all filtered out
# these global rules are applied to the original attribute value
"*": {
"class": ["text-left", "text-right", "text-centered"]
},
# aliases (convert one tag to another):
# convert <b> tags to <strong> tags
"b": "strong",
# convert <center> tags to <p class="text-centered"> tags
"center": "p class=\"text-centered\""
}
FAQs
FilterHTML: A whitelisting HTML filter for Python and JavaScript
The npm package filterhtml receives a total of 25 weekly downloads. As such, filterhtml popularity was classified as not popular.
We found that filterhtml demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.