Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
|build| |version| |license| |downloads|
.. |build| image:: https://img.shields.io/github/actions/workflow/status/matthewwithanm/python-markdownify/python-app.yml?branch=develop :alt: GitHub Workflow Status :target: https://github.com/matthewwithanm/python-markdownify/actions/workflows/python-app.yml?query=workflow%3A%22Python+application%22
.. |version| image:: https://img.shields.io/pypi/v/markdownify :alt: Pypi version :target: https://pypi.org/project/markdownify/
.. |license| image:: https://img.shields.io/pypi/l/markdownify :alt: License :target: https://github.com/matthewwithanm/python-markdownify/blob/develop/LICENSE
.. |downloads| image:: https://pepy.tech/badge/markdownify :alt: Pypi Downloads :target: https://pepy.tech/project/markdownify
pip install markdownify
Convert some HTML to Markdown:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
...or specify the tags you want to include:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
Markdownify supports the following options:
strip
A list of tags to strip. This option can't be used with the
convert
option.
convert
A list of tags to convert. This option can't be used with the
strip
option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a a
tag's contents match its href. Defaults to True
.
default_title
A boolean to enable setting the title of a link to its href, if no title is
given. Defaults to False
.
heading_style
Defines how headings should be converted. Accepted values are ATX
,
ATX_CLOSED
, SETEXT
, and UNDERLINED
(which is an alias for
SETEXT
). Defaults to UNDERLINED
.
bullets
An iterable (string, list, or tuple) of bullet styles to be used. If the
iterable only contains one item, it will be used regardless of how deeply
lists are nested. Otherwise, the bullet will alternate based on nesting
level. Defaults to '*+-'
.
strong_em_symbol
In markdown, both *
and _
are used to encode strong or
emphasized texts. Either of these symbols can be chosen by the options
ASTERISK
(default) or UNDERSCORE
respectively.
sub_symbol, sup_symbol
Define the chars that surround <sub>
and <sup>
text. Defaults to an
empty string, because this is non-standard behavior. Could be something like
~
and ^
to result in ~sub~
and ^sup^
. If the value starts
with <
and ends with >
, it is treated as an HTML tag and a /
is
inserted after the <
in the string used after the text; this allows
specifying <sub>
to use raw HTML in the output for subscripts, for
example.
newline_style
Defines the style of marking linebreaks (<br>
) in markdown. The default
value SPACES
of this option will adopt the usual two spaces and a newline,
while BACKSLASH
will convert a linebreak to \\n
(a backslash and a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all <pre>
sections.
Useful, if all code on a page is in the same programming language and
should be annotated with `````pythonor similar. Defaults to
''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains pre
tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted pre
tag.
The callback gets one single argument, an BeautifylSoup object, and returns
a string containing the code language, or None
.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to None
.
escape_asterisks
If set to False
, do not escape *
to \*
in text.
Defaults to True
.
escape_underscores
If set to False
, do not escape _
to \_
in text.
Defaults to True
.
escape_misc
If set to True
, escape miscellaneous punctuation characters
that sometimes have Markdown significance in text.
Defaults to False
.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ['td']
.
Defaults to an empty list.
wrap, wrap_width
If wrap
is set to True
, all text paragraphs are wrapped at
wrap_width
characters. Defaults to False
and 80
.
Use with newline_style=BACKSLASH
to keep line breaks in paragraphs.
Options may be specified as kwargs to the markdownify
function, or as a
nested Options
class in MarkdownConverter
subclasses.
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
If you have a special usecase that calls for a special conversion, you can
always inherit from MarkdownConverter
and override the method you want to
change.
The function that handles a HTML tag named abc
is called
convert_abc(self, el, text, convert_as_inline)
and returns a string
containing the converted HTML tag.
The MarkdownConverter
object will handle the conversion based on the
function names:
.. code:: python
from markdownify import MarkdownConverter
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
.. code:: python
from markdownify import MarkdownConverter
class IgnoreParagraphsConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that ignores paragraphs
"""
def convert_p(self, el, text, convert_as_inline):
return ''
# Create shorthand method for conversion
def md(html, **options):
return IgnoreParagraphsConverter(**options).convert(html)
Use markdownify example.html > example.md
or pipe input from stdin
(cat example.html | markdownify > example.md
).
Call markdownify -h
to see all available options.
They are the same as listed above and take the same arguments.
To run tests and the linter run pip install tox
once, then tox
.
FAQs
Convert HTML to markdown.
We found that markdownify demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.