Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

readability-lxml

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

readability-lxml

fast html to text parser (article readability tool) with python 3 support

  • 0.8.1
  • PyPI
  • Socket score

Maintainers
2

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master :target: https://travis-ci.org/buriy/python-readability

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project <http://lab.arc90.com/experiments/readability/>__.

Installation

It's easy using pip, just run:

.. code-block:: bash

$ pip install readability-lxml

Usage

.. code-block:: python

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>__ license.

Thanks to

  • Latest readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>__
  • Ruby port by starrhorne and iterationlabs
  • Python port <https://github.com/gfxmonk/python-readability>__ by gfxmonk
  • Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc