html-text
Advanced tools
+14
-0
@@ -5,2 +5,16 @@ ======= | ||
| 0.5.0 (2018-11-19) | ||
| ------------------ | ||
| Parsel dependency is removed in this release, | ||
| though parsel is still supported. | ||
| * ``parsel`` package is no longer required to install and use html-text; | ||
| * ``html_text.etree_to_text`` function allows to extract text from | ||
| lxml Elements; | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance with | ||
| options tuned for text extraction speed and quality; | ||
| * test and documentation improvements; | ||
| * Python 3.7 support. | ||
| 0.4.1 (2018-09-25) | ||
@@ -7,0 +21,0 @@ ------------------ |
| Metadata-Version: 1.1 | ||
| Name: html-text | ||
| Version: 0.4.1 | ||
| Version: 0.5.0 | ||
| Summary: Extract text from HTML | ||
@@ -28,26 +28,16 @@ Home-page: https://github.com/TeamHG-Memex/html-text | ||
| * Free software: MIT license | ||
| How is html_text different from ``.xpath('//text()')`` from LXML | ||
| or ``.get_text()`` from Beautiful Soup? | ||
| Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to the users. | ||
| It normalizes whitespace, but is also smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), | ||
| tries to avoid adding extra spaces for punctuation and | ||
| can add newlines so that the output text looks like how it is rendered in | ||
| browsers. | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| * Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to users; | ||
| * ``html_text`` normalizes whitespace, but in a way smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), and trying to | ||
| avoid adding extra spaces for punctuation; | ||
| * ``html-text`` can add newlines (e.g. after headers or paragraphs), so | ||
| that the output text looks more like how it is rendered in browsers. | ||
| Install | ||
@@ -60,3 +50,3 @@ ------- | ||
| The package depends on lxml, so you might need to install some additional | ||
| The package depends on lxml, so you might need to install additional | ||
| packages: http://lxml.de/installation.html | ||
@@ -77,6 +67,7 @@ | ||
| Passed html is first cleaned from invisible non-text content such | ||
| as styles, and then text is extracted. | ||
| You can also pass an already parsed ``lxml.html.HtmlElement``: | ||
| You can also pass already parsed ``lxml.html.HtmlElement``: | ||
| >>> import html_text | ||
@@ -87,5 +78,15 @@ >>> tree = html_text.parse_html('<h1>Hello</h1> world!') | ||
| Or define a selector to extract text only from specific elements: | ||
| If you want, you can handle cleaning manually; use lower-level | ||
| ``html_text.etree_to_text`` in this case: | ||
| >>> import html_text | ||
| >>> tree = html_text.parse_html('<h1>Hello<style>.foo{}</style>!</h1>') | ||
| >>> cleaned_tree = html_text.cleaner.clean_html(tree) | ||
| >>> html_text.etree_to_text(cleaned_tree) | ||
| 'Hello!' | ||
| parsel.Selector objects are also supported; you can define | ||
| a parsel.Selector to extract text only from specific elements: | ||
| >>> import html_text | ||
| >>> sel = html_text.cleaned_selector('<h1>Hello</h1> world!') | ||
@@ -96,10 +97,14 @@ >>> subsel = sel.xpath('//h1') | ||
| Passed html will be first cleaned from invisible non-text content such | ||
| as styles, and then text would be extracted. | ||
| NB Selectors are not cleaned automatically you need to call | ||
| NB parsel.Selector objects are not cleaned automatically, you need to call | ||
| ``html_text.cleaned_selector`` first. | ||
| Main functions: | ||
| Main functions and objects: | ||
| * ``html_text.extract_text`` accepts html and returns extracted text. | ||
| * ``html_text.etree_to_text`` accepts parsed lxml Element and returns | ||
| extracted text; it is a lower-level function, cleaning is not handled | ||
| here. | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which | ||
| can be used with ``html_text.etree_to_text``; its options are tuned for | ||
| speed and text extraction quality. | ||
| * ``html_text.cleaned_selector`` accepts html as text or as | ||
@@ -127,7 +132,10 @@ ``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``. | ||
| Credits | ||
| ------- | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| The code is extracted from utilities used in several projects, written by Mikhail Korobov. | ||
| ---- | ||
@@ -144,2 +152,16 @@ | ||
| 0.5.0 (2018-11-19) | ||
| ------------------ | ||
| Parsel dependency is removed in this release, | ||
| though parsel is still supported. | ||
| * ``parsel`` package is no longer required to install and use html-text; | ||
| * ``html_text.etree_to_text`` function allows to extract text from | ||
| lxml Elements; | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance with | ||
| options tuned for text extraction speed and quality; | ||
| * test and documentation improvements; | ||
| * Python 3.7 support. | ||
| 0.4.1 (2018-09-25) | ||
@@ -199,3 +221,3 @@ ------------------ | ||
| Platform: UNKNOWN | ||
| Classifier: Development Status :: 3 - Alpha | ||
| Classifier: Development Status :: 4 - Beta | ||
| Classifier: Intended Audience :: Developers | ||
@@ -209,1 +231,2 @@ Classifier: License :: OSI Approved :: MIT License | ||
| Classifier: Programming Language :: Python :: 3.6 | ||
| Classifier: Programming Language :: Python :: 3.7 |
| lxml | ||
| parsel |
| # -*- coding: utf-8 -*- | ||
| __version__ = '0.4.1' | ||
| __version__ = '0.5.0' | ||
| from .html_text import (extract_text, parse_html, cleaned_selector, | ||
| selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS) | ||
| from .html_text import (etree_to_text, extract_text, selector_to_text, | ||
| parse_html, cleaned_selector, cleaner, | ||
| NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS) |
+27
-15
@@ -7,4 +7,2 @@ # -*- coding: utf-8 -*- | ||
| from lxml.html.clean import Cleaner | ||
| import parsel | ||
| from parsel.selector import create_root_node | ||
@@ -22,3 +20,3 @@ | ||
| _clean_html = Cleaner( | ||
| cleaner = Cleaner( | ||
| scripts=True, | ||
@@ -38,3 +36,3 @@ javascript=False, # onclick attributes are fine | ||
| safe_attrs_only=False, | ||
| ).clean_html | ||
| ) | ||
@@ -47,3 +45,3 @@ | ||
| tree = parse_html(html) | ||
| return _clean_html(tree) | ||
| return cleaner.clean_html(tree) | ||
@@ -53,4 +51,10 @@ | ||
| """ Create an lxml.html.HtmlElement from a string with html. | ||
| XXX: mostly copy-pasted from parsel.selector.create_root_node | ||
| """ | ||
| return create_root_node(html, lxml.html.HTMLParser) | ||
| body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>' | ||
| parser = lxml.html.HTMLParser(recover=True, encoding='utf8') | ||
| root = lxml.etree.fromstring(body, parser=parser) | ||
| if root is None: | ||
| root = lxml.etree.fromstring(b'<html/>', parser=parser) | ||
| return root | ||
@@ -68,3 +72,3 @@ | ||
| def _html_to_text(tree, | ||
| def etree_to_text(tree, | ||
| guess_punct_space=True, | ||
@@ -75,5 +79,8 @@ guess_layout=True, | ||
| """ | ||
| Convert a cleaned html tree to text. | ||
| See html_text.extract_text docstring for description of the approach | ||
| and options. | ||
| Convert a html tree to text. Tree should be cleaned with | ||
| ``html_text.html_text.cleaner.clean_html`` before passing to this | ||
| function. | ||
| See html_text.extract_text docstring for description of the | ||
| approach and options. | ||
| """ | ||
@@ -141,6 +148,7 @@ chunks = [] | ||
| def selector_to_text(sel, guess_punct_space=True, guess_layout=True): | ||
| """ Convert a cleaned selector to text. | ||
| """ Convert a cleaned parsel.Selector to text. | ||
| See html_text.extract_text docstring for description of the approach | ||
| and options. | ||
| """ | ||
| import parsel | ||
| if isinstance(sel, parsel.SelectorList): | ||
@@ -150,3 +158,3 @@ # if selecting a specific xpath | ||
| for s in sel: | ||
| extracted = _html_to_text( | ||
| extracted = etree_to_text( | ||
| s.root, | ||
@@ -159,3 +167,3 @@ guess_punct_space=guess_punct_space, | ||
| else: | ||
| return _html_to_text( | ||
| return etree_to_text( | ||
| sel.root, | ||
@@ -167,4 +175,5 @@ guess_punct_space=guess_punct_space, | ||
| def cleaned_selector(html): | ||
| """ Clean selector. | ||
| """ Clean parsel.selector. | ||
| """ | ||
| import parsel | ||
| try: | ||
@@ -197,2 +206,5 @@ tree = _cleaned_html_tree(html) | ||
| ``html_text.etree_to_text`` is a lower-level function which only accepts | ||
| an already parsed lxml.html Element, and is not doing html cleaning itself. | ||
| When guess_punct_space is True (default), no extra whitespace is added | ||
@@ -213,3 +225,3 @@ for punctuation. This has a slight (around 10%) performance overhead | ||
| cleaned = _cleaned_html_tree(html) | ||
| return _html_to_text( | ||
| return etree_to_text( | ||
| cleaned, | ||
@@ -216,0 +228,0 @@ guess_punct_space=guess_punct_space, |
+55
-32
| Metadata-Version: 1.1 | ||
| Name: html_text | ||
| Version: 0.4.1 | ||
| Version: 0.5.0 | ||
| Summary: Extract text from HTML | ||
@@ -28,26 +28,16 @@ Home-page: https://github.com/TeamHG-Memex/html-text | ||
| * Free software: MIT license | ||
| How is html_text different from ``.xpath('//text()')`` from LXML | ||
| or ``.get_text()`` from Beautiful Soup? | ||
| Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to the users. | ||
| It normalizes whitespace, but is also smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), | ||
| tries to avoid adding extra spaces for punctuation and | ||
| can add newlines so that the output text looks like how it is rendered in | ||
| browsers. | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| * Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to users; | ||
| * ``html_text`` normalizes whitespace, but in a way smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), and trying to | ||
| avoid adding extra spaces for punctuation; | ||
| * ``html-text`` can add newlines (e.g. after headers or paragraphs), so | ||
| that the output text looks more like how it is rendered in browsers. | ||
| Install | ||
@@ -60,3 +50,3 @@ ------- | ||
| The package depends on lxml, so you might need to install some additional | ||
| The package depends on lxml, so you might need to install additional | ||
| packages: http://lxml.de/installation.html | ||
@@ -77,6 +67,7 @@ | ||
| Passed html is first cleaned from invisible non-text content such | ||
| as styles, and then text is extracted. | ||
| You can also pass an already parsed ``lxml.html.HtmlElement``: | ||
| You can also pass already parsed ``lxml.html.HtmlElement``: | ||
| >>> import html_text | ||
@@ -87,5 +78,15 @@ >>> tree = html_text.parse_html('<h1>Hello</h1> world!') | ||
| Or define a selector to extract text only from specific elements: | ||
| If you want, you can handle cleaning manually; use lower-level | ||
| ``html_text.etree_to_text`` in this case: | ||
| >>> import html_text | ||
| >>> tree = html_text.parse_html('<h1>Hello<style>.foo{}</style>!</h1>') | ||
| >>> cleaned_tree = html_text.cleaner.clean_html(tree) | ||
| >>> html_text.etree_to_text(cleaned_tree) | ||
| 'Hello!' | ||
| parsel.Selector objects are also supported; you can define | ||
| a parsel.Selector to extract text only from specific elements: | ||
| >>> import html_text | ||
| >>> sel = html_text.cleaned_selector('<h1>Hello</h1> world!') | ||
@@ -96,10 +97,14 @@ >>> subsel = sel.xpath('//h1') | ||
| Passed html will be first cleaned from invisible non-text content such | ||
| as styles, and then text would be extracted. | ||
| NB Selectors are not cleaned automatically you need to call | ||
| NB parsel.Selector objects are not cleaned automatically, you need to call | ||
| ``html_text.cleaned_selector`` first. | ||
| Main functions: | ||
| Main functions and objects: | ||
| * ``html_text.extract_text`` accepts html and returns extracted text. | ||
| * ``html_text.etree_to_text`` accepts parsed lxml Element and returns | ||
| extracted text; it is a lower-level function, cleaning is not handled | ||
| here. | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which | ||
| can be used with ``html_text.etree_to_text``; its options are tuned for | ||
| speed and text extraction quality. | ||
| * ``html_text.cleaned_selector`` accepts html as text or as | ||
@@ -127,7 +132,10 @@ ``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``. | ||
| Credits | ||
| ------- | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| The code is extracted from utilities used in several projects, written by Mikhail Korobov. | ||
| ---- | ||
@@ -144,2 +152,16 @@ | ||
| 0.5.0 (2018-11-19) | ||
| ------------------ | ||
| Parsel dependency is removed in this release, | ||
| though parsel is still supported. | ||
| * ``parsel`` package is no longer required to install and use html-text; | ||
| * ``html_text.etree_to_text`` function allows to extract text from | ||
| lxml Elements; | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance with | ||
| options tuned for text extraction speed and quality; | ||
| * test and documentation improvements; | ||
| * Python 3.7 support. | ||
| 0.4.1 (2018-09-25) | ||
@@ -199,3 +221,3 @@ ------------------ | ||
| Platform: UNKNOWN | ||
| Classifier: Development Status :: 3 - Alpha | ||
| Classifier: Development Status :: 4 - Beta | ||
| Classifier: Intended Audience :: Developers | ||
@@ -209,1 +231,2 @@ Classifier: License :: OSI Approved :: MIT License | ||
| Classifier: Programming Language :: Python :: 3.6 | ||
| Classifier: Programming Language :: Python :: 3.7 |
+38
-30
@@ -20,26 +20,16 @@ ============ | ||
| * Free software: MIT license | ||
| How is html_text different from ``.xpath('//text()')`` from LXML | ||
| or ``.get_text()`` from Beautiful Soup? | ||
| Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to the users. | ||
| It normalizes whitespace, but is also smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), | ||
| tries to avoid adding extra spaces for punctuation and | ||
| can add newlines so that the output text looks like how it is rendered in | ||
| browsers. | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| * Text extracted with ``html_text`` does not contain inline styles, | ||
| javascript, comments and other text that is not normally visible to users; | ||
| * ``html_text`` normalizes whitespace, but in a way smarter than | ||
| ``.xpath('normalize-space())``, adding spaces around inline elements | ||
| (which are often used as block elements in html markup), and trying to | ||
| avoid adding extra spaces for punctuation; | ||
| * ``html-text`` can add newlines (e.g. after headers or paragraphs), so | ||
| that the output text looks more like how it is rendered in browsers. | ||
| Install | ||
@@ -52,3 +42,3 @@ ------- | ||
| The package depends on lxml, so you might need to install some additional | ||
| The package depends on lxml, so you might need to install additional | ||
| packages: http://lxml.de/installation.html | ||
@@ -69,6 +59,7 @@ | ||
| Passed html is first cleaned from invisible non-text content such | ||
| as styles, and then text is extracted. | ||
| You can also pass an already parsed ``lxml.html.HtmlElement``: | ||
| You can also pass already parsed ``lxml.html.HtmlElement``: | ||
| >>> import html_text | ||
@@ -79,5 +70,15 @@ >>> tree = html_text.parse_html('<h1>Hello</h1> world!') | ||
| Or define a selector to extract text only from specific elements: | ||
| If you want, you can handle cleaning manually; use lower-level | ||
| ``html_text.etree_to_text`` in this case: | ||
| >>> import html_text | ||
| >>> tree = html_text.parse_html('<h1>Hello<style>.foo{}</style>!</h1>') | ||
| >>> cleaned_tree = html_text.cleaner.clean_html(tree) | ||
| >>> html_text.etree_to_text(cleaned_tree) | ||
| 'Hello!' | ||
| parsel.Selector objects are also supported; you can define | ||
| a parsel.Selector to extract text only from specific elements: | ||
| >>> import html_text | ||
| >>> sel = html_text.cleaned_selector('<h1>Hello</h1> world!') | ||
@@ -88,10 +89,14 @@ >>> subsel = sel.xpath('//h1') | ||
| Passed html will be first cleaned from invisible non-text content such | ||
| as styles, and then text would be extracted. | ||
| NB Selectors are not cleaned automatically you need to call | ||
| NB parsel.Selector objects are not cleaned automatically, you need to call | ||
| ``html_text.cleaned_selector`` first. | ||
| Main functions: | ||
| Main functions and objects: | ||
| * ``html_text.extract_text`` accepts html and returns extracted text. | ||
| * ``html_text.etree_to_text`` accepts parsed lxml Element and returns | ||
| extracted text; it is a lower-level function, cleaning is not handled | ||
| here. | ||
| * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which | ||
| can be used with ``html_text.etree_to_text``; its options are tuned for | ||
| speed and text extraction quality. | ||
| * ``html_text.cleaned_selector`` accepts html as text or as | ||
@@ -119,7 +124,10 @@ ``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``. | ||
| Credits | ||
| ------- | ||
| Apart from just getting text from the page (e.g. for display or search), | ||
| one intended usage of this library is for machine learning (feature extraction). | ||
| If you want to use the text of the html page as a feature (e.g. for classification), | ||
| this library gives you plain text that you can later feed into a standard text | ||
| classification pipeline. | ||
| If you feel that you need html structure as well, check out | ||
| `webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library. | ||
| The code is extracted from utilities used in several projects, written by Mikhail Korobov. | ||
| ---- | ||
@@ -126,0 +134,0 @@ |
+1
-1
| [bumpversion] | ||
| current_version = 0.4.1 | ||
| current_version = 0.5.0 | ||
| commit = True | ||
@@ -4,0 +4,0 @@ tag = True |
+5
-12
@@ -12,14 +12,6 @@ #!/usr/bin/env python | ||
| requirements = [ | ||
| 'lxml', | ||
| 'parsel', | ||
| ] | ||
| test_requirements = [ | ||
| 'pytest', | ||
| ] | ||
| setup( | ||
| name='html_text', | ||
| version='0.4.1', | ||
| version='0.5.0', | ||
| description="Extract text from HTML", | ||
@@ -32,7 +24,7 @@ long_description=readme + '\n\n' + history, | ||
| include_package_data=True, | ||
| install_requires=requirements, | ||
| install_requires=['lxml'], | ||
| license="MIT license", | ||
| zip_safe=False, | ||
| classifiers=[ | ||
| 'Development Status :: 3 - Alpha', | ||
| 'Development Status :: 4 - Beta', | ||
| 'Intended Audience :: Developers', | ||
@@ -46,5 +38,6 @@ 'License :: OSI Approved :: MIT License', | ||
| 'Programming Language :: Python :: 3.6', | ||
| 'Programming Language :: Python :: 3.7', | ||
| ], | ||
| test_suite='tests', | ||
| tests_require=test_requirements | ||
| tests_require=['pytest'], | ||
| ) |
+46
-11
| # -*- coding: utf-8 -*- | ||
| import pytest | ||
| import glob | ||
| import os | ||
| import six | ||
| import pytest | ||
| from html_text import (extract_text, parse_html, cleaned_selector, | ||
| selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS) | ||
| etree_to_text, cleaner, selector_to_text, NEWLINE_TAGS, | ||
| DOUBLE_NEWLINE_TAGS) | ||
| ROOT = os.path.dirname(os.path.abspath(__file__)) | ||
| @pytest.fixture(params=[ | ||
@@ -45,2 +52,6 @@ {'guess_punct_space': True, 'guess_layout': False}, | ||
| def test_comment(all_options): | ||
| assert extract_text(u"<!-- hello world -->", **all_options) == '' | ||
| def test_extract_text_from_tree(all_options): | ||
@@ -94,2 +105,3 @@ html = (u'<html><style>.div {}</style>' | ||
| def test_selectors(all_options): | ||
| pytest.importorskip("parsel") | ||
| html = (u'<span><span id="extract-me">text<a>more</a>' | ||
@@ -112,2 +124,10 @@ '</span>and more text <a> and some more</a> <a></a> </span>') | ||
| def test_nbsp(): | ||
| if six.PY2: | ||
| raise pytest.xfail(" produces '\xa0' in Python 2, " | ||
| "but ' ' in Python 3") | ||
| html = "<h1>Foo Bar</h1>" | ||
| assert extract_text(html) == "Foo Bar" | ||
| def test_guess_layout(): | ||
@@ -155,10 +175,25 @@ html = (u'<title> title </title><div>text_1.<p>text_2 text_3</p>' | ||
| def test_webpages(): | ||
| webpages = sorted(glob.glob('./test_webpages/*.html')) | ||
| extracted = sorted(glob.glob('./test_webpages/*.txt')) | ||
| for page, extr in zip(webpages, extracted): | ||
| with open(page, 'r', encoding='utf8') as f_in: | ||
| html = f_in.read() | ||
| with open(extr, 'r', encoding='utf8') as f_in: | ||
| expected = f_in.read() | ||
| assert extract_text(html) == expected | ||
| def _webpage_paths(): | ||
| webpages = sorted(glob.glob(os.path.join(ROOT, 'test_webpages', '*.html'))) | ||
| extracted = sorted(glob.glob(os.path.join(ROOT, 'test_webpages','*.txt'))) | ||
| return list(zip(webpages, extracted)) | ||
| def _load_file(path): | ||
| with open(path, 'rb') as f: | ||
| return f.read().decode('utf8') | ||
| @pytest.mark.parametrize(['page', 'extracted'], _webpage_paths()) | ||
| def test_webpages(page, extracted): | ||
| html = _load_file(page) | ||
| if not six.PY3: | ||
| # FIXME: produces '\xa0' in Python 2, but ' ' in Python 3 | ||
| # this difference is ignored in this test. | ||
| # What is the correct behavior? | ||
| html = html.replace(' ', ' ') | ||
| expected = _load_file(extracted) | ||
| assert extract_text(html) == expected | ||
| tree = cleaner.clean_html(parse_html(html)) | ||
| assert etree_to_text(tree) == expected |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
200562
2.77%367
8.26%