🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more

html-table-takeout

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

html-table-takeout

HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.

1.1.1
Maintainers
1

HTML Table Takeout

Test

HTML Table Takeout project logo

A fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.

The input may be text, a URL or local file Path.

HTML5 logo by W3C.

Quick Start

Install the package:

pip install html-table-takeout

Pass in a URL and print out the parsed Table as CSV:

from html_table_takeout import parse_html

# start with http:// or https:// to source from a URL
tables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

print(tables[0].to_csv())

# output:
# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
# MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
# ...

Pass in HTML text and print out the parsed Table as valid HTML:

from html_table_takeout import parse_html

tables = parse_html("""
<table>
    <tr>
        <td rowspan='2'>1</td> <!-- rowspan will be expanded -->
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
</table>""")

print(tables[0].to_html(indent=4))

# output:
# <table data-table-id='0'>
# <tbody>
#     <tr>
#         <td>1</td>
#         <td>2</td>
#     </tr>
#     <tr>
#         <td>1</td>
#         <td>3</td>
#     </tr>
# </tbody>
# </table>

Usage

The core parse_html() function returns a list of zero or more top-level Table. A Table is guaranteed to have this structure:

  • rows: List of one or more TRow
    • cells: List of zero or more TCell resulting from rowspan and colspan expansion
      • elements: List of zero or more TText, TLink, TRef
TypeDescription
TableEach parsed table has an auto-assigned unique id
TRowEqual to each <tr> in the original table
TCellExpanded <td> or <th> cells from row/colspan
TTextHTML-decoded text inside <td> or <th>
TLinkEqual to each <a> inside <td> or <th>
TRefReference to the child Table

All tables are guaranteed to have at least one TRow containing one TCell.

The parse_html() function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.

Why did you make this

Most HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.

Now you too can take out tables to go.

Developing

Install development dependencies:

pip install build mypy pytest

Run tests:

pytest

Build the package:

python -m build

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts