
Research
PyPI Package Disguised as Instagram Growth Tool Harvests User Credentials
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.
A fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.
The input may be text, a URL or local file Path
.
HTML5 logo by W3C.
Install the package:
pip install html-table-takeout
Pass in a URL and print out the parsed Table
as CSV:
from html_table_takeout import parse_html
# start with http:// or https:// to source from a URL
tables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
print(tables[0].to_csv())
# output:
# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
# MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
# ...
Pass in HTML text and print out the parsed Table
as valid HTML:
from html_table_takeout import parse_html
tables = parse_html("""
<table>
<tr>
<td rowspan='2'>1</td> <!-- rowspan will be expanded -->
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
</table>""")
print(tables[0].to_html(indent=4))
# output:
# <table data-table-id='0'>
# <tbody>
# <tr>
# <td>1</td>
# <td>2</td>
# </tr>
# <tr>
# <td>1</td>
# <td>3</td>
# </tr>
# </tbody>
# </table>
The core parse_html()
function returns a list of zero or more top-level Table
. A Table
is guaranteed to have this structure:
TRow
TCell
resulting from rowspan and colspan expansion
TText
, TLink
, TRef
Type | Description |
---|---|
Table | Each parsed table has an auto-assigned unique id |
TRow | Equal to each <tr> in the original table |
TCell | Expanded <td> or <th> cells from row/colspan |
TText | HTML-decoded text inside <td> or <th> |
TLink | Equal to each <a> inside <td> or <th> |
TRef | Reference to the child Table |
All tables are guaranteed to have at least one TRow
containing one TCell
.
The parse_html()
function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.
Most HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.
Now you too can take out tables to go.
Install development dependencies:
pip install build mypy pytest
Run tests:
pytest
Build the package:
python -m build
FAQs
HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.
We found that html-table-takeout demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
Product
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
Security News
Research
Socket uncovered two npm packages that register hidden HTTP endpoints to delete all files on command.