Security News
PyPI’s New Archival Feature Closes a Major Security Gap
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
This Python 3 library extracts the information represented in any HTML table. This project has been developed in the context of the paper TOMATE: On extracting information from HTML tables
.
Some of the main features of this library are:
Some features that will be added soon:
You can install this library via pip using:
pip install tablextract
>>> from pprint import pprint
>>> from tablextract import tables
>>>
>>> ts = tables('https://en.wikipedia.org/wiki/Fiji')
>>> ts
[
Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[2]),
Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[3]),
Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[4])
]
>>> ts[0].record
[
{'Confederacy': 'Burebasaga', 'Chief': 'Ro Teimumu Vuikaba Kepa'},
{'Confederacy': 'Kubuna', 'Chief': 'Vacant'},
{'Confederacy': 'Tovata', 'Chief': 'Ratu Naiqama Tawake Lalabalavu'}
]
>>> ts[2].record # it automatically identifies that it's laid out vertically
[
{
'English': 'Hello/hi',
'Fijian': 'bula',
'Fiji Hindi': 'नमस्ते (namaste)'
}, {
'English': 'Good morning',
'Fijian': 'yadra (Pronounced Yandra)',
'Fiji Hindi': 'सुप्रभात (suprabhat)'
}, {
'English': 'Goodbye',
'Fijian': 'moce (Pronounced Mothe)',
'Fiji Hindi': 'अलविदा (alavidā)'
}
]
This library only have one function tables
, that returns a list of Table
objects.
tables(url, css_filter='table', xpath_filter=None, request_cache_time=None, add_link_urls=False, normalization'min-max-global', clustering_features=['style', 'syntax', 'structural', 'semantic'], dimensionality_reduction='off', clustering_method='k-means')
url: str
: URL of the site where tables should be downloaded from.css_filter: str
: Return just tables that match the CSS selector.xpath_filter: str
: Return just tables that match the XPath.request_cache_time: int
: Cache the downloaded documents for that number of seconds.add_image_text: bool
: Extract the image title/alt/URL as part of the cell text in Table.texts
.add_link_urls: bool
: Extract the links URL as part of the cell text in Table.texts
.text_metadata_dict: dict
: Dictionary of cell texts and likelyhood of it being meta-data. See Meta-data probability corpus.normalization: str
: The kind of normalization applied to the features. Allowed values are min-max-global
to use MinMax normalization with values obtained from a big corpus of tables after removing outliers, min-max-local
to use MinMax normalization with the minimum and maximum values of each feature in the table, standard
to apply a Standard normalization and softmax
to apply a SoftMax normalization.clustering_features: list
: The clustering feature groups that are used to identify the cell functions. Any non-empty subset of style', 'syntax', 'structural' and 'semantic' is allowed.dimensionality_reduction
: The technique used to reduce the cells dimensionality before clustering. Allowed values are off
to disable it, pca
and feature-agglomeration
.clustering_method
: The method used to cluster the cells. Allowed methods are k-means
and agglomerative
.Each Table
object has the following properties and methods:
cols(): int
: Number of columns of the table.
rows(): int
: Number of rows of the table.
cells(): int
: Number of cells of the table (same as table.cols() * table.rows()
).
error: str or None
: If an error has occurred during table extraction, it contains the stacktrace of it. Otherwise, it is None.
url: str
: URL of the page from where the table was extracted.
xpath: str
: XPath of the table within the page.
element: bs4.element.Tag
: BeautifulSoup element that represents the table.
elements: list of list of bs4.element.Tag
: 2D table of BeautifulSoup elements that represents the table after cell segmentation.
texts: list of list of str
: 2D table of strings that represents the text of each cell.
context: dict of {tuple, str}
: Texts inside or outside the table that provides contextual information for it. The keys of the dictionary represents the context position.
features: list of list of dict of {str, float/str}
: 2D table of feature vectors for each cell in the table.
functions: list of list of int
: 2D table of functions of the cells of the table. Functions can be EMPTY (-1), DATA (0), or METADATA(1).
kind: str
: Type of table extracted. Types can be 'horizontal listing', 'vertical listing', 'matrix', 'enumeration' or 'unknown'.
record: list of dict of {str, str}
: Database-like records extracted from the table.
score: float
: Estimation of how properly the table was extracted, between 0 and 1, being 1 a perfect extraction.
If you update this library and you get the error sre_constants.error: bad escape \p at position 257
, you might be using a corrupted environment. You can either:
python3 -m spacy download en
python3 -m venv my_new_env
, source my_new_env/bin/activate
Released on Mar 03, 2020.
tables
.Released on Feb 26, 2020.
tables
.tables
.tables
.tables
.tables
.Released on May 12, 2019.
tables
.render_tabular_array
.Released on Mar 25, 2019.
Released on Feb 05, 2019.
Released on Jan 24, 2019.
quit
is called instead of close
.Released on Jan 22, 2019.
FAQs
Extract the information represented in any HTML table as database-like records
We found that tablextract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Research
Security News
Malicious npm package postcss-optimizer delivers BeaverTail malware, targeting developer systems; similarities to past campaigns suggest a North Korean connection.
Security News
CISA's KEV data is now on GitHub, offering easier access, API integration, commit history tracking, and automated updates for security teams and researchers.