Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Consistent interface for stream reading and writing tabular data (csv/xls/json/etc)
A library for reading and writing tabular data (csv/xls/json/etc).
[Important Notice] We have released Frictionless Framework. This framework is logical continuation of
tabulator
that was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide fromtabulator
to Frictionless Framework.
- we continue to bug-fix
tabulator@1.x
in this repository as well as it's available on PyPi as it was before- please note that
frictionless@3.x
version's API, we're working on at the moment, is not stable- we will release
frictionless@4.x
by the end of 2020 to be the first SemVer/stable version
$ pip install tabulator
Tabulator ships with a simple CLI called tabulator
to read tabular data. For
example:
$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
id,name
1,english
2,中国人
You can see all supported options by running tabulator --help
.
from tabulator import Stream
with Stream('data.csv', headers=1) as stream:
stream.headers # [header1, header2, ..]
for row in stream:
print(row) # [value1, value2, ..]
You can find other examples in the examples directory.
In the following sections, we'll walk through some usage examples of this library. All examples were tested with Python 3.6, but should run fine with Python 3.3+.
The Stream
class represents a tabular stream. It takes the file path as the
source
argument. For example:
<scheme>://path/to/file.<format>
It uses this path to determine the file format (e.g. CSV or XLS) and scheme
(e.g. HTTP or postgresql). It also supports format extraction from URLs like http://example.com?format=csv
. If necessary, you also can define these explicitly.
Let's try it out. First, we create a Stream
object passing the path to a CSV file.
import tabulator
stream = tabulator.Stream('data.csv')
At this point, the file haven't been read yet. Let's open the stream so we can read the contents.
try:
stream.open()
except tabulator.TabulatorException as e:
pass # Handle exception
This will open the underlying data stream, read a small sample to detect the
file encoding, and prepare the data to be read. We catch
tabulator.TabulatorException
here, in case something goes wrong.
We can now read the file contents. To iterate over each row, we do:
for row in stream.iter():
print(row) # [value1, value2, ...]
The stream.iter()
method will return each row data as a list of values. If
you prefer, you could call stream.iter(keyed=True)
instead, which returns a
dictionary with the column names as keys. Either way, this method keeps only a
single row in memory at a time. This means it can handle handle large files
without consuming too much memory.
If you want to read the entire file, use stream.read()
. It accepts the same
arguments as stream.iter()
, but returns all rows at once.
stream.reset()
rows = stream.read()
Notice that we called stream.reset()
before reading the rows. This is because
internally, tabulator only keeps a pointer to its current location in the file.
If we didn't reset this pointer, we would read starting from where we stopped.
For example, if we ran stream.read()
again, we would get an empty list, as
the internal file pointer is at the end of the file (because we've already read
it all). Depending on the file location, it might be necessary to download the
file again to rewind (e.g. when the file was loaded from the web).
After we're done, close the stream with:
stream.close()
The entire example looks like:
import tabulator
stream = tabulator.Stream('data.csv')
try:
stream.open()
except tabulator.TabulatorException as e:
pass # Handle exception
for row in stream.iter():
print(row) # [value1, value2, ...]
stream.reset() # Rewind internal file pointer
rows = stream.read()
stream.close()
It could be rewritten to use Python's context manager interface as:
import tabulator
try:
with tabulator.Stream('data.csv') as stream:
for row in stream.iter():
print(row)
stream.reset()
rows = stream.read()
except tabulator.TabulatorException as e:
pass
This is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.
The full API documentation is available as docstrings in the Stream source code.
By default, tabulator considers that all file rows are values (i.e. there is no header).
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.headers # None
stream.read() # [['name', 'age'], ['Alex', 21]]
If you have a header row, you can use the headers
argument with the its row
number (starting from 1).
# Integer
with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
stream.headers # ['name', 'age']
stream.read() # [['Alex', 21]]
You can also pass a lists of strings to define the headers explicitly:
with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
stream.headers # ['name', 'age']
stream.read() # [['Alex', 21]]
Tabulator also supports multiline headers for the xls
and xlsx
formats.
with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
stream.headers # ['header from row 1-3']
stream.read() # [['value1', 'value2', 'value3']]
You can specify the file encoding (e.g. utf-8
and latin1
) via the encoding
argument.
with Stream(source, encoding='latin1') as stream:
stream.read()
If this argument isn't set, Tabulator will try to infer it from the data. If you
get a UnicodeDecodeError
while loading a file, try setting the encoding to
utf-8
.
Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:
with Stream('http://example.com/data.csv.zip') as stream:
stream.read()
You can also set it explicitly:
with Stream('data.csv.ext', compression='gz') as stream:
stream.read()
Options
The Stream
class raises tabulator.exceptions.FormatError
if it detects HTML
contents. This helps avoiding the relatively common mistake of trying to load a
CSV file inside an HTML page, for example on GitHub.
You can disable this behaviour using the allow_html
option:
with Stream(source_with_html, allow_html=True) as stream:
stream.read() # no exception on open
To detect the file's headers, and run other checks like validating that the file
doesn't contain HTML, Tabulator reads a sample of rows on the stream.open()
method. This data is available via the stream.sample
property. The number of
rows used can be defined via the sample_size
parameters (defaults to 100).
with Stream(two_rows_source, sample_size=1) as stream:
stream.sample # only first row
stream.read() # first and second rows
You can disable this by setting sample_size
to zero. This way, no data will be
read on stream.open()
.
Tabulator needs to read a part of the file to infer its encoding. The
bytes_sample_size
arguments controls how many bytes will be read for this
detection (defaults to 10000).
source = 'data/special/latin1.csv'
with Stream(source) as stream:
stream.encoding # 'iso8859-2'
You can disable this by setting bytes_sample_size
to zero, in which case it'll
use the machine locale's default encoding.
When True
, tabulator will ignore columns that have blank headers (defaults to
False
).
# Default behaviour
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1) as stream:
stream.headers # ['header1', '', 'header3']
stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}
# Ignoring columns with blank headers
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
stream.headers # ['header1', 'header3']
stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}
The option is similar to the ignore_blank_headers
. It removes arbitrary columns from the data based on the corresponding column names:
# Ignore listed headers (omit columns)
source = 'text://header1,header2,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_listed_headers=['header2']) as stream:
assert stream.headers == ['header1', 'header3']
assert stream.read(keyed=True) == [
{'header1': 'value1', 'header3': 'value3'},
]
# Ignore NOT listed headers (pick colums)
source = 'text://header1,header2,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_not_listed_headers=['header2']) as stream:
assert stream.headers == ['header2']
assert stream.read(keyed=True) == [
{'header2': 'value2'},
]
When True
, all rows' values will be converted to strings (defaults to
False
). None
values will be converted to empty strings.
# Default behaviour
with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]
# Forcing rows' values as strings
with Stream([['string', 1]], force_strings=True) as stream:
stream.read() # [['string', '1', '2017-12-01 17:00:00']]
When True
, don't raise an exception when parsing a malformed row, but simply
return an empty row. Otherwise, tabulator raises
tabulator.exceptions.SourceError
when a row can't be parsed. Defaults to False
.
# Default behaviour
with Stream([[1], 'bad', [3]]) as stream:
stream.read() # raises tabulator.exceptions.SourceError
# With force_parse
with Stream([[1], 'bad', [3]], force_parse=True) as stream:
stream.read() # [[1], [], [3]]
List of row numbers and/or strings to skip. If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//'). If it's the empty string, all rows that begin with an empty column will be skipped.
source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
stream.read() # [['Mike', 4]]
If the headers
parameter is also set to be an integer, it will use the first not skipped row as a headers.
source = [['#comment'], ['name', 'order'], ['John', 1], ['Alex', 2]]
with Stream(source, headers=1, skip_rows=['#']) as stream:
stream.headers # [['name', 'order']]
stream.read() # [['Jogn', 1], ['Alex', 2]]
List of functions that can filter or transform rows after they are parsed. These
functions receive the extended_rows
containing the row's number, headers
list, and the row values list. They then process the rows, and yield or discard
them, modified or not.
def skip_odd_rows(extended_rows):
for row_number, headers, row in extended_rows:
if not row_number % 2:
yield (row_number, headers, row)
def multiply_by_two(extended_rows):
for row_number, headers, row in extended_rows:
doubled_row = list(map(lambda value: value * 2, row))
yield (row_number, headers, doubled_row)
rows = [
[1],
[2],
[3],
[4],
]
with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
stream.read() # [[4], [8]]
These functions are applied in order, as a simple data pipeline. In the example
above, multiply_by_two
just sees the rows yielded by skip_odd_rows
.
The methods stream.iter()
and stream.read()
accept the keyed
and
extended
flag arguments to modify how the rows are returned.
By default, every row is returned as a list of its cells values:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read() # [['Alex', 21]]
With keyed=True
, the rows are returned as dictionaries, mapping the column names to their values in the row:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]
And with extended=True
, the rows are returned as a tuple of (row_number, headers, row)
, there row_number
is the current row number (starting from 1),
headers
is a list with the headers names, and row
is a list with the rows
values:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])
It loads data from AWS S3. For private files you should provide credentials supported by the boto3
library, for example, corresponding environment variables. Read more about configuring boto3
.
stream = Stream('s3://bucket/data.csv')
Options
https://s3.amazonaws.com
. For complex use cases, for example, goodtables
's runs on a data package this option can be provided as an environment variable S3_ENDPOINT_URL
.The default scheme, a file in the local filesystem.
stream = Stream('data.csv')
In Python 2,
tabulator
can't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.
stream = Stream('https://example.com/data.csv')
Options
requests.Session
object. Read more in the requests docs.requests
session construction.The source is a file-like Python object.
with open('data.csv') as fp:
stream = Stream(fp)
The source is a string containing the tabular data. Both scheme
and format
must be set explicitly, as it's not possible to infer them.
stream = Stream(
'name,age\nJohn, 21\n',
scheme='text',
format='csv'
)
In this section, we'll describe the supported file formats, and their respective configuration options and operations. Some formats only support read operations, while others support both reading and writing.
stream = Stream('data.csv', delimiter=',')
Options
It supports all options from the Python CSV library. Check their documentation for more information.
Tabulator is unable to stream
xls
files, so the entire file is loaded in memory. Streaming is supported forxlsx
files.
stream = Stream('data.xls', sheet=1)
Options
tabulator
will fill the dictionary with source: tmpfile_path
pairs for remote workbooks. Each workbook will be downloaded only once and all the temporary files will be deleted on the process exit. Defauts: NoneTrue
it will unmerge and fill all merged cells by
a visible value. With this option enabled the parser can't stream data and
load the whole document into memory.True
it will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)True
it will correct the Excel behaviour regarding floating point numbersThis format is not included to package by default. To use it please install
tabulator
with anods
extras:$ pip install tabulator[ods]
Source should be a valid Open Office document.
stream = Stream('data.ods', sheet=1)
Options
A publicly-accessible Google Spreadsheet.
stream = Stream('https://docs.google.com/spreadsheets/d/<id>?usp=sharing')
stream = Stream('https://docs.google.com/spreadsheets/d/<id>edit#gid=<gid>')
Any database URL supported by sqlalchemy.
stream = Stream('postgresql://name:pass@host:5432/database', table='data')
Options
name DESC
)This format is not included to package by default. You can enable it by installing tabulator using
pip install tabulator[datapackage]
.
stream = Stream('datapackage.json', resource=1)
Options
Either a list of lists, or a list of dicts mapping the column names to their respective values.
stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])
JSON document containing a list of lists, or a list of dicts mapping the column
names to their respective values (see the inline
format for an example).
stream = Stream('data.json', property='key1.key2')
Options
{"response": {"data": [...]}}
, the property
should be set to response.data
.stream = Stream('data.ndjson')
stream = Stream('data.tsv')
This format is not included to package by default. To use it please install
tabulator
with thehtml
extra:$ pip install tabulator[html]
An HTML table element residing inside an HTML document.
Supports simple tables (no merged cells) with any legal combination of the td, th, tbody & thead elements.
Usually foramt='html'
would need to be specified explicitly as web URLs don't always use the .html
extension.
stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
Options
selector: CSS selector for specifying which table
element to extract. By default it's table
, which takes the first table
element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
raw_html: False (default) to extract the textual contents of each cell. True to return the inner html without modification.
Tabulator is written with extensibility in mind, allowing you to add support for new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There are three components that allow this:
In this section, we'll see how to write custom classes to extend any of these components.
You can add support for a new scheme (e.g. ssh) by creating a custom loader.
Custom loaders are implemented by inheriting from the Loader
class, and
implementing its methods. This loader can then be used by Stream
to load data
by passing it via the custom_loaders={'scheme': CustomLoader}
argument.
The skeleton of a custom loader looks like:
from tabulator import Loader
class CustomLoader(Loader):
options = []
def __init__(self, bytes_sample_size, **options):
pass
def load(self, source, mode='t', encoding=None):
# load logic
with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
stream.read()
You can see examples of how the loaders are implemented by looking in the
tabulator.loaders
module.
You can add support for a new file format by creating a custom parser. Similarly
to custom loaders, custom parsers are implemented by inheriting from the
Parser
class, and implementing its methods. This parser can then be used by
Stream
to parse data by passing it via the custom_parsers={'format': CustomParser}
argument.
The skeleton of a custom parser looks like:
from tabulator import Parser
class CustomParser(Parser):
options = []
def __init__(self, loader, force_parse, **options):
self.__loader = loader
def open(self, source, encoding=None):
# open logic
def close(self):
# close logic
def reset(self):
# reset logic
@property
def closed(self):
return False
@property
def extended_rows(self):
# extended rows logic
with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
stream.read()
You can see examples of how parsers are implemented by looking in the
tabulator.parsers
module.
You can add support to write files in a specific format by creating a custom
writer. The custom writers are implemented by inheriting from the base Writer
class, and implementing its methods. This writer can then be used by Stream
to
write data via the custom_writers={'format': CustomWriter}
argument.
The skeleton of a custom writer looks like:
from tabulator import Writer
class CustomWriter(Writer):
options = []
def __init__(self, **options):
pass
def write(self, source, target, headers=None, encoding=None):
# write logic
with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
stream.save(target)
You can see examples of how parsers are implemented by looking in the
tabulator.writers
module.
cli
cli(source, limit, **options)
Command-line interface
Usage: tabulator [OPTIONS] SOURCE
Options:
--headers INTEGER
--scheme TEXT
--format TEXT
--encoding TEXT
--limit INTEGER
--sheet TEXT/INTEGER (excel)
--fill-merged-cells BOOLEAN (excel)
--preserve-formatting BOOLEAN (excel)
--adjust-floating-point-error BOOLEAN (excel)
--table TEXT (sql)
--order_by TEXT (sql)
--resource TEXT/INTEGER (datapackage)
--property TEXT (json)
--keyed BOOLEAN (json)
--version Show the version and exit.
--help Show this message and exit.
Stream
Stream(self,
source,
headers=None,
scheme=None,
format=None,
encoding=None,
compression=None,
allow_html=False,
sample_size=100,
bytes_sample_size=10000,
ignore_blank_headers=False,
ignore_listed_headers=None,
ignore_not_listed_headers=None,
multiline_headers_joiner=' ',
multiline_headers_duplicates=False,
force_strings=False,
force_parse=False,
pick_rows=None,
skip_rows=None,
pick_fields=None,
skip_fields=None,
pick_columns=None,
skip_columns=None,
post_parse=[],
custom_loaders={},
custom_parsers={},
custom_writers={},
**options)
Stream of tabular data.
This is the main tabulator
class. It loads a data source, and allows you
to stream its parsed contents.
Arguments
source (str):
Path to file as ``<scheme>://path/to/file.<format>``.
If not explicitly set, the scheme (file, http, ...) and
format (csv, xls, ...) are inferred from the source string.
headers (Union[int, List[int], List[str]], optional):
Either a row
number or list of row numbers (in case of multi-line headers) to be
considered as headers (rows start counting at 1), or the actual
headers defined a list of strings. If not set, all rows will be
treated as containing values.
scheme (str, optional):
Scheme for loading the file (file, http, ...).
If not set, it'll be inferred from `source`.
format (str, optional):
File source's format (csv, xls, ...). If not
set, it'll be inferred from `source`. inferred
encoding (str, optional):
Source encoding. If not set, it'll be inferred.
compression (str, optional):
Source file compression (zip, ...). If not set, it'll be inferred.
pick_rows (List[Union[int, str, dict]], optional):
The same as `skip_rows` but it's for picking rows instead of skipping.
skip_rows (List[Union[int, str, dict]], optional):
List of row numbers, strings and regex patterns as dicts to skip.
If a string, it'll skip rows that their first cells begin with it e.g. '#' and '//'.
To skip only completely blank rows use `{'type': 'preset', 'value': 'blank'}`
To provide a regex pattern use `{'type': 'regex', 'value': '^#'}`
For example: `skip_rows=[1, '# comment', {'type': 'regex', 'value': '^# (regex|comment)'}]`
pick_fields (str[]):
When passed, ignores all columns with headers
that the given list DOES NOT include
skip_fields (str[]):
When passed, ignores all columns with headers
that the given list includes. If it contains an empty string it will skip
empty headers
sample_size (int, optional):
Controls the number of sample rows used to
infer properties from the data (headers, encoding, etc.). Set to
``0`` to disable sampling, in which case nothing will be inferred
from the data. Defaults to ``config.DEFAULT_SAMPLE_SIZE``.
bytes_sample_size (int, optional):
Same as `sample_size`, but instead
of number of rows, controls number of bytes. Defaults to
``config.DEFAULT_BYTES_SAMPLE_SIZE``.
allow_html (bool, optional):
Allow the file source to be an HTML page.
If False, raises ``exceptions.FormatError`` if the loaded file is
an HTML page. Defaults to False.
multiline_headers_joiner (str, optional):
When passed, it's used to join multiline headers
as `<passed-value>.join(header1_1, header1_2)`
Defaults to ' ' (space).
multiline_headers_duplicates (bool, optional):
By default tabulator will exclude a cell of a miltilne header from joining
if it's exactly the same as the previous seen value in this field.
Enabling this option will force duplicates inclusion
Defaults to False.
force_strings (bool, optional):
When True, casts all data to strings.
Defaults to False.
force_parse (bool, optional):
When True, don't raise exceptions when
parsing malformed rows, simply returning an empty value. Defaults
to False.
post_parse (List[function], optional):
List of generator functions that
receives a list of rows and headers, processes them, and yields
them (or not). Useful to pre-process the data. Defaults to None.
custom_loaders (dict, optional):
Dictionary with keys as scheme names,
and values as their respective ``Loader`` class implementations.
Defaults to None.
custom_parsers (dict, optional):
Dictionary with keys as format names,
and values as their respective ``Parser`` class implementations.
Defaults to None.
custom_loaders (dict, optional):
Dictionary with keys as writer format
names, and values as their respective ``Writer`` class
implementations. Defaults to None.
**options (Any, optional): Extra options passed to the loaders and parsers.
stream.closed
Returns True if the underlying stream is closed, False otherwise.
Returns
bool
: whether closed
stream.compression
Stream's compression ("no" if no compression)
Returns
str
: compression
stream.encoding
Stream's encoding
Returns
str
: encoding
stream.format
Path's format
Returns
str
: format
stream.fragment
Path's fragment
Returns
str
: fragment
stream.hash
Returns the SHA256 hash of the read chunks if available
Returns
str/None
: SHA256 hash
stream.headers
Headers
Returns
str[]/None
: headers if available
stream.sample
Returns the stream's rows used as sample.
These sample rows are used internally to infer characteristics of the source file (e.g. encoding, headers, ...).
Returns
list[]
: sample
stream.scheme
Path's scheme
Returns
str
: scheme
stream.size
Returns the BYTE count of the read chunks if available
Returns
int/None
: BYTE count
stream.source
Source
Returns
any
: stream source
stream.open
stream.open()
Opens the stream for reading.
Raises:
TabulatorException: if an error
stream.close
stream.close()
Closes the stream.
stream.reset
stream.reset()
Resets the stream pointer to the beginning of the file.
stream.iter
stream.iter(keyed=False, extended=False)
Iterate over the rows.
Each row is returned in a format that depends on the arguments keyed
and extended
. By default, each row is returned as list of their
values.
Arguments
dict
mapping the header name to its value in the current row.
For example, [{'name': 'J Smith', 'value': '10'}]
. Ignored if
extended
is True. Defaults to False.(1, ['name', 'value'], ['J Smith', '10'])
.
Defaults to False.Raises
exceptions.TabulatorException
: If the stream is closed.Returns
Iterator[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]
:
The row itself. The format depends on the values of keyed
and
extended
arguments.
stream.read
stream.read(keyed=False, extended=False, limit=None)
Returns a list of rows.
Arguments
Stream.iter
.Stream.iter
.Returns
List[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]
:
The list of rows. The format depends on the values of keyed
and extended
arguments.
stream.save
stream.save(target, format=None, encoding=None, **options)
Save stream to the local filesystem.
Arguments
target
path. Defaults to None.config.DEFAULT_ENCODING
.Returns
count (int?)
: Written rows count if available
Building index...
Started generating documentation...
Loader
Loader(self, bytes_sample_size, **options)
Abstract class implemented by the data loaders
The loaders inherit and implement this class' methods to add support for a new scheme (e.g. ssh).
Arguments
loader.options
loader.load
loader.load(source, mode='t', encoding=None)
Load source file.
Arguments
t
(text) or b
(binary). Defaults to t
.Returns
Union[TextIO, BinaryIO]
: I/O stream opened either as text or binary.
Parser
Parser(self, loader, force_parse, **options)
Abstract class implemented by the data parsers.
The parsers inherit and implement this class' methods to add support for a new file type.
Arguments
True
, the parser yields an empty extended
row tuple (row_number, None, [])
when there is an error parsing a
row. Otherwise, it stops the iteration by raising the exception
tabulator.exceptions.SourceError
.parser.closed
Flag telling if the parser is closed.
Returns
bool
: whether closed
parser.encoding
Encoding
Returns
str
: encoding
parser.extended_rows
Returns extended rows iterator.
The extended rows are tuples containing (row_number, headers, row)
,
Raises
SourceError
:
If force_parse
is False
and
a row can't be parsed, this exception will be raised.
Otherwise, an empty extended row is returned (i.e.
(row_number, None, [])
).Returns:
Iterator[Tuple[int, List[str], List[Any]]]:
Extended rows containing
(row_number, headers, row)
, where headers
is a list of the
header names (can be None
), and row
is a list of row
values.
parser.options
parser.open
parser.open(source, encoding=None)
Open underlying file stream in the beginning of the file.
The parser gets a byte or text stream from the tabulator.Loader
instance and start emitting items.
Arguments
Returns
None
parser.close
parser.close()
Closes underlying file stream.
parser.reset
parser.reset()
Resets underlying stream and current items list.
After reset()
is called, iterating over the items will start from the beginning.
Writer
Writer(self, **options)
Abstract class implemented by the data writers.
The writers inherit and implement this class' methods to add support for a new file destination.
Arguments
writer.options
writer.write
writer.write(source, target, headers, encoding=None)
Writes source data to target.
Arguments
Returns
count (int?)
: Written rows count if available
validate
validate(source, scheme=None, format=None)
Check if tabulator is able to load the source.
Arguments
Raises
SchemeError
: The file scheme is not supported.FormatError
: The file format is not supported.Returns
bool
: Whether tabulator is able to load the source file.
TabulatorException
TabulatorException()
Base class for all tabulator exceptions.
SourceError
SourceError()
The source file could not be parsed correctly.
SchemeError
SchemeError()
The file scheme is not supported.
FormatError
FormatError()
The file format is unsupported or invalid.
EncodingError
EncodingError()
Encoding error
CompressionError
CompressionError()
Compression error
IOError
IOError()
Local loading error
LoadingError
LoadingError()
Local loading error
HTTPError
HTTPError()
Remote loading error
The project follows the Open Knowledge International coding standards.
Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:
$ make install
To run tests with linting and coverage:
$ make test
To run tests without Internet:
$ pytest -m 'not remote
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.
workbook_cache
argument for XLSX formatsstream.hashing_algorithm
propertyhashing_algorithm
parametermultiline_headers_duplicates
flagstream.compression
stream.source
pick_rows
parameter (opposite to skip_rows
)stream.save()
returning count of written rowschardet
for encoding detection by default. For cchardet
: pip install tabulator[cchardet]
. Due to a great deal of problems caused by ccharted
for non-Linux/Conda installations we're returning back to using chardet
by default.blank
preset for skip_rows
(#302)skip/pick_columns
aliases for (#293)multiline_headers_joiner
argument (#291)skip_rows
(#290)xlsx
writerhtml
readeradjust_floating_point_error
parameter to the xlsx
parserstream.size
and stream.hash
propertieshttp_timeout
argument for the http/https
formatstream.fragment
field showing e.g. Excel sheet's or DP resource's names3
file scheme (data loading from AWS S3)stream.headers
propertyheaders
parameter will now use the first not skipped row if the skip_rows
parameter is provided and there are comments on the top of a data source (see #264)preserve_formatting
for xlsxUpdated behaviour:
ods
format the boolean, integer and datetime native types are detected nowUpdated behaviour:
xls
format the boolean, integer and datetime native types are detected nowUpdated behaviour:
New API added:
skip_rows
support for an empty string to skip rows with an empty first columnNew API added:
http://example.com?format=csv
Updated behaviour:
xls
booleans will be parsed as booleans not integersNew API added:
skip_rows
argument now supports negative numbers to skip rows starting from the endUpdated behaviour:
UserWarning
warning will be emitted if an option isn't recognized.New API added:
http_session
argument for the http/https
format (it uses requests
now)headers
argument accept ranges like [1,3]
New API added:
zip
and gz
on Python3Stream
constructor now accepts a compression
argumenthttp/https
scheme now accepts a http_stream
flagImproved behaviour:
headers
argument allows to set the order for keyed sources and cherry-pick valuesNew API added:
XLS/XLSX/ODS
supports sheet names passed via the sheet
argumentStream
constructor accepts an ignore_blank_headers
optionImproved behaviour:
datapackage
format on datapackage@1
libraryNew API added:
source
for the Stream
constructor can be a pathlib.Path
New API added:
bytes_sample_size
for the Stream
constructorImproved behaviour:
New API added:
stream.scheme
stream.format
stream.encoding
Promoted provisional API to stable API:
Loader
(custom loaders)Parser
(custom parsers)Writer
(custom writers)validate
Improved behaviour:
New API added:
fill_merged_cells
option to xls/xlsx
formatsNew API added:
Loader/Parser/Writer
APIStream
argument force_strings
Stream
argument force_parse
Stream
argument custom_writers
Deprecated API removal:
topen
and Table
- use Stream
insteadStream
arguments loader/parser_options
- use **options
insteadProvisional API changed:
Loader/Parser/Writer
API - please use an updated versionProvisional API added:
Stream
arguments custom_loaders/parsers
FAQs
Consistent interface for stream reading and writing tabular data (csv/xls/json/etc)
We found that tabulator demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.