Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
If you often find yourself processing CSV files using python, you will quickly notice that, while being more comfortable, csv.DictReader
remains way slower than csv.reader
:
# To read a 1.5G CSV file:
csv.reader: 24s
csv.DictReader: 84s
casanova.reader: 25s
casanova
is therefore an attempt to stick to csv.reader
performance while still keeping a comfortable interface, still able to consider headers (even duplicate ones also, something that csv.DictReader
is incapable of) etc.
casanova
is thus a good fit for you if you need to:
casanova
also packs exotic utilities able to read csv files in reverse (without loading the whole file into memory and in regular O(n)
time), so you can, for instance, fetch useful information at the end of a file to restart some aborted process.
Finally, casanova
can be used as a command line tool able to evaluate python expressions (that can be parallelized if required) for each row of a given CSV file to produce typical results such as adding a column based on others etc.
The command line tool documentation can be found here.
For more generic task that don't require python evaluation, we recommend the very performant xsv
tool instead, or our own fork of the tool.
You can install casanova
with pip with the following command:
pip install casanova
If you want to be able to feed CSV files from the web to casanova
readers & enrichers you will also need to install at least urllib3
and optionally certifi
(if you want secure SSL). Nnote that a lot of python packages, including the popular requests
library, already depend on those two, so it is likely you already have them installed anyway:
# Installing them explicitly
pip install urllib3 certifi
# Installing casanova with those implicitly
pip install casanova[http]
Straightforward CSV reader yielding list rows but giving some information about potential headers and their ipositions.
import casanova
with open('./people.csv') as f:
# Creating a reader
reader = casanova.reader(f)
# Getting header information
reader.fieldnames
>>> ['name', 'surname']
reader.headers
>>> Headers(name=0, surname=1)
name_pos = reader.headers.name
name_pos = reader.headers['name']
'name' in reader.headers
>>> True
# Iterating over the rows
for row in reader:
name = row[name_pos] # it's better to cache your pos outside the loop
name = row[reader.headers.name] # this works, but is slower
# Interested in a single column?
for name in reader.cells('name'):
print(name)
# Need also the current row when iterating on cells?
for row, name in reader.cells('name', with_rows=True):
print(row, name, surname)
# Want to iterate over records
# NOTE: this has a performance cost
for name, surname in reader.records('name', 'surname'):
print(name, surname)
for record in reader.records(['name', 'age']):
print(record[0])
for record in reader.records({'name': 'name', 'age': 1}):
print(record['age'])
# No headers? No problem.
reader = casanova.reader(f, no_headers=True)
# Note that you can also create a reader from a path
with casanova.reader('./people.csv') as reader:
...
# And if you need exotic encodings
with casanova.reader('./people.csv', encoding='latin1') as reader:
...
# The reader will also handle gzipped files out of the box
with casanova.reader('./people.csv.gz') as reader:
...
# If you have `urllib3` installed, casanova is also able to stream
# remote CSV file out of the box
with casanova.reader('https://mydomain.fr/some-file.csv') as reader:
...
# The reader will also accept iterables of rows
rows = [['name', 'surname'], ['John', 'Moran']]
reader = casanova.reader(rows)
# And you can of course use the typical dialect-related kwargs
reader = casanova.reader('./french-semicolons.csv', delimiter=';')
# Readers can also be closed if you want to avoid context managers
reader.close()
Arguments
False
]: set to True
if input_file
has no headers.utf-8
]: encoding to use to open the file if input_file
is a path.prebuffer_bytes
was set.False
]: before python 3.11, the csv
module will raise when attempting to read a CSV file containing null bytes. If set to True
, the reader will strip null bytes on the fly while parsing rows.False
]: whether to read the file in reverse (except for the header of course).Properties
total
kwarg.no_headers=False
.no_headers=False
.Methods
with_rows=True
if you want to iterate over a value, row
tuple instead if required.index, row
tuples. Takes an optional start
kwarg like builtin enumerate
.index, cell
or index, row, cell
if given with_rows=True
. Takes an optional start
kwarg like builtin enumerate
.RowWrapper
object to wrap it.Multiplexing
Sometimes, one column of your CSV file might contain multiple values, separated by an arbitrary separator character such as |
.
In this case, it might be desirable to "multiplex" the file by making a reader emit one copy of the line with each of the values contained by a cell.
To do so, casanova
exposes a special Multiplexer
object you can give to any reader like so:
import casanova
# Most simple case: a column named "colors", separated by "|"
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors')
)
# Customizing the separator:
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors', separator='$')
)
# Renaming the column on the fly:
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors', new_column='color')
)
A reverse CSV reader might sound silly, but it can be useful in some scenarios. Especially when you need to read the last line from an output file without reading the whole thing first, in constant time.
It is mostly used by casanova
resumers and it is unlikely you will need to use them on your own.
import casanova
# people.csv looks like this
# name,surname
# John,Doe,
# Mary,Albert
# Quentin,Gold
with open('./people.csv', 'rb') as f:
reader = casanova.reverse_reader(f)
reader.fieldnames
>>> ['name', 'surname']
next(reader)
>>> ['Quentin', 'Gold']
A class representing the headers of a CSV file. It is useful to find the row position of some columns and perform complex selection.
import casanova
# Headers can be instantiated thusly
headers = casanova.headers(['name', 'surname', 'age'])
# But you will usually use a reader or an enricher's one:
headers = casanova.reader(input_file).headers
# Accessing a column through attributes
headers.surname
>>> 1
# Accessing a column by indexing:
headers['surname']
>>> 1
# Getting a column
headers.get('surname')
>>> 1
headers.get('not-found')
>>> None
# Getting a duplicated column name
casanova.headers(['surname', 'name', 'name'])['name', 1]
>>> 2
casanova.headers(['surname', 'name', 'name']).get('name', index=1)
>>> 2
# Asking if a column exists:
'name' in headers:
>>> True
# Retrieving fieldnames:
headers.fieldnames
>>> ['name', 'surname', 'age']
# Iterating over headers
for col in headers:
print(col)
# Couting columns:
len(headers)
>>> 3
# Retrieving the nth header:
headers.nth(1)
>>> 'surname'
# Wraping a row
headers.wrap(['John', 'Matthews', '45'])
>>> RowWrapper(name='John', surname='Matthews', age='45')
# Selecting some columns (by name and/or index)):
headers.select(['name', 2])
>>> [0, 2]
# Selecting using xsv mini DSL:
headers.select('name,age')
>>> [0, 2]
headers.select('!name')
>>> [1, 2]
For more info about xsv mini DSL, check this part of the documentation.
casanova
also exports a csv writer. It can automatically write headers when needed and is able to resume some tasks.
import casanova
with open('output.csv') as f:
writer = casanova.writer(f, fieldnames=['name', 'surname'])
writer.writerow(['John', 'Davis'])
# If you want to write headers yourself:
writer = casanova.writer(f, fieldnames=['name', 'surname'], write_header=False)
writer.writeheader()
Arguments
False
]: whether to strip null bytes when writing rows. Note that on python 3.10, there is a bug that prevents a csv.writer
will raise an error when attempting to write a row containing a null byte.True
]: whether to automatically write header if required (takes resuming into account).Properties
Resuming
A casanova.writer
is able to resume through a LastCellResumer
.
casanova
enrichers are basically a smart combination of both a reader and a writer.
It can be used to transform a given CSV file. This means you can transform its values on the fly, select some columns to keep from input and add new ones very easily.
Note that enrichers inherits from both casanova.reader
and casanova.writer
and therefore keep both their properties and methods.
import casanova
with open('./people.csv') as input_file, \
open('./enriched-people.csv', 'w') as output_file:
enricher = casanova.enricher(input_file, output_file)
# The enricher inherits from casanova.reader
enricher.fieldnames
>>> ['name', 'surname']
# You can iterate over its rows
name_pos = enricher.headers.name
for row in enricher:
# Editing a cell, so that everyone is called John now
row[name_pos] = 'John'
enricher.writerow(row)
# Want to add columns?
enricher = casanova.enricher(f, of, add=['age', 'hair'])
for row in enricher:
enricher.writerow(row, ['34', 'blond'])
# Want to keep only some columns from input?
enricher = casanova.enricher(f, of, add=['age'], select=['surname'])
for row in enricher:
enricher.writerow(row, ['45'])
# Want to select columns to keep using xsv mini dsl?
enricher = casanova.enricher(f, of, select='!1-4')
# You can of course still use #.cells etc.
for row, name in enricher.cells('name', with_rows=True):
print(row, name)
Arguments
False
]: set to True
if input_file
has no headers.utf-8
]: encoding to use to open the file if input_file
is a path.prebuffer_bytes
was set.False
]: whether to read the file in reverse (except for the header of course).False
]: before python 3.11, the csv
module will raise when attempting to read a CSV file containing null bytes. If set to True
, the reader will strip null bytes on the fly while parsing rows.False
]: whether to strip null bytes when writing rows. Note that on python 3.10, there is a bug that prevents a csv.writer
will raise an error when attempting to write a row containing a null byte.True
]: whether to automatically write
header if required (takes resuming into account).Properties
total
kwarg.no_headers=False
.no_headers=False
.no_headers=False
.no_headers=False
.Resuming
A casanova.enricher
is able to resume through a RowCountResumer
or a LastCellComparisonResumer
.
Sometimes, you might want to process multiple input rows concurrently. This can mean that you will emit rows in an arbitrary order, different from the input one.
This is fine, of course, but if you still want to be able to resume an aborted process efficiently (using the `IndexedResumer), your output will need specific additions for it to work, namely a column containing the index of an output row in the original input.
casanova.indexed_enricher
makes it simpler by providing a tailored writerow
method and iterators always provided the index of a row safely.
Note that such resuming is only possible if one row in the input will produce exactly one row in the output.
import casanova
with open('./people.csv') as f, \
open('./enriched-people.csv', 'w') as of:
enricher = casanova.indexed_enricher(f, of, add=['age', 'hair'])
for index, row in enricher:
enricher.writerow(index, row, ['67', 'blond'])
for index, value in enricher.cells('name'):
...
for index, row, value in enricher.cells('name', with_rows=True):
...
Arguments
Everything from casanova.enricher
plus:
index
]: name of the automatically added index column.Resuming
A casanova.indexed_enricher
is able to resume through a IndexedResumer
.
Sometimes, you might want to process a CSV file and paginate API calls per row. This means that each row of your input file should produce multiple new lines, which will be written in batch each time one call from the API returns.
Sometimes, the pagination might be quite long (think collecting the Twitter followers of a very popular account), and it would not be a good idea to accumulate all the results for a single row before flushing them to file atomically because if something goes wrong, you will lose a lot of work.
But if you still want to be able to resume if process is aborted, you will need to add some things to your output. Namely, a column containing optional "cursor" data to resume your API calls and an "end" symbol indicating we finished the current input row.
import casanova
with open('./twitter-users.csv') as input_file, \
casanova.BatchResumer('./output.csv') as output_file:
enricher = casanova.batch_resumer(input_file, output_file)
for row in enricher:
for results, next_cursor in paginate_api_calls(row):
# NOTE: if we reached the end, next_cursor is None
enricher.writebatch(row, results, next_cursor)
Arguments
Everything from casanova.enricher
plus:
cursor
]: name of the cursor column to add.end
]: unambiguous (from cursor) end symbol to mark end of input row processing.Resuming
A casanova.batch_enricher
is able to resume through a BatchResumer
.
Through handy Resumer
classes, casanova
lets its enrichers and writers resume an aborted process.
Those classes must be used as a wrapper to open the output file and can assess whether resuming is actually useful or not for you.
All resumers act like file handles, can be used as a context manager using the with
keyword and can be manually closed using the close
method if required.
Finally know that resumers should work perfectly well with multiplexing
The RowCountResumer
works by counting the number of line of the output and skipping that many lines from the input.
It can only work in 1-to-1 scenarios where you only emit a single row per input row.
It works in O(2n) => O(n)
time and O(1)
memory, n
being the number of already processed rows.
It is only supported by casanova.enricher
.
import casanova
with open('input.csv') as input_file, \
casanova.RowCountResumer('output.csv') as resumer:
# Want to know if we can resume?
resumer.can_resume()
# Want to know how many rows were already done?
resumer.already_done_count()
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.enricher(input_file, resumer)
casanova
exports an indexed resumer that allows row to be processed concurrently and emitted in a different order.
In this precise case, couting the rows is not enough and we need to be smarter.
One way to proceed is to leverage the index column added by the indexed enricher to compute a set of already processed row while reading the output. Then we can just skip the input rows whose indices are in this set.
The issue here is that this consumes up to O(n)
memory, which is prohibitive in some use cases.
To make sure this still can be done while consuming very little memory, casanova
uses an exotic data structure we named a "contiguous range set".
This means we can resume operation in O(n + log(h) * n)) => O(log(h) * n)
time and O(log(h))
memory, n
being the number of already processed rows and h
being the size of the largest hole in the sorted indices of those same rows. Note that most of the time h << n
since the output is mostly sorted (albeit not at a local level).
You can read more about this data structure in this blog post.
Note finally this resumer can only work in 1-to-1 scenarios where you only emit a single row per input row.
It is supported by casanova.indexed_enricher
only.
import casanova
with open('input.csv') as input_file, \
casanova.IndexedResumer('output.csv') as resumer:
# Want to know if we can resume?
resumer.can_resume()
# Want to know how many rows were already done?
resumer.already_done_count()
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.indexed_enricher(input_file, resumer)
# If you want to use casanova ContiguousRangeSet for whatever reason
from casanova import ContiguousRangeSet
todo...
Sometimes you might write an output CSV file while performing some paginated action. Said action could be aborted and you might want to resume it where you left it.
The LastCellResumer
therefore enables you to resume writing a CSV file by reading its output's last row using a casanova.reverse_reader
and extracting the value you need to resume in constant time and memory.
It is only supported by casanova.writer
.
import casanova
with casanova.LastCellResumer('output.csv', value_column='user_id') as resumer:
# Giving the resumer to a writer as if it was the output file
writer = casanova.writer(resumer)
# Extracting last relevant value if any, so we can properly resume
last_value = resumer.get_state()
In some scenarios, it is possible to resume the operation of an enricher if you can know what was the last value of some column emitted in the output.
Fortunately, using casanova.reverse_reader
, one can read the last line of a CSV file in constant time.
As such the LastCellComparisonResumer
enables you to resume the work of an enricher in O(n)
time and O(1)
memory, with n
being the number of already done lines that you must quickly skip when repositioning yourself in the input.
Note that it only works when the enricher emits a single line per line in the input and when the considered column value is unique across the input file.
It is only supported by casanova.enricher
.
import casanova
with open('input.csv') as input_file, \
casanova.LastCellComparisonResumer('output.csv', value_colum='user_id') as resumer:
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.enricher(input_file, resumer)
casanova
exposes a helper function that one can use to quickly count the number of lines in a CSV file.
import casanova
count = casanova.count('./people.csv')
# You can also stop reading the file if you go beyond a number of rows
count = casanova.count('./people.csv', max_rows=100)
>>> None # if the file has more than 100 rows
>>> 34 # else the actual count
# Any additional kwarg will be passed to the underlying reader as-is
count = casanova.count('./people.csv', delimiter=';')
casanova
exposes a helper function using a reverse_reader to read only the last cell value from a given column of a CSV file.
import casanova
last_cell = casanova.last_cell('./people.csv', column='name')
>>> 'Quentin'
# Will return None if the file is empty
last_cell = casanova.last_cell('./empty.csv', column='name')
>>> None
# Any additional kwarg will be passed to the underlying reader as-is
last_cell = casanova.last_cell('./people.csv', column='name', delimiter=';')
casanova.set_defaults
lets you edit global defaults for casanova
:
import casanova
casanova.set_defaults(strip_null_bytes_on_read=True)
# As a context manager:
with casanova.temporary_defaults(strip_null_bytes_on_read=True):
...
Arguments
False
]: should readers and enrichers strip null bytes on read?False
]: should writers and enrichers strip null bytes on write?xsv, a command line tool written in Rust to handle csv files, uses a clever mini DSL to let users specify column selections.
casanova
has a working python implementation of this mini DSL that can be used by the headers.select
method and the enrichers select
kwargs.
Here is the gist of it (copied right from xsv documentation itself):
Select one column by name:
* name
Select one column by index (1-based):
* 2
Select the first and fourth columns:
* 1,4
Select the first 4 columns (by index and by name):
* 1-4
* Header1-Header4
Ignore the first 2 columns (by range and by omission):
* 3-
* '!1-2'
Select the third column named 'Foo':
* 'Foo[2]'
Re-order and duplicate columns arbitrarily:
* 3-1,Header3-Header1,Header1,Foo[2],Header1
Quote column names that conflict with selector syntax:
* '"Date - Opening","Date - Actual Closing"'
FAQs
Specialized & performant CSV readers, writers and enrichers for python.
We found that casanova demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.