Losser
A little UNIX command and Python library for lossy filter, transform, and
export of JSON to Excel-compatible CSV.
Created for ckanapi-exporter.
Losser can either be run as a UNIX command or used as a Python library
(see Usage below). It takes a JSON-formatted list of objects
(or a list of Python dicts) as input and produces a "table" as output.
The input objects don't all have to have the same fields or structure as each
other, and may contain sub-lists and sub-objects arbitrarily nested.
The output "table" is a list of objects that all have the same keys in the same
order, and with sub-objects and sub-lists nested no more than one level deep.
It can be output as:
- A list of Python OrderedDicts each having the same keys in the same order
- A string of JSON-formatted text representing a list of objects each having
the same keys in the same order
(TODO)
- A string of CSV-formatted text, one object per CSV row. The rows of the CSV
correspond to the objects in the list of output objects if they had been
returned as Python or JSON Data, and the columns correspond to the objects'
keys.
The input objects can be filtered and transformed before producing the output
table. You provide a list of "column query" objects in a columns.json
file
that specifies what columns the output table should have, and how the values
for those columns should be retrieved from the input objects.
For example, if you had some input objects that looked like this:
[
{
"author": "Sean Hammond",
"title": "An Example Input Object",
"extras":
{
"Delivery Unit": "Commissioning"
{
},
...
]
You might transform them using a columns.json
file like this:
{
"Data Owner": {
"pattern_path": "^author$"
},
"Title": {
"pattern_path": "^title$"
},
"Delivery Unit": {
"pattern_path": ["^extras$", "^Delivery Unit$"]
}
}
This would output a CSV file like this:
Data Owner,Title,Delivery Unit
Sean Hammond,An Example Input Object,Commissioning
Frank Black,Another Example Object,Some Other Unit
...
The columns.json
file above specifies three column headings for the output
table:
- Data Owner
- Title
- Delivery Unit
The values for each column are retrieved from the input objects by following a
"pattern path": a list of regular expressions that are matched against the keys
of the input object and its sub-objects in turn to find a value.
For example the "Data Owner" field above has the pattern path "^author$"
which
matches the string "author". This will find top-level keys named "author" in
the input objects and output their values in the "Data Owner" column of the
output table.
The "Delivery Unit" column above has a more complex pattern path:
["^extras$", "^Delivery Unit$"]
. This will find the top-level key "extras" in
an input object and, assuming the value for the "extras" key is a sub-object,
will find and return the value for the "Delivery Unit" key in the sub-object.
Pattern paths can be arbitrarily long, recursing into arbitrarily deeply nested
sub-objects.
One of the patterns in a pattern path may match multiple keys in an object or
sub-object. In that case losser recurses on each of the matched keys and ends
up returning a list of values instead of a single value.
For example given this input object:
{
"update": "yearly",
"update frequency": "monthly",
...
}
The pattern path "^update.*"
(which matches both "update" and "update
frequency") would output "yearly, monthly"
(a quoted, comma-separated list)
in the corresponding cell in the CSV output.
If a pattern path goes through a sub-list in the input dict losser will iterate
over the list and recurse on each of its items. Again it will end up returning
a list of values instead of a single value.
For example, given a list of input objects like this:
[
{
"resources": [
{
"format": "CSV",
...
},
{
"format": "JSON",
...
},
...
],
...
},
...
]
The pattern path ["^resources$", "^format$"]
will find each object's
"resources" sub-list and then find the "format" field in each object in the
sub-list. The values in the CSV column will be lists like "CSV, JSON"
.
List can optionally be deduplicated.
Nested lists can occur (when the input object contains a list of lists, for
example). These are flattened in the output cells.
Some of the filtering and transformations you can do with losser include:
-
Extract some fields from the objects and filter out others.
Any fields in an input object that do not match any of the pattern paths in
the columns.json
file are filtered out.
(TODO: Support appending unmatched
fields to the end of the ouput table as additional columns).
-
Specify the order of the columns in the output table.
Columns are output in the same order that they appear in the columns.json
file, which does not have to be the same order as the corresponding fields in
the input objects.
-
Rename fields, using a different name for the column in the output table than
for the field in the input objects.
For example to get the "notes" field from each input object and place them
all in a "description" column in the output table, put this object in your
columns.json
:
"Description": {
"pattern_path": "^notes$",
}
-
Match patterns case-sensitively.
By default patterns are matched case-insensitively. To do case-sensitive
matching put "case_sensitive": true
in a column query in your
columns.json
file:
"Title": {
"pattern_path": "^title$",
"case_sensitive": true
},
This will match "title" in the input object, but not "Title" or "TITLE".
-
Transform the matched values, for example truncating or stripping whitespace
from strings.
-
Provide arbitrary pre-processor and post-processor functions to do custom
transformations on the input and output objects
(TODO).
-
Find inconsistently-named fields using a pattern that matches any of the
names and combine them into a single column in the output table.
For example you can provide a pattern like "^update.*"
that will find keys
named "update", "Update", "Update Frequency" etc. in different input objects
and collect their values in a single "Update Frequency" column.
-
Collect multiple fields together in a single column.
If a pattern matches multiple fields they'll be output as a quoted
comma-separated list in a single cell in the CSV.
For example with an input object like this:
{
"Contributor 1": "Thom Yorke",
"Contributor 2": "Nigel Godrich",
"Contributor 3": "Jonny Greenwood",
...
}
The pattern "^Contributor.*"
will match all three fields and the value in
the CSV cell will be "Thom Yorke,Nigel Godrich,Jonny Greenwood"
.
-
You can specify that a pattern path should find a unique value in the object,
and if more than one value in the object matches the pattern (and a list
would be returned) an exception will be raised.
Use "unique": true
in a column query in your columns.json
file:
"Title": {
"pattern_path": "^title$",
"unique": true
},
This is useful for debugging pattern paths that you expect to be unique.
-
You can specify that a pattern path must match a value in the object, and
an exception will be raised if there's no matching path through the object
(TODO).
-
When a pattern matches multiple paths through the input object, or matches a
path going through a sub-list, the resulting list of values in the output
table cell can be deduplicated. Put "deduplicate": true
in a column query
in your columns.json
file:
"Format": {
"pattern_path": ["^resources$", "^format$"],
"deduplicate": true
},
What it can't do (yet):
-
Pattern match against the values of items (as opposed to their keys).
When following a pattern path through an object, when losser hits an
object/dictionary in the input, either one of the top-level objects in the
list of input objects or a sub-object, losser matches the relevant regex
against the object's keys and then recurses on the values of each of the
matched keys.
If the key matches the pattern it recurses, you can't also specify a pattern
to match the value against.
When it hits a string, number, boolean or None
/null
losser returns
it. You can't give it a pattern to match the value against to decide whether
to return it or not.
When it hits a list losser iterates over the items in the list and for each
item either returns it or, if it's a sub-list or sub-object, recurses.
(When sub-lists or sub-objects would cause a nested list to be returned it's
flattened into a single list and optionally deduplicated.) Again, you can't
provide a pattern to be matched against each item to decide whether to
return/recurse or not.
Adding pattern matching against values as well as keys would add a lot of
power.
Requirements
Python 2.7.
Installation
To install run:
pip install losser
To install for development, create and activate a Python virtual environment
then do:
git clone https://github.com/ckan/losser.git
cd losser
python setup.py develop
pip install -r dev-requirements.txt
Usage
On the command-line losser reads input objects from stdin and writes the output
table to stdout, making it composable with other UNIX commands. For example:
losser --columns columns.json < input.json > output.csv
You can also specify columns on the command line instead of using a
columns.json file. For example:
losser --column "Data Owner" --pattern '^author$' --unique --case-sensitive --column "Description" --pattern '^notes$' --unique --case-sensitive --max-length 255 --column Formats --pattern '^resources$' '^format$' --case-sensitive --deduplicate
You specify one or more --column
options with the column title as argument
and followed by a --pattern
option and any other column options (--unique
,
--case-sensitive
, etc).
The --pattern
option can take more than one space-separated argument, in the
case where the column's pattern path contains more than one pattern, for
example: --pattern "^resources$" "^format$"
.
Column options like --pattern
, --unique
, --max-length
etc apply to the
preceding --column
on the command-line.
See losser -h
for help.
This will read input objects from input.json
, read column queries from
columns.json
, and write output objects to output.csv
.
To use losser as a Python library:
import losser.losser as losser
table = losser.table(input_objects, columns)
input_objects
should be a list of dicts. columns
can be either a list of
dicts or the path to a columns.json
file (string). The returned table
will
be a list of dicts. If you pass csv=True
to table()
it'll return a
CSV-formatted string instead. See table()
's docstring for more arguments.
Inheriting Losser's Command Line Interface
Losser's command line interface with --column
and related arguments is fairly
complicated to implement. You may want to offer the same command line features
in your own losser-based command without having to reimplement them.
For example ckanapi-exporter offers
the losser command line interface but adds its own --url
and --apikey
arguments to pull the input data directly from a CKAN site instead of reading
it from stdin.
losser.cli
provides make_parser()
and parse()
functions to enable
inheriting its command-line interface. Here's how you use them:
parent_parser = losser.cli.make_parser(add_help=False, exclude_args=["-i"])
parser = argparse.ArgumentParser(
description="Export datasets from a CKAN site to JSON or CSV.",
parents=[parent_parser],
)
parser.add_argument("--url", required=True)
parser.add_argument("--apikey")
try:
parsed_args = losser.cli.parse(parser=parser)
except losser.cli.CommandLineExit as err:
sys.exit(err.code)
except losser.cli.CommandLineError as err:
if err.message:
parser.error(err.message)
url = parsed_args.url
columns = parsed_args.columns
apikey = parsed_args.apikey
datasets = get_datasets_from_ckan(url, apikey)
csv_string = losser.losser.table(datasets, columns, csv=True)
See ckanapi-exporter for a
working example.
Running the Tests
First activate your virtualenv then install the dev requirements:
pip install -r dev-requirements.txt
Then to run the tests do:
nosetests
To run the tests and produce a test coverage report do:
nosetests --with-coverage --cover-inclusive --cover-erase --cover-tests