Socket
Socket
Sign inDemoInstall

losser

Package Overview
Dependencies
0
Maintainers
2
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    losser

Filter, transform and export a list of JSON objects to JSON or CSV


Maintainers
2

Readme

Build Status Coverage Status Latest Version Downloads Supported Python versions Development Status License

Losser

A little UNIX command and Python library for lossy filter, transform, and export of JSON to Excel-compatible CSV. Created for ckanapi-exporter.

Losser can either be run as a UNIX command or used as a Python library (see Usage below). It takes a JSON-formatted list of objects (or a list of Python dicts) as input and produces a "table" as output.

The input objects don't all have to have the same fields or structure as each other, and may contain sub-lists and sub-objects arbitrarily nested.

The output "table" is a list of objects that all have the same keys in the same order, and with sub-objects and sub-lists nested no more than one level deep. It can be output as:

  • A list of Python OrderedDicts each having the same keys in the same order
  • A string of JSON-formatted text representing a list of objects each having the same keys in the same order (TODO)
  • A string of CSV-formatted text, one object per CSV row. The rows of the CSV correspond to the objects in the list of output objects if they had been returned as Python or JSON Data, and the columns correspond to the objects' keys.

The input objects can be filtered and transformed before producing the output table. You provide a list of "column query" objects in a columns.json file that specifies what columns the output table should have, and how the values for those columns should be retrieved from the input objects.

For example, if you had some input objects that looked like this:

[
  {
    "author": "Sean Hammond",
    "title": "An Example Input Object",
    "extras":
      {
        "Delivery Unit": "Commissioning"
      {
  },
  ...
]

You might transform them using a columns.json file like this:

{
    "Data Owner": {
        "pattern_path": "^author$"
    },
    "Title": {
        "pattern_path": "^title$"
    },
    "Delivery Unit": {
        "pattern_path": ["^extras$", "^Delivery Unit$"]
    }
}

This would output a CSV file like this:

Data Owner,Title,Delivery Unit
Sean Hammond,An Example Input Object,Commissioning
Frank Black,Another Example Object,Some Other Unit
...

The columns.json file above specifies three column headings for the output table:

  1. Data Owner
  2. Title
  3. Delivery Unit

The values for each column are retrieved from the input objects by following a "pattern path": a list of regular expressions that are matched against the keys of the input object and its sub-objects in turn to find a value.

For example the "Data Owner" field above has the pattern path "^author$" which matches the string "author". This will find top-level keys named "author" in the input objects and output their values in the "Data Owner" column of the output table.

The "Delivery Unit" column above has a more complex pattern path: ["^extras$", "^Delivery Unit$"]. This will find the top-level key "extras" in an input object and, assuming the value for the "extras" key is a sub-object, will find and return the value for the "Delivery Unit" key in the sub-object.

Pattern paths can be arbitrarily long, recursing into arbitrarily deeply nested sub-objects.

One of the patterns in a pattern path may match multiple keys in an object or sub-object. In that case losser recurses on each of the matched keys and ends up returning a list of values instead of a single value.

For example given this input object:

{
  "update": "yearly",
  "update frequency": "monthly",
  ...
}

The pattern path "^update.*" (which matches both "update" and "update frequency") would output "yearly, monthly" (a quoted, comma-separated list) in the corresponding cell in the CSV output.

If a pattern path goes through a sub-list in the input dict losser will iterate over the list and recurse on each of its items. Again it will end up returning a list of values instead of a single value.

For example, given a list of input objects like this:

[
  {
    "resources": [
      {
        "format": "CSV",
        ...
      },
      {
        "format": "JSON",
        ...
      },
      ...
    ],
    ...
  },
  ...
]

The pattern path ["^resources$", "^format$"] will find each object's "resources" sub-list and then find the "format" field in each object in the sub-list. The values in the CSV column will be lists like "CSV, JSON". List can optionally be deduplicated.

Nested lists can occur (when the input object contains a list of lists, for example). These are flattened in the output cells.

Some of the filtering and transformations you can do with losser include:

  • Extract some fields from the objects and filter out others.

    Any fields in an input object that do not match any of the pattern paths in the columns.json file are filtered out.

    (TODO: Support appending unmatched fields to the end of the ouput table as additional columns).

  • Specify the order of the columns in the output table.

    Columns are output in the same order that they appear in the columns.json file, which does not have to be the same order as the corresponding fields in the input objects.

  • Rename fields, using a different name for the column in the output table than for the field in the input objects.

    For example to get the "notes" field from each input object and place them all in a "description" column in the output table, put this object in your columns.json:

    "Description": {
      "pattern_path": "^notes$",
    }
    
  • Match patterns case-sensitively.

    By default patterns are matched case-insensitively. To do case-sensitive matching put "case_sensitive": true in a column query in your columns.json file:

    "Title": {
      "pattern_path": "^title$",
      "case_sensitive": true
    },
    

    This will match "title" in the input object, but not "Title" or "TITLE".

  • Transform the matched values, for example truncating or stripping whitespace from strings.

  • Provide arbitrary pre-processor and post-processor functions to do custom transformations on the input and output objects (TODO).

  • Find inconsistently-named fields using a pattern that matches any of the names and combine them into a single column in the output table.

    For example you can provide a pattern like "^update.*" that will find keys named "update", "Update", "Update Frequency" etc. in different input objects and collect their values in a single "Update Frequency" column.

  • Collect multiple fields together in a single column.

    If a pattern matches multiple fields they'll be output as a quoted comma-separated list in a single cell in the CSV.

    For example with an input object like this:

    {
      "Contributor 1": "Thom Yorke",
      "Contributor 2": "Nigel Godrich",
      "Contributor 3": "Jonny Greenwood",
      ...
    }
    

    The pattern "^Contributor.*" will match all three fields and the value in the CSV cell will be "Thom Yorke,Nigel Godrich,Jonny Greenwood".

  • You can specify that a pattern path should find a unique value in the object, and if more than one value in the object matches the pattern (and a list would be returned) an exception will be raised.

    Use "unique": true in a column query in your columns.json file:

    "Title": {
      "pattern_path": "^title$",
      "unique": true
    },
    

    This is useful for debugging pattern paths that you expect to be unique.

  • You can specify that a pattern path must match a value in the object, and an exception will be raised if there's no matching path through the object (TODO).

  • When a pattern matches multiple paths through the input object, or matches a path going through a sub-list, the resulting list of values in the output table cell can be deduplicated. Put "deduplicate": true in a column query in your columns.json file:

    "Format": {
        "pattern_path": ["^resources$", "^format$"],
        "deduplicate": true
    },
    

What it can't do (yet):

  • Pattern match against the values of items (as opposed to their keys).

    When following a pattern path through an object, when losser hits an object/dictionary in the input, either one of the top-level objects in the list of input objects or a sub-object, losser matches the relevant regex against the object's keys and then recurses on the values of each of the matched keys.

    If the key matches the pattern it recurses, you can't also specify a pattern to match the value against.

    When it hits a string, number, boolean or None/null losser returns it. You can't give it a pattern to match the value against to decide whether to return it or not.

    When it hits a list losser iterates over the items in the list and for each item either returns it or, if it's a sub-list or sub-object, recurses. (When sub-lists or sub-objects would cause a nested list to be returned it's flattened into a single list and optionally deduplicated.) Again, you can't provide a pattern to be matched against each item to decide whether to return/recurse or not.

    Adding pattern matching against values as well as keys would add a lot of power.

Requirements

Python 2.7.

Installation

To install run:

pip install losser

To install for development, create and activate a Python virtual environment then do:

git clone https://github.com/ckan/losser.git
cd losser
python setup.py develop
pip install -r dev-requirements.txt

Usage

On the command-line losser reads input objects from stdin and writes the output table to stdout, making it composable with other UNIX commands. For example:

losser --columns columns.json < input.json > output.csv

You can also specify columns on the command line instead of using a columns.json file. For example:

losser --column "Data Owner" --pattern '^author$' --unique --case-sensitive --column "Description" --pattern '^notes$' --unique --case-sensitive --max-length 255 --column Formats --pattern '^resources$' '^format$' --case-sensitive --deduplicate

You specify one or more --column options with the column title as argument and followed by a --pattern option and any other column options (--unique, --case-sensitive, etc).

The --pattern option can take more than one space-separated argument, in the case where the column's pattern path contains more than one pattern, for example: --pattern "^resources$" "^format$".

Column options like --pattern, --unique, --max-length etc apply to the preceding --column on the command-line.

See losser -h for help.

This will read input objects from input.json, read column queries from columns.json, and write output objects to output.csv.

To use losser as a Python library:

import losser.losser as losser
table = losser.table(input_objects, columns)

input_objects should be a list of dicts. columns can be either a list of dicts or the path to a columns.json file (string). The returned table will be a list of dicts. If you pass csv=True to table() it'll return a CSV-formatted string instead. See table()'s docstring for more arguments.

Inheriting Losser's Command Line Interface

Losser's command line interface with --column and related arguments is fairly complicated to implement. You may want to offer the same command line features in your own losser-based command without having to reimplement them.

For example ckanapi-exporter offers the losser command line interface but adds its own --url and --apikey arguments to pull the input data directly from a CKAN site instead of reading it from stdin.

losser.cli provides make_parser() and parse() functions to enable inheriting its command-line interface. Here's how you use them:

parent_parser = losser.cli.make_parser(add_help=False, exclude_args=["-i"])
parser = argparse.ArgumentParser(
    description="Export datasets from a CKAN site to JSON or CSV.",
    parents=[parent_parser],
)
parser.add_argument("--url", required=True)
parser.add_argument("--apikey")
try:
    parsed_args = losser.cli.parse(parser=parser)
except losser.cli.CommandLineExit as err:
    sys.exit(err.code)
except losser.cli.CommandLineError as err:
    if err.message:
        parser.error(err.message)
url = parsed_args.url
columns = parsed_args.columns
apikey = parsed_args.apikey
datasets = get_datasets_from_ckan(url, apikey)
csv_string = losser.losser.table(datasets, columns, csv=True)

See ckanapi-exporter for a working example.

Running the Tests

First activate your virtualenv then install the dev requirements:

pip install -r dev-requirements.txt

Then to run the tests do:

nosetests

To run the tests and produce a test coverage report do:

nosetests --with-coverage --cover-inclusive --cover-erase --cover-tests

Keywords

FAQs


Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc