Socket
Socket
Sign inDemoInstall

skeem

Package Overview
Dependencies
28
Maintainers
1
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    skeem

Infer SQL DDL statements from tabular data


Maintainers
1

Readme

Skeem


About


You can use Skeem to infer SQL DDL statements from tabular data.

Skeem is, amongst others, based on the excellent ddlgenerator, frictionless, fsspec, pandas, SQLAlchemy, and xarray packages, and can be used both as a standalone program, and as a library.

Supported input data:

  • Apache Parquet_
  • CSV_
  • Google Sheets_
  • GRIB_
  • InfluxDB line protocol_
  • JSON_
  • NetCDF_
  • NDJSON_ (formerly LDJSON) aka. JSON Lines, see also JSON streaming
  • Office Open XML Workbook_ (Microsoft Excel_)
  • OpenDocument Spreadsheet_ (LibreOffice_)

Supported input sources:

  • Amazon S3_
  • File system_
  • GitHub_
  • Google Cloud Storage_
  • HTTP_

Please note that Skeem is alpha-quality software, and a work in progress. Contributions of all kinds are very welcome, in order to make it more solid. Breaking changes should be expected until a 1.0 release, so version pinning is recommended, especially when you use it as a library.


Synopsis


.. code-block:: sh

skeem infer-ddl --dialect=postgresql data.ndjson

.. code-block:: sql

CREATE TABLE "data" (
    "id" SERIAL NOT NULL,
    "name" TEXT NOT NULL,
    "date" TIMESTAMP WITHOUT TIME ZONE,
    "fruits" TEXT NOT NULL,
    "price" DECIMAL(2, 2) NOT NULL,
    PRIMARY KEY ("id")
);

Quickstart


If you are in a hurry, and want to run Skeem without any installation, just use the OCI image on Podman or Docker.

.. code-block:: sh

docker run --rm ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql \
    https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson

Setup


Install Skeem from PyPI.

.. code-block:: sh

pip install skeem

Install Skeem with support for additional data formats like NetCDF.

.. code-block:: sh

pip install 'skeem[scientific]'

Usage


This section outlines some example invocations of Skeem, both on the command line, and per library use. Other than the resources available from the web, testing data can be acquired from the repository's testdata_ folder.

Command line use

Help

.. code-block:: sh

skeem info
skeem --help
skeem infer-ddl --help

Read from files

.. code-block:: sh

# NDJSON, Parquet, and InfluxDB line protocol (ILP) formats.
skeem infer-ddl --dialect=postgresql data.ndjson
skeem infer-ddl --dialect=postgresql data.parquet
skeem infer-ddl --dialect=postgresql data.lp

# CSV, JSON, ODS, and XLSX formats.
skeem infer-ddl --dialect=postgresql data.csv
skeem infer-ddl --dialect=postgresql data.json
skeem infer-ddl --dialect=postgresql data.ods
skeem infer-ddl --dialect=postgresql data.xlsx
skeem infer-ddl --dialect=postgresql data.xlsx --address="Sheet2"

Read from URLs

.. code-block:: sh

# CSV, NDJSON, XLSX
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.csv
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.xlsx --address="Sheet2"

# Google Sheets: Address first sheet, and specific sheet of workbook.
skeem infer-ddl --dialect=postgresql --table-name=foo https://docs.google.com/spreadsheets/d/1ExyrawjlyksbC6DOM6nLolJDbU8qiRrrhxSuxf5ScB0/view
skeem infer-ddl --dialect=postgresql --table-name=foo https://docs.google.com/spreadsheets/d/1ExyrawjlyksbC6DOM6nLolJDbU8qiRrrhxSuxf5ScB0/view#gid=883324548

# InfluxDB line protocol (ILP)
skeem infer-ddl --dialect=postgresql https://github.com/influxdata/influxdb2-sample-data/raw/master/air-sensor-data/air-sensor-data.lp

# Compressed files in gzip format
skeem --verbose infer-ddl --dialect=crate --content-type=ndjson https://s3.amazonaws.com/crate.sampledata/nyc.yellowcab/yc.2019.07.gz

# CSV on S3
skeem --verbose infer-ddl --dialect=postgresql s3://noaa-ghcn-pds/csv/by_year/2022.csv

# CSV on Google Cloud Storage
skeem --verbose infer-ddl --dialect=postgresql gs://tinybird-assets/datasets/nations.csv
skeem --verbose infer-ddl --dialect=postgresql gs://tinybird-assets/datasets/medals1.csv

# CSV on GitHub
skeem --verbose infer-ddl --dialect=postgresql github://daq-tools:skeem@/tests/testdata/basic.csv

# GRIB2, NetCDF
skeem infer-ddl --dialect=postgresql https://github.com/earthobservations/testdata/raw/main/opendata.dwd.de/weather/nwp/icon/grib/18/t/icon-global_regular-lat-lon_air-temperature_level-90.grib2
skeem infer-ddl --dialect=postgresql https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc
skeem infer-ddl --dialect=postgresql https://www.unidata.ucar.edu/software/netcdf/examples/WMI_Lear.nc

OCI

OCI images are available on the GitHub Container Registry (GHCR). In order to run them on Podman or Docker, invoke:

.. code-block:: sh

docker run --rm ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql \
    https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.csv

If you want to work with files on your filesystem, you will need to mount the working directory into the container when running it, like:

.. code-block:: sh

docker run --rm --volume=$(pwd):/data ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql /data/basic.csv

In order to always run the latest development version, and to use a shortcut for that, this section outlining how to use an alias for skeem, and a variable for storing the URL, may be useful to save a few keystrokes.

.. code-block:: sh

alias skeem="docker run --rm --pull=always ghcr.io/daq-tools/skeem-standard:nightly skeem"
URL=https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson
skeem infer-ddl --dialect=postgresql $URL

More

Use a different backend (default: ddlgen)::

skeem infer-ddl --dialect=postgresql --backend=frictionless data.ndjson

Reading data from stdin needs to obtain both the table name and content type separately::

skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson - < data.ndjson
skeem infer-ddl --dialect=crate --table-name=foo --content-type=json - < data.json
skeem infer-ddl --dialect=crate --table-name=foo --content-type=csv - < data.csv

Reading data from stdin also works like this, if you prefer to use pipes::

cat data.ndjson | skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson -
cat data.json | skeem infer-ddl --dialect=crate --table-name=foo --content-type=json -
cat data.csv | skeem infer-ddl --dialect=crate --table-name=foo --content-type=csv -

Library use

.. code-block:: python

import io
from skeem.core import SchemaGenerator
from skeem.model import Resource, SqlTarget

INDATA = io.StringIO(
    """
    {"id":1,"name":"foo","date":"2014-10-31 09:22:56","fruits":"apple,banana","price":0.42}
    {"id":2,"name":"bar","date":null,"fruits":"pear","price":0.84}
    """
)

sg = SchemaGenerator(
    resource=Resource(data=INDATA, content_type="ndjson"),
    target=SqlTarget(dialect="crate", table_name="testdrive"),
)

print(sg.to_sql_ddl().pretty)

.. code-block:: sql

CREATE TABLE "testdrive" (
    "id" INT NOT NULL,
    "name" STRING NOT NULL,
    "date" TIMESTAMP,
    "fruits" STRING NOT NULL,
    "price" DOUBLE NOT NULL,
    PRIMARY KEY ("id")
);

Development


For installing the project from source, please follow the development_ documentation.


Project information


Credits

  • Catherine Devlin_ for ddlgenerator_ and data_dispenser_.
  • Mike Bayer_ for SQLAlchemy_.
  • Paul Walsh_ and Evgeny Karev_ for frictionless_.
  • Wes McKinney_ for pandas_.
  • All other countless contributors and authors of excellent Python packages, Python itself, and turtles all the way down.

Prior art

We are maintaining a list of other projects_ with the same or similar goals like Skeem.

Etymology

The program was about to be called Eskema, but it turned out that there is already another Eskema_ out there. So, it has been renamed to Skeem, which is Estonian, and means "schema", "outline", or "(to) plan".

.. _Amazon S3: https://en.wikipedia.org/wiki/Amazon_S3 .. _Apache Parquet: https://en.wikipedia.org/wiki/Apache_Parquet .. _Catherine Devlin: https://github.com/catherinedevlin .. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values .. _data_dispenser: https://pypi.org/project/data_dispenser/ .. _ddlgenerator: https://pypi.org/project/ddlgenerator/ .. _development: doc/development.rst .. _Eskema: https://github.com/nombrekeff/eskema .. _Evgeny Karev: https://github.com/roll .. _file system: https://en.wikipedia.org/wiki/File_system .. _frictionless: https://github.com/frictionlessdata/framework .. _fsspec: https://pypi.org/project/fsspec/ .. _GitHub: https://github.com/ .. _Google Cloud Storage: https://en.wikipedia.org/wiki/Google_Cloud_Storage .. _Google Sheets: https://en.wikipedia.org/wiki/Google_Sheets .. _GRIB: https://en.wikipedia.org/wiki/GRIB .. _HTTP: https://en.wikipedia.org/wiki/HTTP .. _InfluxDB line protocol: https://docs.influxdata.com/influxdb/latest/reference/syntax/line-protocol/ .. _JSON: https://www.json.org/ .. _JSON Lines: https://jsonlines.org/ .. _JSON streaming: https://en.wikipedia.org/wiki/JSON_streaming .. _LibreOffice: https://en.wikipedia.org/wiki/LibreOffice .. _list of other projects: doc/prior-art.rst .. _Microsoft Excel: https://en.wikipedia.org/wiki/Microsoft_Excel .. _Mike Bayer: https://github.com/zzzeek .. _NDJSON: http://ndjson.org/ .. _NetCDF: https://en.wikipedia.org/wiki/NetCDF .. _Office Open XML Workbook: https://en.wikipedia.org/wiki/Office_Open_XML .. _OpenDocument Spreadsheet: https://en.wikipedia.org/wiki/OpenDocument .. _pandas: https://pandas.pydata.org/ .. _Paul Walsh: https://github.com/pwalsh .. _SQLAlchemy: https://pypi.org/project/SQLAlchemy/ .. _testdata: https://github.com/daq-tools/skeem/tree/main/tests/testdata .. _Wes McKinney: https://github.com/wesm .. _xarray: https://xarray.dev/

Keywords

FAQs


Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc