DuckDB Kernel for Jupyter
This is a simple DuckDB wrapper kernel which accepts SQL as input, executes it
using a previously loaded DuckDB instance and formats the output as a table.
There are some magic commands that make teaching easier with this kernel.
Quick Start
Table of Contents
Setup
Using pip
Run pip
to install the corresponding package from
pypi
after Jupyter is already installed.
pip install jupyter-duckdb
Register the kernel.
jupyter kernelspec install <path to the site-packages directory>/duckdb_kernel
Now start Jupyter the usual way and the kernel should be available.
If DuckDB cannot be installed on your system, you can use SQLite as a backend
instead. To do this, set the environment variable SQLITE
when running pip:
SQLITE=1 pip install jupyter-duckdb
Using Docker
Execute the following command to pull and run a prepared image.
docker run -p 8888:8888 troebs/jupyter-duckdb
This image can also be used with JupyterHub and the
DockerSpawner / SwarmSpawner
and probably with the
kubespawner. You can also build
your own image using the Dockerfile in the repository.
Usage
A detailed example can be found in the repository. The rest of this
section describes the magic commands.
A Note on Magic Commands
Many Jupyter kernels make a difference between magic commands for a single line
starting with one percent sign and others for a whole cell starting with two
percent signs. The upcoming magic commands always apply to a whole cell.
Therefore, it does not matter whether you use a single or two percent signs.
However, the magic commands must always be used at the beginning of a cell.
It is also possible to use more than one magic command per cell.
Load a Database
To load the database two magic commands are available.
CREATE
creates a new database and therefore overwrites files with the same
name without prompting. Using the optional parameter OF
you can either provide
another DuckDB file or a file with SQL statements. In the first case the
included tables will be copied to the new database, while in the second case the
SQL statements are just executed. We find this feature very useful to work in a
temporary copy of the data and therefore be able to restart at any time. The
last optional parameter WITH_TESTS
is described in
detail below.
%CREATE data.duckdb OF my_statements.sql
LOAD
on the other hand loads an existing database and returns an error if it
does not exist. (That is why OF
cannot be used with LOAD
! WITH_TESTS
on
the other hand is available also with this magic command.)
%LOAD data.duckdb
Only one database can be open at any time. If a new database is created or
loaded, the current one is closed first and saved to disk if necessary.
Please note that :memory:
is also a valid file path for DuckDB. The data is
then stored exclusively in the main memory. In combination with CREATE
and OF
this makes it possible to work on a temporary copy in memory.
Schema Diagrams
The magic command SCHEMA
can be used to create a simple schema diagram of the
loaded database, showing all created tables, their columns and data types, but
without any views. Primary keys are printed in bold and unique keys are
underlined. Foreign keys are also highlighted and the dependencies between the
tables are shown by arrows.
The optional flag TD
can be set to force a vertical layout. This
representation requires more space, but can improve readability.
%SCHEMA TD
Graphviz (dot
in PATH) is required to render schema diagrams.
Number of Rows
By default, only 20 rows are shown. All further lines are replaced by three
dots. When hovering over the three dots using the cursor, the number of omitted
lines is displayed. Of course, the number of lines displayed can be changed.
The magic command ALL_ROWS
and its short form ALL
can be used to display *
all* rows of the query in the same cell. Caution: With large result sets
this can lead to a frozen Jupyter instance.
%ALL_ROWS
SELECT *
FROM foo
The magic command QUERY_MAX_ROWS
followed by an integer can be used to change
the number of displayed rows for the current cell.
%QUERY_MAX_ROWS 50
SELECT *
FROM foo
The magic command MAX_ROWS
followed by an integer can be used to change the
number of displayed rows for all future queries including the current cell.
%MAX_ROWS 30
SELECT *
FROM foo
SELECT *
FROM bar
Ship Tests With Your Notebooks
Simple tests can be loaded together with the database with the help of
the WITH_TESTS
parameter. These tests are stored as a JSON file. Each test is
assigned a unique name, a result set and whether the test should check the order
of the result. A very simple test file looks like the following JSON object:
{
"task1": {
"ordered": false,
"equals": [
[
1,
"Name 1"
],
[
2,
"Name 2"
]
]
}
}
To bind a test to a cell, use the magic command TEST
in combination with a
name. After the cell is executed, the result is evaluated and then displayed
below the query result.
%TEST task1
SELECT 2, 'Name 2'
UNION
SELECT 1, 'Name 1'
Disclaimer: The integrated testing is work-in-progress and thus subject to
potentially incompatible changes and enhancements.
Relational Algebra
An interpreter for relational algebra queries is integrated in this kernel. The
magic command RA
activates the relational algebra mode for a single cell:
%RA
π a, b (σ c = 1 (R))
The supported operations are:
- Projection
π
- Selection
σ
- Rename
β
- Union
∪
- Intersection
∩
- Difference
\
- Natural Join
⋈
- Cross Product
×
- Division
÷
The Dockerfile also installs the Jupyter Lab plugin
jupyter-ra-extension. It adds
the symbols mentioned above and some other supported symbols to the toolbar for
insertion on click.
Domain Calculus
An interpreter for domain calculus queries is integrated in this kernel. The
magic command DC
activates the domain calculus mode for a single cell:
%DC
{ a, b | R(a, b, c) ∧ c = 1 }