scrapbook

The scrapbook library records a notebook’s data values and generated visual
content as "scraps". Recorded scraps can be read at a future time.
See the scrapbook documentation for
more information on how to use scrapbook.
WARNING: This is the old package name nteract-scrapbook
-- please install scrapbook
in the
future as no new releases are going out for this old package name.
Use Cases
Notebook users may wish to record data produced during a notebook's execution.
This recorded data, scraps, can be used at a later time or passed in a
workflow to another notebook as input.
Namely, scrapbook lets you:
- persist data and visual content displays in a notebook as scraps
- recall any persisted scrap of data
- summarize collections of notebooks
Python Version Support
This library's long term support target is Python 3.5+. It currently also
supports Python 2.7 until Python 2 reaches end-of-life in 2020. After this
date, Python 2 support will halt, and only 3.x versions will be maintained.
Installation
Install using pip
:
pip install nteract-scrapbook
For installing optional IO dependencies, you can specify individual store bundles,
like s3
or azure
:
pip install nteract-scrapbook[s3]
or use all
:
pip install nteract-scrapbook[all]
Models and Terminology
Scrapbook defines the following items:
- scraps: serializable data values and visualizations such as strings, lists of
objects, pandas dataframes, charts, images, or data references.
- notebook: a wrapped nbformat notebook object with extra methods for interacting
with scraps.
- scrapbook: a collection of notebooks with an interface for asking questions of
the collection.
- encoders: a registered translator of data to/from notebook
storage formats.
scrap
model
The scrap
model houses a few key attributes in a tuple, including:
- name: The name of the scrap
- data: Any data captured by the scrapbook api call
- encoder: The name of the encoder used to encode/decode data to/from the notebook
- display: Any display data used by IPython to display visual content
API
Scrapbook adds a few basic api commands which enable saving and retrieving data
including:
glue
to persist scraps with or without display outputread_notebook
reads one notebookscraps
provides a searchable dictionary of all scraps by namereglue
which copies a scrap from another notebook to the current notebookread_notebooks
reads many notebooks from a given pathscraps_report
displays a report about collected scrapspapermill_dataframe
and papermill_metrics
for backward compatibility
for two deprecated papermill features
The following sections provide more detail on these api commands.
glue
to persist scraps
Records a scrap
(data or display value) in the given notebook cell.
The scrap
(recorded value) can be retrieved during later inspection of the
output notebook.
"""glue example for recording data values"""
import scrapbook as sb
sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("some_dict", {"a": 1, "b": 2})
sb.glue("non_json", df, 'arrow')
The scrapbook library can be used later to recover scraps
from the output
notebook:
nb = sb.read_notebook('notebook.ipynb')
nb.scraps
scrapbook will imply the storage format by the value type of any registered
data encoders. Alternatively, the implied encoding format can be overwritten by
setting the encoder
argument to the registered name (e.g. "json"
) of a
particular encoder.
This data is persisted by generating a display output with a special media type
identifying the content encoding format and data. These outputs are not always
visible in notebook rendering but still exist in the document. Scrapbook can
then rehydrate the data associated with the notebook in the future by reading
these cell outputs.
With display output
To display a named scrap with visible display outputs, you need to indicate that
the scrap is directly renderable.
This can be done by toggling the display
argument.
sb.glue("hello", "Hello World", display=True)
The call will save the data and the display attributes of the Scrap object,
making it visible as well as encoding the original data. This leans on the
IPython.core.formatters.format_display_data
function to translate the data
object into a display and metadata dict for the notebook kernel to parse.
Another pattern that can be used is to specify that only the display data
should be saved, and not the original object. This is achieved by setting
the encoder to be display
.
sb.glue("sharable_png",
IPython.display.Image(filename="sharable.png"),
encoder='display'
)
Finally the media types that are generated can be controlled by passing
a list, tuple, or dict object as the display argument.
sb.glue("media_as_text_only",
media_obj,
encoder='display',
display=('text/plain',)
)
sb.glue("media_without_text",
media_obj,
encoder='display',
display={'exclude': 'text/plain'}
)
Like data scraps, these can be retrieved at a later time be accessing the scrap's
display
attribute. Though usually one will just use Notebook's reglue
method
(described below).
read_notebook
reads one notebook
Reads a Notebook object loaded from the location specified at path
.
You've already seen how this function is used in the above api call examples,
but essentially this provides a thin wrapper over an nbformat
's NotebookNode
with the ability to extract scrapbook scraps.
nb = sb.read_notebook('notebook.ipynb')
This Notebook object adheres to the nbformat's json schema,
allowing for access to its required fields.
nb.cells
nb.metadata
nb.nbformat
nb.nbformat_minor
There's a few additional methods provided, most of which are outlined in more detail
below:
nb.scraps
nb.reglue
The abstraction also makes saved content available as a dataframe referencing each
key and source. More of these methods will be made available in later versions.
nb.scrap_dataframe
The Notebook object also has a few legacy functions for backwards compatibility
with papermill's Notebook object model. As a result, it can be used to read
papermill execution statistics as well as scrapbook abstractions:
nb.cell_timing
nb.execution_counts
nb.papermill_metrics
nb.papermill_record_dataframe
nb.parameter_dataframe
nb.papermill_dataframe
The notebook reader relies on papermill's registered iorw
to enable access to a variety of sources such as -- but not limited to -- S3,
Azure, and Google Cloud.
scraps
provides a name -> scrap lookup
The scraps
method allows for access to all of the scraps in a particular notebook.
nb = sb.read_notebook('notebook.ipynb')
nb.scraps
This object has a few additional methods as well for convenient conversion and
execution.
nb.scraps.data_scraps
nb.scraps.data_dict
nb.scraps.display_scraps
nb.scraps.display_dict
nb.scraps.dataframe
These methods allow for simple use-cases to not require digging through model
abstractions.
reglue
copys a scrap into the current notebook
Using reglue
one can take any scrap glue'd into one notebook and glue into the
current one.
nb = sb.read_notebook('notebook.ipynb')
nb.reglue("table_scrap")
Any data or display information will be copied verbatim into the currently
executing notebook as though the user called glue
again on the original source.
It's also possible to rename the scrap in the process.
nb.reglue("table_scrap", "old_table_scrap")
And finally if one wishes to try to reglue without checking for existence the
raise_on_missing
can be set to just display a message on failure.
nb.reglue("maybe_missing", raise_on_missing=False)
read_notebooks
reads many notebooks
Reads all notebooks located in a given path
into a Scrapbook object.
book = sb.read_notebooks('path/to/notebook/collection/')
book.notebooks
The path reuses papermill's registered iorw
to list and read files form various sources, such that non-local urls can load data.
book = sb.read_notebooks('s3://bucket/key/prefix/to/notebook/collection/')
The Scrapbook (book
in this example) can be used to recall all scraps across
the collection of notebooks:
book.notebook_scraps
book.scraps
scraps_report
displays a report about collected scraps
The Scrapbook collection can be used to generate a scraps_report
on all the
scraps from the collection as a markdown structured output.
book.scraps_report()
This display can filter on scrap and notebook names, as well as enable or disable
an overall header for the display.
book.scraps_report(
scrap_names=["scrap1", "scrap2"],
notebook_names=["result1"],
header=False
)
By default the report will only populate with visual elements. To also
report on data elements set include_data.
book.scraps_report(include_data=True)
papermill support
Finally the scrapbook provides two backwards compatible features for deprecated
papermill
capabilities:
book.papermill_dataframe
book.papermill_metrics
Encoders
Encoders are accessible by key names to Encoder objects registered
against the encoders.registry
object. To register new data encoders
simply call:
from encoder import registry as encoder_registry
encoder_registry.register("custom_encoder_name", MyCustomEncoder())
The encode class must implement two methods, encode
and decode
:
class MyCustomEncoder(object):
def encode(self, scrap):
pass
def decode(self, scrap):
pass
This can read transform scraps into a json object representing their contents or
location and load those strings back into the original data objects.
text
A basic string storage format that saves data as python strings.
sb.glue("hello", "world", "text")
json
sb.glue("foo_json", {"foo": "bar", "baz": 1}, "json")
pandas
sb.glue("pandas_df",pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}), "pandas")
papermill's deprecated record
feature
scrapbook provides a robust and flexible recording schema. This library replaces papermill's existing
record
functionality.
Documentation for papermill record
exists on ReadTheDocs.
In brief, the deprecated record
function:
pm.record(name, value)
: enables values to be saved
with the notebook [API documentation]
pm.record("hello", "world")
pm.record("number", 123)
pm.record("some_list", [1, 3, 5])
pm.record("some_dict", {"a": 1, "b": 2})
pm.read_notebook(notebook)
: pandas could be used later to recover recorded
values by reading the output notebook into a dataframe. For example:
nb = pm.read_notebook('notebook.ipynb')
nb.dataframe
Rationale for Papermill record
deprecation
Papermill's record
function was deprecated due to these limitations and challenges:
- The
record
function didn't follow papermill's pattern of linear execution
of a notebook. It was awkward to describe record
as an additional
feature of papermill, and really felt like describing a second less
developed library. - Recording / Reading required data translation to JSON for everything. This is
a tedious, painful process for dataframes.
- Reading recorded values into a dataframe would result in unintuitive dataframe
shapes.
- Less modularity and flexiblity than other papermill components where custom
operators can be registered.
To overcome these limitations in Papermill, a decision was made to create
Scrapbook.