stdflow
Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)
Create clean data flow pipelines just by replacing your pd.read_csv()
and df.to_csv()
by sf.load()
and sf.save()
.
Documentation
Install
pip install stdflow
How to use
Pipelines
from stdflow import StepRunner
from stdflow.pipeline import Pipeline
dm = "../demo_project/notebooks/"
ingestion_ppl = Pipeline([
StepRunner(dm + "01_ingestion/countries.ipynb"),
StepRunner(dm + "01_ingestion/world_happiness.ipynb")
])
ingestion_ppl = Pipeline(
StepRunner(dm + "01_ingestion/countries.ipynb"),
StepRunner(dm + "01_ingestion/world_happiness.ipynb")
)
ingestion_ppl = Pipeline()
ingestion_ppl.add_step(StepRunner(dm + "01_ingestion/countries.ipynb"))
ingestion_ppl.add_step(dm + "01_ingestion/world_happiness.ipynb")
ingestion_ppl
================================
PIPELINE
================================
STEP 1
path: ../demo_project/notebooks/01_ingestion/countries.ipynb
vars: {}
STEP 2
path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
vars: {}
================================
Run the pipeline
ingestion_ppl.run(verbose=True, kernel=":any_available")
=================================================================================
01. ../demo_project/notebooks/01_ingestion/countries.ipynb
=================================================================================
Variables: {}
using kernel: python3
Path: ../demo_project/notebooks/01_ingestion/countries.ipynb
Duration: 0 days 00:00:00.603051
Env: {}
Notebook executed successfully.
=================================================================================
02. ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
=================================================================================
Variables: {}
using kernel: python3
Path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
Duration: 0 days 00:00:00.644909
Env: {}
Notebook executed successfully.
Load and save data
Option 1: Specify All Parameters
import stdflow as sf
import pandas as pd
df = sf.load(
root="../demo_project/data/",
attrs=['countries'],
step='created',
version=':last',
file_name='countries.csv',
method=pd.read_csv,
verbose=False,
)
sf.save(
df,
root="../demo_project/data/",
attrs='countries/',
step='loaded',
version='%Y-03',
file_name='countries.csv',
method=pd.DataFrame.to_csv,
)
attrs=countries/::step_name=loaded::version=2023-03::file_name=countries.csv
Each time you perform a save, a metadata.json file is created in the
folder. This keeps track of how your data was created and other
information.
Option 2: Use default variables
import stdflow as sf
sf.reset()
sf.root = "../demo_project/data/"
sf.attrs = 'countries'
sf.step_in = 'loaded'
sf.step_out = 'formatted'
df = sf.load()
sf.save(df)
attrs=countries::step_name=formatted::version=202310101716::file_name=countries.csv
Note that everything we did at package level can be done with the Step
class When you have multiple steps in a notebook, you can create one
Step object per step. stdflow (sf) at package level is a singleton
instance of Step.
from stdflow import Step
step = Step(
root="../demo_project/data/",
attrs='countries',
step_in='formatted',
step_out='pre_processed'
)
step.root = "../demo_project/data/"
df = step.load(version=':last', file_name=":auto", verbose=True)
step.save(df, verbose=True)
INFO:stdflow.step:Loading data from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
INFO:stdflow.step:Data loaded from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/
attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv
Each time you perform a save, a metadata.json file is created in the
folder. This keeps track of how your data was created and other
information.
Do not
- Save in the same directory from different steps. Because this will
erase metadata from the previous step.
Data visualization
import stdflow as sf
step.save(df, verbose=True, export_viz_tool=True)
INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/
INFO:stdflow.step:Exporting viz tool to ../demo_project/data/countries/step_pre_processed/v_202310101716/
attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv
This command exports a folder metadata_viz
in the same folder as the
data you exported. The metadata to display is saved in the metadata.json
file.
In order to display it you need to get both the file and the folder on
your local pc (download if you are working on a server)
Then go to the html file in your file explorer and open it. it should
open in your browser and lets you upload the metadata.json file.
Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)
Create clean data flow pipelines just by replacing your pd.read_csv()
and df.to_csv()
by sf.load()
and sf.save()
.
Data Organization
Format
Data folder organization is systematic and used by the function to load
and save. If follows this format:
root_data_folder/attrs_1/attrs_2/…/attrs_n/step_name/version/file_name
where:
- root_data_folder: is the path to the root of your data folder, and is
not exported in the metadata
- attrs: information to classify your dataset (e.g. country, client, …)
- step_name: name of the step. always starts with
step_
- version: version of the step. always starts with
v_
- file_name: name of the file. can be anything
Each folder is the output of a step. It contains a metadata.json file
with information about all files in the folder and how it was generated.
It can also contain a html page (if you set html_export=True
in
save()
) that lets you visualize the pipeline and your metadata
Best Practices:
- Do not use
sf.reset
as part of your final code - In one step, export only to one path (except the version). meaning for
one step only one combination of attrs and step_name
- Do not set sub-dirs within the export (i.e. version folder is the last
depth). if you need similar operation for different datasets, create
pipelines