Welcome to idact!

Idact, or Interactive Data Analysis Convenience Tools, is a Python 3.5+ library
that takes care of several tedious aspects of working with big data
on an HPC cluster.
Who is it for?
Data scientists or big data enthusiasts, who:
- Perform computations on Jupyter Notebook,
using libraries such as NumPy,
pandas,
Matplotlib,
or bokeh.
- Have access to an HPC cluster with Slurm
as the job scheduler.
- Would like to parallelize their computations across many nodes using
Dask.distributed, a library
for distributed computing.
- May find that it takes too much manual effort to deploy Jupyter Notebook
and Dask on the cluster each time they need it.
Requirements
Python 3.5+.
Client
Cluster
Installation
python -m pip install idact
If you're using Conda, you may want to update
your environment first:
conda update --all
Code samples
Accessing a cluster
Cluster can be accessed with a public/private key pair via SSH.
from idact import *
cluster = add_cluster(name="short-cluster-name",
user="user",
host="login-node.cluster.example.com",
port=22,
auth=AuthMethod.PUBLIC_KEY,
key="~/.ssh/id_rsa",
install_key=False)
node = cluster.get_access_node()
node.connect()
Tutorial:
01. Connecting to a cluster
Allocating and deallocating nodes
Nodes are allocated as a Slurm job.
Afterwards, they can be used for deployments.
import bitmath
nodes = cluster.allocate_nodes(nodes=8,
cores=12,
memory_per_node=bitmath.GiB(120),
walltime=Walltime(hours=1, minutes=30),
native_args={
'--partition': 'debug',
'--account': 'data-analysis-group'
})
try:
nodes.wait(timeout=120.0)
except TimeoutError:
nodes.cancel()
Tutorial:
02. Allocating nodes
Deploying Jupyter Notebook
Jupyter Notebook is deployed on a cluster node,
and made accessible through an SSH tunnel.
nb = nodes[0].deploy_notebook()
nb.open_in_browser()
Tutorial:
03. Deploying Jupyter
Deploying Dask.distributed
Dask.distributed scheduler and workers are deployed
on cluster nodes, and their dashboards are made available
through SSH tunnels.
dd = deploy_dask(nodes[1:])
client = dd.get_client()
client.submit(...)
dd.diagnostics.open_all()
Tutorial:
04. Deploying Dask,
09. Demo analysis
Managing cluster config
Local and remote cluster configuration can be saved, loaded,
and copied to and from the cluster.
save_environment()
load_environment()
push_environment(cluster)
pull_environment(cluster)
Tutorials:
01. Connecting to a cluster,
05. Configuring idact on a cluster
Managing deployments
Deployment objects can be serialized and copied between running program
instances, local or remote.
cluster.push_deployment(nodes)
cluster.push_deployment(nb)
cluster.push_deployment(dd)
cluster.pull_deployments()
Tutorials:
06. Working on a cluster,
07. Adjusting timeouts
Quick deployment app
Quick deployment app allocates nodes and deploys Jupyter notebook
from command line:
idact-notebook short-cluster-name --nodes 3 --walltime 0:20:00
Tutorial:
08. Using the quick deployment app
Documentation
The documentation contains detailed API description, tutorial notebooks,
and other helpful information.
Source code
The source code is available on GitHub.
License
MIT License.
This library was developed under the supervision of Leszek Grzanka, PhD
as a final project of the BEng in Computer Science program
at the Faculty of Computer Science, Electronics and Telecommunications
at AGH University of Science and Technology, Krakow.