A Python framework for structuring, managing and processing FAIR scientific image datasets
Contents
Overview
Marimba is a Python framework designed for the efficient processing of
FAIR (Findable, Accessible, Interoperable, and Reusable) scientific image
datasets. Developed collaboratively by CSIRO and MBARI, Marimba
provides core functionality for structuring, processing, and ensuring the FAIR compliance of scientific image data.
The framework features a Typer Command Line Interface (CLI) enhanced by
Rich for an improved user experience. Marimba offers a well-defined API (Application
Programming Interface) that enables seamless integration with external scripts and Graphical User Interfaces (GUIs).
Marimba is particularly well-suited for researchers, data scientists, and engineers working in marine science and other
fields that require large-scale and streamlined image dataset management. Typical use cases include automating the
processing of imagery from underwater vehicles, integrating multi-instrument data for comprehensive analysis, and
preparing datasets for publication in FAIR-compliant repositories.
(back to top)
Design
Marimba defines three core concepts:
-
Project: A Marimba Project is a standardised, high-level structure designed to manage the entire processing
workflow for producing FAIR image datasets. It serves as the primary context for importing, processing, packaging and
distributing these datasets, with all high-level operations managed by the core Marimba system.
-
Pipelines: A Marimba Pipeline encapsulates the implementation of all processing stages for a single or
multi-instrument system. Each Pipeline operates in isolation, containing all necessary logic to fully process image
data, which may include multiple image or video sources, associated navigational data, and other ancillary
information. The core Marimba system manages Pipeline execution, and developing a custom Pipeline is the only
requirement for processing FAIR image datasets for new instruments or systems with Marimba.
-
Collections: A Marimba Collection is a set of data that is imported into a Marimba project and can include a
diverse aggregation of data from a single or multi-instrument system. Each Collection is isolated within the context of
Marimba's core processing environment. During execution, Marimba Pipelines operate on each Collection in parallel,
applying the specialised processing to the data contained within each Collection.
(back to top)
Features
The Marimba framework offers a number of advanced features designed for the specific needs of scientific image
processing:
-
Project Structuring and Management:
- Marimba enables a systematic approach to structuring and managing scientific image data projects throughout the
entire processing workflow
- Core features of Marimba manage the parallelised execution of isolated Pipelines on sandboxed Collections, enabling
full automation of the processing workflow
- Marimba supports the use of hard links during processing to prevent data duplication and optimise storage efficiency
- Marimba provides a unified interface for importing, processing, packaging, and distributing datasets, ensuring
consistency and efficiency across all stages
-
File and Metadata Management:
- Custom Marimba Pipelines support the implementation of specific naming conventions to automatically rename image
files
- Marimba supports user prompting to manually input Pipeline and Collection-level metadata
- Metadata configuration dictionaries can be optionally passed via the CLI to automate manual input stages
- Marimba provides extensive capabilities for managing image metadata, including:
- Ensuring compliance with the iFDO (image FAIR Digital
Object) standard to ensure interoperability and reusability
- Integrating image datasets with corresponding navigation and sensor data, when available
- Embedding metadata directly into image EXIF tags for greater accessibility
-
Standard Image and Video Library:
- Marimba provides a comprehensive standard library of image and video processing modules that can:
- Convert, compress and resize imagery using Pillow
- Transcode, segment and extract frames from videos using Ffmpeg (to be integrated)
- Automatically generate thumbnails for images and videos and create composite overview images for rapid assessment
of image datasets
- Detect duplicate, blurry, or improperly exposed images using
CleanVision (to be integrated)
-
Dataset Packaging and Distribution:
- Marimba offers a standardised approach for packaging processed FAIR image datasets, including:
- Collating all processing logs to archive the entire dataset provenance, ensuring transparency and traceability
- Generating file manifests to facilitate dataset validation
- Dynamically generating summaries of image and video dataset statistics
- Marimba also provides mechanisms for distributing packaged FAIR image datasets including:
- Uploading FAIR image datasets to S3 buckets
(back to top)
Installation
Marimba can be installed using the Python pip package manager. Ensure that Python version 3.10 or greater is installed
in your environment before proceeding.
To install Marimba, open your terminal or command prompt and run the following command:
pip install marimba
This will download and install the latest version of Marimba along with its required dependencies. After installation,
you can verify the installation by running Marimba and displaying the default help menu:
marimba
Marimba has minimal system level dependencies, such as ffmpeg
, which are required for its operation. On Ubuntu you can
install ffmpeg
with:
sudo apt install ffmpeg
To set up a Marimba development environment, please refer to the Environment Setup Guide, which
provides detailed instructions and guidelines for configuring your development environment.
(back to top)
Getting Started
Marimba offers a streamlined CLI that encompasses the entire post-acquisition data processing workflow. Below is a
minimal demonstration of the key CLI commands required to progress through all the Marimba processing stages.
-
Create a new Marimba Project:
marimba new project MY-PROJECT
cd MY-PROJECT
-
Create a new Marimba Pipeline:
marimba new pipeline MY-INSTRUMENT https://path.to/my-instrument-pipeline.git
-
Import new Marimba Collections:
marimba import COLLECTION-ONE '/path/to/collection/one/'
marimba import COLLECTION-TWO '/path/to/collection/two/'
-
Process the imported Collections with the installed Pipelines:
marimba process
-
Package the FAIR image dataset:
marimba package MY-FAIR-DATASET --version 1.0 --contact-name "Keiko Abe" --contact-email "keiko.abe@email.com"
For additional details and advanced usage, please refer to the Overview and CLI Usage Guide.
Note: Keiko Abe is a renowned Japanese marimba player and composer, widely
recognised for her role in establishing the marimba as a respected concert instrument.
(back to top)
Documentation
Marimba offers extensive documentation to support both users and developers:
Users
If you're interested in creating your own Pipelines to process image data, Marimba provides a comprehensive guide to
help you get started. This documentation covers everything from setting up a Pipeline git repository to implementing
custom processing pipelines.
-
Overview and CLI Usage Guide: Gain an architectural understanding of Marimba and explore the
various CLI commands and options available to enhance pipeline management and execution, detailed in the comprehensive
CLI usage guide.
-
Pipeline Implementation Guide: This guide offers a step-by-step tutorial on how to design and
tailor Marimba Pipelines to suit your unique data processing requirements. From initial setup to advanced
customization techniques, learn everything you need to efficiently use Marimba for your specific projects.
Developers
For developers who want to script Marimba using the CLI or leverage the Marimba API for more advanced integrations, we
offer detailed documentation that covers all aspects of Marimba’s capabilities.
-
CLI Scripting Guide: Learn how to automate data processing workflows using Marimba's CLI. This
guide provides detailed instructions and examples to help you streamline your data processing operations.
-
API Reference: Explore the Marimba API to integrate its functionalities into your applications or
workflows. The reference includes detailed descriptions of Python API endpoints and their usage.
These resources are designed to help you make the most of Marimba, whether you are processing large datasets or
integrating Marimba into your existing systems.
(back to top)
Contributing
Marimba is an open-source project, and we welcome feedback and contributions from the community. If you have ideas or
suggestions to improve Marimba, we encourage you to submit them using our
GitHub issue tracker. For enhancements or new features, we encourage you
to fork the repository and submit a pull request. Please refer to the Contributing Guide for
detailed guidelines on how to contribute.
(back to top)
License
This project is distributed under the CSIRO BSD/MIT license.
(back to top)
Contact
For inquiries related to this repository, please contact:
(back to top)
Acknowledgments
Marimba was developed as a collaborative effort between CSIRO and MBARI, two leading institutions in marine science
and technology. The conceptual foundation of Marimba was formulated at CSIRO in late 2022. Substantial elements of
its initial design and implementation were developed during the CSIRO Image Data Collection and Delivery Hackathon
in early 2023, with further collaborative advancements between CSIRO and MBARI in late 2023. Marimba was
open-sourced on GitHub and PyPI in
mid-2024 and officially launched at the Marine Imaging Workshop 2024.
The development of this project has greatly benefited from the contributions of the following people:
- Chris Jackett - CSIRO Environment
- Kevin Barnard - MBARI
- Nick Mortimer - CSIRO Environment
- David Webb - CSIRO NCMI
- Aaron Tyndall - CSIRO NCMI
- Franzis Althaus - CSIRO Environment
- Candice Untiedt - CSIRO Environment
- Carlie Devine - CSIRO Environment
- Bec Gorton - CSIRO Environment
- Ben Scoulding - CSIRO Environment
(back to top)