articat

Minimal metadata catalog to store and retrieve metadata about data artifacts.
Getting started
At a high level, articat is simply a key-value store. Value being the Artifact metadata.
Key a.k.a. "Artifact Spec" being:
- globally unique
id
- optional timestamp:
partition
- optional arbitrary string:
version
To publish a file system Artifact (FSArtifact):
from articat import FSArtifact
from pathlib import Path
from datetime import date
with FSArtifact.partitioned("foo", partition=date(1643, 1, 4)) as fsa:
data_path = Path("/tmp/data")
data_path.write_text("42")
fsa.stage(data_path)
fsa.metadata.description = "Answer to the Ultimate Question of Life, the Universe, and Everything"
To retrieve the metadata about the Artifact above:
from articat.fs_artifact import FSArtifact
from datetime import date
from pathlib import Path
fsa = FSArtifact.partitioned("foo", partition=date(1643, 1, 4)).fetch()
fsa.id
fsa.created
fsa.partition
fsa.metadata.description
fsa.main_dir
Path(fsa.joinpath("data")).read_text()
Features
- store and retrieve metadata about your data artifacts
- no long running services (low maintenance)
- data publishing utils builtin
- IO/data format agnostic
- immutable metadata
- development mode
Artifact flavours
Currently available Artifact flavours:
FSArtifact: metadata/utils for files or objects (supports: local FS, GCS, S3 and more)
BQArtifact: metadata/utils for BigQuery tables
NotebookArtifact: metadata/utils for Jupyter Notebooks
Development mode
To ease development of Artifacts, articat supports development/dev mode.
Development Artifact can be indicated by dev parameter (preferred), or
_dev prefix in the Artifact id. Dev mode supports:
- overwriting Artifact metadata
- configure separate locations (e.g.
dev_prefix for FSArtifact), with
potentially different retention periods etc
Backend
local: mostly for testing/demo, metadata is stored locally (configurable, default: ~/.config/articat/local)
gcp_datastore: metadata is stored in the Google Cloud Datastore
Configuration
articat configuration can be provided in the API, or configuration files. By default configuration
is loaded from ~/.config/articat/articat.cfg and articat.cfg in current working directory. You
can also point at the configuration file via environment variable ARTICAT_CONFIG.
You use local mode without configuration file. Available options:
[main]
[fs]
[gcp]
[bq]
Our/example setup
Below you can see a diagram of our setup, Articat is just one piece of our system, and solves a specific problem. This should give you an idea where it might fit into your environment: