
Research
/Security News
Popular Tinycolor npm Package Compromised in Supply Chain Attack Affecting 40+ Packages
Malicious update to @ctrl/tinycolor on npm is part of a supply-chain attack hitting 40+ packages across maintainers
ghga-datasteward-kit
Advanced tools
Utilities for data stewards interacting with GHGA infrastructure.
This package can be installed using pip:
pip install ghga-datasteward-kit
It is important to outline that some commands can only be used by Central Data Stewards while other commands address Local Data Stewards.
The workflow for both the Local and Central Data Stewards is outlined in the following paragraphs:
In v1.1 of the Archive, Local Data Stewards are responsible for (A) preparing the metadata of a submission and (B) for encrypting and uploading the corresponding files to the Data Hub's S3-compatible Object Storage.
The data steward kit has no functionality to help with metadata preparation, however, it is still described here for completeness of the workflow.
To define metadata for a submission, you have two options:
Option 1: Use an excel spreadsheet (please do not use Google Spreadsheets because of data protection). Templates can be found here. You may validate the metadata by:
Option 2: Directly specify the metadata using JSON compliant with our LinkML schema. Validation of the metadata can be achieved using the GHGA Metadata Validator.
Once your spreadsheet or JSON file have passed validation, you may send the metadata to the Central Data Steward.
This is achieved using the data steward kit, using the following steps:
Generate credentials: The kit interacts with services at GHGA central. To
authenticate yourself against these services you need to create a set of credentials
using the ghga-datasteward-kit generate-credentials
command. Please see
this section for further details.
Encrypt and Upload: File encryption and upload to the S3-compatible object
storage is done in one go. This is achieved using either the
ghga-datasteward-kit files upload
for uploading a single file or the
ghga-datasteward-kit files batch-upload
for uploading multiple files at once.
There also exist legacy versions of these subcommands for compatibility reasons,
where the command is prefixed with legacy-
.
Please see this section for further details. This will output
one summary JSON per uploaded file. The encryption secret is automatically
transferred to GHGA central for the normal upload path.
For the legacy version of the commands, the encryption secret is present in the summary
JSON and will be exchanged for a secret during ingest.
Once the upload of all files of a submission has completed, please notify the GHGA Central Data Steward and provide the summary JSONs obtained in step 2.
The Central Data Steward is responsible for ingesting the metadata and the upload summary files into the running system. This is performed with the following steps:
ghga-datasteward-kit generate-credentials
command is used.
Please see this section for further details.ghga-datasteward-kit metadata transpile
command. Please see
this section for further details.ghga-datasteward-kit metadata submit
command. Please see
this section for further details.ghga-datasteward-kit metadata transform
command. Please see
this section for further details.ghga-datasteward-kit load
command can be used.ghga-datasteward-kit files ingest-upload-metadata
command. Please see
this section for further details.An overview of all commands is provided using:
ghga-datasteward-kit --help
The following paragraphs provide additional help for using the different commands.
To be performed by Local Data Stewards.
This command facilitates encrypting files using Crypt4GH and uploading the encrypted content to a (remote) S3-compatible object storage. This process consists of multiple steps:
The user needs to provide a config yaml containing information as described here.
An overview of important information about each the upload is written to a file called <alias>.json in the output directory.
It contains the following information:
Attention: Keep this output file in a safe, private location. If this file is lost, the uploaded file content becomes inaccessible.
In addition to the already existing batch upload command that allows for parallel processing and transfer on the file level, v4.3.0 added an asynchronous task handler for upload and download parallelization on the file part level.
Moving from directly downloading the uploaded file to using content MD5 for validation purposes, there is no clear preference in which mode of parallelism should be used, as no benchmarking has been done yet with the current changes.
By default, part level parallelism is set by the client_max_parallel_transfers
to a value of 10.
If you want to disable it, the value has to be explicitly set to 1.
To be performed by Central Data Stewards only.
Upload all file summary JSONs (produced using the files (batch-)upload command) from the given directory to the running system and make the corresponding files available for download.
This command requires a configuration file as described here.
Datasteward Kit Version | File Ingest Service Version |
---|---|
>=4.5.0 | >=5.0.0 |
>=4.4.0, <4.5.0 | >=4.0.0, <5 |
To be performed by Central Data Stewards only.
The metadata label groups metadata related commands.
Some of them require a configuration file as described here.
To be performed by Central Data Stewards only.
The load command makes files and metadata available to user in the running system.
It needs a configuration parameters as described here.
A command to generate a token/hash pair for interacting with GHGA Central services.
The generated token file should not be moved to a different system and never be shared with another user. The token hash (*not the token) must be shared with the GHGA Central Operation Team. This process has to be done only once per data steward and system (if a data steward is working with multiple compute environments, one set of credentials per environment should be created).
For setting up the development environment, we rely on the devcontainer feature of vscode in combination with Docker Compose.
To use it, you have to have Docker Compose as well as vscode with its "Remote - Containers" extension (ms-vscode-remote.remote-containers
) installed.
Then open this repository in vscode and run the command
Remote-Containers: Reopen in Container
from the vscode "Command Palette".
This will give you a full-fledged, pre-configured development environment including:
If you prefer not to use vscode, you could get a similar setup (without the editor specific features) by running the following commands:
# Execute in the repo's root dir:
cd ./.devcontainer
# build and run the environment with docker-compose
docker-compose up
# attach to the main container:
# (you can open multiple shell sessions like this)
docker exec -it devcontainer_app_1 /bin/bash
This repository is free to use and modify according to the Apache 2.0 License.
FAQs
GHGA Data Steward Kit - A utils package for GHGA data stewards.
We found that ghga-datasteward-kit demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
Malicious update to @ctrl/tinycolor on npm is part of a supply-chain attack hitting 40+ packages across maintainers
Security News
pnpm's new minimumReleaseAge setting delays package updates to prevent supply chain attacks, with other tools like Taze and NCU following suit.
Security News
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.