Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
dsub
is a command-line tool that makes it easy to submit and run batch scripts
in the cloud.
The dsub
user experience is modeled after traditional high-performance
computing job schedulers like Grid Engine and Slurm. You write a script and
then submit it to a job scheduler from a shell prompt on your local machine.
Today dsub
supports Google Cloud as the backend batch job runner, along with a
local provider for development and testing. With help from the community, we'd
like to add other backends, such as a Grid Engine, Slurm, Amazon Batch,
and Azure Batch.
dsub
is written in Python and requires Python 3.7 or higher.
dsub
0.4.7.dsub
0.4.1.dsub
0.3.10.This is optional, but whether installing from PyPI or from github, you are strongly encouraged to use a Python virtual environment.
You can do this in a directory of your choosing.
python3 -m venv dsub_libs
source dsub_libs/bin/activate
Using a Python virtual environment isolates dsub
library dependencies from
other Python applications on your system.
Activate this virtual environment in any shell session before running dsub
.
To deactivate the virtual environment in your shell, run the command:
deactivate
Alternatively, a set of convenience scripts are provided that activate the
virutalenv before calling dsub
, dstat
, and ddel
. They are in the
bin directory. You can
use these scripts if you don't want to activate the virtualenv explicitly in
your shell.
While not used directly by dsub
for the google-batch
or google-cls-v2
providers, you are likely to want to install the command line tools found in the Google
Cloud SDK.
If you will be using the local
provider for faster job development,
you will need to install the Google Cloud SDK, which uses gsutil
to ensure
file operation semantics consistent with the Google dsub
providers.
Run
gcloud init
gcloud
will prompt you to set your default project and to grant
credentials to the Google Cloud SDK.
dsub
Choose one of the following:
If necessary, install pip.
Install dsub
pip install dsub
Be sure you have git installed
Instructions for your environment can be found on the git website.
Clone this repository.
git clone https://github.com/DataBiosphere/dsub
cd dsub
Install dsub (this will also install the dependencies)
python -m pip install .
Set up Bash tab completion (optional).
source bash_tab_complete
Minimally verify the installation by running:
dsub --help
(Optional) Install Docker.
This is necessary only if you're going to create your own Docker images or
use the local
provider.
After cloning the dsub repo, you can also use the Makefile by running:
make
This will create a Python virtual environment and install dsub
into a
directory named dsub_libs
.
We think you'll find the local
provider to be very helpful when building
your dsub
tasks. Instead of submitting a request to run your command on a
cloud VM, the local
provider runs your dsub
tasks on your local machine.
The local
provider is not designed for running at scale. It is designed
to emulate running on a cloud VM such that you can rapidly iterate.
You'll get quicker turnaround times and won't incur cloud charges using it.
Run a dsub
job and wait for completion.
Here is a very simple "Hello World" test:
dsub \
--provider local \
--logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
--output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
--command 'echo "Hello World" > "${OUT}"' \
--wait
Note: TMPDIR
is commonly set to /tmp
by default on most Unix systems,
although it is also often left unset.
On some versions of MacOS TMPDIR is set to a location under /var/folders
.
Note: The above syntax ${TMPDIR:-/tmp}
is known to be supported by Bash, zsh, ksh.
The shell will expand TMPDIR
, but if it is unset, /tmp
will be used.
View the output file.
cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"
dsub
currently supports the Cloud Life Sciences v2beta
API from Google Cloud and is is developing support for the Batch
API from Google Cloud.
dsub
supports the v2beta API with the google-cls-v2
provider.
google-cls-v2
is the current default provider. dsub
will be transitioning to
make google-batch
the default in coming releases.
The steps for getting started differ slightly as indicated in the steps below:
Sign up for a Google account and create a project.
Enable the APIs:
v2beta
API (provider: google-cls-v2
):Enable the Cloud Life Sciences, Storage, and Compute APIs
batch
API (provider: google-batch
):Provide credentials
so dsub
can call Google APIs:
gcloud auth application-default login
Create a Google Cloud Storage bucket.
The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.
gsutil mb gs://my-bucket
Change my-bucket
to a unique name that follows the
bucket-naming conventions.
(By default, the bucket will be in the US, but you can change or
refine the location
setting with the -l
option.)
Run a very simple "Hello World" dsub
job and wait for completion.
For the v2beta
API (provider: google-cls-v2
):
dsub \
--provider google-cls-v2 \
--project my-cloud-project \
--regions us-central1 \
--logging gs://my-bucket/logging/ \
--output OUT=gs://my-bucket/output/out.txt \
--command 'echo "Hello World" > "${OUT}"' \
--wait
Change my-cloud-project
to your Google Cloud project, and my-bucket
to
the bucket you created above.
For the batch
API (provider: google-batch
):
dsub \
--provider google-batch \
--project my-cloud-project \
--regions us-central1 \
--logging gs://my-bucket/logging/ \
--output OUT=gs://my-bucket/output/out.txt \
--command 'echo "Hello World" > "${OUT}"' \
--wait
Change my-cloud-project
to your Google Cloud project, and my-bucket
to
the bucket you created above.
The output of the script command will be written to the OUT
file in Cloud
Storage that you specify.
View the output file.
gsutil cat gs://my-bucket/output/out.txt
Where possible, dsub
tries to support users being able to develop and test
locally (for faster iteration) and then progressing to running at scale.
To this end, dsub
provides multiple "backend providers", each of which
implements a consistent runtime environment. The current providers are:
More details on the runtime environment implemented by the backend providers can be found in dsub backend providers.
google-cls-v2
and google-batch
The google-cls-v2
provider is built on the Cloud Life Sciences v2beta
API.
This API is very similar to its predecessor, the Genomics v2alpha1
API.
Details of the differences can be found in the
Migration Guide.
The google-batch
provider is built on the Cloud Batch API.
Details of Cloud Life Sciences versus Batch can be found in this
Migration Guide.
dsub
largely hides the differences between the APIs, but there are a
few differences to note:
google-batch
requires jobs to run in one regionThe --regions
and --zones
flags for dsub
specify where the tasks should
run. The google-cls-v2
allows you to specify a multi-region like US
,
multiple regions, or multiple zones across regions. With the google-batch
provider, you must specify either one region or multiple zones within a single
region.
dsub
featuresThe following sections show how to run more complex jobs.
You can provide a shell command directly in the dsub command-line, as in the hello example above.
You can also save your script to a file, like hello.sh
. Then you can run:
dsub \
... \
--script hello.sh
If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.
To get started more easily, dsub
uses a stock Ubuntu Docker image.
This default image may change at any time in future releases, so for
reproducible production workflows, you should always specify the image
explicitly.
You can change the image by passing the --image
flag.
dsub \
... \
--image ubuntu:16.04 \
--script hello.sh
Note: your --image
must include the
Bash shell interpreter.
For more information on using the
--image
flag, see the
image section in Scripts, Commands, and Docker
You can pass environment variables to your script using the --env
flag.
dsub \
... \
--env MESSAGE=hello \
--command 'echo ${MESSAGE}'
The environment variable MESSAGE
will be assigned the value hello
when
your Docker container runs.
Your script or command can reference the variable like any other Linux
environment variable, as ${MESSAGE}
.
Be sure to enclose your command string in single quotes and not double
quotes. If you use double quotes, the command will be expanded in your local
shell before being passed to dsub. For more information on using the
--command
flag, see Scripts, Commands, and Docker
To set multiple environment variables, you can repeat the flag:
--env VAR1=value1 \
--env VAR2=value2
You can also set multiple variables, space-delimited, with a single flag:
--env VAR1=value1 VAR2=value2
dsub mimics the behavior of a shared file system using cloud storage bucket paths for input and output files and folders. You specify the cloud storage bucket path. Paths can be:
gs://my-bucket/my-file
gs://my-bucket/my-folder
gs://my-bucket/my-folder/*
See the inputs and outputs documentation for more details.
If your script expects to read local input files that are not already contained within your Docker image, the files must be available in Google Cloud Storage.
If your script has dependent files, you can make them available to your script by:
To upload the files to Google Cloud Storage, you can use the storage browser or gsutil. You can also run on data that’s public or shared with your service account, an email address that you can find in the Google Cloud Console.
To specify input and output files, use the --input
and --output
flags:
dsub \
... \
--input INPUT_FILE_1=gs://my-bucket/my-input-file-1 \
--input INPUT_FILE_2=gs://my-bucket/my-input-file-2 \
--output OUTPUT_FILE=gs://my-bucket/my-output-file \
--command 'cat "${INPUT_FILE_1}" "${INPUT_FILE_2}" > "${OUTPUT_FILE}"'
In this example:
gs://my-bucket/my-input-file-1
to a path on the data disk${INPUT_FILE_1}
gs://my-bucket/my-input-file-2
to a path on the data disk${INPUT_FILE_2}
The --command
can reference the file paths using the environment variables.
Also in this example:
${OUTPUT_FILE}
${OUTPUT_FILE}
After the --command
completes, the output file will be copied to the bucket path gs://my-bucket/my-output-file
Multiple --input
, and --output
parameters can be specified and
they can be specified in any order.
To copy folders rather than files, use the --input-recursive
and
output-recursive
flags:
dsub \
... \
--input-recursive FOLDER=gs://my-bucket/my-folder \
--command 'find ${FOLDER} -name "foo*"'
Multiple --input-recursive
, and --output-recursive
parameters can be
specified and they can be specified in any order.
While explicitly specifying inputs improves tracking provenance of your data, there are cases where you might not want to expliclty localize all inputs from Cloud Storage to your job VM.
For example, if you have:
OR
OR
then you may find it more efficient or convenient to access this data by mounting read-only:
The google-cls-v2
and google-batch
provider support these methods of
providing access to resource data.
The local
provider supports mounting a
local directory in a similar fashion to support your local development.
To have the google-cls-v2
or google-batch
provider mount a
Cloud Storage bucket using
Cloud Storage FUSE, use the
--mount
command line flag:
--mount RESOURCES=gs://mybucket
The bucket will be mounted read-only into the Docker container running your
--script
or --command
and the location made available via the environment
variable ${RESOURCES}
. Inside your script, you can reference the mounted path
using the environment variable. Please read
Key differences from a POSIX file system
and Semantics
before using Cloud Storage FUSE.
To have the google-cls-v2
or google-batch
provider mount a persistent disk that
you have pre-created and populated, use the --mount
command line flag and the
url of the source disk:
--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk"
To have the google-cls-v2
or google-batch
provider mount a persistent disk created from an image,
use the --mount
command line flag and the url of the source image and the size
(in GB) of the disk:
--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"
The image will be used to create a new persistent disk, which will be attached
to a Compute Engine VM. The disk will mounted into the Docker container running
your --script
or --command
and the location made available by the
environment variable ${RESOURCES}
. Inside your script, you can reference the
mounted path using the environment variable.
To create an image, see Creating a custom image.
local
provider)To have the local
provider mount a directory read-only, use the --mount
command line flag and a file://
prefix:
--mount RESOURCES=file://path/to/my/dir
The local directory will be mounted into the Docker container running your
--script
or --command
and the location made available via the environment
variable ${RESOURCES}
. Inside your script, you can reference the mounted
path using the environment variable.
dsub
tasks run using the local
provider will use the resources available on
your local machine.
dsub
tasks run using the google-cls-v2
or google-batch
providers can take advantage
of a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.
See the Compute Resources documentation for details.
By default, dsub
generates a job-id
with the form
job-name--userid--timestamp
where the job-name
is truncated at 10 characters
and the timestamp
is of the form YYMMDD-HHMMSS-XX
, unique to hundredths of a
second. If you are submitting multiple jobs concurrently, you may still run into
situations where the job-id
is not unique. If you require a unique job-id
for this situation, you may use the --unique-job-id
parameter.
If the --unique-job-id
parameter is set, job-id
will instead be a unique 32
character UUID created by https://docs.python.org/3/library/uuid.html. Because
some providers require that the job-id
begin with a letter, dsub
will
replace any starting digit with a letter in a manner that preserves uniqueness.
Each of the examples above has demonstrated submitting a single task with
a single set of variables, inputs, and outputs. If you have a batch of inputs
and you want to run the same operation over them, dsub
allows you
to create a batch job.
Instead of calling dsub
repeatedly, you can create
a tab-separated values (TSV) file containing the variables,
inputs, and outputs for each task, and then call dsub
once.
The result will be a single job-id
with multiple tasks. The tasks will
be scheduled and run independently, but can be
monitored and
deleted as a group.
The first line of the TSV file specifies the names and types of the parameters. For example:
--env SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH
Each addition line in the file should provide the variable, input, and output values for each task. Each line beyond the header represents the values for a separate task.
Multiple --env
, --input
, and --output
parameters can be specified and
they can be specified in any order. For example:
--env SAMPLE<tab>--input A<tab>--input B<tab>--env REFNAME<tab>--output O
S1<tab>gs://path/A1.txt<tab>gs://path/B1.txt<tab>R1<tab>gs://path/O1.txt
S2<tab>gs://path/A2.txt<tab>gs://path/B2.txt<tab>R2<tab>gs://path/O2.txt
Pass the TSV file to dsub using the --tasks
parameter. This parameter
accepts both the file path and optionally a range of tasks to process.
The file may be read from the local filesystem (on the machine you're calling
dsub
from), or from a bucket in Google Cloud Storage (file name starts with
"gs://").
For example, suppose my-tasks.tsv
contains 101 lines: a one-line header and
100 lines of parameters for tasks to run. Then:
dsub ... --tasks ./my-tasks.tsv
will create a job with 100 tasks, while:
dsub ... --tasks ./my-tasks.tsv 1-10
will create a job with 10 tasks, one for each of lines 2 through 11.
The task range values can take any of the following forms:
m
indicates to submit task m
(line m+1)m-
indicates to submit all tasks starting with task m
m-n
indicates to submit all tasks from m
to n
(inclusive).The --logging
flag points to a location for dsub
task log files. For details
on how to specify your logging path, see Logging.
It's possible to wait for a job to complete before starting another. For details, see job control with dsub.
It is possible for dsub
to automatically retry failed tasks.
For details, see retries with dsub.
You can add custom labels to jobs and tasks, which allows you to monitor and cancel tasks using your own identifiers. In addition, with the Google providers, labeling a task will label associated compute resources such as virtual machines and disks.
For more details, see Checking Status and Troubleshooting Jobs
The dstat
command displays the status of jobs:
dstat --provider google-cls-v2 --project my-cloud-project
With no additional arguments, dstat will display a list of running jobs for
the current USER
.
To display the status of a specific job, use the --jobs
flag:
dstat --provider google-cls-v2 --project my-cloud-project --jobs job-id
For a batch job, the output will list all running tasks.
Each job submitted by dsub is given a set of metadata values that can be used for job identification and job control. The metadata associated with each job includes:
job-name
: defaults to the name of your script file or the first word of
your script command; it can be explicitly set with the --name
parameter.user-id
: the USER
environment variable value.job-id
: identifier of the job, which can be used in calls to dstat
and
ddel
for job monitoring and canceling respectively. See
Job Identifiers for more
details on the job-id
format.task-id
: if the job is submitted with the --tasks
parameter, each task
gets a sequential value of the form "task-n" where n is 1-based.Note that the job metadata values will be modified to conform with the "Label Restrictions" listed in the Checking Status and Troubleshooting Jobs guide.
Metadata can be used to cancel a job or individual tasks within a batch job.
For more details, see Checking Status and Troubleshooting Jobs
By default, dstat outputs one line per task. If you're using a batch job with
many tasks then you may benefit from --summary
.
$ dstat --provider google-cls-v2 --project my-project --status '*' --summary
Job Name Status Task Count
------------- ------------- -------------
my-job-name RUNNING 2
my-job-name SUCCESS 1
In this mode, dstat prints one line per (job name, task status) pair. You can see at a glance how many tasks are finished, how many are still running, and how many are failed/canceled.
The ddel
command will delete running jobs.
By default, only jobs submitted by the current user will be deleted.
Use the --users
flag to specify other users, or '*'
for all users.
To delete a running job:
ddel --provider google-cls-v2 --project my-cloud-project --jobs job-id
If the job is a batch job, all running tasks will be deleted.
To delete specific tasks:
ddel \
--provider google-cls-v2 \
--project my-cloud-project \
--jobs job-id \
--tasks task-id1 task-id2
To delete all running jobs for the current user:
ddel --provider google-cls-v2 --project my-cloud-project --jobs '*'
When you run the dsub
command with the google-cls-v2
or google-batch
provider, there are two different sets of credentials to consider:
pipelines.run()
request to run your command/script on a VMThe account used to submit the pipelines.run()
request is typically your
end user credentials. You would have set this up by running:
gcloud auth application-default login
The account used on the VM is a service account. The image below illustrates this:
By default, dsub
will use the default Compute Engine service account
as the authorized service account on the VM instance. You can choose to specify
the email address of another service account using --service-account
.
By default, dsub
will grant the following access scopes to the service account:
In addition, the API will always add this scope:
You can choose to specify scopes using --scopes
.
While it is straightforward to use the default service account, this account also
has broad privileges granted to it by default. Following the
Principle of Least Privilege
you may want to create and use a service account that has only sufficient privileges
granted in order to run your dsub
command/script.
To create a new service account, follow the steps below:
Execute the gcloud iam service-accounts create
command. The email address
of the service account will be sa-name@project-id.iam.gserviceaccount.com
.
gcloud iam service-accounts create "sa-name"
Grant IAM access on buckets, etc. to the service account.
gsutil iam ch serviceAccount:sa-name@project-id.iam.gserviceaccount.com:roles/storage.objectAdmin gs://bucket-name
Update your dsub
command to include --service-account
dsub \
--service-account sa-name@project-id.iam.gserviceaccount.com
...
FAQs
A command-line tool that makes it easy to submit and run batch scripts in the cloud
We found that dsub demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.