s5cmd-python
python binding for using s5cmd to download and upload files to s3 efficiently
The S5CmdRunner
class provides a Python interface for interacting with s5cmd
, a command-line tool designed for efficient data transfer to and from Amazon S3.
For more information about s5cmd, please refer to the original s5cmd repository.
Features
- Check for the presence of
s5cmd
and download it if necessary. - Execute
s5cmd
commands cp
, mv
, and run
. - Handle file downloads from URLs and S3 URIs.
- Generate command files for batch operations with
s5cmd
. - Simplify operations like copying and moving files between local paths and S3 URIs.
Installation
To use S5CmdRunner
, ensure that Python 3.10 or higher is installed. The project itself can be installed from pip:
pip install s5cmdpy
or from source:
git clone https://github.com/trojblue/s5cmd-python
cd s5cmd-python
pip install -e .
Usage
Here are some examples of how to use the S5CmdRunner
class:
Initialize S5CmdRunner
from s5cmdpy import S5CmdRunner
runner = S5CmdRunner()
Run s5cmd with a Local Command File
local_txt_path = "s5cmd_test.txt"
runner.run(local_txt_path)
Run s5cmd with a Command File from S3
txt_s3_uri = "s3://dataset-artstation-uw2/s5cmd_test.txt"
runner.run(txt_s3_uri)
Without any arguments, the progress bar created by run()
assumes that each line in the txt is for downloading a single file, therefore n lines in txt will result in n lines of console output.
For a more accurate progress bar, you can pass in the actual total number of files being downloaded, using the total
argument:
s5cmdpy.run("test_run_file.txt", total=10000)
Download Multiple Files from S3
s3_uris = [
's3://dataset-artstation-uw2/artists/__andrey__/1841730##GZGgW.json',
's3://dataset-artstation-uw2/artists/__andrey__/2249992##q5Y22.json'
]
destination_dir = '/home/ubuntu/datasets/s5cmd_test'
runner.download_from_s3_list(s3_uris, destination_dir)
Download a file from internet and upload to S3
cp
command also works with a file from internet:
target_url = "https://huggingface.co/kiriyamaX/mld-caformer/resolve/main/ml_caformer_m36_dec-5-97527.onnx"
dst_s3_uri = "s3://dataset-artstation-uw2/_dev/"
runner.cp(target_url, dst_s3_uri)
List files under S3 Directory
Uses s5cmd to efficiently list files under s3. Has around twice the speed compared to boto3:
s3_uri = "s3://dataset-artstation-uw2/_dev/"
files_under_dir = runner.ls(s3_uri)
Quick use of the runner class
Common commands can be called directly, without initializing a runner first:
s5cmdpy.download_from_s3_list(...)
s5cmdpy.mv(...)
s5cmdpy.cp(...)
s5cmdpy.run(...)
s5cmdpy.sync(...)
s5cmdpy.ls(...)
import s5cmdpy
s5cmdpy.run("some_runfile.txt")
License
S5cmd itself is MIT licensed. This project is also MIT licensed.