spark-etl

Dependencies

Maintainers

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

spark-etl

Generic ETL Pipeline Framework for Apache Spark

0.0.130
PyPI

Maintainers: 1

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.

However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms.

spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms.

Benefit

Your application using spark-etl can be deployed and launched from different cloud spark platforms without changing the source code.

Application

An application is a python program. It contains:

A main.py file which contains the application entry
A manifest.json file, which specify the metadata of the application.
A requirements.txt file, which specify the application dependency.

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

spark is the spark session object
input_args a dict, is the argument user specified when running this application.
sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
Your main function's return value should be a JSON object, it will be returned from the job submitter to the caller.

def main(spark, input_args, sysops={}):
    # your code here

Here is an application example.

Build your application

etl -a build -c <config-filename> -p <application-name>

Deploy your application

etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>

Run your application

etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>

Supported platforms

	You setup your own Apache Spark Cluster.
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud

Demos

APIs

pydocs for APIs

Job Deployer

For job deployers, please check the wiki .

Job Submitter

For job submitters, please check the wiki

FAQs

What is spark-etl?

Is spark-etl well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install