
Security News
vlt Launches "reproduce": A New Tool Challenging the Limits of Package Provenance
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
ocean-spark-airflow-provider
Advanced tools
An Airflow plugin and provider to launch and monitor Spark applications on Ocean for Apache Spark.
pip install ocean-spark-airflow-provider
For general usage of Ocean for Apache Spark, refer to the official documentation.
In the connection menu, register a new connection of type Ocean for
Apache Spark. The default connection name is ocean_spark_default
. You will
need to have:
osc-e4089a00
). You can find it in the Spot console in the
list of
clusters,
or by using the Cluster
List API.The Ocean for Apache Spark connection type is not available for Airflow 1, instead create an HTTP connection and fill your cluster id as host, and your API token as password.
You will need to create a separate connection for each Ocean Spark
cluster that you want to use with Airflow. In the
OceanSparkOperator
, you can select which Ocean Spark connection to
use with the connection_name
argument (defaults to
ocean_spark_default
). For example, you may choose to have one
Ocean Spark cluster per environment (dev, staging, prod), and you
can easily target an environment by picking the correct Airflow connection.
from ocean_spark.operators import OceanSparkOperator
# DAG creation
spark_pi_task = OceanSparkOperator(
job_id="spark-pi",
task_id="compute-pi",
dag=dag,
config_overrides={
"type": "Scala",
"sparkVersion": "3.2.0",
"image": "gcr.io/datamechanics/spark:platform-3.2-latest",
"imagePullPolicy": "IfNotPresent",
"mainClass": "org.apache.spark.examples.SparkPi",
"mainApplicationFile": "local:///opt/spark/examples/jars/examples.jar",
"arguments": ["10000"],
"driver": {
"cores": 1,
"spot": false
},
"executor": {
"cores": 4,
"instances": 1,
"spot": true,
"instanceSelector": "r5"
},
},
)
from airflow import DAG, utils
from ocean_spark.operators import (
OceanSparkConnectOperator,
)
args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": utils.dates.days_ago(0, second=1),
}
dag = DAG(dag_id="spark-connect-task", default_args=args, schedule_interval=None)
spark_pi_task = OceanSparkConnectOperator(
task_id="spark-connect",
dag=dag,
)
{
"sql": "select random()"
}
more examples are available for Airflow 2.
You can test the plugin locally using the docker compose setup in this
repository. Run make serve_airflow
at the root of the repository to
launch an instance of Airflow 2 with the provider already installed.
FAQs
Apache Airflow connector for Ocean for Apache Spark
We found that ocean-spark-airflow-provider demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
Research
Security News
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
Research
The Socket Research Team discovered a malicious npm package, '@ton-wallet/create', stealing cryptocurrency wallet keys from developers and users in the TON ecosystem.