Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
cdk-emrserverless-with-delta-lake
Advanced tools
npm (JS/TS) | PyPI (Python) | Maven (Java) | Go | NuGet |
---|---|---|---|---|
Link | Link | Link | Link | Link |
This constrcut builds an EMR studio, a cluster template for the EMR Studio, and an EMR Serverless application. 2 S3 buckets will be created, one is for the EMR Studio workspace and the other one is for EMR Serverless applications. Besides, the VPC and the subnets for the EMR Studio will be tagged {"Key": "for-use-with-amazon-emr-managed-policies", "Value": "true"}
via a custom resource. This is necessary for the service role of EMR Studio.
This construct is for analysts, data engineers, and anyone who wants to know how to process Delta Lake data with EMR serverless.
They build the construct via cdkv2 and build a serverless job within the EMR application generated by the construct via AWS CLI within few minutes. After the EMR serverless job is finished, they can then check the processed result done by the EMR serverless job on an EMR notebook through the cluster template.
Your current identity has the AdministratorAccess
power.
An IAM user named Administrator
with the AdministratorAccess
power.
serviceCatalogProps
in the EmrServerless
construct for detail, otherwise, the IAM user mentioned above will be chosen to set up with the Product.You might want to execute the following command.
PROFILE_NAME="scott.hsieh"
# If you only have one credentials on your local machine, just ignore `--profile`, buddy.
cdk bootstrap aws://${AWS_ACCOUNT_ID}/${AWS_REGION} --profile ${PROFILE_NAME}
#!/usr/bin/env node
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { EmrServerless } from 'cdk-emrserverless-with-delta-lake';
class TypescriptStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
new EmrServerless(this, 'EmrServerless');
}
}
const app = new cdk.App();
new TypescriptStack(app, 'TypescriptStack', {
stackName: 'emr-studio',
env: {
region: process.env.CDK_DEFAULT_REGION,
account: process.env.CDK_DEFAULT_ACCOUNT,
},
});
Promise me, darling, make advantage on the CloudFormation outputs. All you need is copy-paste, copy-paste, copy-paste, life should be always that easy.
Define the following environment variables on your current session.
export PROFILE_NAME="${YOUR_PROFILE_NAME}"
export JOB_ROLE_ARN="${copy-paste-thank-you}"
export APPLICATION_ID="${copy-paste-thank-you}"
export SERVERLESS_BUCKET_NAME="${copy-paste-thank-you}"
export DELTA_LAKE_SCRIPT_NAME="delta-lake-demo"
Copy partial NYC-taxi data into the EMR Serverless bucket.
aws s3 cp s3://nyc-tlc/trip\ data/ s3://${SERVERLESS_BUCKET_NAME}/nyc-taxi/ --exclude "*" --include "yellow_tripdata_2021-*.parquet" --recursive --profile ${PROFILE_NAME}
Create a Python script for processing Delta Lake
touch ${DELTA_LAKE_SCRIPT_NAME}.py
cat << EOF > ${DELTA_LAKE_SCRIPT_NAME}.py
from pyspark.sql import SparkSession
import uuid
if __name__ == "__main__":
"""
Delta Lake with EMR Serverless, take NYC taxi as example.
"""
spark = SparkSession \\
.builder \\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \\
.enableHiveSupport() \\
.appName("Delta-Lake-OSS") \\
.getOrCreate()
url = "s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/output/1.2.1/%s/" % str(
uuid.uuid4())
# creates a Delta table and outputs to target S3 bucket
spark.range(5).write.format("delta").save(url)
# reads a Delta table and outputs to target S3 bucket
spark.read.format("delta").load(url).show()
# The source for the second Delta table.
base = spark.read.parquet(
"s3://${SERVERLESS_BUCKET_NAME}/nyc-taxi/*.parquet")
# The sceond Delta table, oh ya.
base.write.format("delta") \\
.mode("overwrite") \\
.save("s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/nyx-tlc-2021")
spark.stop()
EOF
Upload the script and required jars into the serverless bucket
# upload script
aws s3 cp delta-lake-demo.py s3://${SERVERLESS_BUCKET_NAME}/scripts/${DELTA_LAKE_SCRIPT_NAME}.py --profile ${PROFILE_NAME}
# download jars and upload them
DELTA_VERSION="3.0.0"
DELTA_LAKE_CORE="delta-spark_2.13-${DELTA_VERSION}.jar"
DELTA_LAKE_STORAGE="delta-storage-${DELTA_VERSION}.jar"
curl https://repo1.maven.org/maven2/io/delta/delta-spark_2.13/${DELTA_VERSION}/${DELTA_LAKE_CORE} --output ${DELTA_LAKE_CORE}
curl https://repo1.maven.org/maven2/io/delta/delta-storage/${DELTA_VERSION}/${DELTA_LAKE_STORAGE} --output ${DELTA_LAKE_STORAGE}
aws s3 mv ${DELTA_LAKE_CORE} s3://${SERVERLESS_BUCKET_NAME}/jars/${DELTA_LAKE_CORE} --profile ${PROFILE_NAME}
aws s3 mv ${DELTA_LAKE_STORAGE} s3://${SERVERLESS_BUCKET_NAME}/jars/${DELTA_LAKE_STORAGE} --profile ${PROFILE_NAME}
Rememeber, you got so much information to copy and paste from the CloudFormation outputs.
aws emr-serverless start-job-run \
--application-id ${APPLICATION_ID} \
--execution-role-arn ${JOB_ROLE_ARN} \
--name 'shy-shy-first-time' \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://'${SERVERLESS_BUCKET_NAME}'/scripts/'${DELTA_LAKE_SCRIPT_NAME}'.py",
"sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.jars=s3://'${SERVERLESS_BUCKET_NAME}'/jars/'${DELTA_LAKE_CORE}',s3://'${SERVERLESS_BUCKET_NAME}'/jars/'${DELTA_LAKE_STORAGE}'"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://'${SERVERLESS_BUCKET_NAME}'/serverless-log/"
}
}
}' \
--profile ${PROFILE_NAME}
If you execute with success, you should see similar reponse as the following:
{
"applicationId": "00f1gvklchoqru25",
"jobRunId": "00f1h0ipd2maem01",
"arn": "arn:aws:emr-serverless:ap-northeast-1:630778274080:/applications/00f1gvklchoqru25/jobruns/00f1h0ipd2maem01"
}
and got a Delta Lake data under s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/nyx-tlc-2021/
.
Access the EMR Studio via the URL from the CloudFormation outputs. It should look very similar to the following url: https://es-pilibalapilibala.emrstudio-prod.ap-northeast-1.amazonaws.com
, i.e., weird string and region won't be the same as mine.
spark.jars
as the Spark page says for your EMR Serverless job. The UI will complain, you still can start the job. Don't be afraid, just click it like when you were child, facing authority fearlessly.
FAQs
A construct for the quick demo of EMR Serverless.
We found that cdk-emrserverless-with-delta-lake demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.