![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
A powerful command-line tool and API for managing and deploying Spark jobs on Amazon EMR clusters. EMRRunner simplifies the process of submitting and managing Spark jobs while handling all the necessary environment setup.
pip install emrrunner
# Clone the repository
git clone https://github.com/yourusername/EMRRunner.git
cd EMRRunner
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
# Install the package
pip install -e .
Create a .env
file in the project root with your AWS configuration:
Note: Export these variables in your terminal before running:
export AWS_ACCESS_KEY=your_access_key
export AWS_SECRET_KEY=your_secret_key
export AWS_REGION=your_region
export EMR_CLUSTER_ID=your_cluster_id
export S3_PATH=s3://your-bucket/path
For EMR cluster setup with required dependencies, create a bootstrap script (bootstrap.sh
):
#!/bin/bash -xe
# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]
# Install Python packages
pip3 install [your-required-packages]
deactivate
Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
EMRRunner/
├── Dockerfile
├── LICENSE.md
├── README.md
├── app/
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration management
│ ├── emr_client.py # EMR interaction logic
│ ├── emr_job_api.py # Flask API endpoints
│ ├── run_api.py # API server runner
│ └── schema.py # Request/Response schemas
├── bootstrap/
│ └── bootstrap.sh # EMR bootstrap script
├── tests/
│ ├── __init__.py
│ ├── test_config.py
│ ├── test_emr_job_api.py
│ └── test_schema.py
├── pyproject.toml
├── requirements.txt
└── setup.py
The S3_PATH
in your configuration should point to a bucket with the following structure:
s3://your-bucket/
├── jobs/
│ ├── job1/
│ │ ├── dependencies.py # Shared functions and utilities
│ │ └── job.py # Main job execution script
│ └── job2/
│ ├── dependencies.py
│ └── job.py
Each job in the S3 bucket follows a standard structure:
dependencies.py
def process_data(df):
# Data processing logic
pass
def validate_input(data):
# Input validation logic
pass
def transform_output(result):
# Output transformation logic
pass
job.py
from dependencies import process_data, validate_input, transform_output
def main():
# 1. Read input data
input_data = spark.read.parquet("s3://input-path")
# 2. Validate input
validate_input(input_data)
# 3. Process data
processed_data = process_data(input_data)
# 4. Transform output
final_output = transform_output(processed_data)
# 5. Write results
final_output.write.parquet("s3://output-path")
if __name__ == "__main__":
main()
Start a job in client mode:
emrrunner start --job job1 --step process_daily_data
Start a job in cluster mode:
emrrunner start --job job1 --step process_daily_data --deploy-mode cluster
Start a job via API in client mode (default):
curl -X POST http://localhost:8000/api/v1/emr/job/start \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data"}'
Start a job via API in cluster mode:
curl -X POST http://localhost:8000/api/v1/emr/job/start \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data", "deploy_mode": "cluster"}'
To contribute to EMRRunner:
Bootstrap Actions
Job Dependencies
Job Organization
This project is licensed under the MIT License - see the LICENSE.md file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you discover any bugs, please create an issue on GitHub with:
Built with ❤️ using Python and AWS EMR
FAQs
A powerful CLI tool and API for managing Spark jobs on Amazon EMR clusters
We found that emrrunner demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.