# EMRRunner (EMR Job Runner)
![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
![Amazon EMR](https://img.shields.io/badge/Amazon%20EMR-FF9900?style=for-the-badge&logo=amazon-aws&logoColor=white)
![Flask](https://img.shields.io/badge/Flask-000000?style=for-the-badge&logo=flask&logoColor=white)
![AWS](https://img.shields.io/badge/AWS-232F3E?style=for-the-badge&logo=amazon-aws&logoColor=white)
A powerful command-line tool and API for managing and deploying Spark jobs on Amazon EMR clusters. EMRRunner simplifies the process of submitting and managing Spark jobs while handling all the necessary environment setup.
## 🚀 Features
- Command-line interface for quick job submission
- RESTful API for programmatic access
- Support for both client and cluster deploy modes
- Automatic S3 synchronization of job files
- Configurable job parameters
- Easy dependency management
- Bootstrap action support for cluster setup
## 📋 Prerequisites
- Python 3.9+
- AWS Account with EMR access
- Configured AWS credentials
- Active EMR cluster
## 🛠️ Installation
### From PyPI
```bash
pip install emrrunner
```
### From Source
```bash
# Clone the repository
git clone https://github.com/yourusername/EMRRunner.git
cd EMRRunner
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
# Install the package
pip install -e .
```
## ⚙️ Configuration
### AWS Configuration
Create a `.env` file in the project root with your AWS configuration:
`Note: Export these variables in your terminal before running:`
```env
export AWS_ACCESS_KEY=your_access_key
export AWS_SECRET_KEY=your_secret_key
export AWS_REGION=your_region
export EMR_CLUSTER_ID=your_cluster_id
export S3_PATH=s3://your-bucket/path
```
### Bootstrap Actions
For EMR cluster setup with required dependencies, create a bootstrap script (`bootstrap.sh`):
```bash
#!/bin/bash -xe
# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]
# Install Python packages
pip3 install [your-required-packages]
deactivate
```
Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
## 📁 Project Structure
```
EMRRunner/
├── Dockerfile
├── LICENSE.md
├── README.md
├── app/
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration management
│ ├── emr_client.py # EMR interaction logic
│ ├── emr_job_api.py # Flask API endpoints
│ ├── run_api.py # API server runner
│ └── schema.py # Request/Response schemas
├── bootstrap/
│ └── bootstrap.sh # EMR bootstrap script
├── tests/
│ ├── __init__.py
│ ├── test_config.py
│ ├── test_emr_job_api.py
│ └── test_schema.py
├── pyproject.toml
├── requirements.txt
└── setup.py
```
## 📦 S3 Job Structure
The `S3_PATH` in your configuration should point to a bucket with the following structure:
```
s3://your-bucket/
├── jobs/
│ ├── job1/
│ │ ├── dependencies.py # Shared functions and utilities
│ │ └── job.py # Main job execution script
│ └── job2/
│ ├── dependencies.py
│ └── job.py
```
### Job Organization
Each job in the S3 bucket follows a standard structure:
1. **dependencies.py**
- Contains reusable functions and utilities specific to the job
- Example functions:
```python
def process_data(df):
# Data processing logic
pass
def validate_input(data):
# Input validation logic
pass
def transform_output(result):
# Output transformation logic
pass
```
2. **job.py**
- Main execution script that uses functions from dependencies.py
- Standard structure:
```python
from dependencies import process_data, validate_input, transform_output
def main():
# 1. Read input data
input_data = spark.read.parquet("s3://input-path")
# 2. Validate input
validate_input(input_data)
# 3. Process data
processed_data = process_data(input_data)
# 4. Transform output
final_output = transform_output(processed_data)
# 5. Write results
final_output.write.parquet("s3://output-path")
if __name__ == "__main__":
main()
```
## 💻 Usage
### Command Line Interface
Start a job in client mode:
```bash
emrrunner start --job job1 --step process_daily_data
```
Start a job in cluster mode:
```bash
emrrunner start --job job1 --step process_daily_data --deploy-mode cluster
```
### API Endpoints
Start a job via API in client mode (default):
```bash
curl -X POST http://localhost:8000/api/v1/emr/job/start \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data"}'
```
Start a job via API in cluster mode:
```bash
curl -X POST http://localhost:8000/api/v1/emr/job/start \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data", "deploy_mode": "cluster"}'
```
## 🔧 Development
To contribute to EMRRunner:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request
## 💡 Best Practices
1. **Bootstrap Actions**
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled
2. **Job Dependencies**
- Maintain a requirements.txt for each job
- Use virtual environments
- Document system-level dependencies
- Test dependencies in a clean environment
3. **Job Organization**
- Follow the standard structure for jobs
- Keep dependencies.py focused and modular
- Use clear naming conventions
- Document all functions and modules
## 🔒 Security
- Supports AWS credential management
- Validates all input parameters
- Secure handling of bootstrap scripts
## 📝 License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
## 👥 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 🐛 Bug Reports
If you discover any bugs, please create an issue on GitHub with:
- Your operating system name and version
- Any details about your local setup that might be helpful in troubleshooting
- Detailed steps to reproduce the bug
---
Built with ❤️ using Python and AWS EMR
Raw data
{
"_id": null,
"home_page": "https://github.com/Haabiy/EMRRunner",
"name": "emrrunner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "aws emr spark hadoop data-processing etl cli api flask",
"author": "Haabiy",
"author_email": "abiy.dema@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/26/6c/bcac006fe73fd5a327aa25d0c4d0b0b301eee602ba322b457351050970b5/emrrunner-1.0.9.tar.gz",
"platform": null,
"description": "# EMRRunner (EMR Job Runner)\n\n![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white) \n![Amazon EMR](https://img.shields.io/badge/Amazon%20EMR-FF9900?style=for-the-badge&logo=amazon-aws&logoColor=white)\n![Flask](https://img.shields.io/badge/Flask-000000?style=for-the-badge&logo=flask&logoColor=white)\n![AWS](https://img.shields.io/badge/AWS-232F3E?style=for-the-badge&logo=amazon-aws&logoColor=white)\n\nA powerful command-line tool and API for managing and deploying Spark jobs on Amazon EMR clusters. EMRRunner simplifies the process of submitting and managing Spark jobs while handling all the necessary environment setup.\n\n## \ud83d\ude80 Features\n\n- Command-line interface for quick job submission\n- RESTful API for programmatic access\n- Support for both client and cluster deploy modes\n- Automatic S3 synchronization of job files\n- Configurable job parameters\n- Easy dependency management\n- Bootstrap action support for cluster setup\n\n## \ud83d\udccb Prerequisites\n\n- Python 3.9+\n- AWS Account with EMR access\n- Configured AWS credentials\n- Active EMR cluster\n\n## \ud83d\udee0\ufe0f Installation\n\n### From PyPI\n```bash\npip install emrrunner\n```\n\n### From Source\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/EMRRunner.git\ncd EMRRunner\n\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: .\\venv\\Scripts\\activate\n\n# Install the package\npip install -e .\n```\n\n## \u2699\ufe0f Configuration\n\n### AWS Configuration\nCreate a `.env` file in the project root with your AWS configuration:\n\n`Note: Export these variables in your terminal before running:`\n```env\nexport AWS_ACCESS_KEY=your_access_key\nexport AWS_SECRET_KEY=your_secret_key\nexport AWS_REGION=your_region\nexport EMR_CLUSTER_ID=your_cluster_id\nexport S3_PATH=s3://your-bucket/path\n```\n\n### Bootstrap Actions\nFor EMR cluster setup with required dependencies, create a bootstrap script (`bootstrap.sh`):\n\n```bash\n#!/bin/bash -xe\n\n# Example structure of a bootstrap script\n# Create and activate virtual environment\npython3 -m venv /home/hadoop/myenv\nsource /home/hadoop/myenv/bin/activate\n\n# Install system dependencies\nsudo yum install python3-pip -y\nsudo yum install -y [your-system-packages]\n\n# Install Python packages\npip3 install [your-required-packages]\n\ndeactivate\n```\n\nUpload the bootstrap script to S3 and reference it in your EMR cluster configuration.\n\n## \ud83d\udcc1 Project Structure\n\n```\nEMRRunner/\n\u251c\u2500\u2500 Dockerfile\n\u251c\u2500\u2500 LICENSE.md\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 app/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 cli.py # Command-line interface\n\u2502 \u251c\u2500\u2500 config.py # Configuration management\n\u2502 \u251c\u2500\u2500 emr_client.py # EMR interaction logic\n\u2502 \u251c\u2500\u2500 emr_job_api.py # Flask API endpoints\n\u2502 \u251c\u2500\u2500 run_api.py # API server runner\n\u2502 \u2514\u2500\u2500 schema.py # Request/Response schemas\n\u251c\u2500\u2500 bootstrap/\n\u2502 \u2514\u2500\u2500 bootstrap.sh # EMR bootstrap script\n\u251c\u2500\u2500 tests/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 test_config.py\n\u2502 \u251c\u2500\u2500 test_emr_job_api.py\n\u2502 \u2514\u2500\u2500 test_schema.py\n\u251c\u2500\u2500 pyproject.toml\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 setup.py\n```\n\n## \ud83d\udce6 S3 Job Structure\n\nThe `S3_PATH` in your configuration should point to a bucket with the following structure:\n\n```\ns3://your-bucket/\n\u251c\u2500\u2500 jobs/\n\u2502 \u251c\u2500\u2500 job1/\n\u2502 \u2502 \u251c\u2500\u2500 dependencies.py # Shared functions and utilities\n\u2502 \u2502 \u2514\u2500\u2500 job.py # Main job execution script\n\u2502 \u2514\u2500\u2500 job2/\n\u2502 \u251c\u2500\u2500 dependencies.py\n\u2502 \u2514\u2500\u2500 job.py\n```\n\n### Job Organization\n\nEach job in the S3 bucket follows a standard structure:\n\n1. **dependencies.py**\n - Contains reusable functions and utilities specific to the job\n - Example functions:\n ```python\n def process_data(df):\n # Data processing logic\n pass\n\n def validate_input(data):\n # Input validation logic\n pass\n\n def transform_output(result):\n # Output transformation logic\n pass\n ```\n\n2. **job.py**\n - Main execution script that uses functions from dependencies.py\n - Standard structure:\n ```python\n from dependencies import process_data, validate_input, transform_output\n\n def main():\n # 1. Read input data\n input_data = spark.read.parquet(\"s3://input-path\")\n \n # 2. Validate input\n validate_input(input_data)\n \n # 3. Process data\n processed_data = process_data(input_data)\n \n # 4. Transform output\n final_output = transform_output(processed_data)\n \n # 5. Write results\n final_output.write.parquet(\"s3://output-path\")\n\n if __name__ == \"__main__\":\n main()\n ```\n\n## \ud83d\udcbb Usage\n\n### Command Line Interface\n\nStart a job in client mode:\n```bash\nemrrunner start --job job1 --step process_daily_data\n```\n\nStart a job in cluster mode:\n```bash\nemrrunner start --job job1 --step process_daily_data --deploy-mode cluster\n```\n\n### API Endpoints\n\nStart a job via API in client mode (default):\n```bash\ncurl -X POST http://localhost:8000/api/v1/emr/job/start \\\n -H \"Content-Type: application/json\" \\\n -d '{\"job_name\": \"job1\", \"step\": \"process_daily_data\"}'\n```\n\nStart a job via API in cluster mode:\n```bash\ncurl -X POST http://localhost:8000/api/v1/emr/job/start \\\n -H \"Content-Type: application/json\" \\\n -d '{\"job_name\": \"job1\", \"step\": \"process_daily_data\", \"deploy_mode\": \"cluster\"}'\n```\n\n## \ud83d\udd27 Development\n\nTo contribute to EMRRunner:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Submit a pull request\n\n## \ud83d\udca1 Best Practices\n\n1. **Bootstrap Actions**\n - Keep bootstrap scripts modular\n - Version control your dependencies\n - Use specific package versions\n - Test bootstrap scripts locally when possible\n - Store bootstrap scripts in S3 with versioning enabled\n\n2. **Job Dependencies**\n - Maintain a requirements.txt for each job\n - Use virtual environments\n - Document system-level dependencies\n - Test dependencies in a clean environment\n\n3. **Job Organization**\n - Follow the standard structure for jobs\n - Keep dependencies.py focused and modular\n - Use clear naming conventions\n - Document all functions and modules\n\n## \ud83d\udd12 Security\n\n- Supports AWS credential management\n- Validates all input parameters\n- Secure handling of bootstrap scripts\n\n## \ud83d\udcdd License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.\n\n## \ud83d\udc65 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## \ud83d\udc1b Bug Reports\n\nIf you discover any bugs, please create an issue on GitHub with:\n- Your operating system name and version\n- Any details about your local setup that might be helpful in troubleshooting\n- Detailed steps to reproduce the bug\n\n---\n\nBuilt with \u2764\ufe0f using Python and AWS EMR\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A powerful CLI tool and API for managing Spark jobs on Amazon EMR clusters",
"version": "1.0.9",
"project_urls": {
"Bug Reports": "https://github.com/Haabiy/EMRRunner/issues",
"Documentation": "https://github.com/Haabiy/EMRRunner#readme",
"Homepage": "https://github.com/Haabiy/EMRRunner",
"Source": "https://github.com/Haabiy/EMRRunner"
},
"split_keywords": [
"aws",
"emr",
"spark",
"hadoop",
"data-processing",
"etl",
"cli",
"api",
"flask"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "28d764a3b49b0daf9ffa6a1ec05201514031c5ed42171dbeda18cf881e5e1c49",
"md5": "d55c60574cc4e2a56f052ab44d50b02d",
"sha256": "a6459e7d2fb2a40e85751915b4a3ebe84aeaccb62a6ae4bb7a1ff01f39bf0e1a"
},
"downloads": -1,
"filename": "emrrunner-1.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d55c60574cc4e2a56f052ab44d50b02d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8524,
"upload_time": "2024-11-03T16:44:03",
"upload_time_iso_8601": "2024-11-03T16:44:03.031730Z",
"url": "https://files.pythonhosted.org/packages/28/d7/64a3b49b0daf9ffa6a1ec05201514031c5ed42171dbeda18cf881e5e1c49/emrrunner-1.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "266cbcac006fe73fd5a327aa25d0c4d0b0b301eee602ba322b457351050970b5",
"md5": "d62b2031efe51438be8ff758989b4ac3",
"sha256": "c93c7fabcbd4221c761eff5e925433c49fb9ba755ac4231911d6c441abbc25ca"
},
"downloads": -1,
"filename": "emrrunner-1.0.9.tar.gz",
"has_sig": false,
"md5_digest": "d62b2031efe51438be8ff758989b4ac3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 9253,
"upload_time": "2024-11-03T16:44:04",
"upload_time_iso_8601": "2024-11-03T16:44:04.487655Z",
"url": "https://files.pythonhosted.org/packages/26/6c/bcac006fe73fd5a327aa25d0c4d0b0b301eee602ba322b457351050970b5/emrrunner-1.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-03 16:44:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Haabiy",
"github_project": "EMRRunner",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "boto3",
"specs": [
[
"==",
"1.35.1"
]
]
},
{
"name": "Flask",
"specs": [
[
"==",
"3.0.3"
]
]
},
{
"name": "marshmallow",
"specs": [
[
"==",
"3.20.1"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.0.2"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"74.1.2"
]
]
}
],
"lcname": "emrrunner"
}