## Spark-submit
[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)
[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month&units=international_system&left_color=grey&right_color=green&left_text=total%20downloads)](https://pepy.tech/project/spark-submit)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)
[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)
[![License](https://img.shields.io/badge/License-MIT-blue)](#license "Go to license section")
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)
#### TL;DR: Python manager for spark-submit jobs
### Description
This package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.
### Installation
The easiest way to install is using `pip`:
`pip install spark-submit`
To install from source:
```
git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install
```
For usage details check `help(spark_submit)`.
### Usage Examples
Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.
##### Simple example:
```
from spark_submit import SparkJob
app = SparkJob('/path/some_file.py', master='local', name='simple-test')
app.submit()
print(app.get_state())
```
##### Another example:
```
from spark_submit import SparkJob
spark_args = {
'master': 'spark://some.spark.master:6066',
'deploy_mode': 'cluster',
'name': 'spark-submit-app',
'class': 'main.Class',
'executor_memory': '2G',
'executor_cores': '1',
'total_executor_cores': '2',
'verbose': True,
'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],
'main_file_args': '--foo arg1 --bar arg2'
}
app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)
print(app.get_submit_cmd(multiline=True))
# poll state in the background every x seconds with `poll_time=x`
app.submit(use_env_vars=True,
extra_env_vars={'PYTHONPATH': '/some/path/'},
poll_time=10
)
print(app.get_state()) # 'SUBMITTED'
while not app.concluded:
# do other stuff...
print(app.get_state()) # 'RUNNING'
print(app.get_state()) # 'FINISHED'
```
#### Examples of `spark-submit` to `spark_args` dictionary:
##### A `client` example:
```
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv
```
##### becomes
```
spark_args = {
'master': 'spark://some.spark.master:7077',
'name': 'spark_job_client',
'total_executor_cores: '8',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'py_files': '/some/utils.zip',
'files': '/some/file.json',
'main_file_args': '--data /path/to/data.csv'
}
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)
```
##### A `cluster` example:
```
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"
```
##### becomes
```
spark_args = {
'master': 'spark://some.spark.master:6066',
'deploy_mode': 'cluster',
'name': 'spark_job_cluster',
'jars': 's3a://mybucket/some/file.jar',
'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
'total_executor_cores: '16',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'class': 'my.main.Class',
'verbose': True,
'main_file_args': '"positional_arg1" "positional_arg2"'
}
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)
```
#### Testing
You can do some simple testing with local mode Spark after cloning the repo.
Note any additional requirements for running the tests: `pip install -r tests/requirements.txt`
`pytest tests/`
`python tests/run_integration_test.py`
#### Additional methods
`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS
`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)
`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code
`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout
`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID
### License
Released under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).
- You can freely modify and reuse.
- The original license must be included with copies of this software.
- Please link back to this repo if you use a significant portion the source code.
Raw data
{
"_id": null,
"home_page": "https://github.com/PApostol/spark-submit",
"name": "spark-submit",
"maintainer": "PApostol",
"docs_url": null,
"requires_python": "~=3.7",
"maintainer_email": "",
"keywords": "apache,spark,submit",
"author": "PApostol",
"author_email": "foo@bar.com",
"download_url": "https://files.pythonhosted.org/packages/2b/a5/51504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104/spark-submit-1.4.0.tar.gz",
"platform": "any",
"description": "## Spark-submit\r\n\r\n[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)\r\n[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month&units=international_system&left_color=grey&right_color=green&left_text=total%20downloads)](https://pepy.tech/project/spark-submit)\r\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)\r\n[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\r\n[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)\r\n[![License](https://img.shields.io/badge/License-MIT-blue)](#license \"Go to license section\")\r\n[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)\r\n\r\n#### TL;DR: Python manager for spark-submit jobs\r\n\r\n### Description\r\nThis package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.\r\n\r\n### Installation\r\nThe easiest way to install is using `pip`:\r\n\r\n`pip install spark-submit`\r\n\r\nTo install from source:\r\n```\r\ngit clone https://github.com/PApostol/spark-submit.git\r\ncd spark-submit\r\npython setup.py install\r\n```\r\n\r\nFor usage details check `help(spark_submit)`.\r\n\r\n### Usage Examples\r\nSpark arguments can either be provided as keyword arguments or as an unpacked dictionary.\r\n\r\n##### Simple example:\r\n```\r\nfrom spark_submit import SparkJob\r\n\r\napp = SparkJob('/path/some_file.py', master='local', name='simple-test')\r\napp.submit()\r\n\r\nprint(app.get_state())\r\n```\r\n##### Another example:\r\n```\r\nfrom spark_submit import SparkJob\r\n\r\nspark_args = {\r\n 'master': 'spark://some.spark.master:6066',\r\n 'deploy_mode': 'cluster',\r\n 'name': 'spark-submit-app',\r\n 'class': 'main.Class',\r\n 'executor_memory': '2G',\r\n 'executor_cores': '1',\r\n 'total_executor_cores': '2',\r\n 'verbose': True,\r\n 'conf': [\"spark.foo.bar='baz'\", \"spark.x.y='z'\"],\r\n 'main_file_args': '--foo arg1 --bar arg2'\r\n }\r\n\r\napp = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)\r\nprint(app.get_submit_cmd(multiline=True))\r\n\r\n# poll state in the background every x seconds with `poll_time=x`\r\napp.submit(use_env_vars=True,\r\n extra_env_vars={'PYTHONPATH': '/some/path/'},\r\n poll_time=10\r\n )\r\n\r\nprint(app.get_state()) # 'SUBMITTED'\r\n\r\nwhile not app.concluded:\r\n # do other stuff...\r\n print(app.get_state()) # 'RUNNING'\r\n\r\nprint(app.get_state()) # 'FINISHED'\r\n```\r\n\r\n#### Examples of `spark-submit` to `spark_args` dictionary:\r\n##### A `client` example:\r\n```\r\n~/spark_home/bin/spark-submit \\\r\n--master spark://some.spark.master:7077 \\\r\n--name spark-submit-job \\\r\n--total-executor-cores 8 \\\r\n--executor-cores 4 \\\r\n--executor-memory 4G \\\r\n--driver-memory 2G \\\r\n--py-files /some/utils.zip \\\r\n--files /some/file.json \\\r\n/path/to/pyspark/file.py --data /path/to/data.csv\r\n```\r\n##### becomes\r\n```\r\nspark_args = {\r\n 'master': 'spark://some.spark.master:7077',\r\n 'name': 'spark_job_client',\r\n 'total_executor_cores: '8',\r\n 'executor_cores': '4',\r\n 'executor_memory': '4G',\r\n 'driver_memory': '2G',\r\n 'py_files': '/some/utils.zip',\r\n 'files': '/some/file.json',\r\n 'main_file_args': '--data /path/to/data.csv'\r\n }\r\nmain_file = '/path/to/pyspark/file.py'\r\napp = SparkJob(main_file, **spark_args)\r\n```\r\n##### A `cluster` example:\r\n```\r\n~/spark_home/bin/spark-submit \\\r\n--master spark://some.spark.master:6066 \\\r\n--deploy-mode cluster \\\r\n--name spark_job_cluster \\\r\n--jars \"s3a://mybucket/some/file.jar\" \\\r\n--conf \"spark.some.conf=foo\" \\\r\n--conf \"spark.some.other.conf=bar\" \\\r\n--total-executor-cores 16 \\\r\n--executor-cores 4 \\\r\n--executor-memory 4G \\\r\n--driver-memory 2G \\\r\n--class my.main.Class \\\r\n--verbose \\\r\ns3a://mybucket/file.jar \"positional_arg1\" \"positional_arg2\"\r\n```\r\n##### becomes\r\n```\r\nspark_args = {\r\n 'master': 'spark://some.spark.master:6066',\r\n 'deploy_mode': 'cluster',\r\n 'name': 'spark_job_cluster',\r\n 'jars': 's3a://mybucket/some/file.jar',\r\n 'conf': [\"spark.some.conf='foo'\", \"spark.some.other.conf='bar'\"], # note the use of quotes\r\n 'total_executor_cores: '16',\r\n 'executor_cores': '4',\r\n 'executor_memory': '4G',\r\n 'driver_memory': '2G',\r\n 'class': 'my.main.Class',\r\n 'verbose': True,\r\n 'main_file_args': '\"positional_arg1\" \"positional_arg2\"'\r\n }\r\nmain_file = 's3a://mybucket/file.jar'\r\napp = SparkJob(main_file, **spark_args)\r\n```\r\n\r\n#### Testing\r\n\r\nYou can do some simple testing with local mode Spark after cloning the repo.\r\n\r\nNote any additional requirements for running the tests: `pip install -r tests/requirements.txt`\r\n\r\n`pytest tests/`\r\n\r\n`python tests/run_integration_test.py`\r\n\r\n\r\n#### Additional methods\r\n\r\n`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS\r\n\r\n`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)\r\n\r\n`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code\r\n\r\n`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout\r\n\r\n`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID\r\n\r\n\r\n### License\r\n\r\nReleased under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).\r\n\r\n- You can freely modify and reuse.\r\n- The original license must be included with copies of this software.\r\n- Please link back to this repo if you use a significant portion the source code.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python manager for spark-submit jobs",
"version": "1.4.0",
"split_keywords": [
"apache",
"spark",
"submit"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2ba551504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104",
"md5": "8ddd29ba5fb920f575646b6d2f33b979",
"sha256": "246ca5bf239d821479c4418399d6f09e07b8c0957042e071107d328fafd4d920"
},
"downloads": -1,
"filename": "spark-submit-1.4.0.tar.gz",
"has_sig": false,
"md5_digest": "8ddd29ba5fb920f575646b6d2f33b979",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "~=3.7",
"size": 12585,
"upload_time": "2023-04-19T11:28:05",
"upload_time_iso_8601": "2023-04-19T11:28:05.848163Z",
"url": "https://files.pythonhosted.org/packages/2b/a5/51504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104/spark-submit-1.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-19 11:28:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "PApostol",
"github_project": "spark-submit",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "spark-submit"
}