spark-submit


Namespark-submit JSON
Version 1.4.0 PyPI version JSON
download
home_pagehttps://github.com/PApostol/spark-submit
SummaryPython manager for spark-submit jobs
upload_time2023-04-19 11:28:05
maintainerPApostol
docs_urlNone
authorPApostol
requires_python~=3.7
licenseMIT
keywords apache spark submit
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Spark-submit

[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)
[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month&units=international_system&left_color=grey&right_color=green&left_text=total%20downloads)](https://pepy.tech/project/spark-submit)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)
[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)
[![License](https://img.shields.io/badge/License-MIT-blue)](#license "Go to license section")
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)

#### TL;DR: Python manager for spark-submit jobs

### Description
This package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.

### Installation
The easiest way to install is using `pip`:

`pip install spark-submit`

To install from source:
```
git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install
```

For usage details check `help(spark_submit)`.

### Usage Examples
Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.

##### Simple example:
```
from spark_submit import SparkJob

app = SparkJob('/path/some_file.py', master='local', name='simple-test')
app.submit()

print(app.get_state())
```
##### Another example:
```
from spark_submit import SparkJob

spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark-submit-app',
    'class': 'main.Class',
    'executor_memory': '2G',
    'executor_cores': '1',
    'total_executor_cores': '2',
    'verbose': True,
    'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],
    'main_file_args': '--foo arg1 --bar arg2'
    }

app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)
print(app.get_submit_cmd(multiline=True))

# poll state in the background every x seconds with `poll_time=x`
app.submit(use_env_vars=True,
           extra_env_vars={'PYTHONPATH': '/some/path/'},
           poll_time=10
           )

print(app.get_state()) # 'SUBMITTED'

while not app.concluded:
    # do other stuff...
    print(app.get_state()) # 'RUNNING'

print(app.get_state()) # 'FINISHED'
```

#### Examples of `spark-submit` to `spark_args` dictionary:
##### A `client` example:
```
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv
```
##### becomes
```
spark_args = {
    'master': 'spark://some.spark.master:7077',
    'name': 'spark_job_client',
    'total_executor_cores: '8',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'py_files': '/some/utils.zip',
    'files': '/some/file.json',
    'main_file_args': '--data /path/to/data.csv'
    }
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)
```
##### A `cluster` example:
```
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"
```
##### becomes
```
spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark_job_cluster',
    'jars': 's3a://mybucket/some/file.jar',
    'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
    'total_executor_cores: '16',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'class': 'my.main.Class',
    'verbose': True,
    'main_file_args': '"positional_arg1" "positional_arg2"'
    }
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)
```

#### Testing

You can do some simple testing with local mode Spark after cloning the repo.

Note any additional requirements for running the tests: `pip install -r tests/requirements.txt`

`pytest tests/`

`python tests/run_integration_test.py`


#### Additional methods

`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS

`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)

`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code

`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout

`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID


### License

Released under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).

- You can freely modify and reuse.
- The original license must be included with copies of this software.
- Please link back to this repo if you use a significant portion the source code.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PApostol/spark-submit",
    "name": "spark-submit",
    "maintainer": "PApostol",
    "docs_url": null,
    "requires_python": "~=3.7",
    "maintainer_email": "",
    "keywords": "apache,spark,submit",
    "author": "PApostol",
    "author_email": "foo@bar.com",
    "download_url": "https://files.pythonhosted.org/packages/2b/a5/51504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104/spark-submit-1.4.0.tar.gz",
    "platform": "any",
    "description": "## Spark-submit\r\n\r\n[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)\r\n[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month&units=international_system&left_color=grey&right_color=green&left_text=total%20downloads)](https://pepy.tech/project/spark-submit)\r\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)\r\n[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\r\n[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)\r\n[![License](https://img.shields.io/badge/License-MIT-blue)](#license \"Go to license section\")\r\n[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)\r\n\r\n#### TL;DR: Python manager for spark-submit jobs\r\n\r\n### Description\r\nThis package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.\r\n\r\n### Installation\r\nThe easiest way to install is using `pip`:\r\n\r\n`pip install spark-submit`\r\n\r\nTo install from source:\r\n```\r\ngit clone https://github.com/PApostol/spark-submit.git\r\ncd spark-submit\r\npython setup.py install\r\n```\r\n\r\nFor usage details check `help(spark_submit)`.\r\n\r\n### Usage Examples\r\nSpark arguments can either be provided as keyword arguments or as an unpacked dictionary.\r\n\r\n##### Simple example:\r\n```\r\nfrom spark_submit import SparkJob\r\n\r\napp = SparkJob('/path/some_file.py', master='local', name='simple-test')\r\napp.submit()\r\n\r\nprint(app.get_state())\r\n```\r\n##### Another example:\r\n```\r\nfrom spark_submit import SparkJob\r\n\r\nspark_args = {\r\n    'master': 'spark://some.spark.master:6066',\r\n    'deploy_mode': 'cluster',\r\n    'name': 'spark-submit-app',\r\n    'class': 'main.Class',\r\n    'executor_memory': '2G',\r\n    'executor_cores': '1',\r\n    'total_executor_cores': '2',\r\n    'verbose': True,\r\n    'conf': [\"spark.foo.bar='baz'\", \"spark.x.y='z'\"],\r\n    'main_file_args': '--foo arg1 --bar arg2'\r\n    }\r\n\r\napp = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)\r\nprint(app.get_submit_cmd(multiline=True))\r\n\r\n# poll state in the background every x seconds with `poll_time=x`\r\napp.submit(use_env_vars=True,\r\n           extra_env_vars={'PYTHONPATH': '/some/path/'},\r\n           poll_time=10\r\n           )\r\n\r\nprint(app.get_state()) # 'SUBMITTED'\r\n\r\nwhile not app.concluded:\r\n    # do other stuff...\r\n    print(app.get_state()) # 'RUNNING'\r\n\r\nprint(app.get_state()) # 'FINISHED'\r\n```\r\n\r\n#### Examples of `spark-submit` to `spark_args` dictionary:\r\n##### A `client` example:\r\n```\r\n~/spark_home/bin/spark-submit \\\r\n--master spark://some.spark.master:7077 \\\r\n--name spark-submit-job \\\r\n--total-executor-cores 8 \\\r\n--executor-cores 4 \\\r\n--executor-memory 4G \\\r\n--driver-memory 2G \\\r\n--py-files /some/utils.zip \\\r\n--files /some/file.json \\\r\n/path/to/pyspark/file.py --data /path/to/data.csv\r\n```\r\n##### becomes\r\n```\r\nspark_args = {\r\n    'master': 'spark://some.spark.master:7077',\r\n    'name': 'spark_job_client',\r\n    'total_executor_cores: '8',\r\n    'executor_cores': '4',\r\n    'executor_memory': '4G',\r\n    'driver_memory': '2G',\r\n    'py_files': '/some/utils.zip',\r\n    'files': '/some/file.json',\r\n    'main_file_args': '--data /path/to/data.csv'\r\n    }\r\nmain_file = '/path/to/pyspark/file.py'\r\napp = SparkJob(main_file, **spark_args)\r\n```\r\n##### A `cluster` example:\r\n```\r\n~/spark_home/bin/spark-submit \\\r\n--master spark://some.spark.master:6066 \\\r\n--deploy-mode cluster \\\r\n--name spark_job_cluster \\\r\n--jars \"s3a://mybucket/some/file.jar\" \\\r\n--conf \"spark.some.conf=foo\" \\\r\n--conf \"spark.some.other.conf=bar\" \\\r\n--total-executor-cores 16 \\\r\n--executor-cores 4 \\\r\n--executor-memory 4G \\\r\n--driver-memory 2G \\\r\n--class my.main.Class \\\r\n--verbose \\\r\ns3a://mybucket/file.jar \"positional_arg1\" \"positional_arg2\"\r\n```\r\n##### becomes\r\n```\r\nspark_args = {\r\n    'master': 'spark://some.spark.master:6066',\r\n    'deploy_mode': 'cluster',\r\n    'name': 'spark_job_cluster',\r\n    'jars': 's3a://mybucket/some/file.jar',\r\n    'conf': [\"spark.some.conf='foo'\", \"spark.some.other.conf='bar'\"], # note the use of quotes\r\n    'total_executor_cores: '16',\r\n    'executor_cores': '4',\r\n    'executor_memory': '4G',\r\n    'driver_memory': '2G',\r\n    'class': 'my.main.Class',\r\n    'verbose': True,\r\n    'main_file_args': '\"positional_arg1\" \"positional_arg2\"'\r\n    }\r\nmain_file = 's3a://mybucket/file.jar'\r\napp = SparkJob(main_file, **spark_args)\r\n```\r\n\r\n#### Testing\r\n\r\nYou can do some simple testing with local mode Spark after cloning the repo.\r\n\r\nNote any additional requirements for running the tests: `pip install -r tests/requirements.txt`\r\n\r\n`pytest tests/`\r\n\r\n`python tests/run_integration_test.py`\r\n\r\n\r\n#### Additional methods\r\n\r\n`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS\r\n\r\n`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)\r\n\r\n`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code\r\n\r\n`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout\r\n\r\n`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID\r\n\r\n\r\n### License\r\n\r\nReleased under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).\r\n\r\n- You can freely modify and reuse.\r\n- The original license must be included with copies of this software.\r\n- Please link back to this repo if you use a significant portion the source code.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python manager for spark-submit jobs",
    "version": "1.4.0",
    "split_keywords": [
        "apache",
        "spark",
        "submit"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ba551504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104",
                "md5": "8ddd29ba5fb920f575646b6d2f33b979",
                "sha256": "246ca5bf239d821479c4418399d6f09e07b8c0957042e071107d328fafd4d920"
            },
            "downloads": -1,
            "filename": "spark-submit-1.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8ddd29ba5fb920f575646b6d2f33b979",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "~=3.7",
            "size": 12585,
            "upload_time": "2023-04-19T11:28:05",
            "upload_time_iso_8601": "2023-04-19T11:28:05.848163Z",
            "url": "https://files.pythonhosted.org/packages/2b/a5/51504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104/spark-submit-1.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-19 11:28:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "PApostol",
    "github_project": "spark-submit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "spark-submit"
}
        
Elapsed time: 0.05831s