Name | beam-pyspark-runner JSON |
Version |
0.0.3
JSON |
| download |
home_page | None |
Summary | An Apache Beam pipeline Runner built on Apache Spark's python API |
upload_time | 2024-04-23 22:51:20 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7 |
license | MIT |
keywords |
virtualenv
dependencies
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PySpark Apache Beam Runner
## Overview
(WHY? Doesn't Beam ship with a Spark runner?)
This project introduces a custom Apache Beam runner that leverages PySpark directly.
This is not a 'portability' framework compliant runner! It is designed for environments
where a SparkSession is available but a Spark master server is not. This is useful for
e.g. serverless environments where jobs are triggered without a long-running cluster,
sidestepping the expectations of Beam's default Spark runner.
The other benefit is that this strategy for building a runner helps to keep the stack as
python-centric as possible. The compilation process, the optimizations, the execution
planning - these all happen in python (for better or worse). Depending on your needs,
this might be a significant advantage.
## Features
- **Direct Integration with PySpark**: Utilizes a PySpark assumed SparkSession directly.
- **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
- **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.
## Getting Started
### Prerequisites
- Apache Spark
- Apache Beam
- Python 3.8 or later
### Installation
To use this custom runner, just `pip install` as you would any library
```bash
pip install beam-pyspark-runner
```
Raw data
{
"_id": null,
"home_page": null,
"name": "beam-pyspark-runner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "virtualenv, dependencies",
"author": null,
"author_email": "Nathan Zimmerman <npzimmerman@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/cc/0c/a51d5b39b2beda69129da1e52825dcdc99b055ac11810258b21deb348fac/beam_pyspark_runner-0.0.3.tar.gz",
"platform": null,
"description": "# PySpark Apache Beam Runner\n\n## Overview\n(WHY? Doesn't Beam ship with a Spark runner?)\n\nThis project introduces a custom Apache Beam runner that leverages PySpark directly.\nThis is not a 'portability' framework compliant runner! It is designed for environments\nwhere a SparkSession is available but a Spark master server is not. This is useful for\ne.g. serverless environments where jobs are triggered without a long-running cluster,\nsidestepping the expectations of Beam's default Spark runner.\n\nThe other benefit is that this strategy for building a runner helps to keep the stack as\npython-centric as possible. The compilation process, the optimizations, the execution\nplanning - these all happen in python (for better or worse). Depending on your needs,\nthis might be a significant advantage.\n\n## Features\n- **Direct Integration with PySpark**: Utilizes a PySpark assumed SparkSession directly.\n- **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.\n- **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.\n\n## Getting Started\n\n### Prerequisites\n- Apache Spark\n- Apache Beam\n- Python 3.8 or later\n\n### Installation\nTo use this custom runner, just `pip install` as you would any library\n\n```bash\npip install beam-pyspark-runner\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An Apache Beam pipeline Runner built on Apache Spark's python API",
"version": "0.0.3",
"project_urls": {
"homepage": "https://github.com/moradology/beam-pyspark-runner",
"repository": "https://github.com/moradology/beam-pyspark-runner.git"
},
"split_keywords": [
"virtualenv",
" dependencies"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5ec44e55ec84c154902a1b334ecc12caab5fb38cdf79e0ea7bd809136dd581fe",
"md5": "ae2c6a090c4ed8839def0abe3cd1ab44",
"sha256": "33c458c2f1b48d7a5042732d4fd55b2739cfad19f6d7a1f485d270e57c1d5141"
},
"downloads": -1,
"filename": "beam_pyspark_runner-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae2c6a090c4ed8839def0abe3cd1ab44",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11868,
"upload_time": "2024-04-23T22:51:18",
"upload_time_iso_8601": "2024-04-23T22:51:18.698309Z",
"url": "https://files.pythonhosted.org/packages/5e/c4/4e55ec84c154902a1b334ecc12caab5fb38cdf79e0ea7bd809136dd581fe/beam_pyspark_runner-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cc0ca51d5b39b2beda69129da1e52825dcdc99b055ac11810258b21deb348fac",
"md5": "359cbf0b0dfda90b45a69694f9a2332f",
"sha256": "1a02ecbf325f9d8c8885a92218c79edfbd43ade2d4f4502a76afa05dd3ccd44d"
},
"downloads": -1,
"filename": "beam_pyspark_runner-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "359cbf0b0dfda90b45a69694f9a2332f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 10775,
"upload_time": "2024-04-23T22:51:20",
"upload_time_iso_8601": "2024-04-23T22:51:20.320385Z",
"url": "https://files.pythonhosted.org/packages/cc/0c/a51d5b39b2beda69129da1e52825dcdc99b055ac11810258b21deb348fac/beam_pyspark_runner-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-23 22:51:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "moradology",
"github_project": "beam-pyspark-runner",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "beam-pyspark-runner"
}