# spark-connect-proxy
A reverse proxy server which allows secure connectivity to a Spark Connect server.
[<img src="https://img.shields.io/badge/GitHub-gizmodata%2Fspark--connect--proxy-blue.svg?logo=Github">](https://github.com/gizmodata/spark-connect-proxy)
[![spark-connect-proxy-ci](https://github.com/gizmodata/spark-connect-proxy/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/spark-connect-proxy/actions/workflows/ci.yml)
[![Supported Python Versions](https://img.shields.io/pypi/pyversions/spark-connect-proxy)](https://pypi.org/project/spark-connect-proxy/)
[![PyPI version](https://badge.fury.io/py/spark-connect-proxy.svg)](https://badge.fury.io/py/spark-connect-proxy)
[![PyPI Downloads](https://img.shields.io/pypi/dm/spark-connect-proxy.svg)](https://pypi.org/project/spark-connect-proxy/)
# Why?
Because [Spark Connect does NOT provide authentication and/or TLS encryption out of the box](https://spark.apache.org/docs/latest/spark-connect-overview.html#client-application-authentication). This project provides a reverse proxy server which can be used to secure the connection to a Spark Connect server.
# Setup (to run locally)
## Install Python package
You can install `spark-connect-proxy` from PyPi or from source.
### Option 1 - from PyPi
```shell
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
pip install spark-connect-proxy[client]
```
### Option 2 - from source - for development
```shell
git clone https://github.com/gizmodata/spark-connect-proxy
cd spark-connect-proxy
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel
# Install Spark Connect Proxy - in editable mode with client and dev dependencies
pip install --editable .[client,dev]
```
### Note
For the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:
```shell
export PYTHONPATH=$(pwd)/src
```
### Usage
This repo contains scripts to let you provision an AWS EMR Spark cluster with a secure Spark Connect Proxy server to allow you to securely and remotely connect to it.
First - you'll need to open up a port for public access to the AWS EMR Spark Cluster - in addition to the `ssh` port: `22`. Add port: `50051` as shown here:
![Open port 50051](images/emr-public-access.png?raw=true "Open port 50051")
> [!NOTE]
> Even though you are opening this port to the public, the Spark Connect Proxy will secure it with TLS and JWT Authentication.
The scripts use the AWS CLI to provision the EMR Spark cluster - so you will need to have the [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and configured with your AWS credentials.
You can create a file in your local copy of the `scripts` directory called `.env` with the following contents:
```shell
export AWS_ACCESS_KEY_ID="put value from AWS here"
export AWS_SECRET_ACCESS_KEY="put value from AWS here"
export AWS_SESSION_TOKEN="put value from AWS here"
export AWS_REGION="us-east-2"
```
To provision the EMR Spark cluster - run the following command from the root directory of this repo:
```shell
scripts/provision_emr_spark_cluster.sh
```
That will output several files (which will be git ignored for security reasons):
- `tls/ca.crt` - the EMR Spark cluster generated TLS certificate - needed for your PySpark client to trust the Spark Connect Proxy server (b/c it is self-signed)
- `scripts/output/instance_details.txt` - shows the ssh command for connecting to the master node of the EMR Spark cluster
- `scripts/output/spark_connect_proxy_details.log` - shows how to run a PySpark Ibis client example - which connects securely from your local computer to the remote EMR Spark cluster. Example command:
```shell
spark-connect-proxy-ibis-client-example \
--host ec2-01-01-01-01.us-east-2.compute.amazonaws.com \
--port 50051 \
--use-tls \
--tls-roots tls/ca.crt \
--token honey.badger.dontcare
```
> [!IMPORTANT]
> You must have installed the `spark-connect-proxy` package with the `[client]` extras onto the client computer to run the `spark-connect-proxy-ibis-client-example` command.
### Handy development commands
#### Version management
##### Bump the version of the application - (you must have installed from source with the [dev] extras)
```bash
bumpver update --patch
```
Raw data
{
"_id": null,
"home_page": null,
"name": "spark-connect-proxy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "spark, pyspark, grpc, reverse-proxy, connect, spark-connect",
"author": null,
"author_email": "Philip Moore <prmoore77@hotmail.com>",
"download_url": "https://files.pythonhosted.org/packages/41/de/d97ff72944483f61ddabc57884ebb4536980cb1e5952afc1ed7ef75dae58/spark_connect_proxy-0.0.10.tar.gz",
"platform": null,
"description": "# spark-connect-proxy\nA reverse proxy server which allows secure connectivity to a Spark Connect server.\n\n[<img src=\"https://img.shields.io/badge/GitHub-gizmodata%2Fspark--connect--proxy-blue.svg?logo=Github\">](https://github.com/gizmodata/spark-connect-proxy)\n[![spark-connect-proxy-ci](https://github.com/gizmodata/spark-connect-proxy/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/spark-connect-proxy/actions/workflows/ci.yml)\n[![Supported Python Versions](https://img.shields.io/pypi/pyversions/spark-connect-proxy)](https://pypi.org/project/spark-connect-proxy/)\n[![PyPI version](https://badge.fury.io/py/spark-connect-proxy.svg)](https://badge.fury.io/py/spark-connect-proxy)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/spark-connect-proxy.svg)](https://pypi.org/project/spark-connect-proxy/)\n\n# Why?\nBecause [Spark Connect does NOT provide authentication and/or TLS encryption out of the box](https://spark.apache.org/docs/latest/spark-connect-overview.html#client-application-authentication). This project provides a reverse proxy server which can be used to secure the connection to a Spark Connect server.\n\n# Setup (to run locally)\n\n## Install Python package\nYou can install `spark-connect-proxy` from PyPi or from source.\n\n### Option 1 - from PyPi\n```shell\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\npip install spark-connect-proxy[client]\n```\n\n### Option 2 - from source - for development\n```shell\ngit clone https://github.com/gizmodata/spark-connect-proxy\n\ncd spark-connect-proxy\n\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\n# Upgrade pip, setuptools, and wheel\npip install --upgrade pip setuptools wheel\n\n# Install Spark Connect Proxy - in editable mode with client and dev dependencies\npip install --editable .[client,dev]\n```\n\n### Note\nFor the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:\n```shell\nexport PYTHONPATH=$(pwd)/src\n```\n\n### Usage\nThis repo contains scripts to let you provision an AWS EMR Spark cluster with a secure Spark Connect Proxy server to allow you to securely and remotely connect to it.\n\nFirst - you'll need to open up a port for public access to the AWS EMR Spark Cluster - in addition to the `ssh` port: `22`. Add port: `50051` as shown here: \n![Open port 50051](images/emr-public-access.png?raw=true \"Open port 50051\") \n\n\n> [!NOTE] \n> Even though you are opening this port to the public, the Spark Connect Proxy will secure it with TLS and JWT Authentication.\n\nThe scripts use the AWS CLI to provision the EMR Spark cluster - so you will need to have the [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and configured with your AWS credentials.\n\nYou can create a file in your local copy of the `scripts` directory called `.env` with the following contents:\n```shell\nexport AWS_ACCESS_KEY_ID=\"put value from AWS here\"\nexport AWS_SECRET_ACCESS_KEY=\"put value from AWS here\"\nexport AWS_SESSION_TOKEN=\"put value from AWS here\"\nexport AWS_REGION=\"us-east-2\"\n```\n\nTo provision the EMR Spark cluster - run the following command from the root directory of this repo:\n```shell\nscripts/provision_emr_spark_cluster.sh\n```\n\nThat will output several files (which will be git ignored for security reasons):\n- `tls/ca.crt` - the EMR Spark cluster generated TLS certificate - needed for your PySpark client to trust the Spark Connect Proxy server (b/c it is self-signed)\n- `scripts/output/instance_details.txt` - shows the ssh command for connecting to the master node of the EMR Spark cluster\n- `scripts/output/spark_connect_proxy_details.log` - shows how to run a PySpark Ibis client example - which connects securely from your local computer to the remote EMR Spark cluster. Example command:\n\n```shell\nspark-connect-proxy-ibis-client-example \\\n --host ec2-01-01-01-01.us-east-2.compute.amazonaws.com \\\n --port 50051 \\\n --use-tls \\\n --tls-roots tls/ca.crt \\\n --token honey.badger.dontcare\n```\n\n> [!IMPORTANT] \n> You must have installed the `spark-connect-proxy` package with the `[client]` extras onto the client computer to run the `spark-connect-proxy-ibis-client-example` command. \n\n### Handy development commands\n\n#### Version management\n\n##### Bump the version of the application - (you must have installed from source with the [dev] extras)\n```bash\nbumpver update --patch\n```\n",
"bugtrack_url": null,
"license": "Copyright 2024 Gizmo Data LLC Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.",
"summary": "A reverse proxy server which allows secure connectivity to a Spark Connect server",
"version": "0.0.10",
"project_urls": {
"Homepage": "https://github.com/gizmodata/spark-connect-proxy"
},
"split_keywords": [
"spark",
" pyspark",
" grpc",
" reverse-proxy",
" connect",
" spark-connect"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "730ad75dfe63967cc9cbfb0a2c64091889f7efd4fc3b2d4dbc0993b67dc9b4d6",
"md5": "e5406cc2dd94fc7797fe82e8a78a3d43",
"sha256": "f5c9a4dee8d0c1e894cee1b6b0c8c149cb54302dafe5c8e614fe10c0cb6e54dc"
},
"downloads": -1,
"filename": "spark_connect_proxy-0.0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e5406cc2dd94fc7797fe82e8a78a3d43",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 13764,
"upload_time": "2024-10-16T15:39:48",
"upload_time_iso_8601": "2024-10-16T15:39:48.168307Z",
"url": "https://files.pythonhosted.org/packages/73/0a/d75dfe63967cc9cbfb0a2c64091889f7efd4fc3b2d4dbc0993b67dc9b4d6/spark_connect_proxy-0.0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "41ded97ff72944483f61ddabc57884ebb4536980cb1e5952afc1ed7ef75dae58",
"md5": "a7f00a113bd145ebaf360b2cb53cabd0",
"sha256": "e63590f8151846628dc836eb360bb73b1abaa251216f07a7369b220344f1495f"
},
"downloads": -1,
"filename": "spark_connect_proxy-0.0.10.tar.gz",
"has_sig": false,
"md5_digest": "a7f00a113bd145ebaf360b2cb53cabd0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 12938,
"upload_time": "2024-10-16T15:39:49",
"upload_time_iso_8601": "2024-10-16T15:39:49.656257Z",
"url": "https://files.pythonhosted.org/packages/41/de/d97ff72944483f61ddabc57884ebb4536980cb1e5952afc1ed7ef75dae58/spark_connect_proxy-0.0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-16 15:39:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gizmodata",
"github_project": "spark-connect-proxy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "spark-connect-proxy"
}