# Google Spark Connect Client
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
additional functionalities that allow applications to communicate with a remote Dataproc
Spark cluster using the Spark Connect protocol without requiring additional steps.
## Install
.. code-block:: console
pip install google_spark_connect
## Uninstall
.. code-block:: console
pip uninstall google_spark_connect
## Setup
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
If you are running the client outside of Google Cloud, you must set following environment variables:
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
## Usage
1. Install the latest version of Dataproc Python client and Google Spark Connect modules:
.. code-block:: console
pip install google_cloud_dataproc --force-reinstall
pip install google_spark_connect --force-reinstall
2. Add the required import into your PySpark application or notebook:
.. code-block:: python
from google.cloud.spark_connect import GoogleSparkSession
3. There are two ways to create a spark session,
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
.. code-block:: python
spark = GoogleSparkSession.builder.getOrCreate()
2. Start a Spark session with the following code instead of using a config file:
.. code-block:: python
from google.cloud.dataproc_v1 import SparkConnectConfig
from google.cloud.dataproc_v1 import Session
google_session_config = Session()
google_session_config.spark_connect_session = SparkConnectConfig()
google_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
google_session_config.runtime_config.version = '3.0'
spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()
## Billing
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
This will happen even if you are running the client from a non-GCE instance.
## Contributing
### Building and Deploying SDK
1. Install the requirements in virtual environment.
.. code-block:: console
pip install -r requirements.txt
2. Build the code.
.. code-block:: console
python setup.py sdist bdist_wheel
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
.. code-block:: console
VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
.. code-block:: console
%%bash
export VERSION=<version>
gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .
yes | pip uninstall google_spark_connect
pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl
Raw data
{
"_id": null,
"home_page": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
"name": "google-spark-connect",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Google LLC",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/a8/35/a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb/google_spark_connect-0.5.2.tar.gz",
"platform": null,
"description": "# Google Spark Connect Client\n\nA wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with\nadditional functionalities that allow applications to communicate with a remote Dataproc\nSpark cluster using the Spark Connect protocol without requiring additional steps.\n\n## Install\n\n.. code-block:: console\n\n pip install google_spark_connect\n\n## Uninstall\n\n.. code-block:: console\n\n pip uninstall google_spark_connect\n\n\n## Setup\nThis client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).\nIf you are running the client outside of Google Cloud, you must set following environment variables:\n\n* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads\n* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.\n* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)\n* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`\n\n## Usage\n\n1. Install the latest version of Dataproc Python client and Google Spark Connect modules:\n\n .. code-block:: console\n\n pip install google_cloud_dataproc --force-reinstall\n pip install google_spark_connect --force-reinstall\n\n2. Add the required import into your PySpark application or notebook:\n\n .. code-block:: python\n\n from google.cloud.spark_connect import GoogleSparkSession\n\n3. There are two ways to create a spark session,\n\n 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:\n\n .. code-block:: python\n\n spark = GoogleSparkSession.builder.getOrCreate()\n\n 2. Start a Spark session with the following code instead of using a config file:\n\n .. code-block:: python\n\n from google.cloud.dataproc_v1 import SparkConnectConfig\n from google.cloud.dataproc_v1 import Session\n google_session_config = Session()\n google_session_config.spark_connect_session = SparkConnectConfig()\n google_session_config.environment_config.execution_config.subnetwork_uri = \"<subnet>\"\n google_session_config.runtime_config.version = '3.0'\n spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()\n\n## Billing\nAs this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).\nThis will happen even if you are running the client from a non-GCE instance.\n\n## Contributing\n### Building and Deploying SDK\n\n1. Install the requirements in virtual environment.\n\n .. code-block:: console\n\n pip install -r requirements.txt\n\n2. Build the code.\n\n .. code-block:: console\n\n python setup.py sdist bdist_wheel\n\n\n3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.\n\n .. code-block:: console\n\n VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>\n\n4. Download the new SDK on Vertex, then uninstall the old version and install the new one.\n\n .. code-block:: console\n\n %%bash\n export VERSION=<version>\n gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .\n yes | pip uninstall google_spark_connect\n pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Google client library for Spark Connect",
"version": "0.5.2",
"project_urls": {
"Homepage": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fb36ac3cc783cb4309363fb07775ca8bf06d6b6e0aefa393adbf89e5dd8eb2b3",
"md5": "75cd6fb14c631fa17b08961e0ed18499",
"sha256": "e408d6753a281ca5c20fb2950f2b2220546247c17b34c5dfb487be4ecf25b65f"
},
"downloads": -1,
"filename": "google_spark_connect-0.5.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "75cd6fb14c631fa17b08961e0ed18499",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 18442,
"upload_time": "2025-02-11T21:49:26",
"upload_time_iso_8601": "2025-02-11T21:49:26.399415Z",
"url": "https://files.pythonhosted.org/packages/fb/36/ac3cc783cb4309363fb07775ca8bf06d6b6e0aefa393adbf89e5dd8eb2b3/google_spark_connect-0.5.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a835a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb",
"md5": "3df39a170704f9aace5619aac8ba143d",
"sha256": "44b5bbd5943b186fef87102592cdc4006e3fbbf70682b92b37ff8eb651db7aed"
},
"downloads": -1,
"filename": "google_spark_connect-0.5.2.tar.gz",
"has_sig": false,
"md5_digest": "3df39a170704f9aace5619aac8ba143d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16937,
"upload_time": "2025-02-11T21:49:27",
"upload_time_iso_8601": "2025-02-11T21:49:27.555362Z",
"url": "https://files.pythonhosted.org/packages/a8/35/a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb/google_spark_connect-0.5.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-11 21:49:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GoogleCloudDataproc",
"github_project": "dataproc-spark-connect-python",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "google-spark-connect"
}