# Dataproc Spark Connect Client
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
additional functionalities that allow applications to communicate with a remote Dataproc
Spark cluster using the Spark Connect protocol without requiring additional steps.
## Install
.. code-block:: console
pip install dataproc_spark_connect
## Uninstall
.. code-block:: console
pip uninstall dataproc_spark_connect
## Setup
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
If you are running the client outside of Google Cloud, you must set following environment variables:
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
## Usage
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
.. code-block:: console
pip install google_cloud_dataproc --force-reinstall
pip install dataproc_spark_connect --force-reinstall
2. Add the required import into your PySpark application or notebook:
.. code-block:: python
from google.cloud.dataproc_spark_connect import DataprocSparkSession
3. There are two ways to create a spark session,
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
.. code-block:: python
spark = DataprocSparkSession.builder.getOrCreate()
2. Start a Spark session with the following code instead of using a config file:
.. code-block:: python
from google.cloud.dataproc_v1 import SparkConnectConfig
from google.cloud.dataproc_v1 import Session
dataproc_config = Session()
dataproc_config.spark_connect_session = SparkConnectConfig()
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
dataproc_config.runtime_config.version = '3.0'
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
## Billing
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
This will happen even if you are running the client from a non-GCE instance.
## Contributing
### Building and Deploying SDK
1. Install the requirements in virtual environment.
.. code-block:: console
pip install -r requirements.txt
2. Build the code.
.. code-block:: console
python setup.py sdist bdist_wheel
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
.. code-block:: console
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
.. code-block:: console
%%bash
export VERSION=<version>
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
yes | pip uninstall dataproc_spark_connect
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
Raw data
{
"_id": null,
"home_page": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
"name": "dataproc-spark-connect",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Google LLC",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/bf/e6/1f18a1c4e728d36db6b3f4922ef2733bcb39412a747b3cb45ae96706c935/dataproc_spark_connect-0.2.0.tar.gz",
"platform": null,
"description": "# Dataproc Spark Connect Client\n\nA wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with\nadditional functionalities that allow applications to communicate with a remote Dataproc\nSpark cluster using the Spark Connect protocol without requiring additional steps.\n\n## Install\n\n.. code-block:: console\n\n pip install dataproc_spark_connect\n\n## Uninstall\n\n.. code-block:: console\n\n pip uninstall dataproc_spark_connect\n\n\n## Setup\nThis client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).\nIf you are running the client outside of Google Cloud, you must set following environment variables:\n\n* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads\n* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.\n* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)\n* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`\n\n## Usage\n\n1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:\n\n .. code-block:: console\n\n pip install google_cloud_dataproc --force-reinstall\n pip install dataproc_spark_connect --force-reinstall\n\n2. Add the required import into your PySpark application or notebook:\n\n .. code-block:: python\n\n from google.cloud.dataproc_spark_connect import DataprocSparkSession\n\n3. There are two ways to create a spark session,\n\n 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:\n\n .. code-block:: python\n\n spark = DataprocSparkSession.builder.getOrCreate()\n\n 2. Start a Spark session with the following code instead of using a config file:\n\n .. code-block:: python\n\n from google.cloud.dataproc_v1 import SparkConnectConfig\n from google.cloud.dataproc_v1 import Session\n dataproc_config = Session()\n dataproc_config.spark_connect_session = SparkConnectConfig()\n dataproc_config.environment_config.execution_config.subnetwork_uri = \"<subnet>\"\n dataproc_config.runtime_config.version = '3.0'\n spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()\n\n## Billing\nAs this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).\nThis will happen even if you are running the client from a non-GCE instance.\n\n## Contributing\n### Building and Deploying SDK\n\n1. Install the requirements in virtual environment.\n\n .. code-block:: console\n\n pip install -r requirements.txt\n\n2. Build the code.\n\n .. code-block:: console\n\n python setup.py sdist bdist_wheel\n\n\n3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.\n\n .. code-block:: console\n\n VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>\n\n4. Download the new SDK on Vertex, then uninstall the old version and install the new one.\n\n .. code-block:: console\n\n %%bash\n export VERSION=<version>\n gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .\n yes | pip uninstall dataproc_spark_connect\n pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Dataproc client library for Spark Connect",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cc02ad6509cf8da0f81585076b0f4abaeb74b6d794cf122f45fe03ff0218dbd1",
"md5": "a42d25609a0233739bc9ed8300ebd741",
"sha256": "8009000a34ad013bfd6c06dd078ed9b55e9c648daf1ee15dd7d25c30a4d16822"
},
"downloads": -1,
"filename": "dataproc_spark_connect-0.2.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "a42d25609a0233739bc9ed8300ebd741",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 18129,
"upload_time": "2024-12-05T23:19:45",
"upload_time_iso_8601": "2024-12-05T23:19:45.741843Z",
"url": "https://files.pythonhosted.org/packages/cc/02/ad6509cf8da0f81585076b0f4abaeb74b6d794cf122f45fe03ff0218dbd1/dataproc_spark_connect-0.2.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bfe61f18a1c4e728d36db6b3f4922ef2733bcb39412a747b3cb45ae96706c935",
"md5": "880d0e4882a6c55d05f5a7cc283498a1",
"sha256": "c80154d0819a3551b7a04251bd246e43afa3e1d997be21eaedd0ea5731d90d97"
},
"downloads": -1,
"filename": "dataproc_spark_connect-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "880d0e4882a6c55d05f5a7cc283498a1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15337,
"upload_time": "2024-12-05T23:19:47",
"upload_time_iso_8601": "2024-12-05T23:19:47.390103Z",
"url": "https://files.pythonhosted.org/packages/bf/e6/1f18a1c4e728d36db6b3f4922ef2733bcb39412a747b3cb45ae96706c935/dataproc_spark_connect-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-05 23:19:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GoogleCloudDataproc",
"github_project": "dataproc-spark-connect-python",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "google-api-core",
"specs": [
[
">=",
"2.19.1"
]
]
},
{
"name": "google-cloud-dataproc",
"specs": [
[
">=",
"5.15.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.2.2"
]
]
},
{
"name": "pyarrow",
"specs": [
[
">=",
"17.0.0"
]
]
},
{
"name": "pyspark",
"specs": [
[
"==",
"3.5"
]
]
},
{
"name": "setuptools",
"specs": [
[
">=",
"72.0.0"
]
]
},
{
"name": "websockets",
"specs": [
[
">=",
"12.0"
]
]
}
],
"lcname": "dataproc-spark-connect"
}