google-spark-connect


Namegoogle-spark-connect JSON
Version 0.5.3 PyPI version JSON
download
home_pagehttps://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
SummaryGoogle client library for Spark Connect
upload_time2025-02-28 22:04:45
maintainerNone
docs_urlNone
authorGoogle LLC
requires_pythonNone
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Google Spark Connect Client

A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
additional functionalities that allow applications to communicate with a remote Dataproc
Spark cluster using the Spark Connect protocol without requiring additional steps.

## Install

.. code-block:: console

      pip install google_spark_connect

## Uninstall

.. code-block:: console

      pip uninstall google_spark_connect


## Setup
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
If you are running the client outside of Google Cloud, you must set following environment variables:

* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`

## Usage

1. Install the latest version of Dataproc Python client and Google Spark Connect modules:

      .. code-block:: console

            pip install google_cloud_dataproc --force-reinstall
            pip install google_spark_connect --force-reinstall

2. Add the required import into your PySpark application or notebook:

      .. code-block:: python

            from google.cloud.spark_connect import GoogleSparkSession

3. There are two ways to create a spark session,

   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:

      .. code-block:: python

            spark = GoogleSparkSession.builder.getOrCreate()

   2. Start a Spark session with the following code instead of using a config file:

      .. code-block:: python

            from google.cloud.dataproc_v1 import SparkConnectConfig
            from google.cloud.dataproc_v1 import Session
            google_session_config = Session()
            google_session_config.spark_connect_session = SparkConnectConfig()
            google_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
            google_session_config.runtime_config.version = '3.0'
            spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()

## Billing
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
This will happen even if you are running the client from a non-GCE instance.

## Contributing
### Building and Deploying SDK

1. Install the requirements in virtual environment.

      .. code-block:: console

            pip install -r requirements.txt

2. Build the code.

      .. code-block:: console

            python setup.py sdist bdist_wheel


3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.

      .. code-block:: console

            VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>

4. Download the new SDK on Vertex, then uninstall the old version and install the new one.

      .. code-block:: console

            %%bash
            export VERSION=<version>
            gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .
            yes | pip uninstall google_spark_connect
            pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
    "name": "google-spark-connect",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Google LLC",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/73/20/33f785a893fb18ddfce5c45d6705f895990205d3456e89bfdd4c2c278985/google_spark_connect-0.5.3.tar.gz",
    "platform": null,
    "description": "# Google Spark Connect Client\n\nA wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with\nadditional functionalities that allow applications to communicate with a remote Dataproc\nSpark cluster using the Spark Connect protocol without requiring additional steps.\n\n## Install\n\n.. code-block:: console\n\n      pip install google_spark_connect\n\n## Uninstall\n\n.. code-block:: console\n\n      pip uninstall google_spark_connect\n\n\n## Setup\nThis client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).\nIf you are running the client outside of Google Cloud, you must set following environment variables:\n\n* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads\n* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.\n* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)\n* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`\n\n## Usage\n\n1. Install the latest version of Dataproc Python client and Google Spark Connect modules:\n\n      .. code-block:: console\n\n            pip install google_cloud_dataproc --force-reinstall\n            pip install google_spark_connect --force-reinstall\n\n2. Add the required import into your PySpark application or notebook:\n\n      .. code-block:: python\n\n            from google.cloud.spark_connect import GoogleSparkSession\n\n3. There are two ways to create a spark session,\n\n   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:\n\n      .. code-block:: python\n\n            spark = GoogleSparkSession.builder.getOrCreate()\n\n   2. Start a Spark session with the following code instead of using a config file:\n\n      .. code-block:: python\n\n            from google.cloud.dataproc_v1 import SparkConnectConfig\n            from google.cloud.dataproc_v1 import Session\n            google_session_config = Session()\n            google_session_config.spark_connect_session = SparkConnectConfig()\n            google_session_config.environment_config.execution_config.subnetwork_uri = \"<subnet>\"\n            google_session_config.runtime_config.version = '3.0'\n            spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()\n\n## Billing\nAs this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).\nThis will happen even if you are running the client from a non-GCE instance.\n\n## Contributing\n### Building and Deploying SDK\n\n1. Install the requirements in virtual environment.\n\n      .. code-block:: console\n\n            pip install -r requirements.txt\n\n2. Build the code.\n\n      .. code-block:: console\n\n            python setup.py sdist bdist_wheel\n\n\n3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.\n\n      .. code-block:: console\n\n            VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>\n\n4. Download the new SDK on Vertex, then uninstall the old version and install the new one.\n\n      .. code-block:: console\n\n            %%bash\n            export VERSION=<version>\n            gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .\n            yes | pip uninstall google_spark_connect\n            pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Google client library for Spark Connect",
    "version": "0.5.3",
    "project_urls": {
        "Homepage": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c100537ce92ccde3fefd938b36eaa3f448a358731f3b6ea2eb3e8602f41cc07a",
                "md5": "c3e47098299f5eacd92645b6d7361516",
                "sha256": "43106e20a3e4999f0d6f6ba40ed8ec162a7456432505c4411c120225bd0cb06e"
            },
            "downloads": -1,
            "filename": "google_spark_connect-0.5.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3e47098299f5eacd92645b6d7361516",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 18451,
            "upload_time": "2025-02-28T22:04:44",
            "upload_time_iso_8601": "2025-02-28T22:04:44.180956Z",
            "url": "https://files.pythonhosted.org/packages/c1/00/537ce92ccde3fefd938b36eaa3f448a358731f3b6ea2eb3e8602f41cc07a/google_spark_connect-0.5.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "732033f785a893fb18ddfce5c45d6705f895990205d3456e89bfdd4c2c278985",
                "md5": "0f038f5010392fecdf969f8e3c5e2f91",
                "sha256": "f2683979c9d73584c4d3520d0fb69f32a172b23abec1d20bc764bc9b7f1e0b68"
            },
            "downloads": -1,
            "filename": "google_spark_connect-0.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "0f038f5010392fecdf969f8e3c5e2f91",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16975,
            "upload_time": "2025-02-28T22:04:45",
            "upload_time_iso_8601": "2025-02-28T22:04:45.366717Z",
            "url": "https://files.pythonhosted.org/packages/73/20/33f785a893fb18ddfce5c45d6705f895990205d3456e89bfdd4c2c278985/google_spark_connect-0.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-28 22:04:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GoogleCloudDataproc",
    "github_project": "dataproc-spark-connect-python",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "google-spark-connect"
}
        
Elapsed time: 6.85769s