google-spark-connect


Namegoogle-spark-connect JSON
Version 0.5.2 PyPI version JSON
download
home_pagehttps://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
SummaryGoogle client library for Spark Connect
upload_time2025-02-11 21:49:27
maintainerNone
docs_urlNone
authorGoogle LLC
requires_pythonNone
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Google Spark Connect Client

A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
additional functionalities that allow applications to communicate with a remote Dataproc
Spark cluster using the Spark Connect protocol without requiring additional steps.

## Install

.. code-block:: console

      pip install google_spark_connect

## Uninstall

.. code-block:: console

      pip uninstall google_spark_connect


## Setup
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
If you are running the client outside of Google Cloud, you must set following environment variables:

* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`

## Usage

1. Install the latest version of Dataproc Python client and Google Spark Connect modules:

      .. code-block:: console

            pip install google_cloud_dataproc --force-reinstall
            pip install google_spark_connect --force-reinstall

2. Add the required import into your PySpark application or notebook:

      .. code-block:: python

            from google.cloud.spark_connect import GoogleSparkSession

3. There are two ways to create a spark session,

   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:

      .. code-block:: python

            spark = GoogleSparkSession.builder.getOrCreate()

   2. Start a Spark session with the following code instead of using a config file:

      .. code-block:: python

            from google.cloud.dataproc_v1 import SparkConnectConfig
            from google.cloud.dataproc_v1 import Session
            google_session_config = Session()
            google_session_config.spark_connect_session = SparkConnectConfig()
            google_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
            google_session_config.runtime_config.version = '3.0'
            spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()

## Billing
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
This will happen even if you are running the client from a non-GCE instance.

## Contributing
### Building and Deploying SDK

1. Install the requirements in virtual environment.

      .. code-block:: console

            pip install -r requirements.txt

2. Build the code.

      .. code-block:: console

            python setup.py sdist bdist_wheel


3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.

      .. code-block:: console

            VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>

4. Download the new SDK on Vertex, then uninstall the old version and install the new one.

      .. code-block:: console

            %%bash
            export VERSION=<version>
            gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .
            yes | pip uninstall google_spark_connect
            pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
    "name": "google-spark-connect",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Google LLC",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/a8/35/a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb/google_spark_connect-0.5.2.tar.gz",
    "platform": null,
    "description": "# Google Spark Connect Client\n\nA wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with\nadditional functionalities that allow applications to communicate with a remote Dataproc\nSpark cluster using the Spark Connect protocol without requiring additional steps.\n\n## Install\n\n.. code-block:: console\n\n      pip install google_spark_connect\n\n## Uninstall\n\n.. code-block:: console\n\n      pip uninstall google_spark_connect\n\n\n## Setup\nThis client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).\nIf you are running the client outside of Google Cloud, you must set following environment variables:\n\n* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads\n* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.\n* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)\n* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`\n\n## Usage\n\n1. Install the latest version of Dataproc Python client and Google Spark Connect modules:\n\n      .. code-block:: console\n\n            pip install google_cloud_dataproc --force-reinstall\n            pip install google_spark_connect --force-reinstall\n\n2. Add the required import into your PySpark application or notebook:\n\n      .. code-block:: python\n\n            from google.cloud.spark_connect import GoogleSparkSession\n\n3. There are two ways to create a spark session,\n\n   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:\n\n      .. code-block:: python\n\n            spark = GoogleSparkSession.builder.getOrCreate()\n\n   2. Start a Spark session with the following code instead of using a config file:\n\n      .. code-block:: python\n\n            from google.cloud.dataproc_v1 import SparkConnectConfig\n            from google.cloud.dataproc_v1 import Session\n            google_session_config = Session()\n            google_session_config.spark_connect_session = SparkConnectConfig()\n            google_session_config.environment_config.execution_config.subnetwork_uri = \"<subnet>\"\n            google_session_config.runtime_config.version = '3.0'\n            spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()\n\n## Billing\nAs this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).\nThis will happen even if you are running the client from a non-GCE instance.\n\n## Contributing\n### Building and Deploying SDK\n\n1. Install the requirements in virtual environment.\n\n      .. code-block:: console\n\n            pip install -r requirements.txt\n\n2. Build the code.\n\n      .. code-block:: console\n\n            python setup.py sdist bdist_wheel\n\n\n3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.\n\n      .. code-block:: console\n\n            VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>\n\n4. Download the new SDK on Vertex, then uninstall the old version and install the new one.\n\n      .. code-block:: console\n\n            %%bash\n            export VERSION=<version>\n            gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .\n            yes | pip uninstall google_spark_connect\n            pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Google client library for Spark Connect",
    "version": "0.5.2",
    "project_urls": {
        "Homepage": "https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fb36ac3cc783cb4309363fb07775ca8bf06d6b6e0aefa393adbf89e5dd8eb2b3",
                "md5": "75cd6fb14c631fa17b08961e0ed18499",
                "sha256": "e408d6753a281ca5c20fb2950f2b2220546247c17b34c5dfb487be4ecf25b65f"
            },
            "downloads": -1,
            "filename": "google_spark_connect-0.5.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "75cd6fb14c631fa17b08961e0ed18499",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 18442,
            "upload_time": "2025-02-11T21:49:26",
            "upload_time_iso_8601": "2025-02-11T21:49:26.399415Z",
            "url": "https://files.pythonhosted.org/packages/fb/36/ac3cc783cb4309363fb07775ca8bf06d6b6e0aefa393adbf89e5dd8eb2b3/google_spark_connect-0.5.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a835a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb",
                "md5": "3df39a170704f9aace5619aac8ba143d",
                "sha256": "44b5bbd5943b186fef87102592cdc4006e3fbbf70682b92b37ff8eb651db7aed"
            },
            "downloads": -1,
            "filename": "google_spark_connect-0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "3df39a170704f9aace5619aac8ba143d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16937,
            "upload_time": "2025-02-11T21:49:27",
            "upload_time_iso_8601": "2025-02-11T21:49:27.555362Z",
            "url": "https://files.pythonhosted.org/packages/a8/35/a85bbd4c6eb55491fd8f1e754189a3d21d6856fa74c666c7e94a911165cb/google_spark_connect-0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-11 21:49:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GoogleCloudDataproc",
    "github_project": "dataproc-spark-connect-python",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "google-spark-connect"
}
        
Elapsed time: 5.74588s