ocean-spark-airflow-provider

Name	ocean-spark-airflow-provider JSON
Version	1.1.0 JSON
	download
home_page	https://spot.io/products/ocean-apache-spark/
Summary	Apache Airflow connector for Ocean for Apache Spark
upload_time	2024-01-29 13:50:16
maintainer
docs_url	None
author	Ocean for Apache Spark authors
requires_python	>=3.7,<4.0
license	Apache-2.0
keywords	airflow provider spark ocean
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Airflow connector to Ocean for Apache Spark

An Airflow plugin and provider to launch and monitor Spark
applications on [Ocean for
Apache Spark](https://spot.io/products/ocean-apache-spark/).

## Installation

```
pip install ocean-spark-airflow-provider
```

## Usage

For general usage of Ocean for Apache Spark, refer to the [official
documentation](https://docs.spot.io/ocean-spark/getting-started/?id=get-started-with-ocean-for-apache-spark).

### Setting up the connection

In the connection menu, register a new connection of type **Ocean for
Apache Spark**. The default connection name is `ocean_spark_default`. You will
need to have:

 - The Ocean Spark cluster ID of the cluster you just created (of the
   format `osc-e4089a00`). You can find it in the Spot console in the
   [list of
   clusters](https://docs.spot.io/ocean-spark/product-tour/manage-clusters),
   or by using the [Cluster
   List](https://docs.spot.io/api/#operation/OceanSparkClusterList) API.
 - [A Spot
   token](https://docs.spot.io/administration/api/create-api-token?id=create-an-api-token)
   to interact with the Spot API.

![connection setup dialog](./images/connection_setup.png)

The **Ocean for Apache Spark** connection type is not available for Airflow
1, instead create an **HTTP** connection and fill your cluster id as
**host**, and your API token as **password**.

You will need to create a separate connection for each Ocean Spark
cluster that you want to use with Airflow.  In the
`OceanSparkOperator`, you can select which Ocean Spark connection to
use with the `connection_name` argument (defaults to
`ocean_spark_default`). For example, you may choose to have one
Ocean Spark cluster per environment (dev, staging, prod), and you
can easily target an environment by picking the correct Airflow connection.

### Using the Spark operator

```python
from ocean_spark.operators import OceanSparkOperator

# DAG creation

spark_pi_task = OceanSparkOperator(
    job_id="spark-pi",
    task_id="compute-pi",
    dag=dag,
    config_overrides={
        "type": "Scala",
        "sparkVersion": "3.2.0",
        "image": "gcr.io/datamechanics/spark:platform-3.2-latest",
        "imagePullPolicy": "IfNotPresent",
        "mainClass": "org.apache.spark.examples.SparkPi",
        "mainApplicationFile": "local:///opt/spark/examples/jars/examples.jar",
        "arguments": ["10000"],
        "driver": {
            "cores": 1,
            "spot": false
        },
        "executor": {
            "cores": 4,
            "instances": 1,
            "spot": true,
            "instanceSelector": "r5"
        },
    },
)
```

### Using the Spark Connect operator (available since airflow 2.6.2)

```python
from airflow import DAG, utils
from ocean_spark.operators import (
    OceanSparkConnectOperator,
)

args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": utils.dates.days_ago(0, second=1),
}


dag = DAG(dag_id="spark-connect-task", default_args=args, schedule_interval=None)

spark_pi_task = OceanSparkConnectOperator(
    task_id="spark-connect",
    dag=dag,
)
```

### Trigger the DAG with config, such as

```json
{
  "sql": "select random()"
}
```


more examples are available for [Airflow 2](./deploy/airflow2/dags).

## Test locally

You can test the plugin locally using the docker compose setup in this
repository. Run `make serve_airflow` at the root of the repository to
launch an instance of Airflow 2 with the provider already installed.

Raw data

            {
    "_id": null,
    "home_page": "https://spot.io/products/ocean-apache-spark/",
    "name": "ocean-spark-airflow-provider",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<4.0",
    "maintainer_email": "",
    "keywords": "airflow,provider,spark,ocean",
    "author": "Ocean for Apache Spark authors",
    "author_email": "clement.rezvoy@netapp.com",
    "download_url": "https://files.pythonhosted.org/packages/4d/b1/bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2/ocean_spark_airflow_provider-1.1.0.tar.gz",
    "platform": null,
    "description": "# Airflow connector to Ocean for Apache Spark\n\nAn Airflow plugin and provider to launch and monitor Spark\napplications on [Ocean for\nApache Spark](https://spot.io/products/ocean-apache-spark/).\n\n## Installation\n\n```\npip install ocean-spark-airflow-provider\n```\n\n## Usage\n\nFor general usage of Ocean for Apache Spark, refer to the [official\ndocumentation](https://docs.spot.io/ocean-spark/getting-started/?id=get-started-with-ocean-for-apache-spark).\n\n### Setting up the connection\n\nIn the connection menu, register a new connection of type **Ocean for\nApache Spark**. The default connection name is `ocean_spark_default`. You will\nneed to have:\n\n - The Ocean Spark cluster ID of the cluster you just created (of the\n   format `osc-e4089a00`). You can find it in the Spot console in the\n   [list of\n   clusters](https://docs.spot.io/ocean-spark/product-tour/manage-clusters),\n   or by using the [Cluster\n   List](https://docs.spot.io/api/#operation/OceanSparkClusterList) API.\n - [A Spot\n   token](https://docs.spot.io/administration/api/create-api-token?id=create-an-api-token)\n   to interact with the Spot API.\n\n![connection setup dialog](./images/connection_setup.png)\n\nThe **Ocean for Apache Spark** connection type is not available for Airflow\n1, instead create an **HTTP** connection and fill your cluster id as\n**host**, and your API token as **password**.\n\nYou will need to create a separate connection for each Ocean Spark\ncluster that you want to use with Airflow.  In the\n`OceanSparkOperator`, you can select which Ocean Spark connection to\nuse with the `connection_name` argument (defaults to\n`ocean_spark_default`). For example, you may choose to have one\nOcean Spark cluster per environment (dev, staging, prod), and you\ncan easily target an environment by picking the correct Airflow connection.\n\n### Using the Spark operator\n\n```python\nfrom ocean_spark.operators import OceanSparkOperator\n\n# DAG creation\n\nspark_pi_task = OceanSparkOperator(\n    job_id=\"spark-pi\",\n    task_id=\"compute-pi\",\n    dag=dag,\n    config_overrides={\n        \"type\": \"Scala\",\n        \"sparkVersion\": \"3.2.0\",\n        \"image\": \"gcr.io/datamechanics/spark:platform-3.2-latest\",\n        \"imagePullPolicy\": \"IfNotPresent\",\n        \"mainClass\": \"org.apache.spark.examples.SparkPi\",\n        \"mainApplicationFile\": \"local:///opt/spark/examples/jars/examples.jar\",\n        \"arguments\": [\"10000\"],\n        \"driver\": {\n            \"cores\": 1,\n            \"spot\": false\n        },\n        \"executor\": {\n            \"cores\": 4,\n            \"instances\": 1,\n            \"spot\": true,\n            \"instanceSelector\": \"r5\"\n        },\n    },\n)\n```\n\n### Using the Spark Connect operator (available since airflow 2.6.2)\n\n```python\nfrom airflow import DAG, utils\nfrom ocean_spark.operators import (\n    OceanSparkConnectOperator,\n)\n\nargs = {\n    \"owner\": \"airflow\",\n    \"depends_on_past\": False,\n    \"start_date\": utils.dates.days_ago(0, second=1),\n}\n\n\ndag = DAG(dag_id=\"spark-connect-task\", default_args=args, schedule_interval=None)\n\nspark_pi_task = OceanSparkConnectOperator(\n    task_id=\"spark-connect\",\n    dag=dag,\n)\n```\n\n### Trigger the DAG with config, such as\n\n```json\n{\n  \"sql\": \"select random()\"\n}\n```\n\n\nmore examples are available for [Airflow 2](./deploy/airflow2/dags).\n\n## Test locally\n\nYou can test the plugin locally using the docker compose setup in this\nrepository. Run `make serve_airflow` at the root of the repository to\nlaunch an instance of Airflow 2 with the provider already installed.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Apache Airflow connector for Ocean for Apache Spark",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://spot.io/products/ocean-apache-spark/",
        "Repository": "https://github.com/spotinst/ocean-spark-airflow-provider"
    },
    "split_keywords": [
        "airflow",
        "provider",
        "spark",
        "ocean"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3484359f2924d014971ba2d89426b71140733589e3afcae9a5f44e2328154a50",
                "md5": "01d54f60888fb57bf76190e20a59bcd3",
                "sha256": "286f218e437569198fb2bdde27dc5323444105ae316fb459bc218379d8c94ff9"
            },
            "downloads": -1,
            "filename": "ocean_spark_airflow_provider-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01d54f60888fb57bf76190e20a59bcd3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<4.0",
            "size": 15772,
            "upload_time": "2024-01-29T13:50:15",
            "upload_time_iso_8601": "2024-01-29T13:50:15.385821Z",
            "url": "https://files.pythonhosted.org/packages/34/84/359f2924d014971ba2d89426b71140733589e3afcae9a5f44e2328154a50/ocean_spark_airflow_provider-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4db1bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2",
                "md5": "addb8a2d79a5ec35b4a4692279b022ee",
                "sha256": "ced17e25506fd90e758b1f410cb53fd1a4b07475831d61a3392eef7dcda4ca34"
            },
            "downloads": -1,
            "filename": "ocean_spark_airflow_provider-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "addb8a2d79a5ec35b4a4692279b022ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<4.0",
            "size": 11698,
            "upload_time": "2024-01-29T13:50:16",
            "upload_time_iso_8601": "2024-01-29T13:50:16.782262Z",
            "url": "https://files.pythonhosted.org/packages/4d/b1/bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2/ocean_spark_airflow_provider-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-29 13:50:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "spotinst",
    "github_project": "ocean-spark-airflow-provider",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ocean-spark-airflow-provider"
}

Ocean for Apache Spark authors