# Airflow connector to Ocean for Apache Spark
An Airflow plugin and provider to launch and monitor Spark
applications on [Ocean for
Apache Spark](https://spot.io/products/ocean-apache-spark/).
## Installation
```
pip install ocean-spark-airflow-provider
```
## Usage
For general usage of Ocean for Apache Spark, refer to the [official
documentation](https://docs.spot.io/ocean-spark/getting-started/?id=get-started-with-ocean-for-apache-spark).
### Setting up the connection
In the connection menu, register a new connection of type **Ocean for
Apache Spark**. The default connection name is `ocean_spark_default`. You will
need to have:
- The Ocean Spark cluster ID of the cluster you just created (of the
format `osc-e4089a00`). You can find it in the Spot console in the
[list of
clusters](https://docs.spot.io/ocean-spark/product-tour/manage-clusters),
or by using the [Cluster
List](https://docs.spot.io/api/#operation/OceanSparkClusterList) API.
- [A Spot
token](https://docs.spot.io/administration/api/create-api-token?id=create-an-api-token)
to interact with the Spot API.
![connection setup dialog](./images/connection_setup.png)
The **Ocean for Apache Spark** connection type is not available for Airflow
1, instead create an **HTTP** connection and fill your cluster id as
**host**, and your API token as **password**.
You will need to create a separate connection for each Ocean Spark
cluster that you want to use with Airflow. In the
`OceanSparkOperator`, you can select which Ocean Spark connection to
use with the `connection_name` argument (defaults to
`ocean_spark_default`). For example, you may choose to have one
Ocean Spark cluster per environment (dev, staging, prod), and you
can easily target an environment by picking the correct Airflow connection.
### Using the Spark operator
```python
from ocean_spark.operators import OceanSparkOperator
# DAG creation
spark_pi_task = OceanSparkOperator(
job_id="spark-pi",
task_id="compute-pi",
dag=dag,
config_overrides={
"type": "Scala",
"sparkVersion": "3.2.0",
"image": "gcr.io/datamechanics/spark:platform-3.2-latest",
"imagePullPolicy": "IfNotPresent",
"mainClass": "org.apache.spark.examples.SparkPi",
"mainApplicationFile": "local:///opt/spark/examples/jars/examples.jar",
"arguments": ["10000"],
"driver": {
"cores": 1,
"spot": false
},
"executor": {
"cores": 4,
"instances": 1,
"spot": true,
"instanceSelector": "r5"
},
},
)
```
### Using the Spark Connect operator (available since airflow 2.6.2)
```python
from airflow import DAG, utils
from ocean_spark.operators import (
OceanSparkConnectOperator,
)
args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": utils.dates.days_ago(0, second=1),
}
dag = DAG(dag_id="spark-connect-task", default_args=args, schedule_interval=None)
spark_pi_task = OceanSparkConnectOperator(
task_id="spark-connect",
dag=dag,
)
```
### Trigger the DAG with config, such as
```json
{
"sql": "select random()"
}
```
more examples are available for [Airflow 2](./deploy/airflow2/dags).
## Test locally
You can test the plugin locally using the docker compose setup in this
repository. Run `make serve_airflow` at the root of the repository to
launch an instance of Airflow 2 with the provider already installed.
Raw data
{
"_id": null,
"home_page": "https://spot.io/products/ocean-apache-spark/",
"name": "ocean-spark-airflow-provider",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7,<4.0",
"maintainer_email": "",
"keywords": "airflow,provider,spark,ocean",
"author": "Ocean for Apache Spark authors",
"author_email": "clement.rezvoy@netapp.com",
"download_url": "https://files.pythonhosted.org/packages/4d/b1/bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2/ocean_spark_airflow_provider-1.1.0.tar.gz",
"platform": null,
"description": "# Airflow connector to Ocean for Apache Spark\n\nAn Airflow plugin and provider to launch and monitor Spark\napplications on [Ocean for\nApache Spark](https://spot.io/products/ocean-apache-spark/).\n\n## Installation\n\n```\npip install ocean-spark-airflow-provider\n```\n\n## Usage\n\nFor general usage of Ocean for Apache Spark, refer to the [official\ndocumentation](https://docs.spot.io/ocean-spark/getting-started/?id=get-started-with-ocean-for-apache-spark).\n\n### Setting up the connection\n\nIn the connection menu, register a new connection of type **Ocean for\nApache Spark**. The default connection name is `ocean_spark_default`. You will\nneed to have:\n\n - The Ocean Spark cluster ID of the cluster you just created (of the\n format `osc-e4089a00`). You can find it in the Spot console in the\n [list of\n clusters](https://docs.spot.io/ocean-spark/product-tour/manage-clusters),\n or by using the [Cluster\n List](https://docs.spot.io/api/#operation/OceanSparkClusterList) API.\n - [A Spot\n token](https://docs.spot.io/administration/api/create-api-token?id=create-an-api-token)\n to interact with the Spot API.\n\n![connection setup dialog](./images/connection_setup.png)\n\nThe **Ocean for Apache Spark** connection type is not available for Airflow\n1, instead create an **HTTP** connection and fill your cluster id as\n**host**, and your API token as **password**.\n\nYou will need to create a separate connection for each Ocean Spark\ncluster that you want to use with Airflow. In the\n`OceanSparkOperator`, you can select which Ocean Spark connection to\nuse with the `connection_name` argument (defaults to\n`ocean_spark_default`). For example, you may choose to have one\nOcean Spark cluster per environment (dev, staging, prod), and you\ncan easily target an environment by picking the correct Airflow connection.\n\n### Using the Spark operator\n\n```python\nfrom ocean_spark.operators import OceanSparkOperator\n\n# DAG creation\n\nspark_pi_task = OceanSparkOperator(\n job_id=\"spark-pi\",\n task_id=\"compute-pi\",\n dag=dag,\n config_overrides={\n \"type\": \"Scala\",\n \"sparkVersion\": \"3.2.0\",\n \"image\": \"gcr.io/datamechanics/spark:platform-3.2-latest\",\n \"imagePullPolicy\": \"IfNotPresent\",\n \"mainClass\": \"org.apache.spark.examples.SparkPi\",\n \"mainApplicationFile\": \"local:///opt/spark/examples/jars/examples.jar\",\n \"arguments\": [\"10000\"],\n \"driver\": {\n \"cores\": 1,\n \"spot\": false\n },\n \"executor\": {\n \"cores\": 4,\n \"instances\": 1,\n \"spot\": true,\n \"instanceSelector\": \"r5\"\n },\n },\n)\n```\n\n### Using the Spark Connect operator (available since airflow 2.6.2)\n\n```python\nfrom airflow import DAG, utils\nfrom ocean_spark.operators import (\n OceanSparkConnectOperator,\n)\n\nargs = {\n \"owner\": \"airflow\",\n \"depends_on_past\": False,\n \"start_date\": utils.dates.days_ago(0, second=1),\n}\n\n\ndag = DAG(dag_id=\"spark-connect-task\", default_args=args, schedule_interval=None)\n\nspark_pi_task = OceanSparkConnectOperator(\n task_id=\"spark-connect\",\n dag=dag,\n)\n```\n\n### Trigger the DAG with config, such as\n\n```json\n{\n \"sql\": \"select random()\"\n}\n```\n\n\nmore examples are available for [Airflow 2](./deploy/airflow2/dags).\n\n## Test locally\n\nYou can test the plugin locally using the docker compose setup in this\nrepository. Run `make serve_airflow` at the root of the repository to\nlaunch an instance of Airflow 2 with the provider already installed.\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Apache Airflow connector for Ocean for Apache Spark",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://spot.io/products/ocean-apache-spark/",
"Repository": "https://github.com/spotinst/ocean-spark-airflow-provider"
},
"split_keywords": [
"airflow",
"provider",
"spark",
"ocean"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3484359f2924d014971ba2d89426b71140733589e3afcae9a5f44e2328154a50",
"md5": "01d54f60888fb57bf76190e20a59bcd3",
"sha256": "286f218e437569198fb2bdde27dc5323444105ae316fb459bc218379d8c94ff9"
},
"downloads": -1,
"filename": "ocean_spark_airflow_provider-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "01d54f60888fb57bf76190e20a59bcd3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7,<4.0",
"size": 15772,
"upload_time": "2024-01-29T13:50:15",
"upload_time_iso_8601": "2024-01-29T13:50:15.385821Z",
"url": "https://files.pythonhosted.org/packages/34/84/359f2924d014971ba2d89426b71140733589e3afcae9a5f44e2328154a50/ocean_spark_airflow_provider-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4db1bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2",
"md5": "addb8a2d79a5ec35b4a4692279b022ee",
"sha256": "ced17e25506fd90e758b1f410cb53fd1a4b07475831d61a3392eef7dcda4ca34"
},
"downloads": -1,
"filename": "ocean_spark_airflow_provider-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "addb8a2d79a5ec35b4a4692279b022ee",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7,<4.0",
"size": 11698,
"upload_time": "2024-01-29T13:50:16",
"upload_time_iso_8601": "2024-01-29T13:50:16.782262Z",
"url": "https://files.pythonhosted.org/packages/4d/b1/bf8fc01d4603c96326d5943ae3ffcfc622ffba4d6e269e0b6f1fc1b038c2/ocean_spark_airflow_provider-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-29 13:50:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "spotinst",
"github_project": "ocean-spark-airflow-provider",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "ocean-spark-airflow-provider"
}