astro-provider-databricks


Nameastro-provider-databricks JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryAffordable Databricks Workflows in Apache Airflow
upload_time2024-08-08 09:55:08
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords airflow apache-airflow astronomer dags
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Deprecation Notice

With the release ``0.3.0`` of the ``astro-provider-databricks`` package, this provider stands deprecated and will
no longer receive updates. We recommend migrating to the official ``apache-airflow-providers-databricks>= 6.8.0`` for the latest features and support.
For the operators and sensors that are deprecated in this repository, migrating to the official Apache Airflow Databricks Provider
is as simple as changing the import path in your DAG code as per the below examples.

| Previous import path used (Deprecated now)                                    | Suggested import path to use                                                                                |
|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| `from astro_databricks.operators.notebook import DatabricksNotebookOperator`  | `from airflow.providers.databricks.operators.databricks import DatabricksNotebookOperator`                  |
| `from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup` | `from airflow.providers.databricks.operators.databricks_workflow import DatabricksWorkflowTaskGroup`        |
| `from astro_databricks.operators.common import DatabricksTaskOperator`        | `from airflow.providers.databricks.operators.databricks import DatabricksTaskOperator`                      |
| `from astro_databricks.plugins.plugin import AstroDatabricksPlugin`           | `from airflow.providers.airflow.providers.databricks.plugins.databricks_workflow import DatabricksWorkflowPlugin` |

# Archives

<h1 align="center">
  Databricks Workflows in Airflow
</h1>

The Astro Databricks Provider is an Apache Airflow provider to write [Databricks Workflows](https://docs.databricks.com/workflows/index.html) using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).

While this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.

There are a few advantages to defining your Databricks Workflows in Airflow:

|                                      |        via Databricks         |      via Airflow       |
| :----------------------------------- | :---------------------------: | :--------------------: |
| Authoring interface                  | _Web-based via Databricks UI_ | _Code via Airflow DAG_ |
| Workflow compute pricing             |              ✅               |           ✅           |
| Notebook code in source control      |              ✅               |           ✅           |
| Workflow structure in source control |              ✅               |           ✅           |
| Retry from beginning                 |              ✅               |           ✅           |
| Retry single task                    |              ✅               |           ✅           |
| Task groups within Workflows         |                               |           ✅           |
| Trigger workflows from other DAGs    |                               |           ✅           |
| Workflow-level parameters            |                               |           ✅           |

## Example

The following Airflow DAG illustrates how to use the `DatabricksTaskGroup` and `DatabricksNotebookOperator` to define a Databricks Workflow in Airflow:

```python
from pendulum import datetime

from airflow.decorators import dag, task_group
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup

# define your cluster spec - can have from 1 to many clusters
job_cluster_spec = [
   {
      "job_cluster_key": "astro_databricks",
      "new_cluster": {
         "cluster_name": "",
         # ...
      },
   }
]

@dag(start_date=datetime(2023, 1, 1), schedule_interval="@daily", catchup=False)
def databricks_workflow_example():
   # the task group is a context manager that will create a Databricks Workflow
   with DatabricksWorkflowTaskGroup(
      group_id="example_databricks_workflow",
      databricks_conn_id="databricks_default",
      job_clusters=job_cluster_spec,
      # you can specify common fields here that get shared to all notebooks
      notebook_packages=[
         { "pypi": { "package": "pandas" } },
      ],
      # notebook_params supports templating
      notebook_params={
         "start_time": "{{ ds }}",
      }
   ) as workflow:
      notebook_1 = DatabricksNotebookOperator(
         task_id="notebook_1",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_1",
         source="WORKSPACE",
         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec
         job_cluster_key="astro_databricks",
         # you can add to packages & params at the task level
         notebook_packages=[
            { "pypi": { "package": "scikit-learn" } },
         ],
         notebook_params={
            "end_time": "{{ macros.ds_add(ds, 1) }}",
         }
      )

      # you can embed task groups for easier dependency management
      @task_group(group_id="inner_task_group")
      def inner_task_group():
         notebook_2 = DatabricksNotebookOperator(
            task_id="notebook_2",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_2",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

         notebook_3 = DatabricksNotebookOperator(
            task_id="notebook_3",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_3",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

      notebook_4 = DatabricksNotebookOperator(
         task_id="notebook_4",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_4",
         source="WORKSPACE",
         job_cluster_key="astro_databricks",
      )

      notebook_1 >> inner_task_group() >> notebook_4

   trigger_workflow_2 = TriggerDagRunOperator(
      task_id="trigger_workflow_2",
      trigger_dag_id="workflow_2",
      execution_date="{{ next_execution_date }}",
   )

   workflow >> trigger_workflow_2

databricks_workflow_example_dag = databricks_workflow_example()
```

### Airflow UI

![Airflow UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_airflow.png)

### Databricks UI

![Databricks UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_databricks.png)

## Quickstart

Check out the following quickstart guides:

- [With the Astro CLI](quickstart/astro-cli.md)
- [Without the Astro CLI](quickstart/without-astro-cli.md)

## Documentation

The documentation is a work in progress--we aim to follow the [Diátaxis](https://diataxis.fr/) system:

- [Reference Guide](https://astronomer.github.io/astro-provider-databricks/)

## Changelog

Astro Databricks follows [semantic versioning](https://semver.org/) for releases. Read [changelog](CHANGELOG.rst) to understand more about the changes introduced to each version.

## Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the [Contribution Guidelines](docs/contributing.rst) for a detailed overview on how to contribute.

Contributors and maintainers should abide by the [Contributor Code of Conduct](CODE_OF_CONDUCT.md).

## License

[Apache Licence 2.0](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "astro-provider-databricks",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "airflow, apache-airflow, astronomer, dags",
    "author": null,
    "author_email": "Astronomer <humans@astronomer.io>",
    "download_url": "https://files.pythonhosted.org/packages/df/10/0ede52a8e62931c454e5759b552c3815a48637446f3e4fb893465cee102c/astro_provider_databricks-0.3.0.tar.gz",
    "platform": null,
    "description": "## Deprecation Notice\n\nWith the release ``0.3.0`` of the ``astro-provider-databricks`` package, this provider stands deprecated and will\nno longer receive updates. We recommend migrating to the official ``apache-airflow-providers-databricks>= 6.8.0`` for the latest features and support.\nFor the operators and sensors that are deprecated in this repository, migrating to the official Apache Airflow Databricks Provider\nis as simple as changing the import path in your DAG code as per the below examples.\n\n| Previous import path used (Deprecated now)                                    | Suggested import path to use                                                                                |\n|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|\n| `from astro_databricks.operators.notebook import DatabricksNotebookOperator`  | `from airflow.providers.databricks.operators.databricks import DatabricksNotebookOperator`                  |\n| `from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup` | `from airflow.providers.databricks.operators.databricks_workflow import DatabricksWorkflowTaskGroup`        |\n| `from astro_databricks.operators.common import DatabricksTaskOperator`        | `from airflow.providers.databricks.operators.databricks import DatabricksTaskOperator`                      |\n| `from astro_databricks.plugins.plugin import AstroDatabricksPlugin`           | `from airflow.providers.airflow.providers.databricks.plugins.databricks_workflow import DatabricksWorkflowPlugin` |\n\n# Archives\n\n<h1 align=\"center\">\n  Databricks Workflows in Airflow\n</h1>\n\nThe Astro Databricks Provider is an Apache Airflow provider to write [Databricks Workflows](https://docs.databricks.com/workflows/index.html) using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).\n\nWhile this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.\n\nThere are a few advantages to defining your Databricks Workflows in Airflow:\n\n|                                      |        via Databricks         |      via Airflow       |\n| :----------------------------------- | :---------------------------: | :--------------------: |\n| Authoring interface                  | _Web-based via Databricks UI_ | _Code via Airflow DAG_ |\n| Workflow compute pricing             |              \u2705               |           \u2705           |\n| Notebook code in source control      |              \u2705               |           \u2705           |\n| Workflow structure in source control |              \u2705               |           \u2705           |\n| Retry from beginning                 |              \u2705               |           \u2705           |\n| Retry single task                    |              \u2705               |           \u2705           |\n| Task groups within Workflows         |                               |           \u2705           |\n| Trigger workflows from other DAGs    |                               |           \u2705           |\n| Workflow-level parameters            |                               |           \u2705           |\n\n## Example\n\nThe following Airflow DAG illustrates how to use the `DatabricksTaskGroup` and `DatabricksNotebookOperator` to define a Databricks Workflow in Airflow:\n\n```python\nfrom pendulum import datetime\n\nfrom airflow.decorators import dag, task_group\nfrom airflow.operators.trigger_dagrun import TriggerDagRunOperator\nfrom astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup\n\n# define your cluster spec - can have from 1 to many clusters\njob_cluster_spec = [\n   {\n      \"job_cluster_key\": \"astro_databricks\",\n      \"new_cluster\": {\n         \"cluster_name\": \"\",\n         # ...\n      },\n   }\n]\n\n@dag(start_date=datetime(2023, 1, 1), schedule_interval=\"@daily\", catchup=False)\ndef databricks_workflow_example():\n   # the task group is a context manager that will create a Databricks Workflow\n   with DatabricksWorkflowTaskGroup(\n      group_id=\"example_databricks_workflow\",\n      databricks_conn_id=\"databricks_default\",\n      job_clusters=job_cluster_spec,\n      # you can specify common fields here that get shared to all notebooks\n      notebook_packages=[\n         { \"pypi\": { \"package\": \"pandas\" } },\n      ],\n      # notebook_params supports templating\n      notebook_params={\n         \"start_time\": \"{{ ds }}\",\n      }\n   ) as workflow:\n      notebook_1 = DatabricksNotebookOperator(\n         task_id=\"notebook_1\",\n         databricks_conn_id=\"databricks_default\",\n         notebook_path=\"/Shared/notebook_1\",\n         source=\"WORKSPACE\",\n         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec\n         job_cluster_key=\"astro_databricks\",\n         # you can add to packages & params at the task level\n         notebook_packages=[\n            { \"pypi\": { \"package\": \"scikit-learn\" } },\n         ],\n         notebook_params={\n            \"end_time\": \"{{ macros.ds_add(ds, 1) }}\",\n         }\n      )\n\n      # you can embed task groups for easier dependency management\n      @task_group(group_id=\"inner_task_group\")\n      def inner_task_group():\n         notebook_2 = DatabricksNotebookOperator(\n            task_id=\"notebook_2\",\n            databricks_conn_id=\"databricks_default\",\n            notebook_path=\"/Shared/notebook_2\",\n            source=\"WORKSPACE\",\n            job_cluster_key=\"astro_databricks\",\n         )\n\n         notebook_3 = DatabricksNotebookOperator(\n            task_id=\"notebook_3\",\n            databricks_conn_id=\"databricks_default\",\n            notebook_path=\"/Shared/notebook_3\",\n            source=\"WORKSPACE\",\n            job_cluster_key=\"astro_databricks\",\n         )\n\n      notebook_4 = DatabricksNotebookOperator(\n         task_id=\"notebook_4\",\n         databricks_conn_id=\"databricks_default\",\n         notebook_path=\"/Shared/notebook_4\",\n         source=\"WORKSPACE\",\n         job_cluster_key=\"astro_databricks\",\n      )\n\n      notebook_1 >> inner_task_group() >> notebook_4\n\n   trigger_workflow_2 = TriggerDagRunOperator(\n      task_id=\"trigger_workflow_2\",\n      trigger_dag_id=\"workflow_2\",\n      execution_date=\"{{ next_execution_date }}\",\n   )\n\n   workflow >> trigger_workflow_2\n\ndatabricks_workflow_example_dag = databricks_workflow_example()\n```\n\n### Airflow UI\n\n![Airflow UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_airflow.png)\n\n### Databricks UI\n\n![Databricks UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_databricks.png)\n\n## Quickstart\n\nCheck out the following quickstart guides:\n\n- [With the Astro CLI](quickstart/astro-cli.md)\n- [Without the Astro CLI](quickstart/without-astro-cli.md)\n\n## Documentation\n\nThe documentation is a work in progress--we aim to follow the [Di\u00e1taxis](https://diataxis.fr/) system:\n\n- [Reference Guide](https://astronomer.github.io/astro-provider-databricks/)\n\n## Changelog\n\nAstro Databricks follows [semantic versioning](https://semver.org/) for releases. Read [changelog](CHANGELOG.rst) to understand more about the changes introduced to each version.\n\n## Contribution guidelines\n\nAll contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.\n\nRead the [Contribution Guidelines](docs/contributing.rst) for a detailed overview on how to contribute.\n\nContributors and maintainers should abide by the [Contributor Code of Conduct](CODE_OF_CONDUCT.md).\n\n## License\n\n[Apache Licence 2.0](LICENSE)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Affordable Databricks Workflows in Apache Airflow",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://github.com/astronomer/astro-provider-databricks/",
        "Homepage": "https://github.com/astronomer/astro-provider-databricks/"
    },
    "split_keywords": [
        "airflow",
        " apache-airflow",
        " astronomer",
        " dags"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "22a06344aadea13590930035bae670b1e67c21bf900f86a020e3a2733839b177",
                "md5": "774647d5d5c45deab9a172db256729d1",
                "sha256": "37e0912fef1be03aee9d4841c1d1061c223aeae739035ca9ca141a48d6457c25"
            },
            "downloads": -1,
            "filename": "astro_provider_databricks-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "774647d5d5c45deab9a172db256729d1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 11397,
            "upload_time": "2024-08-08T09:55:10",
            "upload_time_iso_8601": "2024-08-08T09:55:10.303459Z",
            "url": "https://files.pythonhosted.org/packages/22/a0/6344aadea13590930035bae670b1e67c21bf900f86a020e3a2733839b177/astro_provider_databricks-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "df100ede52a8e62931c454e5759b552c3815a48637446f3e4fb893465cee102c",
                "md5": "50d6186c2b1963a7cd4ef11a821292f9",
                "sha256": "9f31528a4b74a2f9668245a3cbd7db204662ff47a3d69a7875604bceaa55dabc"
            },
            "downloads": -1,
            "filename": "astro_provider_databricks-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "50d6186c2b1963a7cd4ef11a821292f9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 2230496,
            "upload_time": "2024-08-08T09:55:08",
            "upload_time_iso_8601": "2024-08-08T09:55:08.514617Z",
            "url": "https://files.pythonhosted.org/packages/df/10/0ede52a8e62931c454e5759b552c3815a48637446f3e4fb893465cee102c/astro_provider_databricks-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-08 09:55:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "astronomer",
    "github_project": "astro-provider-databricks",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "astro-provider-databricks"
}
        
Elapsed time: 0.28944s