astro-provider-databricks


Nameastro-provider-databricks JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryAffordable Databricks Workflows in Apache Airflow
upload_time2024-04-16 11:26:16
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords airflow apache-airflow astronomer dags
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
  Databricks Workflows in Airflow
</h1>

The Astro Databricks Provider is an Apache Airflow provider to write [Databricks Workflows](https://docs.databricks.com/workflows/index.html) using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).

While this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.

There are a few advantages to defining your Databricks Workflows in Airflow:

|                                      |        via Databricks         |      via Airflow       |
| :----------------------------------- | :---------------------------: | :--------------------: |
| Authoring interface                  | _Web-based via Databricks UI_ | _Code via Airflow DAG_ |
| Workflow compute pricing             |              ✅               |           ✅           |
| Notebook code in source control      |              ✅               |           ✅           |
| Workflow structure in source control |                               |           ✅           |
| Retry from beginning                 |              ✅               |           ✅           |
| Retry single task                    |                               |           ✅           |
| Task groups within Workflows         |                               |           ✅           |
| Trigger workflows from other DAGs    |                               |           ✅           |
| Workflow-level parameters            |                               |           ✅           |

## Example

The following Airflow DAG illustrates how to use the `DatabricksTaskGroup` and `DatabricksNotebookOperator` to define a Databricks Workflow in Airflow:

```python
from pendulum import datetime

from airflow.decorators import dag, task_group
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup

# define your cluster spec - can have from 1 to many clusters
job_cluster_spec = [
   {
      "job_cluster_key": "astro_databricks",
      "new_cluster": {
         "cluster_name": "",
         # ...
      },
   }
]

@dag(start_date=datetime(2023, 1, 1), schedule_interval="@daily", catchup=False)
def databricks_workflow_example():
   # the task group is a context manager that will create a Databricks Workflow
   with DatabricksWorkflowTaskGroup(
      group_id="example_databricks_workflow",
      databricks_conn_id="databricks_default",
      job_clusters=job_cluster_spec,
      # you can specify common fields here that get shared to all notebooks
      notebook_packages=[
         { "pypi": { "package": "pandas" } },
      ],
      # notebook_params supports templating
      notebook_params={
         "start_time": "{{ ds }}",
      }
   ) as workflow:
      notebook_1 = DatabricksNotebookOperator(
         task_id="notebook_1",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_1",
         source="WORKSPACE",
         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec
         job_cluster_key="astro_databricks",
         # you can add to packages & params at the task level
         notebook_packages=[
            { "pypi": { "package": "scikit-learn" } },
         ],
         notebook_params={
            "end_time": "{{ macros.ds_add(ds, 1) }}",
         }
      )

      # you can embed task groups for easier dependency management
      @task_group(group_id="inner_task_group")
      def inner_task_group():
         notebook_2 = DatabricksNotebookOperator(
            task_id="notebook_2",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_2",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

         notebook_3 = DatabricksNotebookOperator(
            task_id="notebook_3",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_3",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

      notebook_4 = DatabricksNotebookOperator(
         task_id="notebook_4",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_4",
         source="WORKSPACE",
         job_cluster_key="astro_databricks",
      )

      notebook_1 >> inner_task_group() >> notebook_4

   trigger_workflow_2 = TriggerDagRunOperator(
      task_id="trigger_workflow_2",
      trigger_dag_id="workflow_2",
      execution_date="{{ next_execution_date }}",
   )

   workflow >> trigger_workflow_2

databricks_workflow_example_dag = databricks_workflow_example()
```

### Airflow UI

![Airflow UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_airflow.png)

### Databricks UI

![Databricks UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_databricks.png)

## Quickstart

Check out the following quickstart guides:

- [With the Astro CLI](quickstart/astro-cli.md)
- [Without the Astro CLI](quickstart/without-astro-cli.md)

## Documentation

The documentation is a work in progress--we aim to follow the [Diátaxis](https://diataxis.fr/) system:

- [Reference Guide](https://astronomer.github.io/astro-provider-databricks/)

## Changelog

Astro Databricks follows [semantic versioning](https://semver.org/) for releases. Read [changelog](CHANGELOG.rst) to understand more about the changes introduced to each version.

## Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the [Contribution Guidelines](docs/contributing.rst) for a detailed overview on how to contribute.

Contributors and maintainers should abide by the [Contributor Code of Conduct](CODE_OF_CONDUCT.md).

## License

[Apache Licence 2.0](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "astro-provider-databricks",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "airflow, apache-airflow, astronomer, dags",
    "author": null,
    "author_email": "Astronomer <humans@astronomer.io>",
    "download_url": "https://files.pythonhosted.org/packages/de/92/eff0c7ab73545129cbd62a7104ff22be04927f91577e781620dff2d4ccbf/astro_provider_databricks-0.2.2.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n  Databricks Workflows in Airflow\n</h1>\n\nThe Astro Databricks Provider is an Apache Airflow provider to write [Databricks Workflows](https://docs.databricks.com/workflows/index.html) using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).\n\nWhile this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.\n\nThere are a few advantages to defining your Databricks Workflows in Airflow:\n\n|                                      |        via Databricks         |      via Airflow       |\n| :----------------------------------- | :---------------------------: | :--------------------: |\n| Authoring interface                  | _Web-based via Databricks UI_ | _Code via Airflow DAG_ |\n| Workflow compute pricing             |              \u2705               |           \u2705           |\n| Notebook code in source control      |              \u2705               |           \u2705           |\n| Workflow structure in source control |                               |           \u2705           |\n| Retry from beginning                 |              \u2705               |           \u2705           |\n| Retry single task                    |                               |           \u2705           |\n| Task groups within Workflows         |                               |           \u2705           |\n| Trigger workflows from other DAGs    |                               |           \u2705           |\n| Workflow-level parameters            |                               |           \u2705           |\n\n## Example\n\nThe following Airflow DAG illustrates how to use the `DatabricksTaskGroup` and `DatabricksNotebookOperator` to define a Databricks Workflow in Airflow:\n\n```python\nfrom pendulum import datetime\n\nfrom airflow.decorators import dag, task_group\nfrom airflow.operators.trigger_dagrun import TriggerDagRunOperator\nfrom astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup\n\n# define your cluster spec - can have from 1 to many clusters\njob_cluster_spec = [\n   {\n      \"job_cluster_key\": \"astro_databricks\",\n      \"new_cluster\": {\n         \"cluster_name\": \"\",\n         # ...\n      },\n   }\n]\n\n@dag(start_date=datetime(2023, 1, 1), schedule_interval=\"@daily\", catchup=False)\ndef databricks_workflow_example():\n   # the task group is a context manager that will create a Databricks Workflow\n   with DatabricksWorkflowTaskGroup(\n      group_id=\"example_databricks_workflow\",\n      databricks_conn_id=\"databricks_default\",\n      job_clusters=job_cluster_spec,\n      # you can specify common fields here that get shared to all notebooks\n      notebook_packages=[\n         { \"pypi\": { \"package\": \"pandas\" } },\n      ],\n      # notebook_params supports templating\n      notebook_params={\n         \"start_time\": \"{{ ds }}\",\n      }\n   ) as workflow:\n      notebook_1 = DatabricksNotebookOperator(\n         task_id=\"notebook_1\",\n         databricks_conn_id=\"databricks_default\",\n         notebook_path=\"/Shared/notebook_1\",\n         source=\"WORKSPACE\",\n         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec\n         job_cluster_key=\"astro_databricks\",\n         # you can add to packages & params at the task level\n         notebook_packages=[\n            { \"pypi\": { \"package\": \"scikit-learn\" } },\n         ],\n         notebook_params={\n            \"end_time\": \"{{ macros.ds_add(ds, 1) }}\",\n         }\n      )\n\n      # you can embed task groups for easier dependency management\n      @task_group(group_id=\"inner_task_group\")\n      def inner_task_group():\n         notebook_2 = DatabricksNotebookOperator(\n            task_id=\"notebook_2\",\n            databricks_conn_id=\"databricks_default\",\n            notebook_path=\"/Shared/notebook_2\",\n            source=\"WORKSPACE\",\n            job_cluster_key=\"astro_databricks\",\n         )\n\n         notebook_3 = DatabricksNotebookOperator(\n            task_id=\"notebook_3\",\n            databricks_conn_id=\"databricks_default\",\n            notebook_path=\"/Shared/notebook_3\",\n            source=\"WORKSPACE\",\n            job_cluster_key=\"astro_databricks\",\n         )\n\n      notebook_4 = DatabricksNotebookOperator(\n         task_id=\"notebook_4\",\n         databricks_conn_id=\"databricks_default\",\n         notebook_path=\"/Shared/notebook_4\",\n         source=\"WORKSPACE\",\n         job_cluster_key=\"astro_databricks\",\n      )\n\n      notebook_1 >> inner_task_group() >> notebook_4\n\n   trigger_workflow_2 = TriggerDagRunOperator(\n      task_id=\"trigger_workflow_2\",\n      trigger_dag_id=\"workflow_2\",\n      execution_date=\"{{ next_execution_date }}\",\n   )\n\n   workflow >> trigger_workflow_2\n\ndatabricks_workflow_example_dag = databricks_workflow_example()\n```\n\n### Airflow UI\n\n![Airflow UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_airflow.png)\n\n### Databricks UI\n\n![Databricks UI](https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/docs/_static/screenshots/workflow_1_databricks.png)\n\n## Quickstart\n\nCheck out the following quickstart guides:\n\n- [With the Astro CLI](quickstart/astro-cli.md)\n- [Without the Astro CLI](quickstart/without-astro-cli.md)\n\n## Documentation\n\nThe documentation is a work in progress--we aim to follow the [Di\u00e1taxis](https://diataxis.fr/) system:\n\n- [Reference Guide](https://astronomer.github.io/astro-provider-databricks/)\n\n## Changelog\n\nAstro Databricks follows [semantic versioning](https://semver.org/) for releases. Read [changelog](CHANGELOG.rst) to understand more about the changes introduced to each version.\n\n## Contribution guidelines\n\nAll contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.\n\nRead the [Contribution Guidelines](docs/contributing.rst) for a detailed overview on how to contribute.\n\nContributors and maintainers should abide by the [Contributor Code of Conduct](CODE_OF_CONDUCT.md).\n\n## License\n\n[Apache Licence 2.0](LICENSE)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Affordable Databricks Workflows in Apache Airflow",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://github.com/astronomer/astro-provider-databricks/",
        "Homepage": "https://github.com/astronomer/astro-provider-databricks/"
    },
    "split_keywords": [
        "airflow",
        " apache-airflow",
        " astronomer",
        " dags"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bb76bcdede455d71189dbcb370fca69091478813546a8124f91e6258a46192db",
                "md5": "ff86574957ce44975cc9d14aeb8048c9",
                "sha256": "0ffeef47d087908c6255ad245ab62226e50f6ba2cca67f4a0ca43eab2eff13bd"
            },
            "downloads": -1,
            "filename": "astro_provider_databricks-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ff86574957ce44975cc9d14aeb8048c9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 26350,
            "upload_time": "2024-04-16T11:26:15",
            "upload_time_iso_8601": "2024-04-16T11:26:15.170479Z",
            "url": "https://files.pythonhosted.org/packages/bb/76/bcdede455d71189dbcb370fca69091478813546a8124f91e6258a46192db/astro_provider_databricks-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "de92eff0c7ab73545129cbd62a7104ff22be04927f91577e781620dff2d4ccbf",
                "md5": "ef593d92d7308baf2601e3c30e5894ac",
                "sha256": "9053e8209cefea43dc855881b46b8747a854d6fc7ef717a3c4c8e48ff00406df"
            },
            "downloads": -1,
            "filename": "astro_provider_databricks-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ef593d92d7308baf2601e3c30e5894ac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 2246965,
            "upload_time": "2024-04-16T11:26:16",
            "upload_time_iso_8601": "2024-04-16T11:26:16.908755Z",
            "url": "https://files.pythonhosted.org/packages/de/92/eff0c7ab73545129cbd62a7104ff22be04927f91577e781620dff2d4ccbf/astro_provider_databricks-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-16 11:26:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "astronomer",
    "github_project": "astro-provider-databricks",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "astro-provider-databricks"
}
        
Elapsed time: 0.35302s