airflow-duckdb


Nameairflow-duckdb JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/hussein-awala/airflow-duckdb
SummaryA package to run DuckDB queries from Apache Airflow
upload_time2024-02-24 00:39:11
maintainer
docs_urlNone
authorHussein Awala
requires_python>=3.8,<3.12
licenseApache-2.0
keywords airflow duckdb kubernetes airflow-duckdb
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Airflow DuckDB on Kubernetes

[DuckDB](https://duckdb.org/) is an in-memory analytical database to run analytical queries on large data sets.

[Apache Airflow](https://airflow.apache.org/) is an open-source platform for developing, scheduling, and monitoring
batch-oriented workflows.

Apache Airflow is not an ETL tool, but more of a workflow scheduler that can be used to schedule and monitor ETL jobs.
Airflow users create DAGs to schedule Spark, Hive, Athena, Trino, BigQuery, and other ETL jobs to process their data.

By using DuckDB with Airflow, the users can run analytical queries on local or remote large data sets and store the
results without the need to use these ETL tools.

To use DuckDB with Airflow, the users can use the PythonOperator with the DuckDB Python library, the BashOperator with
the DuckDB CLI, or one of the available Airflow operators that support DuckDB (e.g.
[airflow-provider-duckdb](https://github.com/astronomer/airflow-provider-duckdb) developed by Astronomer). All of these
operators will be running in the worker pod and limited by its resources, for that reason, some users use the
Kubernetes Executor to run the tasks in a dedicated Kubernetes pod to request more resources when needed.

Setting up Kubernetes Executor could be a bit challenging for some users, especially maintaining the workers docker
image. This project provides an alternative solution to run DuckDB with Airflow using the KubernetesPodOperator.

## How to use

The developed operator is completely based on the KubernetesPodOperator, so it needs cncf-kubernetes provider to be
installed in the Airflow environment (preferably the latest version to profit from all the features).

### Install the package

To use the operator, you need to install the package in your Airflow environment. You can install the package using pip:

```bash
pip install airflow-duckdb
```

### Use the operator

The operators supports all the parameters of the KubernetesPodOperator, and it has some additional parameters to
simplify the usage of DuckDB.

Here is an example of how to use the operator:

```python
with DAG("duckdb_dag", ...) as dag:
    DuckDBPodOperator(
        task_id="duckdb_task",
        query="SELECT MAX(col1) AS  FROM READ_PARQUET('s3://my_bucket/data.parquet');",
        do_xcom_push=True,
        s3_fs_config=S3FSConfig(
            access_key_id="{{ conn.duckdb_s3.login }}",
            secret_access_key="{{ conn.duckdb_s3.password }}",
        ),
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "1", "memory": "8Gi"},
            limits={"cpu": "1", "memory": "8Gi"},
        ),
    )
```

## Features

The current version of the operator supports the following features:
- Running one or more DuckDB queries in a Kubernetes pod
- Configuring the pod resources (requests and limits) to run the queries
- Configuring the S3 credentials securely with a Kubernetes secret to read and write data from/to S3
(AWS S3, MinIO or GCS with S3 compatibility)
- Using Jinja templating to configure the query
- Loading the queries from a file
- Pushing the query result to XCom

The project also provides a Docker image with DuckDB CLI and some extensions to use it with Airflow.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hussein-awala/airflow-duckdb",
    "name": "airflow-duckdb",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "airflow,duckdb,kubernetes,airflow-duckdb",
    "author": "Hussein Awala",
    "author_email": "hussein@awala.fr",
    "download_url": "https://files.pythonhosted.org/packages/b6/24/8fd33800acea8cc6808c75a0351a3888cf1b049986420652bfb4680ebbf6/airflow_duckdb-0.1.1.tar.gz",
    "platform": null,
    "description": "# Airflow DuckDB on Kubernetes\n\n[DuckDB](https://duckdb.org/) is an in-memory analytical database to run analytical queries on large data sets.\n\n[Apache Airflow](https://airflow.apache.org/) is an open-source platform for developing, scheduling, and monitoring\nbatch-oriented workflows.\n\nApache Airflow is not an ETL tool, but more of a workflow scheduler that can be used to schedule and monitor ETL jobs.\nAirflow users create DAGs to schedule Spark, Hive, Athena, Trino, BigQuery, and other ETL jobs to process their data.\n\nBy using DuckDB with Airflow, the users can run analytical queries on local or remote large data sets and store the\nresults without the need to use these ETL tools.\n\nTo use DuckDB with Airflow, the users can use the PythonOperator with the DuckDB Python library, the BashOperator with\nthe DuckDB CLI, or one of the available Airflow operators that support DuckDB (e.g.\n[airflow-provider-duckdb](https://github.com/astronomer/airflow-provider-duckdb) developed by Astronomer). All of these\noperators will be running in the worker pod and limited by its resources, for that reason, some users use the\nKubernetes Executor to run the tasks in a dedicated Kubernetes pod to request more resources when needed.\n\nSetting up Kubernetes Executor could be a bit challenging for some users, especially maintaining the workers docker\nimage. This project provides an alternative solution to run DuckDB with Airflow using the KubernetesPodOperator.\n\n## How to use\n\nThe developed operator is completely based on the KubernetesPodOperator, so it needs cncf-kubernetes provider to be\ninstalled in the Airflow environment (preferably the latest version to profit from all the features).\n\n### Install the package\n\nTo use the operator, you need to install the package in your Airflow environment. You can install the package using pip:\n\n```bash\npip install airflow-duckdb\n```\n\n### Use the operator\n\nThe operators supports all the parameters of the KubernetesPodOperator, and it has some additional parameters to\nsimplify the usage of DuckDB.\n\nHere is an example of how to use the operator:\n\n```python\nwith DAG(\"duckdb_dag\", ...) as dag:\n    DuckDBPodOperator(\n        task_id=\"duckdb_task\",\n        query=\"SELECT MAX(col1) AS  FROM READ_PARQUET('s3://my_bucket/data.parquet');\",\n        do_xcom_push=True,\n        s3_fs_config=S3FSConfig(\n            access_key_id=\"{{ conn.duckdb_s3.login }}\",\n            secret_access_key=\"{{ conn.duckdb_s3.password }}\",\n        ),\n        container_resources=k8s.V1ResourceRequirements(\n            requests={\"cpu\": \"1\", \"memory\": \"8Gi\"},\n            limits={\"cpu\": \"1\", \"memory\": \"8Gi\"},\n        ),\n    )\n```\n\n## Features\n\nThe current version of the operator supports the following features:\n- Running one or more DuckDB queries in a Kubernetes pod\n- Configuring the pod resources (requests and limits) to run the queries\n- Configuring the S3 credentials securely with a Kubernetes secret to read and write data from/to S3\n(AWS S3, MinIO or GCS with S3 compatibility)\n- Using Jinja templating to configure the query\n- Loading the queries from a file\n- Pushing the query result to XCom\n\nThe project also provides a Docker image with DuckDB CLI and some extensions to use it with Airflow.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A package to run DuckDB queries from Apache Airflow",
    "version": "0.1.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/hussein-awala/airflow-duckdb/issues",
        "Documentation": "https://github.com/hussein-awala/airflow-duckdb/blob/main/README.md",
        "Homepage": "https://github.com/hussein-awala/airflow-duckdb",
        "Repository": "https://github.com/hussein-awala/airflow-duckdb"
    },
    "split_keywords": [
        "airflow",
        "duckdb",
        "kubernetes",
        "airflow-duckdb"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a3cebf17ab38aed24d8e636a1b91a86dc1b4d0821e99dea5b20596a7b6063846",
                "md5": "a8dcaf4cbb7a9b10cab96d749544623c",
                "sha256": "43a2a568df86d4ac6004175836f1256aa6188cc76b913bca1bc5febdd651db3b"
            },
            "downloads": -1,
            "filename": "airflow_duckdb-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a8dcaf4cbb7a9b10cab96d749544623c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 9377,
            "upload_time": "2024-02-24T00:39:09",
            "upload_time_iso_8601": "2024-02-24T00:39:09.812375Z",
            "url": "https://files.pythonhosted.org/packages/a3/ce/bf17ab38aed24d8e636a1b91a86dc1b4d0821e99dea5b20596a7b6063846/airflow_duckdb-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b6248fd33800acea8cc6808c75a0351a3888cf1b049986420652bfb4680ebbf6",
                "md5": "48ae0d8417044bafd355f0590e884e0a",
                "sha256": "419da272f6d9a69b845ff9d413eb8b79d6cdfba1d89d5c0736ac3a0316e33249"
            },
            "downloads": -1,
            "filename": "airflow_duckdb-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "48ae0d8417044bafd355f0590e884e0a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 8695,
            "upload_time": "2024-02-24T00:39:11",
            "upload_time_iso_8601": "2024-02-24T00:39:11.418763Z",
            "url": "https://files.pythonhosted.org/packages/b6/24/8fd33800acea8cc6808c75a0351a3888cf1b049986420652bfb4680ebbf6/airflow_duckdb-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-24 00:39:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hussein-awala",
    "github_project": "airflow-duckdb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "airflow-duckdb"
}
        
Elapsed time: 0.96691s