dataverk-airflow


Namedataverk-airflow JSON
Version 1.6.3 PyPI version JSON
download
home_pagehttps://github.com/navikt/dataverk-airflow
SummaryNone
upload_time2024-04-17 08:40:43
maintainerNone
docs_urlNone
authorNAV
requires_python<3.12,>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Dataverk airflow

Enkelt wrapperbibliotek rundt [KubernetesPodOperator](https://airflow.apache.org/docs/stable/kubernetes.html) som lager Airflow task som kjører i en Kubernetes pod.

## Våre operators

Alle våre operators lar deg klone et annet repo enn der DAGene er definert, bare legg det til med `repo="navikt/<repo>`.

Vi har også støtte for å installere Python pakker ved oppstart av Airflow task, spesifiser `requirements.txt`-filen din med `requirements_path="/path/to/requirements.txt"`.
Merk at hvis du kombinerer `repo` og `requirements_path`, må `requirements.txt` ligge i repoet nevnt i `repo`.

### Quarto operator (datafortelling)

Denne kjører `quarto render` for deg, som lager en HTML-fil som kan lastes opp til Datamarkedsplassen.

Vi har støtte for enkeltfiler, og kataloger, dette kan du spesifisere med `path` for enkeltfiler, og `folder` hvis du har et Quarto prosjekt i en katalog.
Quarto prosjekter brukes hovedsakelig for [book](https://quarto.org/docs/books/), [website](https://quarto.org/docs/websites/), eller [dashboard](https://quarto.org/docs/dashboards/).
Enkeltfiler bygges `self-contained`, som betyr at HTML-filen blir bygd med alle sine eksterne avhengighter (Javascript, CSS, og bilder).

For å laste opp filer til Datamarkedsplassen må man ha et Quarto-token, som er unikt per team.
Dette finner man under [Mine teams token](https://data.intern.nav.no/user/tokens) i menyen.

```python
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from dataverk_airflow import quarto_operator


with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = quarto_operator(dag=dag,
                         name="<navn-på-task>",
                         repo="navikt/<repo>",
                         quarto={
                             "path": "/path/to/index.qmd",
                             "env": "dev/prod",
                             "id":"uuid",
                             "token": Variable.get("quarto_token"),
                         },
                         slack_channel="<#slack-alarm-kanal>")
```


Har du behov for å rendre noe annet enn `html`, kan du bruke verdien `format`.
Dette må du for eksempel gjøre hvis du ønsker å lage et dashboard.

```
with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = quarto_operator(dag=dag,
                         name="<navn-på-task>",
                         repo="navikt/<repo>",
                         quarto={
                             "folder": "/path/to/book",
                             "format": "dashboard",
                             "env": "dev/prod",
                             "id":"uuid",
                             "token": Variable.get("quarto_token"),
                         },
                         slack_channel="<#slack-alarm-kanal>")
```

I eksemplene over lagrer vi tokenet i en Airflow variable som så brukes i DAG tasken under.
Se offisiell [Airflow dokumentasjon](https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html) for hvordan man bruker `Variable.get()´ i en task.

### Notebook operator

Denne lar deg kjøre en Jupyter notebook.

```python
from airflow import DAG
from airflow.utils.dates import days_ago
from dataverk_airflow import notebook_operator


with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = notebook_operator(dag=dag,
                           name="<navn-på-task>",
                           repo="navikt/<repo>",
                           nb_path="/path/to/notebook.ipynb",
                           slack_channel="<#slack-alarm-kanal>")
```

### Python operator

Denne lar deg kjøre vilkårlig Python-scripts.

```python
from airflow import DAG
from airflow.utils.dates import days_ago
from dataverk_airflow import python_operator


with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = python_operator(dag=dag,
                         name="<navn-på-task>",
                         repo="navikt/<repo>",
                         script_path="/path/to/script.py",
                         slack_channel="<#slack-alarm-kanal>")
```

### Kubernetes operator

Vi tilbyr også vår egen Kubernetes operator som kloner et valg repo inn i containeren.

```python
from airflow import DAG
from airflow.utils.dates import days_ago
from dataverk_airflow import kubernetes_operator


with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = kubernetes_operator(dag=dag,
                             name="<navn-på-task>",
                             repo="navikt/<repo>",
                             cmds=["/path/to/bin/", "script-name.sh", "argument1", "argument2"],
                             image="europe-north1-docker.pkg.dev/nais-management-233d/ditt-team/ditt-image:din-tag",
                             slack_channel="<#slack-alarm-kanal>")
```

Denne operatoren har støtte for to ekstra flagg som ikke er tilgjengelig fra de andre.

```
cmds: str: Command to run in pod
working_dir: str: Path to working directory
```

### Allow list

Alle operators støtter å sette allow list, men det er noen adresser som blir lagt til av Dataverk Airflow.

Hvis du bruker `slack_channel` argumentet, vil vi legge til:
- hooks.slack.com

Hvis du bruker `email` argumentet, vil vi legge til:
- Riktig SMTP-adresse

Hvis du bruker `requirements_path` argumentet, vil vi legge til:
- pypi.org
- files.pythonhosted.org
- pypi.python.org

For `quarto_operator` vil vi legge til:
- Adressen til riktig Datamarkedsplass
- cdnjs.cloudflare.com

### Felles argumenter

Alle operatorene våre har støtte for disse argumentene i funksjonskallet.

```
dag: DAG: owner DAG
name: str: Name of task
repo: str: Github repo
image: str: Dockerimage the pod should use
branch: str: Branch in repo, default "main"
email: str: Email of owner
slack_channel: str: Name of Slack channel, default None (no Slack notification)
extra_envs: dict: dict with environment variables example: {"key": "value", "key2": "value2"}
allowlist: list: list of hosts and port the task needs to reach on the format host:port
requirements_path: bool: Path (including filename) to your requirements.txt
python_version: str: Desired Python version for the environment your code will be running in when using the default image. We offer only supported versions of Python, and default to the latest version if this parameter is omitted. See https://devguide.python.org/versions/ for available versions.
resources: dict: Specify required cpu and memory requirements (keys in dict: request_memory, request_cpu, limit_memory, limit_cpu), default None
startup_timeout_seconds: int: pod startup timeout
retries: int: Number of retries for task before DAG fails, default 3
delete_on_finish: bool: Whether to delete pod on completion
retry_delay: timedelta: Time inbetween retries, default 5 seconds
do_xcom_push: bool: Enable xcom push of content in file "/airflow/xcom/return.json", default False
on_success_callback:: func: a function to be called when a task instance of this task succeeds
```

## Sette resource requirements

Vi har støtte for å sette `requests` og `limits` for hver operator.
Merk at man ikke trenger å sette `limits` på CPU da dette blir automatisk løst av plattformen.

Ved å bruke `ephemeral-storage` kan man be om ekstra diskplass for lagring i en task.

```python
from airflow import DAG
from airflow.utils.dates import days_ago
from dataverk_airflow import python_operator


with DAG('navn-dag', start_date=days_ago(1), schedule_interval="*/10 * * * *") as dag:
    t1 = python_operator(dag=dag,
                         name="<navn-på-task>",
                         repo="navikt/<repo>",
                         script_path="/path/to/script.py",
                         resources={
                             "requests": {
                                 "memory": "50Mi",
                                 "cpu": "100m",
                                 "ephemeral-storage": "1Gi"
                             },
                             "limits": {
                                 "memory": "100Mi"
                             }
                         })
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/navikt/dataverk-airflow",
    "name": "dataverk-airflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "NAV",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/11/54/c044ae8c12a8eee36e02ebc56740313ea092deff6bbd3aac25cb243df5d4/dataverk_airflow-1.6.3.tar.gz",
    "platform": null,
    "description": "# Dataverk airflow\n\nEnkelt wrapperbibliotek rundt [KubernetesPodOperator](https://airflow.apache.org/docs/stable/kubernetes.html) som lager Airflow task som kj\u00f8rer i en Kubernetes pod.\n\n## V\u00e5re operators\n\nAlle v\u00e5re operators lar deg klone et annet repo enn der DAGene er definert, bare legg det til med `repo=\"navikt/<repo>`.\n\nVi har ogs\u00e5 st\u00f8tte for \u00e5 installere Python pakker ved oppstart av Airflow task, spesifiser `requirements.txt`-filen din med `requirements_path=\"/path/to/requirements.txt\"`.\nMerk at hvis du kombinerer `repo` og `requirements_path`, m\u00e5 `requirements.txt` ligge i repoet nevnt i `repo`.\n\n### Quarto operator (datafortelling)\n\nDenne kj\u00f8rer `quarto render` for deg, som lager en HTML-fil som kan lastes opp til Datamarkedsplassen.\n\nVi har st\u00f8tte for enkeltfiler, og kataloger, dette kan du spesifisere med `path` for enkeltfiler, og `folder` hvis du har et Quarto prosjekt i en katalog.\nQuarto prosjekter brukes hovedsakelig for [book](https://quarto.org/docs/books/), [website](https://quarto.org/docs/websites/), eller [dashboard](https://quarto.org/docs/dashboards/).\nEnkeltfiler bygges `self-contained`, som betyr at HTML-filen blir bygd med alle sine eksterne avhengighter (Javascript, CSS, og bilder).\n\nFor \u00e5 laste opp filer til Datamarkedsplassen m\u00e5 man ha et Quarto-token, som er unikt per team.\nDette finner man under [Mine teams token](https://data.intern.nav.no/user/tokens) i menyen.\n\n```python\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom airflow.models import Variable\nfrom dataverk_airflow import quarto_operator\n\n\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = quarto_operator(dag=dag,\n                         name=\"<navn-p\u00e5-task>\",\n                         repo=\"navikt/<repo>\",\n                         quarto={\n                             \"path\": \"/path/to/index.qmd\",\n                             \"env\": \"dev/prod\",\n                             \"id\":\"uuid\",\n                             \"token\": Variable.get(\"quarto_token\"),\n                         },\n                         slack_channel=\"<#slack-alarm-kanal>\")\n```\n\n\nHar du behov for \u00e5 rendre noe annet enn `html`, kan du bruke verdien `format`.\nDette m\u00e5 du for eksempel gj\u00f8re hvis du \u00f8nsker \u00e5 lage et dashboard.\n\n```\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = quarto_operator(dag=dag,\n                         name=\"<navn-p\u00e5-task>\",\n                         repo=\"navikt/<repo>\",\n                         quarto={\n                             \"folder\": \"/path/to/book\",\n                             \"format\": \"dashboard\",\n                             \"env\": \"dev/prod\",\n                             \"id\":\"uuid\",\n                             \"token\": Variable.get(\"quarto_token\"),\n                         },\n                         slack_channel=\"<#slack-alarm-kanal>\")\n```\n\nI eksemplene over lagrer vi tokenet i en Airflow variable som s\u00e5 brukes i DAG tasken under.\nSe offisiell [Airflow dokumentasjon](https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html) for hvordan man bruker `Variable.get()\u00b4 i en task.\n\n### Notebook operator\n\nDenne lar deg kj\u00f8re en Jupyter notebook.\n\n```python\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom dataverk_airflow import notebook_operator\n\n\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = notebook_operator(dag=dag,\n                           name=\"<navn-p\u00e5-task>\",\n                           repo=\"navikt/<repo>\",\n                           nb_path=\"/path/to/notebook.ipynb\",\n                           slack_channel=\"<#slack-alarm-kanal>\")\n```\n\n### Python operator\n\nDenne lar deg kj\u00f8re vilk\u00e5rlig Python-scripts.\n\n```python\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom dataverk_airflow import python_operator\n\n\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = python_operator(dag=dag,\n                         name=\"<navn-p\u00e5-task>\",\n                         repo=\"navikt/<repo>\",\n                         script_path=\"/path/to/script.py\",\n                         slack_channel=\"<#slack-alarm-kanal>\")\n```\n\n### Kubernetes operator\n\nVi tilbyr ogs\u00e5 v\u00e5r egen Kubernetes operator som kloner et valg repo inn i containeren.\n\n```python\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom dataverk_airflow import kubernetes_operator\n\n\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = kubernetes_operator(dag=dag,\n                             name=\"<navn-p\u00e5-task>\",\n                             repo=\"navikt/<repo>\",\n                             cmds=[\"/path/to/bin/\", \"script-name.sh\", \"argument1\", \"argument2\"],\n                             image=\"europe-north1-docker.pkg.dev/nais-management-233d/ditt-team/ditt-image:din-tag\",\n                             slack_channel=\"<#slack-alarm-kanal>\")\n```\n\nDenne operatoren har st\u00f8tte for to ekstra flagg som ikke er tilgjengelig fra de andre.\n\n```\ncmds: str: Command to run in pod\nworking_dir: str: Path to working directory\n```\n\n### Allow list\n\nAlle operators st\u00f8tter \u00e5 sette allow list, men det er noen adresser som blir lagt til av Dataverk Airflow.\n\nHvis du bruker `slack_channel` argumentet, vil vi legge til:\n- hooks.slack.com\n\nHvis du bruker `email` argumentet, vil vi legge til:\n- Riktig SMTP-adresse\n\nHvis du bruker `requirements_path` argumentet, vil vi legge til:\n- pypi.org\n- files.pythonhosted.org\n- pypi.python.org\n\nFor `quarto_operator` vil vi legge til:\n- Adressen til riktig Datamarkedsplass\n- cdnjs.cloudflare.com\n\n### Felles argumenter\n\nAlle operatorene v\u00e5re har st\u00f8tte for disse argumentene i funksjonskallet.\n\n```\ndag: DAG: owner DAG\nname: str: Name of task\nrepo: str: Github repo\nimage: str: Dockerimage the pod should use\nbranch: str: Branch in repo, default \"main\"\nemail: str: Email of owner\nslack_channel: str: Name of Slack channel, default None (no Slack notification)\nextra_envs: dict: dict with environment variables example: {\"key\": \"value\", \"key2\": \"value2\"}\nallowlist: list: list of hosts and port the task needs to reach on the format host:port\nrequirements_path: bool: Path (including filename) to your requirements.txt\npython_version: str: Desired Python version for the environment your code will be running in when using the default image. We offer only supported versions of Python, and default to the latest version if this parameter is omitted. See https://devguide.python.org/versions/ for available versions.\nresources: dict: Specify required cpu and memory requirements (keys in dict: request_memory, request_cpu, limit_memory, limit_cpu), default None\nstartup_timeout_seconds: int: pod startup timeout\nretries: int: Number of retries for task before DAG fails, default 3\ndelete_on_finish: bool: Whether to delete pod on completion\nretry_delay: timedelta: Time inbetween retries, default 5 seconds\ndo_xcom_push: bool: Enable xcom push of content in file \"/airflow/xcom/return.json\", default False\non_success_callback:: func: a function to be called when a task instance of this task succeeds\n```\n\n## Sette resource requirements\n\nVi har st\u00f8tte for \u00e5 sette `requests` og `limits` for hver operator.\nMerk at man ikke trenger \u00e5 sette `limits` p\u00e5 CPU da dette blir automatisk l\u00f8st av plattformen.\n\nVed \u00e5 bruke `ephemeral-storage` kan man be om ekstra diskplass for lagring i en task.\n\n```python\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom dataverk_airflow import python_operator\n\n\nwith DAG('navn-dag', start_date=days_ago(1), schedule_interval=\"*/10 * * * *\") as dag:\n    t1 = python_operator(dag=dag,\n                         name=\"<navn-p\u00e5-task>\",\n                         repo=\"navikt/<repo>\",\n                         script_path=\"/path/to/script.py\",\n                         resources={\n                             \"requests\": {\n                                 \"memory\": \"50Mi\",\n                                 \"cpu\": \"100m\",\n                                 \"ephemeral-storage\": \"1Gi\"\n                             },\n                             \"limits\": {\n                                 \"memory\": \"100Mi\"\n                             }\n                         })\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "1.6.3",
    "project_urls": {
        "Homepage": "https://github.com/navikt/dataverk-airflow",
        "Repository": "https://github.com/navikt/dataverk-airflow"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f67b52114a7306e36a3685654678e48f3c6fcb2bad0f0feefffa98bcfef834e",
                "md5": "2fb827dc79836ff7574e8bb4b52c34f9",
                "sha256": "87ef948e80afe036ea8b5b636171a13b9e728e070922a7e1cd837b2e01c00aa3"
            },
            "downloads": -1,
            "filename": "dataverk_airflow-1.6.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2fb827dc79836ff7574e8bb4b52c34f9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.8",
            "size": 15877,
            "upload_time": "2024-04-17T08:40:40",
            "upload_time_iso_8601": "2024-04-17T08:40:40.977856Z",
            "url": "https://files.pythonhosted.org/packages/6f/67/b52114a7306e36a3685654678e48f3c6fcb2bad0f0feefffa98bcfef834e/dataverk_airflow-1.6.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1154c044ae8c12a8eee36e02ebc56740313ea092deff6bbd3aac25cb243df5d4",
                "md5": "b52e7e0924a55009b72415d30ca6ceea",
                "sha256": "13e3d551a22dd5b4580321246808c6669e2e004c22de0e2b56cb4ca4a86ab062"
            },
            "downloads": -1,
            "filename": "dataverk_airflow-1.6.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b52e7e0924a55009b72415d30ca6ceea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.8",
            "size": 11229,
            "upload_time": "2024-04-17T08:40:43",
            "upload_time_iso_8601": "2024-04-17T08:40:43.250028Z",
            "url": "https://files.pythonhosted.org/packages/11/54/c044ae8c12a8eee36e02ebc56740313ea092deff6bbd3aac25cb243df5d4/dataverk_airflow-1.6.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-17 08:40:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "navikt",
    "github_project": "dataverk-airflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dataverk-airflow"
}
        
NAV
Elapsed time: 0.36474s