yato-lib


Nameyato-lib JSON
Version 0.0.9 PyPI version JSON
download
home_page
SummaryThe smallest DuckDB SQL transformations orchestrator
upload_time2024-03-12 10:43:10
maintainer
docs_urlNone
authorChristophe Blefari
requires_python>=3.8.1,<3.13
licenseMIT
keywords duckdb sql orchestrator etl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
    <strong>yato — yet another transformation orchestrator</strong>
</h1>
<p align="center">
yato is the smallest orchestrator on Earth to orchestrate SQL data transformations on top of DuckDB. You just give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.
</p>

## Installation

yato works with Python 3.8+.

```bash
pip install yato-lib
```

## Get Started

Create a folder named `sql` and put your SQL files in it, you can for instance uses the 2 queries given in the example folder.

```python
from yato import Yato

yato = Yato(
    # The path of the file in which yato will run the SQL queries.
    # If you want to run it in memory, just set it to :memory:
    database_path="tmp.duckdb",
    # This is the folder where the SQL files are located.
    # The names of the files will determine the name of the table created.
    sql_folder="sql/",
    # The name of the DuckDB schema where the tables will be created.
    schema="transform",
)

# Runs yato against the DuckDB database with the queries in order.
yato.run()
```

You can also run yato with the cli:

```bash
yato run --db tmp.duckdb sql/
```

## Works with dlt

yato is designed to work in pair with dlt. dlt handles the data loading and yato the data transformation.

```python
import dlt
from yato import Yato

yato = Yato(
    database_path="db.duckdb",
    sql_folder="sql/",
    schema="transform",
)

# You restore the database from S3 before runnning dlt
yato.restore()

pipeline = dlt.pipeline(
    pipeline_name="get_my_data",
    destination="duckdb",
    dataset_name="production",
    credentials="db.duckdb",
)

data = my_source()

load_info = pipeline.run(data)

# You backup the database after a successful dlt run
yato.backup()
yato.run()
```

## Advanced usage

### Mixing SQL and Python transformation
Even if we would love to do everything is SQL it happens sometimes that writing a transformation in Python with pandas (or other libraries) might be faster.

This is why you can mix SQL and Python transformation in yato.

In order to do it you can add a Python file in the transformation folder. In this Python file you have to implement a `Transformation` class with a `run` method. If you depend on other SQL transformation you have to define the source SQL query in a static method called `source_sql`.

Below an example of a transformation (like `orders.py`). The framework will understand that orders needs to run after source_orders.
```python
from yato import Transformation


class Orders(Transformation):
    @staticmethod
    def source_sql():
        return "SELECT * FROM source_orders"

    def run(self, context, *args, **kwargs):
        df = self.get_source(context)

        df["new_column"] = 1

        return df
```

### Environment variables
yato supports env variables in the SQL queries (like in the example below). Be careful by default it raises an issue if the env variable is not defined.

```sql
SELECT {{ VALUE }}, {{ OTHER_VALUE }}
```

### Other features
* **Subfolders** — in the main folder, just create the folders you want to organise your transformations, folders have no impact on the DAG inference. Be careful not to have 2 transformations with the same name.
* **Multiple SQL statements** — in the same file, yato will run them in the order they appear. Warning: you can only have one SELECT statement. Other statements can be SET, etc. Still the dependencies (hence the DAG) are computed on the SELECT only for the moment.


## How does it work?

yato runs relies on the amazing SQLGlot library to syntactically parse the SQL queries and build a DAG of the dependencies. Then, it runs the queries in the right order.

## FAQ

**Why choose yato over dbt Core, SQLMesh or lea?**

There is no good answer to this question but yato has not be designed to fully replace SQL transformation orchestrators. yato is meant to be fast to setup and configure with a few features. You give a folder with a bunch of SQL (or Python) inside and it runs. 

You can imagine yato like black for transformations orchestration. Only one parameter and here you go.

**Why only DuckDB**

For the moment yato only supports DuckDB as backend/dialect. The main reason is that DuckDB offers features that would be hard to implement with a client/server database. I do not exclude to add Postgres or cloud warehouses, but it would require to think how to do it, especially when mixing SQL and Python transformations.

**Can yato support Jinja templating?**

I does not. I'm not sure it should. I think that when you're adding Jinja templating to your SQL queries you're already too far. I would recommend not to use yato for this. Still if you really want to use yato and have Jinja support reach me. 

Small note, yato support env variables in the SQL queries.

**Can I contribute?**

Yes obviously, right now the project is in its early stage and I would be happy to have feedbacks and contributions. Keep in mind this is a small orchestrator and covering the full gap with other ochestrators makes no sense because just use them they are awesome.



## Limitations
* You can't have 2 transformations with the same name.
* There are no tests for the moment. I'm working on it.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "yato-lib",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.1,<3.13",
    "maintainer_email": "",
    "keywords": "duckdb,sql,orchestrator,etl",
    "author": "Christophe Blefari",
    "author_email": "yato@blef.fr",
    "download_url": "https://files.pythonhosted.org/packages/b1/01/321d4b4cd6b5b473e46a4bd12d99f1afc90802d3ee3b67f147ba6f409bbd/yato_lib-0.0.9.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n    <strong>yato \u2014 yet another transformation orchestrator</strong>\n</h1>\n<p align=\"center\">\nyato is the smallest orchestrator on Earth to orchestrate SQL data transformations on top of DuckDB. You just give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.\n</p>\n\n## Installation\n\nyato works with Python 3.8+.\n\n```bash\npip install yato-lib\n```\n\n## Get Started\n\nCreate a folder named `sql` and put your SQL files in it, you can for instance uses the 2 queries given in the example folder.\n\n```python\nfrom yato import Yato\n\nyato = Yato(\n    # The path of the file in which yato will run the SQL queries.\n    # If you want to run it in memory, just set it to :memory:\n    database_path=\"tmp.duckdb\",\n    # This is the folder where the SQL files are located.\n    # The names of the files will determine the name of the table created.\n    sql_folder=\"sql/\",\n    # The name of the DuckDB schema where the tables will be created.\n    schema=\"transform\",\n)\n\n# Runs yato against the DuckDB database with the queries in order.\nyato.run()\n```\n\nYou can also run yato with the cli:\n\n```bash\nyato run --db tmp.duckdb sql/\n```\n\n## Works with dlt\n\nyato is designed to work in pair with dlt. dlt handles the data loading and yato the data transformation.\n\n```python\nimport dlt\nfrom yato import Yato\n\nyato = Yato(\n    database_path=\"db.duckdb\",\n    sql_folder=\"sql/\",\n    schema=\"transform\",\n)\n\n# You restore the database from S3 before runnning dlt\nyato.restore()\n\npipeline = dlt.pipeline(\n    pipeline_name=\"get_my_data\",\n    destination=\"duckdb\",\n    dataset_name=\"production\",\n    credentials=\"db.duckdb\",\n)\n\ndata = my_source()\n\nload_info = pipeline.run(data)\n\n# You backup the database after a successful dlt run\nyato.backup()\nyato.run()\n```\n\n## Advanced usage\n\n### Mixing SQL and Python transformation\nEven if we would love to do everything is SQL it happens sometimes that writing a transformation in Python with pandas (or other libraries) might be faster.\n\nThis is why you can mix SQL and Python transformation in yato.\n\nIn order to do it you can add a Python file in the transformation folder. In this Python file you have to implement a `Transformation` class with a `run` method. If you depend on other SQL transformation you have to define the source SQL query in a static method called `source_sql`.\n\nBelow an example of a transformation (like `orders.py`). The framework will understand that orders needs to run after source_orders.\n```python\nfrom yato import Transformation\n\n\nclass Orders(Transformation):\n    @staticmethod\n    def source_sql():\n        return \"SELECT * FROM source_orders\"\n\n    def run(self, context, *args, **kwargs):\n        df = self.get_source(context)\n\n        df[\"new_column\"] = 1\n\n        return df\n```\n\n### Environment variables\nyato supports env variables in the SQL queries (like in the example below). Be careful by default it raises an issue if the env variable is not defined.\n\n```sql\nSELECT {{ VALUE }}, {{ OTHER_VALUE }}\n```\n\n### Other features\n* **Subfolders** \u2014 in the main folder, just create the folders you want to organise your transformations, folders have no impact on the DAG inference. Be careful not to have 2 transformations with the same name.\n* **Multiple SQL statements** \u2014 in the same file, yato will run them in the order they appear. Warning: you can only have one SELECT statement. Other statements can be SET, etc. Still the dependencies (hence the DAG) are computed on the SELECT only for the moment.\n\n\n## How does it work?\n\nyato runs relies on the amazing SQLGlot library to syntactically parse the SQL queries and build a DAG of the dependencies. Then, it runs the queries in the right order.\n\n## FAQ\n\n**Why choose yato over dbt Core, SQLMesh or lea?**\n\nThere is no good answer to this question but yato has not be designed to fully replace SQL transformation orchestrators. yato is meant to be fast to setup and configure with a few features. You give a folder with a bunch of SQL (or Python) inside and it runs. \n\nYou can imagine yato like black for transformations orchestration. Only one parameter and here you go.\n\n**Why only DuckDB**\n\nFor the moment yato only supports DuckDB as backend/dialect. The main reason is that DuckDB offers features that would be hard to implement with a client/server database. I do not exclude to add Postgres or cloud warehouses, but it would require to think how to do it, especially when mixing SQL and Python transformations.\n\n**Can yato support Jinja templating?**\n\nI does not. I'm not sure it should. I think that when you're adding Jinja templating to your SQL queries you're already too far. I would recommend not to use yato for this. Still if you really want to use yato and have Jinja support reach me. \n\nSmall note, yato support env variables in the SQL queries.\n\n**Can I contribute?**\n\nYes obviously, right now the project is in its early stage and I would be happy to have feedbacks and contributions. Keep in mind this is a small orchestrator and covering the full gap with other ochestrators makes no sense because just use them they are awesome.\n\n\n\n## Limitations\n* You can't have 2 transformations with the same name.\n* There are no tests for the moment. I'm working on it.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The smallest DuckDB SQL transformations orchestrator",
    "version": "0.0.9",
    "project_urls": null,
    "split_keywords": [
        "duckdb",
        "sql",
        "orchestrator",
        "etl"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a73a411ea9ffc3a439752e0c18381ed53f1f60ed16f82b3f8bf8e37f7721538",
                "md5": "70ecfd735e604fc4e674dad4608d88cf",
                "sha256": "b534f4c5df05cbbfe54bfcd1bef41689148183e9ed4e502a716b02a96db9b37f"
            },
            "downloads": -1,
            "filename": "yato_lib-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "70ecfd735e604fc4e674dad4608d88cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.1,<3.13",
            "size": 13100,
            "upload_time": "2024-03-12T10:43:08",
            "upload_time_iso_8601": "2024-03-12T10:43:08.553187Z",
            "url": "https://files.pythonhosted.org/packages/7a/73/a411ea9ffc3a439752e0c18381ed53f1f60ed16f82b3f8bf8e37f7721538/yato_lib-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b101321d4b4cd6b5b473e46a4bd12d99f1afc90802d3ee3b67f147ba6f409bbd",
                "md5": "715d64c5a193b7c506650019db3442e7",
                "sha256": "fc8004cd476b3bab18bc8431a8964d478f3070612c5ec77ef8defe96921200a6"
            },
            "downloads": -1,
            "filename": "yato_lib-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "715d64c5a193b7c506650019db3442e7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<3.13",
            "size": 10996,
            "upload_time": "2024-03-12T10:43:10",
            "upload_time_iso_8601": "2024-03-12T10:43:10.947876Z",
            "url": "https://files.pythonhosted.org/packages/b1/01/321d4b4cd6b5b473e46a4bd12d99f1afc90802d3ee3b67f147ba6f409bbd/yato_lib-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-12 10:43:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "yato-lib"
}
        
Elapsed time: 0.84564s