databathing


Namedatabathing JSON
Version 0.2.3 PyPI version JSON
download
home_pagehttps://github.com/jason-jz-zhu/databathing
SummaryConvert SQL queries to PySpark DataFrame operations
upload_time2025-08-08 20:35:53
maintainerJiazhen Zhu
docs_urlNone
authorJiazhen Zhu
requires_python>=3.7
licenseMIT
keywords sql spark pyspark etl data parser converter
VCS
bugtrack_url
requirements mo-sql-parsing
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # More SQL Parsing!

[![PyPI Latest Release](https://img.shields.io/pypi/v/databathing.svg)](https://pypi.org/project/databathing/)
[![Build Status](https://circleci.com/gh/jason-jz-zhu/databathing/tree/main.svg?style=svg)](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing)


Parse SQL into JSON so we can translate it for other datastores!

[See changes](https://github.com/jason-jz-zhu/databathing#version-changes)


## Problem Statement

After converting from sql to spark, data engineers need to write the spark code for ETL pipeline instead of using YAML(SQL) which can improve the performance of ETL job, but it still makes the ETL development longer than before. 

Then we have one question: can we have a solution which can have both good calculation performance (Spark) and quick to develop (YAML - SQL)?

YES, we have !!!

## Objectives

We plan to combine the benefits from Spark and YAML (SQL) to create the platform or library to develop the ETL pipeline. 


## Project Status

May 2022 - There are [over 900 tests](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing). This parser is good enough for basic usage, including:
* `SELECT` feature
* `FROM` feature
* `INNER` JOIN and LEFT JOIN feature
* `ON` feature
* `WHERE` feature
* `GROUP BY` feature
* `HAVING` feature
* `ORDER BY` feature
* `AGG` feature
* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)
* ALIAS NAME feature
* `WITH` STATEMENT feature

## Install

    pip install databathing


## Generating Spark Code

You may also generate PySpark Code from the a given SQL Query. This is done by the Pipeline, which is in Version 1 state (May2022).

    >>> from databathing import pipeline
    >>> pipeline = pipeline.Pipeline("SELECT * FROM Test WHERE info = 1")
    >>> ans = pipeline.parse()
    'final_df = Test\\\n.filter("info = 1")\\\n.selectExpr("a","b","c")\n\n'

## Contributing

In the event that the databathing is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude. 

Please follow this blog to update verion - https://circleci.com/blog/publishing-a-python-package/


### Run Tests

See [the tests directory](https://github.com/jason-jz-zhu/databathing/tree/develop/tests) for instructions running tests, or writing new ones.

## Version Changes


### Version 1

*May 2022*

Features and Functionalities - PySpark Version
* `SELECT` feature
* `FROM` feature
* `INNER` JOIN and LEFT JOIN feature
* `ON` feature
* `WHERE` feature
* `GROUP BY` feature
* `HAVING` feature
* `ORDER BY` feature
* `AGG` feature
* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)
* ALIAS NAME feature
* `WITH` STATEMENT feature






            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jason-jz-zhu/databathing",
    "name": "databathing",
    "maintainer": "Jiazhen Zhu",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "jason.jz.zhu@gmail.com",
    "keywords": "sql, spark, pyspark, etl, data, parser, converter",
    "author": "Jiazhen Zhu",
    "author_email": "jason.jz.zhu@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b2/fe/776746d0df41351f1be35d39feba55d850d3fd5fb6c604c57eec2093493f/databathing-0.2.3.tar.gz",
    "platform": null,
    "description": "# More SQL Parsing!\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/databathing.svg)](https://pypi.org/project/databathing/)\n[![Build Status](https://circleci.com/gh/jason-jz-zhu/databathing/tree/main.svg?style=svg)](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing)\n\n\nParse SQL into JSON so we can translate it for other datastores!\n\n[See changes](https://github.com/jason-jz-zhu/databathing#version-changes)\n\n\n## Problem Statement\n\nAfter converting from sql to spark, data engineers need to write the spark code for ETL pipeline instead of using YAML(SQL) which can improve the performance of ETL job, but it still makes the ETL development longer than before. \n\nThen we have one question: can we have a solution which can have both good calculation performance (Spark) and quick to develop (YAML - SQL)?\n\nYES, we have !!!\n\n## Objectives\n\nWe plan to combine the benefits from Spark and YAML (SQL) to create the platform or library to develop the ETL pipeline. \n\n\n## Project Status\n\nMay 2022 - There are [over 900 tests](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing). This parser is good enough for basic usage, including:\n* `SELECT` feature\n* `FROM` feature\n* `INNER` JOIN and LEFT JOIN feature\n* `ON` feature\n* `WHERE` feature\n* `GROUP BY` feature\n* `HAVING` feature\n* `ORDER BY` feature\n* `AGG` feature\n* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)\n* ALIAS NAME feature\n* `WITH` STATEMENT feature\n\n## Install\n\n    pip install databathing\n\n\n## Generating Spark Code\n\nYou may also generate PySpark Code from the a given SQL Query. This is done by the Pipeline, which is in Version 1 state (May2022).\n\n    >>> from databathing import pipeline\n    >>> pipeline = pipeline.Pipeline(\"SELECT * FROM Test WHERE info = 1\")\n    >>> ans = pipeline.parse()\n    'final_df = Test\\\\\\n.filter(\"info = 1\")\\\\\\n.selectExpr(\"a\",\"b\",\"c\")\\n\\n'\n\n## Contributing\n\nIn the event that the databathing is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude. \n\nPlease follow this blog to update verion - https://circleci.com/blog/publishing-a-python-package/\n\n\n### Run Tests\n\nSee [the tests directory](https://github.com/jason-jz-zhu/databathing/tree/develop/tests) for instructions running tests, or writing new ones.\n\n## Version Changes\n\n\n### Version 1\n\n*May 2022*\n\nFeatures and Functionalities - PySpark Version\n* `SELECT` feature\n* `FROM` feature\n* `INNER` JOIN and LEFT JOIN feature\n* `ON` feature\n* `WHERE` feature\n* `GROUP BY` feature\n* `HAVING` feature\n* `ORDER BY` feature\n* `AGG` feature\n* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)\n* ALIAS NAME feature\n* `WITH` STATEMENT feature\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert SQL queries to PySpark DataFrame operations",
    "version": "0.2.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/jason-jz-zhu/databathing/issues",
        "Homepage": "https://github.com/jason-jz-zhu/databathing",
        "Source Code": "https://github.com/jason-jz-zhu/databathing"
    },
    "split_keywords": [
        "sql",
        " spark",
        " pyspark",
        " etl",
        " data",
        " parser",
        " converter"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e48973557d053091f9fa10deed8e17d76198fe066c2da62a3d81bf0666f911c",
                "md5": "8ef03d306a92c6a236fff3a21c8b8c28",
                "sha256": "688e6100a62859c5d7e50b7c6b8c6b2e091dc38d19096ca76fa9e1e511db0d13"
            },
            "downloads": -1,
            "filename": "databathing-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8ef03d306a92c6a236fff3a21c8b8c28",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 9336,
            "upload_time": "2025-08-08T20:35:52",
            "upload_time_iso_8601": "2025-08-08T20:35:52.829809Z",
            "url": "https://files.pythonhosted.org/packages/2e/48/973557d053091f9fa10deed8e17d76198fe066c2da62a3d81bf0666f911c/databathing-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b2fe776746d0df41351f1be35d39feba55d850d3fd5fb6c604c57eec2093493f",
                "md5": "49e3ce2c1612bd08bc0beb9c7cfaaa48",
                "sha256": "2f7efb7f1010f20c0e5e72de15cda570a30840d8177986d5ab10d626afb85134"
            },
            "downloads": -1,
            "filename": "databathing-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "49e3ce2c1612bd08bc0beb9c7cfaaa48",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 11223,
            "upload_time": "2025-08-08T20:35:53",
            "upload_time_iso_8601": "2025-08-08T20:35:53.572971Z",
            "url": "https://files.pythonhosted.org/packages/b2/fe/776746d0df41351f1be35d39feba55d850d3fd5fb6c604c57eec2093493f/databathing-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 20:35:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jason-jz-zhu",
    "github_project": "databathing",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "requirements": [
        {
            "name": "mo-sql-parsing",
            "specs": []
        }
    ],
    "lcname": "databathing"
}
        
Elapsed time: 0.94801s