# More SQL Parsing!
[](https://pypi.org/project/databathing/)
[](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing)
Parse SQL into JSON so we can translate it for other datastores!
[See changes](https://github.com/jason-jz-zhu/databathing#version-changes)
## Problem Statement
After converting from sql to spark, data engineers need to write the spark code for ETL pipeline instead of using YAML(SQL) which can improve the performance of ETL job, but it still makes the ETL development longer than before.
Then we have one question: can we have a solution which can have both good calculation performance (Spark) and quick to develop (YAML - SQL)?
YES, we have !!!
## Objectives
We plan to combine the benefits from Spark and YAML (SQL) to create the platform or library to develop the ETL pipeline.
## Project Status
August 2025 - There are [over 900 tests](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing) with comprehensive coverage. This parser supports advanced SQL features and includes a **code validation system** with quality scoring, including:
* `SELECT` feature (including `SELECT DISTINCT`)
* `FROM` feature with table aliases
* `INNER`, `LEFT`, and `RIGHT` JOIN features
* `ON` feature for join conditions
* `WHERE` feature with complex conditions
* `GROUP BY` feature with aggregations
* `HAVING` feature for post-aggregation filtering
* `ORDER BY` feature with ASC/DESC sorting
* `LIMIT` feature for result limiting
* `AGG` feature with aggregation functions (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`, `COLLECT_LIST`, `COLLECT_SET`)
* WINDOWS FUNCTION feature with partitioning and ordering
* ALIAS NAME feature for tables and columns
* `WITH` STATEMENT feature (Common Table Expressions/CTEs)
* **SET OPERATIONS feature** - `UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT`
* String functions (`SPLIT`, `STRUCT` creation)
* Complex subqueries and nested operations
## Install
pip install databathing
### Optional Dependencies
For full functionality:
```bash
pip install duckdb # For DuckDB engine support
pip install sqlparse # For code validation system
```
## Generating Spark Code
Generate PySpark Code from SQL queries using the Pipeline. Supports complex queries including set operations.
### Basic Usage
```python
>>> from databathing import Pipeline
>>> pipeline = Pipeline("SELECT * FROM Test WHERE info = 1")
>>> result = pipeline.parse()
>>> print(result)
'final_df = Test\\\n.filter("info = 1")\\\n.selectExpr("a","b","c")\n\n'
# With validation (enabled by default)
>>> pipeline = Pipeline("SELECT name FROM users WHERE age > 25", validate=True)
>>> result = pipeline.parse_with_validation()
>>> print(f"Code: {result['code'].strip()}")
>>> print(f"Quality Score: {result['score']}/100 (Grade: {result['grade']})")
'Code: final_df = users.filter("age > 25").selectExpr("name")'
'Quality Score: 97.2/100 (Grade: A)'
```
### Set Operations Example (New in v0.3.0!)
```python
>>> from databathing import Pipeline
>>> sql_query = """
SELECT customer_id, name FROM active_customers
UNION ALL
SELECT customer_id, name FROM inactive_customers
WHERE status = 'churned'
ORDER BY name
LIMIT 100
"""
>>> pipeline = Pipeline(sql_query)
>>> result = pipeline.parse()
>>> print(result)
'final_df = (active_customers\\\n.selectExpr("customer_id","name")).union((inactive_customers\\\n.filter("status = \'churned\'")\\\n.selectExpr("customer_id","name")))\n.orderBy(col("name").asc())\n.limit(100)\n\n'
```
### Supported Set Operations
- **UNION** - Combines results and removes duplicates
- **UNION ALL** - Combines results keeping all records
- **INTERSECT** - Returns only records that exist in both queries
- **EXCEPT** - Returns records from first query that don't exist in second
## DuckDB Engine Support (New!)
DataBathing now supports DuckDB as an alternative execution engine alongside PySpark. Generate DuckDB Python code from SQL queries for in-memory columnar processing.
### Quick Start with DuckDB
```python
>>> from databathing import Pipeline
>>>
>>> # Generate DuckDB code with validation
>>> query = "SELECT name, age FROM users WHERE age > 25 ORDER BY name LIMIT 10"
>>> pipeline = Pipeline(query, engine="duckdb", validate=True)
>>> result = pipeline.parse_with_validation()
>>> print(f"Code: {result['code']}")
>>> print(f"Quality Score: {result['score']}/100 (Grade: {result['grade']})")
'Code: result = duckdb.sql("SELECT name,age FROM users WHERE age > 25 ORDER BY name ASC LIMIT 10")'
'Quality Score: 98.8/100 (Grade: A)'
>>>
>>> # Compare with PySpark (default behavior)
>>> spark_pipeline = Pipeline(query, engine="spark") # or just Pipeline(query)
>>> spark_code = spark_pipeline.parse()
>>> print(spark_code)
'final_df = users\\.filter("age > 25")\\.selectExpr("name","age")\\.orderBy(col("name").asc())\\.limit(10)'
```
### Engine Comparison
| Feature | PySpark Engine | DuckDB Engine |
|---------|----------------|---------------|
| **Output Style** | Method chaining | SQL strings |
| **Performance** | Distributed processing | In-memory columnar |
| **Dependencies** | PySpark, Hadoop ecosystem | DuckDB only |
| **Setup Complexity** | High (cluster setup) | Low (pip install) |
| **Memory Usage** | Cluster memory | Single machine memory |
| **Best For** | Big data, distributed workloads | Analytics, fast queries, prototyping |
### Complex Query Example
```python
>>> from databathing import Pipeline
>>>
>>> complex_query = """
SELECT department, AVG(salary) as avg_salary, COUNT(*) as emp_count
FROM employees
WHERE salary > 50000
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC
LIMIT 3
"""
>>>
>>> # DuckDB Output - clean SQL
>>> duckdb_pipeline = Pipeline(complex_query, engine="duckdb")
>>> print(duckdb_pipeline.parse())
'result = duckdb.sql("SELECT department,avg(salary) AS avg_salary,count(*) AS emp_count FROM employees WHERE salary > 50000 GROUP BY department HAVING COUNT(*) > 5 ORDER BY avg_salary DESC LIMIT 3")'
>>>
>>> # PySpark Output - method chaining
>>> spark_pipeline = Pipeline(complex_query, engine="spark")
>>> print(spark_pipeline.parse())
'final_df = employees\\.filter("salary > 50000")\\.groupBy("department")\\.agg(avg(col("salary")).alias("avg_salary"),count(col("*")).alias("emp_count"))\\.filter("COUNT(*) > 5")\\.selectExpr("department","avg_salary","emp_count")\\.orderBy(col("avg_salary").desc())\\.limit(3)'
```
### Using Generated DuckDB Code
```python
import duckdb
# Execute the generated code
result = duckdb.sql("SELECT name,age FROM users WHERE age > 25 ORDER BY name ASC LIMIT 10")
# Convert to different formats
pandas_df = result.df() # Pandas DataFrame
arrow_table = result.arrow() # PyArrow Table
python_data = result.fetchall() # Python objects
```
### DuckDB Engine Features
- ✅ **All SQL features supported**: SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT
- ✅ **Aggregation functions**: COUNT, SUM, AVG, MIN, MAX, etc.
- ✅ **SELECT DISTINCT** support
- ✅ **WITH statements** (Common Table Expressions)
- ✅ **Complex expressions** and aliases
- ⚠️ **Set operations** (UNION, INTERSECT, EXCEPT) - basic support
- ⚠️ **JOINs** - generates SQL rather than relational API
### Prerequisites for DuckDB Engine
```bash
pip install duckdb # Required for DuckDB engine
pip install databathing # This package
```
## Code Validation & Quality Scoring (New!)
DataBathing now includes a comprehensive code validation system that automatically checks generated PySpark and DuckDB code for syntax correctness, performance issues, and best practices. Get instant feedback with quality scores (0-100) and actionable suggestions.
### ✅ Key Features
- **Syntax Validation**: Ensures generated code is syntactically correct before execution
- **Quality Scoring**: 0-100 scores with letter grades (A-F) based on 5 metrics
- **Performance Analysis**: Detects common bottlenecks and anti-patterns
- **Best Practices**: Suggests improvements for maintainable, efficient code
- **Multi-Engine Support**: Works with both PySpark and DuckDB engines
### Quick Start with Validation
#### Basic Usage with Pipeline
```python
from databathing import Pipeline
# Enable validation (default behavior)
pipeline = Pipeline("SELECT name, age FROM users WHERE age > 25",
engine="spark", validate=True)
# Get code with validation results
result = pipeline.parse_with_validation()
print(f"Generated Code: {result['code']}")
print(f"Quality Score: {result['score']}/100 (Grade: {result['grade']})")
print(f"Code Valid: {result['is_valid']}")
# Output:
# Generated Code: final_df = users.filter("age > 25").selectExpr("name","age")
# Quality Score: 97.2/100 (Grade: A)
# Code Valid: True
```
#### Accessing Detailed Validation Reports
```python
# Get comprehensive validation details
report = pipeline.get_validation_report()
print(f"Syntax Score: {report.metrics.syntax_score}/100")
print(f"Performance Score: {report.metrics.performance_score}/100")
print(f"Readability Score: {report.metrics.readability_score}/100")
# Show any issues or suggestions
for issue in report.issues:
print(f"⚠️ {issue.message}")
if issue.suggestion:
print(f"💡 {issue.suggestion}")
```
#### Direct Code Validation
```python
from databathing import validate_code
# Validate any generated code directly
code = 'final_df = users.filter("age > 25").selectExpr("name")'
report = validate_code(code, engine="spark", original_sql="SELECT name FROM users WHERE age > 25")
print(f"Overall Score: {report.metrics.overall_score:.1f}/100")
print(f"Grade: {report.get_grade()}")
print(f"Valid: {report.is_valid}")
```
### Quality Scoring System
DataBathing evaluates code across 5 dimensions:
| Metric | Weight | Description |
|--------|--------|-------------|
| **Syntax** | 40% | Code correctness and executability |
| **Complexity** | 20% | Method chaining depth, nesting (lower is better) |
| **Readability** | 20% | Formatting, naming, code clarity |
| **Performance** | 10% | Anti-pattern detection, optimization opportunities |
| **Maintainability** | 10% | Long-term code maintenance ease |
**Grade Scale:** A (90-100) • B (80-89) • C (70-79) • D (60-69) • F (0-59)
### Validation Examples
#### High-Quality Code (Grade A)
```python
# Input SQL
query = "SELECT department, AVG(salary) as avg_sal FROM employees WHERE salary > 50000 GROUP BY department LIMIT 10"
# PySpark Output (Score: 94/100, Grade A)
pipeline = Pipeline(query, engine="spark", validate=True)
result = pipeline.parse_with_validation()
print(result['code'])
# final_df = employees\
# .filter("salary > 50000")\
# .groupBy("department")\
# .agg(avg(col("salary")).alias("avg_sal"))\
# .selectExpr("department","avg_sal")\
# .limit(10)
# DuckDB Output (Score: 98/100, Grade A)
pipeline = Pipeline(query, engine="duckdb", validate=True)
result = pipeline.parse_with_validation()
print(result['code'])
# result = duckdb.sql("SELECT department,avg(salary) AS avg_sal FROM employees WHERE salary > 50000 GROUP BY department LIMIT 10")
```
#### Problematic Code with Warnings (Grade D)
```python
# Query that generates performance issues
problematic_query = "SELECT * FROM large_table ORDER BY created_at"
pipeline = Pipeline(problematic_query, engine="spark", validate=True)
result = pipeline.parse_with_validation()
print(f"Score: {result['score']}/100 (Grade: {result['grade']})")
# Score: 45/100 (Grade: D)
# View specific issues
for issue in result['validation_report'].issues:
print(f"⚠️ {issue.message}")
# ⚠️ Performance: SELECT * can be inefficient on large tables
# ⚠️ Best Practice: Consider adding LIMIT to ORDER BY queries
```
#### Engine Comparison with Validation
```python
sql = "SELECT department, COUNT(*) as total FROM employees GROUP BY department"
# Compare validation scores between engines
for engine in ["spark", "duckdb"]:
pipeline = Pipeline(sql, engine=engine, validate=True)
result = pipeline.parse_with_validation()
print(f"{engine.upper()}: {result['score']}/100 (Grade: {result['grade']})")
# Output:
# SPARK: 89/100 (Grade: B)
# DUCKDB: 95/100 (Grade: A)
```
### Installation Requirements
For full validation functionality:
```bash
pip install databathing
pip install sqlparse # Required for DuckDB SQL validation
```
### Validation Configuration
```python
# Enable/disable validation (default: True)
pipeline = Pipeline(query, engine="spark", validate=True)
# For high-frequency production use, consider disabling for performance
pipeline = Pipeline(query, engine="spark", validate=False)
```
### Performance Notes
- **Validation Overhead**: ~10-50ms per query depending on complexity
- **Recommendation**: Keep enabled for development, consider disabling for high-frequency production use
- **Memory Impact**: Minimal additional memory for validation reports
## Contributing
In the event that the databathing is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude.
Please follow this blog to update verion - https://circleci.com/blog/publishing-a-python-package/
### Run Tests
See [the tests directory](https://github.com/jason-jz-zhu/databathing/tree/develop/tests) for instructions running tests, or writing new ones.
## Version Changes
### Version 0.3.0
*August 2025*
**Major New Feature: SQL Set Operations**
* **`UNION`** - Combines results from multiple queries and removes duplicates
* **`UNION ALL`** - Combines results from multiple queries keeping all records
* **`INTERSECT`** - Returns only records that exist in all queries
* **`EXCEPT`** - Returns records from first query that don't exist in second query
**Enhanced Features:**
* Complex subqueries with WHERE clauses in set operations
* ORDER BY and LIMIT support on combined results
* Table aliases and complex expressions in set operations
* Seamless integration with existing WITH statements (CTEs)
* Support for multiple chained set operations
* Comprehensive test coverage with 10 additional test cases
**Technical Improvements:**
* Enhanced parsing engine with `_set_operation_analyze()` method
* Improved FROM clause handling for nested set operations
* Maintained full backward compatibility
* Added `__version__` constant for programmatic version checking
### Version 1 (Legacy)
*May 2022*
Core Features and Functionalities - PySpark Version
* `SELECT` feature (including `SELECT DISTINCT`)
* `FROM` feature with table aliases
* `INNER`, `LEFT`, and `RIGHT` JOIN features
* `ON` feature for join conditions
* `WHERE` feature with complex conditions
* `GROUP BY` feature with aggregations
* `HAVING` feature for post-aggregation filtering
* `ORDER BY` feature with ASC/DESC sorting
* `LIMIT` feature for result limiting
* `AGG` feature with aggregation functions
* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)
* ALIAS NAME feature for tables and columns
* `WITH` STATEMENT feature (Common Table Expressions)
Raw data
{
"_id": null,
"home_page": "https://github.com/jason-jz-zhu/databathing",
"name": "databathing",
"maintainer": "Jiazhen Zhu",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "jason.jz.zhu@gmail.com",
"keywords": "sql, spark, pyspark, duckdb, etl, data, parser, converter, validation, quality",
"author": "Jiazhen Zhu, Sanhe Hu",
"author_email": "jason.jz.zhu@gmail.com, husanhe@email.com",
"download_url": "https://files.pythonhosted.org/packages/67/7b/af1fae190a1a3daf55e351f59c4580435c301ac712e77700455ded8975fd/databathing-0.7.5.tar.gz",
"platform": null,
"description": "# More SQL Parsing!\n\n[](https://pypi.org/project/databathing/)\n[](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing)\n\n\nParse SQL into JSON so we can translate it for other datastores!\n\n[See changes](https://github.com/jason-jz-zhu/databathing#version-changes)\n\n\n## Problem Statement\n\nAfter converting from sql to spark, data engineers need to write the spark code for ETL pipeline instead of using YAML(SQL) which can improve the performance of ETL job, but it still makes the ETL development longer than before. \n\nThen we have one question: can we have a solution which can have both good calculation performance (Spark) and quick to develop (YAML - SQL)?\n\nYES, we have !!!\n\n## Objectives\n\nWe plan to combine the benefits from Spark and YAML (SQL) to create the platform or library to develop the ETL pipeline. \n\n\n## Project Status\n\nAugust 2025 - There are [over 900 tests](https://app.circleci.com/pipelines/github/jason-jz-zhu/databathing) with comprehensive coverage. This parser supports advanced SQL features and includes a **code validation system** with quality scoring, including:\n* `SELECT` feature (including `SELECT DISTINCT`)\n* `FROM` feature with table aliases\n* `INNER`, `LEFT`, and `RIGHT` JOIN features\n* `ON` feature for join conditions\n* `WHERE` feature with complex conditions\n* `GROUP BY` feature with aggregations\n* `HAVING` feature for post-aggregation filtering\n* `ORDER BY` feature with ASC/DESC sorting\n* `LIMIT` feature for result limiting\n* `AGG` feature with aggregation functions (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`, `COLLECT_LIST`, `COLLECT_SET`)\n* WINDOWS FUNCTION feature with partitioning and ordering\n* ALIAS NAME feature for tables and columns\n* `WITH` STATEMENT feature (Common Table Expressions/CTEs)\n* **SET OPERATIONS feature** - `UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT`\n* String functions (`SPLIT`, `STRUCT` creation)\n* Complex subqueries and nested operations\n\n## Install\n\n pip install databathing\n\n### Optional Dependencies\nFor full functionality:\n```bash\npip install duckdb # For DuckDB engine support\npip install sqlparse # For code validation system\n```\n\n\n## Generating Spark Code\n\nGenerate PySpark Code from SQL queries using the Pipeline. Supports complex queries including set operations.\n\n### Basic Usage\n```python\n>>> from databathing import Pipeline\n>>> pipeline = Pipeline(\"SELECT * FROM Test WHERE info = 1\")\n>>> result = pipeline.parse()\n>>> print(result)\n'final_df = Test\\\\\\n.filter(\"info = 1\")\\\\\\n.selectExpr(\"a\",\"b\",\"c\")\\n\\n'\n\n# With validation (enabled by default)\n>>> pipeline = Pipeline(\"SELECT name FROM users WHERE age > 25\", validate=True)\n>>> result = pipeline.parse_with_validation()\n>>> print(f\"Code: {result['code'].strip()}\")\n>>> print(f\"Quality Score: {result['score']}/100 (Grade: {result['grade']})\")\n'Code: final_df = users.filter(\"age > 25\").selectExpr(\"name\")'\n'Quality Score: 97.2/100 (Grade: A)'\n```\n\n### Set Operations Example (New in v0.3.0!)\n```python\n>>> from databathing import Pipeline\n>>> sql_query = \"\"\"\n SELECT customer_id, name FROM active_customers\n UNION ALL\n SELECT customer_id, name FROM inactive_customers\n WHERE status = 'churned'\n ORDER BY name\n LIMIT 100\n\"\"\"\n>>> pipeline = Pipeline(sql_query)\n>>> result = pipeline.parse()\n>>> print(result)\n'final_df = (active_customers\\\\\\n.selectExpr(\"customer_id\",\"name\")).union((inactive_customers\\\\\\n.filter(\"status = \\'churned\\'\")\\\\\\n.selectExpr(\"customer_id\",\"name\")))\\n.orderBy(col(\"name\").asc())\\n.limit(100)\\n\\n'\n```\n\n### Supported Set Operations\n- **UNION** - Combines results and removes duplicates\n- **UNION ALL** - Combines results keeping all records \n- **INTERSECT** - Returns only records that exist in both queries\n- **EXCEPT** - Returns records from first query that don't exist in second\n\n## DuckDB Engine Support (New!)\n\nDataBathing now supports DuckDB as an alternative execution engine alongside PySpark. Generate DuckDB Python code from SQL queries for in-memory columnar processing.\n\n### Quick Start with DuckDB\n```python\n>>> from databathing import Pipeline\n>>> \n>>> # Generate DuckDB code with validation\n>>> query = \"SELECT name, age FROM users WHERE age > 25 ORDER BY name LIMIT 10\"\n>>> pipeline = Pipeline(query, engine=\"duckdb\", validate=True)\n>>> result = pipeline.parse_with_validation()\n>>> print(f\"Code: {result['code']}\")\n>>> print(f\"Quality Score: {result['score']}/100 (Grade: {result['grade']})\")\n'Code: result = duckdb.sql(\"SELECT name,age FROM users WHERE age > 25 ORDER BY name ASC LIMIT 10\")'\n'Quality Score: 98.8/100 (Grade: A)'\n>>> \n>>> # Compare with PySpark (default behavior)\n>>> spark_pipeline = Pipeline(query, engine=\"spark\") # or just Pipeline(query)\n>>> spark_code = spark_pipeline.parse()\n>>> print(spark_code)\n'final_df = users\\\\.filter(\"age > 25\")\\\\.selectExpr(\"name\",\"age\")\\\\.orderBy(col(\"name\").asc())\\\\.limit(10)'\n```\n\n### Engine Comparison\n\n| Feature | PySpark Engine | DuckDB Engine |\n|---------|----------------|---------------|\n| **Output Style** | Method chaining | SQL strings |\n| **Performance** | Distributed processing | In-memory columnar |\n| **Dependencies** | PySpark, Hadoop ecosystem | DuckDB only |\n| **Setup Complexity** | High (cluster setup) | Low (pip install) |\n| **Memory Usage** | Cluster memory | Single machine memory |\n| **Best For** | Big data, distributed workloads | Analytics, fast queries, prototyping |\n\n### Complex Query Example\n```python\n>>> from databathing import Pipeline\n>>> \n>>> complex_query = \"\"\"\n SELECT department, AVG(salary) as avg_salary, COUNT(*) as emp_count\n FROM employees \n WHERE salary > 50000 \n GROUP BY department \n HAVING COUNT(*) > 5\n ORDER BY avg_salary DESC\n LIMIT 3\n\"\"\"\n>>> \n>>> # DuckDB Output - clean SQL\n>>> duckdb_pipeline = Pipeline(complex_query, engine=\"duckdb\")\n>>> print(duckdb_pipeline.parse())\n'result = duckdb.sql(\"SELECT department,avg(salary) AS avg_salary,count(*) AS emp_count FROM employees WHERE salary > 50000 GROUP BY department HAVING COUNT(*) > 5 ORDER BY avg_salary DESC LIMIT 3\")'\n>>> \n>>> # PySpark Output - method chaining\n>>> spark_pipeline = Pipeline(complex_query, engine=\"spark\")\n>>> print(spark_pipeline.parse())\n'final_df = employees\\\\.filter(\"salary > 50000\")\\\\.groupBy(\"department\")\\\\.agg(avg(col(\"salary\")).alias(\"avg_salary\"),count(col(\"*\")).alias(\"emp_count\"))\\\\.filter(\"COUNT(*) > 5\")\\\\.selectExpr(\"department\",\"avg_salary\",\"emp_count\")\\\\.orderBy(col(\"avg_salary\").desc())\\\\.limit(3)'\n```\n\n### Using Generated DuckDB Code\n```python\nimport duckdb\n\n# Execute the generated code\nresult = duckdb.sql(\"SELECT name,age FROM users WHERE age > 25 ORDER BY name ASC LIMIT 10\")\n\n# Convert to different formats\npandas_df = result.df() # Pandas DataFrame\narrow_table = result.arrow() # PyArrow Table \npython_data = result.fetchall() # Python objects\n```\n\n### DuckDB Engine Features\n- \u2705 **All SQL features supported**: SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT\n- \u2705 **Aggregation functions**: COUNT, SUM, AVG, MIN, MAX, etc.\n- \u2705 **SELECT DISTINCT** support\n- \u2705 **WITH statements** (Common Table Expressions)\n- \u2705 **Complex expressions** and aliases\n- \u26a0\ufe0f **Set operations** (UNION, INTERSECT, EXCEPT) - basic support\n- \u26a0\ufe0f **JOINs** - generates SQL rather than relational API\n\n### Prerequisites for DuckDB Engine\n```bash\npip install duckdb # Required for DuckDB engine\npip install databathing # This package\n```\n\n## Code Validation & Quality Scoring (New!)\n\nDataBathing now includes a comprehensive code validation system that automatically checks generated PySpark and DuckDB code for syntax correctness, performance issues, and best practices. Get instant feedback with quality scores (0-100) and actionable suggestions.\n\n### \u2705 Key Features\n- **Syntax Validation**: Ensures generated code is syntactically correct before execution\n- **Quality Scoring**: 0-100 scores with letter grades (A-F) based on 5 metrics\n- **Performance Analysis**: Detects common bottlenecks and anti-patterns \n- **Best Practices**: Suggests improvements for maintainable, efficient code\n- **Multi-Engine Support**: Works with both PySpark and DuckDB engines\n\n### Quick Start with Validation\n\n#### Basic Usage with Pipeline\n```python\nfrom databathing import Pipeline\n\n# Enable validation (default behavior)\npipeline = Pipeline(\"SELECT name, age FROM users WHERE age > 25\", \n engine=\"spark\", validate=True)\n\n# Get code with validation results\nresult = pipeline.parse_with_validation()\n\nprint(f\"Generated Code: {result['code']}\")\nprint(f\"Quality Score: {result['score']}/100 (Grade: {result['grade']})\")\nprint(f\"Code Valid: {result['is_valid']}\")\n\n# Output:\n# Generated Code: final_df = users.filter(\"age > 25\").selectExpr(\"name\",\"age\")\n# Quality Score: 97.2/100 (Grade: A)\n# Code Valid: True\n```\n\n#### Accessing Detailed Validation Reports\n```python\n# Get comprehensive validation details\nreport = pipeline.get_validation_report()\n\nprint(f\"Syntax Score: {report.metrics.syntax_score}/100\")\nprint(f\"Performance Score: {report.metrics.performance_score}/100\") \nprint(f\"Readability Score: {report.metrics.readability_score}/100\")\n\n# Show any issues or suggestions\nfor issue in report.issues:\n print(f\"\u26a0\ufe0f {issue.message}\")\n if issue.suggestion:\n print(f\"\ud83d\udca1 {issue.suggestion}\")\n```\n\n#### Direct Code Validation\n```python\nfrom databathing import validate_code\n\n# Validate any generated code directly\ncode = 'final_df = users.filter(\"age > 25\").selectExpr(\"name\")'\nreport = validate_code(code, engine=\"spark\", original_sql=\"SELECT name FROM users WHERE age > 25\")\n\nprint(f\"Overall Score: {report.metrics.overall_score:.1f}/100\")\nprint(f\"Grade: {report.get_grade()}\")\nprint(f\"Valid: {report.is_valid}\")\n```\n\n### Quality Scoring System\n\nDataBathing evaluates code across 5 dimensions:\n\n| Metric | Weight | Description |\n|--------|--------|-------------|\n| **Syntax** | 40% | Code correctness and executability |\n| **Complexity** | 20% | Method chaining depth, nesting (lower is better) |\n| **Readability** | 20% | Formatting, naming, code clarity |\n| **Performance** | 10% | Anti-pattern detection, optimization opportunities |\n| **Maintainability** | 10% | Long-term code maintenance ease |\n\n**Grade Scale:** A (90-100) \u2022 B (80-89) \u2022 C (70-79) \u2022 D (60-69) \u2022 F (0-59)\n\n### Validation Examples\n\n#### High-Quality Code (Grade A)\n```python\n# Input SQL\nquery = \"SELECT department, AVG(salary) as avg_sal FROM employees WHERE salary > 50000 GROUP BY department LIMIT 10\"\n\n# PySpark Output (Score: 94/100, Grade A)\npipeline = Pipeline(query, engine=\"spark\", validate=True)\nresult = pipeline.parse_with_validation()\nprint(result['code'])\n# final_df = employees\\\n# .filter(\"salary > 50000\")\\\n# .groupBy(\"department\")\\\n# .agg(avg(col(\"salary\")).alias(\"avg_sal\"))\\\n# .selectExpr(\"department\",\"avg_sal\")\\\n# .limit(10)\n\n# DuckDB Output (Score: 98/100, Grade A) \npipeline = Pipeline(query, engine=\"duckdb\", validate=True)\nresult = pipeline.parse_with_validation()\nprint(result['code'])\n# result = duckdb.sql(\"SELECT department,avg(salary) AS avg_sal FROM employees WHERE salary > 50000 GROUP BY department LIMIT 10\")\n```\n\n#### Problematic Code with Warnings (Grade D)\n```python\n# Query that generates performance issues\nproblematic_query = \"SELECT * FROM large_table ORDER BY created_at\"\n\npipeline = Pipeline(problematic_query, engine=\"spark\", validate=True)\nresult = pipeline.parse_with_validation()\n\nprint(f\"Score: {result['score']}/100 (Grade: {result['grade']})\")\n# Score: 45/100 (Grade: D)\n\n# View specific issues\nfor issue in result['validation_report'].issues:\n print(f\"\u26a0\ufe0f {issue.message}\")\n# \u26a0\ufe0f Performance: SELECT * can be inefficient on large tables\n# \u26a0\ufe0f Best Practice: Consider adding LIMIT to ORDER BY queries\n```\n\n#### Engine Comparison with Validation\n```python\nsql = \"SELECT department, COUNT(*) as total FROM employees GROUP BY department\"\n\n# Compare validation scores between engines\nfor engine in [\"spark\", \"duckdb\"]:\n pipeline = Pipeline(sql, engine=engine, validate=True)\n result = pipeline.parse_with_validation()\n print(f\"{engine.upper()}: {result['score']}/100 (Grade: {result['grade']})\")\n\n# Output:\n# SPARK: 89/100 (Grade: B)\n# DUCKDB: 95/100 (Grade: A)\n```\n\n### Installation Requirements\n\nFor full validation functionality:\n```bash\npip install databathing\npip install sqlparse # Required for DuckDB SQL validation\n```\n\n### Validation Configuration\n\n```python\n# Enable/disable validation (default: True)\npipeline = Pipeline(query, engine=\"spark\", validate=True)\n\n# For high-frequency production use, consider disabling for performance\npipeline = Pipeline(query, engine=\"spark\", validate=False)\n```\n\n### Performance Notes\n- **Validation Overhead**: ~10-50ms per query depending on complexity\n- **Recommendation**: Keep enabled for development, consider disabling for high-frequency production use\n- **Memory Impact**: Minimal additional memory for validation reports\n\n## Contributing\n\nIn the event that the databathing is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude. \n\nPlease follow this blog to update verion - https://circleci.com/blog/publishing-a-python-package/\n\n\n### Run Tests\n\nSee [the tests directory](https://github.com/jason-jz-zhu/databathing/tree/develop/tests) for instructions running tests, or writing new ones.\n\n## Version Changes\n\n### Version 0.3.0\n\n*August 2025*\n\n**Major New Feature: SQL Set Operations**\n* **`UNION`** - Combines results from multiple queries and removes duplicates\n* **`UNION ALL`** - Combines results from multiple queries keeping all records\n* **`INTERSECT`** - Returns only records that exist in all queries\n* **`EXCEPT`** - Returns records from first query that don't exist in second query\n\n**Enhanced Features:**\n* Complex subqueries with WHERE clauses in set operations\n* ORDER BY and LIMIT support on combined results\n* Table aliases and complex expressions in set operations\n* Seamless integration with existing WITH statements (CTEs)\n* Support for multiple chained set operations\n* Comprehensive test coverage with 10 additional test cases\n\n**Technical Improvements:**\n* Enhanced parsing engine with `_set_operation_analyze()` method\n* Improved FROM clause handling for nested set operations\n* Maintained full backward compatibility\n* Added `__version__` constant for programmatic version checking\n\n### Version 1 (Legacy)\n\n*May 2022*\n\nCore Features and Functionalities - PySpark Version\n* `SELECT` feature (including `SELECT DISTINCT`)\n* `FROM` feature with table aliases\n* `INNER`, `LEFT`, and `RIGHT` JOIN features\n* `ON` feature for join conditions\n* `WHERE` feature with complex conditions\n* `GROUP BY` feature with aggregations\n* `HAVING` feature for post-aggregation filtering\n* `ORDER BY` feature with ASC/DESC sorting\n* `LIMIT` feature for result limiting\n* `AGG` feature with aggregation functions\n* WINDOWS FUNCTION feature (`SUM`, `AVG`, `MAX`, `MIN`, `MEAN`, `COUNT`)\n* ALIAS NAME feature for tables and columns\n* `WITH` STATEMENT feature (Common Table Expressions)\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Advanced SQL-to-code generator with multi-engine support, intelligent validation, caching, and async capabilities",
"version": "0.7.5",
"project_urls": {
"Bug Tracker": "https://github.com/jason-jz-zhu/databathing/issues",
"Homepage": "https://github.com/jason-jz-zhu/databathing",
"Source Code": "https://github.com/jason-jz-zhu/databathing"
},
"split_keywords": [
"sql",
" spark",
" pyspark",
" duckdb",
" etl",
" data",
" parser",
" converter",
" validation",
" quality"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6a61a4d9d8b3e2ebc092975166b1c0f42ecb437e178817fa038e76aebcac5fc3",
"md5": "563e9f1ae6d723b52b85ce4a125f1bda",
"sha256": "92ee47cf486710d38aab97f964bceae9d1397b95561e6a4d2f43a4a62ba2348f"
},
"downloads": -1,
"filename": "databathing-0.7.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "563e9f1ae6d723b52b85ce4a125f1bda",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 55339,
"upload_time": "2025-08-22T11:34:57",
"upload_time_iso_8601": "2025-08-22T11:34:57.450924Z",
"url": "https://files.pythonhosted.org/packages/6a/61/a4d9d8b3e2ebc092975166b1c0f42ecb437e178817fa038e76aebcac5fc3/databathing-0.7.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "677baf1fae190a1a3daf55e351f59c4580435c301ac712e77700455ded8975fd",
"md5": "9008af791411210445da43073650bdd5",
"sha256": "2be092e9d27ba0d759b598915a5046a04a9ff515e0d92f50bd6b543666fdf13f"
},
"downloads": -1,
"filename": "databathing-0.7.5.tar.gz",
"has_sig": false,
"md5_digest": "9008af791411210445da43073650bdd5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 44465,
"upload_time": "2025-08-22T11:34:58",
"upload_time_iso_8601": "2025-08-22T11:34:58.353509Z",
"url": "https://files.pythonhosted.org/packages/67/7b/af1fae190a1a3daf55e351f59c4580435c301ac712e77700455ded8975fd/databathing-0.7.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-22 11:34:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jason-jz-zhu",
"github_project": "databathing",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"circle": true,
"requirements": [
{
"name": "mo-sql-parsing",
"specs": []
}
],
"lcname": "databathing"
}