query-farm-sql-scan-planning


Namequery-farm-sql-scan-planning JSON
Version 0.1.5 PyPI version JSON
download
home_pageNone
SummaryA Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.
upload_time2025-07-10 20:39:06
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords predicate pushdown scan planning sql sql scan planning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # [Query.Farm](https://query.farm) SQL Scan Planning

A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.

## Overview

This module provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given `SQL WHERE` clause, significantly reducing query execution time.

## Features

- **SQL Expression Parsing**: Parse and evaluate complex `SQL WHERE` clauses using [SQLGlot](https://github.com/tobymao/sqlglot)
- **Metadata-Based Filtering**: Support for both range-based (min/max) and set-based field metadata
- **Null Handling**: Comprehensive support for `NULL` value semantics in SQL expressions
- **Complex Predicates**: Handle `AND`, `OR`, `XOR`, `NOT`, `IN`, `BETWEEN`, `CASE` statements, and more
- **Multiple Data Types**: Support for integers, floats, strings, decimals, and `NULL` values
- **Dialect Support**: Configurable SQL dialect support (default: DuckDB)

## Installation

```bash
pip install query-farm-sql-scan-planning
```

Or using rye:

```bash
rye add query-farm-sql-scan-planning
```

## Quick Start

```python
from query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo
import pyarrow as pa

# Define file metadata
files = [
    (
        "data_2023_q1.parquet",
        {
            "sales_amount": RangeFieldInfo(
                min_value=pa.scalar(100), max_value=pa.scalar(50000),
                has_nulls=False, has_non_nulls=True
            ),
            "region": SetFieldInfo[str](
                values={pa.scalar("US"), pa.scalar("CA"), pa.scalar("MX")},
                has_nulls=False, has_non_nulls=True
            ),
        }
    ),
    (
        "data_2023_q2.parquet",
        {
            "sales_amount": RangeFieldInfo[int](
                min_value=pa.scalar(200), max_value=pa.scalar(75000),
                has_nulls=False, has_non_nulls=True
            ),
            "region": SetFieldInfo[str](
                values={pa.scalar("US"), pa.scalar("EU"), pa.scalar("UK")},
                has_nulls=False, has_non_nulls=True
            ),
        }
    ),
]

# Create planner
planner = Planner(files)

# Filter files based on SQL expressions
matching_files = set(planner.get_matching_files("sales_amount > 40000 AND region = 'US'"))
print(matching_files)  # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}

# More complex queries
matching_files = set(planner.get_matching_files("region IN ('EU', 'UK')"))
print(matching_files)  # {'data_2023_q2.parquet'}
```

## Field Information Types

### `RangeFieldInfo`

For fields with known minimum and maximum values:

```python
RangeFieldInfo(
    min_value=pa.scalar(0),
    max_value=pa.scalar(100),
    has_nulls=False,      # Whether the field contains NULL values
    has_non_nulls=True    # Whether the field contains non-NULL values
)
```

### `SetFieldInfo`

For fields with a known set of possible values (useful for categorical data):

```python
SetFieldInfo(
    values={pa.scalar("apple"), pa.scalar("banana"), pa.scalar("cherry")},
    has_nulls=False,
    has_non_nulls=True
)
```

**Note**: `SetFieldInfo` can produce false positives - if a value is in the set, the file *might* contain it, but the file could contain additional values not in the set.

## Supported SQL Operations

### Comparison Operators
- `=`, `!=`, `<>` (equality and inequality)
- `<`, `<=`, `>`, `>=` (range comparisons)
- `IS NULL`, `IS NOT NULL` (null checks)
- `IS DISTINCT FROM`, `IS NOT DISTINCT FROM` (null-safe comparisons)

### Logical Operators
- `AND`, `OR`, `XOR` (logical connectors)
- `NOT` (negation)

### Set Operations
- `IN`, `NOT IN` (membership tests)
- `BETWEEN`, `NOT BETWEEN` (range tests)

### Control Flow
- `CASE WHEN ... THEN ... ELSE ... END` (conditional expressions)

### Literals
- Numeric literals: `123`, `45.67`
- String literals: `'hello'`
- Boolean literals: `TRUE`, `FALSE`
- NULL literal: `NULL`

## Examples

### Range Queries
```python
# Files with sales between 1000 and 5000
planner.get_matching_files("sales_amount BETWEEN 1000 AND 5000")

# Files with any sales over 10000
planner.get_matching_files("sales_amount > 10000")
```

### Set Membership
```python
# Files containing specific regions
planner.get_matching_files("region IN ('US', 'CA')")

# Files not containing specific regions
planner.get_matching_files("region NOT IN ('UNKNOWN', 'TEST')")
```

### Complex Conditions
```python
# Combination of range and set conditions
planner.get_matching_files(
    "sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL"
)

# Case expressions
planner.get_matching_files(
    "CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END"
)
```

### Null Handling
```python
# Files that might contain null values in sales_amount
planner.get_matching_files("sales_amount IS NULL")

# Files with non-null sales amounts over 1000
planner.get_matching_files("sales_amount IS NOT NULL AND sales_amount > 1000")
```

## Performance Considerations

- **Metadata Quality**: More accurate metadata (tighter ranges, complete value sets) leads to better filtering
- **Expression Complexity**: Simple expressions evaluate faster than complex nested conditions
- **False Positives**: The library errs on the side of including files that might match rather than risk excluding files that do match

## Use Cases

- **Data Lake Query Optimization**: Skip irrelevant files in distributed query engines
- **ETL Pipeline Optimization**: Process only files containing relevant data
- **Data Catalog Integration**: Enhance metadata catalogs with query planning capabilities
- **Columnar Storage**: Optimize scans of Parquet, ORC, or similar formats

## Development

### Setup
```bash
git clone https://github.com/query-farm/python-sql-scan-planning.git
cd python-sql-scan-planning
rye sync
```

### Running Tests
```bash
rye run pytest
```

### Code Quality
```bash
rye run ruff check
rye run pytest --mypy
```

## Dependencies

- **sqlglot**: SQL parsing and AST manipulation
- **Python 3.12+**: Required for modern type hints and pattern matching

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request

## Author

This Python module was created by [Query.Farm](https://query.farm).

# License

MIT Licensed.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "query-farm-sql-scan-planning",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "predicate pushdown, scan planning, sql, sql scan planning",
    "author": null,
    "author_email": "Rusty Conover <rusty@conover.me>",
    "download_url": "https://files.pythonhosted.org/packages/29/9b/26fb9be68a70e30dc306b7e475cf313657fa020a8b1d77c22d12c692f614/query_farm_sql_scan_planning-0.1.5.tar.gz",
    "platform": null,
    "description": "# [Query.Farm](https://query.farm) SQL Scan Planning\n\nA Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.\n\n## Overview\n\nThis module provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given `SQL WHERE` clause, significantly reducing query execution time.\n\n## Features\n\n- **SQL Expression Parsing**: Parse and evaluate complex `SQL WHERE` clauses using [SQLGlot](https://github.com/tobymao/sqlglot)\n- **Metadata-Based Filtering**: Support for both range-based (min/max) and set-based field metadata\n- **Null Handling**: Comprehensive support for `NULL` value semantics in SQL expressions\n- **Complex Predicates**: Handle `AND`, `OR`, `XOR`, `NOT`, `IN`, `BETWEEN`, `CASE` statements, and more\n- **Multiple Data Types**: Support for integers, floats, strings, decimals, and `NULL` values\n- **Dialect Support**: Configurable SQL dialect support (default: DuckDB)\n\n## Installation\n\n```bash\npip install query-farm-sql-scan-planning\n```\n\nOr using rye:\n\n```bash\nrye add query-farm-sql-scan-planning\n```\n\n## Quick Start\n\n```python\nfrom query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo\nimport pyarrow as pa\n\n# Define file metadata\nfiles = [\n    (\n        \"data_2023_q1.parquet\",\n        {\n            \"sales_amount\": RangeFieldInfo(\n                min_value=pa.scalar(100), max_value=pa.scalar(50000),\n                has_nulls=False, has_non_nulls=True\n            ),\n            \"region\": SetFieldInfo[str](\n                values={pa.scalar(\"US\"), pa.scalar(\"CA\"), pa.scalar(\"MX\")},\n                has_nulls=False, has_non_nulls=True\n            ),\n        }\n    ),\n    (\n        \"data_2023_q2.parquet\",\n        {\n            \"sales_amount\": RangeFieldInfo[int](\n                min_value=pa.scalar(200), max_value=pa.scalar(75000),\n                has_nulls=False, has_non_nulls=True\n            ),\n            \"region\": SetFieldInfo[str](\n                values={pa.scalar(\"US\"), pa.scalar(\"EU\"), pa.scalar(\"UK\")},\n                has_nulls=False, has_non_nulls=True\n            ),\n        }\n    ),\n]\n\n# Create planner\nplanner = Planner(files)\n\n# Filter files based on SQL expressions\nmatching_files = set(planner.get_matching_files(\"sales_amount > 40000 AND region = 'US'\"))\nprint(matching_files)  # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}\n\n# More complex queries\nmatching_files = set(planner.get_matching_files(\"region IN ('EU', 'UK')\"))\nprint(matching_files)  # {'data_2023_q2.parquet'}\n```\n\n## Field Information Types\n\n### `RangeFieldInfo`\n\nFor fields with known minimum and maximum values:\n\n```python\nRangeFieldInfo(\n    min_value=pa.scalar(0),\n    max_value=pa.scalar(100),\n    has_nulls=False,      # Whether the field contains NULL values\n    has_non_nulls=True    # Whether the field contains non-NULL values\n)\n```\n\n### `SetFieldInfo`\n\nFor fields with a known set of possible values (useful for categorical data):\n\n```python\nSetFieldInfo(\n    values={pa.scalar(\"apple\"), pa.scalar(\"banana\"), pa.scalar(\"cherry\")},\n    has_nulls=False,\n    has_non_nulls=True\n)\n```\n\n**Note**: `SetFieldInfo` can produce false positives - if a value is in the set, the file *might* contain it, but the file could contain additional values not in the set.\n\n## Supported SQL Operations\n\n### Comparison Operators\n- `=`, `!=`, `<>` (equality and inequality)\n- `<`, `<=`, `>`, `>=` (range comparisons)\n- `IS NULL`, `IS NOT NULL` (null checks)\n- `IS DISTINCT FROM`, `IS NOT DISTINCT FROM` (null-safe comparisons)\n\n### Logical Operators\n- `AND`, `OR`, `XOR` (logical connectors)\n- `NOT` (negation)\n\n### Set Operations\n- `IN`, `NOT IN` (membership tests)\n- `BETWEEN`, `NOT BETWEEN` (range tests)\n\n### Control Flow\n- `CASE WHEN ... THEN ... ELSE ... END` (conditional expressions)\n\n### Literals\n- Numeric literals: `123`, `45.67`\n- String literals: `'hello'`\n- Boolean literals: `TRUE`, `FALSE`\n- NULL literal: `NULL`\n\n## Examples\n\n### Range Queries\n```python\n# Files with sales between 1000 and 5000\nplanner.get_matching_files(\"sales_amount BETWEEN 1000 AND 5000\")\n\n# Files with any sales over 10000\nplanner.get_matching_files(\"sales_amount > 10000\")\n```\n\n### Set Membership\n```python\n# Files containing specific regions\nplanner.get_matching_files(\"region IN ('US', 'CA')\")\n\n# Files not containing specific regions\nplanner.get_matching_files(\"region NOT IN ('UNKNOWN', 'TEST')\")\n```\n\n### Complex Conditions\n```python\n# Combination of range and set conditions\nplanner.get_matching_files(\n    \"sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL\"\n)\n\n# Case expressions\nplanner.get_matching_files(\n    \"CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END\"\n)\n```\n\n### Null Handling\n```python\n# Files that might contain null values in sales_amount\nplanner.get_matching_files(\"sales_amount IS NULL\")\n\n# Files with non-null sales amounts over 1000\nplanner.get_matching_files(\"sales_amount IS NOT NULL AND sales_amount > 1000\")\n```\n\n## Performance Considerations\n\n- **Metadata Quality**: More accurate metadata (tighter ranges, complete value sets) leads to better filtering\n- **Expression Complexity**: Simple expressions evaluate faster than complex nested conditions\n- **False Positives**: The library errs on the side of including files that might match rather than risk excluding files that do match\n\n## Use Cases\n\n- **Data Lake Query Optimization**: Skip irrelevant files in distributed query engines\n- **ETL Pipeline Optimization**: Process only files containing relevant data\n- **Data Catalog Integration**: Enhance metadata catalogs with query planning capabilities\n- **Columnar Storage**: Optimize scans of Parquet, ORC, or similar formats\n\n## Development\n\n### Setup\n```bash\ngit clone https://github.com/query-farm/python-sql-scan-planning.git\ncd python-sql-scan-planning\nrye sync\n```\n\n### Running Tests\n```bash\nrye run pytest\n```\n\n### Code Quality\n```bash\nrye run ruff check\nrye run pytest --mypy\n```\n\n## Dependencies\n\n- **sqlglot**: SQL parsing and AST manipulation\n- **Python 3.12+**: Required for modern type hints and pattern matching\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Ensure all tests pass\n5. Submit a pull request\n\n## Author\n\nThis Python module was created by [Query.Farm](https://query.farm).\n\n# License\n\nMIT Licensed.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.",
    "version": "0.1.5",
    "project_urls": {
        "Issues": "https://github.com/query-farm/python-sql-scan-planning/issues",
        "Repository": "https://github.com/query-farm/python-sql-scan-planning.git"
    },
    "split_keywords": [
        "predicate pushdown",
        " scan planning",
        " sql",
        " sql scan planning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "67de56cb9dec66c6677aab658dc5739e5d45ea177ec0607d978f43f5f2f6b9e7",
                "md5": "04de92ce28d07ceae2a825594b22c2af",
                "sha256": "570afa7bd5ddf515a564923a453b5cb076c28a93279b501933711002bcf971f2"
            },
            "downloads": -1,
            "filename": "query_farm_sql_scan_planning-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "04de92ce28d07ceae2a825594b22c2af",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 11476,
            "upload_time": "2025-07-10T20:39:04",
            "upload_time_iso_8601": "2025-07-10T20:39:04.823017Z",
            "url": "https://files.pythonhosted.org/packages/67/de/56cb9dec66c6677aab658dc5739e5d45ea177ec0607d978f43f5f2f6b9e7/query_farm_sql_scan_planning-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "299b26fb9be68a70e30dc306b7e475cf313657fa020a8b1d77c22d12c692f614",
                "md5": "b4eb5de38cb30a4592b98f2d3727f2a3",
                "sha256": "cb52b64380e5241e4095d922f8e2d602c0ac9dc821a53b84a2362b9539ff9e1e"
            },
            "downloads": -1,
            "filename": "query_farm_sql_scan_planning-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "b4eb5de38cb30a4592b98f2d3727f2a3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 10528,
            "upload_time": "2025-07-10T20:39:06",
            "upload_time_iso_8601": "2025-07-10T20:39:06.260020Z",
            "url": "https://files.pythonhosted.org/packages/29/9b/26fb9be68a70e30dc306b7e475cf313657fa020a8b1d77c22d12c692f614/query_farm_sql_scan_planning-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 20:39:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "query-farm",
    "github_project": "python-sql-scan-planning",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "query-farm-sql-scan-planning"
}
        
Elapsed time: 0.57294s