sumeh

Name	sumeh JSON
Version	0.3.2 JSON
	download
home_page	https://github.com/maltzsama/sumeh
Summary	Unified Data Quality Validation Framework
upload_time	2025-10-06 19:20:42
maintainer	None
docs_url	None
author	Demetrius Albuquerque
requires_python	<4.0,>=3.10
license	Apache-2.0
keywords	data-quality validation dq etl cli
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)
![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)

# <h1 style="display: flex; align-items: center; gap: 0.5rem;"><img src="https://raw.githubusercontent.com/maltzsama/sumeh/refs/heads/main/docs/img/sumeh.svg" alt="Logo" style="height: 40px; width: auto; vertical-align: middle;" /> <span>Sumeh DQ</span> </h1>

Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.

## 🚀 Installation

```bash
# Using pip
pip install sumeh

# Or with conda-forge
conda install -c conda-forge sumeh
```

**Prerequisites:**  
- Python 3.10+  
- One or more of: `pyspark`, `dask[dataframe]`, `polars`, `duckdb`, `cuallee`

## 🔍 Core API

- **`report(df, rules, name="Quality Check")`**  
  Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB).  
- **`validate(df, rules)`** *(per-engine)*  
  Returns a DataFrame with a `dq_status` column listing violations.  
- **`summarize(qc_df, rules, total_rows)`** *(per-engine)*  
  Consolidates violations into a summary report.

## ⚙️ Supported Engines

Each engine implements the `validate()` + `summarize()` pair:

| Engine                | Module                                  | Status          |
|-----------------------|-----------------------------------------|-----------------|
| PySpark               | `sumeh.engine.pyspark_engine`           | ✅ Fully implemented |
| Dask                  | `sumeh.engine.dask_engine`              | ✅ Fully implemented |
| Polars                | `sumeh.engine.polars_engine`            | ✅ Fully implemented |
| DuckDB                | `sumeh.engine.duckdb_engine`            | ✅ Fully implemented |
| Pandas                | `sumeh.engine.pandas_engine`            | ✅ Fully implemented |
| BigQuery (SQL)        | `sumeh.engine.bigquery_engine`          | 🔧 Stub implementation |

## 🏗 Configuration Sources

**Load rules from CSV**

```python
from sumeh import get_rules_config

rules = get_rules_config("rules.csv", delimiter=";")
```

**Load rules from S3**
```python
from sumeh import get_rules_config
bucket_name = "<bucket>"
path = "<path>"
file_name = "<file_name>"

rules = get_rules_config(f"s3://{bucket_name}/{path}/{file_name}", delimiter=";")

```

**Load rules from MySQL**
```python
from sumeh import get_rules_config

host = "<host>"
port = "<port>" #optional
user = "<username>"
password = "<passwd>"
database = "<database>"
table = "<rules_table>"
query = "<select * from rules>" # optional

rules = get_rules_config(
    source="mysql", 
    host=host, 
    user=user, 
    password=password, 
    database=database, 
    table=table, 
    query=query
)

# or using Mysql Connector
import mysql.connector
conn = mysql.connector.connect(
    host=host,
    port=port,
    database=database,
    user=user,
    password=password
)

rules = get_rules_config(source="mysql", connection=conn, query=query)

```

**Load rules from Postgres**
```python
from sumeh import get_rules_config

host = "<host>"
port = "<port>" #optional
user = "<username>"
password = "<passwd>"
database = "<database>"
schema = "<public>"
table = "<rules_table>"
query = "<select * from rules>" # optional

rules_pgsql = get_rules_config(
    source="postgresql", 
    host=host, user=user, 
    password=password, 
    database=database, 
    schema=schema, 
    table=table, 
    query=query
)

# Or using the PostgreSQL Connector
import psycopg2

conn = psycopg2.connect(
            host=host,
            database=database,
            user=user,
            password=password
)

rules_pgsql = get_rules_config(source="postgresql", connection=conn, query=query)

```

**Load rules from AWS Glue Data Catalog**
```python
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job

from sumeh import get_rules_config

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
job.init(args['JOB_NAME'], args)


database_name = "<database>"
table_name = "<table>"

rules = get_rules_config(
    source="glue",
    glue_context=glue_context,
    database_name=database_name,
    table_name=table_name
)

job.commit()

```

**Load rules from Databricks Data Catalog**
```python
from sumeh import get_rules_config

catalog = "<catalog>"
database_name = "<database>"
table_name = "<table>"
query = "<query>" # OPTIONAL

rules = get_rules_config(
    "databricks", 
    spark=spark, 
    catalog="sumeh_demo", 
    schema="sample_data", 
    table="rules",
    query=query
)

```


## 🏃‍♂️ Typical Workflow

```python
from sumeh import report
from sumeh.engine.polars_engine import validate, summarize
import polars as pl

# 1) Load data
df = pl.read_csv("data.csv")

# 2) Run validation
result, result_raw = validate(df, rules)

# 3) Generate summary
total = df.height
report = summarize(result_raw, rules, total_rows=total)
print(report)
```

**Or simply:**

```python
from sumeh import report

report = report(df, rules)
```

## 📋 Rule Definition Example

```json
{
  "field": "customer_id",
  "check_type": "is_complete",
  "threshold": 0.99,
  "value": null,
  "execute": true
}
```

## Supported Validation Rules

### Numeric checks

| Test                 | Description                                                                                               |
| -------------------- | --------------------------------------------------------------------------------------------------------- |
| **is_in_millions**   | Retains rows where the column value is **less than** 1,000,000 (fails the "in millions" criteria).        |
| **is_in_billions**   | Retains rows where the column value is **less than** 1,000,000,000 (fails the "in billions" criteria).    |

---

### Completeness & Uniqueness

| Test                   | Description                                                                 |
| ---------------------- | --------------------------------------------------------------------------- |
| **is_complete**        | Filters rows where the column value is null.                                |
| **are_complete**       | Filters rows where **any** of the specified columns are null.               |
| **is_unique**          | Identifies rows with duplicate values in the specified column.              |
| **are_unique**         | Identifies rows with duplicate combinations of the specified columns.       |
| **is_primary_key**     | Alias for `is_unique` (checks uniqueness of a single column).               |
| **is_composite_key**   | Alias for `are_unique` (checks combined uniqueness of multiple columns).    |

---

### Comparison & Range

| Test                             | Description                                                                             |
| -------------------------------- | --------------------------------------------------------------------------------------- |
| **is_equal**                     | Filters rows where the column is not equal to the provided value (null-safe).           |
| **is_equal_than**                | Alias for `is_equal`.                                                                   |
| **is_between**                   | Filters rows where the column value is **outside** the numeric range `[min, max]`.       |
| **is_greater_than**              | Filters rows where the column value is **≤** the threshold (fails "greater than").       |
| **is_greater_or_equal_than**     | Filters rows where the column value is **<** the threshold (fails "greater or equal").   |
| **is_less_than**                 | Filters rows where the column value is **≥** the threshold (fails "less than").          |
| **is_less_or_equal_than**        | Filters rows where the column value is **>** the threshold (fails "less or equal").      |
| **is_positive**                  | Filters rows where the column value is **< 0** (fails "positive").                       |
| **is_negative**                  | Filters rows where the column value is **≥ 0** (fails "negative").                       |

---

### Membership & Pattern

| Test                   | Description                                                                               |
| ---------------------- | ----------------------------------------------------------------------------------------- |
| **is_contained_in**    | Filters rows where the column value is **not** in the provided list.                       |
| **not_contained_in**   | Filters rows where the column value **is** in the provided list.                           |
| **has_pattern**        | Filters rows where the column value does **not** match the specified regex.                |
| **is_legit**           | Filters rows where the column value is null or contains whitespace (i.e., not `\S+`).      |

---

### Aggregate checks

| Test                 | Description                                                                                                      |
| -------------------- | ---------------------------------------------------------------------------------------------------------------- |
| **has_min**          | Returns all rows if the column's minimum value **causes failure** (value < threshold); otherwise returns empty.   |
| **has_max**          | Returns all rows if the column's maximum value **causes failure** (value > threshold); otherwise returns empty.   |
| **has_sum**          | Returns all rows if the column's sum **causes failure** (sum > threshold); otherwise returns empty.               |
| **has_mean**         | Returns all rows if the column's mean **causes failure** (mean > threshold); otherwise returns empty.             |
| **has_std**          | Returns all rows if the column's standard deviation **causes failure** (std > threshold); otherwise returns empty.|
| **has_cardinality**  | Returns all rows if the number of distinct values **causes failure** (count > threshold); otherwise returns empty.|
| **has_infogain**     | Same logic as `has_cardinality` (proxy for information gain).                                                    |
| **has_entropy**      | Same logic as `has_cardinality` (proxy for entropy).                                                             |

---

### SQL & Schema

| Test                 | Description                                                                                                     |
| -------------------- | --------------------------------------------------------------------------------------------------------------- |
| **satisfies**        | Filters rows where the SQL expression (based on `rule["value"]`) is **not** satisfied.                          |
| **validate_schema**  | Compares the DataFrame's actual schema against the expected one and returns a match flag + error list.          |
| **validate**         | Executes a list of named rules and returns two DataFrames: one with aggregated status and one with raw violations. |

---

### Date-related checks

| Test                       | Description                                                                                     |
| -------------------------- | ----------------------------------------------------------------------------------------------- |
| **is_t_minus_1**           | Retains rows where the date in the column is **not** equal to yesterday (T–1).                       |
| **is_t_minus_2**           | Retains rows where the date in the column is **not** equal to two days ago (T–2).                    |
| **is_t_minus_3**           | Retains rows where the date in the column is **not** equal to three days ago (T–3).                  |
| **is_today**               | Retains rows where the date in the column is **not** equal to today.                                 |
| **is_yesterday**           | Retains rows where the date in the column is **not** equal to yesterday.                             |
| **is_on_weekday**          | Retains rows where the date in the column **NOT FALLS** on a weekend (fails "weekday").              |
| **is_on_weekend**          | Retains rows where the date in the column **NOT FALLS** on a weekday (fails "weekend").              |
| **is_on_monday**           | Retains rows where the date in the column is **not** Monday.                                         |
| **is_on_tuesday**          | Retains rows where the date in the column is **not** Tuesday.                                        |
| **is_on_wednesday**        | Retains rows where the date in the column is **not** Wednesday.                                      |
| **is_on_thursday**         | Retains rows where the date in the column is **not** Thursday.                                       |
| **is_on_friday**           | Retains rows where the date in the column is **not** Friday.                                         |
| **is_on_saturday**         | Retains rows where the date in the column is **not** Saturday.                                       |
| **is_on_sunday**           | Retains rows where the date in the column is **not** Sunday.                                         |
| **validate_date_format**   | Filters rows where the date doesn't match the expected format or is null.                        |
| **is_future_date**         | Filters rows where the date in the column is **not** after today.                                    |
| **is_past_date**           | Filters rows where the date in the column is **not** before today.                                   |
| **is_date_after**          | Filters rows where the date in the column is **not** before the date provided in the rule.           |
| **is_date_before**         | Filters rows where the date in the column is **not** after the date provided in the rule.            |
| **is_date_between**        | Filters rows where the date in the column is **not** outside the range `[start, end]`.               |
| **all_date_checks**        | Alias for `is_past_date` (same logic: date before today).                                        |



## Schema Validation

Sumeh allows you to validate your DataFrame schemas against a schema registry stored in various data sources (BigQuery, MySQL, PostgreSQL, DuckDB, Databricks, Glue, CSV, S3).

### Step 1: Store Your Schema Registry

First, create a `schema_registry` table in your data source with the following structure:

| Column | Type | Description |
|--------|------|-------------|
| id | int | Auto-increment ID |
| environment | string | Environment (e.g., 'prod', 'staging', 'dev') |
| source_type | string | Source type (e.g., 'bigquery', 'mysql') |
| database_name | string | Database/project name |
| catalog_name | string | Catalog name (for Databricks) |
| schema_name | string | Schema name (for PostgreSQL) |
| table_name | string | Table name |
| field | string | Column name |
| data_type | string | Data type |
| nullable | boolean | Whether column can be null |
| max_length | int | Maximum length for strings |
| comment | string | Description/comment |
| created_at | datetime | Creation timestamp |
| updated_at | datetime | Last update timestamp |

### Step 2: Get Schema Configuration

Use `get_schema_config()` to retrieve the expected schema from your registry:

#### BigQuery

```python
from sumeh.services.config import get_schema_config

schema = get_schema_config(
    source="bigquery",
    project_id="my-project",
    dataset_id="my-dataset",
    table_id="users",
    environment="prod"  # optional, defaults to 'prod'
)
```

#### MySQL

```python
# Option 1: Create connection internally
schema = get_schema_config(
    source="mysql",
    host="localhost",
    user="root",
    password="secret",
    database="mydb",
    table="users",
    environment="prod"
)

# Option 2: Reuse existing connection
import mysql.connector
conn = mysql.connector.connect(host="localhost", user="root", password="secret", database="mydb")

schema = get_schema_config(
    source="mysql",
    conn=conn,
    table="users",
    environment="prod"
)
```

#### PostgreSQL

```python
# Option 1: Create connection internally
schema = get_schema_config(
    source="postgresql",
    host="localhost",
    user="postgres",
    password="secret",
    database="mydb",
    schema="public",
    table="users",
    environment="prod"
)

# Option 2: Reuse existing connection
import psycopg2
conn = psycopg2.connect(host="localhost", user="postgres", password="secret", dbname="mydb")

schema = get_schema_config(
    source="postgresql",
    conn=conn,
    schema="public",
    table="users",
    environment="prod"
)
```

#### DuckDB

```python
import duckdb

conn = duckdb.connect("my_database.db")

schema = get_schema_config(
    source="duckdb",
    conn=conn,
    table="users",
    environment="prod"
)
```

#### Databricks

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

schema = get_schema_config(
    source="databricks",
    spark=spark,
    catalog="main",
    schema="default",
    table="users",
    environment="prod"
)
```

#### AWS Glue

```python
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

schema = get_schema_config(
    source="glue",
    glue_context=glueContext,
    database_name="my_database",
    table_name="users",
    environment="prod"
)
```

#### CSV

```python
schema = get_schema_config(
    source="schema_registry.csv",
    table="users",
    environment="prod"
)
```

#### S3

```python
schema = get_schema_config(
    source="s3://my-bucket/path/schema_registry.csv",
    table="users",
    environment="prod"
)
```

### Step 3: Validate DataFrame Schema

Once you have the expected schema, validate your DataFrame against it:

```python
from sumeh.core import validate_schema

# Load your DataFrame (example with pandas)
import pandas as pd
df = pd.read_csv("users.csv")

# Validate
is_valid, errors = validate_schema(
    df_or_conn=df,
    expected=schema
)

if is_valid:
    print("✅ Schema is valid!")
else:
    print("❌ Schema validation failed:")
    for field, error in errors:
        print(f"  - {field}: {error}")
```

#### Example Output

```
❌ Schema validation failed:
  - email: missing
  - age: type mismatch (got 'object', expected 'int64')
  - created_at: nullable but expected non-nullable
  - extra_field: extra column
```

### Advanced: Custom Filters

You can add custom WHERE clauses to filter the schema registry:

```python
schema = get_schema_config(
    source="bigquery",
    project_id="my-project",
    dataset_id="my-dataset",
    table_id="users",
    environment="prod",
    query="source_type = 'bigquery' AND catalog_name IS NOT NULL"
)
```

**Note:** The `query` parameter adds additional filters to the base filter (`table_name` and `environment`).

### Supported Engines

Schema validation works with all supported DataFrame engines:
- Dask
- DuckDB
- Pandas
- Polars
- PySpark

**Important:** Make sure the `data_type` values in your `schema_registry` match the exact format returned by your DataFrame engine (e.g., `int64` for pandas, `string` for PySpark). Comparisons are case-insensitive.


## 📂 Project Layout

```
sumeh/
├── poetry.lock
├── pyproject.toml
├── README.md
└── sumeh
    ├── __init__.py
    ├── cli.py
    ├── core.py
    ├── engine
    │   ├── __init__.py
    │   ├── bigquery_engine.py
    │   ├── dask_engine.py
    │   ├── duckdb_engine.py
    │   ├── pandas_engine.py
    │   ├── polars_engine.py
    │   └── pyspark_engine.py
    └── services
        ├── __init__.py
        ├── config.py
        ├── index.html
        └── utils.py

```

## 📈 Roadmap

- [ ] Complete BigQuery engine implementation
- ✅ Complete Pandas engine implementation
- ✅ Enhanced documentation
- ✅ More validation rule types
- [ ] Performance optimizations

## 🤝 Contributing

1. Fork & create a feature branch  
2. Implement new checks or engines, following existing signatures  
3. Add tests under `tests/`  
4. Open a PR and ensure CI passes

## 📜 License

Licensed under the [Apache License 2.0](LICENSE).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/maltzsama/sumeh",
    "name": "sumeh",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "data-quality, validation, dq, etl, cli",
    "author": "Demetrius Albuquerque",
    "author_email": "demetrius.albuquerque@yahoo.com.br",
    "download_url": "https://files.pythonhosted.org/packages/5b/6c/823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d/sumeh-0.3.2.tar.gz",
    "platform": null,
    "description": "![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)\n![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)\n\n# <h1 style=\"display: flex; align-items: center; gap: 0.5rem;\"><img src=\"https://raw.githubusercontent.com/maltzsama/sumeh/refs/heads/main/docs/img/sumeh.svg\" alt=\"Logo\" style=\"height: 40px; width: auto; vertical-align: middle;\" /> <span>Sumeh DQ</span> </h1>\n\nSumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.\n\n## \ud83d\ude80 Installation\n\n```bash\n# Using pip\npip install sumeh\n\n# Or with conda-forge\nconda install -c conda-forge sumeh\n```\n\n**Prerequisites:**  \n- Python 3.10+  \n- One or more of: `pyspark`, `dask[dataframe]`, `polars`, `duckdb`, `cuallee`\n\n## \ud83d\udd0d Core API\n\n- **`report(df, rules, name=\"Quality Check\")`**  \n  Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB).  \n- **`validate(df, rules)`** *(per-engine)*  \n  Returns a DataFrame with a `dq_status` column listing violations.  \n- **`summarize(qc_df, rules, total_rows)`** *(per-engine)*  \n  Consolidates violations into a summary report.\n\n## \u2699\ufe0f Supported Engines\n\nEach engine implements the `validate()` + `summarize()` pair:\n\n| Engine                | Module                                  | Status          |\n|-----------------------|-----------------------------------------|-----------------|\n| PySpark               | `sumeh.engine.pyspark_engine`           | \u2705 Fully implemented |\n| Dask                  | `sumeh.engine.dask_engine`              | \u2705 Fully implemented |\n| Polars                | `sumeh.engine.polars_engine`            | \u2705 Fully implemented |\n| DuckDB                | `sumeh.engine.duckdb_engine`            | \u2705 Fully implemented |\n| Pandas                | `sumeh.engine.pandas_engine`            | \u2705 Fully implemented |\n| BigQuery (SQL)        | `sumeh.engine.bigquery_engine`          | \ud83d\udd27 Stub implementation |\n\n## \ud83c\udfd7 Configuration Sources\n\n**Load rules from CSV**\n\n```python\nfrom sumeh import get_rules_config\n\nrules = get_rules_config(\"rules.csv\", delimiter=\";\")\n```\n\n**Load rules from S3**\n```python\nfrom sumeh import get_rules_config\nbucket_name = \"<bucket>\"\npath = \"<path>\"\nfile_name = \"<file_name>\"\n\nrules = get_rules_config(f\"s3://{bucket_name}/{path}/{file_name}\", delimiter=\";\")\n\n```\n\n**Load rules from MySQL**\n```python\nfrom sumeh import get_rules_config\n\nhost = \"<host>\"\nport = \"<port>\" #optional\nuser = \"<username>\"\npassword = \"<passwd>\"\ndatabase = \"<database>\"\ntable = \"<rules_table>\"\nquery = \"<select * from rules>\" # optional\n\nrules = get_rules_config(\n    source=\"mysql\", \n    host=host, \n    user=user, \n    password=password, \n    database=database, \n    table=table, \n    query=query\n)\n\n# or using Mysql Connector\nimport mysql.connector\nconn = mysql.connector.connect(\n    host=host,\n    port=port,\n    database=database,\n    user=user,\n    password=password\n)\n\nrules = get_rules_config(source=\"mysql\", connection=conn, query=query)\n\n```\n\n**Load rules from Postgres**\n```python\nfrom sumeh import get_rules_config\n\nhost = \"<host>\"\nport = \"<port>\" #optional\nuser = \"<username>\"\npassword = \"<passwd>\"\ndatabase = \"<database>\"\nschema = \"<public>\"\ntable = \"<rules_table>\"\nquery = \"<select * from rules>\" # optional\n\nrules_pgsql = get_rules_config(\n    source=\"postgresql\", \n    host=host, user=user, \n    password=password, \n    database=database, \n    schema=schema, \n    table=table, \n    query=query\n)\n\n# Or using the PostgreSQL Connector\nimport psycopg2\n\nconn = psycopg2.connect(\n            host=host,\n            database=database,\n            user=user,\n            password=password\n)\n\nrules_pgsql = get_rules_config(source=\"postgresql\", connection=conn, query=query)\n\n```\n\n**Load rules from AWS Glue Data Catalog**\n```python\nfrom pyspark.context import SparkContext\nfrom awsglue.utils import getResolvedOptions\nfrom awsglue.context import GlueContext\nfrom awsglue.job import Job\n\nfrom sumeh import get_rules_config\n\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglue_context = GlueContext(sc)\nspark = glue_context.spark_session\njob = Job(glue_context)\njob.init(args['JOB_NAME'], args)\n\n\ndatabase_name = \"<database>\"\ntable_name = \"<table>\"\n\nrules = get_rules_config(\n    source=\"glue\",\n    glue_context=glue_context,\n    database_name=database_name,\n    table_name=table_name\n)\n\njob.commit()\n\n```\n\n**Load rules from Databricks Data Catalog**\n```python\nfrom sumeh import get_rules_config\n\ncatalog = \"<catalog>\"\ndatabase_name = \"<database>\"\ntable_name = \"<table>\"\nquery = \"<query>\" # OPTIONAL\n\nrules = get_rules_config(\n    \"databricks\", \n    spark=spark, \n    catalog=\"sumeh_demo\", \n    schema=\"sample_data\", \n    table=\"rules\",\n    query=query\n)\n\n```\n\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Typical Workflow\n\n```python\nfrom sumeh import report\nfrom sumeh.engine.polars_engine import validate, summarize\nimport polars as pl\n\n# 1) Load data\ndf = pl.read_csv(\"data.csv\")\n\n# 2) Run validation\nresult, result_raw = validate(df, rules)\n\n# 3) Generate summary\ntotal = df.height\nreport = summarize(result_raw, rules, total_rows=total)\nprint(report)\n```\n\n**Or simply:**\n\n```python\nfrom sumeh import report\n\nreport = report(df, rules)\n```\n\n## \ud83d\udccb Rule Definition Example\n\n```json\n{\n  \"field\": \"customer_id\",\n  \"check_type\": \"is_complete\",\n  \"threshold\": 0.99,\n  \"value\": null,\n  \"execute\": true\n}\n```\n\n## Supported Validation Rules\n\n### Numeric checks\n\n| Test                 | Description                                                                                               |\n| -------------------- | --------------------------------------------------------------------------------------------------------- |\n| **is_in_millions**   | Retains rows where the column value is **less than** 1,000,000 (fails the \"in millions\" criteria).        |\n| **is_in_billions**   | Retains rows where the column value is **less than** 1,000,000,000 (fails the \"in billions\" criteria).    |\n\n---\n\n### Completeness & Uniqueness\n\n| Test                   | Description                                                                 |\n| ---------------------- | --------------------------------------------------------------------------- |\n| **is_complete**        | Filters rows where the column value is null.                                |\n| **are_complete**       | Filters rows where **any** of the specified columns are null.               |\n| **is_unique**          | Identifies rows with duplicate values in the specified column.              |\n| **are_unique**         | Identifies rows with duplicate combinations of the specified columns.       |\n| **is_primary_key**     | Alias for `is_unique` (checks uniqueness of a single column).               |\n| **is_composite_key**   | Alias for `are_unique` (checks combined uniqueness of multiple columns).    |\n\n---\n\n### Comparison & Range\n\n| Test                             | Description                                                                             |\n| -------------------------------- | --------------------------------------------------------------------------------------- |\n| **is_equal**                     | Filters rows where the column is not equal to the provided value (null-safe).           |\n| **is_equal_than**                | Alias for `is_equal`.                                                                   |\n| **is_between**                   | Filters rows where the column value is **outside** the numeric range `[min, max]`.       |\n| **is_greater_than**              | Filters rows where the column value is **\u2264** the threshold (fails \"greater than\").       |\n| **is_greater_or_equal_than**     | Filters rows where the column value is **<** the threshold (fails \"greater or equal\").   |\n| **is_less_than**                 | Filters rows where the column value is **\u2265** the threshold (fails \"less than\").          |\n| **is_less_or_equal_than**        | Filters rows where the column value is **>** the threshold (fails \"less or equal\").      |\n| **is_positive**                  | Filters rows where the column value is **< 0** (fails \"positive\").                       |\n| **is_negative**                  | Filters rows where the column value is **\u2265 0** (fails \"negative\").                       |\n\n---\n\n### Membership & Pattern\n\n| Test                   | Description                                                                               |\n| ---------------------- | ----------------------------------------------------------------------------------------- |\n| **is_contained_in**    | Filters rows where the column value is **not** in the provided list.                       |\n| **not_contained_in**   | Filters rows where the column value **is** in the provided list.                           |\n| **has_pattern**        | Filters rows where the column value does **not** match the specified regex.                |\n| **is_legit**           | Filters rows where the column value is null or contains whitespace (i.e., not `\\S+`).      |\n\n---\n\n### Aggregate checks\n\n| Test                 | Description                                                                                                      |\n| -------------------- | ---------------------------------------------------------------------------------------------------------------- |\n| **has_min**          | Returns all rows if the column's minimum value **causes failure** (value < threshold); otherwise returns empty.   |\n| **has_max**          | Returns all rows if the column's maximum value **causes failure** (value > threshold); otherwise returns empty.   |\n| **has_sum**          | Returns all rows if the column's sum **causes failure** (sum > threshold); otherwise returns empty.               |\n| **has_mean**         | Returns all rows if the column's mean **causes failure** (mean > threshold); otherwise returns empty.             |\n| **has_std**          | Returns all rows if the column's standard deviation **causes failure** (std > threshold); otherwise returns empty.|\n| **has_cardinality**  | Returns all rows if the number of distinct values **causes failure** (count > threshold); otherwise returns empty.|\n| **has_infogain**     | Same logic as `has_cardinality` (proxy for information gain).                                                    |\n| **has_entropy**      | Same logic as `has_cardinality` (proxy for entropy).                                                             |\n\n---\n\n### SQL & Schema\n\n| Test                 | Description                                                                                                     |\n| -------------------- | --------------------------------------------------------------------------------------------------------------- |\n| **satisfies**        | Filters rows where the SQL expression (based on `rule[\"value\"]`) is **not** satisfied.                          |\n| **validate_schema**  | Compares the DataFrame's actual schema against the expected one and returns a match flag + error list.          |\n| **validate**         | Executes a list of named rules and returns two DataFrames: one with aggregated status and one with raw violations. |\n\n---\n\n### Date-related checks\n\n| Test                       | Description                                                                                     |\n| -------------------------- | ----------------------------------------------------------------------------------------------- |\n| **is_t_minus_1**           | Retains rows where the date in the column is **not** equal to yesterday (T\u20131).                       |\n| **is_t_minus_2**           | Retains rows where the date in the column is **not** equal to two days ago (T\u20132).                    |\n| **is_t_minus_3**           | Retains rows where the date in the column is **not** equal to three days ago (T\u20133).                  |\n| **is_today**               | Retains rows where the date in the column is **not** equal to today.                                 |\n| **is_yesterday**           | Retains rows where the date in the column is **not** equal to yesterday.                             |\n| **is_on_weekday**          | Retains rows where the date in the column **NOT FALLS** on a weekend (fails \"weekday\").              |\n| **is_on_weekend**          | Retains rows where the date in the column **NOT FALLS** on a weekday (fails \"weekend\").              |\n| **is_on_monday**           | Retains rows where the date in the column is **not** Monday.                                         |\n| **is_on_tuesday**          | Retains rows where the date in the column is **not** Tuesday.                                        |\n| **is_on_wednesday**        | Retains rows where the date in the column is **not** Wednesday.                                      |\n| **is_on_thursday**         | Retains rows where the date in the column is **not** Thursday.                                       |\n| **is_on_friday**           | Retains rows where the date in the column is **not** Friday.                                         |\n| **is_on_saturday**         | Retains rows where the date in the column is **not** Saturday.                                       |\n| **is_on_sunday**           | Retains rows where the date in the column is **not** Sunday.                                         |\n| **validate_date_format**   | Filters rows where the date doesn't match the expected format or is null.                        |\n| **is_future_date**         | Filters rows where the date in the column is **not** after today.                                    |\n| **is_past_date**           | Filters rows where the date in the column is **not** before today.                                   |\n| **is_date_after**          | Filters rows where the date in the column is **not** before the date provided in the rule.           |\n| **is_date_before**         | Filters rows where the date in the column is **not** after the date provided in the rule.            |\n| **is_date_between**        | Filters rows where the date in the column is **not** outside the range `[start, end]`.               |\n| **all_date_checks**        | Alias for `is_past_date` (same logic: date before today).                                        |\n\n\n\n## Schema Validation\n\nSumeh allows you to validate your DataFrame schemas against a schema registry stored in various data sources (BigQuery, MySQL, PostgreSQL, DuckDB, Databricks, Glue, CSV, S3).\n\n### Step 1: Store Your Schema Registry\n\nFirst, create a `schema_registry` table in your data source with the following structure:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| id | int | Auto-increment ID |\n| environment | string | Environment (e.g., 'prod', 'staging', 'dev') |\n| source_type | string | Source type (e.g., 'bigquery', 'mysql') |\n| database_name | string | Database/project name |\n| catalog_name | string | Catalog name (for Databricks) |\n| schema_name | string | Schema name (for PostgreSQL) |\n| table_name | string | Table name |\n| field | string | Column name |\n| data_type | string | Data type |\n| nullable | boolean | Whether column can be null |\n| max_length | int | Maximum length for strings |\n| comment | string | Description/comment |\n| created_at | datetime | Creation timestamp |\n| updated_at | datetime | Last update timestamp |\n\n### Step 2: Get Schema Configuration\n\nUse `get_schema_config()` to retrieve the expected schema from your registry:\n\n#### BigQuery\n\n```python\nfrom sumeh.services.config import get_schema_config\n\nschema = get_schema_config(\n    source=\"bigquery\",\n    project_id=\"my-project\",\n    dataset_id=\"my-dataset\",\n    table_id=\"users\",\n    environment=\"prod\"  # optional, defaults to 'prod'\n)\n```\n\n#### MySQL\n\n```python\n# Option 1: Create connection internally\nschema = get_schema_config(\n    source=\"mysql\",\n    host=\"localhost\",\n    user=\"root\",\n    password=\"secret\",\n    database=\"mydb\",\n    table=\"users\",\n    environment=\"prod\"\n)\n\n# Option 2: Reuse existing connection\nimport mysql.connector\nconn = mysql.connector.connect(host=\"localhost\", user=\"root\", password=\"secret\", database=\"mydb\")\n\nschema = get_schema_config(\n    source=\"mysql\",\n    conn=conn,\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### PostgreSQL\n\n```python\n# Option 1: Create connection internally\nschema = get_schema_config(\n    source=\"postgresql\",\n    host=\"localhost\",\n    user=\"postgres\",\n    password=\"secret\",\n    database=\"mydb\",\n    schema=\"public\",\n    table=\"users\",\n    environment=\"prod\"\n)\n\n# Option 2: Reuse existing connection\nimport psycopg2\nconn = psycopg2.connect(host=\"localhost\", user=\"postgres\", password=\"secret\", dbname=\"mydb\")\n\nschema = get_schema_config(\n    source=\"postgresql\",\n    conn=conn,\n    schema=\"public\",\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### DuckDB\n\n```python\nimport duckdb\n\nconn = duckdb.connect(\"my_database.db\")\n\nschema = get_schema_config(\n    source=\"duckdb\",\n    conn=conn,\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### Databricks\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.getOrCreate()\n\nschema = get_schema_config(\n    source=\"databricks\",\n    spark=spark,\n    catalog=\"main\",\n    schema=\"default\",\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### AWS Glue\n\n```python\nfrom awsglue.context import GlueContext\nfrom pyspark.context import SparkContext\n\nglueContext = GlueContext(SparkContext.getOrCreate())\n\nschema = get_schema_config(\n    source=\"glue\",\n    glue_context=glueContext,\n    database_name=\"my_database\",\n    table_name=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### CSV\n\n```python\nschema = get_schema_config(\n    source=\"schema_registry.csv\",\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n#### S3\n\n```python\nschema = get_schema_config(\n    source=\"s3://my-bucket/path/schema_registry.csv\",\n    table=\"users\",\n    environment=\"prod\"\n)\n```\n\n### Step 3: Validate DataFrame Schema\n\nOnce you have the expected schema, validate your DataFrame against it:\n\n```python\nfrom sumeh.core import validate_schema\n\n# Load your DataFrame (example with pandas)\nimport pandas as pd\ndf = pd.read_csv(\"users.csv\")\n\n# Validate\nis_valid, errors = validate_schema(\n    df_or_conn=df,\n    expected=schema\n)\n\nif is_valid:\n    print(\"\u2705 Schema is valid!\")\nelse:\n    print(\"\u274c Schema validation failed:\")\n    for field, error in errors:\n        print(f\"  - {field}: {error}\")\n```\n\n#### Example Output\n\n```\n\u274c Schema validation failed:\n  - email: missing\n  - age: type mismatch (got 'object', expected 'int64')\n  - created_at: nullable but expected non-nullable\n  - extra_field: extra column\n```\n\n### Advanced: Custom Filters\n\nYou can add custom WHERE clauses to filter the schema registry:\n\n```python\nschema = get_schema_config(\n    source=\"bigquery\",\n    project_id=\"my-project\",\n    dataset_id=\"my-dataset\",\n    table_id=\"users\",\n    environment=\"prod\",\n    query=\"source_type = 'bigquery' AND catalog_name IS NOT NULL\"\n)\n```\n\n**Note:** The `query` parameter adds additional filters to the base filter (`table_name` and `environment`).\n\n### Supported Engines\n\nSchema validation works with all supported DataFrame engines:\n- Dask\n- DuckDB\n- Pandas\n- Polars\n- PySpark\n\n**Important:** Make sure the `data_type` values in your `schema_registry` match the exact format returned by your DataFrame engine (e.g., `int64` for pandas, `string` for PySpark). Comparisons are case-insensitive.\n\n\n## \ud83d\udcc2 Project Layout\n\n```\nsumeh/\n\u251c\u2500\u2500 poetry.lock\n\u251c\u2500\u2500 pyproject.toml\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumeh\n    \u251c\u2500\u2500 __init__.py\n    \u251c\u2500\u2500 cli.py\n    \u251c\u2500\u2500 core.py\n    \u251c\u2500\u2500 engine\n    \u2502   \u251c\u2500\u2500 __init__.py\n    \u2502   \u251c\u2500\u2500 bigquery_engine.py\n    \u2502   \u251c\u2500\u2500 dask_engine.py\n    \u2502   \u251c\u2500\u2500 duckdb_engine.py\n    \u2502   \u251c\u2500\u2500 pandas_engine.py\n    \u2502   \u251c\u2500\u2500 polars_engine.py\n    \u2502   \u2514\u2500\u2500 pyspark_engine.py\n    \u2514\u2500\u2500 services\n        \u251c\u2500\u2500 __init__.py\n        \u251c\u2500\u2500 config.py\n        \u251c\u2500\u2500 index.html\n        \u2514\u2500\u2500 utils.py\n\n```\n\n## \ud83d\udcc8 Roadmap\n\n- [ ] Complete BigQuery engine implementation\n- \u2705 Complete Pandas engine implementation\n- \u2705 Enhanced documentation\n- \u2705 More validation rule types\n- [ ] Performance optimizations\n\n## \ud83e\udd1d Contributing\n\n1. Fork & create a feature branch  \n2. Implement new checks or engines, following existing signatures  \n3. Add tests under `tests/`  \n4. Open a PR and ensure CI passes\n\n## \ud83d\udcdc License\n\nLicensed under the [Apache License 2.0](LICENSE).\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Unified Data Quality Validation Framework",
    "version": "0.3.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/maltzsama/sumeh/issues",
        "Documentation": "https://maltzsama.github.io/sumeh",
        "Homepage": "https://github.com/maltzsama/sumeh",
        "Repository": "https://github.com/maltzsama/sumeh"
    },
    "split_keywords": [
        "data-quality",
        " validation",
        " dq",
        " etl",
        " cli"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9274f73c579ea154759d929fa3799b3dfe43c079e5fd64b1087f7b35d1db5383",
                "md5": "076f4e4449f171e44317e7cebb3e1691",
                "sha256": "9e42f46327c2970ba2fab015dc81b04256234765dc90457d8081e9666616698a"
            },
            "downloads": -1,
            "filename": "sumeh-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "076f4e4449f171e44317e7cebb3e1691",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 79209,
            "upload_time": "2025-10-06T19:20:40",
            "upload_time_iso_8601": "2025-10-06T19:20:40.898145Z",
            "url": "https://files.pythonhosted.org/packages/92/74/f73c579ea154759d929fa3799b3dfe43c079e5fd64b1087f7b35d1db5383/sumeh-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5b6c823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d",
                "md5": "784b3b654eb383bfef8c6379e20f685f",
                "sha256": "e3eeba168078481069e6eb742c9726b8c2caae225b7e918089670e1a5f69d1c7"
            },
            "downloads": -1,
            "filename": "sumeh-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "784b3b654eb383bfef8c6379e20f685f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 76971,
            "upload_time": "2025-10-06T19:20:42",
            "upload_time_iso_8601": "2025-10-06T19:20:42.077605Z",
            "url": "https://files.pythonhosted.org/packages/5b/6c/823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d/sumeh-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 19:20:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maltzsama",
    "github_project": "sumeh",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sumeh"
}

Demetrius Albuquerque