

# <h1 style="display: flex; align-items: center; gap: 0.5rem;"><img src="https://raw.githubusercontent.com/maltzsama/sumeh/refs/heads/main/docs/img/sumeh.svg" alt="Logo" style="height: 40px; width: auto; vertical-align: middle;" /> <span>Sumeh DQ</span> </h1>
Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.
## 🚀 Installation
```bash
# Using pip
pip install sumeh
# Or with conda-forge
conda install -c conda-forge sumeh
```
**Prerequisites:**
- Python 3.10+
- One or more of: `pyspark`, `dask[dataframe]`, `polars`, `duckdb`, `cuallee`
## 🔍 Core API
- **`report(df, rules, name="Quality Check")`**
Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB).
- **`validate(df, rules)`** *(per-engine)*
Returns a DataFrame with a `dq_status` column listing violations.
- **`summarize(qc_df, rules, total_rows)`** *(per-engine)*
Consolidates violations into a summary report.
## ⚙️ Supported Engines
Each engine implements the `validate()` + `summarize()` pair:
| Engine | Module | Status |
|-----------------------|-----------------------------------------|-----------------|
| PySpark | `sumeh.engine.pyspark_engine` | ✅ Fully implemented |
| Dask | `sumeh.engine.dask_engine` | ✅ Fully implemented |
| Polars | `sumeh.engine.polars_engine` | ✅ Fully implemented |
| DuckDB | `sumeh.engine.duckdb_engine` | ✅ Fully implemented |
| Pandas | `sumeh.engine.pandas_engine` | ✅ Fully implemented |
| BigQuery (SQL) | `sumeh.engine.bigquery_engine` | 🔧 Stub implementation |
## 🏗 Configuration Sources
**Load rules from CSV**
```python
from sumeh import get_rules_config
rules = get_rules_config("rules.csv", delimiter=";")
```
**Load rules from S3**
```python
from sumeh import get_rules_config
bucket_name = "<bucket>"
path = "<path>"
file_name = "<file_name>"
rules = get_rules_config(f"s3://{bucket_name}/{path}/{file_name}", delimiter=";")
```
**Load rules from MySQL**
```python
from sumeh import get_rules_config
host = "<host>"
port = "<port>" #optional
user = "<username>"
password = "<passwd>"
database = "<database>"
table = "<rules_table>"
query = "<select * from rules>" # optional
rules = get_rules_config(
source="mysql",
host=host,
user=user,
password=password,
database=database,
table=table,
query=query
)
# or using Mysql Connector
import mysql.connector
conn = mysql.connector.connect(
host=host,
port=port,
database=database,
user=user,
password=password
)
rules = get_rules_config(source="mysql", connection=conn, query=query)
```
**Load rules from Postgres**
```python
from sumeh import get_rules_config
host = "<host>"
port = "<port>" #optional
user = "<username>"
password = "<passwd>"
database = "<database>"
schema = "<public>"
table = "<rules_table>"
query = "<select * from rules>" # optional
rules_pgsql = get_rules_config(
source="postgresql",
host=host, user=user,
password=password,
database=database,
schema=schema,
table=table,
query=query
)
# Or using the PostgreSQL Connector
import psycopg2
conn = psycopg2.connect(
host=host,
database=database,
user=user,
password=password
)
rules_pgsql = get_rules_config(source="postgresql", connection=conn, query=query)
```
**Load rules from AWS Glue Data Catalog**
```python
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from sumeh import get_rules_config
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
job.init(args['JOB_NAME'], args)
database_name = "<database>"
table_name = "<table>"
rules = get_rules_config(
source="glue",
glue_context=glue_context,
database_name=database_name,
table_name=table_name
)
job.commit()
```
**Load rules from Databricks Data Catalog**
```python
from sumeh import get_rules_config
catalog = "<catalog>"
database_name = "<database>"
table_name = "<table>"
query = "<query>" # OPTIONAL
rules = get_rules_config(
"databricks",
spark=spark,
catalog="sumeh_demo",
schema="sample_data",
table="rules",
query=query
)
```
## 🏃♂️ Typical Workflow
```python
from sumeh import report
from sumeh.engine.polars_engine import validate, summarize
import polars as pl
# 1) Load data
df = pl.read_csv("data.csv")
# 2) Run validation
result, result_raw = validate(df, rules)
# 3) Generate summary
total = df.height
report = summarize(result_raw, rules, total_rows=total)
print(report)
```
**Or simply:**
```python
from sumeh import report
report = report(df, rules)
```
## 📋 Rule Definition Example
```json
{
"field": "customer_id",
"check_type": "is_complete",
"threshold": 0.99,
"value": null,
"execute": true
}
```
## Supported Validation Rules
### Numeric checks
| Test | Description |
| -------------------- | --------------------------------------------------------------------------------------------------------- |
| **is_in_millions** | Retains rows where the column value is **less than** 1,000,000 (fails the "in millions" criteria). |
| **is_in_billions** | Retains rows where the column value is **less than** 1,000,000,000 (fails the "in billions" criteria). |
---
### Completeness & Uniqueness
| Test | Description |
| ---------------------- | --------------------------------------------------------------------------- |
| **is_complete** | Filters rows where the column value is null. |
| **are_complete** | Filters rows where **any** of the specified columns are null. |
| **is_unique** | Identifies rows with duplicate values in the specified column. |
| **are_unique** | Identifies rows with duplicate combinations of the specified columns. |
| **is_primary_key** | Alias for `is_unique` (checks uniqueness of a single column). |
| **is_composite_key** | Alias for `are_unique` (checks combined uniqueness of multiple columns). |
---
### Comparison & Range
| Test | Description |
| -------------------------------- | --------------------------------------------------------------------------------------- |
| **is_equal** | Filters rows where the column is not equal to the provided value (null-safe). |
| **is_equal_than** | Alias for `is_equal`. |
| **is_between** | Filters rows where the column value is **outside** the numeric range `[min, max]`. |
| **is_greater_than** | Filters rows where the column value is **≤** the threshold (fails "greater than"). |
| **is_greater_or_equal_than** | Filters rows where the column value is **<** the threshold (fails "greater or equal"). |
| **is_less_than** | Filters rows where the column value is **≥** the threshold (fails "less than"). |
| **is_less_or_equal_than** | Filters rows where the column value is **>** the threshold (fails "less or equal"). |
| **is_positive** | Filters rows where the column value is **< 0** (fails "positive"). |
| **is_negative** | Filters rows where the column value is **≥ 0** (fails "negative"). |
---
### Membership & Pattern
| Test | Description |
| ---------------------- | ----------------------------------------------------------------------------------------- |
| **is_contained_in** | Filters rows where the column value is **not** in the provided list. |
| **not_contained_in** | Filters rows where the column value **is** in the provided list. |
| **has_pattern** | Filters rows where the column value does **not** match the specified regex. |
| **is_legit** | Filters rows where the column value is null or contains whitespace (i.e., not `\S+`). |
---
### Aggregate checks
| Test | Description |
| -------------------- | ---------------------------------------------------------------------------------------------------------------- |
| **has_min** | Returns all rows if the column's minimum value **causes failure** (value < threshold); otherwise returns empty. |
| **has_max** | Returns all rows if the column's maximum value **causes failure** (value > threshold); otherwise returns empty. |
| **has_sum** | Returns all rows if the column's sum **causes failure** (sum > threshold); otherwise returns empty. |
| **has_mean** | Returns all rows if the column's mean **causes failure** (mean > threshold); otherwise returns empty. |
| **has_std** | Returns all rows if the column's standard deviation **causes failure** (std > threshold); otherwise returns empty.|
| **has_cardinality** | Returns all rows if the number of distinct values **causes failure** (count > threshold); otherwise returns empty.|
| **has_infogain** | Same logic as `has_cardinality` (proxy for information gain). |
| **has_entropy** | Same logic as `has_cardinality` (proxy for entropy). |
---
### SQL & Schema
| Test | Description |
| -------------------- | --------------------------------------------------------------------------------------------------------------- |
| **satisfies** | Filters rows where the SQL expression (based on `rule["value"]`) is **not** satisfied. |
| **validate_schema** | Compares the DataFrame's actual schema against the expected one and returns a match flag + error list. |
| **validate** | Executes a list of named rules and returns two DataFrames: one with aggregated status and one with raw violations. |
---
### Date-related checks
| Test | Description |
| -------------------------- | ----------------------------------------------------------------------------------------------- |
| **is_t_minus_1** | Retains rows where the date in the column is **not** equal to yesterday (T–1). |
| **is_t_minus_2** | Retains rows where the date in the column is **not** equal to two days ago (T–2). |
| **is_t_minus_3** | Retains rows where the date in the column is **not** equal to three days ago (T–3). |
| **is_today** | Retains rows where the date in the column is **not** equal to today. |
| **is_yesterday** | Retains rows where the date in the column is **not** equal to yesterday. |
| **is_on_weekday** | Retains rows where the date in the column **NOT FALLS** on a weekend (fails "weekday"). |
| **is_on_weekend** | Retains rows where the date in the column **NOT FALLS** on a weekday (fails "weekend"). |
| **is_on_monday** | Retains rows where the date in the column is **not** Monday. |
| **is_on_tuesday** | Retains rows where the date in the column is **not** Tuesday. |
| **is_on_wednesday** | Retains rows where the date in the column is **not** Wednesday. |
| **is_on_thursday** | Retains rows where the date in the column is **not** Thursday. |
| **is_on_friday** | Retains rows where the date in the column is **not** Friday. |
| **is_on_saturday** | Retains rows where the date in the column is **not** Saturday. |
| **is_on_sunday** | Retains rows where the date in the column is **not** Sunday. |
| **validate_date_format** | Filters rows where the date doesn't match the expected format or is null. |
| **is_future_date** | Filters rows where the date in the column is **not** after today. |
| **is_past_date** | Filters rows where the date in the column is **not** before today. |
| **is_date_after** | Filters rows where the date in the column is **not** before the date provided in the rule. |
| **is_date_before** | Filters rows where the date in the column is **not** after the date provided in the rule. |
| **is_date_between** | Filters rows where the date in the column is **not** outside the range `[start, end]`. |
| **all_date_checks** | Alias for `is_past_date` (same logic: date before today). |
## Schema Validation
Sumeh allows you to validate your DataFrame schemas against a schema registry stored in various data sources (BigQuery, MySQL, PostgreSQL, DuckDB, Databricks, Glue, CSV, S3).
### Step 1: Store Your Schema Registry
First, create a `schema_registry` table in your data source with the following structure:
| Column | Type | Description |
|--------|------|-------------|
| id | int | Auto-increment ID |
| environment | string | Environment (e.g., 'prod', 'staging', 'dev') |
| source_type | string | Source type (e.g., 'bigquery', 'mysql') |
| database_name | string | Database/project name |
| catalog_name | string | Catalog name (for Databricks) |
| schema_name | string | Schema name (for PostgreSQL) |
| table_name | string | Table name |
| field | string | Column name |
| data_type | string | Data type |
| nullable | boolean | Whether column can be null |
| max_length | int | Maximum length for strings |
| comment | string | Description/comment |
| created_at | datetime | Creation timestamp |
| updated_at | datetime | Last update timestamp |
### Step 2: Get Schema Configuration
Use `get_schema_config()` to retrieve the expected schema from your registry:
#### BigQuery
```python
from sumeh.services.config import get_schema_config
schema = get_schema_config(
source="bigquery",
project_id="my-project",
dataset_id="my-dataset",
table_id="users",
environment="prod" # optional, defaults to 'prod'
)
```
#### MySQL
```python
# Option 1: Create connection internally
schema = get_schema_config(
source="mysql",
host="localhost",
user="root",
password="secret",
database="mydb",
table="users",
environment="prod"
)
# Option 2: Reuse existing connection
import mysql.connector
conn = mysql.connector.connect(host="localhost", user="root", password="secret", database="mydb")
schema = get_schema_config(
source="mysql",
conn=conn,
table="users",
environment="prod"
)
```
#### PostgreSQL
```python
# Option 1: Create connection internally
schema = get_schema_config(
source="postgresql",
host="localhost",
user="postgres",
password="secret",
database="mydb",
schema="public",
table="users",
environment="prod"
)
# Option 2: Reuse existing connection
import psycopg2
conn = psycopg2.connect(host="localhost", user="postgres", password="secret", dbname="mydb")
schema = get_schema_config(
source="postgresql",
conn=conn,
schema="public",
table="users",
environment="prod"
)
```
#### DuckDB
```python
import duckdb
conn = duckdb.connect("my_database.db")
schema = get_schema_config(
source="duckdb",
conn=conn,
table="users",
environment="prod"
)
```
#### Databricks
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
schema = get_schema_config(
source="databricks",
spark=spark,
catalog="main",
schema="default",
table="users",
environment="prod"
)
```
#### AWS Glue
```python
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
schema = get_schema_config(
source="glue",
glue_context=glueContext,
database_name="my_database",
table_name="users",
environment="prod"
)
```
#### CSV
```python
schema = get_schema_config(
source="schema_registry.csv",
table="users",
environment="prod"
)
```
#### S3
```python
schema = get_schema_config(
source="s3://my-bucket/path/schema_registry.csv",
table="users",
environment="prod"
)
```
### Step 3: Validate DataFrame Schema
Once you have the expected schema, validate your DataFrame against it:
```python
from sumeh.core import validate_schema
# Load your DataFrame (example with pandas)
import pandas as pd
df = pd.read_csv("users.csv")
# Validate
is_valid, errors = validate_schema(
df_or_conn=df,
expected=schema
)
if is_valid:
print("✅ Schema is valid!")
else:
print("❌ Schema validation failed:")
for field, error in errors:
print(f" - {field}: {error}")
```
#### Example Output
```
❌ Schema validation failed:
- email: missing
- age: type mismatch (got 'object', expected 'int64')
- created_at: nullable but expected non-nullable
- extra_field: extra column
```
### Advanced: Custom Filters
You can add custom WHERE clauses to filter the schema registry:
```python
schema = get_schema_config(
source="bigquery",
project_id="my-project",
dataset_id="my-dataset",
table_id="users",
environment="prod",
query="source_type = 'bigquery' AND catalog_name IS NOT NULL"
)
```
**Note:** The `query` parameter adds additional filters to the base filter (`table_name` and `environment`).
### Supported Engines
Schema validation works with all supported DataFrame engines:
- Dask
- DuckDB
- Pandas
- Polars
- PySpark
**Important:** Make sure the `data_type` values in your `schema_registry` match the exact format returned by your DataFrame engine (e.g., `int64` for pandas, `string` for PySpark). Comparisons are case-insensitive.
## 📂 Project Layout
```
sumeh/
├── poetry.lock
├── pyproject.toml
├── README.md
└── sumeh
├── __init__.py
├── cli.py
├── core.py
├── engine
│ ├── __init__.py
│ ├── bigquery_engine.py
│ ├── dask_engine.py
│ ├── duckdb_engine.py
│ ├── pandas_engine.py
│ ├── polars_engine.py
│ └── pyspark_engine.py
└── services
├── __init__.py
├── config.py
├── index.html
└── utils.py
```
## 📈 Roadmap
- [ ] Complete BigQuery engine implementation
- ✅ Complete Pandas engine implementation
- ✅ Enhanced documentation
- ✅ More validation rule types
- [ ] Performance optimizations
## 🤝 Contributing
1. Fork & create a feature branch
2. Implement new checks or engines, following existing signatures
3. Add tests under `tests/`
4. Open a PR and ensure CI passes
## 📜 License
Licensed under the [Apache License 2.0](LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/maltzsama/sumeh",
"name": "sumeh",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "data-quality, validation, dq, etl, cli",
"author": "Demetrius Albuquerque",
"author_email": "demetrius.albuquerque@yahoo.com.br",
"download_url": "https://files.pythonhosted.org/packages/5b/6c/823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d/sumeh-0.3.2.tar.gz",
"platform": null,
"description": "\n\n\n# <h1 style=\"display: flex; align-items: center; gap: 0.5rem;\"><img src=\"https://raw.githubusercontent.com/maltzsama/sumeh/refs/heads/main/docs/img/sumeh.svg\" alt=\"Logo\" style=\"height: 40px; width: auto; vertical-align: middle;\" /> <span>Sumeh DQ</span> </h1>\n\nSumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.\n\n## \ud83d\ude80 Installation\n\n```bash\n# Using pip\npip install sumeh\n\n# Or with conda-forge\nconda install -c conda-forge sumeh\n```\n\n**Prerequisites:** \n- Python 3.10+ \n- One or more of: `pyspark`, `dask[dataframe]`, `polars`, `duckdb`, `cuallee`\n\n## \ud83d\udd0d Core API\n\n- **`report(df, rules, name=\"Quality Check\")`** \n Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB). \n- **`validate(df, rules)`** *(per-engine)* \n Returns a DataFrame with a `dq_status` column listing violations. \n- **`summarize(qc_df, rules, total_rows)`** *(per-engine)* \n Consolidates violations into a summary report.\n\n## \u2699\ufe0f Supported Engines\n\nEach engine implements the `validate()` + `summarize()` pair:\n\n| Engine | Module | Status |\n|-----------------------|-----------------------------------------|-----------------|\n| PySpark | `sumeh.engine.pyspark_engine` | \u2705 Fully implemented |\n| Dask | `sumeh.engine.dask_engine` | \u2705 Fully implemented |\n| Polars | `sumeh.engine.polars_engine` | \u2705 Fully implemented |\n| DuckDB | `sumeh.engine.duckdb_engine` | \u2705 Fully implemented |\n| Pandas | `sumeh.engine.pandas_engine` | \u2705 Fully implemented |\n| BigQuery (SQL) | `sumeh.engine.bigquery_engine` | \ud83d\udd27 Stub implementation |\n\n## \ud83c\udfd7 Configuration Sources\n\n**Load rules from CSV**\n\n```python\nfrom sumeh import get_rules_config\n\nrules = get_rules_config(\"rules.csv\", delimiter=\";\")\n```\n\n**Load rules from S3**\n```python\nfrom sumeh import get_rules_config\nbucket_name = \"<bucket>\"\npath = \"<path>\"\nfile_name = \"<file_name>\"\n\nrules = get_rules_config(f\"s3://{bucket_name}/{path}/{file_name}\", delimiter=\";\")\n\n```\n\n**Load rules from MySQL**\n```python\nfrom sumeh import get_rules_config\n\nhost = \"<host>\"\nport = \"<port>\" #optional\nuser = \"<username>\"\npassword = \"<passwd>\"\ndatabase = \"<database>\"\ntable = \"<rules_table>\"\nquery = \"<select * from rules>\" # optional\n\nrules = get_rules_config(\n source=\"mysql\", \n host=host, \n user=user, \n password=password, \n database=database, \n table=table, \n query=query\n)\n\n# or using Mysql Connector\nimport mysql.connector\nconn = mysql.connector.connect(\n host=host,\n port=port,\n database=database,\n user=user,\n password=password\n)\n\nrules = get_rules_config(source=\"mysql\", connection=conn, query=query)\n\n```\n\n**Load rules from Postgres**\n```python\nfrom sumeh import get_rules_config\n\nhost = \"<host>\"\nport = \"<port>\" #optional\nuser = \"<username>\"\npassword = \"<passwd>\"\ndatabase = \"<database>\"\nschema = \"<public>\"\ntable = \"<rules_table>\"\nquery = \"<select * from rules>\" # optional\n\nrules_pgsql = get_rules_config(\n source=\"postgresql\", \n host=host, user=user, \n password=password, \n database=database, \n schema=schema, \n table=table, \n query=query\n)\n\n# Or using the PostgreSQL Connector\nimport psycopg2\n\nconn = psycopg2.connect(\n host=host,\n database=database,\n user=user,\n password=password\n)\n\nrules_pgsql = get_rules_config(source=\"postgresql\", connection=conn, query=query)\n\n```\n\n**Load rules from AWS Glue Data Catalog**\n```python\nfrom pyspark.context import SparkContext\nfrom awsglue.utils import getResolvedOptions\nfrom awsglue.context import GlueContext\nfrom awsglue.job import Job\n\nfrom sumeh import get_rules_config\n\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglue_context = GlueContext(sc)\nspark = glue_context.spark_session\njob = Job(glue_context)\njob.init(args['JOB_NAME'], args)\n\n\ndatabase_name = \"<database>\"\ntable_name = \"<table>\"\n\nrules = get_rules_config(\n source=\"glue\",\n glue_context=glue_context,\n database_name=database_name,\n table_name=table_name\n)\n\njob.commit()\n\n```\n\n**Load rules from Databricks Data Catalog**\n```python\nfrom sumeh import get_rules_config\n\ncatalog = \"<catalog>\"\ndatabase_name = \"<database>\"\ntable_name = \"<table>\"\nquery = \"<query>\" # OPTIONAL\n\nrules = get_rules_config(\n \"databricks\", \n spark=spark, \n catalog=\"sumeh_demo\", \n schema=\"sample_data\", \n table=\"rules\",\n query=query\n)\n\n```\n\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Typical Workflow\n\n```python\nfrom sumeh import report\nfrom sumeh.engine.polars_engine import validate, summarize\nimport polars as pl\n\n# 1) Load data\ndf = pl.read_csv(\"data.csv\")\n\n# 2) Run validation\nresult, result_raw = validate(df, rules)\n\n# 3) Generate summary\ntotal = df.height\nreport = summarize(result_raw, rules, total_rows=total)\nprint(report)\n```\n\n**Or simply:**\n\n```python\nfrom sumeh import report\n\nreport = report(df, rules)\n```\n\n## \ud83d\udccb Rule Definition Example\n\n```json\n{\n \"field\": \"customer_id\",\n \"check_type\": \"is_complete\",\n \"threshold\": 0.99,\n \"value\": null,\n \"execute\": true\n}\n```\n\n## Supported Validation Rules\n\n### Numeric checks\n\n| Test | Description |\n| -------------------- | --------------------------------------------------------------------------------------------------------- |\n| **is_in_millions** | Retains rows where the column value is **less than** 1,000,000 (fails the \"in millions\" criteria). |\n| **is_in_billions** | Retains rows where the column value is **less than** 1,000,000,000 (fails the \"in billions\" criteria). |\n\n---\n\n### Completeness & Uniqueness\n\n| Test | Description |\n| ---------------------- | --------------------------------------------------------------------------- |\n| **is_complete** | Filters rows where the column value is null. |\n| **are_complete** | Filters rows where **any** of the specified columns are null. |\n| **is_unique** | Identifies rows with duplicate values in the specified column. |\n| **are_unique** | Identifies rows with duplicate combinations of the specified columns. |\n| **is_primary_key** | Alias for `is_unique` (checks uniqueness of a single column). |\n| **is_composite_key** | Alias for `are_unique` (checks combined uniqueness of multiple columns). |\n\n---\n\n### Comparison & Range\n\n| Test | Description |\n| -------------------------------- | --------------------------------------------------------------------------------------- |\n| **is_equal** | Filters rows where the column is not equal to the provided value (null-safe). |\n| **is_equal_than** | Alias for `is_equal`. |\n| **is_between** | Filters rows where the column value is **outside** the numeric range `[min, max]`. |\n| **is_greater_than** | Filters rows where the column value is **\u2264** the threshold (fails \"greater than\"). |\n| **is_greater_or_equal_than** | Filters rows where the column value is **<** the threshold (fails \"greater or equal\"). |\n| **is_less_than** | Filters rows where the column value is **\u2265** the threshold (fails \"less than\"). |\n| **is_less_or_equal_than** | Filters rows where the column value is **>** the threshold (fails \"less or equal\"). |\n| **is_positive** | Filters rows where the column value is **< 0** (fails \"positive\"). |\n| **is_negative** | Filters rows where the column value is **\u2265 0** (fails \"negative\"). |\n\n---\n\n### Membership & Pattern\n\n| Test | Description |\n| ---------------------- | ----------------------------------------------------------------------------------------- |\n| **is_contained_in** | Filters rows where the column value is **not** in the provided list. |\n| **not_contained_in** | Filters rows where the column value **is** in the provided list. |\n| **has_pattern** | Filters rows where the column value does **not** match the specified regex. |\n| **is_legit** | Filters rows where the column value is null or contains whitespace (i.e., not `\\S+`). |\n\n---\n\n### Aggregate checks\n\n| Test | Description |\n| -------------------- | ---------------------------------------------------------------------------------------------------------------- |\n| **has_min** | Returns all rows if the column's minimum value **causes failure** (value < threshold); otherwise returns empty. |\n| **has_max** | Returns all rows if the column's maximum value **causes failure** (value > threshold); otherwise returns empty. |\n| **has_sum** | Returns all rows if the column's sum **causes failure** (sum > threshold); otherwise returns empty. |\n| **has_mean** | Returns all rows if the column's mean **causes failure** (mean > threshold); otherwise returns empty. |\n| **has_std** | Returns all rows if the column's standard deviation **causes failure** (std > threshold); otherwise returns empty.|\n| **has_cardinality** | Returns all rows if the number of distinct values **causes failure** (count > threshold); otherwise returns empty.|\n| **has_infogain** | Same logic as `has_cardinality` (proxy for information gain). |\n| **has_entropy** | Same logic as `has_cardinality` (proxy for entropy). |\n\n---\n\n### SQL & Schema\n\n| Test | Description |\n| -------------------- | --------------------------------------------------------------------------------------------------------------- |\n| **satisfies** | Filters rows where the SQL expression (based on `rule[\"value\"]`) is **not** satisfied. |\n| **validate_schema** | Compares the DataFrame's actual schema against the expected one and returns a match flag + error list. |\n| **validate** | Executes a list of named rules and returns two DataFrames: one with aggregated status and one with raw violations. |\n\n---\n\n### Date-related checks\n\n| Test | Description |\n| -------------------------- | ----------------------------------------------------------------------------------------------- |\n| **is_t_minus_1** | Retains rows where the date in the column is **not** equal to yesterday (T\u20131). |\n| **is_t_minus_2** | Retains rows where the date in the column is **not** equal to two days ago (T\u20132). |\n| **is_t_minus_3** | Retains rows where the date in the column is **not** equal to three days ago (T\u20133). |\n| **is_today** | Retains rows where the date in the column is **not** equal to today. |\n| **is_yesterday** | Retains rows where the date in the column is **not** equal to yesterday. |\n| **is_on_weekday** | Retains rows where the date in the column **NOT FALLS** on a weekend (fails \"weekday\"). |\n| **is_on_weekend** | Retains rows where the date in the column **NOT FALLS** on a weekday (fails \"weekend\"). |\n| **is_on_monday** | Retains rows where the date in the column is **not** Monday. |\n| **is_on_tuesday** | Retains rows where the date in the column is **not** Tuesday. |\n| **is_on_wednesday** | Retains rows where the date in the column is **not** Wednesday. |\n| **is_on_thursday** | Retains rows where the date in the column is **not** Thursday. |\n| **is_on_friday** | Retains rows where the date in the column is **not** Friday. |\n| **is_on_saturday** | Retains rows where the date in the column is **not** Saturday. |\n| **is_on_sunday** | Retains rows where the date in the column is **not** Sunday. |\n| **validate_date_format** | Filters rows where the date doesn't match the expected format or is null. |\n| **is_future_date** | Filters rows where the date in the column is **not** after today. |\n| **is_past_date** | Filters rows where the date in the column is **not** before today. |\n| **is_date_after** | Filters rows where the date in the column is **not** before the date provided in the rule. |\n| **is_date_before** | Filters rows where the date in the column is **not** after the date provided in the rule. |\n| **is_date_between** | Filters rows where the date in the column is **not** outside the range `[start, end]`. |\n| **all_date_checks** | Alias for `is_past_date` (same logic: date before today). |\n\n\n\n## Schema Validation\n\nSumeh allows you to validate your DataFrame schemas against a schema registry stored in various data sources (BigQuery, MySQL, PostgreSQL, DuckDB, Databricks, Glue, CSV, S3).\n\n### Step 1: Store Your Schema Registry\n\nFirst, create a `schema_registry` table in your data source with the following structure:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| id | int | Auto-increment ID |\n| environment | string | Environment (e.g., 'prod', 'staging', 'dev') |\n| source_type | string | Source type (e.g., 'bigquery', 'mysql') |\n| database_name | string | Database/project name |\n| catalog_name | string | Catalog name (for Databricks) |\n| schema_name | string | Schema name (for PostgreSQL) |\n| table_name | string | Table name |\n| field | string | Column name |\n| data_type | string | Data type |\n| nullable | boolean | Whether column can be null |\n| max_length | int | Maximum length for strings |\n| comment | string | Description/comment |\n| created_at | datetime | Creation timestamp |\n| updated_at | datetime | Last update timestamp |\n\n### Step 2: Get Schema Configuration\n\nUse `get_schema_config()` to retrieve the expected schema from your registry:\n\n#### BigQuery\n\n```python\nfrom sumeh.services.config import get_schema_config\n\nschema = get_schema_config(\n source=\"bigquery\",\n project_id=\"my-project\",\n dataset_id=\"my-dataset\",\n table_id=\"users\",\n environment=\"prod\" # optional, defaults to 'prod'\n)\n```\n\n#### MySQL\n\n```python\n# Option 1: Create connection internally\nschema = get_schema_config(\n source=\"mysql\",\n host=\"localhost\",\n user=\"root\",\n password=\"secret\",\n database=\"mydb\",\n table=\"users\",\n environment=\"prod\"\n)\n\n# Option 2: Reuse existing connection\nimport mysql.connector\nconn = mysql.connector.connect(host=\"localhost\", user=\"root\", password=\"secret\", database=\"mydb\")\n\nschema = get_schema_config(\n source=\"mysql\",\n conn=conn,\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n#### PostgreSQL\n\n```python\n# Option 1: Create connection internally\nschema = get_schema_config(\n source=\"postgresql\",\n host=\"localhost\",\n user=\"postgres\",\n password=\"secret\",\n database=\"mydb\",\n schema=\"public\",\n table=\"users\",\n environment=\"prod\"\n)\n\n# Option 2: Reuse existing connection\nimport psycopg2\nconn = psycopg2.connect(host=\"localhost\", user=\"postgres\", password=\"secret\", dbname=\"mydb\")\n\nschema = get_schema_config(\n source=\"postgresql\",\n conn=conn,\n schema=\"public\",\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n#### DuckDB\n\n```python\nimport duckdb\n\nconn = duckdb.connect(\"my_database.db\")\n\nschema = get_schema_config(\n source=\"duckdb\",\n conn=conn,\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n#### Databricks\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.getOrCreate()\n\nschema = get_schema_config(\n source=\"databricks\",\n spark=spark,\n catalog=\"main\",\n schema=\"default\",\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n#### AWS Glue\n\n```python\nfrom awsglue.context import GlueContext\nfrom pyspark.context import SparkContext\n\nglueContext = GlueContext(SparkContext.getOrCreate())\n\nschema = get_schema_config(\n source=\"glue\",\n glue_context=glueContext,\n database_name=\"my_database\",\n table_name=\"users\",\n environment=\"prod\"\n)\n```\n\n#### CSV\n\n```python\nschema = get_schema_config(\n source=\"schema_registry.csv\",\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n#### S3\n\n```python\nschema = get_schema_config(\n source=\"s3://my-bucket/path/schema_registry.csv\",\n table=\"users\",\n environment=\"prod\"\n)\n```\n\n### Step 3: Validate DataFrame Schema\n\nOnce you have the expected schema, validate your DataFrame against it:\n\n```python\nfrom sumeh.core import validate_schema\n\n# Load your DataFrame (example with pandas)\nimport pandas as pd\ndf = pd.read_csv(\"users.csv\")\n\n# Validate\nis_valid, errors = validate_schema(\n df_or_conn=df,\n expected=schema\n)\n\nif is_valid:\n print(\"\u2705 Schema is valid!\")\nelse:\n print(\"\u274c Schema validation failed:\")\n for field, error in errors:\n print(f\" - {field}: {error}\")\n```\n\n#### Example Output\n\n```\n\u274c Schema validation failed:\n - email: missing\n - age: type mismatch (got 'object', expected 'int64')\n - created_at: nullable but expected non-nullable\n - extra_field: extra column\n```\n\n### Advanced: Custom Filters\n\nYou can add custom WHERE clauses to filter the schema registry:\n\n```python\nschema = get_schema_config(\n source=\"bigquery\",\n project_id=\"my-project\",\n dataset_id=\"my-dataset\",\n table_id=\"users\",\n environment=\"prod\",\n query=\"source_type = 'bigquery' AND catalog_name IS NOT NULL\"\n)\n```\n\n**Note:** The `query` parameter adds additional filters to the base filter (`table_name` and `environment`).\n\n### Supported Engines\n\nSchema validation works with all supported DataFrame engines:\n- Dask\n- DuckDB\n- Pandas\n- Polars\n- PySpark\n\n**Important:** Make sure the `data_type` values in your `schema_registry` match the exact format returned by your DataFrame engine (e.g., `int64` for pandas, `string` for PySpark). Comparisons are case-insensitive.\n\n\n## \ud83d\udcc2 Project Layout\n\n```\nsumeh/\n\u251c\u2500\u2500 poetry.lock\n\u251c\u2500\u2500 pyproject.toml\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumeh\n \u251c\u2500\u2500 __init__.py\n \u251c\u2500\u2500 cli.py\n \u251c\u2500\u2500 core.py\n \u251c\u2500\u2500 engine\n \u2502 \u251c\u2500\u2500 __init__.py\n \u2502 \u251c\u2500\u2500 bigquery_engine.py\n \u2502 \u251c\u2500\u2500 dask_engine.py\n \u2502 \u251c\u2500\u2500 duckdb_engine.py\n \u2502 \u251c\u2500\u2500 pandas_engine.py\n \u2502 \u251c\u2500\u2500 polars_engine.py\n \u2502 \u2514\u2500\u2500 pyspark_engine.py\n \u2514\u2500\u2500 services\n \u251c\u2500\u2500 __init__.py\n \u251c\u2500\u2500 config.py\n \u251c\u2500\u2500 index.html\n \u2514\u2500\u2500 utils.py\n\n```\n\n## \ud83d\udcc8 Roadmap\n\n- [ ] Complete BigQuery engine implementation\n- \u2705 Complete Pandas engine implementation\n- \u2705 Enhanced documentation\n- \u2705 More validation rule types\n- [ ] Performance optimizations\n\n## \ud83e\udd1d Contributing\n\n1. Fork & create a feature branch \n2. Implement new checks or engines, following existing signatures \n3. Add tests under `tests/` \n4. Open a PR and ensure CI passes\n\n## \ud83d\udcdc License\n\nLicensed under the [Apache License 2.0](LICENSE).\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Unified Data Quality Validation Framework",
"version": "0.3.2",
"project_urls": {
"Bug Tracker": "https://github.com/maltzsama/sumeh/issues",
"Documentation": "https://maltzsama.github.io/sumeh",
"Homepage": "https://github.com/maltzsama/sumeh",
"Repository": "https://github.com/maltzsama/sumeh"
},
"split_keywords": [
"data-quality",
" validation",
" dq",
" etl",
" cli"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9274f73c579ea154759d929fa3799b3dfe43c079e5fd64b1087f7b35d1db5383",
"md5": "076f4e4449f171e44317e7cebb3e1691",
"sha256": "9e42f46327c2970ba2fab015dc81b04256234765dc90457d8081e9666616698a"
},
"downloads": -1,
"filename": "sumeh-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "076f4e4449f171e44317e7cebb3e1691",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 79209,
"upload_time": "2025-10-06T19:20:40",
"upload_time_iso_8601": "2025-10-06T19:20:40.898145Z",
"url": "https://files.pythonhosted.org/packages/92/74/f73c579ea154759d929fa3799b3dfe43c079e5fd64b1087f7b35d1db5383/sumeh-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5b6c823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d",
"md5": "784b3b654eb383bfef8c6379e20f685f",
"sha256": "e3eeba168078481069e6eb742c9726b8c2caae225b7e918089670e1a5f69d1c7"
},
"downloads": -1,
"filename": "sumeh-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "784b3b654eb383bfef8c6379e20f685f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 76971,
"upload_time": "2025-10-06T19:20:42",
"upload_time_iso_8601": "2025-10-06T19:20:42.077605Z",
"url": "https://files.pythonhosted.org/packages/5b/6c/823a9e782145fdbbe12e5bae5611d8bb493f92ef85dd95f5ce889a440c3d/sumeh-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 19:20:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maltzsama",
"github_project": "sumeh",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sumeh"
}