# HDF DQ Framework
A powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules, designed for the HDF Data Pipeline ecosystem.
## Overview
The DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.
### Key Features
- **Easy Integration**: Simple API that works with existing PySpark workflows
- **Great Expectations**: Leverages the power of Great Expectations for data validation
- **Flexible Rules**: Support for JSON string, dictionary, or list-based rule configuration
- **Dual Output**: Returns both qualified and bad rows as separate DataFrames
- **Detailed Validation**: Optional validation details for debugging and monitoring
## Quick Start
```python
from pyspark.sql import SparkSession
from dq_framework import DQFramework
# Initialize Spark session
spark = SparkSession.builder.appName("DQ_Example").getOrCreate()
# Create sample data
data = [
(1, "John", 25, "john@email.com"),
(2, "Jane", -5, "invalid-email"), # Bad data: negative age, invalid email
(3, "Bob", 30, "bob@email.com"),
(4, None, 35, "alice@email.com"), # Bad data: null name
]
columns = ["id", "name", "age", "email"]
df = spark.createDataFrame(data, columns)
# Define quality rules
quality_rules = [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "name"}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 0, "max_value": 120}
},
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {"column": "email", "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"}
}
]
# Initialize DQ Framework
dq = DQFramework()
# Filter data
qualified_df, bad_df = dq.filter_dataframe(
dataframe=df,
quality_rules=quality_rules,
include_validation_details=True
)
# Show results
print("Qualified Data:")
qualified_df.show()
print("Bad Data:")
bad_df.show()
```
## API Reference
### DQFramework
The main class for data quality processing.
#### Methods
- **`filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)`**
- Filters a DataFrame based on quality rules
- Returns tuple of (qualified_df, bad_df)
### RuleProcessor
Handles the processing of Great Expectations rules.
## Dependencies
### Core Dependencies
- **PySpark** ^3.0.0: For DataFrame operations
- **Great Expectations** ^0.15.0: For validation logic
- **typing-extensions** ^4.0.0: For enhanced type hints
## Supported Expectations
The DQ Framework supports a comprehensive set of Great Expectations validation rules. Below are all available expectations organized by category, with examples and descriptions.
### 1. Basic Column Existence and Null Checks
#### `expect_column_to_exist`
Validates that a specified column exists in the DataFrame.
```python
{
"expectation_type": "expect_column_to_exist",
"kwargs": {"column": "customer_id"}
}
```
#### `expect_column_values_to_not_be_null`
Validates that column values are not null.
```python
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "email"}
}
```
#### `expect_column_values_to_be_null`
Validates that column values are null (useful for optional fields in specific contexts).
```python
{
"expectation_type": "expect_column_values_to_be_null",
"kwargs": {"column": "middle_name"}
}
```
### 2. Uniqueness Expectations
#### `expect_column_values_to_be_unique`
Validates that all values in a column are unique.
```python
{
"expectation_type": "expect_column_values_to_be_unique",
"kwargs": {"column": "user_id"}
}
```
#### `expect_compound_columns_to_be_unique`
Validates that combinations of multiple column values are unique.
```python
{
"expectation_type": "expect_compound_columns_to_be_unique",
"kwargs": {"column_list": ["user_id", "transaction_date", "amount"]}
}
```
#### `expect_select_column_values_to_be_unique_within_record`
Validates that values are unique within each record across specified columns.
```python
{
"expectation_type": "expect_select_column_values_to_be_unique_within_record",
"kwargs": {"column_list": ["phone_home", "phone_work", "phone_mobile"]}
}
```
#### `expect_multicolumn_values_to_be_unique`
Validates that combinations of multiple column values are unique (alias for compound_columns).
```python
{
"expectation_type": "expect_multicolumn_values_to_be_unique",
"kwargs": {"column_list": ["order_id", "product_id"]}
}
```
### 3. Range and Value Expectations
#### `expect_column_values_to_be_between`
Validates that column values are within a specified numeric range.
```python
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 0, "max_value": 120}
}
# Age must be between 18 and 65
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 18, "max_value": 65}
}
# Price must be at least 0 (no maximum)
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "price", "min_value": 0}
}
```
#### `expect_column_values_to_be_in_set`
Validates that column values are within a specified set of allowed values.
```python
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "status",
"value_set": ["active", "inactive", "suspended", "pending"]
}
}
# Gender validation
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "gender",
"value_set": ["M", "F", "Other", "Prefer not to say"]
}
}
```
#### `expect_column_values_to_not_be_in_set`
Validates that column values are NOT in a specified set of disallowed values.
```python
{
"expectation_type": "expect_column_values_to_not_be_in_set",
"kwargs": {
"column": "username",
"value_set": ["admin", "root", "test", "guest"]
}
}
```
#### `expect_column_distinct_values_to_be_in_set`
Validates that all distinct values in a column are within a specified set.
```python
{
"expectation_type": "expect_column_distinct_values_to_be_in_set",
"kwargs": {
"column": "department",
"value_set": ["HR", "Finance", "Engineering", "Marketing", "Sales"]
}
}
```
#### `expect_column_distinct_values_to_contain_set`
Validates that the column contains all values from a specified set.
```python
{
"expectation_type": "expect_column_distinct_values_to_contain_set",
"kwargs": {
"column": "required_skills",
"value_set": ["Python", "SQL"]
}
}
```
#### `expect_column_distinct_values_to_equal_set`
Validates that distinct values in a column exactly match a specified set.
```python
{
"expectation_type": "expect_column_distinct_values_to_equal_set",
"kwargs": {
"column": "grade",
"value_set": ["A", "B", "C", "D", "F"]
}
}
```
### 4. Pattern Matching Expectations
#### `expect_column_values_to_match_regex`
Validates that column values match a specified regular expression pattern.
```python
# Email validation
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "email",
"regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
}
}
# Phone number validation (US format)
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "phone",
"regex": r"^\(\d{3}\) \d{3}-\d{4}$"
}
}
# Product code validation (3 letters + 4 digits)
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "product_code",
"regex": r"^[A-Z]{3}\d{4}$"
}
}
```
#### `expect_column_values_to_not_match_regex`
Validates that column values do NOT match a specified regular expression pattern.
```python
# Ensure no special characters in username
{
"expectation_type": "expect_column_values_to_not_match_regex",
"kwargs": {
"column": "username",
"regex": r"[!@#$%^&*()+=\[\]{};':\"\\|,.<>/?]"
}
}
```
#### `expect_column_values_to_match_strftime_format`
Validates that column values match a specific date/time format.
```python
# Date in YYYY-MM-DD format
{
"expectation_type": "expect_column_values_to_match_strftime_format",
"kwargs": {
"column": "birth_date",
"strftime_format": "%Y-%m-%d"
}
}
# Timestamp in YYYY-MM-DD HH:MM:SS format
{
"expectation_type": "expect_column_values_to_match_strftime_format",
"kwargs": {
"column": "created_at",
"strftime_format": "%Y-%m-%d %H:%M:%S"
}
}
```
### 5. String Length Expectations
#### `expect_column_value_lengths_to_be_between`
Validates that string column value lengths are within a specified range.
```python
# Password length validation
{
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {"column": "password", "min_value": 8, "max_value": 128}
}
# Comment length validation
{
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {"column": "comment", "min_value": 1, "max_value": 500}
}
```
#### `expect_column_value_lengths_to_equal`
Validates that string column value lengths equal a specific value.
```python
# Country code (ISO 3166-1 alpha-2)
{
"expectation_type": "expect_column_value_lengths_to_equal",
"kwargs": {"column": "country_code", "value": 2}
}
# SSN format (XXX-XX-XXXX = 11 characters)
{
"expectation_type": "expect_column_value_lengths_to_equal",
"kwargs": {"column": "ssn", "value": 11}
}
```
### 6. Type Expectations
#### `expect_column_values_to_be_of_type`
Validates that column values are of a specified data type.
```python
{
"expectation_type": "expect_column_values_to_be_of_type",
"kwargs": {"column": "age", "type_": "int"}
}
{
"expectation_type": "expect_column_values_to_be_of_type",
"kwargs": {"column": "price", "type_": "float"}
}
{
"expectation_type": "expect_column_values_to_be_of_type",
"kwargs": {"column": "name", "type_": "string"}
}
```
#### `expect_column_values_to_be_in_type_list`
Validates that column values are of one of the specified data types.
```python
{
"expectation_type": "expect_column_values_to_be_in_type_list",
"kwargs": {
"column": "numeric_value",
"type_list": ["int", "float", "double"]
}
}
```
### 7. Date and Time Expectations
#### `expect_column_values_to_be_dateutil_parseable`
Validates that column values can be parsed as valid dates.
```python
{
"expectation_type": "expect_column_values_to_be_dateutil_parseable",
"kwargs": {"column": "event_date"}
}
```
### 8. JSON Expectations
#### `expect_column_values_to_be_json_parseable`
Validates that column values are valid JSON strings.
```python
{
"expectation_type": "expect_column_values_to_be_json_parseable",
"kwargs": {"column": "metadata"}
}
```
#### `expect_column_values_to_match_json_schema`
Validates that column values match a specified JSON schema.
```python
{
"expectation_type": "expect_column_values_to_match_json_schema",
"kwargs": {
"column": "user_preferences",
"json_schema": {
"type": "object",
"properties": {
"theme": {"type": "string"},
"notifications": {"type": "boolean"}
}
}
}
}
```
### 9. Ordering Expectations
#### `expect_column_values_to_be_increasing`
Validates that column values are in increasing order.
```python
# Values must be increasing (non-strict)
{
"expectation_type": "expect_column_values_to_be_increasing",
"kwargs": {"column": "timestamp"}
}
# Values must be strictly increasing
{
"expectation_type": "expect_column_values_to_be_increasing",
"kwargs": {"column": "sequence_number", "strictly": True}
}
```
#### `expect_column_values_to_be_decreasing`
Validates that column values are in decreasing order.
```python
# Values must be decreasing (non-strict)
{
"expectation_type": "expect_column_values_to_be_decreasing",
"kwargs": {"column": "priority_score"}
}
# Values must be strictly decreasing
{
"expectation_type": "expect_column_values_to_be_decreasing",
"kwargs": {"column": "countdown", "strictly": True}
}
```
### 10. Statistical Expectations
#### `expect_column_mean_to_be_between`
Validates that the column mean is within a specified range.
```python
{
"expectation_type": "expect_column_mean_to_be_between",
"kwargs": {"column": "test_scores", "min_value": 70, "max_value": 90}
}
```
#### `expect_column_median_to_be_between`
Validates that the column median is within a specified range.
```python
{
"expectation_type": "expect_column_median_to_be_between",
"kwargs": {"column": "response_time", "min_value": 100, "max_value": 500}
}
```
#### `expect_column_stdev_to_be_between`
Validates that the column standard deviation is within a specified range.
```python
{
"expectation_type": "expect_column_stdev_to_be_between",
"kwargs": {"column": "measurements", "min_value": 0.5, "max_value": 2.0}
}
```
#### `expect_column_unique_value_count_to_be_between`
Validates that the count of unique values is within a specified range.
```python
{
"expectation_type": "expect_column_unique_value_count_to_be_between",
"kwargs": {"column": "category", "min_value": 5, "max_value": 20}
}
```
#### `expect_column_proportion_of_unique_values_to_be_between`
Validates that the proportion of unique values is within a specified range.
```python
{
"expectation_type": "expect_column_proportion_of_unique_values_to_be_between",
"kwargs": {"column": "user_id", "min_value": 0.95, "max_value": 1.0}
}
```
#### `expect_column_most_common_value_to_be_in_set`
Validates that the most common value is within a specified set.
```python
{
"expectation_type": "expect_column_most_common_value_to_be_in_set",
"kwargs": {
"column": "preferred_language",
"value_set": ["English", "Spanish", "French"]
}
}
```
#### `expect_column_max_to_be_between`
Validates that the column maximum value is within a specified range.
```python
{
"expectation_type": "expect_column_max_to_be_between",
"kwargs": {"column": "temperature", "min_value": -50, "max_value": 50}
}
```
#### `expect_column_min_to_be_between`
Validates that the column minimum value is within a specified range.
```python
{
"expectation_type": "expect_column_min_to_be_between",
"kwargs": {"column": "price", "min_value": 0, "max_value": 10}
}
```
#### `expect_column_sum_to_be_between`
Validates that the column sum is within a specified range.
```python
{
"expectation_type": "expect_column_sum_to_be_between",
"kwargs": {"column": "order_amount", "min_value": 1000, "max_value": 100000}
}
```
#### `expect_column_quantile_values_to_be_between`
Validates that column quantile values are within specified ranges.
```python
{
"expectation_type": "expect_column_quantile_values_to_be_between",
"kwargs": {
"column": "response_time",
"quantile_ranges": {
"quantiles": [0.25, 0.5, 0.75],
"value_ranges": [[50, 100], [100, 200], [200, 400]]
}
}
}
```
### 11. Column Pair Expectations
#### `expect_column_pair_values_to_be_equal`
Validates that values in two columns are equal.
```python
{
"expectation_type": "expect_column_pair_values_to_be_equal",
"kwargs": {"column_A": "password", "column_B": "password_confirm"}
}
```
#### `expect_column_pair_values_A_to_be_greater_than_B`
Validates that values in column A are greater than values in column B.
```python
# End date must be after start date
{
"expectation_type": "expect_column_pair_values_A_to_be_greater_than_B",
"kwargs": {"column_A": "end_date", "column_B": "start_date"}
}
# Final score must be greater than or equal to initial score
{
"expectation_type": "expect_column_pair_values_A_to_be_greater_than_B",
"kwargs": {"column_A": "final_score", "column_B": "initial_score", "or_equal": True}
}
```
#### `expect_column_pair_values_to_be_in_set`
Validates that pairs of values are within a specified set of valid combinations.
```python
{
"expectation_type": "expect_column_pair_values_to_be_in_set",
"kwargs": {
"column_A": "state",
"column_B": "country",
"value_pairs_set": [
["CA", "USA"],
["NY", "USA"],
["TX", "USA"],
["ON", "Canada"],
["BC", "Canada"]
]
}
}
```
### 12. Table-level Expectations
#### `expect_table_row_count_to_be_between`
Validates that the total number of rows in the table is within a specified range.
```python
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {"min_value": 100, "max_value": 10000}
}
```
#### `expect_table_row_count_to_equal`
Validates that the total number of rows equals a specific value.
```python
{
"expectation_type": "expect_table_row_count_to_equal",
"kwargs": {"value": 1000}
}
```
#### `expect_table_column_count_to_be_between`
Validates that the number of columns is within a specified range.
```python
{
"expectation_type": "expect_table_column_count_to_be_between",
"kwargs": {"min_value": 5, "max_value": 50}
}
```
#### `expect_table_column_count_to_equal`
Validates that the number of columns equals a specific value.
```python
{
"expectation_type": "expect_table_column_count_to_equal",
"kwargs": {"value": 12}
}
```
#### `expect_table_columns_to_match_ordered_list`
Validates that table columns match an ordered list exactly.
```python
{
"expectation_type": "expect_table_columns_to_match_ordered_list",
"kwargs": {
"column_list": ["id", "name", "email", "created_at", "updated_at"]
}
}
```
#### `expect_table_columns_to_match_set`
Validates that table columns match a set (order doesn't matter).
```python
{
"expectation_type": "expect_table_columns_to_match_set",
"kwargs": {
"column_set": ["user_id", "product_id", "quantity", "price", "order_date"]
}
}
```
## Complete Example: E-commerce Order Validation
Here's a comprehensive example showing multiple expectations for validating e-commerce order data:
```python
from pyspark.sql import SparkSession
from dq_framework import DQFramework
# Initialize Spark session
spark = SparkSession.builder.appName("ECommerce_DQ").getOrCreate()
# Sample e-commerce order data
data = [
(1, "ORD001", "user123", "PROD-A001", 2, 29.99, "2023-01-15", "completed"),
(2, "ORD002", "user456", "PROD-B002", 1, 15.50, "2023-01-16", "pending"),
(3, "ORD003", "user789", "PROD-C003", 0, 45.00, "2023-01-17", "cancelled"), # Bad: quantity = 0
(4, "ORD004", "", "PROD-D004", 1, -10.00, "2023-01-18", "processing"), # Bad: empty user_id, negative price
(5, "ORD005", "user123", "INVALID", 3, 75.25, "invalid-date", "unknown") # Bad: invalid product code, date, status
]
columns = ["id", "order_id", "user_id", "product_code", "quantity", "price", "order_date", "status"]
df = spark.createDataFrame(data, columns)
# Comprehensive quality rules
quality_rules = [
# Basic existence and null checks
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "order_id"}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "user_id"}
},
# Uniqueness
{
"expectation_type": "expect_column_values_to_be_unique",
"kwargs": {"column": "order_id"}
},
# Range validation
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "quantity", "min_value": 1, "max_value": 100}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "price", "min_value": 0.01, "max_value": 10000}
},
# Set validation
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "status",
"value_set": ["pending", "processing", "completed", "cancelled"]
}
},
# Pattern matching
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "product_code",
"regex": r"^PROD-[A-Z]\d{3}$"
}
},
{
"expectation_type": "expect_column_values_to_match_strftime_format",
"kwargs": {
"column": "order_date",
"strftime_format": "%Y-%m-%d"
}
},
# String length
{
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {"column": "user_id", "min_value": 3, "max_value": 20}
}
]
# Initialize DQ Framework and filter data
dq = DQFramework()
qualified_df, bad_df = dq.filter_dataframe(
dataframe=df,
quality_rules=quality_rules,
include_validation_details=True
)
print("Qualified Orders:")
qualified_df.show()
print("Bad Orders:")
bad_df.show()
```
This example demonstrates how multiple expectations work together to ensure comprehensive data quality validation for real-world scenarios.
Raw data
{
"_id": null,
"home_page": "https://github.com/your-org/hdf-data-pipeline",
"name": "hdf-dq-framework",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "data-quality, pyspark, great-expectations, dataframe, validation",
"author": "HDF Data Pipeline Team nengkhoiba.chungkham@iqvia.com",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/d6/6f/48f59d0f890a1bc3f67157b074f70c3ad93ff67206a4e5ab5ecc52d20f45/hdf_dq_framework-0.4.0.tar.gz",
"platform": null,
"description": "# HDF DQ Framework\n\nA powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules, designed for the HDF Data Pipeline ecosystem.\n\n## Overview\n\nThe DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.\n\n### Key Features\n\n- **Easy Integration**: Simple API that works with existing PySpark workflows\n- **Great Expectations**: Leverages the power of Great Expectations for data validation\n- **Flexible Rules**: Support for JSON string, dictionary, or list-based rule configuration\n- **Dual Output**: Returns both qualified and bad rows as separate DataFrames\n- **Detailed Validation**: Optional validation details for debugging and monitoring\n\n## Quick Start\n\n```python\nfrom pyspark.sql import SparkSession\nfrom dq_framework import DQFramework\n\n# Initialize Spark session\nspark = SparkSession.builder.appName(\"DQ_Example\").getOrCreate()\n\n# Create sample data\ndata = [\n (1, \"John\", 25, \"john@email.com\"),\n (2, \"Jane\", -5, \"invalid-email\"), # Bad data: negative age, invalid email\n (3, \"Bob\", 30, \"bob@email.com\"),\n (4, None, 35, \"alice@email.com\"), # Bad data: null name\n]\ncolumns = [\"id\", \"name\", \"age\", \"email\"]\ndf = spark.createDataFrame(data, columns)\n\n# Define quality rules\nquality_rules = [\n {\n \"expectation_type\": \"expect_column_values_to_not_be_null\",\n \"kwargs\": {\"column\": \"name\"}\n },\n {\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"age\", \"min_value\": 0, \"max_value\": 120}\n },\n {\n \"expectation_type\": \"expect_column_values_to_match_regex\",\n \"kwargs\": {\"column\": \"email\", \"regex\": r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\"}\n }\n]\n\n# Initialize DQ Framework\ndq = DQFramework()\n\n# Filter data\nqualified_df, bad_df = dq.filter_dataframe(\n dataframe=df,\n quality_rules=quality_rules,\n include_validation_details=True\n)\n\n# Show results\nprint(\"Qualified Data:\")\nqualified_df.show()\n\nprint(\"Bad Data:\")\nbad_df.show()\n```\n\n## API Reference\n\n### DQFramework\n\nThe main class for data quality processing.\n\n#### Methods\n\n- **`filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)`**\n - Filters a DataFrame based on quality rules\n - Returns tuple of (qualified_df, bad_df)\n\n### RuleProcessor\n\nHandles the processing of Great Expectations rules.\n\n## Dependencies\n\n### Core Dependencies\n\n- **PySpark** ^3.0.0: For DataFrame operations\n- **Great Expectations** ^0.15.0: For validation logic\n- **typing-extensions** ^4.0.0: For enhanced type hints\n\n## Supported Expectations\n\nThe DQ Framework supports a comprehensive set of Great Expectations validation rules. Below are all available expectations organized by category, with examples and descriptions.\n\n### 1. Basic Column Existence and Null Checks\n\n#### `expect_column_to_exist`\n\nValidates that a specified column exists in the DataFrame.\n\n```python\n{\n \"expectation_type\": \"expect_column_to_exist\",\n \"kwargs\": {\"column\": \"customer_id\"}\n}\n```\n\n#### `expect_column_values_to_not_be_null`\n\nValidates that column values are not null.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_not_be_null\",\n \"kwargs\": {\"column\": \"email\"}\n}\n```\n\n#### `expect_column_values_to_be_null`\n\nValidates that column values are null (useful for optional fields in specific contexts).\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_null\",\n \"kwargs\": {\"column\": \"middle_name\"}\n}\n```\n\n### 2. Uniqueness Expectations\n\n#### `expect_column_values_to_be_unique`\n\nValidates that all values in a column are unique.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_unique\",\n \"kwargs\": {\"column\": \"user_id\"}\n}\n```\n\n#### `expect_compound_columns_to_be_unique`\n\nValidates that combinations of multiple column values are unique.\n\n```python\n{\n \"expectation_type\": \"expect_compound_columns_to_be_unique\",\n \"kwargs\": {\"column_list\": [\"user_id\", \"transaction_date\", \"amount\"]}\n}\n```\n\n#### `expect_select_column_values_to_be_unique_within_record`\n\nValidates that values are unique within each record across specified columns.\n\n```python\n{\n \"expectation_type\": \"expect_select_column_values_to_be_unique_within_record\",\n \"kwargs\": {\"column_list\": [\"phone_home\", \"phone_work\", \"phone_mobile\"]}\n}\n```\n\n#### `expect_multicolumn_values_to_be_unique`\n\nValidates that combinations of multiple column values are unique (alias for compound_columns).\n\n```python\n{\n \"expectation_type\": \"expect_multicolumn_values_to_be_unique\",\n \"kwargs\": {\"column_list\": [\"order_id\", \"product_id\"]}\n}\n```\n\n### 3. Range and Value Expectations\n\n#### `expect_column_values_to_be_between`\n\nValidates that column values are within a specified numeric range.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"age\", \"min_value\": 0, \"max_value\": 120}\n}\n\n# Age must be between 18 and 65\n{\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"age\", \"min_value\": 18, \"max_value\": 65}\n}\n\n# Price must be at least 0 (no maximum)\n{\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"price\", \"min_value\": 0}\n}\n```\n\n#### `expect_column_values_to_be_in_set`\n\nValidates that column values are within a specified set of allowed values.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_in_set\",\n \"kwargs\": {\n \"column\": \"status\",\n \"value_set\": [\"active\", \"inactive\", \"suspended\", \"pending\"]\n }\n}\n\n# Gender validation\n{\n \"expectation_type\": \"expect_column_values_to_be_in_set\",\n \"kwargs\": {\n \"column\": \"gender\",\n \"value_set\": [\"M\", \"F\", \"Other\", \"Prefer not to say\"]\n }\n}\n```\n\n#### `expect_column_values_to_not_be_in_set`\n\nValidates that column values are NOT in a specified set of disallowed values.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_not_be_in_set\",\n \"kwargs\": {\n \"column\": \"username\",\n \"value_set\": [\"admin\", \"root\", \"test\", \"guest\"]\n }\n}\n```\n\n#### `expect_column_distinct_values_to_be_in_set`\n\nValidates that all distinct values in a column are within a specified set.\n\n```python\n{\n \"expectation_type\": \"expect_column_distinct_values_to_be_in_set\",\n \"kwargs\": {\n \"column\": \"department\",\n \"value_set\": [\"HR\", \"Finance\", \"Engineering\", \"Marketing\", \"Sales\"]\n }\n}\n```\n\n#### `expect_column_distinct_values_to_contain_set`\n\nValidates that the column contains all values from a specified set.\n\n```python\n{\n \"expectation_type\": \"expect_column_distinct_values_to_contain_set\",\n \"kwargs\": {\n \"column\": \"required_skills\",\n \"value_set\": [\"Python\", \"SQL\"]\n }\n}\n```\n\n#### `expect_column_distinct_values_to_equal_set`\n\nValidates that distinct values in a column exactly match a specified set.\n\n```python\n{\n \"expectation_type\": \"expect_column_distinct_values_to_equal_set\",\n \"kwargs\": {\n \"column\": \"grade\",\n \"value_set\": [\"A\", \"B\", \"C\", \"D\", \"F\"]\n }\n}\n```\n\n### 4. Pattern Matching Expectations\n\n#### `expect_column_values_to_match_regex`\n\nValidates that column values match a specified regular expression pattern.\n\n```python\n# Email validation\n{\n \"expectation_type\": \"expect_column_values_to_match_regex\",\n \"kwargs\": {\n \"column\": \"email\",\n \"regex\": r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\"\n }\n}\n\n# Phone number validation (US format)\n{\n \"expectation_type\": \"expect_column_values_to_match_regex\",\n \"kwargs\": {\n \"column\": \"phone\",\n \"regex\": r\"^\\(\\d{3}\\) \\d{3}-\\d{4}$\"\n }\n}\n\n# Product code validation (3 letters + 4 digits)\n{\n \"expectation_type\": \"expect_column_values_to_match_regex\",\n \"kwargs\": {\n \"column\": \"product_code\",\n \"regex\": r\"^[A-Z]{3}\\d{4}$\"\n }\n}\n```\n\n#### `expect_column_values_to_not_match_regex`\n\nValidates that column values do NOT match a specified regular expression pattern.\n\n```python\n# Ensure no special characters in username\n{\n \"expectation_type\": \"expect_column_values_to_not_match_regex\",\n \"kwargs\": {\n \"column\": \"username\",\n \"regex\": r\"[!@#$%^&*()+=\\[\\]{};':\\\"\\\\|,.<>/?]\"\n }\n}\n```\n\n#### `expect_column_values_to_match_strftime_format`\n\nValidates that column values match a specific date/time format.\n\n```python\n# Date in YYYY-MM-DD format\n{\n \"expectation_type\": \"expect_column_values_to_match_strftime_format\",\n \"kwargs\": {\n \"column\": \"birth_date\",\n \"strftime_format\": \"%Y-%m-%d\"\n }\n}\n\n# Timestamp in YYYY-MM-DD HH:MM:SS format\n{\n \"expectation_type\": \"expect_column_values_to_match_strftime_format\",\n \"kwargs\": {\n \"column\": \"created_at\",\n \"strftime_format\": \"%Y-%m-%d %H:%M:%S\"\n }\n}\n```\n\n### 5. String Length Expectations\n\n#### `expect_column_value_lengths_to_be_between`\n\nValidates that string column value lengths are within a specified range.\n\n```python\n# Password length validation\n{\n \"expectation_type\": \"expect_column_value_lengths_to_be_between\",\n \"kwargs\": {\"column\": \"password\", \"min_value\": 8, \"max_value\": 128}\n}\n\n# Comment length validation\n{\n \"expectation_type\": \"expect_column_value_lengths_to_be_between\",\n \"kwargs\": {\"column\": \"comment\", \"min_value\": 1, \"max_value\": 500}\n}\n```\n\n#### `expect_column_value_lengths_to_equal`\n\nValidates that string column value lengths equal a specific value.\n\n```python\n# Country code (ISO 3166-1 alpha-2)\n{\n \"expectation_type\": \"expect_column_value_lengths_to_equal\",\n \"kwargs\": {\"column\": \"country_code\", \"value\": 2}\n}\n\n# SSN format (XXX-XX-XXXX = 11 characters)\n{\n \"expectation_type\": \"expect_column_value_lengths_to_equal\",\n \"kwargs\": {\"column\": \"ssn\", \"value\": 11}\n}\n```\n\n### 6. Type Expectations\n\n#### `expect_column_values_to_be_of_type`\n\nValidates that column values are of a specified data type.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_of_type\",\n \"kwargs\": {\"column\": \"age\", \"type_\": \"int\"}\n}\n\n{\n \"expectation_type\": \"expect_column_values_to_be_of_type\",\n \"kwargs\": {\"column\": \"price\", \"type_\": \"float\"}\n}\n\n{\n \"expectation_type\": \"expect_column_values_to_be_of_type\",\n \"kwargs\": {\"column\": \"name\", \"type_\": \"string\"}\n}\n```\n\n#### `expect_column_values_to_be_in_type_list`\n\nValidates that column values are of one of the specified data types.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_in_type_list\",\n \"kwargs\": {\n \"column\": \"numeric_value\",\n \"type_list\": [\"int\", \"float\", \"double\"]\n }\n}\n```\n\n### 7. Date and Time Expectations\n\n#### `expect_column_values_to_be_dateutil_parseable`\n\nValidates that column values can be parsed as valid dates.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_dateutil_parseable\",\n \"kwargs\": {\"column\": \"event_date\"}\n}\n```\n\n### 8. JSON Expectations\n\n#### `expect_column_values_to_be_json_parseable`\n\nValidates that column values are valid JSON strings.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_be_json_parseable\",\n \"kwargs\": {\"column\": \"metadata\"}\n}\n```\n\n#### `expect_column_values_to_match_json_schema`\n\nValidates that column values match a specified JSON schema.\n\n```python\n{\n \"expectation_type\": \"expect_column_values_to_match_json_schema\",\n \"kwargs\": {\n \"column\": \"user_preferences\",\n \"json_schema\": {\n \"type\": \"object\",\n \"properties\": {\n \"theme\": {\"type\": \"string\"},\n \"notifications\": {\"type\": \"boolean\"}\n }\n }\n }\n}\n```\n\n### 9. Ordering Expectations\n\n#### `expect_column_values_to_be_increasing`\n\nValidates that column values are in increasing order.\n\n```python\n# Values must be increasing (non-strict)\n{\n \"expectation_type\": \"expect_column_values_to_be_increasing\",\n \"kwargs\": {\"column\": \"timestamp\"}\n}\n\n# Values must be strictly increasing\n{\n \"expectation_type\": \"expect_column_values_to_be_increasing\",\n \"kwargs\": {\"column\": \"sequence_number\", \"strictly\": True}\n}\n```\n\n#### `expect_column_values_to_be_decreasing`\n\nValidates that column values are in decreasing order.\n\n```python\n# Values must be decreasing (non-strict)\n{\n \"expectation_type\": \"expect_column_values_to_be_decreasing\",\n \"kwargs\": {\"column\": \"priority_score\"}\n}\n\n# Values must be strictly decreasing\n{\n \"expectation_type\": \"expect_column_values_to_be_decreasing\",\n \"kwargs\": {\"column\": \"countdown\", \"strictly\": True}\n}\n```\n\n### 10. Statistical Expectations\n\n#### `expect_column_mean_to_be_between`\n\nValidates that the column mean is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_mean_to_be_between\",\n \"kwargs\": {\"column\": \"test_scores\", \"min_value\": 70, \"max_value\": 90}\n}\n```\n\n#### `expect_column_median_to_be_between`\n\nValidates that the column median is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_median_to_be_between\",\n \"kwargs\": {\"column\": \"response_time\", \"min_value\": 100, \"max_value\": 500}\n}\n```\n\n#### `expect_column_stdev_to_be_between`\n\nValidates that the column standard deviation is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_stdev_to_be_between\",\n \"kwargs\": {\"column\": \"measurements\", \"min_value\": 0.5, \"max_value\": 2.0}\n}\n```\n\n#### `expect_column_unique_value_count_to_be_between`\n\nValidates that the count of unique values is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_unique_value_count_to_be_between\",\n \"kwargs\": {\"column\": \"category\", \"min_value\": 5, \"max_value\": 20}\n}\n```\n\n#### `expect_column_proportion_of_unique_values_to_be_between`\n\nValidates that the proportion of unique values is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_proportion_of_unique_values_to_be_between\",\n \"kwargs\": {\"column\": \"user_id\", \"min_value\": 0.95, \"max_value\": 1.0}\n}\n```\n\n#### `expect_column_most_common_value_to_be_in_set`\n\nValidates that the most common value is within a specified set.\n\n```python\n{\n \"expectation_type\": \"expect_column_most_common_value_to_be_in_set\",\n \"kwargs\": {\n \"column\": \"preferred_language\",\n \"value_set\": [\"English\", \"Spanish\", \"French\"]\n }\n}\n```\n\n#### `expect_column_max_to_be_between`\n\nValidates that the column maximum value is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_max_to_be_between\",\n \"kwargs\": {\"column\": \"temperature\", \"min_value\": -50, \"max_value\": 50}\n}\n```\n\n#### `expect_column_min_to_be_between`\n\nValidates that the column minimum value is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_min_to_be_between\",\n \"kwargs\": {\"column\": \"price\", \"min_value\": 0, \"max_value\": 10}\n}\n```\n\n#### `expect_column_sum_to_be_between`\n\nValidates that the column sum is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_column_sum_to_be_between\",\n \"kwargs\": {\"column\": \"order_amount\", \"min_value\": 1000, \"max_value\": 100000}\n}\n```\n\n#### `expect_column_quantile_values_to_be_between`\n\nValidates that column quantile values are within specified ranges.\n\n```python\n{\n \"expectation_type\": \"expect_column_quantile_values_to_be_between\",\n \"kwargs\": {\n \"column\": \"response_time\",\n \"quantile_ranges\": {\n \"quantiles\": [0.25, 0.5, 0.75],\n \"value_ranges\": [[50, 100], [100, 200], [200, 400]]\n }\n }\n}\n```\n\n### 11. Column Pair Expectations\n\n#### `expect_column_pair_values_to_be_equal`\n\nValidates that values in two columns are equal.\n\n```python\n{\n \"expectation_type\": \"expect_column_pair_values_to_be_equal\",\n \"kwargs\": {\"column_A\": \"password\", \"column_B\": \"password_confirm\"}\n}\n```\n\n#### `expect_column_pair_values_A_to_be_greater_than_B`\n\nValidates that values in column A are greater than values in column B.\n\n```python\n# End date must be after start date\n{\n \"expectation_type\": \"expect_column_pair_values_A_to_be_greater_than_B\",\n \"kwargs\": {\"column_A\": \"end_date\", \"column_B\": \"start_date\"}\n}\n\n# Final score must be greater than or equal to initial score\n{\n \"expectation_type\": \"expect_column_pair_values_A_to_be_greater_than_B\",\n \"kwargs\": {\"column_A\": \"final_score\", \"column_B\": \"initial_score\", \"or_equal\": True}\n}\n```\n\n#### `expect_column_pair_values_to_be_in_set`\n\nValidates that pairs of values are within a specified set of valid combinations.\n\n```python\n{\n \"expectation_type\": \"expect_column_pair_values_to_be_in_set\",\n \"kwargs\": {\n \"column_A\": \"state\",\n \"column_B\": \"country\",\n \"value_pairs_set\": [\n [\"CA\", \"USA\"],\n [\"NY\", \"USA\"],\n [\"TX\", \"USA\"],\n [\"ON\", \"Canada\"],\n [\"BC\", \"Canada\"]\n ]\n }\n}\n```\n\n### 12. Table-level Expectations\n\n#### `expect_table_row_count_to_be_between`\n\nValidates that the total number of rows in the table is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_table_row_count_to_be_between\",\n \"kwargs\": {\"min_value\": 100, \"max_value\": 10000}\n}\n```\n\n#### `expect_table_row_count_to_equal`\n\nValidates that the total number of rows equals a specific value.\n\n```python\n{\n \"expectation_type\": \"expect_table_row_count_to_equal\",\n \"kwargs\": {\"value\": 1000}\n}\n```\n\n#### `expect_table_column_count_to_be_between`\n\nValidates that the number of columns is within a specified range.\n\n```python\n{\n \"expectation_type\": \"expect_table_column_count_to_be_between\",\n \"kwargs\": {\"min_value\": 5, \"max_value\": 50}\n}\n```\n\n#### `expect_table_column_count_to_equal`\n\nValidates that the number of columns equals a specific value.\n\n```python\n{\n \"expectation_type\": \"expect_table_column_count_to_equal\",\n \"kwargs\": {\"value\": 12}\n}\n```\n\n#### `expect_table_columns_to_match_ordered_list`\n\nValidates that table columns match an ordered list exactly.\n\n```python\n{\n \"expectation_type\": \"expect_table_columns_to_match_ordered_list\",\n \"kwargs\": {\n \"column_list\": [\"id\", \"name\", \"email\", \"created_at\", \"updated_at\"]\n }\n}\n```\n\n#### `expect_table_columns_to_match_set`\n\nValidates that table columns match a set (order doesn't matter).\n\n```python\n{\n \"expectation_type\": \"expect_table_columns_to_match_set\",\n \"kwargs\": {\n \"column_set\": [\"user_id\", \"product_id\", \"quantity\", \"price\", \"order_date\"]\n }\n}\n```\n\n## Complete Example: E-commerce Order Validation\n\nHere's a comprehensive example showing multiple expectations for validating e-commerce order data:\n\n```python\nfrom pyspark.sql import SparkSession\nfrom dq_framework import DQFramework\n\n# Initialize Spark session\nspark = SparkSession.builder.appName(\"ECommerce_DQ\").getOrCreate()\n\n# Sample e-commerce order data\ndata = [\n (1, \"ORD001\", \"user123\", \"PROD-A001\", 2, 29.99, \"2023-01-15\", \"completed\"),\n (2, \"ORD002\", \"user456\", \"PROD-B002\", 1, 15.50, \"2023-01-16\", \"pending\"),\n (3, \"ORD003\", \"user789\", \"PROD-C003\", 0, 45.00, \"2023-01-17\", \"cancelled\"), # Bad: quantity = 0\n (4, \"ORD004\", \"\", \"PROD-D004\", 1, -10.00, \"2023-01-18\", \"processing\"), # Bad: empty user_id, negative price\n (5, \"ORD005\", \"user123\", \"INVALID\", 3, 75.25, \"invalid-date\", \"unknown\") # Bad: invalid product code, date, status\n]\n\ncolumns = [\"id\", \"order_id\", \"user_id\", \"product_code\", \"quantity\", \"price\", \"order_date\", \"status\"]\ndf = spark.createDataFrame(data, columns)\n\n# Comprehensive quality rules\nquality_rules = [\n # Basic existence and null checks\n {\n \"expectation_type\": \"expect_column_values_to_not_be_null\",\n \"kwargs\": {\"column\": \"order_id\"}\n },\n {\n \"expectation_type\": \"expect_column_values_to_not_be_null\",\n \"kwargs\": {\"column\": \"user_id\"}\n },\n\n # Uniqueness\n {\n \"expectation_type\": \"expect_column_values_to_be_unique\",\n \"kwargs\": {\"column\": \"order_id\"}\n },\n\n # Range validation\n {\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"quantity\", \"min_value\": 1, \"max_value\": 100}\n },\n {\n \"expectation_type\": \"expect_column_values_to_be_between\",\n \"kwargs\": {\"column\": \"price\", \"min_value\": 0.01, \"max_value\": 10000}\n },\n\n # Set validation\n {\n \"expectation_type\": \"expect_column_values_to_be_in_set\",\n \"kwargs\": {\n \"column\": \"status\",\n \"value_set\": [\"pending\", \"processing\", \"completed\", \"cancelled\"]\n }\n },\n\n # Pattern matching\n {\n \"expectation_type\": \"expect_column_values_to_match_regex\",\n \"kwargs\": {\n \"column\": \"product_code\",\n \"regex\": r\"^PROD-[A-Z]\\d{3}$\"\n }\n },\n {\n \"expectation_type\": \"expect_column_values_to_match_strftime_format\",\n \"kwargs\": {\n \"column\": \"order_date\",\n \"strftime_format\": \"%Y-%m-%d\"\n }\n },\n\n # String length\n {\n \"expectation_type\": \"expect_column_value_lengths_to_be_between\",\n \"kwargs\": {\"column\": \"user_id\", \"min_value\": 3, \"max_value\": 20}\n }\n]\n\n# Initialize DQ Framework and filter data\ndq = DQFramework()\nqualified_df, bad_df = dq.filter_dataframe(\n dataframe=df,\n quality_rules=quality_rules,\n include_validation_details=True\n)\n\nprint(\"Qualified Orders:\")\nqualified_df.show()\n\nprint(\"Bad Orders:\")\nbad_df.show()\n```\n\nThis example demonstrates how multiple expectations work together to ensure comprehensive data quality validation for real-world scenarios.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "HDF Data Quality Framework for PySpark DataFrames using Great Expectations",
"version": "0.4.0",
"project_urls": {
"Documentation": "https://github.com/your-org/hdf-data-pipeline",
"Homepage": "https://github.com/your-org/hdf-data-pipeline",
"Repository": "https://github.com/your-org/hdf-data-pipeline"
},
"split_keywords": [
"data-quality",
" pyspark",
" great-expectations",
" dataframe",
" validation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bf096ae5bf93c51fea0ca522ce45807600b36b8d54f6f9573ff04a255bc79826",
"md5": "b1d097a341a5b0472c691329042466f7",
"sha256": "d5a5f66c3270f8427116fdb8b64d5d5841a1a13951ff3d3dfada750bd2dd23c6"
},
"downloads": -1,
"filename": "hdf_dq_framework-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b1d097a341a5b0472c691329042466f7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 1083321,
"upload_time": "2025-08-19T05:09:01",
"upload_time_iso_8601": "2025-08-19T05:09:01.164470Z",
"url": "https://files.pythonhosted.org/packages/bf/09/6ae5bf93c51fea0ca522ce45807600b36b8d54f6f9573ff04a255bc79826/hdf_dq_framework-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d66f48f59d0f890a1bc3f67157b074f70c3ad93ff67206a4e5ab5ecc52d20f45",
"md5": "98f140da57ae14eadaa1be1497e5a117",
"sha256": "bd0f334b9d706dd1df0db57ce7d259b61974e148c3b88c25cf0772a82c418769"
},
"downloads": -1,
"filename": "hdf_dq_framework-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "98f140da57ae14eadaa1be1497e5a117",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 708898,
"upload_time": "2025-08-19T05:09:04",
"upload_time_iso_8601": "2025-08-19T05:09:04.223710Z",
"url": "https://files.pythonhosted.org/packages/d6/6f/48f59d0f890a1bc3f67157b074f70c3ad93ff67206a4e5ab5ecc52d20f45/hdf_dq_framework-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-19 05:09:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-org",
"github_project": "hdf-data-pipeline",
"github_not_found": true,
"lcname": "hdf-dq-framework"
}