etl-utilities


Nameetl-utilities JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummaryThis repository provides a collection of utility functions and classes for data cleaning, SQL query generation, and data analysis. The code is written in Python and uses libraries such as `pandas`, `numpy`, and `dateutil`.
upload_time2024-10-21 23:11:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords cleaning database dataframe etl etl-utilities pandas utilities
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data Cleaning and SQL Query Generation Utilities

This repository provides a collection of utility functions and classes for data cleaning, SQL query generation, and data analysis. The code is written in Python and uses libraries such as `pandas`, `numpy`, and `dateutil`.

## Table of Contents


- [Usage](#usage)
  - [Parser Class](#parser-class)
  - [Cleaner Class](#cleaner-class)
  - [Analyzer Class](#analyzer-class)
  - [Creator Class](#creator-class)
  - [Updater Class](#updater-class)
  - [Validator Class](#validator-class)
- [Contributing](#contributing)
- [License](#license)


## Usage

### `Parser` Class

#### `Parser.parse_boolean(value)`

Parses boolean values from various possible string representations.

- **Parameters:** 
  - `value` (*any*): The value to be parsed.
- **Returns:** 
  - `True` or `False` or `None`
- **Raises:** 
  - `ValueError`: If the value is not a recognized boolean representation.

```python
Parser.parse_boolean('yes')   # True
Parser.parse_boolean('no')    # False
```

#### `Parser.parse_float(value)`

Parses a float value, cleaning up common characters like commas, dollar signs, and percentages.

- **Parameters:** 
  - `value` (*any*): The value to be parsed.
- **Returns:** 
  - `float` or `None`
  
```python
Parser.parse_float('$1,234.56')    # 1234.56
Parser.parse_float('45%')          # 45.0
```

#### `Parser.parse_date(value)`

Parses values into a date.

- **Parameters:** 
  - `value` (*any*): The value to be parsed.
- **Returns:** 
  - `datetime` or `None`
  
```python
Parser.parse_date('2023-10-04')    # datetime object
```

#### `Parser.parse_integer(value)`

Parses integer values.

- **Parameters:** 
  - `value` (*any*): The value to be parsed.
- **Returns:** 
  - `int` or `None`
- **Raises:** 
  - `ValueError`: If the value cannot be converted to an integer.
  
```python
Parser.parse_integer('123')   # 123
```

### `Cleaner` Class

This class provides static methods to clean and manipulate pandas DataFrame columns.

#### `compute_hash(value)`

Computes a SHA-1 hash of the input value.

- **Parameters:** 
  - `value` (*any*): The value to be hashed.
- **Returns:** 
  - `str`: SHA-1 hash of the input value.
  
```python
compute_hash('test')    # 'a94a8fe5ccb19ba61c4c0873d391e987982fbbd3'
```

#### `standardize_column_name(name)`

Standardizes column names into snake_case.

- **Parameters:** 
  - `name` (*str*): The column name to be standardized.
- **Returns:** 
  - `str`: The standardized column name.
  
```python
standardize_column_name('Date of Birth')    # 'date_of_birth'
standardize_column_name('Employee ID#')     # 'employee_id_num'
```

#### `Cleaner.column_names_to_snake_case(df)`

Converts all column names in the DataFrame to snake_case.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame whose column names need to be converted.
- **Returns:** 
  - None
  
```python
Cleaner.column_names_to_snake_case(df)
```

#### `Cleaner.column_names_to_pascal_case(df)`

Converts all column names in the DataFrame to PascalCase.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame whose column names need to be converted.
- **Returns:** 
  - None
  
```python
Cleaner.column_names_to_pascal_case(df)
```

#### `Cleaner.clean_series(series, clean_function)`

Cleans a specified series using a provided function, and attempts to cast the series to the return functions dtype.

- **Parameters:** 
  - `series` (*pd.Series*): The series to be cleaned.
  - `clean_function` (*function*): The function to clean the column.
- **Returns:** 
  - The cleaned column.
  
```python
Cleaner.clean_column(df, 'salary', parse_float)
```

#### `Cleaner.clean_numbers(df)`

Cleans columns in the DataFrame that contain numerical values.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.
- **Returns:** 
  - `pd.DataFrame`: The cleaned DataFrame.
  
```python
Cleaner.clean_numbers(df)
```

#### `Cleaner.clean_dates(df)`

Cleans columns in the DataFrame that contain date values.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.
- **Returns:** 
  - `pd.DataFrame`: The cleaned DataFrame.
  
```python
Cleaner.clean_dates(df)
```

#### `Cleaner.clean_bools(df)`

Cleans columns in the DataFrame that contain boolean values.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.
- **Returns:** 
  - `pd.DataFrame`: The cleaned DataFrame.
  
```python
Cleaner.clean_bools(df)
```

#### `Cleaner.clean_all_types(df)`

Attempts to clean all column types in the DataFrame.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.
- **Returns:** 
  - `pd.DataFrame`: The cleaned DataFrame.

```python
Cleaner.clean_all_types(df)
```

#### `Cleaner.clean_df(df)`

Removes empty columns and rows and attempts to clean all column types in the DataFrame.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.
- **Returns:** 
  - `pd.DataFrame`: The cleaned DataFrame.

```python
Cleaner.clean_df(df)
```

#### `Cleaner.generate_hash_column(df, columns_to_hash, new_column_name)`

Generates a new column that contains the SHA-1 hash of specified columns.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be processed.
  - `columns_to_hash` (*list*): The list of column names to hash.
  - `new_column_name` (*str*): The name of the new column.
- **Returns:** 
  - `pd.DataFrame`: The DataFrame with the new hash column.
  
```python
Cleaner.generate_hash_column(df, ['name', 'birth_date'], 'identity_hash')
```

#### `Cleaner.coalesce_columns(df, columns_to_coalesce, new_column_name, drop=False)`

Combines multiple columns into a single column, prioritizing non-null values.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be processed.
  - `columns_to_coalesce` (*list*): The list of columns to coalesce.
  - `target_column` (*str*): The name of the target column.
  - `drop` (*bool*): Whether to drop the original columns.
- **Returns:** 
  - `pd.DataFrame`: The DataFrame with the coalesced column.
  
```python
Cleaner.coalesce_columns(df, ['phone_home', 'phone_mobile', 'phone_work'], 'phone', drop=True)
```

#### Example Usage

Here's an example of how you can use these utilities:

```python
import pandas as pd

data = {
    'name': ['Alice', 'Bob'],
    'birth_date': ['1990-01-01', '1985-05-12'],
    'is_active': ['yes', 'no'],
    'salary': ['$1,200.50', '$2,500.75'],
}

df = pd.DataFrame(data)

# Clean the DataFrame
Cleaner.column_names_to_snake_case(df)
df = Cleaner.clean_dates(df)
df = Cleaner.clean_bools(df)
df = Cleaner.clean_numbers(df)

# Generate hash column
df = Cleaner.generate_hash_column(df, ['name', 'birth_date'], 'identity_hash')

print(df)
```

The above script will clean the DataFrame, convert column names to snake_case, cleanse date, boolean, and numerical values, and generate a hash column based on the specified columns.
### Analyzer Class

Provides utilities for analyzing DataFrames.
#### `Analyzer.find_unique_columns(df)`

Finds columns that can serve as unique identifiers.

```python
import pandas as pd
from etl.dataframe.analyzer import Analyzer

df = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Alice']})
candidates = Analyzer.find_unique_columns(df)
print(candidates)  # Output: ['id']

```
#### `Analyzer.find_unique_column_pairs(df)`
Finds pairs of columns that can serve as unique identifiers.

```python
import pandas as pd
from etl.dataframe.analyzer import Analyzer

df = pd.DataFrame({'first': [1, 2, 2], 'second': [3, 3, 4]})
candidates = Analyzer.find_unique_column_pairs(df)
print(candidates)  # Output: [('first', 'second')]
```

#### `Analyzer.find_empty_columns(df)`
Finds empty columns in dataframe.

```python
import pandas as pd
from etl.dataframe.analyzer import Analyzer

df = pd.DataFrame({'first': [1, 2, 2], 'second': [None, None, None]})
candidates = Analyzer.find_empty_columns(df)
print(candidates)  # Output: ['second']
```

#### `find_categorical_columns`

Identifies and returns the list of columns in a Pandas DataFrame that can be considered categorical based on the proportion of unique values in each column.

**Parameters**

- **df** (`pd.DataFrame`): 
  The DataFrame to inspect for categorical columns.

- **unique_threshold** (`float`, optional, default=`1`): 
  The threshold ratio of unique values to total non-null values in the column. Columns with a unique value ratio less than or equal to this threshold are classified as categorical. The value must be between 0 and 1.

**Returns**

- **categorical_columns** (`list` of `str`): 
  A list of column names in the DataFrame that are classified as categorical.

**Raises**

- **ValueError**: 
  If `unique_threshold` is not between 0 and 1.

**Example**

```python
import pandas as pd
from etl.dataframe.analyzer import Analyzer

# Sample DataFrame
data = {
    'A': [1, 2, 2, 3, 4],
    'B': ['cat', 'dog', 'cat', 'dog', 'cat'],
    'C': [1.1, 2.2, 3.3, 4.4, 5.5]
}

df = pd.DataFrame(data)

# Find categorical columns with the default unique threshold of 1
categorical = Analyzer.find_categorical_columns(df)
print(categorical)  # Output: ['A', 'B', 'C']
# Find categorical columns with the default unique threshold of 80 percent
categorical = Analyzer.find_categorical_columns(df, .8)
print(categorical)  # Output: ['A', 'B']
# Find categorical columns with the default unique threshold of 40 percent
categorical = Analyzer.find_categorical_columns(df, .4)
print(categorical)  # Output: ['B']
```

**Notes**

- The function evaluates each column in the DataFrame to check if it has a proportion of unique values below or equal to the specified `unique_threshold`.
- It skips columns that have only null values.
- This function considers columns with unique values ratio up to the `unique_threshold` to be categorical. Adjust the threshold based on the specific needs of your analysis.

Using this function can help identify columns that are more suitable to involve in categorical analysis, such as encoding for machine learning tasks.
### `Creator` Class

Generates SQL queries for creating tables.
#### `Creator.create_mssql_table(df, schema, table, primary_key=None, unique_columns= None, history=False, varchar_padding=20, float_precision=10, decimal_places=2, generate_id=False)`

Generates a SQL CREATE TABLE statement based on a DataFrame.
### `Updater` Class

Generates SQL queries for inserting and merging data.
#### `Updater.merge_mssql(source_schema, source_table, target_schema, target_table, columns, id_column, delete_unmatched=True)`

Generates a SQL MERGE statement.
#### `Updater.upsert_mssql(source_schema, source_table, target_schema, target_table, columns, id_column)`

Generates a SQL UPSERT statement.
#### `Updater.append_mssql(source_schema, source_table, target_schema, target_table, columns)`

Generates a SQL INSERT statement with EXCEPT.

### `Validator` Class

This class provides static methods to validate the structure and data of a DataFrame before uploading it to a database.


##### `validate_upload(connection, df: pd.DataFrame, schema: str, table: str)`

Validates the DataFrame against the specified database schema and table.

- **Parameters:**
  - `connection` (*sqlalchemy.engine.Connection*): Database connection.
  - `df` (*pd.DataFrame*): The DataFrame to be validated.
  - `schema` (*str*): The schema name.
  - `table` (*str*): The table name.
- **Returns:** 
  - None

```python
Validator.validate_upload(connection, df, 'public', 'my_table')
```

##### `_fetch_column_info(connection, df, schema, table)`

Fetches column information from the database for the specified schema and table.

- **Parameters:** 
  - `connection` (*sqlalchemy.engine.Connection*): Database connection.
  - `df` (*pd.DataFrame*): The DataFrame to be validated.
  - `schema` (*str*): The schema name.
  - `table` (*str*): The table name.
- **Returns:** 
  - Tuple(List, pd.DataFrame): List of DataFrame columns and DataFrame containing column information.

##### `_check_extra_columns(df_columns, column_info_df, schema, table)`

Checks for extra columns in the DataFrame that are not in the database table schema.

- **Parameters:** 
  - `df_columns` (*List[str]*): List of DataFrame columns.
  - `column_info_df` (*pd.DataFrame*): DataFrame containing column information.
  - `schema` (*str*): The schema name.
  - `table` (*str*): The table name.
- **Raises:** 
  - `ExtraColumnsException`: If any extra columns are found.

##### `_validate_column_types(df, df_columns, column_info_df)`

Validates the data types of DataFrame columns against the database schema.

- **Parameters:** 
  - `df` (*pd.DataFrame*): The DataFrame to be validated.
  - `df_columns` (*List[str]*): List of DataFrame columns.
  - `column_info_df` (*pd.DataFrame*): DataFrame containing column information.
- **Raises:** 
  - `ColumnDataException`: If any column data type mismatches or truncation issues are found.

##### `_is_type_mismatch(df_column_data_type, db_column_data_type)`

Checks if there is a type mismatch between the DataFrame and database column data types.

- **Parameters:** 
  - `df_column_data_type` (*numpy.dtype*): Data type of the DataFrame column.
  - `db_column_data_type` (*str*): Data type of the database column.
- **Returns:** 
  - `bool`: `True` if there is a type mismatch, `False` otherwise.

##### `_check_numeric_truncation(column, df, db_column_info)`

Checks for numeric truncation issues.

- **Parameters:** 
  - `column` (*str*): The column name.
  - `df` (*pd.DataFrame*): The DataFrame containing the column.
  - `db_column_info` (*pd.Series*): Series containing information about the database column.
- **Returns:** 
  - `str` or `None`: Error message if truncation is detected, `None` otherwise.

##### `_check_string_or_date_truncation(column, df, db_column_info)`

Checks for string or date truncation issues.

- **Parameters:** 
  - `column` (*str*): The column name.
  - `df` (*pd.DataFrame*): The DataFrame containing the column.
  - `db_column_info` (*pd.Series*): Series containing information about the database column.
- **Returns:** 
  - `str` or `None`: Error message if truncation is detected, `None` otherwise.

#### Custom Exceptions

#### `ExtraColumnsException`

This exception is raised when extra columns are found in the DataFrame that are not present in the database table schema.

```python
class ExtraColumnsException(Exception):
    pass
```

#### `ColumnDataException`

This exception is raised when there are data type mismatches or truncation issues in the DataFrame columns.

```python
class ColumnDataException(Exception):
    pass
```

#### Example Usage

Here's an example of how you can use the `Validator` class to validate a DataFrame before uploading it to a database:

```python
import pandas as pd
from sqlalchemy import create_engine

data = {
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'join_date': ['2020-01-01', '2019-06-15'],
}

df = pd.DataFrame(data)

# Create a connection to the database
engine = create_engine('sqlite:///example.db')
connection = engine.connect()

# Validate upload
try:
    Validator.validate_upload(connection, df, 'public', 'employees')
    print('DataFrame is valid for upload.')
except (ExtraColumnsException, ColumnDataException) as e:
    print(f'Validation failed: {e}')

connection.close()
```

The above script will validate the DataFrame against the specified database schema and table, ensuring that column types match and no extra columns are present.
## Contributing

Contributions are welcome! Please create a pull request or open an issue to discuss your ideas.
## License

This project is licensed under the MIT License.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "etl-utilities",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "cleaning, database, dataframe, etl, etl-utilities, pandas, utilities",
    "author": null,
    "author_email": "Jesse <magicjedi90@aim.com>",
    "download_url": "https://files.pythonhosted.org/packages/6b/1f/c858325adab4767b59af9bd96d19cda1a55a29897d0d5b95f5a9228b6d28/etl_utilities-0.9.0.tar.gz",
    "platform": null,
    "description": "# Data Cleaning and SQL Query Generation Utilities\n\nThis repository provides a collection of utility functions and classes for data cleaning, SQL query generation, and data analysis. The code is written in Python and uses libraries such as `pandas`, `numpy`, and `dateutil`.\n\n## Table of Contents\n\n\n- [Usage](#usage)\n  - [Parser Class](#parser-class)\n  - [Cleaner Class](#cleaner-class)\n  - [Analyzer Class](#analyzer-class)\n  - [Creator Class](#creator-class)\n  - [Updater Class](#updater-class)\n  - [Validator Class](#validator-class)\n- [Contributing](#contributing)\n- [License](#license)\n\n\n## Usage\n\n### `Parser` Class\n\n#### `Parser.parse_boolean(value)`\n\nParses boolean values from various possible string representations.\n\n- **Parameters:** \n  - `value` (*any*): The value to be parsed.\n- **Returns:** \n  - `True` or `False` or `None`\n- **Raises:** \n  - `ValueError`: If the value is not a recognized boolean representation.\n\n```python\nParser.parse_boolean('yes')   # True\nParser.parse_boolean('no')    # False\n```\n\n#### `Parser.parse_float(value)`\n\nParses a float value, cleaning up common characters like commas, dollar signs, and percentages.\n\n- **Parameters:** \n  - `value` (*any*): The value to be parsed.\n- **Returns:** \n  - `float` or `None`\n  \n```python\nParser.parse_float('$1,234.56')    # 1234.56\nParser.parse_float('45%')          # 45.0\n```\n\n#### `Parser.parse_date(value)`\n\nParses values into a date.\n\n- **Parameters:** \n  - `value` (*any*): The value to be parsed.\n- **Returns:** \n  - `datetime` or `None`\n  \n```python\nParser.parse_date('2023-10-04')    # datetime object\n```\n\n#### `Parser.parse_integer(value)`\n\nParses integer values.\n\n- **Parameters:** \n  - `value` (*any*): The value to be parsed.\n- **Returns:** \n  - `int` or `None`\n- **Raises:** \n  - `ValueError`: If the value cannot be converted to an integer.\n  \n```python\nParser.parse_integer('123')   # 123\n```\n\n### `Cleaner` Class\n\nThis class provides static methods to clean and manipulate pandas DataFrame columns.\n\n#### `compute_hash(value)`\n\nComputes a SHA-1 hash of the input value.\n\n- **Parameters:** \n  - `value` (*any*): The value to be hashed.\n- **Returns:** \n  - `str`: SHA-1 hash of the input value.\n  \n```python\ncompute_hash('test')    # 'a94a8fe5ccb19ba61c4c0873d391e987982fbbd3'\n```\n\n#### `standardize_column_name(name)`\n\nStandardizes column names into snake_case.\n\n- **Parameters:** \n  - `name` (*str*): The column name to be standardized.\n- **Returns:** \n  - `str`: The standardized column name.\n  \n```python\nstandardize_column_name('Date of Birth')    # 'date_of_birth'\nstandardize_column_name('Employee ID#')     # 'employee_id_num'\n```\n\n#### `Cleaner.column_names_to_snake_case(df)`\n\nConverts all column names in the DataFrame to snake_case.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame whose column names need to be converted.\n- **Returns:** \n  - None\n  \n```python\nCleaner.column_names_to_snake_case(df)\n```\n\n#### `Cleaner.column_names_to_pascal_case(df)`\n\nConverts all column names in the DataFrame to PascalCase.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame whose column names need to be converted.\n- **Returns:** \n  - None\n  \n```python\nCleaner.column_names_to_pascal_case(df)\n```\n\n#### `Cleaner.clean_series(series, clean_function)`\n\nCleans a specified series using a provided function, and attempts to cast the series to the return functions dtype.\n\n- **Parameters:** \n  - `series` (*pd.Series*): The series to be cleaned.\n  - `clean_function` (*function*): The function to clean the column.\n- **Returns:** \n  - The cleaned column.\n  \n```python\nCleaner.clean_column(df, 'salary', parse_float)\n```\n\n#### `Cleaner.clean_numbers(df)`\n\nCleans columns in the DataFrame that contain numerical values.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.\n- **Returns:** \n  - `pd.DataFrame`: The cleaned DataFrame.\n  \n```python\nCleaner.clean_numbers(df)\n```\n\n#### `Cleaner.clean_dates(df)`\n\nCleans columns in the DataFrame that contain date values.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.\n- **Returns:** \n  - `pd.DataFrame`: The cleaned DataFrame.\n  \n```python\nCleaner.clean_dates(df)\n```\n\n#### `Cleaner.clean_bools(df)`\n\nCleans columns in the DataFrame that contain boolean values.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.\n- **Returns:** \n  - `pd.DataFrame`: The cleaned DataFrame.\n  \n```python\nCleaner.clean_bools(df)\n```\n\n#### `Cleaner.clean_all_types(df)`\n\nAttempts to clean all column types in the DataFrame.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.\n- **Returns:** \n  - `pd.DataFrame`: The cleaned DataFrame.\n\n```python\nCleaner.clean_all_types(df)\n```\n\n#### `Cleaner.clean_df(df)`\n\nRemoves empty columns and rows and attempts to clean all column types in the DataFrame.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be cleaned.\n- **Returns:** \n  - `pd.DataFrame`: The cleaned DataFrame.\n\n```python\nCleaner.clean_df(df)\n```\n\n#### `Cleaner.generate_hash_column(df, columns_to_hash, new_column_name)`\n\nGenerates a new column that contains the SHA-1 hash of specified columns.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be processed.\n  - `columns_to_hash` (*list*): The list of column names to hash.\n  - `new_column_name` (*str*): The name of the new column.\n- **Returns:** \n  - `pd.DataFrame`: The DataFrame with the new hash column.\n  \n```python\nCleaner.generate_hash_column(df, ['name', 'birth_date'], 'identity_hash')\n```\n\n#### `Cleaner.coalesce_columns(df, columns_to_coalesce, new_column_name, drop=False)`\n\nCombines multiple columns into a single column, prioritizing non-null values.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be processed.\n  - `columns_to_coalesce` (*list*): The list of columns to coalesce.\n  - `target_column` (*str*): The name of the target column.\n  - `drop` (*bool*): Whether to drop the original columns.\n- **Returns:** \n  - `pd.DataFrame`: The DataFrame with the coalesced column.\n  \n```python\nCleaner.coalesce_columns(df, ['phone_home', 'phone_mobile', 'phone_work'], 'phone', drop=True)\n```\n\n#### Example Usage\n\nHere's an example of how you can use these utilities:\n\n```python\nimport pandas as pd\n\ndata = {\n    'name': ['Alice', 'Bob'],\n    'birth_date': ['1990-01-01', '1985-05-12'],\n    'is_active': ['yes', 'no'],\n    'salary': ['$1,200.50', '$2,500.75'],\n}\n\ndf = pd.DataFrame(data)\n\n# Clean the DataFrame\nCleaner.column_names_to_snake_case(df)\ndf = Cleaner.clean_dates(df)\ndf = Cleaner.clean_bools(df)\ndf = Cleaner.clean_numbers(df)\n\n# Generate hash column\ndf = Cleaner.generate_hash_column(df, ['name', 'birth_date'], 'identity_hash')\n\nprint(df)\n```\n\nThe above script will clean the DataFrame, convert column names to snake_case, cleanse date, boolean, and numerical values, and generate a hash column based on the specified columns.\n### Analyzer Class\n\nProvides utilities for analyzing DataFrames.\n#### `Analyzer.find_unique_columns(df)`\n\nFinds columns that can serve as unique identifiers.\n\n```python\nimport pandas as pd\nfrom etl.dataframe.analyzer import Analyzer\n\ndf = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Alice']})\ncandidates = Analyzer.find_unique_columns(df)\nprint(candidates)  # Output: ['id']\n\n```\n#### `Analyzer.find_unique_column_pairs(df)`\nFinds pairs of columns that can serve as unique identifiers.\n\n```python\nimport pandas as pd\nfrom etl.dataframe.analyzer import Analyzer\n\ndf = pd.DataFrame({'first': [1, 2, 2], 'second': [3, 3, 4]})\ncandidates = Analyzer.find_unique_column_pairs(df)\nprint(candidates)  # Output: [('first', 'second')]\n```\n\n#### `Analyzer.find_empty_columns(df)`\nFinds empty columns in dataframe.\n\n```python\nimport pandas as pd\nfrom etl.dataframe.analyzer import Analyzer\n\ndf = pd.DataFrame({'first': [1, 2, 2], 'second': [None, None, None]})\ncandidates = Analyzer.find_empty_columns(df)\nprint(candidates)  # Output: ['second']\n```\n\n#### `find_categorical_columns`\n\nIdentifies and returns the list of columns in a Pandas DataFrame that can be considered categorical based on the proportion of unique values in each column.\n\n**Parameters**\n\n- **df** (`pd.DataFrame`): \n  The DataFrame to inspect for categorical columns.\n\n- **unique_threshold** (`float`, optional, default=`1`): \n  The threshold ratio of unique values to total non-null values in the column. Columns with a unique value ratio less than or equal to this threshold are classified as categorical. The value must be between 0 and 1.\n\n**Returns**\n\n- **categorical_columns** (`list` of `str`): \n  A list of column names in the DataFrame that are classified as categorical.\n\n**Raises**\n\n- **ValueError**: \n  If `unique_threshold` is not between 0 and 1.\n\n**Example**\n\n```python\nimport pandas as pd\nfrom etl.dataframe.analyzer import Analyzer\n\n# Sample DataFrame\ndata = {\n    'A': [1, 2, 2, 3, 4],\n    'B': ['cat', 'dog', 'cat', 'dog', 'cat'],\n    'C': [1.1, 2.2, 3.3, 4.4, 5.5]\n}\n\ndf = pd.DataFrame(data)\n\n# Find categorical columns with the default unique threshold of 1\ncategorical = Analyzer.find_categorical_columns(df)\nprint(categorical)  # Output: ['A', 'B', 'C']\n# Find categorical columns with the default unique threshold of 80 percent\ncategorical = Analyzer.find_categorical_columns(df, .8)\nprint(categorical)  # Output: ['A', 'B']\n# Find categorical columns with the default unique threshold of 40 percent\ncategorical = Analyzer.find_categorical_columns(df, .4)\nprint(categorical)  # Output: ['B']\n```\n\n**Notes**\n\n- The function evaluates each column in the DataFrame to check if it has a proportion of unique values below or equal to the specified `unique_threshold`.\n- It skips columns that have only null values.\n- This function considers columns with unique values ratio up to the `unique_threshold` to be categorical. Adjust the threshold based on the specific needs of your analysis.\n\nUsing this function can help identify columns that are more suitable to involve in categorical analysis, such as encoding for machine learning tasks.\n### `Creator` Class\n\nGenerates SQL queries for creating tables.\n#### `Creator.create_mssql_table(df, schema, table, primary_key=None, unique_columns= None, history=False, varchar_padding=20, float_precision=10, decimal_places=2, generate_id=False)`\n\nGenerates a SQL CREATE TABLE statement based on a DataFrame.\n### `Updater` Class\n\nGenerates SQL queries for inserting and merging data.\n#### `Updater.merge_mssql(source_schema, source_table, target_schema, target_table, columns, id_column, delete_unmatched=True)`\n\nGenerates a SQL MERGE statement.\n#### `Updater.upsert_mssql(source_schema, source_table, target_schema, target_table, columns, id_column)`\n\nGenerates a SQL UPSERT statement.\n#### `Updater.append_mssql(source_schema, source_table, target_schema, target_table, columns)`\n\nGenerates a SQL INSERT statement with EXCEPT.\n\n### `Validator` Class\n\nThis class provides static methods to validate the structure and data of a DataFrame before uploading it to a database.\n\n\n##### `validate_upload(connection, df: pd.DataFrame, schema: str, table: str)`\n\nValidates the DataFrame against the specified database schema and table.\n\n- **Parameters:**\n  - `connection` (*sqlalchemy.engine.Connection*): Database connection.\n  - `df` (*pd.DataFrame*): The DataFrame to be validated.\n  - `schema` (*str*): The schema name.\n  - `table` (*str*): The table name.\n- **Returns:** \n  - None\n\n```python\nValidator.validate_upload(connection, df, 'public', 'my_table')\n```\n\n##### `_fetch_column_info(connection, df, schema, table)`\n\nFetches column information from the database for the specified schema and table.\n\n- **Parameters:** \n  - `connection` (*sqlalchemy.engine.Connection*): Database connection.\n  - `df` (*pd.DataFrame*): The DataFrame to be validated.\n  - `schema` (*str*): The schema name.\n  - `table` (*str*): The table name.\n- **Returns:** \n  - Tuple(List, pd.DataFrame): List of DataFrame columns and DataFrame containing column information.\n\n##### `_check_extra_columns(df_columns, column_info_df, schema, table)`\n\nChecks for extra columns in the DataFrame that are not in the database table schema.\n\n- **Parameters:** \n  - `df_columns` (*List[str]*): List of DataFrame columns.\n  - `column_info_df` (*pd.DataFrame*): DataFrame containing column information.\n  - `schema` (*str*): The schema name.\n  - `table` (*str*): The table name.\n- **Raises:** \n  - `ExtraColumnsException`: If any extra columns are found.\n\n##### `_validate_column_types(df, df_columns, column_info_df)`\n\nValidates the data types of DataFrame columns against the database schema.\n\n- **Parameters:** \n  - `df` (*pd.DataFrame*): The DataFrame to be validated.\n  - `df_columns` (*List[str]*): List of DataFrame columns.\n  - `column_info_df` (*pd.DataFrame*): DataFrame containing column information.\n- **Raises:** \n  - `ColumnDataException`: If any column data type mismatches or truncation issues are found.\n\n##### `_is_type_mismatch(df_column_data_type, db_column_data_type)`\n\nChecks if there is a type mismatch between the DataFrame and database column data types.\n\n- **Parameters:** \n  - `df_column_data_type` (*numpy.dtype*): Data type of the DataFrame column.\n  - `db_column_data_type` (*str*): Data type of the database column.\n- **Returns:** \n  - `bool`: `True` if there is a type mismatch, `False` otherwise.\n\n##### `_check_numeric_truncation(column, df, db_column_info)`\n\nChecks for numeric truncation issues.\n\n- **Parameters:** \n  - `column` (*str*): The column name.\n  - `df` (*pd.DataFrame*): The DataFrame containing the column.\n  - `db_column_info` (*pd.Series*): Series containing information about the database column.\n- **Returns:** \n  - `str` or `None`: Error message if truncation is detected, `None` otherwise.\n\n##### `_check_string_or_date_truncation(column, df, db_column_info)`\n\nChecks for string or date truncation issues.\n\n- **Parameters:** \n  - `column` (*str*): The column name.\n  - `df` (*pd.DataFrame*): The DataFrame containing the column.\n  - `db_column_info` (*pd.Series*): Series containing information about the database column.\n- **Returns:** \n  - `str` or `None`: Error message if truncation is detected, `None` otherwise.\n\n#### Custom Exceptions\n\n#### `ExtraColumnsException`\n\nThis exception is raised when extra columns are found in the DataFrame that are not present in the database table schema.\n\n```python\nclass ExtraColumnsException(Exception):\n    pass\n```\n\n#### `ColumnDataException`\n\nThis exception is raised when there are data type mismatches or truncation issues in the DataFrame columns.\n\n```python\nclass ColumnDataException(Exception):\n    pass\n```\n\n#### Example Usage\n\nHere's an example of how you can use the `Validator` class to validate a DataFrame before uploading it to a database:\n\n```python\nimport pandas as pd\nfrom sqlalchemy import create_engine\n\ndata = {\n    'name': ['Alice', 'Bob'],\n    'age': [25, 30],\n    'join_date': ['2020-01-01', '2019-06-15'],\n}\n\ndf = pd.DataFrame(data)\n\n# Create a connection to the database\nengine = create_engine('sqlite:///example.db')\nconnection = engine.connect()\n\n# Validate upload\ntry:\n    Validator.validate_upload(connection, df, 'public', 'employees')\n    print('DataFrame is valid for upload.')\nexcept (ExtraColumnsException, ColumnDataException) as e:\n    print(f'Validation failed: {e}')\n\nconnection.close()\n```\n\nThe above script will validate the DataFrame against the specified database schema and table, ensuring that column types match and no extra columns are present.\n## Contributing\n\nContributions are welcome! Please create a pull request or open an issue to discuss your ideas.\n## License\n\nThis project is licensed under the MIT License.",
    "bugtrack_url": null,
    "license": null,
    "summary": "This repository provides a collection of utility functions and classes for data cleaning, SQL query generation, and data analysis. The code is written in Python and uses libraries such as `pandas`, `numpy`, and `dateutil`.",
    "version": "0.9.0",
    "project_urls": {
        "Documentation": "https://github.com/magicjedi90/etl_utilities#readme",
        "Issues": "https://github.com/magicjedi90/etl_utilities/issues",
        "Source": "https://github.com/magicjedi90/etl_utilities"
    },
    "split_keywords": [
        "cleaning",
        " database",
        " dataframe",
        " etl",
        " etl-utilities",
        " pandas",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5138ca5cc7386d9b6871550b826f42e4e9876d653b856d1b5ef89b6d03a6af5",
                "md5": "54739b0d69edfba34837047cc212fd57",
                "sha256": "beaea70a0be2ea5f613c26776cd840f85b3e1bf9ae4412f617334cd740646a6f"
            },
            "downloads": -1,
            "filename": "etl_utilities-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "54739b0d69edfba34837047cc212fd57",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 17200,
            "upload_time": "2024-10-21T23:10:59",
            "upload_time_iso_8601": "2024-10-21T23:10:59.996098Z",
            "url": "https://files.pythonhosted.org/packages/d5/13/8ca5cc7386d9b6871550b826f42e4e9876d653b856d1b5ef89b6d03a6af5/etl_utilities-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6b1fc858325adab4767b59af9bd96d19cda1a55a29897d0d5b95f5a9228b6d28",
                "md5": "8150d16109208135b05c8fdb69e08a88",
                "sha256": "5d05f08bb3d67e98fd03d8f6f74abb5ec9949c688f0bbdea3be3bc3569f3017f"
            },
            "downloads": -1,
            "filename": "etl_utilities-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8150d16109208135b05c8fdb69e08a88",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 17084,
            "upload_time": "2024-10-21T23:11:01",
            "upload_time_iso_8601": "2024-10-21T23:11:01.945903Z",
            "url": "https://files.pythonhosted.org/packages/6b/1f/c858325adab4767b59af9bd96d19cda1a55a29897d0d5b95f5a9228b6d28/etl_utilities-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-21 23:11:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "magicjedi90",
    "github_project": "etl_utilities#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "etl-utilities"
}
        
Elapsed time: 0.95760s