sanitipy

Name	sanitipy JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/yestab335/sanitipy
Summary	Sanitipy is a user-friendly Python library designed for data cleaning and preprocessing. It provides essential utilities to streamline the process of preparing datasets for analysis or modeling. With features such as duplicate removal, handling missing values, and automatic data type inference, sanitipy simplifies the data cleaning workflow, making it a useful tool for data scientists and analysts.
upload_time	2025-08-16 03:33:59
maintainer	None
docs_url	None
author	Adam Ben-Aamr
requires_python	None
license	MIT
keywords	data data cleaning missing data data science pandas python
VCS
bugtrack_url
requirements	pandas matplotlib yapf
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SanitiPy - Automatic Data Cleaner
<!-- Badges go here -->
![PyPI - Version](https://img.shields.io/pypi/v/sanitipy?style=for-the-badge)
![PyPI status](https://img.shields.io/pypi/status/sanitipy?style=for-the-badge)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sanitipy?style=for-the-badge)
![PyPI - License](https://img.shields.io/pypi/l/sanitipy?style=for-the-badge)

**SanitiPy automates the data cleaning process for your data science projects using Python.**

## Overview
SanitiPy is a user-friendly Python library designed to streamline the data cleaning and preprocessing workflow. It provides essential utilities to prepare datasets for analysis or modeling by handling common data quality issues such as duplicate entries, missing values, and inconsistent data types.

## Features
- **Remove Duplicates:** Easily eliminate duplicate rows from your DataFrame to ensure data integrity.

- **Handle Missing Values:** Automatically identify and remove rows containing `NaN` (Not a Number) values.

- **Infer Data Types:** Intelligently detect and convert column data types, including:
  - Converting potential datetime columns based on a configurable ratio of valid dates.
  
  - Converting numeric-like values to proper numeric types.
  
  - Falling back to string type when type inference is unsuccessful.

- **Automated Cleaning Process:** The `DataCleaner` class orchestrates the cleaning steps, ensuring your data is ready for further analysis.

## Installation
You can install SanitiPy using pip:

```zsh
pip install sanitipy
```

## Useage
Quick example on using the package with a Pandas DataFrame:

```python
import pandas as pd
from sanitipy import DataCleaner

# Create a sample DataFrame with some common data issues
data = {
  'ID': [1, 2, 3, 1, 4, 5],
  'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve'],
  'Value': [100, 200, None, 100, 400, 500],
  'Date': ['2023/01/01', '2023/01/02', '2023/01/03', '2023/01/01', 'invalid-date', '2023/01/05'],
  'Category': ['A', 'B', 'C', 'A', 'D', 'E']
}
df = pd.DataFrame(data)

# Initialize the DataCleaner
cleaner = DataCleaner(df)

# Clean the data
cleaned_df = cleaner.clean_data()
```

## API Reference
### `DataCleaner` Class
The main class for orchestrating the data cleaning process.

- `__init__(self, data_frame: pd.DataFrame)`:
  - Initializes the `DataCleaner` with a pandas DataFrame

- `clean_data(self) -> pd.DataFrame`:
  - Performs a sequence of cleaning operations:
    - Removes duplicate rows.
    
    - Removes rows with missing values (if any are detected). Raises a `ValueError` if missing values persist after removal.

    - Infers and converts data types for columns with inconsistent types.

    - Resets the DataFrame index.
  - Returns the cleaned pandas DataFrame.

### `Preprocessor` Class
Provides individual data transformation and cleaning utilities.

- `remove_duplicates(self, data: pd.DataFrame) -> pd.DataFrame`:
  - Removes duplicate rows from the input DataFrame.

- `remove_na(self, data: pd.DataFrame) -> pd.DataFrame`:
  - Removes rows containing any `NaN` values from the input DataFrame.

- `infer_data_types(self, data_frame: pd.DataFrame, date_time_ratio: float = 0.5) -> pd.DataFrame`:
  - Infers and converts data types for columns.
  - `date_time_ratio`: The threshold (0-1) for treating an object column as datetime. Default is 0.5 (50% valid date values required).

### `Validator` Class
Used internally by `DataCleaner` to check data quality.

- `check_missing_values(self, data: pd.DataFrame) -> int`:
  -  Retuns the total count of missing values in the DataFrame.

- `validate_data_types(self) -> bool`:
  - Checks if all columns in the DataFrame have consistent data types. Returns `True` if all columns have the same data type or if the DataFrame is empty, `False` otherwise.

## Contributing
Constributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.

## License
This project is licensed under the GNU License - see the [LICENSE](./LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yestab335/sanitipy",
    "name": "sanitipy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "data, data cleaning, missing data, data science, pandas, python",
    "author": "Adam Ben-Aamr",
    "author_email": "adambenaamr@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6a/2e/0a12d5450bbd9b65f576f526dd31bbea6917325117da0d8548e3bd60c841/sanitipy-1.1.0.tar.gz",
    "platform": null,
    "description": "# SanitiPy - Automatic Data Cleaner\n<!-- Badges go here -->\n![PyPI - Version](https://img.shields.io/pypi/v/sanitipy?style=for-the-badge)\n![PyPI status](https://img.shields.io/pypi/status/sanitipy?style=for-the-badge)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sanitipy?style=for-the-badge)\n![PyPI - License](https://img.shields.io/pypi/l/sanitipy?style=for-the-badge)\n\n**SanitiPy automates the data cleaning process for your data science projects using Python.**\n\n## Overview\nSanitiPy is a user-friendly Python library designed to streamline the data cleaning and preprocessing workflow. It provides essential utilities to prepare datasets for analysis or modeling by handling common data quality issues such as duplicate entries, missing values, and inconsistent data types.\n\n## Features\n- **Remove Duplicates:** Easily eliminate duplicate rows from your DataFrame to ensure data integrity.\n\n- **Handle Missing Values:** Automatically identify and remove rows containing `NaN` (Not a Number) values.\n\n- **Infer Data Types:** Intelligently detect and convert column data types, including:\n  - Converting potential datetime columns based on a configurable ratio of valid dates.\n  \n  - Converting numeric-like values to proper numeric types.\n  \n  - Falling back to string type when type inference is unsuccessful.\n\n- **Automated Cleaning Process:** The `DataCleaner` class orchestrates the cleaning steps, ensuring your data is ready for further analysis.\n\n## Installation\nYou can install SanitiPy using pip:\n\n```zsh\npip install sanitipy\n```\n\n## Useage\nQuick example on using the package with a Pandas DataFrame:\n\n```python\nimport pandas as pd\nfrom sanitipy import DataCleaner\n\n# Create a sample DataFrame with some common data issues\ndata = {\n  'ID': [1, 2, 3, 1, 4, 5],\n  'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve'],\n  'Value': [100, 200, None, 100, 400, 500],\n  'Date': ['2023/01/01', '2023/01/02', '2023/01/03', '2023/01/01', 'invalid-date', '2023/01/05'],\n  'Category': ['A', 'B', 'C', 'A', 'D', 'E']\n}\ndf = pd.DataFrame(data)\n\n# Initialize the DataCleaner\ncleaner = DataCleaner(df)\n\n# Clean the data\ncleaned_df = cleaner.clean_data()\n```\n\n## API Reference\n### `DataCleaner` Class\nThe main class for orchestrating the data cleaning process.\n\n- `__init__(self, data_frame: pd.DataFrame)`:\n  - Initializes the `DataCleaner` with a pandas DataFrame\n\n- `clean_data(self) -> pd.DataFrame`:\n  - Performs a sequence of cleaning operations:\n    - Removes duplicate rows.\n    \n    - Removes rows with missing values (if any are detected). Raises a `ValueError` if missing values persist after removal.\n\n    - Infers and converts data types for columns with inconsistent types.\n\n    - Resets the DataFrame index.\n  - Returns the cleaned pandas DataFrame.\n\n### `Preprocessor` Class\nProvides individual data transformation and cleaning utilities.\n\n- `remove_duplicates(self, data: pd.DataFrame) -> pd.DataFrame`:\n  - Removes duplicate rows from the input DataFrame.\n\n- `remove_na(self, data: pd.DataFrame) -> pd.DataFrame`:\n  - Removes rows containing any `NaN` values from the input DataFrame.\n\n- `infer_data_types(self, data_frame: pd.DataFrame, date_time_ratio: float = 0.5) -> pd.DataFrame`:\n  - Infers and converts data types for columns.\n  - `date_time_ratio`: The threshold (0-1) for treating an object column as datetime. Default is 0.5 (50% valid date values required).\n\n### `Validator` Class\nUsed internally by `DataCleaner` to check data quality.\n\n- `check_missing_values(self, data: pd.DataFrame) -> int`:\n  -  Retuns the total count of missing values in the DataFrame.\n\n- `validate_data_types(self) -> bool`:\n  - Checks if all columns in the DataFrame have consistent data types. Returns `True` if all columns have the same data type or if the DataFrame is empty, `False` otherwise.\n\n## Contributing\nConstributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.\n\n## License\nThis project is licensed under the GNU License - see the [LICENSE](./LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Sanitipy is a user-friendly Python library designed for data cleaning and preprocessing. It provides essential utilities to streamline the process of preparing datasets for analysis or modeling. With features such as duplicate removal, handling missing values, and automatic data type inference, sanitipy simplifies the data cleaning workflow, making it a useful tool for data scientists and analysts.",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/yestab335/sanitipy"
    },
    "split_keywords": [
        "data",
        " data cleaning",
        " missing data",
        " data science",
        " pandas",
        " python"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ddb48eae498fbe185222ebf391002cf8cd161327539026ad86e508d01b9771c4",
                "md5": "70cc4fa90f1f4e6ae46f1624250e23c7",
                "sha256": "249274eda0497b8b4c7545247d539d82aa5fe298146c64985e34989e6e4bcd94"
            },
            "downloads": -1,
            "filename": "sanitipy-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "70cc4fa90f1f4e6ae46f1624250e23c7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 18714,
            "upload_time": "2025-08-16T03:33:58",
            "upload_time_iso_8601": "2025-08-16T03:33:58.065586Z",
            "url": "https://files.pythonhosted.org/packages/dd/b4/8eae498fbe185222ebf391002cf8cd161327539026ad86e508d01b9771c4/sanitipy-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6a2e0a12d5450bbd9b65f576f526dd31bbea6917325117da0d8548e3bd60c841",
                "md5": "bb8b16e91da575c0631774ed85067bca",
                "sha256": "300fec2c23d64b2c1cf797e78563be9e7ded65a22c81099fcb392d7b17d27bbe"
            },
            "downloads": -1,
            "filename": "sanitipy-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bb8b16e91da575c0631774ed85067bca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19125,
            "upload_time": "2025-08-16T03:33:59",
            "upload_time_iso_8601": "2025-08-16T03:33:59.226870Z",
            "url": "https://files.pythonhosted.org/packages/6a/2e/0a12d5450bbd9b65f576f526dd31bbea6917325117da0d8548e3bd60c841/sanitipy-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-16 03:33:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yestab335",
    "github_project": "sanitipy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "yapf",
            "specs": []
        }
    ],
    "lcname": "sanitipy"
}

Adam Ben-Aamr