categorical-finder


Namecategorical-finder JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryA Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets.
upload_time2024-12-31 11:20:00
maintainerNone
docs_urlNone
authorSri Jaya Karti
requires_pythonNone
licenseMIT
keywords categorical
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Categorical Data Analysis Tool

A Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets. This tool helps data scientists and analysts make informed decisions about how to handle categorical variables in their machine learning pipelines.

## Features

- Automatic detection of categorical columns
- Analysis of cardinality and unique value distributions
- Intelligent encoding suggestions based on data characteristics
- Detection of ordinal categorical variables
- Identification of potential data quality issues
- Comprehensive reporting of analysis results

## Prerequisites

```python
pandas>=1.0.0
numpy>=1.18.0
```

## Installation

1. Clone this repository or copy the script to your local machine
2. Install the required dependencies:
   ```bash
   pip install pandas numpy
   ```

## Usage

Run the script directly from the command line:

```bash
python categorical_analysis.py
```

When prompted, provide the path to your CSV file.

### Function Parameters

The main analysis function `analyze_categorical_columns()` accepts the following parameters:

- `csv_file_path` (str): Path to the input CSV file
- `max_unique_values` (int, default=50): Maximum number of unique values for a column to be considered categorical
- `unique_ratio_threshold` (float, default=0.1): Maximum ratio of unique values to total rows
- `cardinality_threshold` (int, default=10): Threshold for suggesting one-hot encoding vs other encoding methods

### Encoding Suggestions

The tool suggests one of three encoding methods based on the column characteristics:

1. **Ordinal Encoding**: Suggested for categorical variables with natural ordering (detected through keywords like 'low', 'medium', 'high', etc.)
2. **One-Hot Encoding**: Recommended for categorical variables with low cardinality (fewer unique values than the cardinality threshold)
3. **Frequency/Target/Hashing Encoding**: Suggested for high-cardinality categorical variables

## Output Format

The tool provides two types of output:

1. **Column Classification**: Lists all columns in the dataset and whether they are identified as categorical
2. **Detailed Analysis**: For each categorical column, provides:
   - Number of unique values
   - List of unique values (up to 10 shown)
   - Suggested encoding method
   - Additional notes (missing values, unique value counts, etc.)

## Example Output

```
--------------------------------------------------
Column 'product_category' is categorical: True
Column 'price' is categorical: False
Column 'status' is categorical: True

Categorical Columns Analysis:

Column: product_category
Unique Values: 8
Encoding Suggestion: One-Hot Encoding

Unique Values:
  - Electronics
  - Clothing
  - Books
  ...

Additional Notes:
  - Contains missing values
```

## Notes

- The tool automatically detects categorical columns based on data type ('object' or 'category') and uniqueness criteria
- Columns with numeric data types are excluded from the analysis
- The tool provides warnings about potential data quality issues like missing values
- Analysis results can be used to inform feature engineering decisions in machine learning pipelines

## Limitations

- Only processes CSV files
- Assumes categorical data is stored as string or category dtype
- May not detect numeric categorical variables
- Limited to basic encoding suggestions without considering specific use cases

## Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "categorical-finder",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "categorical",
    "author": "Sri Jaya Karti",
    "author_email": "srijayakarti@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/01/af/0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68/categorical_finder-0.0.1.tar.gz",
    "platform": null,
    "description": "# Categorical Data Analysis Tool\r\n\r\nA Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets. This tool helps data scientists and analysts make informed decisions about how to handle categorical variables in their machine learning pipelines.\r\n\r\n## Features\r\n\r\n- Automatic detection of categorical columns\r\n- Analysis of cardinality and unique value distributions\r\n- Intelligent encoding suggestions based on data characteristics\r\n- Detection of ordinal categorical variables\r\n- Identification of potential data quality issues\r\n- Comprehensive reporting of analysis results\r\n\r\n## Prerequisites\r\n\r\n```python\r\npandas>=1.0.0\r\nnumpy>=1.18.0\r\n```\r\n\r\n## Installation\r\n\r\n1. Clone this repository or copy the script to your local machine\r\n2. Install the required dependencies:\r\n   ```bash\r\n   pip install pandas numpy\r\n   ```\r\n\r\n## Usage\r\n\r\nRun the script directly from the command line:\r\n\r\n```bash\r\npython categorical_analysis.py\r\n```\r\n\r\nWhen prompted, provide the path to your CSV file.\r\n\r\n### Function Parameters\r\n\r\nThe main analysis function `analyze_categorical_columns()` accepts the following parameters:\r\n\r\n- `csv_file_path` (str): Path to the input CSV file\r\n- `max_unique_values` (int, default=50): Maximum number of unique values for a column to be considered categorical\r\n- `unique_ratio_threshold` (float, default=0.1): Maximum ratio of unique values to total rows\r\n- `cardinality_threshold` (int, default=10): Threshold for suggesting one-hot encoding vs other encoding methods\r\n\r\n### Encoding Suggestions\r\n\r\nThe tool suggests one of three encoding methods based on the column characteristics:\r\n\r\n1. **Ordinal Encoding**: Suggested for categorical variables with natural ordering (detected through keywords like 'low', 'medium', 'high', etc.)\r\n2. **One-Hot Encoding**: Recommended for categorical variables with low cardinality (fewer unique values than the cardinality threshold)\r\n3. **Frequency/Target/Hashing Encoding**: Suggested for high-cardinality categorical variables\r\n\r\n## Output Format\r\n\r\nThe tool provides two types of output:\r\n\r\n1. **Column Classification**: Lists all columns in the dataset and whether they are identified as categorical\r\n2. **Detailed Analysis**: For each categorical column, provides:\r\n   - Number of unique values\r\n   - List of unique values (up to 10 shown)\r\n   - Suggested encoding method\r\n   - Additional notes (missing values, unique value counts, etc.)\r\n\r\n## Example Output\r\n\r\n```\r\n--------------------------------------------------\r\nColumn 'product_category' is categorical: True\r\nColumn 'price' is categorical: False\r\nColumn 'status' is categorical: True\r\n\r\nCategorical Columns Analysis:\r\n\r\nColumn: product_category\r\nUnique Values: 8\r\nEncoding Suggestion: One-Hot Encoding\r\n\r\nUnique Values:\r\n  - Electronics\r\n  - Clothing\r\n  - Books\r\n  ...\r\n\r\nAdditional Notes:\r\n  - Contains missing values\r\n```\r\n\r\n## Notes\r\n\r\n- The tool automatically detects categorical columns based on data type ('object' or 'category') and uniqueness criteria\r\n- Columns with numeric data types are excluded from the analysis\r\n- The tool provides warnings about potential data quality issues like missing values\r\n- Analysis results can be used to inform feature engineering decisions in machine learning pipelines\r\n\r\n## Limitations\r\n\r\n- Only processes CSV files\r\n- Assumes categorical data is stored as string or category dtype\r\n- May not detect numeric categorical variables\r\n- Limited to basic encoding suggestions without considering specific use cases\r\n\r\n## Contributing\r\n\r\nFeel free to submit issues, fork the repository, and create pull requests for any improvements.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets.",
    "version": "0.0.1",
    "project_urls": null,
    "split_keywords": [
        "categorical"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "01af0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68",
                "md5": "c3cd8ee4f515bae62694d0227956facd",
                "sha256": "b2c23d3a0263df6ed337a089250b4fdf6995d2a8acf5ad87f1320bcbc69a5b2b"
            },
            "downloads": -1,
            "filename": "categorical_finder-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c3cd8ee4f515bae62694d0227956facd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4555,
            "upload_time": "2024-12-31T11:20:00",
            "upload_time_iso_8601": "2024-12-31T11:20:00.030248Z",
            "url": "https://files.pythonhosted.org/packages/01/af/0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68/categorical_finder-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-31 11:20:00",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "categorical-finder"
}
        
Elapsed time: 0.81154s