Name | categorical-finder JSON |
Version |
0.0.1
JSON |
| download |
home_page | None |
Summary | A Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets. |
upload_time | 2024-12-31 11:20:00 |
maintainer | None |
docs_url | None |
author | Sri Jaya Karti |
requires_python | None |
license | MIT |
keywords |
categorical
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Categorical Data Analysis Tool
A Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets. This tool helps data scientists and analysts make informed decisions about how to handle categorical variables in their machine learning pipelines.
## Features
- Automatic detection of categorical columns
- Analysis of cardinality and unique value distributions
- Intelligent encoding suggestions based on data characteristics
- Detection of ordinal categorical variables
- Identification of potential data quality issues
- Comprehensive reporting of analysis results
## Prerequisites
```python
pandas>=1.0.0
numpy>=1.18.0
```
## Installation
1. Clone this repository or copy the script to your local machine
2. Install the required dependencies:
```bash
pip install pandas numpy
```
## Usage
Run the script directly from the command line:
```bash
python categorical_analysis.py
```
When prompted, provide the path to your CSV file.
### Function Parameters
The main analysis function `analyze_categorical_columns()` accepts the following parameters:
- `csv_file_path` (str): Path to the input CSV file
- `max_unique_values` (int, default=50): Maximum number of unique values for a column to be considered categorical
- `unique_ratio_threshold` (float, default=0.1): Maximum ratio of unique values to total rows
- `cardinality_threshold` (int, default=10): Threshold for suggesting one-hot encoding vs other encoding methods
### Encoding Suggestions
The tool suggests one of three encoding methods based on the column characteristics:
1. **Ordinal Encoding**: Suggested for categorical variables with natural ordering (detected through keywords like 'low', 'medium', 'high', etc.)
2. **One-Hot Encoding**: Recommended for categorical variables with low cardinality (fewer unique values than the cardinality threshold)
3. **Frequency/Target/Hashing Encoding**: Suggested for high-cardinality categorical variables
## Output Format
The tool provides two types of output:
1. **Column Classification**: Lists all columns in the dataset and whether they are identified as categorical
2. **Detailed Analysis**: For each categorical column, provides:
- Number of unique values
- List of unique values (up to 10 shown)
- Suggested encoding method
- Additional notes (missing values, unique value counts, etc.)
## Example Output
```
--------------------------------------------------
Column 'product_category' is categorical: True
Column 'price' is categorical: False
Column 'status' is categorical: True
Categorical Columns Analysis:
Column: product_category
Unique Values: 8
Encoding Suggestion: One-Hot Encoding
Unique Values:
- Electronics
- Clothing
- Books
...
Additional Notes:
- Contains missing values
```
## Notes
- The tool automatically detects categorical columns based on data type ('object' or 'category') and uniqueness criteria
- Columns with numeric data types are excluded from the analysis
- The tool provides warnings about potential data quality issues like missing values
- Analysis results can be used to inform feature engineering decisions in machine learning pipelines
## Limitations
- Only processes CSV files
- Assumes categorical data is stored as string or category dtype
- May not detect numeric categorical variables
- Limited to basic encoding suggestions without considering specific use cases
## Contributing
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
Raw data
{
"_id": null,
"home_page": null,
"name": "categorical-finder",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "categorical",
"author": "Sri Jaya Karti",
"author_email": "srijayakarti@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/01/af/0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68/categorical_finder-0.0.1.tar.gz",
"platform": null,
"description": "# Categorical Data Analysis Tool\r\n\r\nA Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets. This tool helps data scientists and analysts make informed decisions about how to handle categorical variables in their machine learning pipelines.\r\n\r\n## Features\r\n\r\n- Automatic detection of categorical columns\r\n- Analysis of cardinality and unique value distributions\r\n- Intelligent encoding suggestions based on data characteristics\r\n- Detection of ordinal categorical variables\r\n- Identification of potential data quality issues\r\n- Comprehensive reporting of analysis results\r\n\r\n## Prerequisites\r\n\r\n```python\r\npandas>=1.0.0\r\nnumpy>=1.18.0\r\n```\r\n\r\n## Installation\r\n\r\n1. Clone this repository or copy the script to your local machine\r\n2. Install the required dependencies:\r\n ```bash\r\n pip install pandas numpy\r\n ```\r\n\r\n## Usage\r\n\r\nRun the script directly from the command line:\r\n\r\n```bash\r\npython categorical_analysis.py\r\n```\r\n\r\nWhen prompted, provide the path to your CSV file.\r\n\r\n### Function Parameters\r\n\r\nThe main analysis function `analyze_categorical_columns()` accepts the following parameters:\r\n\r\n- `csv_file_path` (str): Path to the input CSV file\r\n- `max_unique_values` (int, default=50): Maximum number of unique values for a column to be considered categorical\r\n- `unique_ratio_threshold` (float, default=0.1): Maximum ratio of unique values to total rows\r\n- `cardinality_threshold` (int, default=10): Threshold for suggesting one-hot encoding vs other encoding methods\r\n\r\n### Encoding Suggestions\r\n\r\nThe tool suggests one of three encoding methods based on the column characteristics:\r\n\r\n1. **Ordinal Encoding**: Suggested for categorical variables with natural ordering (detected through keywords like 'low', 'medium', 'high', etc.)\r\n2. **One-Hot Encoding**: Recommended for categorical variables with low cardinality (fewer unique values than the cardinality threshold)\r\n3. **Frequency/Target/Hashing Encoding**: Suggested for high-cardinality categorical variables\r\n\r\n## Output Format\r\n\r\nThe tool provides two types of output:\r\n\r\n1. **Column Classification**: Lists all columns in the dataset and whether they are identified as categorical\r\n2. **Detailed Analysis**: For each categorical column, provides:\r\n - Number of unique values\r\n - List of unique values (up to 10 shown)\r\n - Suggested encoding method\r\n - Additional notes (missing values, unique value counts, etc.)\r\n\r\n## Example Output\r\n\r\n```\r\n--------------------------------------------------\r\nColumn 'product_category' is categorical: True\r\nColumn 'price' is categorical: False\r\nColumn 'status' is categorical: True\r\n\r\nCategorical Columns Analysis:\r\n\r\nColumn: product_category\r\nUnique Values: 8\r\nEncoding Suggestion: One-Hot Encoding\r\n\r\nUnique Values:\r\n - Electronics\r\n - Clothing\r\n - Books\r\n ...\r\n\r\nAdditional Notes:\r\n - Contains missing values\r\n```\r\n\r\n## Notes\r\n\r\n- The tool automatically detects categorical columns based on data type ('object' or 'category') and uniqueness criteria\r\n- Columns with numeric data types are excluded from the analysis\r\n- The tool provides warnings about potential data quality issues like missing values\r\n- Analysis results can be used to inform feature engineering decisions in machine learning pipelines\r\n\r\n## Limitations\r\n\r\n- Only processes CSV files\r\n- Assumes categorical data is stored as string or category dtype\r\n- May not detect numeric categorical variables\r\n- Limited to basic encoding suggestions without considering specific use cases\r\n\r\n## Contributing\r\n\r\nFeel free to submit issues, fork the repository, and create pull requests for any improvements.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python utility for analyzing and suggesting appropriate encoding methods for categorical columns in CSV datasets.",
"version": "0.0.1",
"project_urls": null,
"split_keywords": [
"categorical"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "01af0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68",
"md5": "c3cd8ee4f515bae62694d0227956facd",
"sha256": "b2c23d3a0263df6ed337a089250b4fdf6995d2a8acf5ad87f1320bcbc69a5b2b"
},
"downloads": -1,
"filename": "categorical_finder-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "c3cd8ee4f515bae62694d0227956facd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4555,
"upload_time": "2024-12-31T11:20:00",
"upload_time_iso_8601": "2024-12-31T11:20:00.030248Z",
"url": "https://files.pythonhosted.org/packages/01/af/0986bfc3bfa95ff7a3b27d040ed6c260e963b2bd8a86d71f0e42feb12b68/categorical_finder-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-31 11:20:00",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "categorical-finder"
}