# phenome-outlier-analysis
# OutlierDetector Class Documentation
## Overview
The `OutlierDetector` class is designed for detecting outliers in datasets using various normalization methods. It supports both context-specific and global outlier detection strategies, making it versatile for different types of data analysis.
## Class Initialization
```python
OutlierDetector(df, analyte_columns, segment_columns=['sex'])
```
### Parameters:
- `df` (pandas.DataFrame): The input DataFrame containing the data to be analyzed.
- `analyte_columns` (list): A list of column names to be analyzed for outliers.
- `segment_columns` (list, optional): A list of column names used for segmentation in context-specific outlier detection. Defaults to ['sex'].
## Main Methods
### 1. perform_outlier_detection
```python
perform_outlier_detection(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False)
```
This is the primary method to perform outlier detection on the given DataFrame.
#### Parameters:
- `lower_percentile` (float): Lower percentile for cutoff calculation. Default is 0.01.
- `upper_percentile` (float): Upper percentile for cutoff calculation. Default is 0.99.
- `method` (str): Normalization method. Can be 'double_mad' or 'zscore'. Default is 'double_mad'.
- `take_log` (bool): Whether to apply log transformation before normalization. Default is False.
#### Returns:
A tuple containing two dictionaries:
1. Context-specific results
2. Super-global results
### 2. context_specific_outlier_detection
```python
context_specific_outlier_detection(method='double_mad', take_log=False)
```
Performs context-specific outlier detection by segmenting the DataFrame based on the `segment_columns`.
### 3. super_global_outlier_detection
```python
super_global_outlier_detection(method='double_mad', take_log=False)
```
Evaluates outliers on a global scale, considering all data points together.
## Helper Methods
### calculate_double_mad
Calculates left and right Median Absolute Deviations (MADs) from the median.
### normalize_series
Normalizes a series using the specified method (double_mad or zscore).
### calculate_percentile_cutoffs
Calculates global percentile cutoffs based on the specified columns of a DataFrame.
### create_binary_matrix
Creates a binary matrix indicating outliers based on specified cutoffs.
### normalize_dataframe
Normalizes specified columns in a DataFrame.
### detect_outliers
Detects outliers in the specified columns of a DataFrame.
### get_global_cutoffs
Gets global cutoffs for outlier detection.
## Usage Example
```python
import pandas as pd
from outlier_detection import OutlierDetector
# Load your data
df = pd.read_csv('your_data.csv')
# Define columns
analyte_columns = ['column1', 'column2', 'column3']
segment_columns = ['sex', 'age_group']
# Create OutlierDetector instance
detector = OutlierDetector(df, analyte_columns, segment_columns)
# Perform outlier detection
context_results, global_results = detector.perform_outlier_detection(
lower_percentile=0.01,
upper_percentile=0.99,
method='double_mad',
take_log=True
)
# Analyze results
for (segment, value), result in context_results.items():
print(f"Outliers for {segment}={value}:")
print(result['binary_matrix'].sum())
print("Global outliers:")
print(global_results[('global', 'global')]['binary_matrix'].sum())
```
## Notes
- The class uses logging to provide information and warnings during the outlier detection process.
- The `tqdm` library is used to show progress bars for long-running operations.
- The class can handle both context-specific (segmented) and global outlier detection.
- Two normalization methods are supported: 'double_mad' (double Median Absolute Deviation) and 'zscore'.
- Log transformation can be applied before normalization if needed.
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/phenome-outlier-analysis",
"name": "phenome-outlier-analysis",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Your Name",
"author_email": "your.email@example.com",
"download_url": "https://files.pythonhosted.org/packages/da/67/a4edc5c168a8fdd90d80c40708c66ad12443a9441cab2fc80458031d38ab/phenome_outlier_analysis-0.1.0.tar.gz",
"platform": null,
"description": "# phenome-outlier-analysis\n\n# OutlierDetector Class Documentation\n\n## Overview\n\nThe `OutlierDetector` class is designed for detecting outliers in datasets using various normalization methods. It supports both context-specific and global outlier detection strategies, making it versatile for different types of data analysis.\n\n## Class Initialization\n\n```python\nOutlierDetector(df, analyte_columns, segment_columns=['sex'])\n```\n\n### Parameters:\n- `df` (pandas.DataFrame): The input DataFrame containing the data to be analyzed.\n- `analyte_columns` (list): A list of column names to be analyzed for outliers.\n- `segment_columns` (list, optional): A list of column names used for segmentation in context-specific outlier detection. Defaults to ['sex'].\n\n## Main Methods\n\n### 1. perform_outlier_detection\n\n```python\nperform_outlier_detection(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False)\n```\n\nThis is the primary method to perform outlier detection on the given DataFrame.\n\n#### Parameters:\n- `lower_percentile` (float): Lower percentile for cutoff calculation. Default is 0.01.\n- `upper_percentile` (float): Upper percentile for cutoff calculation. Default is 0.99.\n- `method` (str): Normalization method. Can be 'double_mad' or 'zscore'. Default is 'double_mad'.\n- `take_log` (bool): Whether to apply log transformation before normalization. Default is False.\n\n#### Returns:\nA tuple containing two dictionaries:\n1. Context-specific results\n2. Super-global results\n\n### 2. context_specific_outlier_detection\n\n```python\ncontext_specific_outlier_detection(method='double_mad', take_log=False)\n```\n\nPerforms context-specific outlier detection by segmenting the DataFrame based on the `segment_columns`.\n\n### 3. super_global_outlier_detection\n\n```python\nsuper_global_outlier_detection(method='double_mad', take_log=False)\n```\n\nEvaluates outliers on a global scale, considering all data points together.\n\n## Helper Methods\n\n### calculate_double_mad\n\nCalculates left and right Median Absolute Deviations (MADs) from the median.\n\n### normalize_series\n\nNormalizes a series using the specified method (double_mad or zscore).\n\n### calculate_percentile_cutoffs\n\nCalculates global percentile cutoffs based on the specified columns of a DataFrame.\n\n### create_binary_matrix\n\nCreates a binary matrix indicating outliers based on specified cutoffs.\n\n### normalize_dataframe\n\nNormalizes specified columns in a DataFrame.\n\n### detect_outliers\n\nDetects outliers in the specified columns of a DataFrame.\n\n### get_global_cutoffs\n\nGets global cutoffs for outlier detection.\n\n## Usage Example\n\n```python\nimport pandas as pd\nfrom outlier_detection import OutlierDetector\n\n# Load your data\ndf = pd.read_csv('your_data.csv')\n\n# Define columns\nanalyte_columns = ['column1', 'column2', 'column3']\nsegment_columns = ['sex', 'age_group']\n\n# Create OutlierDetector instance\ndetector = OutlierDetector(df, analyte_columns, segment_columns)\n\n# Perform outlier detection\ncontext_results, global_results = detector.perform_outlier_detection(\n lower_percentile=0.01,\n upper_percentile=0.99,\n method='double_mad',\n take_log=True\n)\n\n# Analyze results\nfor (segment, value), result in context_results.items():\n print(f\"Outliers for {segment}={value}:\")\n print(result['binary_matrix'].sum())\n\nprint(\"Global outliers:\")\nprint(global_results[('global', 'global')]['binary_matrix'].sum())\n```\n\n## Notes\n\n- The class uses logging to provide information and warnings during the outlier detection process.\n- The `tqdm` library is used to show progress bars for long-running operations.\n- The class can handle both context-specific (segmented) and global outlier detection.\n- Two normalization methods are supported: 'double_mad' (double Median Absolute Deviation) and 'zscore'.\n- Log transformation can be applied before normalization if needed.\n",
"bugtrack_url": null,
"license": null,
"summary": "A package for outlier detection in phenome datasets",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/yourusername/phenome-outlier-analysis"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "564b9ccddb69fcf58ca09bc5de16dfd9d7328efdb4afcd16c6a18df799300d5a",
"md5": "aaf8e8ca3974fb754999df418537654c",
"sha256": "37fce37970dc8e0aa6de056acba5719c8c578750b53aa975d0091dbbaf309f78"
},
"downloads": -1,
"filename": "phenome_outlier_analysis-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aaf8e8ca3974fb754999df418537654c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6838,
"upload_time": "2024-08-12T15:40:54",
"upload_time_iso_8601": "2024-08-12T15:40:54.917399Z",
"url": "https://files.pythonhosted.org/packages/56/4b/9ccddb69fcf58ca09bc5de16dfd9d7328efdb4afcd16c6a18df799300d5a/phenome_outlier_analysis-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "da67a4edc5c168a8fdd90d80c40708c66ad12443a9441cab2fc80458031d38ab",
"md5": "bf68e33662e2a91f0f211f18369251c5",
"sha256": "fadd5fbd5befc06f2e8f97c232dd3ae003b7da4afc58eded04181cda165fb0b5"
},
"downloads": -1,
"filename": "phenome_outlier_analysis-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "bf68e33662e2a91f0f211f18369251c5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5406,
"upload_time": "2024-08-12T15:40:56",
"upload_time_iso_8601": "2024-08-12T15:40:56.317202Z",
"url": "https://files.pythonhosted.org/packages/da/67/a4edc5c168a8fdd90d80c40708c66ad12443a9441cab2fc80458031d38ab/phenome_outlier_analysis-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-12 15:40:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "phenome-outlier-analysis",
"github_not_found": true,
"lcname": "phenome-outlier-analysis"
}