## README.md
describr is a Python library that provides functionality for descriptive statistics and outlier detection in pandas DataFrames.
**Installation**
You can install describr using pip:
```python
pip install describr
```
#### Example usage
```python
import pandas as pd
import numpy as np
from describr import FindOutliers, DescriptiveStats
```
#### Create a sample dataframe
```python
np.random.seed(0)
n = 500
data = {
'MCID': ['MCID_' + str(i) for i in range(1, n + 1)],
'Age': np.random.randint(18, 90, size=n),
'Race': np.random.choice(['White', 'Black', 'Asian', 'Hispanic',''], size=n),
'Educational_Status': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD',''], size=n),
'Gender': np.random.choice(['Male', 'Female', ''], size=n),
'ER_COST': np.random.uniform(500, 5000, size=n),
'ER_VISITS': np.random.randint(0, 10, size=n),
'IP_COST': np.random.uniform(5000, 20000, size=n),
'IP_ADMITS': np.random.randint(0, 5, size=n),
'CHF': np.random.choice([0, 1], size=n),
'COPD': np.random.choice([0, 1], size=n),
'DM': np.random.choice([0, 1], size=n),
'ASTHMA': np.random.choice([0, 1], size=n),
'HYPERTENSION': np.random.choice([0, 1], size=n),
'SCHIZOPHRENIA': np.random.choice([0, 1], size=n),
'MOOD_DEPRESSED': np.random.choice([0, 1], size=n),
'MOOD_BIPOLAR': np.random.choice([0, 1], size=n),
'TREATMENT': np.random.choice(['Yes', 'No'], size=n)
}
df = pd.DataFrame(data)
```
#### Parameters
**df**: name of dataframe
**id_col**: Primary key of the dataframe; accepts string or integer or float.
**group_col**: A Column to group by, It must be a binary column. Strings or integers are acceptable.
**positive_class**: This is the response value for the primary outcome of interest. For instance, positive value for a Treatment cohort is 'Yes' or 1 otherwise 'No' or 0, respectively. Strings or integers are acceptable.
**continuous_var_summary**: User specifies measures of central tendency, only mean and median are acceptable. This parameter is case insensitive.
#### Example usage of FindOutliers Class
This returns a dataframe (outliers_flag_df) with outlier_flag column (outlier_flag =1: record contains one or more ouliers). Tukey's IQR method is used to detect outliers in the data
```python
outliers_flag=FindOutliers(df=df, id_col='MCID', group_col='TREATMENT')
outliers_flag_df=outliers_flag.flag_outliers()
```
#### This example counts number of rows with outliers stratified by a defined grouping variable
```python
outliers_flag.count_outliers()
```
#### This example removes all outliers
```python
df2=outliers_flag.remove_outliers()
df2.shape
```
#### Example usage of DescriptiveStats Class
```python
descriptive_stats = DescriptiveStats(df=df, id_col='MCID', group_col='TREATMENT', positive_class='Yes', continuous_var_summary='median')
```
#### Gets statistics for binary and categorical variables and returns a dataframe.
```python
binary_stats_df = descriptive_stats.get_binary_stats()
```
#### Gets mean and standard deviation for continuous variables and returns a dataframe.
```python
continuous_stats_mean_df = descriptive_stats.get_continuous_mean_stats()
```
#### Gets median and interquartile range for continuous variables and returns a dataframe.
```python
continuous_stats_median_df = descriptive_stats.get_continuous_median_stats()
```
#### Computes summary statistics for binary and continuous variables based on defined measure of central tendency. Method returns a dataframe.
````python
descriptive_stats.compute_descriptive_stats()
summary_stats = descriptive_stats.summary_stats()
````
Raw data
{
"_id": null,
"home_page": "https://github.com/famutimine/describr",
"name": "describr",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "descriptive statistics",
"author": "Daniel Famutimi MD, MPH",
"author_email": "danielfamutimi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/79/b6/f8508682d3f88a1732de3cfbcc8d1ff889174c4e90551374e2252569b549/describr-0.0.31.tar.gz",
"platform": null,
"description": "## README.md\n\ndescribr is a Python library that provides functionality for descriptive statistics and outlier detection in pandas DataFrames.\n\n**Installation**\n\nYou can install describr using pip:\n\n```python\npip install describr\n```\n\n#### Example usage\n```python\nimport pandas as pd\nimport numpy as np\nfrom describr import FindOutliers, DescriptiveStats\n```\n#### Create a sample dataframe\n```python\nnp.random.seed(0)\nn = 500\n\ndata = {\n 'MCID': ['MCID_' + str(i) for i in range(1, n + 1)],\n 'Age': np.random.randint(18, 90, size=n),\n 'Race': np.random.choice(['White', 'Black', 'Asian', 'Hispanic',''], size=n),\n 'Educational_Status': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD',''], size=n),\n 'Gender': np.random.choice(['Male', 'Female', ''], size=n),\n 'ER_COST': np.random.uniform(500, 5000, size=n),\n 'ER_VISITS': np.random.randint(0, 10, size=n),\n 'IP_COST': np.random.uniform(5000, 20000, size=n),\n 'IP_ADMITS': np.random.randint(0, 5, size=n),\n 'CHF': np.random.choice([0, 1], size=n),\n 'COPD': np.random.choice([0, 1], size=n),\n 'DM': np.random.choice([0, 1], size=n),\n 'ASTHMA': np.random.choice([0, 1], size=n),\n 'HYPERTENSION': np.random.choice([0, 1], size=n),\n 'SCHIZOPHRENIA': np.random.choice([0, 1], size=n),\n 'MOOD_DEPRESSED': np.random.choice([0, 1], size=n),\n 'MOOD_BIPOLAR': np.random.choice([0, 1], size=n),\n 'TREATMENT': np.random.choice(['Yes', 'No'], size=n)\n}\n\ndf = pd.DataFrame(data)\n```\n#### Parameters\n**df**: name of dataframe\n\n**id_col**: Primary key of the dataframe; accepts string or integer or float.\n\n**group_col**: A Column to group by, It must be a binary column. Strings or integers are acceptable. \n\n**positive_class**: This is the response value for the primary outcome of interest. For instance, positive value for a Treatment cohort is 'Yes' or 1 otherwise 'No' or 0, respectively. Strings or integers are acceptable.\n\n**continuous_var_summary**: User specifies measures of central tendency, only mean and median are acceptable. This parameter is case insensitive.\n\n\n#### Example usage of FindOutliers Class\n\nThis returns a dataframe (outliers_flag_df) with outlier_flag column (outlier_flag =1: record contains one or more ouliers). Tukey's IQR method is used to detect outliers in the data\n\n```python\noutliers_flag=FindOutliers(df=df, id_col='MCID', group_col='TREATMENT')\noutliers_flag_df=outliers_flag.flag_outliers()\n```\n#### This example counts number of rows with outliers stratified by a defined grouping variable\n```python\noutliers_flag.count_outliers()\n```\n#### This example removes all outliers\n```python\ndf2=outliers_flag.remove_outliers()\ndf2.shape\n```\n\n#### Example usage of DescriptiveStats Class\n```python \ndescriptive_stats = DescriptiveStats(df=df, id_col='MCID', group_col='TREATMENT', positive_class='Yes', continuous_var_summary='median')\n```\n#### Gets statistics for binary and categorical variables and returns a dataframe.\n```python\nbinary_stats_df = descriptive_stats.get_binary_stats()\n```\n\n#### Gets mean and standard deviation for continuous variables and returns a dataframe.\n\n```python\ncontinuous_stats_mean_df = descriptive_stats.get_continuous_mean_stats()\n```\n\n#### Gets median and interquartile range for continuous variables and returns a dataframe.\n```python\ncontinuous_stats_median_df = descriptive_stats.get_continuous_median_stats()\n```\n\n#### Computes summary statistics for binary and continuous variables based on defined measure of central tendency. Method returns a dataframe.\n````python\ndescriptive_stats.compute_descriptive_stats()\nsummary_stats = descriptive_stats.summary_stats()\n````\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Describr is a Python library that provides a convenient way to generate descriptive statistics for datasets.",
"version": "0.0.31",
"project_urls": {
"Homepage": "https://github.com/famutimine/describr"
},
"split_keywords": [
"descriptive",
"statistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2d73b33cf46cec122fca1c8083d8862bec632e7543d237119575410d8b3c5c2b",
"md5": "506b70bce3887597bed8ba704759bfff",
"sha256": "8057ee6a95c04af49b233266d9f7814b67cd1ea179171856edd2c5167a0c91d6"
},
"downloads": -1,
"filename": "describr-0.0.31-py3-none-any.whl",
"has_sig": false,
"md5_digest": "506b70bce3887597bed8ba704759bfff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6252,
"upload_time": "2024-02-07T04:34:41",
"upload_time_iso_8601": "2024-02-07T04:34:41.483848Z",
"url": "https://files.pythonhosted.org/packages/2d/73/b33cf46cec122fca1c8083d8862bec632e7543d237119575410d8b3c5c2b/describr-0.0.31-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "79b6f8508682d3f88a1732de3cfbcc8d1ff889174c4e90551374e2252569b549",
"md5": "bc84962c350601498a782a1a12194611",
"sha256": "1a64fd7e36f6709944a4f88d63635f83613f343663cddb7b2b7e41fba140d1c9"
},
"downloads": -1,
"filename": "describr-0.0.31.tar.gz",
"has_sig": false,
"md5_digest": "bc84962c350601498a782a1a12194611",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6578,
"upload_time": "2024-02-07T04:34:42",
"upload_time_iso_8601": "2024-02-07T04:34:42.597895Z",
"url": "https://files.pythonhosted.org/packages/79/b6/f8508682d3f88a1732de3cfbcc8d1ff889174c4e90551374e2252569b549/describr-0.0.31.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-07 04:34:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "famutimine",
"github_project": "describr",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "describr"
}