# 📚 Python Script Documentation for `main.py`
Welcome to the documentation for the `main.py` file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.
## Installation
```
pip install phenome-utils
```
OR
```
pip install git+https://git.phenome.health/trent.leslie/phenome-utils
```
## 📑 Index
1. [sum_and_sort_columns](#1-sum_and_sort_columns)
2. [binary_threshold_matrix_by_col](#2-binary_threshold_matrix_by_col)
3. [binary_threshold_matrix_by_row](#3-binary_threshold_matrix_by_row)
4. [concatenate_csvs_in_directory](#4-concatenate_csvs_in_directory)
5. [aggregate_function](#5-aggregate_function)
6. [aggregate_duplicates](#6-aggregate_duplicates)
7. [generate_subcategories](#7-generate_subcategories)
8. [generate_subcategories_with_proportions](#8-generate_subcategories_with_proportions)
9. [bin_continuous](#9-bin_continuous)
10. [load_latest_yyyymmdd_file](#10-load_latest_yyyymmdd_file)
11. [remove_rows_with_na_threshold](#11-remove_rows_with_na_threshold)
12. [impute_na_in_columns](#12-impute_na_in_columns)
---
## 1. `sum_and_sort_columns`
### Description
Sum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `DataFrame` | The input DataFrame. |
| `plot_histogram`| `bool` | Whether to plot a histogram of the non-zero column sums. Defaults to False. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `DataFrame` | A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |
### Example Usage
```python
# Example code demonstrating usage
sorted_df = sum_and_sort_columns(df, plot_histogram=True)
```
### 📊 Visualization
If `plot_histogram` is set to True, a histogram of the non-zero column sums will be displayed.
---
## 2. `binary_threshold_matrix_by_col`
### Description
Convert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
### Parameters
| Parameter | Type | Description |
|-------------------------------|---------------|------------------------------------------------------------------|
| `df` | `DataFrame` | The input DataFrame. |
| `lower_threshold` | `int` | The lower percentile threshold. Defaults to 1. |
| `upper_threshold` | `int` | The upper percentile threshold. Defaults to 99. |
| `second_df` | `DataFrame` | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
| `decimal_proportion_threshold`| `float` | Threshold for filtering the second DataFrame. Defaults to 0.1. |
| `filter_column` | `str` | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `DataFrame` | A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
### Example Usage
```python
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
```
---
## 3. `binary_threshold_matrix_by_row`
### Description
Convert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
### Parameters
| Parameter | Type | Description |
|-------------------------------|---------------|------------------------------------------------------------------|
| `df` | `DataFrame` | The input DataFrame. |
| `lower_threshold` | `int` | The lower percentile threshold. Defaults to 1. |
| `upper_threshold` | `int` | The upper percentile threshold. Defaults to 99. |
| `second_df` | `DataFrame` | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
| `decimal_proportion_threshold`| `float` | Threshold for filtering the second DataFrame. Defaults to 0.1. |
| `filter_column` | `str` | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `DataFrame` | A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
### Example Usage
```python
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
```
---
## 4. `concatenate_csvs_in_directory`
### Description
Concatenate CSV files from a root directory and its subdirectories.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `root_dir` | `str` | The root directory to start the search. |
| `filter_string` | `str` | A string that must be in the filename to be included. Defaults to None. |
| `file_extension`| `str` | The file extension to search for. Defaults to "csv". |
| `csv_filename` | `str` | The filename to save the concatenated CSV. If not provided, returns the DataFrame. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `DataFrame` or `None` | A concatenated DataFrame of all the CSVs if `csv_filename` is not provided. Otherwise, saves the DataFrame and returns None. |
### Example Usage
```python
# Example code demonstrating usage
concatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')
```
---
## 5. `aggregate_function`
### Description
General-purpose aggregation function.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `x` | `pd.Series` | Input series |
| `numeric_method`| `str` | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
| `substitute` | `any` | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `any` | Aggregated value |
### Example Usage
```python
# Example code demonstrating usage
aggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')
```
---
## 6. `aggregate_duplicates`
### Description
Aggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `pd.DataFrame`| Input dataframe |
| `group_columns` | `list` | List of column names to group by |
| `numeric_method`| `str` | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
| `substitute` | `any` | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| Dataframe with aggregated duplicates |
### Example Usage
```python
# Example code demonstrating usage
aggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')
```
---
## 7. `generate_subcategories`
### Description
Generate subcategories by combining values from specified columns.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `pd.DataFrame`| The input dataframe. |
| `columns` | `list` | List of columns to generate subcategories from. |
| `col_separator` | `str` | Separator to use between column names for new columns. Defaults to '_'. |
| `val_separator` | `str` | Separator to use between values when combining. Defaults to ' '. |
| `missing_val` | `str` | Value to replace missing data in specified columns. Defaults to 'NA'. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| DataFrame with new subcategory columns. |
| `list` | List of all column names (original + generated). |
### Example Usage
```python
# Example code demonstrating usage
df, all_columns = generate_subcategories(df, columns=['col1', 'col2'])
```
---
## 8. `generate_subcategories_with_proportions`
### Description
Generate subcategories by combining values from specified columns and calculate their proportions.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `pd.DataFrame`| The input dataframe. |
| `columns` | `list` | List of columns to generate subcategories from. |
| `solo_columns` | `list` | List of columns to be considered on their own. |
| `col_separator` | `str` | Separator to use between column names for new columns. Defaults to '_'. |
| `val_separator` | `str` | Separator to use between values when combining. Defaults to ' '. |
| `missing_val` | `str` | Value to replace missing data in specified columns. Defaults to 'NA'. |
| `overall_category_name` | `str` | Name for the overall category. Defaults to 'overall'. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| DataFrame with new subcategory columns. |
| `list` | List of all column names (original + generated). |
| `pd.DataFrame`| DataFrame with subcategory and its decimal proportion. |
### Example Usage
```python
# Example code demonstrating usage
df, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])
```
---
## 9. `bin_continuous`
### Description
Bins continuous data in a specified column of a dataframe.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `dataframe` | `pd.DataFrame`| The input dataframe. |
| `column_name` | `str` | The name of the column containing continuous data to be binned. |
| `bin_size` | `int` | The size of each bin. Default is 10. |
| `range_start` | `int` | The starting value of the range for binning. Default is 0. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with an additional column for binned data. |
### Example Usage
```python
# Example code demonstrating usage
binned_df = bin_continuous(df, column_name='age', bin_size=5)
```
---
## 10. `load_latest_yyyymmdd_file`
### Description
Load the latest file from the specified directory based on its date.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `directory` | `str` | Path to the directory containing the files. |
| `base_filename` | `str` | Base name of the file. |
| `file_extension`| `str` | File extension including the dot (e.g., '.csv'). |
| `na_values` | `list or dict`| Additional strings to recognize as NA/NaN. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The loaded data. |
### Example Usage
```python
# Example code demonstrating usage
df = load_latest_yyyymmdd_file("/path/to/directory", "data_", ".csv")
print(df.head())
```
---
## 11. `remove_rows_with_na_threshold`
### Description
Removes rows from the dataframe that have a fraction of NA values greater than the specified threshold.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `pd.DataFrame`| The input dataframe. |
| `stringified_ids`| `list` | List of column names to consider for NA value calculation. |
| `threshold` | `float` | The fraction of NA values for a row to be removed. Default is 0.5. |
| `save_starting_df`| `bool` | Whether to save the initial dataframe to a CSV file. Default is False. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with rows removed based on the threshold. |
### Example Usage
```python
# Example code demonstrating usage
cleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)
```
---
## 12. `impute_na_in_columns`
### Description
Imputes NA values in columns with either the minimum or median value of the column.
### Parameters
| Parameter | Type | Description |
|-----------------|---------------|------------------------------------------------------------------|
| `df` | `pd.DataFrame`| The input dataframe. |
| `method` | `str` | Method for imputation. Either 'min' or 'median'. Default is 'median'. |
### Returns
| Type | Description |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with NA values imputed. |
### Example Usage
```python
# Example code demonstrating usage
imputed_df = impute_na_in_columns(df, method='median')
```
---
Each function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the `main.py` script. Happy coding! 🎉
</source>
# ph_utils
ph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.
## Features
- Sum and sort DataFrame columns
- Generate binary threshold matrices
- Concatenate CSV files from directories
- Aggregate duplicates in DataFrames
- Generate subcategories from DataFrame columns
- Bin continuous data
- Load latest files based on date in filename
- Remove rows with NA values above a threshold
- Impute NA values in DataFrame columns
## Installation
You can install ph_utils using pip:
```
pip install ph_utils
```
## Usage
Here are some examples of how to use ph_utils:
```python
import pandas as pd
from ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates
# Example 1: Sum and sort columns
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [0, 0, 0],
'C': [4, 5, 6],
'D': ['x', 'y', 'z']
})
result = sum_and_sort_columns(df)
print(result)
# Example 2: Create a binary threshold matrix
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)
print(binary_df)
# Example 3: Aggregate duplicates
aggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')
print(aggregated_df)
```
For more detailed information on each function, please refer to the function docstrings in the source code.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://git.phenome.health/trent.leslie/phenome-utils",
"name": "phenome-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "data analysis pandas utilities arivale snapshots",
"author": "Trent Leslie",
"author_email": "trent.leslie@phenomehealth.org",
"download_url": "https://files.pythonhosted.org/packages/23/4e/59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe/phenome_utils-0.7.0.tar.gz",
"platform": null,
"description": "# \ud83d\udcda Python Script Documentation for `main.py`\n\nWelcome to the documentation for the `main.py` file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.\n\n## Installation\n\n```\npip install phenome-utils\n```\n\nOR\n\n```\npip install git+https://git.phenome.health/trent.leslie/phenome-utils\n```\n\n## \ud83d\udcd1 Index\n\n1. [sum_and_sort_columns](#1-sum_and_sort_columns)\n2. [binary_threshold_matrix_by_col](#2-binary_threshold_matrix_by_col)\n3. [binary_threshold_matrix_by_row](#3-binary_threshold_matrix_by_row)\n4. [concatenate_csvs_in_directory](#4-concatenate_csvs_in_directory)\n5. [aggregate_function](#5-aggregate_function)\n6. [aggregate_duplicates](#6-aggregate_duplicates)\n7. [generate_subcategories](#7-generate_subcategories)\n8. [generate_subcategories_with_proportions](#8-generate_subcategories_with_proportions)\n9. [bin_continuous](#9-bin_continuous)\n10. [load_latest_yyyymmdd_file](#10-load_latest_yyyymmdd_file)\n11. [remove_rows_with_na_threshold](#11-remove_rows_with_na_threshold)\n12. [impute_na_in_columns](#12-impute_na_in_columns)\n\n---\n\n## 1. `sum_and_sort_columns`\n\n### Description\n\nSum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `DataFrame` | The input DataFrame. |\n| `plot_histogram`| `bool` | Whether to plot a histogram of the non-zero column sums. Defaults to False. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `DataFrame` | A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nsorted_df = sum_and_sort_columns(df, plot_histogram=True)\n```\n\n### \ud83d\udcca Visualization\n\nIf `plot_histogram` is set to True, a histogram of the non-zero column sums will be displayed.\n\n---\n\n## 2. `binary_threshold_matrix_by_col`\n\n### Description\n\nConvert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-------------------------------|---------------|------------------------------------------------------------------|\n| `df` | `DataFrame` | The input DataFrame. |\n| `lower_threshold` | `int` | The lower percentile threshold. Defaults to 1. |\n| `upper_threshold` | `int` | The upper percentile threshold. Defaults to 99. |\n| `second_df` | `DataFrame` | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |\n| `decimal_proportion_threshold`| `float` | Threshold for filtering the second DataFrame. Defaults to 0.1. |\n| `filter_column` | `str` | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `DataFrame` | A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')\n```\n\n---\n\n## 3. `binary_threshold_matrix_by_row`\n\n### Description\n\nConvert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-------------------------------|---------------|------------------------------------------------------------------|\n| `df` | `DataFrame` | The input DataFrame. |\n| `lower_threshold` | `int` | The lower percentile threshold. Defaults to 1. |\n| `upper_threshold` | `int` | The upper percentile threshold. Defaults to 99. |\n| `second_df` | `DataFrame` | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |\n| `decimal_proportion_threshold`| `float` | Threshold for filtering the second DataFrame. Defaults to 0.1. |\n| `filter_column` | `str` | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `DataFrame` | A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')\n```\n\n---\n\n## 4. `concatenate_csvs_in_directory`\n\n### Description\n\nConcatenate CSV files from a root directory and its subdirectories.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `root_dir` | `str` | The root directory to start the search. |\n| `filter_string` | `str` | A string that must be in the filename to be included. Defaults to None. |\n| `file_extension`| `str` | The file extension to search for. Defaults to \"csv\". |\n| `csv_filename` | `str` | The filename to save the concatenated CSV. If not provided, returns the DataFrame. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `DataFrame` or `None` | A concatenated DataFrame of all the CSVs if `csv_filename` is not provided. Otherwise, saves the DataFrame and returns None. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nconcatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')\n```\n\n---\n\n## 5. `aggregate_function`\n\n### Description\n\nGeneral-purpose aggregation function.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `x` | `pd.Series` | Input series |\n| `numeric_method`| `str` | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |\n| `substitute` | `any` | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `any` | Aggregated value |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\naggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')\n```\n\n---\n\n## 6. `aggregate_duplicates`\n\n### Description\n\nAggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `pd.DataFrame`| Input dataframe |\n| `group_columns` | `list` | List of column names to group by |\n| `numeric_method`| `str` | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |\n| `substitute` | `any` | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| Dataframe with aggregated duplicates |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\naggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')\n```\n\n---\n\n## 7. `generate_subcategories`\n\n### Description\n\nGenerate subcategories by combining values from specified columns.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `pd.DataFrame`| The input dataframe. |\n| `columns` | `list` | List of columns to generate subcategories from. |\n| `col_separator` | `str` | Separator to use between column names for new columns. Defaults to '_'. |\n| `val_separator` | `str` | Separator to use between values when combining. Defaults to ' '. |\n| `missing_val` | `str` | Value to replace missing data in specified columns. Defaults to 'NA'. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| DataFrame with new subcategory columns. |\n| `list` | List of all column names (original + generated). |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf, all_columns = generate_subcategories(df, columns=['col1', 'col2'])\n```\n\n---\n\n## 8. `generate_subcategories_with_proportions`\n\n### Description\n\nGenerate subcategories by combining values from specified columns and calculate their proportions.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `pd.DataFrame`| The input dataframe. |\n| `columns` | `list` | List of columns to generate subcategories from. |\n| `solo_columns` | `list` | List of columns to be considered on their own. |\n| `col_separator` | `str` | Separator to use between column names for new columns. Defaults to '_'. |\n| `val_separator` | `str` | Separator to use between values when combining. Defaults to ' '. |\n| `missing_val` | `str` | Value to replace missing data in specified columns. Defaults to 'NA'. |\n| `overall_category_name` | `str` | Name for the overall category. Defaults to 'overall'. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| DataFrame with new subcategory columns. |\n| `list` | List of all column names (original + generated). |\n| `pd.DataFrame`| DataFrame with subcategory and its decimal proportion. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])\n```\n\n---\n\n## 9. `bin_continuous`\n\n### Description\n\nBins continuous data in a specified column of a dataframe.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `dataframe` | `pd.DataFrame`| The input dataframe. |\n| `column_name` | `str` | The name of the column containing continuous data to be binned. |\n| `bin_size` | `int` | The size of each bin. Default is 10. |\n| `range_start` | `int` | The starting value of the range for binning. Default is 0. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with an additional column for binned data. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinned_df = bin_continuous(df, column_name='age', bin_size=5)\n```\n\n---\n\n## 10. `load_latest_yyyymmdd_file`\n\n### Description\n\nLoad the latest file from the specified directory based on its date.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `directory` | `str` | Path to the directory containing the files. |\n| `base_filename` | `str` | Base name of the file. |\n| `file_extension`| `str` | File extension including the dot (e.g., '.csv'). |\n| `na_values` | `list or dict`| Additional strings to recognize as NA/NaN. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The loaded data. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf = load_latest_yyyymmdd_file(\"/path/to/directory\", \"data_\", \".csv\")\nprint(df.head())\n```\n\n---\n\n## 11. `remove_rows_with_na_threshold`\n\n### Description\n\nRemoves rows from the dataframe that have a fraction of NA values greater than the specified threshold.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `pd.DataFrame`| The input dataframe. |\n| `stringified_ids`| `list` | List of column names to consider for NA value calculation. |\n| `threshold` | `float` | The fraction of NA values for a row to be removed. Default is 0.5. |\n| `save_starting_df`| `bool` | Whether to save the initial dataframe to a CSV file. Default is False. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with rows removed based on the threshold. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ncleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)\n```\n\n---\n\n## 12. `impute_na_in_columns`\n\n### Description\n\nImputes NA values in columns with either the minimum or median value of the column.\n\n### Parameters\n\n| Parameter | Type | Description |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df` | `pd.DataFrame`| The input dataframe. |\n| `method` | `str` | Method for imputation. Either 'min' or 'median'. Default is 'median'. |\n\n### Returns\n\n| Type | Description |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with NA values imputed. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nimputed_df = impute_na_in_columns(df, method='median')\n```\n\n---\n\nEach function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the `main.py` script. Happy coding! \ud83c\udf89\n</source>\n# ph_utils\n\nph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.\n\n## Features\n\n- Sum and sort DataFrame columns\n- Generate binary threshold matrices\n- Concatenate CSV files from directories\n- Aggregate duplicates in DataFrames\n- Generate subcategories from DataFrame columns\n- Bin continuous data\n- Load latest files based on date in filename\n- Remove rows with NA values above a threshold\n- Impute NA values in DataFrame columns\n\n## Installation\n\nYou can install ph_utils using pip:\n\n```\npip install ph_utils\n```\n\n## Usage\n\nHere are some examples of how to use ph_utils:\n\n```python\nimport pandas as pd\nfrom ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates\n\n# Example 1: Sum and sort columns\ndf = pd.DataFrame({\n 'A': [1, 2, 3],\n 'B': [0, 0, 0],\n 'C': [4, 5, 6],\n 'D': ['x', 'y', 'z']\n})\nresult = sum_and_sort_columns(df)\nprint(result)\n\n# Example 2: Create a binary threshold matrix\nbinary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)\nprint(binary_df)\n\n# Example 3: Aggregate duplicates\naggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')\nprint(aggregated_df)\n```\n\nFor more detailed information on each function, please refer to the function docstrings in the source code.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package for data manipulation and analysis utilities",
"version": "0.7.0",
"project_urls": {
"Homepage": "https://git.phenome.health/trent.leslie/phenome-utils"
},
"split_keywords": [
"data",
"analysis",
"pandas",
"utilities",
"arivale",
"snapshots"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "06361e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a",
"md5": "bd07797906f3215c25967d17248adc0b",
"sha256": "bc866d2b88958dc786118befdb41ab2cb5096112752f519fda8f7f9a207d85a5"
},
"downloads": -1,
"filename": "phenome_utils-0.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bd07797906f3215c25967d17248adc0b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 17471,
"upload_time": "2024-07-25T14:35:14",
"upload_time_iso_8601": "2024-07-25T14:35:14.007066Z",
"url": "https://files.pythonhosted.org/packages/06/36/1e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a/phenome_utils-0.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "234e59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe",
"md5": "16379023de754281dd0d5b7966b14583",
"sha256": "4e1ccb5b4402ab4bfd97042312d40b2f6097f9b79f185ffc481f333c01523dc5"
},
"downloads": -1,
"filename": "phenome_utils-0.7.0.tar.gz",
"has_sig": false,
"md5_digest": "16379023de754281dd0d5b7966b14583",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14557,
"upload_time": "2024-07-25T14:35:23",
"upload_time_iso_8601": "2024-07-25T14:35:23.812988Z",
"url": "https://files.pythonhosted.org/packages/23/4e/59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe/phenome_utils-0.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-25 14:35:23",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "phenome-utils"
}