phenome-utils

Name	phenome-utils JSON
Version	0.7.0 JSON
	download
home_page	https://git.phenome.health/trent.leslie/phenome-utils
Summary	A Python package for data manipulation and analysis utilities
upload_time	2024-07-25 14:35:23
maintainer	None
docs_url	None
author	Trent Leslie
requires_python	>=3.8
license	MIT
keywords	data analysis pandas utilities arivale snapshots
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 📚 Python Script Documentation for `main.py`

Welcome to the documentation for the `main.py` file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.

## Installation

```
pip install phenome-utils
```

OR

```
pip install git+https://git.phenome.health/trent.leslie/phenome-utils
```

## 📑 Index

1. [sum_and_sort_columns](#1-sum_and_sort_columns)
2. [binary_threshold_matrix_by_col](#2-binary_threshold_matrix_by_col)
3. [binary_threshold_matrix_by_row](#3-binary_threshold_matrix_by_row)
4. [concatenate_csvs_in_directory](#4-concatenate_csvs_in_directory)
5. [aggregate_function](#5-aggregate_function)
6. [aggregate_duplicates](#6-aggregate_duplicates)
7. [generate_subcategories](#7-generate_subcategories)
8. [generate_subcategories_with_proportions](#8-generate_subcategories_with_proportions)
9. [bin_continuous](#9-bin_continuous)
10. [load_latest_yyyymmdd_file](#10-load_latest_yyyymmdd_file)
11. [remove_rows_with_na_threshold](#11-remove_rows_with_na_threshold)
12. [impute_na_in_columns](#12-impute_na_in_columns)

---

## 1. `sum_and_sort_columns`

### Description

Sum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `DataFrame`   | The input DataFrame.                                             |
| `plot_histogram`| `bool`        | Whether to plot a histogram of the non-zero column sums. Defaults to False. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `DataFrame`   | A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |

### Example Usage

```python
# Example code demonstrating usage
sorted_df = sum_and_sort_columns(df, plot_histogram=True)
```

### 📊 Visualization

If `plot_histogram` is set to True, a histogram of the non-zero column sums will be displayed.

---

## 2. `binary_threshold_matrix_by_col`

### Description

Convert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.

### Parameters

| Parameter                     | Type          | Description                                                      |
|-------------------------------|---------------|------------------------------------------------------------------|
| `df`                          | `DataFrame`   | The input DataFrame.                                             |
| `lower_threshold`             | `int`         | The lower percentile threshold. Defaults to 1.                   |
| `upper_threshold`             | `int`         | The upper percentile threshold. Defaults to 99.                  |
| `second_df`                   | `DataFrame`   | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
| `decimal_proportion_threshold`| `float`       | Threshold for filtering the second DataFrame. Defaults to 0.1.   |
| `filter_column`               | `str`         | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `DataFrame`   | A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |

### Example Usage

```python
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
```

---

## 3. `binary_threshold_matrix_by_row`

### Description

Convert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.

### Parameters

| Parameter                     | Type          | Description                                                      |
|-------------------------------|---------------|------------------------------------------------------------------|
| `df`                          | `DataFrame`   | The input DataFrame.                                             |
| `lower_threshold`             | `int`         | The lower percentile threshold. Defaults to 1.                   |
| `upper_threshold`             | `int`         | The upper percentile threshold. Defaults to 99.                  |
| `second_df`                   | `DataFrame`   | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
| `decimal_proportion_threshold`| `float`       | Threshold for filtering the second DataFrame. Defaults to 0.1.   |
| `filter_column`               | `str`         | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `DataFrame`   | A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |

### Example Usage

```python
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
```

---

## 4. `concatenate_csvs_in_directory`

### Description

Concatenate CSV files from a root directory and its subdirectories.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `root_dir`      | `str`         | The root directory to start the search.                          |
| `filter_string` | `str`         | A string that must be in the filename to be included. Defaults to None. |
| `file_extension`| `str`         | The file extension to search for. Defaults to "csv".             |
| `csv_filename`  | `str`         | The filename to save the concatenated CSV. If not provided, returns the DataFrame. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `DataFrame` or `None` | A concatenated DataFrame of all the CSVs if `csv_filename` is not provided. Otherwise, saves the DataFrame and returns None. |

### Example Usage

```python
# Example code demonstrating usage
concatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')
```

---

## 5. `aggregate_function`

### Description

General-purpose aggregation function.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `x`             | `pd.Series`   | Input series                                                     |
| `numeric_method`| `str`         | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
| `substitute`    | `any`         | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `any`         | Aggregated value                                                 |

### Example Usage

```python
# Example code demonstrating usage
aggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')
```

---

## 6. `aggregate_duplicates`

### Description

Aggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `pd.DataFrame`| Input dataframe                                                  |
| `group_columns` | `list`        | List of column names to group by                                 |
| `numeric_method`| `str`         | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
| `substitute`    | `any`         | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| Dataframe with aggregated duplicates                             |

### Example Usage

```python
# Example code demonstrating usage
aggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')
```

---

## 7. `generate_subcategories`

### Description

Generate subcategories by combining values from specified columns.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `pd.DataFrame`| The input dataframe.                                             |
| `columns`       | `list`        | List of columns to generate subcategories from.                  |
| `col_separator` | `str`         | Separator to use between column names for new columns. Defaults to '_'. |
| `val_separator` | `str`         | Separator to use between values when combining. Defaults to ' '. |
| `missing_val`   | `str`         | Value to replace missing data in specified columns. Defaults to 'NA'. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| DataFrame with new subcategory columns.                          |
| `list`        | List of all column names (original + generated).                 |

### Example Usage

```python
# Example code demonstrating usage
df, all_columns = generate_subcategories(df, columns=['col1', 'col2'])
```

---

## 8. `generate_subcategories_with_proportions`

### Description

Generate subcategories by combining values from specified columns and calculate their proportions.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `pd.DataFrame`| The input dataframe.                                             |
| `columns`       | `list`        | List of columns to generate subcategories from.                  |
| `solo_columns`  | `list`        | List of columns to be considered on their own.                   |
| `col_separator` | `str`         | Separator to use between column names for new columns. Defaults to '_'. |
| `val_separator` | `str`         | Separator to use between values when combining. Defaults to ' '. |
| `missing_val`   | `str`         | Value to replace missing data in specified columns. Defaults to 'NA'. |
| `overall_category_name` | `str` | Name for the overall category. Defaults to 'overall'.            |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| DataFrame with new subcategory columns.                          |
| `list`        | List of all column names (original + generated).                 |
| `pd.DataFrame`| DataFrame with subcategory and its decimal proportion.           |

### Example Usage

```python
# Example code demonstrating usage
df, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])
```

---

## 9. `bin_continuous`

### Description

Bins continuous data in a specified column of a dataframe.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `dataframe`     | `pd.DataFrame`| The input dataframe.                                             |
| `column_name`   | `str`         | The name of the column containing continuous data to be binned.  |
| `bin_size`      | `int`         | The size of each bin. Default is 10.                             |
| `range_start`   | `int`         | The starting value of the range for binning. Default is 0.       |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with an additional column for binned data.         |

### Example Usage

```python
# Example code demonstrating usage
binned_df = bin_continuous(df, column_name='age', bin_size=5)
```

---

## 10. `load_latest_yyyymmdd_file`

### Description

Load the latest file from the specified directory based on its date.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `directory`     | `str`         | Path to the directory containing the files.                      |
| `base_filename` | `str`         | Base name of the file.                                           |
| `file_extension`| `str`         | File extension including the dot (e.g., '.csv').                 |
| `na_values`     | `list or dict`| Additional strings to recognize as NA/NaN.                       |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The loaded data.                                                 |

### Example Usage

```python
# Example code demonstrating usage
df = load_latest_yyyymmdd_file("/path/to/directory", "data_", ".csv")
print(df.head())
```

---

## 11. `remove_rows_with_na_threshold`

### Description

Removes rows from the dataframe that have a fraction of NA values greater than the specified threshold.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `pd.DataFrame`| The input dataframe.                                             |
| `stringified_ids`| `list`       | List of column names to consider for NA value calculation.       |
| `threshold`     | `float`       | The fraction of NA values for a row to be removed. Default is 0.5. |
| `save_starting_df`| `bool`      | Whether to save the initial dataframe to a CSV file. Default is False. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with rows removed based on the threshold.          |

### Example Usage

```python
# Example code demonstrating usage
cleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)
```

---

## 12. `impute_na_in_columns`

### Description

Imputes NA values in columns with either the minimum or median value of the column.

### Parameters

| Parameter       | Type          | Description                                                      |
|-----------------|---------------|------------------------------------------------------------------|
| `df`            | `pd.DataFrame`| The input dataframe.                                             |
| `method`        | `str`         | Method for imputation. Either 'min' or 'median'. Default is 'median'. |

### Returns

| Type          | Description                                                      |
|---------------|------------------------------------------------------------------|
| `pd.DataFrame`| The dataframe with NA values imputed.                            |

### Example Usage

```python
# Example code demonstrating usage
imputed_df = impute_na_in_columns(df, method='median')
```

---

Each function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the `main.py` script. Happy coding! 🎉
</source>
# ph_utils

ph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.

## Features

- Sum and sort DataFrame columns
- Generate binary threshold matrices
- Concatenate CSV files from directories
- Aggregate duplicates in DataFrames
- Generate subcategories from DataFrame columns
- Bin continuous data
- Load latest files based on date in filename
- Remove rows with NA values above a threshold
- Impute NA values in DataFrame columns

## Installation

You can install ph_utils using pip:

```
pip install ph_utils
```

## Usage

Here are some examples of how to use ph_utils:

```python
import pandas as pd
from ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates

# Example 1: Sum and sort columns
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [0, 0, 0],
    'C': [4, 5, 6],
    'D': ['x', 'y', 'z']
})
result = sum_and_sort_columns(df)
print(result)

# Example 2: Create a binary threshold matrix
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)
print(binary_df)

# Example 3: Aggregate duplicates
aggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')
print(aggregated_df)
```

For more detailed information on each function, please refer to the function docstrings in the source code.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://git.phenome.health/trent.leslie/phenome-utils",
    "name": "phenome-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data analysis pandas utilities arivale snapshots",
    "author": "Trent Leslie",
    "author_email": "trent.leslie@phenomehealth.org",
    "download_url": "https://files.pythonhosted.org/packages/23/4e/59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe/phenome_utils-0.7.0.tar.gz",
    "platform": null,
    "description": "# \ud83d\udcda Python Script Documentation for `main.py`\n\nWelcome to the documentation for the `main.py` file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.\n\n## Installation\n\n```\npip install phenome-utils\n```\n\nOR\n\n```\npip install git+https://git.phenome.health/trent.leslie/phenome-utils\n```\n\n## \ud83d\udcd1 Index\n\n1. [sum_and_sort_columns](#1-sum_and_sort_columns)\n2. [binary_threshold_matrix_by_col](#2-binary_threshold_matrix_by_col)\n3. [binary_threshold_matrix_by_row](#3-binary_threshold_matrix_by_row)\n4. [concatenate_csvs_in_directory](#4-concatenate_csvs_in_directory)\n5. [aggregate_function](#5-aggregate_function)\n6. [aggregate_duplicates](#6-aggregate_duplicates)\n7. [generate_subcategories](#7-generate_subcategories)\n8. [generate_subcategories_with_proportions](#8-generate_subcategories_with_proportions)\n9. [bin_continuous](#9-bin_continuous)\n10. [load_latest_yyyymmdd_file](#10-load_latest_yyyymmdd_file)\n11. [remove_rows_with_na_threshold](#11-remove_rows_with_na_threshold)\n12. [impute_na_in_columns](#12-impute_na_in_columns)\n\n---\n\n## 1. `sum_and_sort_columns`\n\n### Description\n\nSum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `DataFrame`   | The input DataFrame.                                             |\n| `plot_histogram`| `bool`        | Whether to plot a histogram of the non-zero column sums. Defaults to False. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `DataFrame`   | A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nsorted_df = sum_and_sort_columns(df, plot_histogram=True)\n```\n\n### \ud83d\udcca Visualization\n\nIf `plot_histogram` is set to True, a histogram of the non-zero column sums will be displayed.\n\n---\n\n## 2. `binary_threshold_matrix_by_col`\n\n### Description\n\nConvert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.\n\n### Parameters\n\n| Parameter                     | Type          | Description                                                      |\n|-------------------------------|---------------|------------------------------------------------------------------|\n| `df`                          | `DataFrame`   | The input DataFrame.                                             |\n| `lower_threshold`             | `int`         | The lower percentile threshold. Defaults to 1.                   |\n| `upper_threshold`             | `int`         | The upper percentile threshold. Defaults to 99.                  |\n| `second_df`                   | `DataFrame`   | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |\n| `decimal_proportion_threshold`| `float`       | Threshold for filtering the second DataFrame. Defaults to 0.1.   |\n| `filter_column`               | `str`         | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `DataFrame`   | A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')\n```\n\n---\n\n## 3. `binary_threshold_matrix_by_row`\n\n### Description\n\nConvert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.\n\n### Parameters\n\n| Parameter                     | Type          | Description                                                      |\n|-------------------------------|---------------|------------------------------------------------------------------|\n| `df`                          | `DataFrame`   | The input DataFrame.                                             |\n| `lower_threshold`             | `int`         | The lower percentile threshold. Defaults to 1.                   |\n| `upper_threshold`             | `int`         | The upper percentile threshold. Defaults to 99.                  |\n| `second_df`                   | `DataFrame`   | A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |\n| `decimal_proportion_threshold`| `float`       | Threshold for filtering the second DataFrame. Defaults to 0.1.   |\n| `filter_column`               | `str`         | Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `DataFrame`   | A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')\n```\n\n---\n\n## 4. `concatenate_csvs_in_directory`\n\n### Description\n\nConcatenate CSV files from a root directory and its subdirectories.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `root_dir`      | `str`         | The root directory to start the search.                          |\n| `filter_string` | `str`         | A string that must be in the filename to be included. Defaults to None. |\n| `file_extension`| `str`         | The file extension to search for. Defaults to \"csv\".             |\n| `csv_filename`  | `str`         | The filename to save the concatenated CSV. If not provided, returns the DataFrame. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `DataFrame` or `None` | A concatenated DataFrame of all the CSVs if `csv_filename` is not provided. Otherwise, saves the DataFrame and returns None. |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nconcatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')\n```\n\n---\n\n## 5. `aggregate_function`\n\n### Description\n\nGeneral-purpose aggregation function.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `x`             | `pd.Series`   | Input series                                                     |\n| `numeric_method`| `str`         | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |\n| `substitute`    | `any`         | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `any`         | Aggregated value                                                 |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\naggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')\n```\n\n---\n\n## 6. `aggregate_duplicates`\n\n### Description\n\nAggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `pd.DataFrame`| Input dataframe                                                  |\n| `group_columns` | `list`        | List of column names to group by                                 |\n| `numeric_method`| `str`         | Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |\n| `substitute`    | `any`         | Value to substitute when all values are NaN or mode is empty. Default is np.nan. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| Dataframe with aggregated duplicates                             |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\naggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')\n```\n\n---\n\n## 7. `generate_subcategories`\n\n### Description\n\nGenerate subcategories by combining values from specified columns.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `pd.DataFrame`| The input dataframe.                                             |\n| `columns`       | `list`        | List of columns to generate subcategories from.                  |\n| `col_separator` | `str`         | Separator to use between column names for new columns. Defaults to '_'. |\n| `val_separator` | `str`         | Separator to use between values when combining. Defaults to ' '. |\n| `missing_val`   | `str`         | Value to replace missing data in specified columns. Defaults to 'NA'. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| DataFrame with new subcategory columns.                          |\n| `list`        | List of all column names (original + generated).                 |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf, all_columns = generate_subcategories(df, columns=['col1', 'col2'])\n```\n\n---\n\n## 8. `generate_subcategories_with_proportions`\n\n### Description\n\nGenerate subcategories by combining values from specified columns and calculate their proportions.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `pd.DataFrame`| The input dataframe.                                             |\n| `columns`       | `list`        | List of columns to generate subcategories from.                  |\n| `solo_columns`  | `list`        | List of columns to be considered on their own.                   |\n| `col_separator` | `str`         | Separator to use between column names for new columns. Defaults to '_'. |\n| `val_separator` | `str`         | Separator to use between values when combining. Defaults to ' '. |\n| `missing_val`   | `str`         | Value to replace missing data in specified columns. Defaults to 'NA'. |\n| `overall_category_name` | `str` | Name for the overall category. Defaults to 'overall'.            |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| DataFrame with new subcategory columns.                          |\n| `list`        | List of all column names (original + generated).                 |\n| `pd.DataFrame`| DataFrame with subcategory and its decimal proportion.           |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])\n```\n\n---\n\n## 9. `bin_continuous`\n\n### Description\n\nBins continuous data in a specified column of a dataframe.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `dataframe`     | `pd.DataFrame`| The input dataframe.                                             |\n| `column_name`   | `str`         | The name of the column containing continuous data to be binned.  |\n| `bin_size`      | `int`         | The size of each bin. Default is 10.                             |\n| `range_start`   | `int`         | The starting value of the range for binning. Default is 0.       |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with an additional column for binned data.         |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nbinned_df = bin_continuous(df, column_name='age', bin_size=5)\n```\n\n---\n\n## 10. `load_latest_yyyymmdd_file`\n\n### Description\n\nLoad the latest file from the specified directory based on its date.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `directory`     | `str`         | Path to the directory containing the files.                      |\n| `base_filename` | `str`         | Base name of the file.                                           |\n| `file_extension`| `str`         | File extension including the dot (e.g., '.csv').                 |\n| `na_values`     | `list or dict`| Additional strings to recognize as NA/NaN.                       |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The loaded data.                                                 |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ndf = load_latest_yyyymmdd_file(\"/path/to/directory\", \"data_\", \".csv\")\nprint(df.head())\n```\n\n---\n\n## 11. `remove_rows_with_na_threshold`\n\n### Description\n\nRemoves rows from the dataframe that have a fraction of NA values greater than the specified threshold.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `pd.DataFrame`| The input dataframe.                                             |\n| `stringified_ids`| `list`       | List of column names to consider for NA value calculation.       |\n| `threshold`     | `float`       | The fraction of NA values for a row to be removed. Default is 0.5. |\n| `save_starting_df`| `bool`      | Whether to save the initial dataframe to a CSV file. Default is False. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with rows removed based on the threshold.          |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\ncleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)\n```\n\n---\n\n## 12. `impute_na_in_columns`\n\n### Description\n\nImputes NA values in columns with either the minimum or median value of the column.\n\n### Parameters\n\n| Parameter       | Type          | Description                                                      |\n|-----------------|---------------|------------------------------------------------------------------|\n| `df`            | `pd.DataFrame`| The input dataframe.                                             |\n| `method`        | `str`         | Method for imputation. Either 'min' or 'median'. Default is 'median'. |\n\n### Returns\n\n| Type          | Description                                                      |\n|---------------|------------------------------------------------------------------|\n| `pd.DataFrame`| The dataframe with NA values imputed.                            |\n\n### Example Usage\n\n```python\n# Example code demonstrating usage\nimputed_df = impute_na_in_columns(df, method='median')\n```\n\n---\n\nEach function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the `main.py` script. Happy coding! \ud83c\udf89\n</source>\n# ph_utils\n\nph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.\n\n## Features\n\n- Sum and sort DataFrame columns\n- Generate binary threshold matrices\n- Concatenate CSV files from directories\n- Aggregate duplicates in DataFrames\n- Generate subcategories from DataFrame columns\n- Bin continuous data\n- Load latest files based on date in filename\n- Remove rows with NA values above a threshold\n- Impute NA values in DataFrame columns\n\n## Installation\n\nYou can install ph_utils using pip:\n\n```\npip install ph_utils\n```\n\n## Usage\n\nHere are some examples of how to use ph_utils:\n\n```python\nimport pandas as pd\nfrom ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates\n\n# Example 1: Sum and sort columns\ndf = pd.DataFrame({\n    'A': [1, 2, 3],\n    'B': [0, 0, 0],\n    'C': [4, 5, 6],\n    'D': ['x', 'y', 'z']\n})\nresult = sum_and_sort_columns(df)\nprint(result)\n\n# Example 2: Create a binary threshold matrix\nbinary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)\nprint(binary_df)\n\n# Example 3: Aggregate duplicates\naggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')\nprint(aggregated_df)\n```\n\nFor more detailed information on each function, please refer to the function docstrings in the source code.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package for data manipulation and analysis utilities",
    "version": "0.7.0",
    "project_urls": {
        "Homepage": "https://git.phenome.health/trent.leslie/phenome-utils"
    },
    "split_keywords": [
        "data",
        "analysis",
        "pandas",
        "utilities",
        "arivale",
        "snapshots"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06361e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a",
                "md5": "bd07797906f3215c25967d17248adc0b",
                "sha256": "bc866d2b88958dc786118befdb41ab2cb5096112752f519fda8f7f9a207d85a5"
            },
            "downloads": -1,
            "filename": "phenome_utils-0.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bd07797906f3215c25967d17248adc0b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 17471,
            "upload_time": "2024-07-25T14:35:14",
            "upload_time_iso_8601": "2024-07-25T14:35:14.007066Z",
            "url": "https://files.pythonhosted.org/packages/06/36/1e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a/phenome_utils-0.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "234e59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe",
                "md5": "16379023de754281dd0d5b7966b14583",
                "sha256": "4e1ccb5b4402ab4bfd97042312d40b2f6097f9b79f185ffc481f333c01523dc5"
            },
            "downloads": -1,
            "filename": "phenome_utils-0.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "16379023de754281dd0d5b7966b14583",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14557,
            "upload_time": "2024-07-25T14:35:23",
            "upload_time_iso_8601": "2024-07-25T14:35:23.812988Z",
            "url": "https://files.pythonhosted.org/packages/23/4e/59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe/phenome_utils-0.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-25 14:35:23",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "phenome-utils"
}

Trent Leslie