gitlabds

Name	gitlabds JSON
Version	2.1.3 JSON
	download
home_page	https://gitlab.com/gitlab-data/gitlabds
Summary	Gitlab Data Science and Modeling Tools
upload_time	2025-07-11 22:28:49
maintainer	None
docs_url	None
author	Kevin Dietz
requires_python	>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![PyPI version](https://badge.fury.io/py/gitlabds.svg)](https://badge.fury.io/py/gitlabds)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# gitlabds

## What is it?
gitlabds is a Python toolkit that streamlines the machine learning workflows with specialized functions for data preparation, feature engineering, model evaluation, and deployment. It helps data scientists focus on providing consistent patterns for both experimentation and production pipelines.

## Installation

```bash
pip install gitlabds
```

### Requirements

- Python 3.10 or later
- Core dependencies:
   - pandas>=1.5.3
   - numpy>=1.23.5
   - scipy>=1.13.1
   - scikit-learn>=1.1.1
   - imbalanced-learn>=0.9.1
   - seaborn>=0.13.2
   - shap>=0.41.0
   - tqdm>=4.66.2

## Main Features by Category

### Data Preparation

#### Outlier Detection and Treatment
<details><summary> MAD Outliers </summary>

#### Description
Median Absolute Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

`gitlabds.mad_outliers(df, dv=None, min_levels=10, columns='all', threshold=4.0, auto_adjust_skew=False, verbose=True, windsor_threshold=0.01):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
- **_min_levels_** : Only include columns that have at least the number of levels specified. 
- **_columns_** : Will examine at all numeric columns by default. To limit to just  a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters. 
- **_threshold_** : Windsor values greater than this number of standard deviations from the median.
- **_auto_adjust_skew_** : Whether to adjust thresholds based on column skewness
- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.
- **_windsor_threshold_** : Only windsor values that affect less than this percentage of the population.  

#### Returns
- Tuple containing:
  - The transformed DataFrame by windsoring outliers
  - Dictionary of outlier limits that can be used with apply_outliers()
	
#### Examples:
		
```python
# Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, dv='my_outcome', columns=['colA', 'colB', 'colC'], verbose=False)
```
```python
# Windsor values with skew adjustment for highly skewed data
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, threshold=3.0, auto_adjust_skew=True)
```
</details>

<details><summary> Apply Outliers </summary>

#### Description
Apply previously determined outlier limits to a dataframe. This is typically used to apply the same outlier treatment to new data that was applied during model training.

`gitlabds.apply_outliers(df, outlier_limits):`

#### Parameters:
- **_df_** : The dataframe to transform
- **_outlier_limits_** : dictionary of outlier limits previously generated by mad_outliers()

#### Returns
- DataFrame with outlier limits applied.
	
#### Examples:
		
```python
# Find outliers in training data
train_df, outlier_limits = gitlabds.mad_outliers(df=train_data, dv='target', threshold=3.0)

# Apply same outlier limits to test data
test_df_transformed = gitlabds.apply_outliers(df=test_data, outlier_limits=outlier_limits)
```
</details>

#### Missing Value Handling
<details><summary> Missing Values </summary>

#### Description
Detect and optionally fill missing values in a DataFrame, with support for various filling methods and detailed reporting.

`gitlabds.missing_values(df, threshold=0.0, method=None, columns="all", constant_value=None, verbose=True, operation="both")`

#### Parameters:
- **_df_** : Your pandas dataframe
- **_threshold_** : The percent of missing values at which a column is considered for processing. For example, threshold=0.10 will only process columns with more than 10% missing values.
- **_method_** : Method to fill missing values or dictionary mapping columns to methods. Options:
  - "mean": Fill with column mean (numeric only)
  - "median": Fill with column median (numeric only)
  - "zero": Fill with 0
  - "constant": Fill with the value specified in constant_value
  - "random": Fill with random values sampled from the column's distribution
  - "drop_column": Remove columns with missing values
  - "drop_row": Remove rows with any missing values in specified columns
- **_columns_** : Columns to check and/or fill. If "all", processes all columns with missing values.
- **_constant_value_** : Value to use when method="constant" or when specified columns use the constant method.
- **_verbose_** : Whether to print detailed information about missing values and filling operations.
- **_operation_** : Operation mode:
  - "check": Only check for missing values, don't fill
  - "fill": Fill missing values and return filled dataframe
  - "both": Check and fill missing values (default)

#### Returns
- If operation="check": List of column names with missing values (or None)
- If operation="fill" or "both": Tuple containing:
  - DataFrame with missing values handled
  - Dictionary with missing value information that can be used with apply_missing_fill()
    
#### Examples:
```python
# Just check for missing values
missing_columns = gitlabds.missing_values(df, threshold=0.05, operation="check")

# Fill all columns with mean value
df_filled, missing_info = gitlabds.missing_values(df, method="mean")

# Fill different columns with different methods
df_filled, missing_info = gitlabds.missing_values(
    df, 
    method={"numeric_col": "median", "string_col": "constant"},
    constant_value="Unknown",
    verbose=True
)
```
</details>

<details><summary> Apply Missing Values </summary>

#### Description
Apply previously determined missing value handling to a dataframe.

`gitlabds.apply_missing_values(df, missing_info):`

#### Parameters:
- **_df_** : The dataframe to transform
- **_missing_info_** : Dictionary of missing value information previously generated by `missing_values()`

#### Returns
- DataFrame with missing values handled according to the provided information.
   
#### Examples:
```python
# Generate missing value info from training data
_, missing_info = gitlabds.missing_values(train_df, method="mean")
   
# Apply to test data
test_df_filled = gitlabds.apply_missing_values(test_df, missing_info)
```
</details>

#### Feature Engineering
<details><summary> Dummy Code </summary>

#### Description
Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not

`gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels=20, numeric_max_levels=10, dummy_na=False, prefix_sep="_dummy_", verbose=True):`

#### Parameters:
- **_df_** : Your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
- **_columns_** : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names. 
- **_categorical_** : Set to `True` to attempt to dummy code any categorical column passed via the `columns` parameter.
- **_numeric_** : Set to `True` to attempt to dummy code any numeric column passed via the `columns` parameter.
- **_categorical_max_levels_** : Maximum number of levels a categorical column can have to be eligible for dummy coding.
- **_numeric_max_levels_** : Maximum number of levels a numeric column can have to be eligible for dummy coding.
- **_dummy_na_** : Set to `True` to create a dummy coded column for missing values.
- **_prefix_sep_** : String to use as separator between column name and value in dummy column names. Default is "_dummy_".
- **_verbose_** : Set to `True` to print outputs of dummy coding being done. Set to `False` to suppress.

#### Returns
- A tuple containing:
  - The transformed DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.
  - A dictionary containing information about dummy coding that can be used with `apply_dummy()` to transform new data consistently.

#### Examples:
		
```python
# Dummy code only categorical columns with a maximum of 30 levels; suppress verbose output
import gitlabds
new_df, dummy_dict = gitlabds.dummy_code(
    df=my_df, 
    dv='my_outcome', 
    columns='all', 
    categorical=True, 
    numeric=False, 
    categorical_max_levels=30, 
    verbose=False
)
```

```python
# Dummy code with custom separator
new_df, dummy_dict = gitlabds.dummy_code(
    df=my_df, 
    columns=['colA', 'colB', 'colC'], 
    categorical=True, 
    numeric=True, 
    prefix_sep="_is_"
)
```
</details>

<details><summary> Dummy Top </summary>

#### Description
Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

`gitlabds.dummy_top(df, dv=None, columns='all', min_threshold=0.05, drop_categorical=True, prefix_sep="_dummy_", verbose=True):`

#### Parameters:
- **_df_** : Your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
- **_columns_** : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names. 
- **_min_threshold_**: The threshold at which levels will be dummy coded. For example, the default value of `0.05` will dummy code any categorical level that is in at least 5% of all rows.
- **_drop_categorical_**: Set to `True` to drop categorical columns after they are considered for dummy coding. Set to `False` to keep the original categorical columns in the dataframe.
- **_prefix_sep_** : String to use as separator between column name and value in dummy column names. Default is "_dummy_".
- **_verbose_** : Set to `True` to print detailed list of all dummy columns being created. Set to `False` to suppress.

#### Returns
- A tuple containing:
  - The transformed DataFrame with dummy-coded columns for high-frequency values.
  - A dictionary containing information about dummy coding that can be used with `apply_dummy()` to transform new data consistently.

#### Examples:
		
```python
# Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows
import gitlabds
new_df, dummy_top_dict = gitlabds.dummy_top(
    df=my_df, 
    dv='my_outcome', 
    columns='all', 
    min_threshold=0.05, 
    drop_categorical=True, 
    verbose=True
)
```

```python
# Dummy code all categorical levels from the selected columns whose values are in at least 10% of all rows; 
# suppress verbose printout and retain original categorical columns
new_df, dummy_top_dict = gitlabds.dummy_top(
    df=my_df, 
    dv='my_outcome', 
    columns=['colA', 'colB', 'colC'], 
    min_threshold=0.10, 
    drop_categorical=False, 
    verbose=False
)
```
</details>

<details><summary> Apply Dummy </summary>

#### Description
Apply previously determined dummy coding to a new dataframe. This is typically used to apply the same dummy coding to new data that was created during model training.

`gitlabds.apply_dummy(df, dummy_info, drop_original=False):`

#### Parameters:
- **_df_** : The dataframe to transform
- **_dummy_info_** : Dictionary of dummy coding information previously generated by `dummy_code()` or `dummy_top()`
- **_drop_original_** : Whether to drop the original columns after dummy coding. Default is `False`.

#### Returns
- DataFrame with dummy coding applied according to the provided information.
	
#### Examples:
		
```python
# Generate dummy coding information from training data
train_df, dummy_info = gitlabds.dummy_code(df=train_data, dv='target')

# Apply to test data
test_df_transformed = gitlabds.apply_dummy(
    df=test_data, 
    dummy_info=dummy_info
)
```
</details>

#### Feature Selection
<details><summary> Remove Low Variation </summary>

#### Description
Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.

`gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, verbose=True):`

#### Parameters:
- _**df**_ : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
- **_columns_** : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names. 
- **_threshold_**: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of `0.98` will drop any column where one value is present in more than 98% of rows.
- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.

#### Returns
- DataFrame with low variation columns dropped.

#### Examples:
```python
# Drop any columns (except for the outcome) where one value is present in more than 95% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95)
```
```python
# Drop any of the selected columns where one value is present in more than 99% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv=None, columns=['colA', 'colB', 'colC'], threshold=.99)
```
</details>

<details><summary> Correlation Reduction </summary>

#### Description
Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped.

`gitlabds.correlation_reduction(df=None, dv=None, threshold=0.9, method="pearson", verbose=True):`

#### Parameters:
- _**df**_ : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dropped. If provided, when choosing between correlated features, the one with higher correlation to the target will be kept.
- **_threshold_**: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of `0.90` will identify columns that have correlations greater than 90% to each other and drop one of those columns.
- **_method_**: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.

#### Returns
- DataFrame with redundant correlated columns dropped.

#### Examples:
```python
# Perform column reduction via correlation using a threshold of 95%, excluding the outcome column.
new_df = gitlabds.correlation_reduction(df=my_df, dv='my_outcome', threshold=0.95, method="pearson")
```
```python
# Perform column reduction using Spearman rank correlation with a threshold of 90%.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold=0.90, method="spearman")
```
</details>

<details><summary> Remove Outcome Proxies </summary>

#### Description
Remove columns that are highly correlated with the outcome (target) column.

`gitlabds.remove_outcome_proxies(df, dv, threshold=.8, method="pearson", verbose=True):`

#### Parameters:
- _**df**_ : your pandas dataframe
- _**dv**_ : The column name of your outcome.    
- _**threshold**_ : The correlation value to the outcome above which columns will be dropped. For example, the default value of `0.80` will identify and drop columns that have correlations greater than 80% to the outcome.
- **_method_**: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.

#### Returns
- DataFrame with outcome proxy columns dropped.

#### Examples:
```python
# Drop columns with correlations to the outcome greater than 70%
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.7)    
```
```python
# Drop columns with correlations to the outcome greater than 80% using Spearman correlation
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.8, method="spearman")        
```
</details>

<details><summary> Drop Categorical </summary>

#### Description
Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.

`gitlabds.drop_categorical(df):`

#### Parameters:
- _**df**_ : your pandas dataframe

#### Returns
- DataFrame with categorical columns dropped.

#### Examples:
```python
# Dropping categorical columns
new_df = gitlabds.drop_categorical(df=my_df) 
```
</details>

#### Memory Optimization
<details><summary> Memory Optimization </summary>

#### Description
Apply multiple memory optimization techniques to dramatically reduce DataFrame memory usage.

`gitlabds.memory_optimization(df, apply_numeric_downcasting=True, apply_categorical=True, apply_sparse=True, precision_mode='balanced', verbose=True, exclude_columns=None, **kwargs):`

#### Parameters:
- **_df_** : Input pandas dataframe to optimize
- **_apply_numeric_downcasting_** : Whether to downcast numeric columns to smaller data types. Defaults to `True`.
- **_apply_categorical_** : Whether to convert string columns to categorical when beneficial. Defaults to `True`.
- **_apply_sparse_** : Whether to apply sparse encoding for columns with many repeated values. Defaults to `True`.
- **_precision_mode_**: str, default="balanced"
        Controls aggressiveness of numeric downcasting:
        - "aggressive": Maximum memory savings, may affect precision
        - "balanced": Good memory savings while preserving most precision
        - "safe": Conservative downcasting to preserve numeric precision
- **_verbose_** : Whether to print progress and memory statistics. Defaults to `True`.
- **_exclude_columns_** : List of columns to exclude from optimization. Defaults to `None`.
- **_**kwargs_** : Additional arguments for optimization techniques

#### Returns
- Memory-optimized pandas DataFrame.

#### Examples:
        
```python
# Basic optimization with default settings
import gitlabds
df_optimized = gitlabds.memory_optimization(df)
```
```python
# Customize optimization approach
df_optimized = gitlabds.memory_optimization(
    df,
    apply_numeric_downcasting=True,
    apply_categorical=True,
    apply_sparse=False,  # Skip sparse encoding
    precision_mode='safe'
    exclude_columns=['id', 'timestamp'],
    verbose=True
)
```
</details>

### Model Development

#### Data Splitting and Sampling
<details><summary> Split Data </summary>

#### Description
This function splits your data into train and test datasets, separating the outcome from the rest of the file. It supports stratified sampling, balanced upsampling for imbalanced datasets, and provides model weights for compensating sampling adjustments.

`gitlabds.split_data(df, train_pct=0.7, dv=None, dv_threshold=0.0, random_state=5435, stratify=True, sampling_strategy=None, shuffle=True, verbose=True):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_train_pct_** : The percentage of rows randomly assigned to the training dataset. Defaults to 0.7 (70% train, 30% test).
- **_dv_** : The column name of your outcome. If None, the function will return the entire dataframe split without separating features and target.
- **_dv_threshold_** : The minimum percentage of rows that must contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5.
- **_random_state_** : Random seed to use for splitting dataframe and for up-sampling (if needed).
- **_stratify_** : Controls stratified sampling. If True and dv is provided, stratifies by the outcome variable. If a list of column names, stratifies by those columns. If False, does not use stratified sampling.
- **_sampling_strategy_** : Sampling strategy for imbalanced data. If None, will use dv_threshold. See imblearn documentation for more details on acceptable values.
- **_shuffle_** : Whether to shuffle the data before splitting.
- **_verbose_** : Whether to print information about the splitting process.

#### Returns
- A tuple containing:
  - x_train: Training features DataFrame
  - y_train: Training target Series (if dv is provided, otherwise empty Series)
  - x_test: Testing features DataFrame 
  - y_test: Testing target Series (if dv is provided, otherwise empty Series)
  - model_weights: List of weights to use for modeling [negative_class_weight, positive_class_weight]
    
#### Examples:
        
```python
# Basic split with default parameters (70% train, 30% test)
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome'
)
```

```python
# Split with 80% training data and balancing for imbalanced target
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome', 
    train_pct=0.80, 
    dv_threshold=0.3
)
```

```python
# Split with stratification on multiple variables
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome',
    stratify=['my_outcome', 'region', 'customer_segment']
)
```

```python
# Split entire dataframe without separating target
train_df, _, test_df, _, _ = gitlabds.split_data(
    df=my_df, 
    dv=None, 
    train_pct=0.75
)
```
</details>

#### Model Configuration
<details><summary> ConfigGenerator </summary>

#### Description
A simple, flexible configuration builder for creating YAML files with any structure. This utility allows you to build complex, nested configuration files programmatically without being constrained to a predefined structure.

`gitlabds.ConfigGenerator(**kwargs):`

#### Parameters:
- **_**kwargs_** : Initial configuration values to populate the configuration object with

#### Methods:

##### `add(path, value)`
Add or update a value at a specific path in the configuration.

- **_path_**: String using dot-notation to specify the location (e.g., 'model.parameters.learning_rate')
- **_value_**: Any value to set at the specified path

##### `to_yaml(file_path)`
Write the configuration to a YAML file.

- **_file_path_**: Path to the output YAML file

#### Returns
- ConfigGenerator object for method chaining

#### Examples:
```python
# Initialize with some top-level parameters
config = ConfigGenerator(
    model_name="churn_prediction",
    version="1.0.0",
    unique_id="customer_id"
)

# Add nested model parameters
config.add("model.file", "xgboost_model.pkl")
config.add("model.parameters.learning_rate", 0.01)
config.add("model.parameters.max_depth", 6)

# Add preprocessing information from outlier detection and dummy coding
config.add("preprocessing.outliers", outlier_info)
config.add("preprocessing.dummy_coding", dummy_info)

# Add query information
config.add("query_parameters.query_file", "customer_data.sql")
config.add("query_parameters.lookback_months", 12)

# Save to YAML
config.to_yaml("churn_model_config.yaml")
```
</details>

### Model Evaluation

<details><summary> ModelEvaluator </summary>

#### Description
A comprehensive framework for evaluating machine learning models, supporting both classification (binary and multi-class) and regression models. It provides extensive evaluation metrics, visualizations, and feature importance analysis.

`gitlabds.ModelEvaluator(model, x_train, y_train, x_test, y_test, x_oot=None, y_oot=None, classification=True, algo=None, f1_threshold=0.50, decile_n=10, top_features_n=20, show_all_classes=True, show_plots=True, save_plots=True, plot_dir='plots', plot_save_format='png', plot_save_dpi=300)`

#### Parameters:
- _**model**_ : The trained model to evaluate. Must have predict for regression and predict_proba method for classification
- _**x_train**_ : Training features DataFrame.
- _**y_train**_ : Training labels (Series or DataFrame).
- _**x_test**_ : Test features DataFrame.
- _**y_test**_ : Test labels (Series or DataFrame).
- _**x_oot**_ : Optional out-of-time validation features.
- _**y_oot**_ : Optional out-of-time validation labels.
- _**classification**_ : Whether this is a classification model. If False, regression metrics will be used.
- _**algo**_ : Algorithm type for feature importance calculation. Options: 'xgb', 'rf', 'mars'. For other algorithms, use `None`
- _**f1_threshold**_ : Threshold for binary classification.
- _**decile_n**_ : Number of n-tiles for lift calculation. Defaults to 10 for deciles
- _**top_features_n**_ : Number of top features to display in visualizations.
- _**show_all_classes**_ : Whether to show metrics for all classes in multi-class classification.
- _**show_plots**_ : Whether to display plots
- _**save_plots**_ : Whether to save plots locally
- _**plot_dir**_ : Directory to save plots
- _**plot_save_format**_ : Plot format
- _**plot_save_dpi**_ : Plot resolution

#### Returns
- ModelMetricsResult object containing all evaluation metrics and results.

#### Key Methods:
- **evaluate()** - Compute and return all metrics
- **evaluate_custom_metrics(custom_metrics)** - Evaluate with additional custom metrics
- **display_metrics(results=None)** - Display evaluation results in a formatted way
- **calibration_assessment()** - Assess model calibration for classification models
- **get_feature_descriptives(display_results=False)** - Generate descriptive statistics for features
- **plot_feature_importance(feature_importance, n_features=20)** - Plot feature importance
- **plot_shap_beeswarm(n_features=20, plot_type="beeswarm")** - Create SHAP visualization
- **plot_score_distribution(bins=None)** - Plot distribution of predicted values
- **plot_feature_interactions(feature_pairs=None, n_top_pairs=5)** - Plot feature interactions
- **plot_confusion_matrix()** - Plot confusion matrix for classification models
- **plot_lift_analysis()** - Plot comprehensive lift analysis
- **plot_performance_curves()** - Plot ROC and precision-recall curves
- **plot_learning_history()** - Plot learning curves for iterative models
- **plot_performance_comparison()** - Plot model performance for out-of-time validation

#### Examples:

```python
# Create an evaluator for a classification model
from gitlabds import ModelEvaluator

evaluator = ModelEvaluator(
    model=my_model,
    x_train=x_train,
    y_train=y_train,
    x_test=x_test,
    y_test=y_test,
    classification=True,
    algo='xgb'
)

# Get all evaluation metrics
results = evaluator.evaluate()

# Display metrics in a formatted way
evaluator.display_metrics(results)

# Create visualizations
evaluator.plot_feature_importance(results.feature_importance)
evaluator.plot_confusion_matrix()
evaluator.plot_performance_curves()

# Save results to file
results.metrics_df.to_csv("metrics.csv")
results.classification_metrics_df.to_csv("classification_metrics.csv")
results.feature_importance.to_csv("feature_importance.csv")
```
</details>


### Insight Generation
<details><summary> Marginal Effects </summary>

#### Description
Calculates and returns the marginal effects at the mean (MEM) for predictor fields.

`gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):`

#### Parameters:
- _**model**_ : model file from training
- _**x_test**_ : test "predictors" dataframe.
- _**dv_description**_ : Description of the outcome field to be used in text-based insights. 
- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name

#### Returns
- Dataframe of marginal effects.

#### Examples:
```python
# Calculate marginal effects for a trained model
import gitlabds
effects_df = gitlabds.marginal_effects(
    model=trained_model,
    x_test=test_features,
    dv_description="probability of churn",
    field_labels={
        "tenure": "Customer tenure in months",
        "monthly_charges": "Average monthly bill amount",
        "total_charges": "Total amount charged to customer"
    }
)

# Display the marginal effects
display(effects_df)
```
</details>

<details><summary> Prescriptions </summary>

#### Description
Return "actionable" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.  

`gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):`

#### Parameters:
- _**model**_ : model file from training
- _**input_df**_ : train "predictors" dataframe. 
- _**scored_df**_ : dataframe containing model scores.
- _**actionable_fields**_ : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: `Increasing` for prescriptions only when the field increases; `Decreasing` for prescriptions only when the field decreases; `Both` for when the field either increases or decreases.   
- _**dv_description**_ : Description of the outcome field to be used in text-based insights.
- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
- _**returned_insights**_ : Number of insights per record to return. Defaults to 5
- _**only_actionable**_ : Only return actionable prescriptions
- _**explanation_fields**_ : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'

#### Returns
- Dataframe of prescriptive actions. One row per record input.

#### Examples:
```python
# Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
results = gitlabds.prescriptions(
    model=model, 
    input_df=my_df, 
    scored_df=my_scores, 
    actionable_fields={
        'spend': 'Increasing', 
        'returns': 'Decreasing', 
        'emails_sent': 'Both'
    }, 
    dv_description='likelihood to churn', 
    field_labels={
        'spend': 'Dollars spent in last 6 months', 
        'returns': 'Item returns in last 3 months', 
        'emails_sent': 'Marketing emails sent in last month'
    }, 
    returned_insights=5, 
    only_actionable=True, 
    explanation_fields=['spend', 'returns']
)
```
</details>

### Model Monitoring

<details><summary> Generate Baseline Features </summary>

#### Description
Generate baseline feature distributions, importance scores, and drift thresholds in a single comprehensive artifact for model monitoring.

`gitlabds.generate_baseline_features(training_data, feature_importance_df, importance_method="shapley_values", n_bins=10, psi_warning=0.1, psi_critical=0.2, ks_warning=0.2, ks_critical=0.3, js_warning=0.1, js_critical=0.2, output_path="baseline_features.json"):`

#### Parameters:
- **_training_data_** : Training feature data DataFrame
- **_feature_importance_df_** : DataFrame with columns: feature, importance
- **_importance_method_** : Method used to calculate importance (default: "shapley_values")
- **_n_bins_** : Number of bins for numerical features (default: 10)
- **_psi_warning, psi_critical_** : PSI thresholds for warning and critical drift detection
- **_ks_warning, ks_critical_** : KS statistic thresholds for warning and critical drift detection
- **_js_warning, js_critical_** : JS divergence thresholds for warning and critical drift detection
- **_output_path_** : Path to save the JSON file

#### Returns
- None (saves baseline artifact to JSON file)

#### Examples:
```python
# Generate baseline features with default thresholds
import gitlabds
gitlabds.generate_baseline_features(
    training_data=train_df,
    feature_importance_df=importance_df,
    importance_method="shapley_values",
    output_path="model_baseline_features.json"
)
```

```python
# Generate with custom drift thresholds
gitlabds.generate_baseline_features(
    training_data=train_df,
    feature_importance_df=importance_df,
    n_bins=15,
    psi_warning=0.15,
    psi_critical=0.25,
    output_path="custom_baseline_features.json"
)
```
</details>

<details><summary> Generate Baseline Calibration </summary>

#### Description
Generate baseline calibration data with train + test curves, prediction statistics, and model configuration for monitoring model calibration drift.

`gitlabds.generate_baseline_calibration(train_predictions, train_actuals, test_predictions, test_actuals, model_configuration, n_bins=10, output_path="baseline_calibration.json"):`

#### Parameters:
- **_train_predictions_** : Training set predicted probabilities
- **_train_actuals_** : Training set actual binary outcomes (0/1)
- **_test_predictions_** : Test set predicted probabilities
- **_test_actuals_** : Test set actual binary outcomes (0/1)
- **_model_configuration_** : Dictionary of model configuration parameters
- **_n_bins_** : Number of bins for calibration curve (default: 10)
- **_prediction_drift_warning_** :  Warning level for prediction score drift (default: 0.10), 
- **_prediction_drift_critical_** : Critical level for prediction score drift (default: 0.20),  
- **_output_path_** : Path to save the JSON file
        
#### Returns
- None (saves baseline calibration to JSON file)

#### Examples:
```python
# Generate baseline calibration
import gitlabds
model_config = {'f1_threshold': 0.15, 'random_state': 42}

gitlabds.generate_baseline_calibration(
    train_predictions=y_train_pred,
    train_actuals=y_train,
    test_predictions=y_test_pred,
    test_actuals=y_test,
    model_configuration=model_config,
    prediction_drift_warning=0.15,
    prediction_drift_critical=0.25
    output_path="model_baseline_calibration.json"
)
```
</details>

<details><summary> Calculate Monitoring Metrics </summary>

#### Description
Calculate all monitoring metrics including feature drift, prediction drift, and health status for a model scoring run.

`gitlabds.calculate_monitoring_metrics(run_id, model_name, sub_model, model_version, scoring_date, feature_df, predictions, baseline_metrics, importance_threshold_pct=0.05):`

#### Parameters:
- **_run_id_** : Unique identifier for this scoring run
- **_model_name_** : Model name (e.g., "propensity_model")
- **_sub_model_** : Sub model identifier
- **_model_version_** : Model version (e.g., "2.1")
- **_scoring_date_** : Date of scoring
- **_feature_df_** : Feature data used for scoring
- **_predictions_** : Model predictions (probabilities 0-1)
- **_baseline_metrics_** : Dictionary containing baseline_features and baseline_calibration JSON data
- **_importance_threshold_pct_** : Percentage of total importance required for a feature to be considered "important"

#### Returns
- Dictionary containing DataFrames for each table:
  - 'scoring_summary': model_scoring_summary table
  - 'feature_drift': model_feature_drift table

#### Examples:
```python
# Calculate monitoring metrics for a scoring run
import gitlabds
from datetime import datetime

# Load baseline metrics
with open('baseline_features.json', 'r') as f:
    baseline_features = json.load(f)
with open('baseline_calibration.json', 'r') as f:
    baseline_calibration = json.load(f)

baseline_metrics = {
    'baseline_features': baseline_features,
    'baseline_calibration': baseline_calibration
}

# Calculate metrics
results = gitlabds.calculate_monitoring_metrics(
    run_id="scoring_run_123",
    model_name="churn_prediction",
    sub_model="high_value_customers",
    model_version="2.1",
    scoring_date=datetime.now(),
    feature_df=current_features,
    predictions=model_predictions,
    baseline_metrics=baseline_metrics,
    importance_threshold_pct=0.05
)

# Access results
scoring_summary = results['scoring_summary']
feature_drift = results['feature_drift']
```
</details>

### SQL and Trend Analysis

<details><summary> SQL Trend Query Generator </summary>

#### Description
Generate SQL for trend analysis across time periods. The generated SQL transforms regular data into a time-series format with columns for each time period, allowing for easy trend detection.

`gitlabds.generate_sql_trend_query(snapshot_date, date_field, date_unit='MONTH', periods=12, table_name=None, group_by_fields=None, metrics=None, filters=None, output_file=None):`

#### Parameters:
- **_snapshot_date_** : Reference date for analysis (e.g., '2025-04-08')
- **_date_field_** : Field name in the table that contains the date to analyze
- **_date_unit_** : Time unit for analysis: 'DAY', 'WEEK', 'MONTH', 'QUARTER', 'YEAR'
- **_periods_** : Number of time periods to analyze
- **_table_name_** : Table to query data from
- **_group_by_fields_** : Fields to group by (entity identifiers)
- **_metrics_** : Metrics to include in analysis with their properties. Each metric is a dict with:
  - name: output column name prefix
  - source: field name in the source table
  - aggregation: function to apply (AVG, SUM, MAX, etc.)
  - condition: optional WHERE condition
  - cumulative: if True, calculate period-over-period differences
  - is_case_expression: if True, the source is already a CASE WHEN expression
  - is_expression: if True, the source is a complex expression
- **_filters_** : SQL WHERE clause conditions as a string
- **_output_file_** : If provided, save the generated SQL to this file
        
#### Returns:
- The generated SQL query as a string

#### Examples:
```python
# Generate SQL for monthly trend analysis
import gitlabds

# Define metrics
metrics = [
    {"name": "active_users", "source": "monthly_active_users", "aggregation": "AVG"},
    {"name": "revenue", "source": "monthly_revenue", "aggregation": "SUM"},
    {"name": "projects", "source": "projects_created", "aggregation": "MAX", "cumulative": True}
]

# Generate SQL query
sql = gitlabds.generate_sql_trend_query(
    snapshot_date='2025-04-08',
    date_field='transaction_date',
    date_unit='MONTH',
    periods=12,
    table_name='analytics.user_metrics',
    group_by_fields=['account_id'],
    metrics=metrics,
    filters="is_active = TRUE",
    output_file='trend_query.sql'
)
```

```python
# Generate SQL for daily trend analysis with custom conditions
metrics = [
    {"name": "logins", "source": "user_logins", "aggregation": "SUM"},
    {"name": "premium_logins", "source": "user_logins", "aggregation": "SUM", 
     "condition": "subscription_tier = 'premium'"}
]

sql = gitlabds.generate_sql_trend_query(
    snapshot_date='2025-04-08',
    date_field='login_date',
    date_unit='DAY',
    periods=30,
    table_name='analytics.daily_logins',
    metrics=metrics
)
```
</details>

<details><summary> Trend Analysis </summary>

#### Description
Calculate trend metrics for a dataframe produced by the SQL trend generator. This function analyzes time-series data to identify patterns like consecutive increases or decreases, proportion of periods with growth or decline, and average percentage changes.

`gitlabds.trend_analysis(df, metric_list=None, time_unit='month', periods=6, include_cumulative=True, exclude_fields=None, verbose=False):`

#### Parameters:
- **_df_** : Dataframe containing trend data with time-based columns
- **_metric_list_** : List of metric names to analyze. If None, auto-detects metrics from columns
- **_time_unit_** : Time unit used in the column names (month, day, week, etc.)
- **_periods_** : Number of time periods to analyze
- **_include_cumulative_** : Whether to use cumulative (event) metrics when available
- **_exclude_fields_** : List of fields to exclude from auto-detection
- **_verbose_** : Whether to display intermediate output
        
#### Returns:
- A dataframe containing trend metrics for each specified metric, including:
  - Count of periods with decreases/increases
  - Count of consecutive decreases/increases
  - Average percentage change across periods

#### Examples:
```python
# Run trend analysis on data from SQL trend query
import gitlabds

# Run the SQL query to get trend data
trend_data = run_sql_query(trend_sql)  # Your function to execute SQL
trend_data.set_index('account_id', inplace=True)

# Analyze trends for all metrics
trends_df = gitlabds.trend_analysis(
    df=trend_data,
    time_unit='month',
    periods=12,
    verbose=True
)
```

```python
# Analyze trends for specific metrics
trends_df = gitlabds.trend_analysis(
    df=trend_data,
    metric_list=['active_users', 'revenue'],
    time_unit='month',
    periods=6,
    include_cumulative=True,
    exclude_fields=['has_data']
)

# Use trend metrics for customer health scoring
account_data['declining_usage'] = trends_df['consecutive_drop_active_users_period_6_months_cnt'] > 0
account_data['growth_score'] = trends_df['avg_perc_change_revenue_period_6_months'] * 100
```
</details>

## Gitlab Data Science

The [handbook](https://handbook.gitlab.com/handbook/enterprise-data/organization/data-science/) is the single source of truth for all of our documentation. 

## Contributing

We welcome contributions and improvements, please see the [contribution guidelines](CONTRIBUTING.md).

## License

This code is distributed under the MIT license, please see the [LICENSE](LICENSE) file.

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/gitlab-data/gitlabds",
    "name": "gitlabds",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Kevin Dietz",
    "author_email": "kdietz@gitlab.com",
    "download_url": "https://files.pythonhosted.org/packages/c6/0d/e47744a8511b3bea53c3ddb3fdb4af2861e113be58accea1111d67043c4e/gitlabds-2.1.3.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/gitlabds.svg)](https://badge.fury.io/py/gitlabds)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n# gitlabds\n\n## What is it?\ngitlabds is a Python toolkit that streamlines the machine learning workflows with specialized functions for data preparation, feature engineering, model evaluation, and deployment. It helps data scientists focus on providing consistent patterns for both experimentation and production pipelines.\n\n## Installation\n\n```bash\npip install gitlabds\n```\n\n### Requirements\n\n- Python 3.10 or later\n- Core dependencies:\n   - pandas>=1.5.3\n   - numpy>=1.23.5\n   - scipy>=1.13.1\n   - scikit-learn>=1.1.1\n   - imbalanced-learn>=0.9.1\n   - seaborn>=0.13.2\n   - shap>=0.41.0\n   - tqdm>=4.66.2\n\n## Main Features by Category\n\n### Data Preparation\n\n#### Outlier Detection and Treatment\n<details><summary> MAD Outliers </summary>\n\n#### Description\nMedian Absolute Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').\n\n`gitlabds.mad_outliers(df, dv=None, min_levels=10, columns='all', threshold=4.0, auto_adjust_skew=False, verbose=True, windsor_threshold=0.01):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.\n- **_min_levels_** : Only include columns that have at least the number of levels specified. \n- **_columns_** : Will examine at all numeric columns by default. To limit to just  a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters. \n- **_threshold_** : Windsor values greater than this number of standard deviations from the median.\n- **_auto_adjust_skew_** : Whether to adjust thresholds based on column skewness\n- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.\n- **_windsor_threshold_** : Only windsor values that affect less than this percentage of the population.  \n\n#### Returns\n- Tuple containing:\n  - The transformed DataFrame by windsoring outliers\n  - Dictionary of outlier limits that can be used with apply_outliers()\n\t\n#### Examples:\n\t\t\n```python\n# Create a new df; only windsor selected columns; suppress verbose\nimport gitlabds\nnew_df, outlier_limits = gitlabds.mad_outliers(df=my_df, dv='my_outcome', columns=['colA', 'colB', 'colC'], verbose=False)\n```\n```python\n# Windsor values with skew adjustment for highly skewed data\nnew_df, outlier_limits = gitlabds.mad_outliers(df=my_df, threshold=3.0, auto_adjust_skew=True)\n```\n</details>\n\n<details><summary> Apply Outliers </summary>\n\n#### Description\nApply previously determined outlier limits to a dataframe. This is typically used to apply the same outlier treatment to new data that was applied during model training.\n\n`gitlabds.apply_outliers(df, outlier_limits):`\n\n#### Parameters:\n- **_df_** : The dataframe to transform\n- **_outlier_limits_** : dictionary of outlier limits previously generated by mad_outliers()\n\n#### Returns\n- DataFrame with outlier limits applied.\n\t\n#### Examples:\n\t\t\n```python\n# Find outliers in training data\ntrain_df, outlier_limits = gitlabds.mad_outliers(df=train_data, dv='target', threshold=3.0)\n\n# Apply same outlier limits to test data\ntest_df_transformed = gitlabds.apply_outliers(df=test_data, outlier_limits=outlier_limits)\n```\n</details>\n\n#### Missing Value Handling\n<details><summary> Missing Values </summary>\n\n#### Description\nDetect and optionally fill missing values in a DataFrame, with support for various filling methods and detailed reporting.\n\n`gitlabds.missing_values(df, threshold=0.0, method=None, columns=\"all\", constant_value=None, verbose=True, operation=\"both\")`\n\n#### Parameters:\n- **_df_** : Your pandas dataframe\n- **_threshold_** : The percent of missing values at which a column is considered for processing. For example, threshold=0.10 will only process columns with more than 10% missing values.\n- **_method_** : Method to fill missing values or dictionary mapping columns to methods. Options:\n  - \"mean\": Fill with column mean (numeric only)\n  - \"median\": Fill with column median (numeric only)\n  - \"zero\": Fill with 0\n  - \"constant\": Fill with the value specified in constant_value\n  - \"random\": Fill with random values sampled from the column's distribution\n  - \"drop_column\": Remove columns with missing values\n  - \"drop_row\": Remove rows with any missing values in specified columns\n- **_columns_** : Columns to check and/or fill. If \"all\", processes all columns with missing values.\n- **_constant_value_** : Value to use when method=\"constant\" or when specified columns use the constant method.\n- **_verbose_** : Whether to print detailed information about missing values and filling operations.\n- **_operation_** : Operation mode:\n  - \"check\": Only check for missing values, don't fill\n  - \"fill\": Fill missing values and return filled dataframe\n  - \"both\": Check and fill missing values (default)\n\n#### Returns\n- If operation=\"check\": List of column names with missing values (or None)\n- If operation=\"fill\" or \"both\": Tuple containing:\n  - DataFrame with missing values handled\n  - Dictionary with missing value information that can be used with apply_missing_fill()\n    \n#### Examples:\n```python\n# Just check for missing values\nmissing_columns = gitlabds.missing_values(df, threshold=0.05, operation=\"check\")\n\n# Fill all columns with mean value\ndf_filled, missing_info = gitlabds.missing_values(df, method=\"mean\")\n\n# Fill different columns with different methods\ndf_filled, missing_info = gitlabds.missing_values(\n    df, \n    method={\"numeric_col\": \"median\", \"string_col\": \"constant\"},\n    constant_value=\"Unknown\",\n    verbose=True\n)\n```\n</details>\n\n<details><summary> Apply Missing Values </summary>\n\n#### Description\nApply previously determined missing value handling to a dataframe.\n\n`gitlabds.apply_missing_values(df, missing_info):`\n\n#### Parameters:\n- **_df_** : The dataframe to transform\n- **_missing_info_** : Dictionary of missing value information previously generated by `missing_values()`\n\n#### Returns\n- DataFrame with missing values handled according to the provided information.\n   \n#### Examples:\n```python\n# Generate missing value info from training data\n_, missing_info = gitlabds.missing_values(train_df, method=\"mean\")\n   \n# Apply to test data\ntest_df_filled = gitlabds.apply_missing_values(test_df, missing_info)\n```\n</details>\n\n#### Feature Engineering\n<details><summary> Dummy Code </summary>\n\n#### Description\nDummy code (AKA \"one-hot encode\") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not\n\n`gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels=20, numeric_max_levels=10, dummy_na=False, prefix_sep=\"_dummy_\", verbose=True):`\n\n#### Parameters:\n- **_df_** : Your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.\n- **_columns_** : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names. \n- **_categorical_** : Set to `True` to attempt to dummy code any categorical column passed via the `columns` parameter.\n- **_numeric_** : Set to `True` to attempt to dummy code any numeric column passed via the `columns` parameter.\n- **_categorical_max_levels_** : Maximum number of levels a categorical column can have to be eligible for dummy coding.\n- **_numeric_max_levels_** : Maximum number of levels a numeric column can have to be eligible for dummy coding.\n- **_dummy_na_** : Set to `True` to create a dummy coded column for missing values.\n- **_prefix_sep_** : String to use as separator between column name and value in dummy column names. Default is \"_dummy_\".\n- **_verbose_** : Set to `True` to print outputs of dummy coding being done. Set to `False` to suppress.\n\n#### Returns\n- A tuple containing:\n  - The transformed DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.\n  - A dictionary containing information about dummy coding that can be used with `apply_dummy()` to transform new data consistently.\n\n#### Examples:\n\t\t\n```python\n# Dummy code only categorical columns with a maximum of 30 levels; suppress verbose output\nimport gitlabds\nnew_df, dummy_dict = gitlabds.dummy_code(\n    df=my_df, \n    dv='my_outcome', \n    columns='all', \n    categorical=True, \n    numeric=False, \n    categorical_max_levels=30, \n    verbose=False\n)\n```\n\n```python\n# Dummy code with custom separator\nnew_df, dummy_dict = gitlabds.dummy_code(\n    df=my_df, \n    columns=['colA', 'colB', 'colC'], \n    categorical=True, \n    numeric=True, \n    prefix_sep=\"_is_\"\n)\n```\n</details>\n\n<details><summary> Dummy Top </summary>\n\n#### Description\nDummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.\n\n`gitlabds.dummy_top(df, dv=None, columns='all', min_threshold=0.05, drop_categorical=True, prefix_sep=\"_dummy_\", verbose=True):`\n\n#### Parameters:\n- **_df_** : Your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.\n- **_columns_** : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names. \n- **_min_threshold_**: The threshold at which levels will be dummy coded. For example, the default value of `0.05` will dummy code any categorical level that is in at least 5% of all rows.\n- **_drop_categorical_**: Set to `True` to drop categorical columns after they are considered for dummy coding. Set to `False` to keep the original categorical columns in the dataframe.\n- **_prefix_sep_** : String to use as separator between column name and value in dummy column names. Default is \"_dummy_\".\n- **_verbose_** : Set to `True` to print detailed list of all dummy columns being created. Set to `False` to suppress.\n\n#### Returns\n- A tuple containing:\n  - The transformed DataFrame with dummy-coded columns for high-frequency values.\n  - A dictionary containing information about dummy coding that can be used with `apply_dummy()` to transform new data consistently.\n\n#### Examples:\n\t\t\n```python\n# Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows\nimport gitlabds\nnew_df, dummy_top_dict = gitlabds.dummy_top(\n    df=my_df, \n    dv='my_outcome', \n    columns='all', \n    min_threshold=0.05, \n    drop_categorical=True, \n    verbose=True\n)\n```\n\n```python\n# Dummy code all categorical levels from the selected columns whose values are in at least 10% of all rows; \n# suppress verbose printout and retain original categorical columns\nnew_df, dummy_top_dict = gitlabds.dummy_top(\n    df=my_df, \n    dv='my_outcome', \n    columns=['colA', 'colB', 'colC'], \n    min_threshold=0.10, \n    drop_categorical=False, \n    verbose=False\n)\n```\n</details>\n\n<details><summary> Apply Dummy </summary>\n\n#### Description\nApply previously determined dummy coding to a new dataframe. This is typically used to apply the same dummy coding to new data that was created during model training.\n\n`gitlabds.apply_dummy(df, dummy_info, drop_original=False):`\n\n#### Parameters:\n- **_df_** : The dataframe to transform\n- **_dummy_info_** : Dictionary of dummy coding information previously generated by `dummy_code()` or `dummy_top()`\n- **_drop_original_** : Whether to drop the original columns after dummy coding. Default is `False`.\n\n#### Returns\n- DataFrame with dummy coding applied according to the provided information.\n\t\n#### Examples:\n\t\t\n```python\n# Generate dummy coding information from training data\ntrain_df, dummy_info = gitlabds.dummy_code(df=train_data, dv='target')\n\n# Apply to test data\ntest_df_transformed = gitlabds.apply_dummy(\n    df=test_data, \n    dummy_info=dummy_info\n)\n```\n</details>\n\n#### Feature Selection\n<details><summary> Remove Low Variation </summary>\n\n#### Description\nRemove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.\n\n`gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, verbose=True):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.\n- **_columns_** : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names. \n- **_threshold_**: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of `0.98` will drop any column where one value is present in more than 98% of rows.\n- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.\n\n#### Returns\n- DataFrame with low variation columns dropped.\n\n#### Examples:\n```python\n# Drop any columns (except for the outcome) where one value is present in more than 95% of rows.\nnew_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95)\n```\n```python\n# Drop any of the selected columns where one value is present in more than 99% of rows.\nnew_df = gitlabds.remove_low_variation(df=my_df, dv=None, columns=['colA', 'colB', 'colC'], threshold=.99)\n```\n</details>\n\n<details><summary> Correlation Reduction </summary>\n\n#### Description\nReduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped.\n\n`gitlabds.correlation_reduction(df=None, dv=None, threshold=0.9, method=\"pearson\", verbose=True):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable will prevent it from being dropped. If provided, when choosing between correlated features, the one with higher correlation to the target will be kept.\n- **_threshold_**: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of `0.90` will identify columns that have correlations greater than 90% to each other and drop one of those columns.\n- **_method_**: The correlation method to use. Options are \"pearson\" (linear relationships), \"spearman\" (monotonic relationships), or \"mutual_info\" (any statistical dependency).\n- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.\n\n#### Returns\n- DataFrame with redundant correlated columns dropped.\n\n#### Examples:\n```python\n# Perform column reduction via correlation using a threshold of 95%, excluding the outcome column.\nnew_df = gitlabds.correlation_reduction(df=my_df, dv='my_outcome', threshold=0.95, method=\"pearson\")\n```\n```python\n# Perform column reduction using Spearman rank correlation with a threshold of 90%.\nnew_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold=0.90, method=\"spearman\")\n```\n</details>\n\n<details><summary> Remove Outcome Proxies </summary>\n\n#### Description\nRemove columns that are highly correlated with the outcome (target) column.\n\n`gitlabds.remove_outcome_proxies(df, dv, threshold=.8, method=\"pearson\", verbose=True):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- _**dv**_ : The column name of your outcome.    \n- _**threshold**_ : The correlation value to the outcome above which columns will be dropped. For example, the default value of `0.80` will identify and drop columns that have correlations greater than 80% to the outcome.\n- **_method_**: The correlation method to use. Options are \"pearson\" (linear relationships), \"spearman\" (monotonic relationships), or \"mutual_info\" (any statistical dependency).\n- **_verbose_** : Set to `True` to print outputs of columns being dropped. Set to `False` to suppress.\n\n#### Returns\n- DataFrame with outcome proxy columns dropped.\n\n#### Examples:\n```python\n# Drop columns with correlations to the outcome greater than 70%\nnew_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.7)    \n```\n```python\n# Drop columns with correlations to the outcome greater than 80% using Spearman correlation\nnew_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.8, method=\"spearman\")        \n```\n</details>\n\n<details><summary> Drop Categorical </summary>\n\n#### Description\nDrop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.\n\n`gitlabds.drop_categorical(df):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n\n#### Returns\n- DataFrame with categorical columns dropped.\n\n#### Examples:\n```python\n# Dropping categorical columns\nnew_df = gitlabds.drop_categorical(df=my_df) \n```\n</details>\n\n#### Memory Optimization\n<details><summary> Memory Optimization </summary>\n\n#### Description\nApply multiple memory optimization techniques to dramatically reduce DataFrame memory usage.\n\n`gitlabds.memory_optimization(df, apply_numeric_downcasting=True, apply_categorical=True, apply_sparse=True, precision_mode='balanced', verbose=True, exclude_columns=None, **kwargs):`\n\n#### Parameters:\n- **_df_** : Input pandas dataframe to optimize\n- **_apply_numeric_downcasting_** : Whether to downcast numeric columns to smaller data types. Defaults to `True`.\n- **_apply_categorical_** : Whether to convert string columns to categorical when beneficial. Defaults to `True`.\n- **_apply_sparse_** : Whether to apply sparse encoding for columns with many repeated values. Defaults to `True`.\n- **_precision_mode_**: str, default=\"balanced\"\n        Controls aggressiveness of numeric downcasting:\n        - \"aggressive\": Maximum memory savings, may affect precision\n        - \"balanced\": Good memory savings while preserving most precision\n        - \"safe\": Conservative downcasting to preserve numeric precision\n- **_verbose_** : Whether to print progress and memory statistics. Defaults to `True`.\n- **_exclude_columns_** : List of columns to exclude from optimization. Defaults to `None`.\n- **_**kwargs_** : Additional arguments for optimization techniques\n\n#### Returns\n- Memory-optimized pandas DataFrame.\n\n#### Examples:\n        \n```python\n# Basic optimization with default settings\nimport gitlabds\ndf_optimized = gitlabds.memory_optimization(df)\n```\n```python\n# Customize optimization approach\ndf_optimized = gitlabds.memory_optimization(\n    df,\n    apply_numeric_downcasting=True,\n    apply_categorical=True,\n    apply_sparse=False,  # Skip sparse encoding\n    precision_mode='safe'\n    exclude_columns=['id', 'timestamp'],\n    verbose=True\n)\n```\n</details>\n\n### Model Development\n\n#### Data Splitting and Sampling\n<details><summary> Split Data </summary>\n\n#### Description\nThis function splits your data into train and test datasets, separating the outcome from the rest of the file. It supports stratified sampling, balanced upsampling for imbalanced datasets, and provides model weights for compensating sampling adjustments.\n\n`gitlabds.split_data(df, train_pct=0.7, dv=None, dv_threshold=0.0, random_state=5435, stratify=True, sampling_strategy=None, shuffle=True, verbose=True):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_train_pct_** : The percentage of rows randomly assigned to the training dataset. Defaults to 0.7 (70% train, 30% test).\n- **_dv_** : The column name of your outcome. If None, the function will return the entire dataframe split without separating features and target.\n- **_dv_threshold_** : The minimum percentage of rows that must contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5.\n- **_random_state_** : Random seed to use for splitting dataframe and for up-sampling (if needed).\n- **_stratify_** : Controls stratified sampling. If True and dv is provided, stratifies by the outcome variable. If a list of column names, stratifies by those columns. If False, does not use stratified sampling.\n- **_sampling_strategy_** : Sampling strategy for imbalanced data. If None, will use dv_threshold. See imblearn documentation for more details on acceptable values.\n- **_shuffle_** : Whether to shuffle the data before splitting.\n- **_verbose_** : Whether to print information about the splitting process.\n\n#### Returns\n- A tuple containing:\n  - x_train: Training features DataFrame\n  - y_train: Training target Series (if dv is provided, otherwise empty Series)\n  - x_test: Testing features DataFrame \n  - y_test: Testing target Series (if dv is provided, otherwise empty Series)\n  - model_weights: List of weights to use for modeling [negative_class_weight, positive_class_weight]\n    \n#### Examples:\n        \n```python\n# Basic split with default parameters (70% train, 30% test)\nx_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(\n    df=my_df, \n    dv='my_outcome'\n)\n```\n\n```python\n# Split with 80% training data and balancing for imbalanced target\nx_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(\n    df=my_df, \n    dv='my_outcome', \n    train_pct=0.80, \n    dv_threshold=0.3\n)\n```\n\n```python\n# Split with stratification on multiple variables\nx_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(\n    df=my_df, \n    dv='my_outcome',\n    stratify=['my_outcome', 'region', 'customer_segment']\n)\n```\n\n```python\n# Split entire dataframe without separating target\ntrain_df, _, test_df, _, _ = gitlabds.split_data(\n    df=my_df, \n    dv=None, \n    train_pct=0.75\n)\n```\n</details>\n\n#### Model Configuration\n<details><summary> ConfigGenerator </summary>\n\n#### Description\nA simple, flexible configuration builder for creating YAML files with any structure. This utility allows you to build complex, nested configuration files programmatically without being constrained to a predefined structure.\n\n`gitlabds.ConfigGenerator(**kwargs):`\n\n#### Parameters:\n- **_**kwargs_** : Initial configuration values to populate the configuration object with\n\n#### Methods:\n\n##### `add(path, value)`\nAdd or update a value at a specific path in the configuration.\n\n- **_path_**: String using dot-notation to specify the location (e.g., 'model.parameters.learning_rate')\n- **_value_**: Any value to set at the specified path\n\n##### `to_yaml(file_path)`\nWrite the configuration to a YAML file.\n\n- **_file_path_**: Path to the output YAML file\n\n#### Returns\n- ConfigGenerator object for method chaining\n\n#### Examples:\n```python\n# Initialize with some top-level parameters\nconfig = ConfigGenerator(\n    model_name=\"churn_prediction\",\n    version=\"1.0.0\",\n    unique_id=\"customer_id\"\n)\n\n# Add nested model parameters\nconfig.add(\"model.file\", \"xgboost_model.pkl\")\nconfig.add(\"model.parameters.learning_rate\", 0.01)\nconfig.add(\"model.parameters.max_depth\", 6)\n\n# Add preprocessing information from outlier detection and dummy coding\nconfig.add(\"preprocessing.outliers\", outlier_info)\nconfig.add(\"preprocessing.dummy_coding\", dummy_info)\n\n# Add query information\nconfig.add(\"query_parameters.query_file\", \"customer_data.sql\")\nconfig.add(\"query_parameters.lookback_months\", 12)\n\n# Save to YAML\nconfig.to_yaml(\"churn_model_config.yaml\")\n```\n</details>\n\n### Model Evaluation\n\n<details><summary> ModelEvaluator </summary>\n\n#### Description\nA comprehensive framework for evaluating machine learning models, supporting both classification (binary and multi-class) and regression models. It provides extensive evaluation metrics, visualizations, and feature importance analysis.\n\n`gitlabds.ModelEvaluator(model, x_train, y_train, x_test, y_test, x_oot=None, y_oot=None, classification=True, algo=None, f1_threshold=0.50, decile_n=10, top_features_n=20, show_all_classes=True, show_plots=True, save_plots=True, plot_dir='plots', plot_save_format='png', plot_save_dpi=300)`\n\n#### Parameters:\n- _**model**_ : The trained model to evaluate. Must have predict for regression and predict_proba method for classification\n- _**x_train**_ : Training features DataFrame.\n- _**y_train**_ : Training labels (Series or DataFrame).\n- _**x_test**_ : Test features DataFrame.\n- _**y_test**_ : Test labels (Series or DataFrame).\n- _**x_oot**_ : Optional out-of-time validation features.\n- _**y_oot**_ : Optional out-of-time validation labels.\n- _**classification**_ : Whether this is a classification model. If False, regression metrics will be used.\n- _**algo**_ : Algorithm type for feature importance calculation. Options: 'xgb', 'rf', 'mars'. For other algorithms, use `None`\n- _**f1_threshold**_ : Threshold for binary classification.\n- _**decile_n**_ : Number of n-tiles for lift calculation. Defaults to 10 for deciles\n- _**top_features_n**_ : Number of top features to display in visualizations.\n- _**show_all_classes**_ : Whether to show metrics for all classes in multi-class classification.\n- _**show_plots**_ : Whether to display plots\n- _**save_plots**_ : Whether to save plots locally\n- _**plot_dir**_ : Directory to save plots\n- _**plot_save_format**_ : Plot format\n- _**plot_save_dpi**_ : Plot resolution\n\n#### Returns\n- ModelMetricsResult object containing all evaluation metrics and results.\n\n#### Key Methods:\n- **evaluate()** - Compute and return all metrics\n- **evaluate_custom_metrics(custom_metrics)** - Evaluate with additional custom metrics\n- **display_metrics(results=None)** - Display evaluation results in a formatted way\n- **calibration_assessment()** - Assess model calibration for classification models\n- **get_feature_descriptives(display_results=False)** - Generate descriptive statistics for features\n- **plot_feature_importance(feature_importance, n_features=20)** - Plot feature importance\n- **plot_shap_beeswarm(n_features=20, plot_type=\"beeswarm\")** - Create SHAP visualization\n- **plot_score_distribution(bins=None)** - Plot distribution of predicted values\n- **plot_feature_interactions(feature_pairs=None, n_top_pairs=5)** - Plot feature interactions\n- **plot_confusion_matrix()** - Plot confusion matrix for classification models\n- **plot_lift_analysis()** - Plot comprehensive lift analysis\n- **plot_performance_curves()** - Plot ROC and precision-recall curves\n- **plot_learning_history()** - Plot learning curves for iterative models\n- **plot_performance_comparison()** - Plot model performance for out-of-time validation\n\n#### Examples:\n\n```python\n# Create an evaluator for a classification model\nfrom gitlabds import ModelEvaluator\n\nevaluator = ModelEvaluator(\n    model=my_model,\n    x_train=x_train,\n    y_train=y_train,\n    x_test=x_test,\n    y_test=y_test,\n    classification=True,\n    algo='xgb'\n)\n\n# Get all evaluation metrics\nresults = evaluator.evaluate()\n\n# Display metrics in a formatted way\nevaluator.display_metrics(results)\n\n# Create visualizations\nevaluator.plot_feature_importance(results.feature_importance)\nevaluator.plot_confusion_matrix()\nevaluator.plot_performance_curves()\n\n# Save results to file\nresults.metrics_df.to_csv(\"metrics.csv\")\nresults.classification_metrics_df.to_csv(\"classification_metrics.csv\")\nresults.feature_importance.to_csv(\"feature_importance.csv\")\n```\n</details>\n\n\n### Insight Generation\n<details><summary> Marginal Effects </summary>\n\n#### Description\nCalculates and returns the marginal effects at the mean (MEM) for predictor fields.\n\n`gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):`\n\n#### Parameters:\n- _**model**_ : model file from training\n- _**x_test**_ : test \"predictors\" dataframe.\n- _**dv_description**_ : Description of the outcome field to be used in text-based insights. \n- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name\n\n#### Returns\n- Dataframe of marginal effects.\n\n#### Examples:\n```python\n# Calculate marginal effects for a trained model\nimport gitlabds\neffects_df = gitlabds.marginal_effects(\n    model=trained_model,\n    x_test=test_features,\n    dv_description=\"probability of churn\",\n    field_labels={\n        \"tenure\": \"Customer tenure in months\",\n        \"monthly_charges\": \"Average monthly bill amount\",\n        \"total_charges\": \"Total amount charged to customer\"\n    }\n)\n\n# Display the marginal effects\ndisplay(effects_df)\n```\n</details>\n\n<details><summary> Prescriptions </summary>\n\n#### Description\nReturn \"actionable\" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.  \n\n`gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):`\n\n#### Parameters:\n- _**model**_ : model file from training\n- _**input_df**_ : train \"predictors\" dataframe. \n- _**scored_df**_ : dataframe containing model scores.\n- _**actionable_fields**_ : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: `Increasing` for prescriptions only when the field increases; `Decreasing` for prescriptions only when the field decreases; `Both` for when the field either increases or decreases.   \n- _**dv_description**_ : Description of the outcome field to be used in text-based insights.\n- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name\n- _**returned_insights**_ : Number of insights per record to return. Defaults to 5\n- _**only_actionable**_ : Only return actionable prescriptions\n- _**explanation_fields**_ : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'\n\n#### Returns\n- Dataframe of prescriptive actions. One row per record input.\n\n#### Examples:\n```python\n# Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':\nresults = gitlabds.prescriptions(\n    model=model, \n    input_df=my_df, \n    scored_df=my_scores, \n    actionable_fields={\n        'spend': 'Increasing', \n        'returns': 'Decreasing', \n        'emails_sent': 'Both'\n    }, \n    dv_description='likelihood to churn', \n    field_labels={\n        'spend': 'Dollars spent in last 6 months', \n        'returns': 'Item returns in last 3 months', \n        'emails_sent': 'Marketing emails sent in last month'\n    }, \n    returned_insights=5, \n    only_actionable=True, \n    explanation_fields=['spend', 'returns']\n)\n```\n</details>\n\n### Model Monitoring\n\n<details><summary> Generate Baseline Features </summary>\n\n#### Description\nGenerate baseline feature distributions, importance scores, and drift thresholds in a single comprehensive artifact for model monitoring.\n\n`gitlabds.generate_baseline_features(training_data, feature_importance_df, importance_method=\"shapley_values\", n_bins=10, psi_warning=0.1, psi_critical=0.2, ks_warning=0.2, ks_critical=0.3, js_warning=0.1, js_critical=0.2, output_path=\"baseline_features.json\"):`\n\n#### Parameters:\n- **_training_data_** : Training feature data DataFrame\n- **_feature_importance_df_** : DataFrame with columns: feature, importance\n- **_importance_method_** : Method used to calculate importance (default: \"shapley_values\")\n- **_n_bins_** : Number of bins for numerical features (default: 10)\n- **_psi_warning, psi_critical_** : PSI thresholds for warning and critical drift detection\n- **_ks_warning, ks_critical_** : KS statistic thresholds for warning and critical drift detection\n- **_js_warning, js_critical_** : JS divergence thresholds for warning and critical drift detection\n- **_output_path_** : Path to save the JSON file\n\n#### Returns\n- None (saves baseline artifact to JSON file)\n\n#### Examples:\n```python\n# Generate baseline features with default thresholds\nimport gitlabds\ngitlabds.generate_baseline_features(\n    training_data=train_df,\n    feature_importance_df=importance_df,\n    importance_method=\"shapley_values\",\n    output_path=\"model_baseline_features.json\"\n)\n```\n\n```python\n# Generate with custom drift thresholds\ngitlabds.generate_baseline_features(\n    training_data=train_df,\n    feature_importance_df=importance_df,\n    n_bins=15,\n    psi_warning=0.15,\n    psi_critical=0.25,\n    output_path=\"custom_baseline_features.json\"\n)\n```\n</details>\n\n<details><summary> Generate Baseline Calibration </summary>\n\n#### Description\nGenerate baseline calibration data with train + test curves, prediction statistics, and model configuration for monitoring model calibration drift.\n\n`gitlabds.generate_baseline_calibration(train_predictions, train_actuals, test_predictions, test_actuals, model_configuration, n_bins=10, output_path=\"baseline_calibration.json\"):`\n\n#### Parameters:\n- **_train_predictions_** : Training set predicted probabilities\n- **_train_actuals_** : Training set actual binary outcomes (0/1)\n- **_test_predictions_** : Test set predicted probabilities\n- **_test_actuals_** : Test set actual binary outcomes (0/1)\n- **_model_configuration_** : Dictionary of model configuration parameters\n- **_n_bins_** : Number of bins for calibration curve (default: 10)\n- **_prediction_drift_warning_** :  Warning level for prediction score drift (default: 0.10), \n- **_prediction_drift_critical_** : Critical level for prediction score drift (default: 0.20),  \n- **_output_path_** : Path to save the JSON file\n        \n#### Returns\n- None (saves baseline calibration to JSON file)\n\n#### Examples:\n```python\n# Generate baseline calibration\nimport gitlabds\nmodel_config = {'f1_threshold': 0.15, 'random_state': 42}\n\ngitlabds.generate_baseline_calibration(\n    train_predictions=y_train_pred,\n    train_actuals=y_train,\n    test_predictions=y_test_pred,\n    test_actuals=y_test,\n    model_configuration=model_config,\n    prediction_drift_warning=0.15,\n    prediction_drift_critical=0.25\n    output_path=\"model_baseline_calibration.json\"\n)\n```\n</details>\n\n<details><summary> Calculate Monitoring Metrics </summary>\n\n#### Description\nCalculate all monitoring metrics including feature drift, prediction drift, and health status for a model scoring run.\n\n`gitlabds.calculate_monitoring_metrics(run_id, model_name, sub_model, model_version, scoring_date, feature_df, predictions, baseline_metrics, importance_threshold_pct=0.05):`\n\n#### Parameters:\n- **_run_id_** : Unique identifier for this scoring run\n- **_model_name_** : Model name (e.g., \"propensity_model\")\n- **_sub_model_** : Sub model identifier\n- **_model_version_** : Model version (e.g., \"2.1\")\n- **_scoring_date_** : Date of scoring\n- **_feature_df_** : Feature data used for scoring\n- **_predictions_** : Model predictions (probabilities 0-1)\n- **_baseline_metrics_** : Dictionary containing baseline_features and baseline_calibration JSON data\n- **_importance_threshold_pct_** : Percentage of total importance required for a feature to be considered \"important\"\n\n#### Returns\n- Dictionary containing DataFrames for each table:\n  - 'scoring_summary': model_scoring_summary table\n  - 'feature_drift': model_feature_drift table\n\n#### Examples:\n```python\n# Calculate monitoring metrics for a scoring run\nimport gitlabds\nfrom datetime import datetime\n\n# Load baseline metrics\nwith open('baseline_features.json', 'r') as f:\n    baseline_features = json.load(f)\nwith open('baseline_calibration.json', 'r') as f:\n    baseline_calibration = json.load(f)\n\nbaseline_metrics = {\n    'baseline_features': baseline_features,\n    'baseline_calibration': baseline_calibration\n}\n\n# Calculate metrics\nresults = gitlabds.calculate_monitoring_metrics(\n    run_id=\"scoring_run_123\",\n    model_name=\"churn_prediction\",\n    sub_model=\"high_value_customers\",\n    model_version=\"2.1\",\n    scoring_date=datetime.now(),\n    feature_df=current_features,\n    predictions=model_predictions,\n    baseline_metrics=baseline_metrics,\n    importance_threshold_pct=0.05\n)\n\n# Access results\nscoring_summary = results['scoring_summary']\nfeature_drift = results['feature_drift']\n```\n</details>\n\n### SQL and Trend Analysis\n\n<details><summary> SQL Trend Query Generator </summary>\n\n#### Description\nGenerate SQL for trend analysis across time periods. The generated SQL transforms regular data into a time-series format with columns for each time period, allowing for easy trend detection.\n\n`gitlabds.generate_sql_trend_query(snapshot_date, date_field, date_unit='MONTH', periods=12, table_name=None, group_by_fields=None, metrics=None, filters=None, output_file=None):`\n\n#### Parameters:\n- **_snapshot_date_** : Reference date for analysis (e.g., '2025-04-08')\n- **_date_field_** : Field name in the table that contains the date to analyze\n- **_date_unit_** : Time unit for analysis: 'DAY', 'WEEK', 'MONTH', 'QUARTER', 'YEAR'\n- **_periods_** : Number of time periods to analyze\n- **_table_name_** : Table to query data from\n- **_group_by_fields_** : Fields to group by (entity identifiers)\n- **_metrics_** : Metrics to include in analysis with their properties. Each metric is a dict with:\n  - name: output column name prefix\n  - source: field name in the source table\n  - aggregation: function to apply (AVG, SUM, MAX, etc.)\n  - condition: optional WHERE condition\n  - cumulative: if True, calculate period-over-period differences\n  - is_case_expression: if True, the source is already a CASE WHEN expression\n  - is_expression: if True, the source is a complex expression\n- **_filters_** : SQL WHERE clause conditions as a string\n- **_output_file_** : If provided, save the generated SQL to this file\n        \n#### Returns:\n- The generated SQL query as a string\n\n#### Examples:\n```python\n# Generate SQL for monthly trend analysis\nimport gitlabds\n\n# Define metrics\nmetrics = [\n    {\"name\": \"active_users\", \"source\": \"monthly_active_users\", \"aggregation\": \"AVG\"},\n    {\"name\": \"revenue\", \"source\": \"monthly_revenue\", \"aggregation\": \"SUM\"},\n    {\"name\": \"projects\", \"source\": \"projects_created\", \"aggregation\": \"MAX\", \"cumulative\": True}\n]\n\n# Generate SQL query\nsql = gitlabds.generate_sql_trend_query(\n    snapshot_date='2025-04-08',\n    date_field='transaction_date',\n    date_unit='MONTH',\n    periods=12,\n    table_name='analytics.user_metrics',\n    group_by_fields=['account_id'],\n    metrics=metrics,\n    filters=\"is_active = TRUE\",\n    output_file='trend_query.sql'\n)\n```\n\n```python\n# Generate SQL for daily trend analysis with custom conditions\nmetrics = [\n    {\"name\": \"logins\", \"source\": \"user_logins\", \"aggregation\": \"SUM\"},\n    {\"name\": \"premium_logins\", \"source\": \"user_logins\", \"aggregation\": \"SUM\", \n     \"condition\": \"subscription_tier = 'premium'\"}\n]\n\nsql = gitlabds.generate_sql_trend_query(\n    snapshot_date='2025-04-08',\n    date_field='login_date',\n    date_unit='DAY',\n    periods=30,\n    table_name='analytics.daily_logins',\n    metrics=metrics\n)\n```\n</details>\n\n<details><summary> Trend Analysis </summary>\n\n#### Description\nCalculate trend metrics for a dataframe produced by the SQL trend generator. This function analyzes time-series data to identify patterns like consecutive increases or decreases, proportion of periods with growth or decline, and average percentage changes.\n\n`gitlabds.trend_analysis(df, metric_list=None, time_unit='month', periods=6, include_cumulative=True, exclude_fields=None, verbose=False):`\n\n#### Parameters:\n- **_df_** : Dataframe containing trend data with time-based columns\n- **_metric_list_** : List of metric names to analyze. If None, auto-detects metrics from columns\n- **_time_unit_** : Time unit used in the column names (month, day, week, etc.)\n- **_periods_** : Number of time periods to analyze\n- **_include_cumulative_** : Whether to use cumulative (event) metrics when available\n- **_exclude_fields_** : List of fields to exclude from auto-detection\n- **_verbose_** : Whether to display intermediate output\n        \n#### Returns:\n- A dataframe containing trend metrics for each specified metric, including:\n  - Count of periods with decreases/increases\n  - Count of consecutive decreases/increases\n  - Average percentage change across periods\n\n#### Examples:\n```python\n# Run trend analysis on data from SQL trend query\nimport gitlabds\n\n# Run the SQL query to get trend data\ntrend_data = run_sql_query(trend_sql)  # Your function to execute SQL\ntrend_data.set_index('account_id', inplace=True)\n\n# Analyze trends for all metrics\ntrends_df = gitlabds.trend_analysis(\n    df=trend_data,\n    time_unit='month',\n    periods=12,\n    verbose=True\n)\n```\n\n```python\n# Analyze trends for specific metrics\ntrends_df = gitlabds.trend_analysis(\n    df=trend_data,\n    metric_list=['active_users', 'revenue'],\n    time_unit='month',\n    periods=6,\n    include_cumulative=True,\n    exclude_fields=['has_data']\n)\n\n# Use trend metrics for customer health scoring\naccount_data['declining_usage'] = trends_df['consecutive_drop_active_users_period_6_months_cnt'] > 0\naccount_data['growth_score'] = trends_df['avg_perc_change_revenue_period_6_months'] * 100\n```\n</details>\n\n## Gitlab Data Science\n\nThe [handbook](https://handbook.gitlab.com/handbook/enterprise-data/organization/data-science/) is the single source of truth for all of our documentation. \n\n## Contributing\n\nWe welcome contributions and improvements, please see the [contribution guidelines](CONTRIBUTING.md).\n\n## License\n\nThis code is distributed under the MIT license, please see the [LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Gitlab Data Science and Modeling Tools",
    "version": "2.1.3",
    "project_urls": {
        "Homepage": "https://gitlab.com/gitlab-data/gitlabds"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b28c48bfcc460eae5475b4f25e8e25aed22529f18244e4a9571380fef66a7b7b",
                "md5": "72520a26174c5aefbcfb40b2109842b9",
                "sha256": "cab929325f4f3ac4eefc169554fe8ad8d835d0a33f752ef9064c7a646100a5f1"
            },
            "downloads": -1,
            "filename": "gitlabds-2.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "72520a26174c5aefbcfb40b2109842b9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 87612,
            "upload_time": "2025-07-11T22:28:48",
            "upload_time_iso_8601": "2025-07-11T22:28:48.248908Z",
            "url": "https://files.pythonhosted.org/packages/b2/8c/48bfcc460eae5475b4f25e8e25aed22529f18244e4a9571380fef66a7b7b/gitlabds-2.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c60de47744a8511b3bea53c3ddb3fdb4af2861e113be58accea1111d67043c4e",
                "md5": "59486f4f22404ad01250a8da64a767dd",
                "sha256": "72458030055b3b936ec3dcde04bad7cb2e9e41889ddf73a13f716ef8b6b774da"
            },
            "downloads": -1,
            "filename": "gitlabds-2.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "59486f4f22404ad01250a8da64a767dd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 101977,
            "upload_time": "2025-07-11T22:28:49",
            "upload_time_iso_8601": "2025-07-11T22:28:49.519040Z",
            "url": "https://files.pythonhosted.org/packages/c6/0d/e47744a8511b3bea53c3ddb3fdb4af2861e113be58accea1111d67043c4e/gitlabds-2.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 22:28:49",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "gitlab-data",
    "gitlab_project": "gitlabds",
    "lcname": "gitlabds"
}

Kevin Dietz