gitlabds


Namegitlabds JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://gitlab.com/gitlab-data/gitlabds
SummaryGitlab Data Science and Modeling Tools
upload_time2024-08-28 17:02:38
maintainerNone
docs_urlNone
authorKevin Dietz
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ### What is it?
gitlabds is a set of tools designed make it quicker and easier to build predictive models.


### Where to get it?
gitlabds can be installed directly via pip: `pip install gitlabds`.

Alternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.


### Main Features	
- **Data prep tools:**
	- Treat outliers
	- Dummy code
	- Miss fill
	- Reduce feature space
	- Split and sample data into train/test
- **Modeling tools:**
	- Easily produce model metrics, feature importance, performance graphs, and lift/gains charts 
    - Generate model insights and prescriptions 

### References and Examples
<details><summary> MAD Outliers </summary>

#### Description
Median Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

`gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01, output_file=None, output_method='a'):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
- **_min_levels_** : Only include columns that have at least the number of levels specified. 
- **_columns_** : Will examine at all numeric columns by default. To limit to just  a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters. 
- **_threshold_** : Windsor values greater than this number of standard deviations from the median.
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress
- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.
- **_windsor_threshold_** : Only windsor values that affect less than this percentage of the population.  
- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.
- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

#### Returns
- DataFrame with windsored values or None if `inplace=True`.
	
#### Examples:
		
```
#Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)
```
```
#Inplace outliers. Will windsor values by altering the current dataframe
import gitlabds
gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)
```
</details>
   
<details><summary> Missing Values Check </summary>

#### Description
Check for missing values.

`gitlabds.missing_check(df=None, threshold = 0, by='column_name', ascending=True, return_missing_cols = False):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_threshold_** : The percent of missing values at which a column is considered to have missing values. For example, threshold = .10 will only display columns with more than 10% of its values missing. Defaults to 0.
- **_by_** : Columns to sort by. Defaults to `column_name`. Also accepts `percent_missing`, `total_missing`, or a list.
- **_ascending_** : Sort ascending vs. descending. Defaults to ascending (ascending=True).
- **_return_missing_cols_** : Set to `True` to return a list of column names that meet the threshold criteria for missing. 

#### Returns
- List of columns with missing values filled or None if `return_missing_cols=False`.

#### Examples:
		
```
#Check for missing values using default settings
gitlabds.missing_check(df=my_df, threshold = 0, by='column_name', ascending=True, return_missing_cols = False)
```
```
#Check for columns with more than 5% missing values and return a list of those columns
missing_list = gitlabds.missing_check(df=my_df, threshold = 0.05, by='column_name', ascending=True, return_missing_cols = True) 
```
</details>

<details><summary> Missing Values Fill </summary>

#### Description
Fill missing values using a range of different options.

`gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False, output_file=None, output_method='a'):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_columns_** : Columns which to miss fill. Defaults to `all` which will miss fill all columns with missing values.
- **_method_** : Options are `zero`, `median`, `mean`, `drop_column`, and `drop_row`. Defaults to `zero`.
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress
- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.
- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

#### Returns
- DataFrame with missing values filled or None if `inplace=True`.

#### Examples:
		
```
#Miss fill specificied columns with the mean value into a new dataframe
new_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):
```
```
#Miss fill all values with zero in place.
gitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)   
```
</details>

<details><summary> Dummy Code </summary>

#### Description
Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not

`gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False, output_file=None, output_method='a'):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. 
- **_categorical_** : Set to `True` to attempt to dummy code any categorical column passed via the `columns` parameter.
- **_numeric_** : Set to `True` to attempt to dummy code any numeric column passed via the `columns` parameter.
- **_categorical_max_levels_** : Maximum number of levels a categorical column can have to be eligable for dummy coding.
- **_categorical_max_levels_** : Maximum number of levels a numeric column can have to be eligable for dummy coding.
- **_dummy_na_** : Set to `True` to create a dummy coded column for missing values.
- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.
- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

#### Returns
- DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.

#### Examples:
		
```
#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)
```
```
#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10,  dummy_na=True)
```
</details>

<details><summary> Top Dummies </summary>

#### Description
Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

`gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True, output_file=None, output_method='a'):`

#### Parameters:
- **_df_** : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. 
- **_min_threshold_**: The threshold at which levels will be dummy coded. For example, the default value of `0.05` will dummy code any categorical level that is in at least 5% of all rows.
_ **_drop_categorical_**: Set to `True` to drop categorical columns after they are considered for dummy coding. Set to `False` to keep the original categorical columns in the dataframe.
- **_verbose_** : Set to `True` to print detailed list of all dummy columns being created. Set to `False` to suppress.
- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.
- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

#### Returns
- DataFrame with dummy coded columns.

#### Examples:
		
```
#Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)
```
```
#Dummy code all categorical levels from the selected columns who values are in at least 10% of all rows; suppress verbose printout and retain original categorical columns.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)
```
</details>




<details><summary> Remove Low Variation columns </summary>

#### Description
Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.

`gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, inplace=False, verbose=True):`

#### Parameters:
- _**df**_ : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. 
- **_threshold_**: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of `0.98` will drop any column where one value is present in more than 98% of rows.
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress
- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.

#### Returns
- DataFrame with low variation columns dropped or None if `inplace=True`.

#### Examples:
		
```
#Dropped any columns (except for the outcome) where one value is present in more than 95% of rows. A new dataframe will be created.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95):
```
```
#Dropped any of the selected columns where one value is present in more than 99% of rows. Operation will be done in place on the existing dataframe.
gitlabds.remove_low_variation(df=my_df, dv=None, columns = ['colA', 'colB', 'colC'], threshold=.99, inplace=True):
```
</details>

<details><summary> Correlation Reduction </summary>

#### Description
Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped. uses Pearson's correlation coefficient.

`gitlabds.correlation_reduction(df=None, dv=None, threshold = 0.90, inplace=False, verbose=True):`

#### Parameters:
- _**df**_ : your pandas dataframe
- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dropped. May be left blank there is no outcome variable.
- **_threshold_**: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of `0.90` will identify columns that have correlations greater than 90% to each other and drop one of those columns.
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress
- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.

#### Returns
- DataFrame with half of highly correlated columns dropped or None if `inplace=True`.

#### Examples:
		
```
#Perform column reduction via correlation using a threshold of 95%, excluding the outcome column. A new dataframe will be created.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold = 0.95, verbose=True)
```
```
#Perform column reduction via correlation using a threshold of 90%. Operation will be done in place on the existing dataframe.
gitlabds.correlation_reduction(df=None, dv='my_outcome', threshold = 0.90, inplace=True, verbose=True)
```
</details>

<details><summary> Drop Categorical columns </summary>

#### Description
Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.

`gitlabds.drop_categorical(df, inplace=False):`

#### Parameters:
- _**df**_ : your pandas dataframe
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress

#### Returns
- DataFrame with categorical columns dropped or None if `inplace=True`.

#### Examples:
		
```
#Dropping categorical columns and creating a new dataframe
new_df = gitlabds.drop_categorical(df=my_df) 
```
```
#Dropping categorical columns in place
gitlabds.drop_categorical(df=my_df, inplace=True) 
```
</details>


<details><summary> Remove Outcome Proxies </summary>

#### Description
Remove columns that are highly correlated with the outcome (target) column.

`gitlabds.dv_proxies(df, dv, threshold=.8, inplace=False):`

#### Parameters:
- _**df**_ : your pandas dataframe
- _**dv**_ : The column name of your outcome.    
- _**threshold**_ : The Pearson's correlation value to the outcome above which columns will be dropped. For example, the default value of `0.80` will identify and drop columns that have correlations greater than 80% to the outcome.    
- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress

#### Returns
- DataFrame with outcome proxy columns dropped or None if `inplace=True`.

#### Examples:
		
```
#Drop columns with correlations to the outcome greater than 70% and create a new dataframe
new_df = gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.7)    
```
```
#Drop columns with correlations to the outcome greater than 80% in place
gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.8, inplace=True)        
```
</details>


<details><summary> Split and Sample Data </summary>

#### Description
This function will split your data into train and test datasets, separating the outcome from the rest of the file. The resultant datasets will be named x_train,y_train, x_test, and y_test.

`gitlabds.split_data(df, train_pct=.7, dv=None, dv_threshold=.0, random_state = 5435):`

#### Parameters:
- _**df**_ : your pandas dataframe
- _**train_pct**_ : The percentage of rows randomdly assigned to the training dataset.
- _**dv**_ : The column name of your outcome.  
- _**dv_threshold**_ : The minimum percentage of rows that much contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5
- **random_state** : Random seed to use for splitting dataframe and for up-sampling (if needed)

#### Returns
- 4 dataframes for train and test and a list of model weights.

#### Examples:
		
```
#Split into train and test datasets with 70% of rows in train and 30% in test and change random seed.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.70, dv_threshold=0, random_state = 64522)
```
```
#Split into train and test datasets with 80% of rows in train and 20% in test; Up-sample if needed to hit 10% threshold.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.80, dv_threshold=0.1)
```
</details>

<details><summary> Model Metrics </summary>

#### Description
Display a variety of model metrics for linear and logistic predictive models.

`gitlabds.model_metrics(model, x_train, y_train, x_test, y_test, show_graphs=True, f_score = 0.50, classification = True, algo=None, decile_n=10, top_features_n=20):`

#### Parameters:
- _**model**_ : model file from training
- _**x_train**_ : train "predictors" dataframe. 
- _**y_train**_ : train outcome/dv/target dataframe
- _**x_test**_ : test "predictors" dataframe. 
- _**y_test**_ : test outcome/dv/target dataframe
- _**show_graphs**_ : Set to `True` to show visualizations
- _**f_score**_ : Cut point for determining a correct classification. Must also set classification to `True` to enable.
- _**classification**_ : Set to `True` to show classification model metrics (accuracy, precision, recall, F1). Set show_graphs to `True` to display confusion matrix.
- _**algo**_ : Select the algorythm used to display additional model metrics. Supports `rf`, `xgb`, 'logistic', 'elasticnet', and `None`. If your model type is not listed, try `None` and some model metrics should still generate.
- _**top_features_n**_ : Print a list of the top x features present in the model.
- _**decile_n**_ : Specify number of group to create to calculate lift. Defaults to `10` (deciles)



#### Returns
- Separate dataframes for `model metrics` and `lift` and `class_model_metrics` (optional). Lists for `top_features`, and `decile_breaks`.

#### Examples:
		
```
#Display model metrics from an XGBoost model. Return classification metrics using a cut point of 0.30 F-Score
model_metrics, lift, class_metrics, top_features, decile_breaks = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=True, f_score = 0.3, classification=True, algo='xgb', top_features_n=20, decile_n=10)
```

```
#Display model metrics from a logistic model. Do not return classification metrics and suppress visualizations
model_metrics, lift, top_features, decile_breaks = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=False, classification=False, algo='logistic',top_features_n=20, decile_n=10)
```
</details>

<details><summary> Marginal Effects </summary>

#### Description
Calculates and returns the marginal effects at the mean (MEM) for predcitor fields.

`gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):`

#### Parameters:
- _**model**_ : model file from training
- _**x_test**_ : test "predictors" dataframe.
- _**dv_description**_ : Description of the outcome field to be used in text-based insights. 
- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name


#### Returns
- Dataframe of marginal effects.

#### Examples:
		
</details>

<details><summary> Prescriptions </summary>

#### Description
Return "actionable" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.  

`gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):`

#### Parameters:
- _**model**_ : model file from training
- _**input_df**_ : train "predictors" dataframe. 
- _**scored_df**_ : dataframe containing model scores.
- _**actionable_fields**_ : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: `Increasing` for prescriptions only when the field increases; `Decreasing` for prescriptions only when the field decreases; `Both` for when the field either increases or decreases.   
- _**dv_description**_ : Description of the outcome field to be used in text-based insights.
- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
- _**returned_insights**_ : Number of insights per record to return. Defaults to 5
- _**only_actionable**_ : Only return actionable prescriptions
- _**explanation_fields**_ : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'

#### Returns
- Dataframe of prescriptive actions. One row per record input.

#### Examples:
		
```
#Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
gitlabds.prescriptions(model=model, input_df=my_df, scored_df=my_scores, actionable_fields={'spend':'Increasing', 'returns':'Decreasing', 'emails_sent':'Both'}, dv_description='likelihood to churn', field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'}, returned_insights=5, only_actionable=True, explaination_fields=['spend', returns'])
```
</details>


## Gitlab Data Science

The [handbook](https://about.gitlab.com/handbook/business-technology/data-team/organization/data-science/) is the single source of truth for all of our documentation. 

### Contributing

We welcome contributions and improvements, please see the [contribution guidelines](CONTRIBUTING.md).

### License

This code is distributed under the MIT license, please see the [LICENSE](LICENSE) file.




            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/gitlab-data/gitlabds",
    "name": "gitlabds",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Kevin Dietz",
    "author_email": "kdietz@gitlab.com",
    "download_url": "https://files.pythonhosted.org/packages/3d/0b/0e836e30aabcd75f90b4c6bc094a828a6adb07df5e54a57a039cf378a842/gitlabds-1.1.0.tar.gz",
    "platform": null,
    "description": "### What is it?\ngitlabds is a set of tools designed make it quicker and easier to build predictive models.\n\n\n### Where to get it?\ngitlabds can be installed directly via pip: `pip install gitlabds`.\n\nAlternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.\n\n\n### Main Features\t\n- **Data prep tools:**\n\t- Treat outliers\n\t- Dummy code\n\t- Miss fill\n\t- Reduce feature space\n\t- Split and sample data into train/test\n- **Modeling tools:**\n\t- Easily produce model metrics, feature importance, performance graphs, and lift/gains charts \n    - Generate model insights and prescriptions \n\n### References and Examples\n<details><summary> MAD Outliers </summary>\n\n#### Description\nMedian Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').\n\n`gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01, output_file=None, output_method='a'):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.\n- **_min_levels_** : Only include columns that have at least the number of levels specified. \n- **_columns_** : Will examine at all numeric columns by default. To limit to just  a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters. \n- **_threshold_** : Windsor values greater than this number of standard deviations from the median.\n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.\n- **_windsor_threshold_** : Only windsor values that affect less than this percentage of the population.  \n- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.\n- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.\n\n#### Returns\n- DataFrame with windsored values or None if `inplace=True`.\n\t\n#### Examples:\n\t\t\n```\n#Create a new df; only windsor selected columns; suppress verbose\nimport gitlabds\nnew_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)\n```\n```\n#Inplace outliers. Will windsor values by altering the current dataframe\nimport gitlabds\ngitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)\n```\n</details>\n   \n<details><summary> Missing Values Check </summary>\n\n#### Description\nCheck for missing values.\n\n`gitlabds.missing_check(df=None, threshold = 0, by='column_name', ascending=True, return_missing_cols = False):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_threshold_** : The percent of missing values at which a column is considered to have missing values. For example, threshold = .10 will only display columns with more than 10% of its values missing. Defaults to 0.\n- **_by_** : Columns to sort by. Defaults to `column_name`. Also accepts `percent_missing`, `total_missing`, or a list.\n- **_ascending_** : Sort ascending vs. descending. Defaults to ascending (ascending=True).\n- **_return_missing_cols_** : Set to `True` to return a list of column names that meet the threshold criteria for missing. \n\n#### Returns\n- List of columns with missing values filled or None if `return_missing_cols=False`.\n\n#### Examples:\n\t\t\n```\n#Check for missing values using default settings\ngitlabds.missing_check(df=my_df, threshold = 0, by='column_name', ascending=True, return_missing_cols = False)\n```\n```\n#Check for columns with more than 5% missing values and return a list of those columns\nmissing_list = gitlabds.missing_check(df=my_df, threshold = 0.05, by='column_name', ascending=True, return_missing_cols = True) \n```\n</details>\n\n<details><summary> Missing Values Fill </summary>\n\n#### Description\nFill missing values using a range of different options.\n\n`gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False, output_file=None, output_method='a'):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_columns_** : Columns which to miss fill. Defaults to `all` which will miss fill all columns with missing values.\n- **_method_** : Options are `zero`, `median`, `mean`, `drop_column`, and `drop_row`. Defaults to `zero`.\n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.\n- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.\n\n#### Returns\n- DataFrame with missing values filled or None if `inplace=True`.\n\n#### Examples:\n\t\t\n```\n#Miss fill specificied columns with the mean value into a new dataframe\nnew_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):\n```\n```\n#Miss fill all values with zero in place.\ngitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)   \n```\n</details>\n\n<details><summary> Dummy Code </summary>\n\n#### Description\nDummy code (AKA \"one-hot encode\") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not\n\n`gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False, output_file=None, output_method='a'):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.\n- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. \n- **_categorical_** : Set to `True` to attempt to dummy code any categorical column passed via the `columns` parameter.\n- **_numeric_** : Set to `True` to attempt to dummy code any numeric column passed via the `columns` parameter.\n- **_categorical_max_levels_** : Maximum number of levels a categorical column can have to be eligable for dummy coding.\n- **_categorical_max_levels_** : Maximum number of levels a numeric column can have to be eligable for dummy coding.\n- **_dummy_na_** : Set to `True` to create a dummy coded column for missing values.\n- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.\n- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.\n\n#### Returns\n- DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.\n\n#### Examples:\n\t\t\n```\n#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values\nnew_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)\n```\n```\n#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values\nnew_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10,  dummy_na=True)\n```\n</details>\n\n<details><summary> Top Dummies </summary>\n\n#### Description\nDummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.\n\n`gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True, output_file=None, output_method='a'):`\n\n#### Parameters:\n- **_df_** : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.\n- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. \n- **_min_threshold_**: The threshold at which levels will be dummy coded. For example, the default value of `0.05` will dummy code any categorical level that is in at least 5% of all rows.\n_ **_drop_categorical_**: Set to `True` to drop categorical columns after they are considered for dummy coding. Set to `False` to keep the original categorical columns in the dataframe.\n- **_verbose_** : Set to `True` to print detailed list of all dummy columns being created. Set to `False` to suppress.\n- **_output_file_**: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to `None`.\n- **_output_method_**: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.\n\n#### Returns\n- DataFrame with dummy coded columns.\n\n#### Examples:\n\t\t\n```\n#Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows.\nnew_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)\n```\n```\n#Dummy code all categorical levels from the selected columns who values are in at least 10% of all rows; suppress verbose printout and retain original categorical columns.\nnew_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)\n```\n</details>\n\n\n\n\n<details><summary> Remove Low Variation columns </summary>\n\n#### Description\nRemove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.\n\n`gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, inplace=False, verbose=True):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.\n- **_columns_** : Will examine at all columns by default. To limit to just  a subset of columns, pass a list of column names. \n- **_threshold_**: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of `0.98` will drop any column where one value is present in more than 98% of rows.\n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.\n\n#### Returns\n- DataFrame with low variation columns dropped or None if `inplace=True`.\n\n#### Examples:\n\t\t\n```\n#Dropped any columns (except for the outcome) where one value is present in more than 95% of rows. A new dataframe will be created.\nnew_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95):\n```\n```\n#Dropped any of the selected columns where one value is present in more than 99% of rows. Operation will be done in place on the existing dataframe.\ngitlabds.remove_low_variation(df=my_df, dv=None, columns = ['colA', 'colB', 'colC'], threshold=.99, inplace=True):\n```\n</details>\n\n<details><summary> Correlation Reduction </summary>\n\n#### Description\nReduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped. uses Pearson's correlation coefficient.\n\n`gitlabds.correlation_reduction(df=None, dv=None, threshold = 0.90, inplace=False, verbose=True):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- **_dv_** : The column name of your outcome. Entering your outcome variable in will prevent it from being dropped. May be left blank there is no outcome variable.\n- **_threshold_**: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of `0.90` will identify columns that have correlations greater than 90% to each other and drop one of those columns.\n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n- **_verbose_** : Set to `True` to print outputs of windsoring being done. Set to `False` to suppress.\n\n#### Returns\n- DataFrame with half of highly correlated columns dropped or None if `inplace=True`.\n\n#### Examples:\n\t\t\n```\n#Perform column reduction via correlation using a threshold of 95%, excluding the outcome column. A new dataframe will be created.\nnew_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold = 0.95, verbose=True)\n```\n```\n#Perform column reduction via correlation using a threshold of 90%. Operation will be done in place on the existing dataframe.\ngitlabds.correlation_reduction(df=None, dv='my_outcome', threshold = 0.90, inplace=True, verbose=True)\n```\n</details>\n\n<details><summary> Drop Categorical columns </summary>\n\n#### Description\nDrop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.\n\n`gitlabds.drop_categorical(df, inplace=False):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n\n#### Returns\n- DataFrame with categorical columns dropped or None if `inplace=True`.\n\n#### Examples:\n\t\t\n```\n#Dropping categorical columns and creating a new dataframe\nnew_df = gitlabds.drop_categorical(df=my_df) \n```\n```\n#Dropping categorical columns in place\ngitlabds.drop_categorical(df=my_df, inplace=True) \n```\n</details>\n\n\n<details><summary> Remove Outcome Proxies </summary>\n\n#### Description\nRemove columns that are highly correlated with the outcome (target) column.\n\n`gitlabds.dv_proxies(df, dv, threshold=.8, inplace=False):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- _**dv**_ : The column name of your outcome.    \n- _**threshold**_ : The Pearson's correlation value to the outcome above which columns will be dropped. For example, the default value of `0.80` will identify and drop columns that have correlations greater than 80% to the outcome.    \n- **_inplace_** : Set to `True` to replace existing dataframe. Set to false to create a new one. Set to `False` to suppress\n\n#### Returns\n- DataFrame with outcome proxy columns dropped or None if `inplace=True`.\n\n#### Examples:\n\t\t\n```\n#Drop columns with correlations to the outcome greater than 70% and create a new dataframe\nnew_df = gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.7)    \n```\n```\n#Drop columns with correlations to the outcome greater than 80% in place\ngitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.8, inplace=True)        \n```\n</details>\n\n\n<details><summary> Split and Sample Data </summary>\n\n#### Description\nThis function will split your data into train and test datasets, separating the outcome from the rest of the file. The resultant datasets will be named x_train,y_train, x_test, and y_test.\n\n`gitlabds.split_data(df, train_pct=.7, dv=None, dv_threshold=.0, random_state = 5435):`\n\n#### Parameters:\n- _**df**_ : your pandas dataframe\n- _**train_pct**_ : The percentage of rows randomdly assigned to the training dataset.\n- _**dv**_ : The column name of your outcome.  \n- _**dv_threshold**_ : The minimum percentage of rows that much contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5\n- **random_state** : Random seed to use for splitting dataframe and for up-sampling (if needed)\n\n#### Returns\n- 4 dataframes for train and test and a list of model weights.\n\n#### Examples:\n\t\t\n```\n#Split into train and test datasets with 70% of rows in train and 30% in test and change random seed.\nx_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.70, dv_threshold=0, random_state = 64522)\n```\n```\n#Split into train and test datasets with 80% of rows in train and 20% in test; Up-sample if needed to hit 10% threshold.\nx_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.80, dv_threshold=0.1)\n```\n</details>\n\n<details><summary> Model Metrics </summary>\n\n#### Description\nDisplay a variety of model metrics for linear and logistic predictive models.\n\n`gitlabds.model_metrics(model, x_train, y_train, x_test, y_test, show_graphs=True, f_score = 0.50, classification = True, algo=None, decile_n=10, top_features_n=20):`\n\n#### Parameters:\n- _**model**_ : model file from training\n- _**x_train**_ : train \"predictors\" dataframe. \n- _**y_train**_ : train outcome/dv/target dataframe\n- _**x_test**_ : test \"predictors\" dataframe. \n- _**y_test**_ : test outcome/dv/target dataframe\n- _**show_graphs**_ : Set to `True` to show visualizations\n- _**f_score**_ : Cut point for determining a correct classification. Must also set classification to `True` to enable.\n- _**classification**_ : Set to `True` to show classification model metrics (accuracy, precision, recall, F1). Set show_graphs to `True` to display confusion matrix.\n- _**algo**_ : Select the algorythm used to display additional model metrics. Supports `rf`, `xgb`, 'logistic', 'elasticnet', and `None`. If your model type is not listed, try `None` and some model metrics should still generate.\n- _**top_features_n**_ : Print a list of the top x features present in the model.\n- _**decile_n**_ : Specify number of group to create to calculate lift. Defaults to `10` (deciles)\n\n\n\n#### Returns\n- Separate dataframes for `model metrics` and `lift` and `class_model_metrics` (optional). Lists for `top_features`, and `decile_breaks`.\n\n#### Examples:\n\t\t\n```\n#Display model metrics from an XGBoost model. Return classification metrics using a cut point of 0.30 F-Score\nmodel_metrics, lift, class_metrics, top_features, decile_breaks = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=True, f_score = 0.3, classification=True, algo='xgb', top_features_n=20, decile_n=10)\n```\n\n```\n#Display model metrics from a logistic model. Do not return classification metrics and suppress visualizations\nmodel_metrics, lift, top_features, decile_breaks = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=False, classification=False, algo='logistic',top_features_n=20, decile_n=10)\n```\n</details>\n\n<details><summary> Marginal Effects </summary>\n\n#### Description\nCalculates and returns the marginal effects at the mean (MEM) for predcitor fields.\n\n`gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):`\n\n#### Parameters:\n- _**model**_ : model file from training\n- _**x_test**_ : test \"predictors\" dataframe.\n- _**dv_description**_ : Description of the outcome field to be used in text-based insights. \n- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name\n\n\n#### Returns\n- Dataframe of marginal effects.\n\n#### Examples:\n\t\t\n</details>\n\n<details><summary> Prescriptions </summary>\n\n#### Description\nReturn \"actionable\" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.  \n\n`gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):`\n\n#### Parameters:\n- _**model**_ : model file from training\n- _**input_df**_ : train \"predictors\" dataframe. \n- _**scored_df**_ : dataframe containing model scores.\n- _**actionable_fields**_ : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: `Increasing` for prescriptions only when the field increases; `Decreasing` for prescriptions only when the field decreases; `Both` for when the field either increases or decreases.   \n- _**dv_description**_ : Description of the outcome field to be used in text-based insights.\n- _**field_labels**_ : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name\n- _**returned_insights**_ : Number of insights per record to return. Defaults to 5\n- _**only_actionable**_ : Only return actionable prescriptions\n- _**explanation_fields**_ : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'\n\n#### Returns\n- Dataframe of prescriptive actions. One row per record input.\n\n#### Examples:\n\t\t\n```\n#Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':\ngitlabds.prescriptions(model=model, input_df=my_df, scored_df=my_scores, actionable_fields={'spend':'Increasing', 'returns':'Decreasing', 'emails_sent':'Both'}, dv_description='likelihood to churn', field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'}, returned_insights=5, only_actionable=True, explaination_fields=['spend', returns'])\n```\n</details>\n\n\n## Gitlab Data Science\n\nThe [handbook](https://about.gitlab.com/handbook/business-technology/data-team/organization/data-science/) is the single source of truth for all of our documentation. \n\n### Contributing\n\nWe welcome contributions and improvements, please see the [contribution guidelines](CONTRIBUTING.md).\n\n### License\n\nThis code is distributed under the MIT license, please see the [LICENSE](LICENSE) file.\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Gitlab Data Science and Modeling Tools",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://gitlab.com/gitlab-data/gitlabds"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3ed4ce29dc280a15a6d136220a2b40c5fc1ecd96fb3ecd64de70c5ce0bf93402",
                "md5": "dd9caaa87313ec64339646c1cd4aebab",
                "sha256": "a8a8c504babac283639206d0b4886b9e10cf29ee8e8cdd9aa057e979751a1c5e"
            },
            "downloads": -1,
            "filename": "gitlabds-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dd9caaa87313ec64339646c1cd4aebab",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 29349,
            "upload_time": "2024-08-28T17:02:36",
            "upload_time_iso_8601": "2024-08-28T17:02:36.642142Z",
            "url": "https://files.pythonhosted.org/packages/3e/d4/ce29dc280a15a6d136220a2b40c5fc1ecd96fb3ecd64de70c5ce0bf93402/gitlabds-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3d0b0e836e30aabcd75f90b4c6bc094a828a6adb07df5e54a57a039cf378a842",
                "md5": "1a08a5ae71fee56b7c7d31fb7153136d",
                "sha256": "6c2afdc54e93ffc3dbadc310d3d5cb2095f8f2977750c4c1b45fe130cdf1146c"
            },
            "downloads": -1,
            "filename": "gitlabds-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1a08a5ae71fee56b7c7d31fb7153136d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 29300,
            "upload_time": "2024-08-28T17:02:38",
            "upload_time_iso_8601": "2024-08-28T17:02:38.093659Z",
            "url": "https://files.pythonhosted.org/packages/3d/0b/0e836e30aabcd75f90b4c6bc094a828a6adb07df5e54a57a039cf378a842/gitlabds-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-28 17:02:38",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "gitlab-data",
    "gitlab_project": "gitlabds",
    "lcname": "gitlabds"
}
        
Elapsed time: 0.33214s