# cheutils
A package with a set of basic reusable utilities and tools to facilitate quickly getting up and going on any machine learning project.
## Features
- Managing properties files or project configuration, based on jproperties. The application configuration is expected to be available in a properties file named `app-config.properties`, which can be placed anywhere in the project root or any project subfolder.
- Convenience methods such as `get_estimator()` to get a handle on any configured estimator with a specified hyperparameters dictionary, `get_params_grid()` or `get_param_defaults()` relating to obtaining model hyperparameters in the `app-config.properties` file.
- Convenience methods for conducting hyperparameter optimization such as `params_optimization()`, `promising_params_grid()` for obtaining a set of promising hyperparameters using RandomSearchCV and a set of broadly specified or configured hyperparameters in the `app-config.properties`; a combination of `promising_params_grid()` followed by `params_optimization()` constitutes a coarse-to-fine search.
- Convenience methods for accessing the project tree folders - e.g., `get_data_dir()` for accessing the configured data and `get_output_dir()` for the output folders, `load_dataset()` for loading, `save_excel()` and `save_csv()` for savings Excel in the project output folder and CSV respectively; you can also save any plotted figure using `save_current_fig()` (note that this must be called before `plt.show()`.
- Convenience methods to support common programming tasks, such as renaming or tagging file names- e.g., `label(file_name, label='some_label')`) or tagging and date-stamping files (e.g., `datestamp(file_name, fmt='%Y-%m-%d')`).
- A debug or logging, timer, and singleton decorators - for enabling logging and method timing, as well as creating singleton instances.
- Convenience methods available via the `DSWrapper` for managing datasource configuration or properties files - e.g. `ds-config.properties` - offering a set of generic datasource access methods such as `apply_to_datasource()` to persist data to any configured datasource or `read_from_datasource()` to read data from any configured datasources.
- A set of custom `scikit-learn` transformers for preprocessing data such as `DataPrepTransformer` which can be added to a data pipeline for pre-process dataset - e.g., handling date conversions, type casting of columns, clipping data, generating special features from rows of text strings, generating calculated features, masking columns, dropping correlated or potential data leakage columns, and generating target variables from other features as needed (separet from target encoding). A `GeospatialTransformer` for generating geohash features from latitude and longitudes; a `SelectiveFunctionTransformer` and `SelectiveColumnTransformer` for selectively transforming dataframe columns; a `DateFeaturesTransformer` for generating date-related features for feature engineering, and `FeatureSelectionTransformer` for feature selection using configured estimators such as `Lasso` or `LinearRegression`
- A set of generic or common utilities for summarizing dataframes and others - e.g., using `summarize()` or to winsorize using `winsorize_it()`
- A set of convenience properties handlers to accessing generic configured properties relating to the project tree, data preparation, or model development and execution such as `ProjectTreeProperties`, `DataPrepProperties`, and `ModelProperties`. These handlers offer a convenient feature for reloading properties as needed, thereby refreshing properties without having to re-start the running VM (really only useful in development). However you may access any configured properties in the usual way via the `AppProperties` object.
## Usage
You can install this module as follows:
```commandline
pip install cheutils
```
OPTIONAL: if you want the latest release:
```commandline
pip install --upgrade cheutils
```
## Get started using `cheutils`
The module supports application configuration via a properties file. As such, you can include a project configuration file - the default properties file expected is `app-config.properties`, which you can place anywhere in your project root or any project sub folder. You can also include a special properties file called `ds-config.properties` with the configuration of your data sources; this is also automatically loaded. A sample application properties file may contain entries such as the following:
```properties
##
# Sample application properties file
##
project.namespace=proj_namespace
project.root.dir=./
project.data.dir=./data/
project.output.dir=./output/
# properties handlers
project.properties.prop_handler={'name': 'ProjectTreeProperties', 'package': 'cheutils', }
project.properties.data_handler={'name': 'DataPrepProperties', 'package': 'cheutils', }
project.properties.model_handler={'name': 'ModelProperties', 'package': 'cheutils', }
# SQLite DB - used for selected caching for efficiency
project.sqlite3.db=proj_sqlite.db
project.dataset.list=[X_train.csv, X_test.csv, y_train.csv, y_test.csv]
# estimator configuration: default parameters are those not necessarily included for any tuning or optimization
# but are useful for instantiating instances of the estimator; all others in the estimator params_grid are
# candidates for any optimization. If no default parameters are needed simply ignore or set default_params value to None
project.models.supported={'xgb_boost': {'name': 'XGBRegressor', 'package': 'xgboost', 'default_params': None, }, \
'random_forest': {'name': 'RandomForestRegressor', 'package': 'sklearn.ensemble', 'default_params': None, }, \
'lasso': {'name': 'Lasso', 'package': 'sklearn.linear_model', 'default_params': None, }, }
# selected estimator parameter grid options - these are included in any tuning or model optimization
model.params_grid.xgb_boost={'learning_rate': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 10}, 'subsample': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 10}, 'min_child_weight': {'type': float, 'start': 0.1, 'end': 1.0, 'num': 10}, 'n_estimators': {'type': int, 'start': 10, 'end': 400, 'num': 10}, 'max_depth': {'type': int, 'start': 3, 'end': 17, 'num': 5}, 'colsample_bytree': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, 'gamma': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, 'reg_alpha': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, }
model.params_grid.random_forest={'min_samples_leaf': {'type': int, 'start': 1, 'end': 60, 'num': 5}, 'max_features': {'type': int, 'start': 5, 'end': 1001, 'num': 10}, 'max_depth': {'type': int, 'start': 5, 'end': 31, 'num': 6}, 'n_estimators': {'type': int, 'start': 5, 'end': 201, 'num': 10}, 'min_samples_split': {'type': int, 'start': 2, 'end': 21, 'num': 5}, 'max_leaf_nodes': {'type': int, 'start': 5, 'end': 401, 'num': 10}, }
model.baseline.model_option=lasso
model.active.model_option=xgb_boost
# hyperparameter search algorith options supported: hyperopt with cross-validation and Scikit-Optimize
project.hyperparam.searches=['hyperoptcv', 'skoptimize']
model.active.n_iters=200
model.active.n_trials=10
model.narrow_grid.scaling_factor=0.20
model.narrow_grid.scaling_factors={'start': 0.1, 'end': 1.0, 'steps': 10}
model.find_optimal.grid_resolution=False
model.find_optimal.grid_resolution.with_cv=False
model.grid_resolutions.sample={'start': 1, 'end': 21, 'step': 1}
model.active.grid_resolution=7
model.cross_val.num_folds=3
model.active.n_jobs=-1
model.cross_val.scoring=neg_mean_squared_error
model.active.random_seed=100
model.active.trial_timeout=60
model.hyperopt.algos={'rand.suggest': 0.05, 'tpe.suggest': 0.75, 'anneal.suggest': 0.20, }
# transformers - defined as a dictionary of pipelines containing dictionaries of transformers
# note that each pipeline is mapped to a set of columns, and all transformers in a pipeline act on the set of columns
model.selective_column.transformers=[{'pipeline_name': 'scalers_pipeline', 'transformers': [{'name': 'scaler_tf', 'module': 'StandardScaler', 'package': 'sklearn.preprocessing', 'params': None, }, ], 'columns': ['col1_label', 'col2_label']}, ]
# global winsorize default limits or specify desired property and use accordingly
func.winsorize.limits=[0.05, 0.05]
```
A sample datasource configuration properties file may contain something like the following:
```properties
##
# Sample datasource configuration properties file
##
# datasources supported
project.ds.supported=[{'mysql_local': {'db_driver': 'MySQL ODBC 8.1 ANSI Driver', 'drivername': 'mysql+pyodbc', 'db_server': 'host.domain.com', 'db_port': 3306, 'db_name': 'mysql_db_name', 'username': 'db_username', 'password': 'db_user_passwd', 'direct_conn': 0, 'timeout': 0, 'verbose': True, 'encoding': 'utf8', }, }, ]
# database tables and interactions
db.rel_cols.db_namespace.some_table_name=['some_prim_key', 'name', 'iso_2code', 'iso_3code', 'gps_lat', 'gps_lon', 'is_active']
db.unique_key.db_namespace.some_table_name=['some_prim_key']
db.to_tables.replace.db_namespace=[some_table_name=False, ]
db.to_table.delete.db_namespace.some_table_name=[some_prim_key=120]
```
You import the `cheutils` module as per usual:
```python
from cheutils import AppProperties, get_data_dir
# The following provide access to the properties file, usually expected to be named "app-config.properties" and
# typically found in the project data folder or anywhere either in the project root or any other subfolder
APP_PROPS = AppProperties() # this automatically search for the app-config.properties file and loads it
# During development, you may find it convenient to reload properties file changes without re-starting the
# VM - NB: not recommended for production. You can achieve that by adding the following somewhere at the top of your Jupyter notebook, for example.
APP_PROPS.reload() # this automatically notifies and registered properties handlers to be reloaded
# You can access any properties using various methods such as:
data_dir = APP_PROPS.get('project.data.dir')
# You can also retrieve the path to the data folder (see app-config.properties), which is under the project root as follows:
data_dir = get_data_dir() # also returns the path to the project data folder, which is always interpreted relative to the project root
# You can also retrieve other properties as follows:
datasets = APP_PROPS.get_list('project.dataset.list') # e.g., some.configured.list=[1, 2, 3] or ['1', '2', '3']; see dataset configured in app-config.properties
hyperopt_algos = APP_PROPS.get_dic_properties('model.hyperopt.algos') # e.g., some.configured.dict={'val1': 10, 'val2': 'value'}
sel_transformers = APP_PROPS.get_list_properties('model.selective_column.transformers') # e.g., configured pipelines of transformers in the sample properties file above
find_opt_grid_res = APP_PROPS.get_bol('model.find_optimal.grid_resolution') # e.g., some.configured.bol=True
```
You access the LOGGER instance and use it in a similar way to you will when using a logging module like `loguru` or standard logging
```python
from cheutils import LoguruWrapper
LOGGER = LoguruWrapper().get_logger()
# You may also wish to change the logging context from the default, which is usually set to the configured project namespace property, by calling `set_prefix()`
# to ensure the log messages are scoped to that context thereafter - which can be helpful when reviewing the generated log file (`app-log.log`) - the default
# prefix is "app-log". You can set the logger prefix as follows:
LoguruWrapper().set_prefix(prefix='some_namespace')
some_val = 100
LOGGER.info('Some info you wish to log some value: {}', some_val) # or debug() etc.
```
The `cheutils` module currently supports any configured estimator (see, the xgb_boost example in the sample properties file for how to configure any estimator).
You can configure the active or main estimators for your project with an entry in the app-config.properties as below, but you add your own properties as well,
provided the estimator has been fully configured as in the sample application properties file:
```python
from cheutils import get_estimator, get_params_grid, AppProperties, load_dataset
# You can get a handle to the corresponding estimator in your code as follows:
estimator = get_estimator(model_option='xgb_boost') # the appropriate porperty can be seen in the sample app-config.properties
# You can do the following as well, to get a non-default instance, with appropriately configured hyperparameters:
estimator = get_estimator(**get_params_grid(model_option='xgb_boost'))
# You can fit the estimator as follows per usual:
datasets = AppProperties().get_list('project.dataset.list')
X_train, y_train, X_val, y_val, X_test, y_test = [load_dataset(file_name=file_name, is_csv=True) for file_name in datasets]
estimator.fit(X_train, y_train)
```
Given a default broad estimator hyperparameter configuration (usually in the properties file), you can generate a promising parameter
grid using RandomSearchCV as in the following line. Note that, the pipeline can either be an sklearn pipeline or an estimator.
The general idea is that, to avoid worrying about trying to figure out the optimal set of hyperparameter values for a given estimator, you can do that automatically, by
adopting a two-step coarse-to-fine search, where you configure a broad hyperparameter space or grid based on the estimator's most important or impactful hyperparameters, and the use a random search to find a set of promising hyperparameters that
you can use to conduct a finer hyperparameter space search using other algorithms such as Bayesian optimization (e.g., hyperopt or Scikit-Optimize, etc.)
```python
from cheutils import promising_params_grid, params_optimization, AppProperties, load_dataset
from sklearn.pipeline import Pipeline
datasets = AppProperties().get_list('project.dataset.list') # AppProperties is a singleton
X_train, y_train, X_val, y_val, X_test, y_test = [load_dataset(file_name=file_name, is_csv=False) for file_name in datasets]
pipeline = Pipeline(steps=['some previously defined pipeline steps'])
promising_grid = promising_params_grid(pipeline, X_train, y_train, grid_resolution=3, prefix='baseline_model') # the prefix is not needed if not part of a model pipeline
# thereafter, you can run hyperparameter optimization or tuning as follows (assuming you enabled cross-validation in your configuration or app-conf.properties - e.g., with an entry such as `model.cross_val.num_folds=3`),
# if using hyperopt - i.e., 'hyperoptcv' indicates using hyperopt optimization with cross-validation
best_estimator, best_score, best_params, cv_results = params_optimization(pipeline, X_train, y_train, promising_params_grid=promising_grid, with_narrower_grid=True, fine_search='hyperoptcv', prefix='model_prefix')
# if you are running the optimization as part of a Mlflow experiment and logging, you could also pass an optional parameter in the optimization call:
mlflow_exp={'log': True, 'uri': 'http://<mlflow_tracking_server>:<port>', } # ensures mlflow logging is done as well and you should also have the appropriate mlflow server instance running
best_estimator, best_score, best_params, cv_results = params_optimization(pipeline, X_train, y_train, promising_params_grid=promising_grid, with_narrower_grid=True, fine_search='hyperoptcv', prefix='model_prefix', mlflow_exp=mlflow_exp)
```
If you have also configured some datasources (i.e., using the `ds-config.properties`), you can get a handle to the datasource wrapper as follows:
```python
import os
from cheutils import DSWrapper, get_data_dir
ds = DSWrapper() # it is a singleton
# You can then read a large CSV file, leveraging `dask` as follows:
data_df = ds.read_large_csv(path_to_data_file=os.path.join(get_data_dir(), 'some_large_file.csv')) # where the data file is expected to be in the data sub folder of the project tree
# Assuming you previously defined a datasource configuration such as `ds-config.properties` somewhere in the project tree or sub folder, containing:
# You could then simply read from a configured datasource (DB) as below. Note that, the ds_params allows you to prescribe how DSWrapper behaves in
# the current interaction; the data_file attribute in ds_params MUST be set to None or left unset (i.e., left entirely out),
# if you wish to read from a configured DB resource - i.e., a datasource other than Excel or CSV file. You should set the attribute to signal to DWrapper to
# read from either an Excel or CSV file, and you should additionally provide another attribute: is_csv=False if reading an Excel file. Note the ds_key matches
# the entry in the sample ds-config.properties. DSWrapper expects the data_file to be in the data sub folder of the project.
ds_params = {'db_key': 'mysql_local', 'ds_namespace': 'test', 'db_table': 'some_table', 'data_file': None}
data_df = ds.read_from_datasource(ds_config=ds_params, chunksize=5000)
```
The `cheutils` module comes with custom transformers for some preprocessing - e.g., some basic data cleaning and formatting, handling date conversions, type casting of columns, clipping data, generating special features, calculating new features, masking columns, dropping correlated and potential leakage columns, and generating target variables if needed.
You can add a data preprocessing transformer to your pipeline as follows:
```python
from cheutils import DataPrepTransformer
date_cols = ['rental_date']
int_cols = ['release_year', 'length', 'NC-17', 'PG', 'PG-13', 'R',
'trailers', 'deleted_scenes', 'behind_scenes', 'commentaries', 'extra_fees']
correlated_cols = ['rental_rate_2', 'length_2', 'amount_2']
drop_missing = True # drop rows with missing data
clip_data = None # no data clipping; but you could clip outliers based on category aggregates with something like clip_data = {'rel_cols': ['col1', 'col2'], 'filterby': 'cat_col', }
exp_tf = DataPrepTransformer(date_cols=date_cols,
int_cols=int_cols,
drop_missing=drop_missing,
clip_data=clip_data,
correlated_cols=correlated_cols,)
data_prep_pipeline_steps = [('data_prep_step', exp_tf)] # this can be added to a model pipeline
```
You can also include feature selection by adding the following to the pipeline:
```python
from cheutils import FeatureSelectionTransformer, get_estimator, AppProperties, ModelProperties, SelectiveColumnTransformer
standard_pipeline_steps = ['some previously defined pipeline steps']
model_handler: ModelProperties = AppProperties().get_subscriber('model_handler')
feat_sel_tf = FeatureSelectionTransformer(estimator=get_estimator(model_option='xgboost'),
random_state=model_handler.get_random_seed())
# add feature selection to pipeline
standard_pipeline_steps.append(('feat_selection_step', feat_sel_tf))
# You can also add a configured selective column transforme.
# e.g., if you already have configured a list of column transformers in the `app-config.properties` such as in the sample properties file above,
# you can add it to the pipeline as below. The `SelectiveColumnTransformer` uses the configured property to determine
# the transformer(s), and the corresponding columns affected, to add to the pipeline.
# Each configured transformer only applies any transformations to the specified columns and others are simply passed through.
scaler_tf = SelectiveColumnTransformer()
standard_pipeline_steps.append(('scale_feats_step', scaler_tf))
```
Ultimately, you may create a model pipeline and execute using steps similar to the following:
```python
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import TransformedTargetRegressor
from cheutils import get_estimator, winsorize_it, AppProperties, LoguruWrapper
LOGGER = LoguruWrapper().get_logger()
# assuming any previous necessary steps
standard_pipeline_steps = ['some previosly defined pipeline steps']
# ...
baseline_model = get_estimator(model_option=AppProperties().get('model.baseline.model_option'))
baseline_pipeline_steps = standard_pipeline_steps.copy()
baseline_pipeline_steps.append(('baseline_mdl', baseline_model))
baseline_pipeline = Pipeline(steps=baseline_pipeline_steps, verbose=True)
# you could even wrap the pipeline with an appropriate `scikit-learn` target encoder, for argument's sake
# here the target is winsorized, but you could do other encoding as you wish
baseline_est = TransformedTargetRegressor(regressor=baseline_pipeline,
func=winsorize_it,
inverse_func=winsorize_it,
check_inverse=False, )
X_train = None # ignore the None value - assume previously defined and gone through an appropriate train_test_split
y_train = None # ditto what is said on X_train above
baseline_est.fit(X_train, y_train)
y_train_pred = baseline_est.predict(X_train)
mse_score = mean_squared_error(y_train, y_train_pred)
r_squared = r2_score(y_train, y_train_pred)
LOGGER.debug('Training baseline mse = {:.2f}'.format(mse_score))
LOGGER.debug('Training baseline r_squared = {:.2f}'.format(r_squared))
```
## Community
Contributions are welcomed from contributors, all experience levels, anyone looking to collaborate to improve the package or to be helpful.
We rely on a scikit-learn's [`Development Guide`](https://scikit-learn.org/stable/developers/index.html), which contains lots of best practices and detailed information about contributing code, documentation, tests, and more.
### Source code
You can check the latest sources with the command:
```commandline
git clone https://github.com/chewitty/cheutils.git
```
### Communication
- Author email: ferdinand.che@gmail.com
### Citation
If you use `cheutils` in a media/research publication, we would appreciate citations to this repository.
Raw data
{
"_id": null,
"home_page": null,
"name": "cheutils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Ferdinand Che <ferdinand.che@gmail.com>",
"keywords": "machine learning utilities, machine learning pipeline utilities, quick start machine learning, python project configuration, project configuration, python project properties files",
"author": null,
"author_email": "Ferdinand Che <ferdinand.che@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/90/26/4816082b6668aa8f02315ab743db90c211056f95f37f58a0e6bb281f9404/cheutils-2.7.30.tar.gz",
"platform": null,
"description": "# cheutils\r\n\r\nA package with a set of basic reusable utilities and tools to facilitate quickly getting up and going on any machine learning project.\r\n\r\n## Features\r\n- Managing properties files or project configuration, based on jproperties. The application configuration is expected to be available in a properties file named `app-config.properties`, which can be placed anywhere in the project root or any project subfolder.\r\n- Convenience methods such as `get_estimator()` to get a handle on any configured estimator with a specified hyperparameters dictionary, `get_params_grid()` or `get_param_defaults()` relating to obtaining model hyperparameters in the `app-config.properties` file.\r\n- Convenience methods for conducting hyperparameter optimization such as `params_optimization()`, `promising_params_grid()` for obtaining a set of promising hyperparameters using RandomSearchCV and a set of broadly specified or configured hyperparameters in the `app-config.properties`; a combination of `promising_params_grid()` followed by `params_optimization()` constitutes a coarse-to-fine search.\r\n- Convenience methods for accessing the project tree folders - e.g., `get_data_dir()` for accessing the configured data and `get_output_dir()` for the output folders, `load_dataset()` for loading, `save_excel()` and `save_csv()` for savings Excel in the project output folder and CSV respectively; you can also save any plotted figure using `save_current_fig()` (note that this must be called before `plt.show()`.\r\n- Convenience methods to support common programming tasks, such as renaming or tagging file names- e.g., `label(file_name, label='some_label')`) or tagging and date-stamping files (e.g., `datestamp(file_name, fmt='%Y-%m-%d')`).\r\n- A debug or logging, timer, and singleton decorators - for enabling logging and method timing, as well as creating singleton instances.\r\n- Convenience methods available via the `DSWrapper` for managing datasource configuration or properties files - e.g. `ds-config.properties` - offering a set of generic datasource access methods such as `apply_to_datasource()` to persist data to any configured datasource or `read_from_datasource()` to read data from any configured datasources.\r\n- A set of custom `scikit-learn` transformers for preprocessing data such as `DataPrepTransformer` which can be added to a data pipeline for pre-process dataset - e.g., handling date conversions, type casting of columns, clipping data, generating special features from rows of text strings, generating calculated features, masking columns, dropping correlated or potential data leakage columns, and generating target variables from other features as needed (separet from target encoding). A `GeospatialTransformer` for generating geohash features from latitude and longitudes; a `SelectiveFunctionTransformer` and `SelectiveColumnTransformer` for selectively transforming dataframe columns; a `DateFeaturesTransformer` for generating date-related features for feature engineering, and `FeatureSelectionTransformer` for feature selection using configured estimators such as `Lasso` or `LinearRegression`\r\n- A set of generic or common utilities for summarizing dataframes and others - e.g., using `summarize()` or to winsorize using `winsorize_it()`\r\n- A set of convenience properties handlers to accessing generic configured properties relating to the project tree, data preparation, or model development and execution such as `ProjectTreeProperties`, `DataPrepProperties`, and `ModelProperties`. These handlers offer a convenient feature for reloading properties as needed, thereby refreshing properties without having to re-start the running VM (really only useful in development). However you may access any configured properties in the usual way via the `AppProperties` object.\r\n\r\n## Usage\r\nYou can install this module as follows:\r\n```commandline\r\npip install cheutils\r\n```\r\nOPTIONAL: if you want the latest release:\r\n```commandline\r\npip install --upgrade cheutils\r\n```\r\n## Get started using `cheutils`\r\nThe module supports application configuration via a properties file. As such, you can include a project configuration file - the default properties file expected is `app-config.properties`, which you can place anywhere in your project root or any project sub folder. You can also include a special properties file called `ds-config.properties` with the configuration of your data sources; this is also automatically loaded. A sample application properties file may contain entries such as the following:\r\n```properties\r\n##\r\n# Sample application properties file\r\n##\r\nproject.namespace=proj_namespace\r\nproject.root.dir=./\r\nproject.data.dir=./data/\r\nproject.output.dir=./output/\r\n# properties handlers\r\nproject.properties.prop_handler={'name': 'ProjectTreeProperties', 'package': 'cheutils', }\r\nproject.properties.data_handler={'name': 'DataPrepProperties', 'package': 'cheutils', }\r\nproject.properties.model_handler={'name': 'ModelProperties', 'package': 'cheutils', }\r\n# SQLite DB - used for selected caching for efficiency\r\nproject.sqlite3.db=proj_sqlite.db\r\nproject.dataset.list=[X_train.csv, X_test.csv, y_train.csv, y_test.csv]\r\n# estimator configuration: default parameters are those not necessarily included for any tuning or optimization\r\n# but are useful for instantiating instances of the estimator; all others in the estimator params_grid are\r\n# candidates for any optimization. If no default parameters are needed simply ignore or set default_params value to None\r\nproject.models.supported={'xgb_boost': {'name': 'XGBRegressor', 'package': 'xgboost', 'default_params': None, }, \\\r\n'random_forest': {'name': 'RandomForestRegressor', 'package': 'sklearn.ensemble', 'default_params': None, }, \\\r\n'lasso': {'name': 'Lasso', 'package': 'sklearn.linear_model', 'default_params': None, }, }\r\n# selected estimator parameter grid options - these are included in any tuning or model optimization\r\nmodel.params_grid.xgb_boost={'learning_rate': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 10}, 'subsample': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 10}, 'min_child_weight': {'type': float, 'start': 0.1, 'end': 1.0, 'num': 10}, 'n_estimators': {'type': int, 'start': 10, 'end': 400, 'num': 10}, 'max_depth': {'type': int, 'start': 3, 'end': 17, 'num': 5}, 'colsample_bytree': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, 'gamma': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, 'reg_alpha': {'type': float, 'start': 0.0, 'end': 1.0, 'num': 5}, }\r\nmodel.params_grid.random_forest={'min_samples_leaf': {'type': int, 'start': 1, 'end': 60, 'num': 5}, 'max_features': {'type': int, 'start': 5, 'end': 1001, 'num': 10}, 'max_depth': {'type': int, 'start': 5, 'end': 31, 'num': 6}, 'n_estimators': {'type': int, 'start': 5, 'end': 201, 'num': 10}, 'min_samples_split': {'type': int, 'start': 2, 'end': 21, 'num': 5}, 'max_leaf_nodes': {'type': int, 'start': 5, 'end': 401, 'num': 10}, }\r\nmodel.baseline.model_option=lasso\r\nmodel.active.model_option=xgb_boost\r\n# hyperparameter search algorith options supported: hyperopt with cross-validation and Scikit-Optimize\r\nproject.hyperparam.searches=['hyperoptcv', 'skoptimize']\r\nmodel.active.n_iters=200\r\nmodel.active.n_trials=10\r\nmodel.narrow_grid.scaling_factor=0.20\r\nmodel.narrow_grid.scaling_factors={'start': 0.1, 'end': 1.0, 'steps': 10}\r\nmodel.find_optimal.grid_resolution=False\r\nmodel.find_optimal.grid_resolution.with_cv=False\r\nmodel.grid_resolutions.sample={'start': 1, 'end': 21, 'step': 1}\r\nmodel.active.grid_resolution=7\r\nmodel.cross_val.num_folds=3\r\nmodel.active.n_jobs=-1\r\nmodel.cross_val.scoring=neg_mean_squared_error\r\nmodel.active.random_seed=100\r\nmodel.active.trial_timeout=60\r\nmodel.hyperopt.algos={'rand.suggest': 0.05, 'tpe.suggest': 0.75, 'anneal.suggest': 0.20, }\r\n# transformers - defined as a dictionary of pipelines containing dictionaries of transformers\r\n# note that each pipeline is mapped to a set of columns, and all transformers in a pipeline act on the set of columns\r\nmodel.selective_column.transformers=[{'pipeline_name': 'scalers_pipeline', 'transformers': [{'name': 'scaler_tf', 'module': 'StandardScaler', 'package': 'sklearn.preprocessing', 'params': None, }, ], 'columns': ['col1_label', 'col2_label']}, ]\r\n# global winsorize default limits or specify desired property and use accordingly\r\nfunc.winsorize.limits=[0.05, 0.05]\r\n```\r\nA sample datasource configuration properties file may contain something like the following:\r\n```properties\r\n##\r\n# Sample datasource configuration properties file\r\n##\r\n# datasources supported\r\nproject.ds.supported=[{'mysql_local': {'db_driver': 'MySQL ODBC 8.1 ANSI Driver', 'drivername': 'mysql+pyodbc', 'db_server': 'host.domain.com', 'db_port': 3306, 'db_name': 'mysql_db_name', 'username': 'db_username', 'password': 'db_user_passwd', 'direct_conn': 0, 'timeout': 0, 'verbose': True, 'encoding': 'utf8', }, }, ]\r\n# database tables and interactions\r\ndb.rel_cols.db_namespace.some_table_name=['some_prim_key', 'name', 'iso_2code', 'iso_3code', 'gps_lat', 'gps_lon', 'is_active']\r\ndb.unique_key.db_namespace.some_table_name=['some_prim_key']\r\ndb.to_tables.replace.db_namespace=[some_table_name=False, ]\r\ndb.to_table.delete.db_namespace.some_table_name=[some_prim_key=120]\r\n```\r\nYou import the `cheutils` module as per usual:\r\n```python\r\nfrom cheutils import AppProperties, get_data_dir\r\n\r\n# The following provide access to the properties file, usually expected to be named \"app-config.properties\" and \r\n# typically found in the project data folder or anywhere either in the project root or any other subfolder\r\nAPP_PROPS = AppProperties() # this automatically search for the app-config.properties file and loads it\r\n\r\n# During development, you may find it convenient to reload properties file changes without re-starting the \r\n# VM - NB: not recommended for production. You can achieve that by adding the following somewhere at the top of your Jupyter notebook, for example.\r\nAPP_PROPS.reload() # this automatically notifies and registered properties handlers to be reloaded\r\n\r\n# You can access any properties using various methods such as:\r\ndata_dir = APP_PROPS.get('project.data.dir')\r\n\r\n# You can also retrieve the path to the data folder (see app-config.properties), which is under the project root as follows:\r\ndata_dir = get_data_dir() # also returns the path to the project data folder, which is always interpreted relative to the project root\r\n\r\n# You can also retrieve other properties as follows:\r\ndatasets = APP_PROPS.get_list('project.dataset.list') # e.g., some.configured.list=[1, 2, 3] or ['1', '2', '3']; see dataset configured in app-config.properties\r\nhyperopt_algos = APP_PROPS.get_dic_properties('model.hyperopt.algos') # e.g., some.configured.dict={'val1': 10, 'val2': 'value'}\r\nsel_transformers = APP_PROPS.get_list_properties('model.selective_column.transformers') # e.g., configured pipelines of transformers in the sample properties file above\r\nfind_opt_grid_res = APP_PROPS.get_bol('model.find_optimal.grid_resolution') # e.g., some.configured.bol=True\r\n```\r\nYou access the LOGGER instance and use it in a similar way to you will when using a logging module like `loguru` or standard logging\r\n```python\r\nfrom cheutils import LoguruWrapper\r\n\r\nLOGGER = LoguruWrapper().get_logger()\r\n# You may also wish to change the logging context from the default, which is usually set to the configured project namespace property, by calling `set_prefix()` \r\n# to ensure the log messages are scoped to that context thereafter - which can be helpful when reviewing the generated log file (`app-log.log`) - the default \r\n# prefix is \"app-log\". You can set the logger prefix as follows:\r\nLoguruWrapper().set_prefix(prefix='some_namespace')\r\nsome_val = 100\r\nLOGGER.info('Some info you wish to log some value: {}', some_val) # or debug() etc.\r\n```\r\n\r\nThe `cheutils` module currently supports any configured estimator (see, the xgb_boost example in the sample properties file for how to configure any estimator).\r\nYou can configure the active or main estimators for your project with an entry in the app-config.properties as below, but you add your own properties as well, \r\nprovided the estimator has been fully configured as in the sample application properties file:\r\n```python\r\nfrom cheutils import get_estimator, get_params_grid, AppProperties, load_dataset\r\n\r\n# You can get a handle to the corresponding estimator in your code as follows:\r\nestimator = get_estimator(model_option='xgb_boost') # the appropriate porperty can be seen in the sample app-config.properties\r\n\r\n# You can do the following as well, to get a non-default instance, with appropriately configured hyperparameters:\r\nestimator = get_estimator(**get_params_grid(model_option='xgb_boost'))\r\n# You can fit the estimator as follows per usual:\r\ndatasets = AppProperties().get_list('project.dataset.list')\r\nX_train, y_train, X_val, y_val, X_test, y_test = [load_dataset(file_name=file_name, is_csv=True) for file_name in datasets]\r\nestimator.fit(X_train, y_train)\r\n```\r\nGiven a default broad estimator hyperparameter configuration (usually in the properties file), you can generate a promising parameter \r\ngrid using RandomSearchCV as in the following line. Note that, the pipeline can either be an sklearn pipeline or an estimator. \r\nThe general idea is that, to avoid worrying about trying to figure out the optimal set of hyperparameter values for a given estimator, you can do that automatically, by \r\nadopting a two-step coarse-to-fine search, where you configure a broad hyperparameter space or grid based on the estimator's most important or impactful hyperparameters, and the use a random search to find a set of promising hyperparameters that \r\nyou can use to conduct a finer hyperparameter space search using other algorithms such as Bayesian optimization (e.g., hyperopt or Scikit-Optimize, etc.)\r\n```python\r\nfrom cheutils import promising_params_grid, params_optimization, AppProperties, load_dataset\r\nfrom sklearn.pipeline import Pipeline\r\ndatasets = AppProperties().get_list('project.dataset.list') # AppProperties is a singleton\r\nX_train, y_train, X_val, y_val, X_test, y_test = [load_dataset(file_name=file_name, is_csv=False) for file_name in datasets]\r\npipeline = Pipeline(steps=['some previously defined pipeline steps'])\r\npromising_grid = promising_params_grid(pipeline, X_train, y_train, grid_resolution=3, prefix='baseline_model') # the prefix is not needed if not part of a model pipeline\r\n# thereafter, you can run hyperparameter optimization or tuning as follows (assuming you enabled cross-validation in your configuration or app-conf.properties - e.g., with an entry such as `model.cross_val.num_folds=3`), \r\n# if using hyperopt - i.e., 'hyperoptcv' indicates using hyperopt optimization with cross-validation\r\nbest_estimator, best_score, best_params, cv_results = params_optimization(pipeline, X_train, y_train, promising_params_grid=promising_grid, with_narrower_grid=True, fine_search='hyperoptcv', prefix='model_prefix')\r\n# if you are running the optimization as part of a Mlflow experiment and logging, you could also pass an optional parameter in the optimization call:\r\nmlflow_exp={'log': True, 'uri': 'http://<mlflow_tracking_server>:<port>', } # ensures mlflow logging is done as well and you should also have the appropriate mlflow server instance running\r\nbest_estimator, best_score, best_params, cv_results = params_optimization(pipeline, X_train, y_train, promising_params_grid=promising_grid, with_narrower_grid=True, fine_search='hyperoptcv', prefix='model_prefix', mlflow_exp=mlflow_exp)\r\n```\r\nIf you have also configured some datasources (i.e., using the `ds-config.properties`), you can get a handle to the datasource wrapper as follows:\r\n```python\r\nimport os\r\nfrom cheutils import DSWrapper, get_data_dir\r\nds = DSWrapper() # it is a singleton\r\n# You can then read a large CSV file, leveraging `dask` as follows:\r\ndata_df = ds.read_large_csv(path_to_data_file=os.path.join(get_data_dir(), 'some_large_file.csv')) # where the data file is expected to be in the data sub folder of the project tree\r\n\r\n# Assuming you previously defined a datasource configuration such as `ds-config.properties` somewhere in the project tree or sub folder, containing:\r\n# You could then simply read from a configured datasource (DB) as below. Note that, the ds_params allows you to prescribe how DSWrapper behaves in \r\n# the current interaction; the data_file attribute in ds_params MUST be set to None or left unset (i.e., left entirely out), \r\n# if you wish to read from a configured DB resource - i.e., a datasource other than Excel or CSV file. You should set the attribute to signal to DWrapper to\r\n# read from either an Excel or CSV file, and you should additionally provide another attribute: is_csv=False if reading an Excel file. Note the ds_key matches\r\n# the entry in the sample ds-config.properties. DSWrapper expects the data_file to be in the data sub folder of the project.\r\nds_params = {'db_key': 'mysql_local', 'ds_namespace': 'test', 'db_table': 'some_table', 'data_file': None}\r\ndata_df = ds.read_from_datasource(ds_config=ds_params, chunksize=5000)\r\n```\r\nThe `cheutils` module comes with custom transformers for some preprocessing - e.g., some basic data cleaning and formatting, handling date conversions, type casting of columns, clipping data, generating special features, calculating new features, masking columns, dropping correlated and potential leakage columns, and generating target variables if needed. \r\n\r\nYou can add a data preprocessing transformer to your pipeline as follows:\r\n```python\r\nfrom cheutils import DataPrepTransformer\r\ndate_cols = ['rental_date']\r\nint_cols = ['release_year', 'length', 'NC-17', 'PG', 'PG-13', 'R',\r\n 'trailers', 'deleted_scenes', 'behind_scenes', 'commentaries', 'extra_fees']\r\ncorrelated_cols = ['rental_rate_2', 'length_2', 'amount_2']\r\ndrop_missing = True # drop rows with missing data\r\nclip_data = None # no data clipping; but you could clip outliers based on category aggregates with something like clip_data = {'rel_cols': ['col1', 'col2'], 'filterby': 'cat_col', }\r\nexp_tf = DataPrepTransformer(date_cols=date_cols, \r\n int_cols=int_cols, \r\n drop_missing=drop_missing, \r\n clip_data=clip_data,\r\n correlated_cols=correlated_cols,)\r\ndata_prep_pipeline_steps = [('data_prep_step', exp_tf)] # this can be added to a model pipeline\r\n```\r\nYou can also include feature selection by adding the following to the pipeline:\r\n```python\r\nfrom cheutils import FeatureSelectionTransformer, get_estimator, AppProperties, ModelProperties, SelectiveColumnTransformer\r\n\r\nstandard_pipeline_steps = ['some previously defined pipeline steps']\r\nmodel_handler: ModelProperties = AppProperties().get_subscriber('model_handler')\r\nfeat_sel_tf = FeatureSelectionTransformer(estimator=get_estimator(model_option='xgboost'),\r\n random_state=model_handler.get_random_seed())\r\n# add feature selection to pipeline\r\nstandard_pipeline_steps.append(('feat_selection_step', feat_sel_tf))\r\n# You can also add a configured selective column transforme.\r\n# e.g., if you already have configured a list of column transformers in the `app-config.properties` such as in the sample properties file above,\r\n# you can add it to the pipeline as below. The `SelectiveColumnTransformer` uses the configured property to determine \r\n# the transformer(s), and the corresponding columns affected, to add to the pipeline. \r\n# Each configured transformer only applies any transformations to the specified columns and others are simply passed through.\r\nscaler_tf = SelectiveColumnTransformer()\r\nstandard_pipeline_steps.append(('scale_feats_step', scaler_tf))\r\n```\r\nUltimately, you may create a model pipeline and execute using steps similar to the following:\r\n\r\n```python\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.metrics import mean_squared_error, r2_score\r\nfrom sklearn.compose import TransformedTargetRegressor\r\nfrom cheutils import get_estimator, winsorize_it, AppProperties, LoguruWrapper\r\n\r\nLOGGER = LoguruWrapper().get_logger()\r\n# assuming any previous necessary steps\r\nstandard_pipeline_steps = ['some previosly defined pipeline steps']\r\n# ...\r\nbaseline_model = get_estimator(model_option=AppProperties().get('model.baseline.model_option'))\r\nbaseline_pipeline_steps = standard_pipeline_steps.copy()\r\nbaseline_pipeline_steps.append(('baseline_mdl', baseline_model))\r\nbaseline_pipeline = Pipeline(steps=baseline_pipeline_steps, verbose=True)\r\n# you could even wrap the pipeline with an appropriate `scikit-learn` target encoder, for argument's sake\r\n# here the target is winsorized, but you could do other encoding as you wish\r\nbaseline_est = TransformedTargetRegressor(regressor=baseline_pipeline, \r\n func=winsorize_it, \r\n inverse_func=winsorize_it,\r\n check_inverse=False, )\r\nX_train = None # ignore the None value - assume previously defined and gone through an appropriate train_test_split\r\ny_train = None # ditto what is said on X_train above\r\nbaseline_est.fit(X_train, y_train)\r\ny_train_pred = baseline_est.predict(X_train)\r\nmse_score = mean_squared_error(y_train, y_train_pred)\r\nr_squared = r2_score(y_train, y_train_pred)\r\nLOGGER.debug('Training baseline mse = {:.2f}'.format(mse_score))\r\nLOGGER.debug('Training baseline r_squared = {:.2f}'.format(r_squared))\r\n```\r\n\r\n## Community\r\n\r\nContributions are welcomed from contributors, all experience levels, anyone looking to collaborate to improve the package or to be helpful. \r\nWe rely on a scikit-learn's [`Development Guide`](https://scikit-learn.org/stable/developers/index.html), which contains lots of best practices and detailed information about contributing code, documentation, tests, and more. \r\n\r\n### Source code\r\nYou can check the latest sources with the command:\r\n```commandline\r\ngit clone https://github.com/chewitty/cheutils.git\r\n```\r\n### Communication\r\n- Author email: ferdinand.che@gmail.com\r\n\r\n### Citation\r\n\r\nIf you use `cheutils` in a media/research publication, we would appreciate citations to this repository.\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A set of basic reusable utilities and tools to facilitate quickly getting up and going on any machine learning project.",
"version": "2.7.30",
"project_urls": {
"Homepage": "https://github.com/chewitty/cheutils",
"Issues": "https://github.com/chewitty/cheutils/issues",
"Repository": "https://github.com/chewitty/cheutils.git"
},
"split_keywords": [
"machine learning utilities",
" machine learning pipeline utilities",
" quick start machine learning",
" python project configuration",
" project configuration",
" python project properties files"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "47497a6ab060a9b267aa620ae6ddac91b3695e67a8bd94a2dea81e607fac9178",
"md5": "4bbff49e80ae80f9e1af3c168efc0bfa",
"sha256": "45f00bdb8cd2077f59be0d3b4561a2e85ea2f3e33070d647e33ed0f59fef0bcb"
},
"downloads": -1,
"filename": "cheutils-2.7.30-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4bbff49e80ae80f9e1af3c168efc0bfa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 76128,
"upload_time": "2024-12-12T17:50:03",
"upload_time_iso_8601": "2024-12-12T17:50:03.242205Z",
"url": "https://files.pythonhosted.org/packages/47/49/7a6ab060a9b267aa620ae6ddac91b3695e67a8bd94a2dea81e607fac9178/cheutils-2.7.30-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "90264816082b6668aa8f02315ab743db90c211056f95f37f58a0e6bb281f9404",
"md5": "33c15d43d5dd05bacd1d5eb92267228d",
"sha256": "7f1616c389102bc4c411a1146ce6bc233dcd7cee86657f1e31c76350b1307738"
},
"downloads": -1,
"filename": "cheutils-2.7.30.tar.gz",
"has_sig": false,
"md5_digest": "33c15d43d5dd05bacd1d5eb92267228d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 75164,
"upload_time": "2024-12-12T17:50:06",
"upload_time_iso_8601": "2024-12-12T17:50:06.365552Z",
"url": "https://files.pythonhosted.org/packages/90/26/4816082b6668aa8f02315ab743db90c211056f95f37f58a0e6bb281f9404/cheutils-2.7.30.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-12 17:50:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "chewitty",
"github_project": "cheutils",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cheutils"
}