MetaHeuristicsFS


NameMetaHeuristicsFS JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/StatguyUser/MetaHeuristicsFS
SummaryImplementation of metaheuristic algorithms for machine learning feature selection. Companion library for the book `Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists`
upload_time2023-08-06 13:55:52
maintainer
docs_urlNone
authorStatguyUser
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            What is it?
===========

Companion library of machine learning book [Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists](https://statguyuser.github.io/feature-engg-selection-for-explainable-models.github.io/index.html)

MetaHeuristicsFS module helps in identifying combination of features that gives best result. Process of searching best combination is called 'feature selection'. This library uses metaheuristic based algorithms such as genetic algorithm, simulated annealing, ant colony optimization, and particle swarm optimization, for performing feature selection.


Input parameters
================

  - **Machine Learning Parameters: These are common for all algorithms**

    `columns_list` : Column names present in x_train_dataframe and x_test which will be used as input list for searching best list of features.

    `data_dict` : X and Y training and test data provided in dictionary format. Below is example of 5 fold cross validation data with keys.
        {0:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},
        1:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},
        2:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},
        3:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},
        4:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array}}

    If you only have train and test data and do not wish to do cross validation, use above dictionary format, with only one key.

    `use_validation_data` : Whether you want to use validation data as a boolean True or False. Default value is True. If false, user need not provide x_validation_dataframe and y_validation_dataframe

    `x_validation_dataframe` : dataframe containing features of validatoin dataset. Default is blank pandas dataframe.

    `y_validation_dataframe` : dataframe containing dependent variable of validation dataset. Default is blank pandas dataframe.

    `model` : Model object. It should have .fit and .predict attribute

    `cost_function_improvement` : Objective is to whether increase or decrease the cost during subsequent iterations.
        For regression it should be 'decrease' and for classification it should be 'increase'

    `cost_function` : Cost function for finding cost between actual and predicted values, depending on regression or classification problem.
        cost function should accept 'actual' and 'predicted' as arrays and return cost for the both.

    `average` : Averaging to be used. This is useful for clasification metrics such as 'f1_score', 'jaccard_score', 'fbeta_score', 'precision_score',
        'recall_score' and 'roc_auc_score' when dependent variable is multi-class

  - **Genetic Algorithm Feature Selection (GeneticAlgorithmFS) Parameters**

    `generations` : Number of generations to run genetic algorithm. 100 as deafult

    `population` : Number of individual chromosomes. 50 as default. It should be kept as low number if number of possible permutation and combination of feature sets are small.

    `prob_crossover` : Probability of crossover. 0.9 as default

    `prob_mutation` : Probability of mutation. 0.1 as default

    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.
        At start of each generation it is checked if runtime has exceeded than alloted time.
        If case run time did exceeds provided limit, best result from generations executed so far is given as output.
        Default is 2 hours. i.e. 120 minutes.

  - **Simulated Annealing Feature Selection (SimulatedAnnealingFS) Parameters**

    `temperature` : Initial temperature for annealing. Default is 1500

    `iterations` : Number of times simulated annealing will search for solutions. Default is 100.

    `n_perturb` : Number of times feature set will be perturbed in an iteration. Default is 1.

    `n_features_percent_perturb` : Percentage of features that will be perturbed during each perturbation. Value are between 1 and 100.

    `alpha` : Temperature reduction factor. Defaults is 0.9.

    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.
        At start of each generation it is checked if runtime has exceeded than alloted time.
        If case run time did exceeds provided limit, best result from generations executed so far is given as output.
        Default is 2 hours. i.e. 120 minutes.

  - **Ant Colony Optimization Feature Selection (AntColonyOptimizationFS) Parameters**

    `iterations` : Number of times ant colony optimization will search for solutions. Default is 100

    `N_ants` : Number of ants in each iteration. Default is 100.

    `run_time` : Number of minutes to run the algorithm. This is checked in between each iteration.
        At start of each generation it is checked if runtime has exceeded than alloted time.
        If case run time did exceeds provided limit, best result from iterations executed so far is given as output.
        Default is 2 hours. i.e. 120 minutes.

    `evaporation_rate` : Evaporation rate. Values are between 0 and 1. If it is too large, chances are higher to find global optima, but computationally expensive. If it is low, chances of finding global optima are less. Default is kept as 0.9

    `Q` : Pheromene update coefficient. Value between 0 and 1. It affects the convergence speed. If it is large, ACO will get stuck at local optima. Default is kept as 0.2

  - **Particle Swarm Optimization Feature Selection (ParticleSwarmOptimizationFS) Parameters**

    `iterations` : Number of times particle swarm optimization will search for solutions. Default is 100.

    `swarmSize` : Size of the swarm in each iteration. Default is 100.

    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.
        At start of each generation it is checked if runtime has exceeded than alloted time.
        If case run time did exceeds provided limit, best result from generations executed so far is given as output.
        Default is 2 hours. i.e. 120 minutes.

Output
================

  - **best_columns** : List object with list of column names which gives best performance for the model. These features can be used for training and saving models separately by the user.

Examples
================

 - [Example 1 - Regression](https://github.com/StatguyUser/feature_engineering_and_selection_for_explanable_models/blob/main/Chapter%208%20-%20Predicting%20Room%20Bookings%20-%20More%20Genetic%20Algorithm%20Iterations.ipynb)
 - [Example 2 - Classification](https://github.com/StatguyUser/feature_engineering_and_selection_for_explanable_models/blob/37ba0d2921fbabbb83df44c6eb7a1242b19a637f/Chapter%208%20-%20Hotel%20Cancelation%20.ipynb)

How to cite
================
Md Azimul Haque (2022). Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists. Lulu Press, Inc.

Where to get it?
================

`pip install MetaHeuristicsFS`

Dependencies
============

 - [numpy](https://numpy.org/)
 - [scikit-learn](https://scikit-learn.org/)




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/StatguyUser/MetaHeuristicsFS",
    "name": "MetaHeuristicsFS",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "StatguyUser",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/ab/1f/8652682c850d21d610e65b55a7715f52378bde62086333c3f465a6f71f0c/MetaHeuristicsFS-0.0.8.tar.gz",
    "platform": null,
    "description": "What is it?\n===========\n\nCompanion library of machine learning book [Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists](https://statguyuser.github.io/feature-engg-selection-for-explainable-models.github.io/index.html)\n\nMetaHeuristicsFS module helps in identifying combination of features that gives best result. Process of searching best combination is called 'feature selection'. This library uses metaheuristic based algorithms such as genetic algorithm, simulated annealing, ant colony optimization, and particle swarm optimization, for performing feature selection.\n\n\nInput parameters\n================\n\n  - **Machine Learning Parameters: These are common for all algorithms**\n\n    `columns_list` : Column names present in x_train_dataframe and x_test which will be used as input list for searching best list of features.\n\n    `data_dict` : X and Y training and test data provided in dictionary format. Below is example of 5 fold cross validation data with keys.\n        {0:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},\n        1:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},\n        2:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},\n        3:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array},\n        4:{'x_train':x_train_dataframe,'y_train':y_train_array,'x_test':x_test_dataframe,'y_test':y_test_array}}\n\n    If you only have train and test data and do not wish to do cross validation, use above dictionary format, with only one key.\n\n    `use_validation_data` : Whether you want to use validation data as a boolean True or False. Default value is True. If false, user need not provide x_validation_dataframe and y_validation_dataframe\n\n    `x_validation_dataframe` : dataframe containing features of validatoin dataset. Default is blank pandas dataframe.\n\n    `y_validation_dataframe` : dataframe containing dependent variable of validation dataset. Default is blank pandas dataframe.\n\n    `model` : Model object. It should have .fit and .predict attribute\n\n    `cost_function_improvement` : Objective is to whether increase or decrease the cost during subsequent iterations.\n        For regression it should be 'decrease' and for classification it should be 'increase'\n\n    `cost_function` : Cost function for finding cost between actual and predicted values, depending on regression or classification problem.\n        cost function should accept 'actual' and 'predicted' as arrays and return cost for the both.\n\n    `average` : Averaging to be used. This is useful for clasification metrics such as 'f1_score', 'jaccard_score', 'fbeta_score', 'precision_score',\n        'recall_score' and 'roc_auc_score' when dependent variable is multi-class\n\n  - **Genetic Algorithm Feature Selection (GeneticAlgorithmFS) Parameters**\n\n    `generations` : Number of generations to run genetic algorithm. 100 as deafult\n\n    `population` : Number of individual chromosomes. 50 as default. It should be kept as low number if number of possible permutation and combination of feature sets are small.\n\n    `prob_crossover` : Probability of crossover. 0.9 as default\n\n    `prob_mutation` : Probability of mutation. 0.1 as default\n\n    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.\n        At start of each generation it is checked if runtime has exceeded than alloted time.\n        If case run time did exceeds provided limit, best result from generations executed so far is given as output.\n        Default is 2 hours. i.e. 120 minutes.\n\n  - **Simulated Annealing Feature Selection (SimulatedAnnealingFS) Parameters**\n\n    `temperature` : Initial temperature for annealing. Default is 1500\n\n    `iterations` : Number of times simulated annealing will search for solutions. Default is 100.\n\n    `n_perturb` : Number of times feature set will be perturbed in an iteration. Default is 1.\n\n    `n_features_percent_perturb` : Percentage of features that will be perturbed during each perturbation. Value are between 1 and 100.\n\n    `alpha` : Temperature reduction factor. Defaults is 0.9.\n\n    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.\n        At start of each generation it is checked if runtime has exceeded than alloted time.\n        If case run time did exceeds provided limit, best result from generations executed so far is given as output.\n        Default is 2 hours. i.e. 120 minutes.\n\n  - **Ant Colony Optimization Feature Selection (AntColonyOptimizationFS) Parameters**\n\n    `iterations` : Number of times ant colony optimization will search for solutions. Default is 100\n\n    `N_ants` : Number of ants in each iteration. Default is 100.\n\n    `run_time` : Number of minutes to run the algorithm. This is checked in between each iteration.\n        At start of each generation it is checked if runtime has exceeded than alloted time.\n        If case run time did exceeds provided limit, best result from iterations executed so far is given as output.\n        Default is 2 hours. i.e. 120 minutes.\n\n    `evaporation_rate` : Evaporation rate. Values are between 0 and 1. If it is too large, chances are higher to find global optima, but computationally expensive. If it is low, chances of finding global optima are less. Default is kept as 0.9\n\n    `Q` : Pheromene update coefficient. Value between 0 and 1. It affects the convergence speed. If it is large, ACO will get stuck at local optima. Default is kept as 0.2\n\n  - **Particle Swarm Optimization Feature Selection (ParticleSwarmOptimizationFS) Parameters**\n\n    `iterations` : Number of times particle swarm optimization will search for solutions. Default is 100.\n\n    `swarmSize` : Size of the swarm in each iteration. Default is 100.\n\n    `run_time` : Number of minutes to run the algorithm. This is checked in between generations.\n        At start of each generation it is checked if runtime has exceeded than alloted time.\n        If case run time did exceeds provided limit, best result from generations executed so far is given as output.\n        Default is 2 hours. i.e. 120 minutes.\n\nOutput\n================\n\n  - **best_columns** : List object with list of column names which gives best performance for the model. These features can be used for training and saving models separately by the user.\n\nExamples\n================\n\n - [Example 1 - Regression](https://github.com/StatguyUser/feature_engineering_and_selection_for_explanable_models/blob/main/Chapter%208%20-%20Predicting%20Room%20Bookings%20-%20More%20Genetic%20Algorithm%20Iterations.ipynb)\n - [Example 2 - Classification](https://github.com/StatguyUser/feature_engineering_and_selection_for_explanable_models/blob/37ba0d2921fbabbb83df44c6eb7a1242b19a637f/Chapter%208%20-%20Hotel%20Cancelation%20.ipynb)\n\nHow to cite\n================\nMd Azimul Haque (2022). Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists. Lulu Press, Inc.\n\nWhere to get it?\n================\n\n`pip install MetaHeuristicsFS`\n\nDependencies\n============\n\n - [numpy](https://numpy.org/)\n - [scikit-learn](https://scikit-learn.org/)\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Implementation of metaheuristic algorithms for machine learning feature selection. Companion library for the book `Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists`",
    "version": "0.0.8",
    "project_urls": {
        "Download": "https://github.com/MetaHeuristicsFS/MetaHeuristicsFS.git",
        "Homepage": "https://github.com/StatguyUser/MetaHeuristicsFS"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3179fefa9d82a3344c8748c14749429d0db7122d5c6a1ca03547a94d3109f28a",
                "md5": "603eef00ba761afe93f045f06a723910",
                "sha256": "5fc641fc8b1c5d861875493695309ad16cb2264f9fe79497f6d116c5c5c9fd5a"
            },
            "downloads": -1,
            "filename": "MetaHeuristicsFS-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "603eef00ba761afe93f045f06a723910",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14952,
            "upload_time": "2023-08-06T13:55:50",
            "upload_time_iso_8601": "2023-08-06T13:55:50.351842Z",
            "url": "https://files.pythonhosted.org/packages/31/79/fefa9d82a3344c8748c14749429d0db7122d5c6a1ca03547a94d3109f28a/MetaHeuristicsFS-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ab1f8652682c850d21d610e65b55a7715f52378bde62086333c3f465a6f71f0c",
                "md5": "c1d731d143e3549558aaa357b5e1e4ea",
                "sha256": "1b0a783bee0481db38f1e3df851e540037581302248db9a3a7424f3b69fd2874"
            },
            "downloads": -1,
            "filename": "MetaHeuristicsFS-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "c1d731d143e3549558aaa357b5e1e4ea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14074,
            "upload_time": "2023-08-06T13:55:52",
            "upload_time_iso_8601": "2023-08-06T13:55:52.588143Z",
            "url": "https://files.pythonhosted.org/packages/ab/1f/8652682c850d21d610e65b55a7715f52378bde62086333c3f465a6f71f0c/MetaHeuristicsFS-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-06 13:55:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "StatguyUser",
    "github_project": "MetaHeuristicsFS",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "metaheuristicsfs"
}
        
Elapsed time: 0.10478s