dswizard

Name	dswizard JSON
Version	0.2.6 JSON
	download
home_page	https://github.com/Ennosigaeon/dswizard
Summary	DataScience Wizard for automatic assembly of machine learning pipelines
upload_time	2023-07-18 11:49:30
maintainer
docs_url	None
author	Marc Zoeller
requires_python	>=3.7
license	MIT
keywords	automl machine learning pipeline synthesis
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# dswizard

_dswizard_ is an efficient solver for machine learning (ML) pipeline synthesis inspired by human behaviour. It
automatically derives a pipeline structure, selects algorithms and performs hyperparameter optimization. This repository
contains the source code and data used in our publication [Iterative Search Space Construction for Pipeline Structure Search](https://arxiv.org/).

## How to install

The code has only be tested with Python 3.8, but any version supporting type hints should work. We recommend using a
virtual environment.
```
python3 -m virtualenv venv
source venv/bin/activate
```

_dswizard_ is available on PyPI, you can simply install it via
```
pip install dswizard
```

Alternatively, you can checkout the source code and install it directly via
```
pip install -e dswizard
```

Now you are ready to go.

### Visualization
`dswizard` contains an optional pipeline search space visualization functionality intended for debugging and
explainability. If you don't need this feature, you can skip this step. To use the visualization you have to install
[Graphviz](https://graphviz.org/) manually and add the additional visualization libraries using
```
pip install dswizard[visualization]
```

## Usage

In the folder scripts, we have provided scripts to showcase usage of _dswizard_. The most important script is
`scripts/1_optimize.py`. This script solves the pipeline synthesis for a given task. To get usage information use
```
python dswizard/scripts/1_optimize.py --help
```
yielding a output similar to

usage: 1_optimize.py [-h] [--wallclock_limit WALLCLOCK_LIMIT] [--cutoff CUTOFF] [--log_dir LOG_DIR] task

Example 1 - dswizard optimization.

positional arguments:
task OpenML task id

optional arguments:
-h, --help show this help message and exit
--wallclock_limit WALLCLOCK_LIMIT
Maximum optimization time for in seconds
--cutoff CUTOFF Maximum cutoff time for a single evaluation in seconds
--log_dir LOG_DIR Directory used for logging
--fold FOLD Fold of OpenML task to optimize

You have to pass an [OpenML](https://www.openml.org/) task id. For example, to create pipelines for the _kc2_ data set
use `python dswizard/scripts/1_optimize.py 3913`. Via the optional parameter you can change the total optimization time
(default 300 seconds), maximum evaluation time for a single configuration (default 60 seconds), the directory to store
optimization artifacts (default _run/{TASK}_) and the fold to evaluate (default 0).

The optimization procedure prints the best found pipeline structure with the according configuration and test performance
to the console., similar to

2020-11-13 16:45:55,312 INFO root MainThread Best found configuration: [('22', KBinsDiscretizer), ('26', PCAComponent), ('28', AdaBoostingClassifier)]
Configuration:
22:encode, Value: 'ordinal'
22:n_bins, Value: 32
22:strategy, Value: 'kmeans'
26:keep_variance, Value: 0.9145797030897109
26:whiten, Value: True
28:algorithm, Value: 'SAMME'
28:learning_rate, Value: 0.039407336108331845
28:n_estimators, Value: 138
with loss -0.8401893431635389
2020-11-13 16:45:55,312 INFO root MainThread A total of 20 unique structures where sampled.
2020-11-13 16:45:55,312 INFO root MainThread A total of 58 runs where executed.
2020-11-13 16:45:55,316 INFO root MainThread Final pipeline:
FlexiblePipeline(configuration={'22:encode': 'ordinal', '22:n_bins': 32,
'22:strategy': 'kmeans',
'26:keep_variance': 0.9145797030897109,
'26:whiten': True, '28:algorithm': 'SAMME',
'28:learning_rate': 0.039407336108331845,
'28:n_estimators': 138},
steps=[('22', KBinsDiscretizer(encode='ordinal', n_bins=32, strategy='kmeans')),
('26', PCAComponent(keep_variance=0.9145797030897109, whiten=True)),
('28', AdaBoostingClassifier(algorithm='SAMME', learning_rate=0.039407336108331845, n_estimators=138))])
2020-11-13 16:45:55,828 INFO root MainThread Final test performance -0.8430735930735931

Additionally, an ensemble of the evaluated pipeline candidates is constructed.

2020-11-13 16:46:06,371 DEBUG Ensemble MainThread Building bagged ensemble
2020-11-13 16:46:09,606 DEBUG Ensemble MainThread Ensemble constructed
2020-11-13 16:46:10,472 INFO root MainThread Final ensemble performance -0.8528138528138528 based on 11 pipelines

In the log directory (default _run/{task}_) four files are stored:

1. _log.txt_ contains the complete logging output
2. _results.json_ contains detailed information about all evaluated hyperparameter configurations.
3. _search_graph.pdf_ is a visual representation of the internal pipeline structure graph.
4. _structures.json_ contains all tested pipeline structures including the list of algorithms and the complete configuration space.

## Benchmarking

To assess the performance of _dswizard_ we have implemented an adapter for the OpenML [automlbenchmark](https://github.com/openml/automlbenchmark) available
[here](https://github.com/Ennosigaeon/automlbenchmark). Please refer to that repository for benchmarking _dswizard_. The
file `scripts/2_load_pipelines.py`, `scripts/3_load_performance.py` and `scripts/4_load_trajectories.py` are used to
compare _dswizard_ with _autosklearn_ and _tpot_, both also evaluated via _automlbenchmark_.

## Meta-Learning

The meta-learning base used in this repository is created using [meta-learning-base](https://github.com/Ennosigaeon/meta-learning-base).
Please see this repository on how to create the required meta-learning models.

For simplicity, we directly provide a random and sgd forest regression model trained on all available data ready to use.
It is available in in `dswizard/assets/`. `scripts/1_optimize.py` is already configured to use this model.

The data used to train the regression model is also available [online](https://github.com/Ennosigaeon/meta-learning-base/tree/master/assets/defaults).
Please refer to [meta-learning-base](https://github.com/Ennosigaeon/meta-learning-base) to see how to train the model
from the raw data.

## dswizard-components

This repository only contains the optimization logic. The actual basic ML components to be optimized are available in
[_dswizard-components_](https://github.com/Ennosigaeon/dswizard-components). Currently, only _sklearn_ components are
supported.

## Release New Version

Increase the version number in `setup.py` and build a new release with `python setup.py sdist`. Finally, upload the
new version using `twine upload dist/dswizard-<VERSION>.tar.gz`.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Ennosigaeon/dswizard",
    "name": "dswizard",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "automl,machine learning,pipeline synthesis",
    "author": "Marc Zoeller",
    "author_email": "m.zoeller@usu.de",
    "download_url": "https://files.pythonhosted.org/packages/a2/6c/a8bc568f43ae2ad0ff189bad47c45e545f88e09b7d1bed88cf54b017fc23/dswizard-0.2.6.tar.gz",
    "platform": null,
    "description": "# dswizard\n\n_dswizard_ is an efficient solver for machine learning (ML) pipeline synthesis inspired by human behaviour. It\nautomatically derives a pipeline structure, selects algorithms and performs hyperparameter optimization. This repository\ncontains the source code and data used in our publication [Iterative Search Space Construction for Pipeline Structure Search](https://arxiv.org/).\n\n## How to install\n\nThe code has only be tested with Python 3.8, but any version supporting type hints should work. We recommend using a\nvirtual environment.\n```\npython3 -m virtualenv venv\nsource venv/bin/activate\n```\n\n_dswizard_ is available on PyPI, you can simply install it via\n```\npip install dswizard\n```\n\nAlternatively, you can checkout the source code and install it directly via\n```\npip install -e dswizard\n```\n\nNow you are ready to go.\n\n### Visualization\n`dswizard` contains an optional pipeline search space visualization functionality intended for debugging and\nexplainability. If you don't need this feature, you can skip this step. To use the visualization you have to install\n[Graphviz](https://graphviz.org/) manually and add the additional visualization libraries using\n```\npip install dswizard[visualization]\n```\n\n\n## Usage\n\nIn the folder scripts, we have provided scripts to showcase usage of _dswizard_. The most important script is\n`scripts/1_optimize.py`. This script solves the pipeline synthesis for a given task. To get usage information use\n```\npython dswizard/scripts/1_optimize.py --help\n```\nyielding a output similar to\n\n    usage: 1_optimize.py [-h] [--wallclock_limit WALLCLOCK_LIMIT] [--cutoff CUTOFF] [--log_dir LOG_DIR] task\n    \n    Example 1 - dswizard optimization.\n    \n    positional arguments:\n      task                  OpenML task id\n    \n    optional arguments:\n      -h, --help            show this help message and exit\n      --wallclock_limit WALLCLOCK_LIMIT\n                            Maximum optimization time for in seconds\n      --cutoff CUTOFF       Maximum cutoff time for a single evaluation in seconds\n      --log_dir LOG_DIR     Directory used for logging\n      --fold FOLD           Fold of OpenML task to optimize\n\n\nYou have to pass an [OpenML](https://www.openml.org/) task id. For example, to create pipelines for the _kc2_ data set\nuse `python dswizard/scripts/1_optimize.py 3913`. Via the optional parameter you can change the total optimization time\n(default 300 seconds), maximum evaluation time for a single configuration (default 60 seconds), the directory to store\noptimization artifacts (default _run/{TASK}_) and the fold to evaluate (default 0).\n\nThe optimization procedure prints the best found pipeline structure with the according configuration and test performance\nto the console., similar to\n\n    2020-11-13 16:45:55,312 INFO     root            MainThread Best found configuration: [('22', KBinsDiscretizer), ('26', PCAComponent), ('28', AdaBoostingClassifier)]\n    Configuration:\n      22:encode, Value: 'ordinal'\n      22:n_bins, Value: 32\n      22:strategy, Value: 'kmeans'\n      26:keep_variance, Value: 0.9145797030897109\n      26:whiten, Value: True\n      28:algorithm, Value: 'SAMME'\n      28:learning_rate, Value: 0.039407336108331845\n      28:n_estimators, Value: 138\n     with loss -0.8401893431635389\n    2020-11-13 16:45:55,312 INFO     root            MainThread A total of 20 unique structures where sampled.\n    2020-11-13 16:45:55,312 INFO     root            MainThread A total of 58 runs where executed.\n    2020-11-13 16:45:55,316 INFO     root            MainThread Final pipeline:\n    FlexiblePipeline(configuration={'22:encode': 'ordinal', '22:n_bins': 32,\n                                    '22:strategy': 'kmeans',\n                                    '26:keep_variance': 0.9145797030897109,\n                                    '26:whiten': True, '28:algorithm': 'SAMME',\n                                    '28:learning_rate': 0.039407336108331845,\n                                    '28:n_estimators': 138},\n                     steps=[('22', KBinsDiscretizer(encode='ordinal', n_bins=32, strategy='kmeans')),\n                            ('26', PCAComponent(keep_variance=0.9145797030897109, whiten=True)),\n                            ('28', AdaBoostingClassifier(algorithm='SAMME', learning_rate=0.039407336108331845, n_estimators=138))])\n    2020-11-13 16:45:55,828 INFO     root            MainThread Final test performance -0.8430735930735931\n\nAdditionally, an ensemble of the evaluated pipeline candidates is constructed.\n\n    2020-11-13 16:46:06,371 DEBUG    Ensemble        MainThread Building bagged ensemble\n    2020-11-13 16:46:09,606 DEBUG    Ensemble        MainThread Ensemble constructed\n    2020-11-13 16:46:10,472 INFO     root            MainThread Final ensemble performance -0.8528138528138528 based on 11 pipelines\n\nIn the log directory (default _run/{task}_) four files are stored:\n\n1. _log.txt_ contains the complete logging output\n2. _results.json_ contains detailed information about all evaluated hyperparameter configurations.\n3. _search_graph.pdf_ is a visual representation of the internal pipeline structure graph.\n4. _structures.json_ contains all tested pipeline structures including the list of algorithms and the complete configuration space.\n\n\n## Benchmarking\n\nTo assess the performance of _dswizard_ we have implemented an adapter for the OpenML [automlbenchmark](https://github.com/openml/automlbenchmark) available \n[here](https://github.com/Ennosigaeon/automlbenchmark). Please refer to that repository for benchmarking _dswizard_. The\nfile `scripts/2_load_pipelines.py`, `scripts/3_load_performance.py` and `scripts/4_load_trajectories.py` are used to\ncompare _dswizard_ with _autosklearn_ and _tpot_, both also evaluated via _automlbenchmark_.\n\n\n## Meta-Learning\n\nThe meta-learning base used in this repository is created using [meta-learning-base](https://github.com/Ennosigaeon/meta-learning-base).\nPlease see this repository on how to create the required meta-learning models.\n\nFor simplicity, we directly provide a random and sgd forest regression model trained on all available data ready to use.\nIt is available in in `dswizard/assets/`. `scripts/1_optimize.py` is already configured to use this model.\n\nThe data used to train the regression model is also available [online](https://github.com/Ennosigaeon/meta-learning-base/tree/master/assets/defaults).\nPlease refer to [meta-learning-base](https://github.com/Ennosigaeon/meta-learning-base) to see how to train the model\nfrom the raw data.\n\n\n## dswizard-components\n\nThis repository only contains the optimization logic. The actual basic ML components to be optimized are available in\n[_dswizard-components_](https://github.com/Ennosigaeon/dswizard-components). Currently, only _sklearn_ components are\nsupported.\n\n\n## Release New Version\n\nIncrease the version number in `setup.py` and build a new release with `python setup.py sdist`. Finally, upload the\nnew version using `twine upload dist/dswizard-<VERSION>.tar.gz`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "DataScience Wizard for automatic assembly of machine learning pipelines",
    "version": "0.2.6",
    "project_urls": {
        "Homepage": "https://github.com/Ennosigaeon/dswizard"
    },
    "split_keywords": [
        "automl",
        "machine learning",
        "pipeline synthesis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a26ca8bc568f43ae2ad0ff189bad47c45e545f88e09b7d1bed88cf54b017fc23",
                "md5": "8244712db7c8af060f5bd9bff6ff98e2",
                "sha256": "36068471d0138af705621cc2bd0ef73825c6e9662811cbf929fc5fb9e3c8d406"
            },
            "downloads": -1,
            "filename": "dswizard-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "8244712db7c8af060f5bd9bff6ff98e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 53303,
            "upload_time": "2023-07-18T11:49:30",
            "upload_time_iso_8601": "2023-07-18T11:49:30.123209Z",
            "url": "https://files.pythonhosted.org/packages/a2/6c/a8bc568f43ae2ad0ff189bad47c45e545f88e09b7d1bed88cf54b017fc23/dswizard-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-18 11:49:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ennosigaeon",
    "github_project": "dswizard",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "dswizard"
}

Marc Zoeller