evoltree


Nameevoltree JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/p-pereira/evoltree
Summaryevoltree - Evolutionary Decision Trees
upload_time2023-01-02 15:03:24
maintainer
docs_urlNone
authorPedro José Pereira, Paulo Cortez, Rui Mendes
requires_python>=3.5
licenseMIT
keywords evolutionary decision trees grammatical evolution lamarckian evolution
VCS
bugtrack_url
requirements matplotlib numpy pandas scikit-learn ply tqdm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # **evoltree**: Evolutionary Decision Trees

[![Downloads](https://pepy.tech/badge/evoltree)](https://pepy.tech/project/evoltree) [![Downloads](https://pepy.tech/badge/evoltree/month)](https://pepy.tech/project/evoltree) [![Downloads](https://pepy.tech/badge/evoltree/week)](https://pepy.tech/project/evoltree)


## Overview

*evoltree* is a novel **Multi-objective Optimization (MO)** approach to evolve **Decision Trees (DT)** using **Grammatical Evolution (GE)**, under two main variants: a pure GE method (**EDT**) and a GE with Lamarckian Evolution (**EDTL**).
Both variants evolve variable-length DTs and perform a simultaneous optimization of the predictive performance (measured in terms of AUC) and model complexity (measured in terms of GE tree nodes). To handle big data, the GE methods include a **training sampling** and **parallelism evaluation mechanism**.
Both variants both use [PonyGE2](https://github.com/PonyGE/PonyGE2) as GE engine, while EDTL uses [sklearn DT](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).


Solutions are represented as a **numpy.where** expression in the format bellow, with *x* being a pandas dataframe with data; *idx* a column from the dataset; *comparison* a comparison signal (e.g., <, ==); *value* being a numerical value; and *result* can be another **numpy.where** expression (creating chained expressions), or a class probability (numeric value from 0 to 1).
Due to this representation, current evoltree implementation only allows numerical attributes.

```python3
numpy.where(x[<idx>]<comparison><value>,<result>,<result>) 
```
Example of a generated Python code (left) and the corresponding DT (right):
![DT and code](https://raw.githubusercontent.com/p-pereira/evoltree/main/imgs/dt_code.png)

More details about this work can be found at: https://doi.org/10.1016/j.eswa.2020.114287.

# Install

**Using `pip`:**

```bash
pip install evoltree
```

# Quick Start

This short tutorial contains a set of steps that will help you getting started with **evoltree**.

## Load Example Data

evoltree package includes two example datasets for testing purposes only. Data is ordered in time and the second dataset contains events collected after the first one.
To load the datasets, already divided into train, validation and test sets, two functions were created:
- **load_offline_data** - returns the training, validation and test sets from first dataset, used for static environments;
- **load_online_data** - returns the training, validation and test sets from both datasets, used for online learning scenarios.

Next steps present how to load data in the two different modes (online and offline). Due to privacy issues, all data is anonymized.

```python3
# Import package
from evoltree import evoltree
# Create evoltree object
edt = evoltree()
# Loading first example dataset, already divided into train, validation and test sets.
X, y, X_val, y_val, X_ts, y_ts = edt.load_offline_data()
# Loading both example datasets, already divided into train, validation and test sets.
X1, y1, X_val1, y_val1, X_ts1, y_ts1, X2, y2, X_val2, y_val2, X_ts2, y_ts2 = edt.load_online_data()
print(X)
print(y)
```

```
# Outputs
## X (train)
        col1      col2      col3      col4      col5      col6      col7      col8      col9      col10
0       0.755951  4.653432  2.767041  0.739897  0.101081  2.401580  2.890712  3.157321  5.554321  0.865465
1       5.624979  4.782823  0.416159  0.823179  0.101081  2.446735  0.739841  2.564807  4.552277  0.991334
2       0.755951  5.365407  3.081596  0.739897  0.101081  3.174113  1.986985  3.193263  3.378832  0.865465
3       4.436114  6.393779  0.416159  0.823179  0.101081  2.238128  2.066007  2.564807  3.436435  0.865465
4       7.071106  6.069200  0.416159  0.739897  0.101081  5.100103  0.739841  3.551982  4.886248  0.991334
...          ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
708937  0.755951  4.804125  0.416159  0.823179  0.101081  2.807017  2.066007  2.564807  5.802875  0.865465
708938  0.755951  2.558737  0.416159  0.823179  0.101081  2.558737  2.066007  2.564807  5.802875  2.173501
708939  0.755951  2.875355  0.416159  0.823179  0.101081  2.572643  0.739841  2.455024  2.947305  0.991334
708940  0.755951  4.694221  0.416159  0.823179  0.101081  3.568007  0.739841  2.455024  6.279122  0.991334
708941  6.631839  2.553551  0.416159  0.823179  0.101081  2.446735  0.739841  2.564807  5.266229  0.991334

[708942 rows x 10 columns]

[708942 rows x 10 columns]

## y (train)
0         NoSale
1         NoSale
2         NoSale
3         NoSale
4         NoSale
           ...
708937    NoSale
708938    NoSale
708939    NoSale
708940    NoSale
708941    NoSale
Name: target, Length: 708942, dtype: object
```

## Offline Learning: Fit DT, EDT and EDTL models

Next steps present the basic usage of both variants (EDT and EDTL) for modeling the previously loaded data in an offline environment.

```python3
# Imports
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from evoltree import evoltree
import matplotlib.pyplot as plt

# Create two evoltree objects, one for each variant
edt = evoltree()
edtl = evoltree()
# Load dataset
X, y, X_val, y_val, X_ts, y_ts = edt.load_offline_data()

# Fit both versions on train data
## Normal variant:
edt.fit(
    X,
    y,
    "Sale",
    X_val,
    y_val,
    pop=100,
    gen=10,
    lamarck=False,
    experiment_name="test",
)
y_pred1 = edt.predict(X_ts, mode="best")

## Lamarckian variant, doesn't need as much iterations (gen)
edtl.fit(
    X,
    y,
    "Sale",
    X_val,
    y_val,
    pop=100,
    gen=5,
    lamarck=True,
    experiment_name="testLamarck",
)
# Predict on test data, using the solution with better predictive performance on validation data
y_predL1 = edtl.predict(X_ts, mode="best")

# Fit a traditional Decision Tree for comparison
dt = DecisionTreeClassifier(random_state=1234).fit(X, y)
prob = dt.predict_proba(X_ts)

# Compute AUC on test data
fpr, tpr, th = metrics.roc_curve(y_ts, y_pred1, pos_label="Sale")
fprL1, tprL1, thL1 = metrics.roc_curve(y_ts, y_predL1, pos_label="Sale")
fprdt, tprdt, thdt = metrics.roc_curve(y_ts, prob[:, 1], pos_label="Sale")
auc1 = metrics.auc(fpr, tpr)
aucL1 = metrics.auc(fprL1, tprL1)
aucdt = metrics.auc(fprdt, tprdt)
# Plot results
plt.rcParams["font.family"] = "sans-serif"
fig, ax = plt.subplots(1, 1, figsize=(5.5, 5))
plt.plot(
    fpr,
    tpr,
    color="red",
    ls="--",
    lw=2,
    label="EDT={}%".format(round(auc1 * 100, 2)),
)
plt.plot(
    fprL1,
    tprL1,
    color="royalblue",
    ls="--",
    lw=2,
    label="EDTL={}%".format(round(aucL1 * 100, 2)),
)
plt.plot(
    fprdt,
    tprdt,
    color="darkgreen",
    ls=":",
    lw=2,
    label="DT={}%".format(round(aucdt * 100, 2)),
)
plt.plot([0, 1], [0, 1], color="black", ls="--", label="baseline=50%")
plt.legend(loc=4)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.show()
plt.savefig("results.png")

```
Result:

![Results.](https://raw.githubusercontent.com/p-pereira/evoltree/main/imgs/results.png)

# Citation

If you use **evoltree** for your research, please cite the following paper:


Pedro Jose Pereira, Paulo Cortez, Rui Mendes:

[**Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing User Conversion Prediction.**](https://doi.org/10.1016/j.eswa.2020.114287)

Expert Systems with Applications 168: 114287 (2021).


```
@article{DBLP:journals/eswa/PereiraCM21,
	author    = {Pedro Jos{\'{e}} Pereira and Paulo Cortez and Rui Mendes},
	title     = {Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing user conversion prediction},
	journal   = {Expert Systems with Applications},
	volume    = {168},
	pages     = {114287},
	year      = {2021},
	doi       = {10.1016/j.eswa.2020.114287},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/p-pereira/evoltree",
    "name": "evoltree",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "Evolutionary Decision Trees,Grammatical Evolution,Lamarckian Evolution",
    "author": "Pedro Jos\u00e9 Pereira, Paulo Cortez, Rui Mendes",
    "author_email": "pedro.pereira@dsi.uminho.pt",
    "download_url": "https://files.pythonhosted.org/packages/82/ee/ff15f39951f6a066ce88cea5022ecd23516876e330337d141169ace303fb/evoltree-0.0.8.tar.gz",
    "platform": null,
    "description": "# **evoltree**: Evolutionary Decision Trees\n\n[![Downloads](https://pepy.tech/badge/evoltree)](https://pepy.tech/project/evoltree) [![Downloads](https://pepy.tech/badge/evoltree/month)](https://pepy.tech/project/evoltree) [![Downloads](https://pepy.tech/badge/evoltree/week)](https://pepy.tech/project/evoltree)\n\n\n## Overview\n\n*evoltree* is a novel **Multi-objective Optimization (MO)** approach to evolve **Decision Trees (DT)** using **Grammatical Evolution (GE)**, under two main variants: a pure GE method (**EDT**) and a GE with Lamarckian Evolution (**EDTL**).\nBoth variants evolve variable-length DTs and perform a simultaneous optimization of the predictive performance (measured in terms of AUC) and model complexity (measured in terms of GE tree nodes). To handle big data, the GE methods include a **training sampling** and **parallelism evaluation mechanism**.\nBoth variants both use [PonyGE2](https://github.com/PonyGE/PonyGE2) as GE engine, while EDTL uses [sklearn DT](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).\n\n\nSolutions are represented as a **numpy.where** expression in the format bellow, with *x* being a pandas dataframe with data; *idx* a column from the dataset; *comparison* a comparison signal (e.g., <, ==); *value* being a numerical value; and *result* can be another **numpy.where** expression (creating chained expressions), or a class probability (numeric value from 0 to 1).\nDue to this representation, current evoltree implementation only allows numerical attributes.\n\n```python3\nnumpy.where(x[<idx>]<comparison><value>,<result>,<result>) \n```\nExample of a generated Python code (left) and the corresponding DT (right):\n![DT and code](https://raw.githubusercontent.com/p-pereira/evoltree/main/imgs/dt_code.png)\n\nMore details about this work can be found at: https://doi.org/10.1016/j.eswa.2020.114287.\n\n# Install\n\n**Using `pip`:**\n\n```bash\npip install evoltree\n```\n\n# Quick Start\n\nThis short tutorial contains a set of steps that will help you getting started with **evoltree**.\n\n## Load Example Data\n\nevoltree package includes two example datasets for testing purposes only. Data is ordered in time and the second dataset contains events collected after the first one.\nTo load the datasets, already divided into train, validation and test sets, two functions were created:\n- **load_offline_data** - returns the training, validation and test sets from first dataset, used for static environments;\n- **load_online_data** - returns the training, validation and test sets from both datasets, used for online learning scenarios.\n\nNext steps present how to load data in the two different modes (online and offline). Due to privacy issues, all data is anonymized.\n\n```python3\n# Import package\nfrom evoltree import evoltree\n# Create evoltree object\nedt = evoltree()\n# Loading first example dataset, already divided into train, validation and test sets.\nX, y, X_val, y_val, X_ts, y_ts = edt.load_offline_data()\n# Loading both example datasets, already divided into train, validation and test sets.\nX1, y1, X_val1, y_val1, X_ts1, y_ts1, X2, y2, X_val2, y_val2, X_ts2, y_ts2 = edt.load_online_data()\nprint(X)\nprint(y)\n```\n\n```\n# Outputs\n## X (train)\n        col1      col2      col3      col4      col5      col6      col7      col8      col9      col10\n0       0.755951  4.653432  2.767041  0.739897  0.101081  2.401580  2.890712  3.157321  5.554321  0.865465\n1       5.624979  4.782823  0.416159  0.823179  0.101081  2.446735  0.739841  2.564807  4.552277  0.991334\n2       0.755951  5.365407  3.081596  0.739897  0.101081  3.174113  1.986985  3.193263  3.378832  0.865465\n3       4.436114  6.393779  0.416159  0.823179  0.101081  2.238128  2.066007  2.564807  3.436435  0.865465\n4       7.071106  6.069200  0.416159  0.739897  0.101081  5.100103  0.739841  3.551982  4.886248  0.991334\n...          ...       ...       ...       ...       ...       ...       ...       ...       ...       ...\n708937  0.755951  4.804125  0.416159  0.823179  0.101081  2.807017  2.066007  2.564807  5.802875  0.865465\n708938  0.755951  2.558737  0.416159  0.823179  0.101081  2.558737  2.066007  2.564807  5.802875  2.173501\n708939  0.755951  2.875355  0.416159  0.823179  0.101081  2.572643  0.739841  2.455024  2.947305  0.991334\n708940  0.755951  4.694221  0.416159  0.823179  0.101081  3.568007  0.739841  2.455024  6.279122  0.991334\n708941  6.631839  2.553551  0.416159  0.823179  0.101081  2.446735  0.739841  2.564807  5.266229  0.991334\n\n[708942 rows x 10 columns]\n\n[708942 rows x 10 columns]\n\n## y (train)\n0         NoSale\n1         NoSale\n2         NoSale\n3         NoSale\n4         NoSale\n           ...\n708937    NoSale\n708938    NoSale\n708939    NoSale\n708940    NoSale\n708941    NoSale\nName: target, Length: 708942, dtype: object\n```\n\n## Offline Learning: Fit DT, EDT and EDTL models\n\nNext steps present the basic usage of both variants (EDT and EDTL) for modeling the previously loaded data in an offline environment.\n\n```python3\n# Imports\nfrom sklearn import metrics\nfrom sklearn.tree import DecisionTreeClassifier\nfrom evoltree import evoltree\nimport matplotlib.pyplot as plt\n\n# Create two evoltree objects, one for each variant\nedt = evoltree()\nedtl = evoltree()\n# Load dataset\nX, y, X_val, y_val, X_ts, y_ts = edt.load_offline_data()\n\n# Fit both versions on train data\n## Normal variant:\nedt.fit(\n    X,\n    y,\n    \"Sale\",\n    X_val,\n    y_val,\n    pop=100,\n    gen=10,\n    lamarck=False,\n    experiment_name=\"test\",\n)\ny_pred1 = edt.predict(X_ts, mode=\"best\")\n\n## Lamarckian variant, doesn't need as much iterations (gen)\nedtl.fit(\n    X,\n    y,\n    \"Sale\",\n    X_val,\n    y_val,\n    pop=100,\n    gen=5,\n    lamarck=True,\n    experiment_name=\"testLamarck\",\n)\n# Predict on test data, using the solution with better predictive performance on validation data\ny_predL1 = edtl.predict(X_ts, mode=\"best\")\n\n# Fit a traditional Decision Tree for comparison\ndt = DecisionTreeClassifier(random_state=1234).fit(X, y)\nprob = dt.predict_proba(X_ts)\n\n# Compute AUC on test data\nfpr, tpr, th = metrics.roc_curve(y_ts, y_pred1, pos_label=\"Sale\")\nfprL1, tprL1, thL1 = metrics.roc_curve(y_ts, y_predL1, pos_label=\"Sale\")\nfprdt, tprdt, thdt = metrics.roc_curve(y_ts, prob[:, 1], pos_label=\"Sale\")\nauc1 = metrics.auc(fpr, tpr)\naucL1 = metrics.auc(fprL1, tprL1)\naucdt = metrics.auc(fprdt, tprdt)\n# Plot results\nplt.rcParams[\"font.family\"] = \"sans-serif\"\nfig, ax = plt.subplots(1, 1, figsize=(5.5, 5))\nplt.plot(\n    fpr,\n    tpr,\n    color=\"red\",\n    ls=\"--\",\n    lw=2,\n    label=\"EDT={}%\".format(round(auc1 * 100, 2)),\n)\nplt.plot(\n    fprL1,\n    tprL1,\n    color=\"royalblue\",\n    ls=\"--\",\n    lw=2,\n    label=\"EDTL={}%\".format(round(aucL1 * 100, 2)),\n)\nplt.plot(\n    fprdt,\n    tprdt,\n    color=\"darkgreen\",\n    ls=\":\",\n    lw=2,\n    label=\"DT={}%\".format(round(aucdt * 100, 2)),\n)\nplt.plot([0, 1], [0, 1], color=\"black\", ls=\"--\", label=\"baseline=50%\")\nplt.legend(loc=4)\nplt.xlabel(\"FPR\")\nplt.ylabel(\"TPR\")\nplt.show()\nplt.savefig(\"results.png\")\n\n```\nResult:\n\n![Results.](https://raw.githubusercontent.com/p-pereira/evoltree/main/imgs/results.png)\n\n# Citation\n\nIf you use **evoltree** for your research, please cite the following paper:\n\n\nPedro Jose Pereira, Paulo Cortez, Rui Mendes:\n\n[**Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing User Conversion Prediction.**](https://doi.org/10.1016/j.eswa.2020.114287)\n\nExpert Systems with Applications 168: 114287 (2021).\n\n\n```\n@article{DBLP:journals/eswa/PereiraCM21,\n\tauthor    = {Pedro Jos{\\'{e}} Pereira and Paulo Cortez and Rui Mendes},\n\ttitle     = {Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing user conversion prediction},\n\tjournal   = {Expert Systems with Applications},\n\tvolume    = {168},\n\tpages     = {114287},\n\tyear      = {2021},\n\tdoi       = {10.1016/j.eswa.2020.114287},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "evoltree - Evolutionary Decision Trees",
    "version": "0.0.8",
    "split_keywords": [
        "evolutionary decision trees",
        "grammatical evolution",
        "lamarckian evolution"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "003d679ae6793cd6d5995858a7b2b6c4",
                "sha256": "ac0e510ca078a7bfbca4e2fc2f62c0c241fa5e8ce689f09694cbfc8935faa457"
            },
            "downloads": -1,
            "filename": "evoltree-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "003d679ae6793cd6d5995858a7b2b6c4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 29866761,
            "upload_time": "2023-01-02T15:03:24",
            "upload_time_iso_8601": "2023-01-02T15:03:24.318018Z",
            "url": "https://files.pythonhosted.org/packages/82/ee/ff15f39951f6a066ce88cea5022ecd23516876e330337d141169ace303fb/evoltree-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-02 15:03:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "p-pereira",
    "github_project": "evoltree",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "ply",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        }
    ],
    "lcname": "evoltree"
}
        
Elapsed time: 0.61179s