sapientml


Namesapientml JSON
Version 0.4.12.post0 PyPI version JSON
download
home_pagehttps://sapientml.io/
SummaryGenerative AutoML for Tabular Data
upload_time2023-12-04 05:32:17
maintainerKosaku Kimura
docs_urlNone
authorThe SapientML Authors
requires_python>=3.10,<3.13
licenseApache-2.0
keywords automl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![SapientML](https://raw.githubusercontent.com/sapientml/sapientml/main/static/SapientML_positive_logo.svg#gh-light-mode-only)
![](./static/SapientML_negative_logo.svg#gh-dark-mode-only)
<h1 align="center">Generative AutoML for Tabular Data</h1>
<p align='center'>
SapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.
</p>
<p align='center'>
<a href="https://pypi.org/project/sapientml/"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/sapientml"></a>
<a href="https://github.com/sapientml/sapientml/actions/workflows/release.yml"><img alt="Release" src="https://github.com/sapientml/sapientml/actions/workflows/release.yml/badge.svg"></a>
<a href="https://conventionalcommits.org"><img alt="Conventional Commits" src="https://img.shields.io/badge/Conventional%20Commits-1.0.0-%23FE5196?logo=conventionalcommits&logoColor=white"></a>
<a href="https://www.bestpractices.dev/projects/7781"><img alt="OpenSSF Best Practices" src="https://www.bestpractices.dev/projects/7781/badge"></a>
<a href="https://codecov.io/gh/sapientml/sapientml" ><img src="https://codecov.io/gh/sapientml/sapientml/graph/badge.svg?token=STVPNF5X25"/></a>
<a href="https://pepy.tech/project/sapientml"><img src="https://static.pepy.tech/badge/sapientml"/></a>
<a href="https://pepy.tech/project/sapientml"><img src="https://static.pepy.tech/badge/sapientml/month"/></a>
</p>

# Installation

From PyPI repository

```
pip install sapientml
```

From source code:

```
git clone https://github.com/sapientml/sapientml.git
cd sapientml
pip install poetry
poetry install
```
# Getting Started

Please see our [Documentation](https://sapientml.readthedocs.io/en/latest/user/usage.html) for further details.
## Run AutoML

```py
import pandas as pd
from sapientml import SapientML
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)

cls = SapientML(["survived"])

cls.fit(train_data)
y_pred = cls.predict(test_data)

y_pred = y_pred["survived"].rename("survived_pred")
print(f"F1 score: {f1_score(y_true, y_pred)}")
```

## Obtain and Run Generated Code

You can access `model` field to get a model consisting of generated code after executing `fit` method.
`model` provides `fit`, `predict`, and `save` method to train a model by generated code, predict from a test data by generated code, and save generated code to a designated folder.

```py
model = sml.fit(train_data, codegen_only=True).model

model.fit(X_train, y_train) # build a model by using another data and the same generated code

y_pred = model.predict(X_test) # prediction by using generated code

model.save("/path/to/output") # save generated code to `path/to/output`
```

# Examples

| Dataset                                                                                                            | Task             | Target      | Code                                                                                                                                                                                                                                                       |
| ------------------------------------------------------------------------------------------------------------------ | ---------------- | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Titanic Dataset](https://www.openml.org/d/40945)                                                                  | `classification` | `survived`  | <a target="_blank" href="https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-titanic.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>                      |
| Hotel Cancellation                                                                                                 | `classification` | `Status`    | <a target="_blank" href="https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-hotel-candel-prediction.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>      |
| Housing Prices                                                                                                     | `regression`     | `SalePrice` | <a target="_blank" href="https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-housing-prices.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>               |
| [Medical Insurance Charges](https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction) | `regression`     | `charges`   | <a target="_blank" href="https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-medical-insurance-prediction.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

# Publications

The technologies of the software originates from the following research paper published at the International Conference on Software Engineering (ICSE), which is one of the premier conferences on Software Engineering.

**Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad (2022, May). [SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions](https://arxiv.org/abs/2202.10451). In *[Proceedings of the 44th International Conference on Software Engineering](https://conf.researchr.org/home/icse-2022)* (pp. 1932-1944).**

```bibtex
@inproceedings{10.1145/3510003.3510226,
author = {Saha, Ripon K. and Ura, Akira and Mahajan, Sonal and Zhu, Chenguang and Li, Linyi and Hu, Yang and Yoshida, Hiroaki and Khurshid, Sarfraz and Prasad, Mukul R.},
title = {SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions},
year = {2022},
isbn = {9781450392211},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3510003.3510226},
doi = {10.1145/3510003.3510226},
abstract = {Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.},
booktitle = {Proceedings of the 44th International Conference on Software Engineering},
pages = {1932–1944},
numpages = {13},
keywords = {AutoML, program synthesis, program analysis, machine learning},
location = {Pittsburgh, Pennsylvania},
series = {ICSE '22}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://sapientml.io/",
    "name": "sapientml",
    "maintainer": "Kosaku Kimura",
    "docs_url": null,
    "requires_python": ">=3.10,<3.13",
    "maintainer_email": "kimura.kosaku@fujitsu.com",
    "keywords": "automl",
    "author": "The SapientML Authors",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/ab/f2/3bb2f47503722e08da4a5f6f729da8432eee518bad97d8a53516d498ee0f/sapientml-0.4.12.post0.tar.gz",
    "platform": null,
    "description": "![SapientML](https://raw.githubusercontent.com/sapientml/sapientml/main/static/SapientML_positive_logo.svg#gh-light-mode-only)\n![](./static/SapientML_negative_logo.svg#gh-dark-mode-only)\n<h1 align=\"center\">Generative AutoML for Tabular Data</h1>\n<p align='center'>\nSapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.\n</p>\n<p align='center'>\n<a href=\"https://pypi.org/project/sapientml/\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/sapientml\"></a>\n<a href=\"https://github.com/sapientml/sapientml/actions/workflows/release.yml\"><img alt=\"Release\" src=\"https://github.com/sapientml/sapientml/actions/workflows/release.yml/badge.svg\"></a>\n<a href=\"https://conventionalcommits.org\"><img alt=\"Conventional Commits\" src=\"https://img.shields.io/badge/Conventional%20Commits-1.0.0-%23FE5196?logo=conventionalcommits&logoColor=white\"></a>\n<a href=\"https://www.bestpractices.dev/projects/7781\"><img alt=\"OpenSSF Best Practices\" src=\"https://www.bestpractices.dev/projects/7781/badge\"></a>\n<a href=\"https://codecov.io/gh/sapientml/sapientml\" ><img src=\"https://codecov.io/gh/sapientml/sapientml/graph/badge.svg?token=STVPNF5X25\"/></a>\n<a href=\"https://pepy.tech/project/sapientml\"><img src=\"https://static.pepy.tech/badge/sapientml\"/></a>\n<a href=\"https://pepy.tech/project/sapientml\"><img src=\"https://static.pepy.tech/badge/sapientml/month\"/></a>\n</p>\n\n# Installation\n\nFrom PyPI repository\n\n```\npip install sapientml\n```\n\nFrom source code:\n\n```\ngit clone https://github.com/sapientml/sapientml.git\ncd sapientml\npip install poetry\npoetry install\n```\n# Getting Started\n\nPlease see our [Documentation](https://sapientml.readthedocs.io/en/latest/user/usage.html) for further details.\n## Run AutoML\n\n```py\nimport pandas as pd\nfrom sapientml import SapientML\nfrom sklearn.metrics import f1_score\nfrom sklearn.model_selection import train_test_split\n\ntrain_data = pd.read_csv(\"https://github.com/sapientml/sapientml/files/12481088/titanic.csv\")\ntrain_data, test_data = train_test_split(train_data)\ny_true = test_data[\"survived\"].reset_index(drop=True)\ntest_data.drop([\"survived\"], axis=1, inplace=True)\n\ncls = SapientML([\"survived\"])\n\ncls.fit(train_data)\ny_pred = cls.predict(test_data)\n\ny_pred = y_pred[\"survived\"].rename(\"survived_pred\")\nprint(f\"F1 score: {f1_score(y_true, y_pred)}\")\n```\n\n## Obtain and Run Generated Code\n\nYou can access `model` field to get a model consisting of generated code after executing `fit` method.\n`model` provides `fit`, `predict`, and `save` method to train a model by generated code, predict from a test data by generated code, and save generated code to a designated folder.\n\n```py\nmodel = sml.fit(train_data, codegen_only=True).model\n\nmodel.fit(X_train, y_train) # build a model by using another data and the same generated code\n\ny_pred = model.predict(X_test) # prediction by using generated code\n\nmodel.save(\"/path/to/output\") # save generated code to `path/to/output`\n```\n\n# Examples\n\n| Dataset                                                                                                            | Task             | Target      | Code                                                                                                                                                                                                                                                       |\n| ------------------------------------------------------------------------------------------------------------------ | ---------------- | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| [Titanic Dataset](https://www.openml.org/d/40945)                                                                  | `classification` | `survived`  | <a target=\"_blank\" href=\"https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-titanic.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>                      |\n| Hotel Cancellation                                                                                                 | `classification` | `Status`    | <a target=\"_blank\" href=\"https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-hotel-candel-prediction.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>      |\n| Housing Prices                                                                                                     | `regression`     | `SalePrice` | <a target=\"_blank\" href=\"https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-housing-prices.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>               |\n| [Medical Insurance Charges](https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction) | `regression`     | `charges`   | <a target=\"_blank\" href=\"https://colab.research.google.com/github/sapientml/sapientml/blob/main/static/sapientml-example-medical-insurance-prediction.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a> |\n\n# Publications\n\nThe technologies of the software originates from the following research paper published at the International Conference on Software Engineering (ICSE), which is one of the premier conferences on Software Engineering.\n\n**Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad (2022, May). [SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions](https://arxiv.org/abs/2202.10451). In *[Proceedings of the 44th International Conference on Software Engineering](https://conf.researchr.org/home/icse-2022)* (pp. 1932-1944).**\n\n```bibtex\n@inproceedings{10.1145/3510003.3510226,\nauthor = {Saha, Ripon K. and Ura, Akira and Mahajan, Sonal and Zhu, Chenguang and Li, Linyi and Hu, Yang and Yoshida, Hiroaki and Khurshid, Sarfraz and Prasad, Mukul R.},\ntitle = {SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions},\nyear = {2022},\nisbn = {9781450392211},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3510003.3510226},\ndoi = {10.1145/3510003.3510226},\nabstract = {Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.},\nbooktitle = {Proceedings of the 44th International Conference on Software Engineering},\npages = {1932\u20131944},\nnumpages = {13},\nkeywords = {AutoML, program synthesis, program analysis, machine learning},\nlocation = {Pittsburgh, Pennsylvania},\nseries = {ICSE '22}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Generative AutoML for Tabular Data",
    "version": "0.4.12.post0",
    "project_urls": {
        "Documentation": "https://sapientml.readthedocs.io/",
        "Homepage": "https://sapientml.io/",
        "Repository": "https://github.com/sapientml/sapientml"
    },
    "split_keywords": [
        "automl"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0cad5aba21bea28b7b2547dd40fe10810cae8e5d306fcd950cc7164161671c9b",
                "md5": "0bc30fdb6815e7184856a6a5937d3945",
                "sha256": "0f62830b6b577fe044fcab8b9e7ccb8cb25093bb5ec33fc9e95bb8725950dd90"
            },
            "downloads": -1,
            "filename": "sapientml-0.4.12.post0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0bc30fdb6815e7184856a6a5937d3945",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<3.13",
            "size": 29072,
            "upload_time": "2023-12-04T05:32:16",
            "upload_time_iso_8601": "2023-12-04T05:32:16.381445Z",
            "url": "https://files.pythonhosted.org/packages/0c/ad/5aba21bea28b7b2547dd40fe10810cae8e5d306fcd950cc7164161671c9b/sapientml-0.4.12.post0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "abf23bb2f47503722e08da4a5f6f729da8432eee518bad97d8a53516d498ee0f",
                "md5": "dcdd9f8ce4d990823ca65718186e7e6d",
                "sha256": "cd83d8db59dc7851d357ab6e36e9e1b74e27bdfdf73f0c986b1fceb1512d62da"
            },
            "downloads": -1,
            "filename": "sapientml-0.4.12.post0.tar.gz",
            "has_sig": false,
            "md5_digest": "dcdd9f8ce4d990823ca65718186e7e6d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<3.13",
            "size": 24162,
            "upload_time": "2023-12-04T05:32:17",
            "upload_time_iso_8601": "2023-12-04T05:32:17.679468Z",
            "url": "https://files.pythonhosted.org/packages/ab/f2/3bb2f47503722e08da4a5f6f729da8432eee518bad97d8a53516d498ee0f/sapientml-0.4.12.post0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-04 05:32:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sapientml",
    "github_project": "sapientml",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sapientml"
}
        
Elapsed time: 0.14805s