mlmbench


Namemlmbench JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/gmrandazzo/mlmbench
SummaryMLM MachineLearning Molecular Benchmarch
upload_time2023-03-12 11:06:27
maintainer
docs_urlNone
authorGiuseppe Marco Randazzo
requires_python>=3.0
licenseGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MLMBench - MachineLearning Molecular Benchmarch

![Page views](https://visitor-badge.glitch.me/badge?page_id=gmrandazzo.mlmbench)
[![Licence: GPL v3](https://img.shields.io/github/license/gmrandazzo/mlmbench)](https://github.com/gmrandazzo/mlmbench/blob/master/LICENSE)

MLMBench collects datasets and splits them to do FAIR ML benchmarks.
MLMBench can be used with different ML algorithms and data representations
for molecular property/activity predictions and more.

The scope of this code is:
- keep a simple API representation
- no need of other libraries
- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.


Splits are made using well-known rational approaches such as:

- random split
- meaningful split for model target extrapolation
- meaningful split for chemical diversity extrapolation
- literature published split

The datasets are stored in the "data" directory in subfolders.
Every subfolder needs the following files with the following names:

- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)
- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm
- dataset.csv: the matrix of features 
- target.csv: the matrix of target/targets
- dataset.smi: the smiles list

Install
-------

```

pip3 install mlmbench

```

Split types per dataset
-----------------------
mlmbench includes for every dataset two different splits:
- random split using "mkrndsplits.py" starting from a list of names
- target extrapolation using "mktgtextrapsplits.py" starting from the target file.
  In this case, the algorithm will first import the target file, and then for every column,
  rank from min to max the queue and split the ordered target
  into "N" splits selected by the user. This split aims to check for "extrapolation."
- literature split (if available). In this case, we try to preserve particular splits published by users.

Available datasets
------------------

- BACE-moleculenet
- BACE-random
- BACE-tgt_extrapolation
- FU-random
- FU-tgt_extrapolation
- HLMCLint-random
- HLMCLint-tgt_extrapolation
- MeltingPoint-random
- MeltingPoint-tgt_extrapolation
- NIR_Gasoline-random
- NIR_Gasoline-tgt_extrapolation
- SteroidsLSS-isomers
- SteroidsLSS-random
- SteroidsLSS-tgt_extrapolation
- esol-chemdiversity
- esol-random
- esol-tgt_extrapolation
- logDpH7.4-random
- logDpH7.4-tgt_extrapolation

How to use
----------

```
#!/usr/bin/env python3

from mlmbench.data import Datasets

ds = Datasets()
print(ds.get_available_datasets())
print(f'Dataset info: {ds.get_info("esol-random")}')
for train_data, test_data, val_data in ds.ttv_generator("esol-random"):
    print("train ", train_data["xdata"].shape, train_data["target"].shape, len(train_data["smi"]))
    print("test ", test_data["xdata"].shape, test_data["target"].shape, len(test_data["smi"]))
    print("val ", val_data["xdata"].shape, val_data["target"].shape, len(val_data["smi"]))
    
    # Do ml training/test validation, collect the results and store it in your 
    # appropriate format to do your analysis.

    print("-"*40)

```

Submit new dataset
__________________

1) Fork the project!
2) Clone the forked project
3) Add the dataset in this form:
    dataset.csv: tabular data for any kind of descriptors
    target.csv: tabular data for one or multiple targets
    dataset.smi: smiles of the molecule in its appropriate format "c1ccccc1 benzene"
    cv.split: The split you like. This specific file needs to be compatible with the following
    	      standard. The file comprises lines representing the model,
    	      groups split by the ";" character, and every group representing
    	      the compound name, and every name is split using the "," character.
    i.e.
           train keys           test keys            validation keys
    line 1  mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,...
    line 2  ...
    line 3  ..

    Readme.md: Info regarding the dataset(i.e. source and so on)
4) Create a pull request and 99.9% will be merged


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gmrandazzo/mlmbench",
    "name": "mlmbench",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Giuseppe Marco Randazzo",
    "author_email": "gmrandazzo@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/90/d0/61191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36/mlmbench-1.0.3.tar.gz",
    "platform": null,
    "description": "# MLMBench - MachineLearning Molecular Benchmarch\n\n![Page views](https://visitor-badge.glitch.me/badge?page_id=gmrandazzo.mlmbench)\n[![Licence: GPL v3](https://img.shields.io/github/license/gmrandazzo/mlmbench)](https://github.com/gmrandazzo/mlmbench/blob/master/LICENSE)\n\nMLMBench collects datasets and splits them to do FAIR ML benchmarks.\nMLMBench can be used with different ML algorithms and data representations\nfor molecular property/activity predictions and more.\n\nThe scope of this code is:\n- keep a simple API representation\n- no need of other libraries\n- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.\n\n\nSplits are made using well-known rational approaches such as:\n\n- random split\n- meaningful split for model target extrapolation\n- meaningful split for chemical diversity extrapolation\n- literature published split\n\nThe datasets are stored in the \"data\" directory in subfolders.\nEvery subfolder needs the following files with the following names:\n\n- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)\n- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm\n- dataset.csv: the matrix of features \n- target.csv: the matrix of target/targets\n- dataset.smi: the smiles list\n\nInstall\n-------\n\n```\n\npip3 install mlmbench\n\n```\n\nSplit types per dataset\n-----------------------\nmlmbench includes for every dataset two different splits:\n- random split using \"mkrndsplits.py\" starting from a list of names\n- target extrapolation using \"mktgtextrapsplits.py\" starting from the target file.\n  In this case, the algorithm will first import the target file, and then for every column,\n  rank from min to max the queue and split the ordered target\n  into \"N\" splits selected by the user. This split aims to check for \"extrapolation.\"\n- literature split (if available). In this case, we try to preserve particular splits published by users.\n\nAvailable datasets\n------------------\n\n- BACE-moleculenet\n- BACE-random\n- BACE-tgt_extrapolation\n- FU-random\n- FU-tgt_extrapolation\n- HLMCLint-random\n- HLMCLint-tgt_extrapolation\n- MeltingPoint-random\n- MeltingPoint-tgt_extrapolation\n- NIR_Gasoline-random\n- NIR_Gasoline-tgt_extrapolation\n- SteroidsLSS-isomers\n- SteroidsLSS-random\n- SteroidsLSS-tgt_extrapolation\n- esol-chemdiversity\n- esol-random\n- esol-tgt_extrapolation\n- logDpH7.4-random\n- logDpH7.4-tgt_extrapolation\n\nHow to use\n----------\n\n```\n#!/usr/bin/env python3\n\nfrom mlmbench.data import Datasets\n\nds = Datasets()\nprint(ds.get_available_datasets())\nprint(f'Dataset info: {ds.get_info(\"esol-random\")}')\nfor train_data, test_data, val_data in ds.ttv_generator(\"esol-random\"):\n    print(\"train \", train_data[\"xdata\"].shape, train_data[\"target\"].shape, len(train_data[\"smi\"]))\n    print(\"test \", test_data[\"xdata\"].shape, test_data[\"target\"].shape, len(test_data[\"smi\"]))\n    print(\"val \", val_data[\"xdata\"].shape, val_data[\"target\"].shape, len(val_data[\"smi\"]))\n    \n    # Do ml training/test validation, collect the results and store it in your \n    # appropriate format to do your analysis.\n\n    print(\"-\"*40)\n\n```\n\nSubmit new dataset\n__________________\n\n1) Fork the project!\n2) Clone the forked project\n3) Add the dataset in this form:\n    dataset.csv: tabular data for any kind of descriptors\n    target.csv: tabular data for one or multiple targets\n    dataset.smi: smiles of the molecule in its appropriate format \"c1ccccc1 benzene\"\n    cv.split: The split you like. This specific file needs to be compatible with the following\n    \t      standard. The file comprises lines representing the model,\n    \t      groups split by the \";\" character, and every group representing\n    \t      the compound name, and every name is split using the \",\" character.\n    i.e.\n           train keys           test keys            validation keys\n    line 1  mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,...\n    line 2  ...\n    line 3  ..\n\n    Readme.md: Info regarding the dataset(i.e. source and so on)\n4) Create a pull request and 99.9% will be merged\n\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "MLM MachineLearning Molecular Benchmarch",
    "version": "1.0.3",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "34c6e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2",
                "md5": "e15ee011df9642ff7343250211878384",
                "sha256": "820282dbe1ca12ae66a4656edddf77f8b729297fb4e4821559225e590256624e"
            },
            "downloads": -1,
            "filename": "mlmbench-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e15ee011df9642ff7343250211878384",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 50561906,
            "upload_time": "2023-03-12T11:06:18",
            "upload_time_iso_8601": "2023-03-12T11:06:18.644340Z",
            "url": "https://files.pythonhosted.org/packages/34/c6/e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2/mlmbench-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "90d061191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36",
                "md5": "30bf5ea58bde5d1cf111c1e462ba6d97",
                "sha256": "a5e2f086eadfaf422a9a753aab9fa2ba1bf1911843b3b8e26117f803362e05e5"
            },
            "downloads": -1,
            "filename": "mlmbench-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "30bf5ea58bde5d1cf111c1e462ba6d97",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.0",
            "size": 49938058,
            "upload_time": "2023-03-12T11:06:27",
            "upload_time_iso_8601": "2023-03-12T11:06:27.958362Z",
            "url": "https://files.pythonhosted.org/packages/90/d0/61191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36/mlmbench-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-12 11:06:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "gmrandazzo",
    "github_project": "mlmbench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mlmbench"
}
        
Elapsed time: 0.06455s