# MLMBench - MachineLearning Molecular Benchmarch

[](https://github.com/gmrandazzo/mlmbench/blob/master/LICENSE)
MLMBench collects datasets and splits them to do FAIR ML benchmarks.
MLMBench can be used with different ML algorithms and data representations
for molecular property/activity predictions and more.
The scope of this code is:
- keep a simple API representation
- no need of other libraries
- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.
Splits are made using well-known rational approaches such as:
- random split
- meaningful split for model target extrapolation
- meaningful split for chemical diversity extrapolation
- literature published split
The datasets are stored in the "data" directory in subfolders.
Every subfolder needs the following files with the following names:
- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)
- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm
- dataset.csv: the matrix of features
- target.csv: the matrix of target/targets
- dataset.smi: the smiles list
Install
-------
```
pip3 install mlmbench
```
Split types per dataset
-----------------------
mlmbench includes for every dataset two different splits:
- random split using "mkrndsplits.py" starting from a list of names
- target extrapolation using "mktgtextrapsplits.py" starting from the target file.
In this case, the algorithm will first import the target file, and then for every column,
rank from min to max the queue and split the ordered target
into "N" splits selected by the user. This split aims to check for "extrapolation."
- literature split (if available). In this case, we try to preserve particular splits published by users.
Available datasets
------------------
- BACE-moleculenet
- BACE-random
- BACE-tgt_extrapolation
- FU-random
- FU-tgt_extrapolation
- HLMCLint-random
- HLMCLint-tgt_extrapolation
- MeltingPoint-random
- MeltingPoint-tgt_extrapolation
- NIR_Gasoline-random
- NIR_Gasoline-tgt_extrapolation
- SteroidsLSS-isomers
- SteroidsLSS-random
- SteroidsLSS-tgt_extrapolation
- esol-chemdiversity
- esol-random
- esol-tgt_extrapolation
- logDpH7.4-random
- logDpH7.4-tgt_extrapolation
How to use
----------
```
#!/usr/bin/env python3
from mlmbench.data import Datasets
ds = Datasets()
print(ds.get_available_datasets())
print(f'Dataset info: {ds.get_info("esol-random")}')
for train_data, test_data, val_data in ds.ttv_generator("esol-random"):
print("train ", train_data["xdata"].shape, train_data["target"].shape, len(train_data["smi"]))
print("test ", test_data["xdata"].shape, test_data["target"].shape, len(test_data["smi"]))
print("val ", val_data["xdata"].shape, val_data["target"].shape, len(val_data["smi"]))
# Do ml training/test validation, collect the results and store it in your
# appropriate format to do your analysis.
print("-"*40)
```
Submit new dataset
__________________
1) Fork the project!
2) Clone the forked project
3) Add the dataset in this form:
dataset.csv: tabular data for any kind of descriptors
target.csv: tabular data for one or multiple targets
dataset.smi: smiles of the molecule in its appropriate format "c1ccccc1 benzene"
cv.split: The split you like. This specific file needs to be compatible with the following
standard. The file comprises lines representing the model,
groups split by the ";" character, and every group representing
the compound name, and every name is split using the "," character.
i.e.
train keys test keys validation keys
line 1 mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,...
line 2 ...
line 3 ..
Readme.md: Info regarding the dataset(i.e. source and so on)
4) Create a pull request and 99.9% will be merged
Raw data
{
"_id": null,
"home_page": "https://github.com/gmrandazzo/mlmbench",
"name": "mlmbench",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.0",
"maintainer_email": "",
"keywords": "",
"author": "Giuseppe Marco Randazzo",
"author_email": "gmrandazzo@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/90/d0/61191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36/mlmbench-1.0.3.tar.gz",
"platform": null,
"description": "# MLMBench - MachineLearning Molecular Benchmarch\n\n\n[](https://github.com/gmrandazzo/mlmbench/blob/master/LICENSE)\n\nMLMBench collects datasets and splits them to do FAIR ML benchmarks.\nMLMBench can be used with different ML algorithms and data representations\nfor molecular property/activity predictions and more.\n\nThe scope of this code is:\n- keep a simple API representation\n- no need of other libraries\n- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.\n\n\nSplits are made using well-known rational approaches such as:\n\n- random split\n- meaningful split for model target extrapolation\n- meaningful split for chemical diversity extrapolation\n- literature published split\n\nThe datasets are stored in the \"data\" directory in subfolders.\nEvery subfolder needs the following files with the following names:\n\n- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)\n- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm\n- dataset.csv: the matrix of features \n- target.csv: the matrix of target/targets\n- dataset.smi: the smiles list\n\nInstall\n-------\n\n```\n\npip3 install mlmbench\n\n```\n\nSplit types per dataset\n-----------------------\nmlmbench includes for every dataset two different splits:\n- random split using \"mkrndsplits.py\" starting from a list of names\n- target extrapolation using \"mktgtextrapsplits.py\" starting from the target file.\n In this case, the algorithm will first import the target file, and then for every column,\n rank from min to max the queue and split the ordered target\n into \"N\" splits selected by the user. This split aims to check for \"extrapolation.\"\n- literature split (if available). In this case, we try to preserve particular splits published by users.\n\nAvailable datasets\n------------------\n\n- BACE-moleculenet\n- BACE-random\n- BACE-tgt_extrapolation\n- FU-random\n- FU-tgt_extrapolation\n- HLMCLint-random\n- HLMCLint-tgt_extrapolation\n- MeltingPoint-random\n- MeltingPoint-tgt_extrapolation\n- NIR_Gasoline-random\n- NIR_Gasoline-tgt_extrapolation\n- SteroidsLSS-isomers\n- SteroidsLSS-random\n- SteroidsLSS-tgt_extrapolation\n- esol-chemdiversity\n- esol-random\n- esol-tgt_extrapolation\n- logDpH7.4-random\n- logDpH7.4-tgt_extrapolation\n\nHow to use\n----------\n\n```\n#!/usr/bin/env python3\n\nfrom mlmbench.data import Datasets\n\nds = Datasets()\nprint(ds.get_available_datasets())\nprint(f'Dataset info: {ds.get_info(\"esol-random\")}')\nfor train_data, test_data, val_data in ds.ttv_generator(\"esol-random\"):\n print(\"train \", train_data[\"xdata\"].shape, train_data[\"target\"].shape, len(train_data[\"smi\"]))\n print(\"test \", test_data[\"xdata\"].shape, test_data[\"target\"].shape, len(test_data[\"smi\"]))\n print(\"val \", val_data[\"xdata\"].shape, val_data[\"target\"].shape, len(val_data[\"smi\"]))\n \n # Do ml training/test validation, collect the results and store it in your \n # appropriate format to do your analysis.\n\n print(\"-\"*40)\n\n```\n\nSubmit new dataset\n__________________\n\n1) Fork the project!\n2) Clone the forked project\n3) Add the dataset in this form:\n dataset.csv: tabular data for any kind of descriptors\n target.csv: tabular data for one or multiple targets\n dataset.smi: smiles of the molecule in its appropriate format \"c1ccccc1 benzene\"\n cv.split: The split you like. This specific file needs to be compatible with the following\n \t standard. The file comprises lines representing the model,\n \t groups split by the \";\" character, and every group representing\n \t the compound name, and every name is split using the \",\" character.\n i.e.\n train keys test keys validation keys\n line 1 mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,...\n line 2 ...\n line 3 ..\n\n Readme.md: Info regarding the dataset(i.e. source and so on)\n4) Create a pull request and 99.9% will be merged\n\n",
"bugtrack_url": null,
"license": "GPLv3",
"summary": "MLM MachineLearning Molecular Benchmarch",
"version": "1.0.3",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "34c6e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2",
"md5": "e15ee011df9642ff7343250211878384",
"sha256": "820282dbe1ca12ae66a4656edddf77f8b729297fb4e4821559225e590256624e"
},
"downloads": -1,
"filename": "mlmbench-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e15ee011df9642ff7343250211878384",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.0",
"size": 50561906,
"upload_time": "2023-03-12T11:06:18",
"upload_time_iso_8601": "2023-03-12T11:06:18.644340Z",
"url": "https://files.pythonhosted.org/packages/34/c6/e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2/mlmbench-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "90d061191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36",
"md5": "30bf5ea58bde5d1cf111c1e462ba6d97",
"sha256": "a5e2f086eadfaf422a9a753aab9fa2ba1bf1911843b3b8e26117f803362e05e5"
},
"downloads": -1,
"filename": "mlmbench-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "30bf5ea58bde5d1cf111c1e462ba6d97",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.0",
"size": 49938058,
"upload_time": "2023-03-12T11:06:27",
"upload_time_iso_8601": "2023-03-12T11:06:27.958362Z",
"url": "https://files.pythonhosted.org/packages/90/d0/61191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36/mlmbench-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-12 11:06:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "gmrandazzo",
"github_project": "mlmbench",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mlmbench"
}