|made-with-python| |python-version| |ruff|
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |python-version| image:: https://img.shields.io/badge/python-3.8%20|%203.9%20|%203.10%20|%203.11%20|%203.12-blue
:target: https://www.python.org/
.. |ruff| image:: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json
:target: https://github.com/astral-sh/ruff
.. |pypi| image:: https://img.shields.io/pypi/v/chem_mat_data.svg
:target: https://pypi.org/project/ruff/
=================
⚗️ ChemMatData
=================
.. image:: chem_mat_data/ChemMatData_logo_final.png
:alt: ChemMatData Logo
:align: center
The ``chem_mat_data`` package provides easy access to a large range of property prediction datasets from Chemistry and Material Science.
The aim of this package is to provide the datasets in a unified format suitable to *machine learning* applications and specifically to train
*graph neural networks (GNNs)*.
Specifically, ``chem_mat_data`` addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download
datasets either in *raw* or in *processed* (graph) format.
Features:
- 🐍 Easily installable via ``pip``
- 📦 Instant access to a collection of datasets across the domains of *chemistry* and *material science*
- 🤖 Direct support of popular graph deep learning libraries like [Torch/PyG](https://pytorch-geometric.readthedocs.io/en/latest/) and [Jax/Jraph](https://jraph.readthedocs.io/en/latest/)
- 🤝 Large python version compatibility
- ⌨️ Comprehensive command line interface (CLI)
- 📖 Documentation: https://the16thpythonist.github.io/chem_mat_data
Getting ready to train a PyTorch Geometric model can be as easy as this:
.. code-block:: python
from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
# Convert the graph dicts into PyG Data objects
data_list: list[Data] = pyg_data_list_from_graphs(graphs)
data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)
# Network training...
📦 Pip Installation
===================
Install the latest stable release using ``pip`` from the Python Package Index (PyPI):
.. code-block:: console
pip install chem_mat_data
Or install the latest development versin directly from the GitHub repository:
.. code-block::
pip install git+https://github.com/the16thpythonist/chem_mat_data.git
⌨️ Command Line Interface (CLI)
===============================
The package provides the ``cmdata`` command line interface (CLI) to interact with the remote database.
To see the list of all available commands, simply use the ``--help`` flag:
.. code-block:: bash
cmdata --help
Listing Available Datasets
--------------------------
To which datasets are available to be downloaded from the remote file share server, use the ``list`` command:
.. code-block:: bash
cmdata list
This will print a table containing all the dataset which are currently available to download from the database. Each row of the
table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of
target properties as additional columns.
Downloading Datasets
--------------------
Finally, to download this dataset, use the ``download`` command:
.. code-block:: bash
cmdata donwload "clintox"
This will download the dataset ``clintox.csv`` dataset file to your current working directory.
One can also specify the path to wich the dataset should be downloaded as following:
.. code-block:: bash
cmdata download --path="/tmp" "clintox"
🚀 Quickstart
=============
Alternatively, the ``chem_mat_data`` functionality can be used programmatically as part of python code. The
package provides each dataset either in **raw** or **processed/graph** format (For further information on the
distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).
Raw Datasets
------------
You can use the ``load_smiles_dataset`` function to download the raw dataset format. This function will
return the dataset as a ``pandas.DataFrame`` object which contains a "smiles" column along with the specific
target value annotations as separate data frame columns.
.. code-block:: python
import pandas as pd
from chem_mat_data import load_smiles_dataset
df: pd.DataFrame = load_smiles_dataset('clintox')
print(df.head())
Graph Datasets
--------------
You can also use the ``load_graph_dataset`` function to download the same dataset in the *pre-processed* graph
representation. This function will return a list of ``dict`` objects which contain the full graph representation
of the corresponding molecules.
.. code-block:: python
from rich.pretty import pprint
from chem_mat_data import load_graph_dataset
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)
For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).
Training Graph Neural Networks
------------------------------
Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the
PyTorch Geometric library with the dataset loaded from the ``chem_mat_data`` package.
.. code-block:: python
from torch import Tensor
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn.models import GIN
from rich.pretty import pprint
from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)
# Convert the graph dicts into PyG Data objects
data_list = pyg_data_list_from_graphs(graphs)
data_loader = DataLoader(data_list, batch_size=32, shuffle=True)
# Construct a GNN model
model = GIN(
in_channels=example_graph['node_attributes'].shape[1],
out_channels=example_graph['graph_labels'].shape[0],
hidden_channels=32,
num_layers=3,
)
# Perform model forward pass with a batch of graphs
data: Data = next(iter(data_loader))
out_pred: Tensor = model.forward(
x=data.x,
edge_index=data.edge_index,
batch=data.batch
)
pprint(out_pred)
🤝 Credits
===========
We thank the following packages, institutions and individuals for their significant impact on this package.
* PyComex_ is a micro framework which simplifies the setup, processing and management of computational
experiments. It is also used to auto-generate the command line interface that can be used to interact
with these experiments.
.. _PyComex: https://github.com/the16thpythonist/pycomex.git
.. _Cookiecutter: https://github.com/cookiecutter/cookiecutter
Raw data
{
"_id": null,
"home_page": null,
"name": "chem-mat-database",
"maintainer": null,
"docs_url": null,
"requires_python": "<=3.12,>=3.8",
"maintainer_email": "Jonas Teufel <jonseb1998@gmail.com>, Mohit Singh <mohit.singh@student.kit.edu>",
"keywords": "chemistry, data, data management, dataset, graph neural network, graph representation, machine learning",
"author": null,
"author_email": "Jonas Teufel <jonseb1998@gmail.com>, Mohit Singh <mohit.singh@student.kit.edu>",
"download_url": "https://files.pythonhosted.org/packages/71/89/f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b/chem_mat_database-1.0.0.tar.gz",
"platform": null,
"description": "|made-with-python| |python-version| |ruff| \n\n\n.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg\n :target: https://www.python.org/\n\n.. |python-version| image:: https://img.shields.io/badge/python-3.8%20|%203.9%20|%203.10%20|%203.11%20|%203.12-blue\n :target: https://www.python.org/\n\n.. |ruff| image:: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\n :target: https://github.com/astral-sh/ruff\n\n.. |pypi| image:: https://img.shields.io/pypi/v/chem_mat_data.svg\n :target: https://pypi.org/project/ruff/\n\n=================\n\u2697\ufe0f ChemMatData\n=================\n\n.. image:: chem_mat_data/ChemMatData_logo_final.png\n :alt: ChemMatData Logo\n :align: center\n\nThe ``chem_mat_data`` package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. \nThe aim of this package is to provide the datasets in a unified format suitable to *machine learning* applications and specifically to train \n*graph neural networks (GNNs)*.\n\nSpecifically, ``chem_mat_data`` addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download \ndatasets either in *raw* or in *processed* (graph) format.\n\nFeatures:\n\n- \ud83d\udc0d Easily installable via ``pip``\n- \ud83d\udce6 Instant access to a collection of datasets across the domains of *chemistry* and *material science* \n- \ud83e\udd16 Direct support of popular graph deep learning libraries like [Torch/PyG](https://pytorch-geometric.readthedocs.io/en/latest/) and [Jax/Jraph](https://jraph.readthedocs.io/en/latest/)\n- \ud83e\udd1d Large python version compatibility\n- \u2328\ufe0f Comprehensive command line interface (CLI)\n- \ud83d\udcd6 Documentation: https://the16thpythonist.github.io/chem_mat_data \n\nGetting ready to train a PyTorch Geometric model can be as easy as this:\n\n.. code-block:: python\n\n from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs\n from torch_geometric.data import Data\n from torch_geometric.loader import DataLoader\n \n # Load the dataset of graphs\n graphs: list[dict] = load_graph_dataset('clintox')\n \n # Convert the graph dicts into PyG Data objects\n data_list: list[Data] = pyg_data_list_from_graphs(graphs)\n data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)\n \n # Network training...\n\n\n\ud83d\udce6 Pip Installation\n===================\n\nInstall the latest stable release using ``pip`` from the Python Package Index (PyPI):\n\n.. code-block:: console\n\n pip install chem_mat_data\n\nOr install the latest development versin directly from the GitHub repository:\n\n.. code-block::\n\n pip install git+https://github.com/the16thpythonist/chem_mat_data.git\n\n\n\u2328\ufe0f Command Line Interface (CLI)\n===============================\n\nThe package provides the ``cmdata`` command line interface (CLI) to interact with the remote database.\n\nTo see the list of all available commands, simply use the ``--help`` flag:\n\n.. code-block:: bash\n\n cmdata --help\n\nListing Available Datasets\n--------------------------\n\nTo which datasets are available to be downloaded from the remote file share server, use the ``list`` command:\n\n.. code-block:: bash\n\n cmdata list\n\nThis will print a table containing all the dataset which are currently available to download from the database. Each row of the \ntable represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of\ntarget properties as additional columns.\n\n\nDownloading Datasets\n--------------------\n\nFinally, to download this dataset, use the ``download`` command:\n\n.. code-block:: bash\n\n cmdata donwload \"clintox\"\n\nThis will download the dataset ``clintox.csv`` dataset file to your current working directory.\n\nOne can also specify the path to wich the dataset should be downloaded as following:\n\n.. code-block:: bash\n\n cmdata download --path=\"/tmp\" \"clintox\"\n\n\n\ud83d\ude80 Quickstart\n=============\n\nAlternatively, the ``chem_mat_data`` functionality can be used programmatically as part of python code. The \npackage provides each dataset either in **raw** or **processed/graph** format (For further information on the \ndistincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).\n\nRaw Datasets\n------------\n\nYou can use the ``load_smiles_dataset`` function to download the raw dataset format. This function will \nreturn the dataset as a ``pandas.DataFrame`` object which contains a \"smiles\" column along with the specific \ntarget value annotations as separate data frame columns.\n\n.. code-block:: python\n\n import pandas as pd\n from chem_mat_data import load_smiles_dataset\n\n df: pd.DataFrame = load_smiles_dataset('clintox')\n print(df.head())\n\n\nGraph Datasets\n--------------\n\nYou can also use the ``load_graph_dataset`` function to download the same dataset in the *pre-processed* graph \nrepresentation. This function will return a list of ``dict`` objects which contain the full graph representation \nof the corresponding molecules.\n\n.. code-block:: python\n\n from rich.pretty import pprint\n from chem_mat_data import load_graph_dataset\n\n graphs: list[dict] = load_graph_dataset('clintox')\n example_graph = graphs[0]\n pprint(example_graph)\n\n\nFor further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).\n\n\nTraining Graph Neural Networks\n------------------------------\n\nFinally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the\nPyTorch Geometric library with the dataset loaded from the ``chem_mat_data`` package.\n\n.. code-block:: python\n\n from torch import Tensor\n from torch_geometric.data import Data\n from torch_geometric.loader import DataLoader\n from torch_geometric.nn.models import GIN\n from rich.pretty import pprint\n \n from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs\n \n # Load the dataset of graphs\n graphs: list[dict] = load_graph_dataset('clintox')\n example_graph = graphs[0]\n pprint(example_graph)\n \n # Convert the graph dicts into PyG Data objects\n data_list = pyg_data_list_from_graphs(graphs)\n data_loader = DataLoader(data_list, batch_size=32, shuffle=True)\n \n # Construct a GNN model\n model = GIN(\n in_channels=example_graph['node_attributes'].shape[1],\n out_channels=example_graph['graph_labels'].shape[0],\n hidden_channels=32,\n num_layers=3, \n )\n \n # Perform model forward pass with a batch of graphs\n data: Data = next(iter(data_loader))\n out_pred: Tensor = model.forward(\n x=data.x, \n edge_index=data.edge_index, \n batch=data.batch\n )\n pprint(out_pred)\n\n\n\ud83e\udd1d Credits\n===========\n\nWe thank the following packages, institutions and individuals for their significant impact on this package.\n\n* PyComex_ is a micro framework which simplifies the setup, processing and management of computational\n experiments. It is also used to auto-generate the command line interface that can be used to interact\n with these experiments.\n\n.. _PyComex: https://github.com/the16thpythonist/pycomex.git\n.. _Cookiecutter: https://github.com/cookiecutter/cookiecutter\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Command Line Interface for projects",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"chemistry",
" data",
" data management",
" dataset",
" graph neural network",
" graph representation",
" machine learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8395127de3b5b9a61537482807241d40b6f9fb7f301b31f7a34d8a6df81bdd11",
"md5": "2989b7a1916b282f397d27d8c655e743",
"sha256": "32208212d348eb56c95cc98e4bd3af400ea394c49032bffa9f214646a581c136"
},
"downloads": -1,
"filename": "chem_mat_database-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2989b7a1916b282f397d27d8c655e743",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<=3.12,>=3.8",
"size": 8306365,
"upload_time": "2025-01-13T14:42:26",
"upload_time_iso_8601": "2025-01-13T14:42:26.160048Z",
"url": "https://files.pythonhosted.org/packages/83/95/127de3b5b9a61537482807241d40b6f9fb7f301b31f7a34d8a6df81bdd11/chem_mat_database-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7189f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b",
"md5": "279b8d25d186fdfac6e56078cb6cc708",
"sha256": "981bf69b29fa31696e4cedc51a0e91b714b778b4847568544d95cc116733e00e"
},
"downloads": -1,
"filename": "chem_mat_database-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "279b8d25d186fdfac6e56078cb6cc708",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<=3.12,>=3.8",
"size": 9577665,
"upload_time": "2025-01-13T14:42:32",
"upload_time_iso_8601": "2025-01-13T14:42:32.325069Z",
"url": "https://files.pythonhosted.org/packages/71/89/f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b/chem_mat_database-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-13 14:42:32",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "chem-mat-database"
}