chem-mat-database


Namechem-mat-database JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryCommand Line Interface for projects
upload_time2025-01-13 14:42:32
maintainerNone
docs_urlNone
authorNone
requires_python<=3.12,>=3.8
licenseMIT License
keywords chemistry data data management dataset graph neural network graph representation machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            |made-with-python| |python-version| |ruff| 


.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
   :target: https://www.python.org/

.. |python-version| image:: https://img.shields.io/badge/python-3.8%20|%203.9%20|%203.10%20|%203.11%20|%203.12-blue
   :target: https://www.python.org/

.. |ruff| image:: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json
   :target: https://github.com/astral-sh/ruff

.. |pypi| image:: https://img.shields.io/pypi/v/chem_mat_data.svg
   :target: https://pypi.org/project/ruff/

=================
⚗️ ChemMatData
=================

.. image:: chem_mat_data/ChemMatData_logo_final.png
   :alt: ChemMatData Logo
   :align: center

The ``chem_mat_data`` package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. 
The aim of this package is to provide the datasets in a unified format suitable to *machine learning* applications and specifically to train 
*graph neural networks (GNNs)*.

Specifically, ``chem_mat_data`` addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download 
datasets either in *raw* or in *processed* (graph) format.

Features:

- 🐍 Easily installable via ``pip``
- 📦 Instant access to a collection of datasets across the domains of *chemistry* and *material science* 
- 🤖 Direct support of popular graph deep learning libraries like [Torch/PyG](https://pytorch-geometric.readthedocs.io/en/latest/) and [Jax/Jraph](https://jraph.readthedocs.io/en/latest/)
- 🤝 Large python version compatibility
- ⌨️ Comprehensive command line interface (CLI)
- 📖 Documentation: https://the16thpythonist.github.io/chem_mat_data 

Getting ready to train a PyTorch Geometric model can be as easy as this:

.. code-block:: python

    from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
    from torch_geometric.data import Data
    from torch_geometric.loader import DataLoader
    
    # Load the dataset of graphs
    graphs: list[dict] = load_graph_dataset('clintox')
    
    # Convert the graph dicts into PyG Data objects
    data_list: list[Data] = pyg_data_list_from_graphs(graphs)
    data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)
    
    # Network training...


📦 Pip Installation
===================

Install the latest stable release using ``pip`` from the Python Package Index (PyPI):

.. code-block:: console

    pip install chem_mat_data

Or install the latest development versin directly from the GitHub repository:

.. code-block::

    pip install git+https://github.com/the16thpythonist/chem_mat_data.git


⌨️ Command Line Interface (CLI)
===============================

The package provides the ``cmdata`` command line interface (CLI) to interact with the remote database.

To see the list of all available commands, simply use the ``--help`` flag:

.. code-block:: bash

    cmdata --help

Listing Available Datasets
--------------------------

To which datasets are available to be downloaded from the remote file share server, use the ``list`` command:

.. code-block:: bash

    cmdata list

This will print a table containing all the dataset which are currently available to download from the database. Each row of the 
table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of
target properties as additional columns.


Downloading Datasets
--------------------

Finally, to download this dataset, use the ``download`` command:

.. code-block:: bash

    cmdata donwload "clintox"

This will download the dataset ``clintox.csv`` dataset file to your current working directory.

One can also specify the path to wich the dataset should be downloaded as following:

.. code-block:: bash

    cmdata download --path="/tmp" "clintox"


🚀 Quickstart
=============

Alternatively, the ``chem_mat_data`` functionality can be used programmatically as part of python code. The 
package provides each dataset either in **raw** or **processed/graph** format (For further information on the 
distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).

Raw Datasets
------------

You can use the ``load_smiles_dataset`` function to download the raw dataset format. This function will 
return the dataset as a ``pandas.DataFrame`` object which contains a "smiles" column along with the specific 
target value annotations as separate data frame columns.

.. code-block:: python

    import pandas as pd
    from chem_mat_data import load_smiles_dataset

    df: pd.DataFrame = load_smiles_dataset('clintox')
    print(df.head())


Graph Datasets
--------------

You can also use the ``load_graph_dataset`` function to download the same dataset in the *pre-processed* graph 
representation. This function will return a list of ``dict`` objects which contain the full graph representation 
of the corresponding molecules.

.. code-block:: python

    from rich.pretty import pprint
    from chem_mat_data import load_graph_dataset

    graphs: list[dict] = load_graph_dataset('clintox')
    example_graph = graphs[0]
    pprint(example_graph)


For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).


Training Graph Neural Networks
------------------------------

Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the
PyTorch Geometric library with the dataset loaded from the ``chem_mat_data`` package.

.. code-block:: python

    from torch import Tensor
    from torch_geometric.data import Data
    from torch_geometric.loader import DataLoader
    from torch_geometric.nn.models import GIN
    from rich.pretty import pprint
    
    from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
    
    # Load the dataset of graphs
    graphs: list[dict] = load_graph_dataset('clintox')
    example_graph = graphs[0]
    pprint(example_graph)
    
    # Convert the graph dicts into PyG Data objects
    data_list = pyg_data_list_from_graphs(graphs)
    data_loader = DataLoader(data_list, batch_size=32, shuffle=True)
    
    # Construct a GNN model
    model = GIN(
        in_channels=example_graph['node_attributes'].shape[1],
        out_channels=example_graph['graph_labels'].shape[0],
        hidden_channels=32,
        num_layers=3,  
    )
    
    # Perform model forward pass with a batch of graphs
    data: Data = next(iter(data_loader))
    out_pred: Tensor = model.forward(
        x=data.x, 
        edge_index=data.edge_index, 
        batch=data.batch
    )
    pprint(out_pred)


🤝 Credits
===========

We thank the following packages, institutions and individuals for their significant impact on this package.

* PyComex_ is a micro framework which simplifies the setup, processing and management of computational
  experiments. It is also used to auto-generate the command line interface that can be used to interact
  with these experiments.

.. _PyComex: https://github.com/the16thpythonist/pycomex.git
.. _Cookiecutter: https://github.com/cookiecutter/cookiecutter

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chem-mat-database",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.12,>=3.8",
    "maintainer_email": "Jonas Teufel <jonseb1998@gmail.com>, Mohit Singh <mohit.singh@student.kit.edu>",
    "keywords": "chemistry, data, data management, dataset, graph neural network, graph representation, machine learning",
    "author": null,
    "author_email": "Jonas Teufel <jonseb1998@gmail.com>, Mohit Singh <mohit.singh@student.kit.edu>",
    "download_url": "https://files.pythonhosted.org/packages/71/89/f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b/chem_mat_database-1.0.0.tar.gz",
    "platform": null,
    "description": "|made-with-python| |python-version| |ruff| \n\n\n.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg\n   :target: https://www.python.org/\n\n.. |python-version| image:: https://img.shields.io/badge/python-3.8%20|%203.9%20|%203.10%20|%203.11%20|%203.12-blue\n   :target: https://www.python.org/\n\n.. |ruff| image:: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\n   :target: https://github.com/astral-sh/ruff\n\n.. |pypi| image:: https://img.shields.io/pypi/v/chem_mat_data.svg\n   :target: https://pypi.org/project/ruff/\n\n=================\n\u2697\ufe0f ChemMatData\n=================\n\n.. image:: chem_mat_data/ChemMatData_logo_final.png\n   :alt: ChemMatData Logo\n   :align: center\n\nThe ``chem_mat_data`` package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. \nThe aim of this package is to provide the datasets in a unified format suitable to *machine learning* applications and specifically to train \n*graph neural networks (GNNs)*.\n\nSpecifically, ``chem_mat_data`` addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download \ndatasets either in *raw* or in *processed* (graph) format.\n\nFeatures:\n\n- \ud83d\udc0d Easily installable via ``pip``\n- \ud83d\udce6 Instant access to a collection of datasets across the domains of *chemistry* and *material science* \n- \ud83e\udd16 Direct support of popular graph deep learning libraries like [Torch/PyG](https://pytorch-geometric.readthedocs.io/en/latest/) and [Jax/Jraph](https://jraph.readthedocs.io/en/latest/)\n- \ud83e\udd1d Large python version compatibility\n- \u2328\ufe0f Comprehensive command line interface (CLI)\n- \ud83d\udcd6 Documentation: https://the16thpythonist.github.io/chem_mat_data \n\nGetting ready to train a PyTorch Geometric model can be as easy as this:\n\n.. code-block:: python\n\n    from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs\n    from torch_geometric.data import Data\n    from torch_geometric.loader import DataLoader\n    \n    # Load the dataset of graphs\n    graphs: list[dict] = load_graph_dataset('clintox')\n    \n    # Convert the graph dicts into PyG Data objects\n    data_list: list[Data] = pyg_data_list_from_graphs(graphs)\n    data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)\n    \n    # Network training...\n\n\n\ud83d\udce6 Pip Installation\n===================\n\nInstall the latest stable release using ``pip`` from the Python Package Index (PyPI):\n\n.. code-block:: console\n\n    pip install chem_mat_data\n\nOr install the latest development versin directly from the GitHub repository:\n\n.. code-block::\n\n    pip install git+https://github.com/the16thpythonist/chem_mat_data.git\n\n\n\u2328\ufe0f Command Line Interface (CLI)\n===============================\n\nThe package provides the ``cmdata`` command line interface (CLI) to interact with the remote database.\n\nTo see the list of all available commands, simply use the ``--help`` flag:\n\n.. code-block:: bash\n\n    cmdata --help\n\nListing Available Datasets\n--------------------------\n\nTo which datasets are available to be downloaded from the remote file share server, use the ``list`` command:\n\n.. code-block:: bash\n\n    cmdata list\n\nThis will print a table containing all the dataset which are currently available to download from the database. Each row of the \ntable represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of\ntarget properties as additional columns.\n\n\nDownloading Datasets\n--------------------\n\nFinally, to download this dataset, use the ``download`` command:\n\n.. code-block:: bash\n\n    cmdata donwload \"clintox\"\n\nThis will download the dataset ``clintox.csv`` dataset file to your current working directory.\n\nOne can also specify the path to wich the dataset should be downloaded as following:\n\n.. code-block:: bash\n\n    cmdata download --path=\"/tmp\" \"clintox\"\n\n\n\ud83d\ude80 Quickstart\n=============\n\nAlternatively, the ``chem_mat_data`` functionality can be used programmatically as part of python code. The \npackage provides each dataset either in **raw** or **processed/graph** format (For further information on the \ndistincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).\n\nRaw Datasets\n------------\n\nYou can use the ``load_smiles_dataset`` function to download the raw dataset format. This function will \nreturn the dataset as a ``pandas.DataFrame`` object which contains a \"smiles\" column along with the specific \ntarget value annotations as separate data frame columns.\n\n.. code-block:: python\n\n    import pandas as pd\n    from chem_mat_data import load_smiles_dataset\n\n    df: pd.DataFrame = load_smiles_dataset('clintox')\n    print(df.head())\n\n\nGraph Datasets\n--------------\n\nYou can also use the ``load_graph_dataset`` function to download the same dataset in the *pre-processed* graph \nrepresentation. This function will return a list of ``dict`` objects which contain the full graph representation \nof the corresponding molecules.\n\n.. code-block:: python\n\n    from rich.pretty import pprint\n    from chem_mat_data import load_graph_dataset\n\n    graphs: list[dict] = load_graph_dataset('clintox')\n    example_graph = graphs[0]\n    pprint(example_graph)\n\n\nFor further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).\n\n\nTraining Graph Neural Networks\n------------------------------\n\nFinally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the\nPyTorch Geometric library with the dataset loaded from the ``chem_mat_data`` package.\n\n.. code-block:: python\n\n    from torch import Tensor\n    from torch_geometric.data import Data\n    from torch_geometric.loader import DataLoader\n    from torch_geometric.nn.models import GIN\n    from rich.pretty import pprint\n    \n    from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs\n    \n    # Load the dataset of graphs\n    graphs: list[dict] = load_graph_dataset('clintox')\n    example_graph = graphs[0]\n    pprint(example_graph)\n    \n    # Convert the graph dicts into PyG Data objects\n    data_list = pyg_data_list_from_graphs(graphs)\n    data_loader = DataLoader(data_list, batch_size=32, shuffle=True)\n    \n    # Construct a GNN model\n    model = GIN(\n        in_channels=example_graph['node_attributes'].shape[1],\n        out_channels=example_graph['graph_labels'].shape[0],\n        hidden_channels=32,\n        num_layers=3,  \n    )\n    \n    # Perform model forward pass with a batch of graphs\n    data: Data = next(iter(data_loader))\n    out_pred: Tensor = model.forward(\n        x=data.x, \n        edge_index=data.edge_index, \n        batch=data.batch\n    )\n    pprint(out_pred)\n\n\n\ud83e\udd1d Credits\n===========\n\nWe thank the following packages, institutions and individuals for their significant impact on this package.\n\n* PyComex_ is a micro framework which simplifies the setup, processing and management of computational\n  experiments. It is also used to auto-generate the command line interface that can be used to interact\n  with these experiments.\n\n.. _PyComex: https://github.com/the16thpythonist/pycomex.git\n.. _Cookiecutter: https://github.com/cookiecutter/cookiecutter\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Command Line Interface for projects",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "chemistry",
        " data",
        " data management",
        " dataset",
        " graph neural network",
        " graph representation",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8395127de3b5b9a61537482807241d40b6f9fb7f301b31f7a34d8a6df81bdd11",
                "md5": "2989b7a1916b282f397d27d8c655e743",
                "sha256": "32208212d348eb56c95cc98e4bd3af400ea394c49032bffa9f214646a581c136"
            },
            "downloads": -1,
            "filename": "chem_mat_database-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2989b7a1916b282f397d27d8c655e743",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<=3.12,>=3.8",
            "size": 8306365,
            "upload_time": "2025-01-13T14:42:26",
            "upload_time_iso_8601": "2025-01-13T14:42:26.160048Z",
            "url": "https://files.pythonhosted.org/packages/83/95/127de3b5b9a61537482807241d40b6f9fb7f301b31f7a34d8a6df81bdd11/chem_mat_database-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7189f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b",
                "md5": "279b8d25d186fdfac6e56078cb6cc708",
                "sha256": "981bf69b29fa31696e4cedc51a0e91b714b778b4847568544d95cc116733e00e"
            },
            "downloads": -1,
            "filename": "chem_mat_database-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "279b8d25d186fdfac6e56078cb6cc708",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.12,>=3.8",
            "size": 9577665,
            "upload_time": "2025-01-13T14:42:32",
            "upload_time_iso_8601": "2025-01-13T14:42:32.325069Z",
            "url": "https://files.pythonhosted.org/packages/71/89/f0edcae0f58187dbaf1d563f6d4cdfd996b61250c8333eff88433d18525b/chem_mat_database-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-13 14:42:32",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "chem-mat-database"
}
        
Elapsed time: 0.44969s