PyTDC

Name	PyTDC JSON
Version	1.1.12 JSON
	download
home_page	https://github.com/mims-harvard/TDC
Summary	Therapeutics Data Commons
upload_time	2025-01-21 13:11:55
maintainer	None
docs_url	None
author	TDC Team
requires_python	None
license	MIT
keywords
VCS
bugtrack_url
requirements	accelerate dataclasses datasets evaluate fuzzywuzzy huggingface_hub numpy openpyxl pandas requests scikit-learn seaborn tqdm transformers cellxgene-census gget pydantic rdkit tiledbsoma yapf
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center"><img src="https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/logo.png" alt="logo" width="600px" /></p>

----

[![website](https://img.shields.io/badge/website-live-brightgreen)](https://tdcommons.ai)
[![PyPI version](https://badge.fury.io/py/PyTDC.svg)](https://badge.fury.io/py/PyTDC)
[![Downloads](https://pepy.tech/badge/pytdc/month)](https://pepy.tech/project/pytdc)
[![Downloads](https://pepy.tech/badge/pytdc)](https://pepy.tech/project/pytdc)
[![GitHub Repo stars](https://img.shields.io/github/stars/mims-harvard/TDC)](https://github.com/mims-harvard/TDC/stargazers)
[![GitHub Repo stars](https://img.shields.io/github/forks/mims-harvard/TDC)](https://github.com/mims-harvard/TDC/network/members)

[![TDC CircleCI](https://circleci.com/gh/mims-harvard/TDC.svg?style=svg)](https://app.circleci.com/pipelines/github/mims-harvard/TDC)
![Conda Github Actions Build](https://github.com/mims-harvard/TDC/actions/workflows/conda-tests.yml/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/tdc/badge/?version=latest)](http://tdc.readthedocs.io/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social&label=Follow%20%40ProjectTDC)](https://twitter.com/ProjectTDC)


[**Website**](https://tdcommons.ai) | [**NeurIPS 2024 AIDrugX Paper**](https://openreview.net/forum?id=kL8dlYp6IM) | [**Nature Chemical Biology 2022 Paper**](https://www.nature.com/articles/s41589-022-01131-2) | [**NeurIPS 2021 Paper**](https://openreview.net/pdf?id=8nvgnORnoWr) | [**Long Paper**](https://arxiv.org/abs/2102.09548) | [**Slack**](https://join.slack.com/t/pytdc/shared_invite/zt-x0ujg5v6-zwtQZt83fhRdgrYjXRFz5g) | [**TDC Mailing List**](https://groups.io/g/tdc) | [**TDC Documentation**](https://tdc.readthedocs.io/) | [**Contribution Guidelines**](CONTRIBUTE.md)

Artificial intelligence is poised to reshape therapeutic science. **Therapeutics Data Commons** is a coordinated initiative to access and evaluate artificial intelligence capability across therapeutic modalities and stages of discovery. It supports the development of AI methods and aims to establish the foundation of which AI methods are most suitable for drug discovery applications and why.

Researchers across disciplines can use TDC for numerous applications. AI-solvable tasks, AI-ready datasets, and curated benchmarks in TDC serve as a meeting point between biochemical and AI scientists. TDC facilitates algorithmic and scientific advances and accelerates machine learning method development, validation, and transition into biomedical and clinical implementation.

TDC is an open-science initiative. We welcome [contributions from the community.](CONTRIBUTE.md)

## Key TDC Presentations and Publications

[1] Velez-Arce, Huang, Li, Lin, et al., TDC-2: Multimodal Foundation for Therapeutic Science, bioRxiv, 2024 [**\[Paper\]**](https://www.biorxiv.org/content/10.1101/2024.06.12.598655v2)

[2] Huang, Fu, Gao, et al., Artificial Intelligence Foundation for Therapeutic Science, Nature Chemical Biology, 2022 [**\[Paper\]**](https://www.nature.com/articles/s41589-022-01131-2)

[3] Huang, Fu, Gao, et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, NeurIPS 2021 [**\[Paper\]**](https://openreview.net/forum?id=8nvgnORnoWr) [**\[Poster\]**](https://drive.google.com/file/d/1LfF8mfPLUqAVEzH3KPBxDO_VF7nLFtiJ/view?usp=sharing) 

[4] Huang et al., Benchmarking Molecular Machine Learning in Therapeutics Data Commons, ELLIS ML4Molecules 2021 [**\[Paper\]**](https://cloud.ml.jku.at/s/54pB5Eqf6ftX7qA) [**\[Slides\]**](https://drive.google.com/file/d/1iOSW_5eruca4vdygDxS1H64c49oQuH40/view?usp=sharing) 

[5] Huang et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, Baylearn 2021 [**\[Slides\]**](https://drive.google.com/file/d/1BNpk3dOdqE3ksgyVV-V3xySdBMq-8cXL/view?usp=sharing) [**\[Poster\]**](https://drive.google.com/file/d/1LfF8mfPLUqAVEzH3KPBxDO_VF7nLFtiJ/view?usp=sharing)

[6] Huang, Fu, Gao et al., Therapeutics Data Commons, NSF-Harvard Symposium on Drugs for Future Pandemics 2020 [**\[#futuretx20\]**](https://www.drugsymposium.org/) [**\[Slides\]**](https://drive.google.com/file/d/11eTrh_lsqPcwu3RZRYjJGNpJ3s18YlBS/view) [**\[Video\]**](https://youtu.be/ZuCOhEZtaOw)

[7] [TDC User Group Meetup, Jan 2022](https://harvard.zoom.us/rec/share/HO0TjRPs56YG-Fu3i033izaTwebB4KwUhPeNURkWSI-anrH9su03lCtUlHeZG-WP.67ZJmAIHsD7Q_2GQ) [**\[Agenda\]**](https://shoutout.wix.com/so/d1Nv1pC2d#/main)

[8] Zitnik, Machine Learning to Translate the Cancer Genome and Epigenome Session, [AACR Annual Meeting 2022, Apr 2022](https://www.aacr.org/meeting/aacr-annual-meeting-2022/)

[9] Zitnik, Few-Shot Learning for Network Biology, [Keynote at KDD Workshop on Data Mining in Bioinformatics](https://biokdd.org/biokdd21/keynote.html)

[10] Zitnik, Actionable machine learning for drug discovery and development, [Broad Institute, Models, Inference & Algorithms Seminar, 2021](https://www.broadinstitute.org/talks/actionable-machine-learning-drug-discovery-and-development)

[11] Zitnik, Graph Neural Networks for Biomedical Data, [Machine Learning in Computational Biology, 2020](https://sites.google.com/cs.washington.edu/mlcb2020/schedule?authuser=0)

[12] Zitnik, Graph Neural Networks for Identifying COVID-19 Drug Repurposing Opportunities, [MIT AI Cures, 2020](https://www.aicures.mit.edu/drugdiscoveryconference)


## Unique Features of TDC

- *Diverse areas of therapeutics development*: TDC covers a wide range of learning tasks, including target discovery, activity screening, efficacy, safety, and manufacturing across biomedical products, including small molecules, antibodies, and vaccines.
- *Ready-to-use datasets*: TDC is minimally dependent on external packages. Any TDC dataset can be retrieved using only three lines of code.
- *Data functions*: TDC provides extensive data functions, including data evaluators, meaningful data splits, data processors, and molecule generation oracles. 
- *Leaderboards*: TDC provides benchmarks for fair model comparison and systematic model development and evaluation.
- *Open-source initiative*: TDC is an open-source initiative. If you'd like to get involved, please don't hesitate to let us know. 

<p align="center"><img src="https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/tdc_overview.png" alt="overview" width="600px" /></p>

See [here](https://tdcommons.ai/news/) for the latest updates in TDC!

## Installation

### Using `pip`

To install the core environment dependencies of TDC, use `pip`:

```bash
pip install PyTDC
```

**Note**: TDC is in the beta release. Please update your local copy regularly by

```bash
pip install PyTDC --upgrade
```

The core data loaders are lightweight with minimum dependency on external packages:

```bash
numpy, pandas, tqdm, scikit-learn, fuzzywuzzy, seaborn
```

## Tutorials

We provide  tutorials to get started with TDC:

| Name  | Description                                             |
|-------|---------------------------------------------------------|
| [101](tutorials/TDC_101_Data_Loader.ipynb)   | Introduce TDC Data Loaders                              |
| [102](tutorials/TDC_102_Data_Functions.ipynb)   | Introduce TDC Data Functions                            |
| [103.1](tutorials/TDC_103.1_Datasets_Small_Molecules.ipynb) | Walk through TDC Small Molecule Datasets                |
| [103.2](tutorials/TDC_103.2_Datasets_Biologics.ipynb) | Walk through TDC Biologics Datasets                     |
| [104](tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)   | Generate 21 ADME ML Predictors with 15 Lines of Code |
| [105](tutorials/TDC_105_Oracle.ipynb)   | Molecule Generation Oracles                             |
| [106](tutorials/TDC_106_BenchmarkGroup_Submission_Demo.ipynb)   | Benchmark submission                             |
| [DGL](tutorials/DGL_User_Group_Demo.ipynb)   | Demo presented at DGL GNN User Group Meeting                             |
| [U1.1](tutorials/User_Group/UserGroupMeeting_Tianfan.ipynb)   | Demo presented at first TDC User Group Meetup                             |
| [U1.2](tutorials/User_Group/UserGroupMeeting_Wenhao.ipynb)   | Demo presented at first TDC User Group Meetup                             |
| [201](https://colab.research.google.com/drive/1xTgBwKUfP2b8j6Fqh28M2GUp2ScfENMX?usp=sharing) | TDC-2 Resource and Multi-modal Single-Cell API |
| [202](https://colab.research.google.com/drive/1kYH8nt3nW7tXYBPNcfYuDbWxGTqOEnWg?usp=sharing) | TDC-2 Resource and PrimeKG |
| [203](https://colab.research.google.com/drive/13MYlg5tWpywWbKYsJQXafKAlVF2hz-sP?usp=sharing) | TDC-2 Resource and External APIs |
| [204](https://colab.research.google.com/drive/17Pd328W27mn-iBCRkHIa78L3pukKcfW1?usp=sharing) | TDC-2 Model Hub |
| [205](https://colab.research.google.com/drive/1kHdFG4gUic5nmiul7b1hUh0HLCxLQnw_?usp=sharing) | TDC-2 Molecular Property Cliff Prediction Task |


## Design of TDC

TDC has a unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics. We organize TDC into three distinct *problems*. For each problem, we provide a collection of *learning tasks*. Finally, for each task, we provide a series of *datasets*.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major areas (i.e., problems) where machine learning can facilitate scientific advances, namely, single-instance prediction, multi-instance prediction, and generation:

* Single-instance prediction `single_pred`: Prediction of property given individual biomedical entity.
* Multi-instance prediction `multi_pred`: Prediction of property given multiple biomedical entities. 
* Generation `generation`: Generation of new desirable biomedical entities.

<p align="center"><img src="https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/tdc_problems.png" alt="problems" width="500px" /></p>

The second tier in the TDC structure is organized into learning tasks. Improvement in these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel classes of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

Finally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits into training, validation, and test sets to simulate the type of understanding and generalization (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy) needed for transition into production and clinical implementation.


## TDC Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized "Problem -- Learning Task -- Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets.

For a concrete example, to obtain the HIA dataset from the ADME therapeutic learning task in the single-instance prediction problem:

```python
from tdc.single_pred import ADME
data = ADME(name = 'HIA_Hou')
# split into train/val/test with scaffold split methods
split = data.get_split(method = 'scaffold')
# get the entire data in the various formats
data.get_data(format = 'df')
```

You can see all the datasets that belong to a task as follows:

```python
from tdc.utils import retrieve_dataset_names
retrieve_dataset_names('ADME')
```

See all therapeutic tasks and datasets on the [TDC website](https://zitniklab.hms.harvard.edu/TDC/overview/)!

## TDC Data Functions

#### Dataset Splits

To retrieve the training/validation/test dataset split, you could type
```python 
data = X(name = Y)
data.get_split(seed = 42)
# {'train': df_train, 'val': df_val, 'test': df_test}
```
You can specify the function's splitting method, random seed, and split fractions by, e.g., `data.get_split(method = 'scaffold', seed = 1, frac = [0.7, 0.1, 0.2])`. Check the [data split page](https://zitniklab.hms.harvard.edu/TDC/functions/data_split/) for details.

#### Strategies for Model Evaluation

We provide various evaluation metrics for the tasks in TDC, described in [model evaluation page](https://zitniklab.hms.harvard.edu/TDC/functions/data_evaluation/) on the website. For example, to use metric ROC-AUC, you could type

```python
from tdc import Evaluator
evaluator = Evaluator(name = 'ROC-AUC')
score = evaluator(y_true, y_pred)
```

#### Data Processing 

TDC provides numerous data processing functions, including label transformation, data balancing, pairing data to PyG/DGL graphs, negative sampling, database querying, and so on. For function usage, see our [data processing page](https://zitniklab.hms.harvard.edu/TDC/functions/data_process/) on the TDC website.

#### Molecule Generation Oracles

For molecule generation tasks, we provide 10+ oracles for both goal-oriented and distribution learning. For detailed usage of each oracle, please have a look at the [oracle page](https://zitniklab.hms.harvard.edu/TDC/functions/oracles/) on the website. For example, we want to retrieve the GSK3Beta oracle:

```python
from tdc import Oracle
oracle = Oracle(name = 'GSK3B')
oracle(['CC(C)(C)....' 
  'C[C@@H]1....',
  'CCNC(=O)....', 
  'C[C@@H]1....'])

# [0.03, 0.02, 0.0, 0.1]
```

## TDC Leaderboards

Every dataset in TDC is a benchmark, and we provide training/validation and test sets for it, together with data splits and performance evaluation metrics. To participate in the leaderboard for a specific benchmark, follow these steps:

* Use the TDC benchmark data loader to retrieve the benchmark.

* Use training and/or validation set to train your model.

* Use the TDC model evaluator to calculate your model's performance on the test set.

* Submit the test set performance to a TDC leaderboard.

As many datasets share a therapeutics theme, we organize benchmarks into meaningfully defined groups, which we refer to as benchmark groups. Datasets and tasks within a benchmark group are carefully curated and centered around a theme (for example, TDC contains a benchmark group to support ML predictions of the ADMET properties). While every benchmark group consists of multiple benchmarks, it is possible to separately submit results for each benchmark. Here is the code framework to access the benchmarks:

```python
from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
predictions_list = []

for seed in [1, 2, 3, 4, 5]:
    benchmark = group.get('Caco2_Wang') 
    # all benchmark names in a benchmark group are stored in group.dataset_names
    predictions = {}
    name = benchmark['name']
    train_val, test = benchmark['train_val'], benchmark['test']
    train, valid = group.get_train_valid_split(benchmark = name, split_type = 'default', seed = seed)
    
        # --------------------------------------------- # 
        #  Train your model using train, valid, test    #
        #  Save test prediction in y_pred_test variable #
        # --------------------------------------------- #
        
    predictions[name] = y_pred_test
    predictions_list.append(predictions)

results = group.evaluate_many(predictions_list)
# {'caco2_wang': [6.328, 0.101]}
```

For more information, visit [here](https://tdcommons.ai/benchmark/overview/).


## Cite Us

If you find Therapeutics Data Commons useful, cite our [NeurIPS'24 AIDrugX paper](https://openreview.net/pdf?id=kL8dlYp6IM), our [NeurIPS paper](https://openreview.net/pdf?id=8nvgnORnoWr), and [Nature Chemical Biology paper](https://www.nature.com/articles/s41589-022-01131-2) :

```
@inproceedings{
velez-arce2024signals,
title={Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics},
author={Alejandro Velez-Arce and Xiang Lin and Kexin Huang and Michelle M Li and Wenhao Gao and Bradley Pentelute and Tianfan Fu and Manolis Kellis and Marinka Zitnik},
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
year={2024},
url={https://openreview.net/forum?id=kL8dlYp6IM}
}
```

```
@article{Huang2021tdc,
  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, 
          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks},
  year={2021}
}
```

```
@article{Huang2022artificial,
  title={Artificial intelligence foundation for therapeutic science},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, 
          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={Nature Chemical Biology},
  year={2022}
}
```

TDC is built on top of other open-sourced projects. Additionally, please cite the original work if you used these datasets/functions in your research. You can find the original paper for the function/dataset on the website.

## Contribute

TDC is a community-driven and open-science initiative. To get involved, join our [Slack Workspace](https://join.slack.com/t/pytdc/shared_invite/zt-x0ujg5v6-zwtQZt83fhRdgrYjXRFz5g) and check out the [contribution guide](CONTRIBUTE.md)!

## Contact

Reach us at [contact@tdcommons.ai](mailto:contact@tdcommons.ai) or open a GitHub issue.

## Data Server

Many TDC datasets are hosted on [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/21LKWG) with the following persistent identifier [https://doi.org/10.7910/DVN/21LKWG](https://doi.org/10.7910/DVN/21LKWG). When Dataverse is under maintenance, TDC datasets cannot be retrieved. That happens rarely; please check the status on [the Dataverse website](https://dataverse.harvard.edu/).

## License
The TDC codebase is licensed under the MIT license. For individual dataset usage, please refer to the dataset license on the website.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mims-harvard/TDC",
    "name": "PyTDC",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "TDC Team",
    "author_email": "alejandro_velez-arce@hms.harvard.edu",
    "download_url": "https://files.pythonhosted.org/packages/5f/15/56451d7f4643f1ce34a86cb90bdca53cdc480433deb4c67d6617720b2d49/pytdc-1.1.12.tar.gz",
    "platform": null,
    "description": "<p align=\"center\"><img src=\"https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/logo.png\" alt=\"logo\" width=\"600px\" /></p>\n\n----\n\n[![website](https://img.shields.io/badge/website-live-brightgreen)](https://tdcommons.ai)\n[![PyPI version](https://badge.fury.io/py/PyTDC.svg)](https://badge.fury.io/py/PyTDC)\n[![Downloads](https://pepy.tech/badge/pytdc/month)](https://pepy.tech/project/pytdc)\n[![Downloads](https://pepy.tech/badge/pytdc)](https://pepy.tech/project/pytdc)\n[![GitHub Repo stars](https://img.shields.io/github/stars/mims-harvard/TDC)](https://github.com/mims-harvard/TDC/stargazers)\n[![GitHub Repo stars](https://img.shields.io/github/forks/mims-harvard/TDC)](https://github.com/mims-harvard/TDC/network/members)\n\n[![TDC CircleCI](https://circleci.com/gh/mims-harvard/TDC.svg?style=svg)](https://app.circleci.com/pipelines/github/mims-harvard/TDC)\n![Conda Github Actions Build](https://github.com/mims-harvard/TDC/actions/workflows/conda-tests.yml/badge.svg)\n[![Documentation Status](https://readthedocs.org/projects/tdc/badge/?version=latest)](http://tdc.readthedocs.io/?badge=latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social&label=Follow%20%40ProjectTDC)](https://twitter.com/ProjectTDC)\n\n\n[**Website**](https://tdcommons.ai) | [**NeurIPS 2024 AIDrugX Paper**](https://openreview.net/forum?id=kL8dlYp6IM) | [**Nature Chemical Biology 2022 Paper**](https://www.nature.com/articles/s41589-022-01131-2) | [**NeurIPS 2021 Paper**](https://openreview.net/pdf?id=8nvgnORnoWr) | [**Long Paper**](https://arxiv.org/abs/2102.09548) | [**Slack**](https://join.slack.com/t/pytdc/shared_invite/zt-x0ujg5v6-zwtQZt83fhRdgrYjXRFz5g) | [**TDC Mailing List**](https://groups.io/g/tdc) | [**TDC Documentation**](https://tdc.readthedocs.io/) | [**Contribution Guidelines**](CONTRIBUTE.md)\n\nArtificial intelligence is poised to reshape therapeutic science. **Therapeutics Data Commons** is a coordinated initiative to access and evaluate artificial intelligence capability across therapeutic modalities and stages of discovery. It supports the development of AI methods and aims to establish the foundation of which AI methods are most suitable for drug discovery applications and why.\n\nResearchers across disciplines can use TDC for numerous applications. AI-solvable tasks, AI-ready datasets, and curated benchmarks in TDC serve as a meeting point between biochemical and AI scientists. TDC facilitates algorithmic and scientific advances and accelerates machine learning method development, validation, and transition into biomedical and clinical implementation.\n\nTDC is an open-science initiative. We welcome [contributions from the community.](CONTRIBUTE.md)\n\n## Key TDC Presentations and Publications\n\n[1] Velez-Arce, Huang, Li, Lin, et al., TDC-2: Multimodal Foundation for Therapeutic Science, bioRxiv, 2024 [**\\[Paper\\]**](https://www.biorxiv.org/content/10.1101/2024.06.12.598655v2)\n\n[2] Huang, Fu, Gao, et al., Artificial Intelligence Foundation for Therapeutic Science, Nature Chemical Biology, 2022 [**\\[Paper\\]**](https://www.nature.com/articles/s41589-022-01131-2)\n\n[3] Huang, Fu, Gao, et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, NeurIPS 2021 [**\\[Paper\\]**](https://openreview.net/forum?id=8nvgnORnoWr) [**\\[Poster\\]**](https://drive.google.com/file/d/1LfF8mfPLUqAVEzH3KPBxDO_VF7nLFtiJ/view?usp=sharing) \n\n[4] Huang et al., Benchmarking Molecular Machine Learning in Therapeutics Data Commons, ELLIS ML4Molecules 2021 [**\\[Paper\\]**](https://cloud.ml.jku.at/s/54pB5Eqf6ftX7qA) [**\\[Slides\\]**](https://drive.google.com/file/d/1iOSW_5eruca4vdygDxS1H64c49oQuH40/view?usp=sharing) \n\n[5] Huang et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, Baylearn 2021 [**\\[Slides\\]**](https://drive.google.com/file/d/1BNpk3dOdqE3ksgyVV-V3xySdBMq-8cXL/view?usp=sharing) [**\\[Poster\\]**](https://drive.google.com/file/d/1LfF8mfPLUqAVEzH3KPBxDO_VF7nLFtiJ/view?usp=sharing)\n\n[6] Huang, Fu, Gao et al., Therapeutics Data Commons, NSF-Harvard Symposium on Drugs for Future Pandemics 2020 [**\\[#futuretx20\\]**](https://www.drugsymposium.org/) [**\\[Slides\\]**](https://drive.google.com/file/d/11eTrh_lsqPcwu3RZRYjJGNpJ3s18YlBS/view) [**\\[Video\\]**](https://youtu.be/ZuCOhEZtaOw)\n\n[7] [TDC User Group Meetup, Jan 2022](https://harvard.zoom.us/rec/share/HO0TjRPs56YG-Fu3i033izaTwebB4KwUhPeNURkWSI-anrH9su03lCtUlHeZG-WP.67ZJmAIHsD7Q_2GQ) [**\\[Agenda\\]**](https://shoutout.wix.com/so/d1Nv1pC2d#/main)\n\n[8] Zitnik, Machine Learning to Translate the Cancer Genome and Epigenome Session, [AACR Annual Meeting 2022, Apr 2022](https://www.aacr.org/meeting/aacr-annual-meeting-2022/)\n\n[9] Zitnik, Few-Shot Learning for Network Biology, [Keynote at KDD Workshop on Data Mining in Bioinformatics](https://biokdd.org/biokdd21/keynote.html)\n\n[10] Zitnik, Actionable machine learning for drug discovery and development, [Broad Institute, Models, Inference & Algorithms Seminar, 2021](https://www.broadinstitute.org/talks/actionable-machine-learning-drug-discovery-and-development)\n\n[11] Zitnik, Graph Neural Networks for Biomedical Data, [Machine Learning in Computational Biology, 2020](https://sites.google.com/cs.washington.edu/mlcb2020/schedule?authuser=0)\n\n[12] Zitnik, Graph Neural Networks for Identifying COVID-19 Drug Repurposing Opportunities, [MIT AI Cures, 2020](https://www.aicures.mit.edu/drugdiscoveryconference)\n\n\n## Unique Features of TDC\n\n- *Diverse areas of therapeutics development*: TDC covers a wide range of learning tasks, including target discovery, activity screening, efficacy, safety, and manufacturing across biomedical products, including small molecules, antibodies, and vaccines.\n- *Ready-to-use datasets*: TDC is minimally dependent on external packages. Any TDC dataset can be retrieved using only three lines of code.\n- *Data functions*: TDC provides extensive data functions, including data evaluators, meaningful data splits, data processors, and molecule generation oracles. \n- *Leaderboards*: TDC provides benchmarks for fair model comparison and systematic model development and evaluation.\n- *Open-source initiative*: TDC is an open-source initiative. If you'd like to get involved, please don't hesitate to let us know. \n\n<p align=\"center\"><img src=\"https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/tdc_overview.png\" alt=\"overview\" width=\"600px\" /></p>\n\nSee [here](https://tdcommons.ai/news/) for the latest updates in TDC!\n\n## Installation\n\n### Using `pip`\n\nTo install the core environment dependencies of TDC, use `pip`:\n\n```bash\npip install PyTDC\n```\n\n**Note**: TDC is in the beta release. Please update your local copy regularly by\n\n```bash\npip install PyTDC --upgrade\n```\n\nThe core data loaders are lightweight with minimum dependency on external packages:\n\n```bash\nnumpy, pandas, tqdm, scikit-learn, fuzzywuzzy, seaborn\n```\n\n## Tutorials\n\nWe provide  tutorials to get started with TDC:\n\n| Name  | Description                                             |\n|-------|---------------------------------------------------------|\n| [101](tutorials/TDC_101_Data_Loader.ipynb)   | Introduce TDC Data Loaders                              |\n| [102](tutorials/TDC_102_Data_Functions.ipynb)   | Introduce TDC Data Functions                            |\n| [103.1](tutorials/TDC_103.1_Datasets_Small_Molecules.ipynb) | Walk through TDC Small Molecule Datasets                |\n| [103.2](tutorials/TDC_103.2_Datasets_Biologics.ipynb) | Walk through TDC Biologics Datasets                     |\n| [104](tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)   | Generate 21 ADME ML Predictors with 15 Lines of Code |\n| [105](tutorials/TDC_105_Oracle.ipynb)   | Molecule Generation Oracles                             |\n| [106](tutorials/TDC_106_BenchmarkGroup_Submission_Demo.ipynb)   | Benchmark submission                             |\n| [DGL](tutorials/DGL_User_Group_Demo.ipynb)   | Demo presented at DGL GNN User Group Meeting                             |\n| [U1.1](tutorials/User_Group/UserGroupMeeting_Tianfan.ipynb)   | Demo presented at first TDC User Group Meetup                             |\n| [U1.2](tutorials/User_Group/UserGroupMeeting_Wenhao.ipynb)   | Demo presented at first TDC User Group Meetup                             |\n| [201](https://colab.research.google.com/drive/1xTgBwKUfP2b8j6Fqh28M2GUp2ScfENMX?usp=sharing) | TDC-2 Resource and Multi-modal Single-Cell API |\n| [202](https://colab.research.google.com/drive/1kYH8nt3nW7tXYBPNcfYuDbWxGTqOEnWg?usp=sharing) | TDC-2 Resource and PrimeKG |\n| [203](https://colab.research.google.com/drive/13MYlg5tWpywWbKYsJQXafKAlVF2hz-sP?usp=sharing) | TDC-2 Resource and External APIs |\n| [204](https://colab.research.google.com/drive/17Pd328W27mn-iBCRkHIa78L3pukKcfW1?usp=sharing) | TDC-2 Model Hub |\n| [205](https://colab.research.google.com/drive/1kHdFG4gUic5nmiul7b1hUh0HLCxLQnw_?usp=sharing) | TDC-2 Molecular Property Cliff Prediction Task |\n\n\n## Design of TDC\n\nTDC has a unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics. We organize TDC into three distinct *problems*. For each problem, we provide a collection of *learning tasks*. Finally, for each task, we provide a series of *datasets*.\n\nIn the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major areas (i.e., problems) where machine learning can facilitate scientific advances, namely, single-instance prediction, multi-instance prediction, and generation:\n\n* Single-instance prediction `single_pred`: Prediction of property given individual biomedical entity.\n* Multi-instance prediction `multi_pred`: Prediction of property given multiple biomedical entities. \n* Generation `generation`: Generation of new desirable biomedical entities.\n\n<p align=\"center\"><img src=\"https://raw.githubusercontent.com/mims-harvard/TDC/master/fig/tdc_problems.png\" alt=\"problems\" width=\"500px\" /></p>\n\nThe second tier in the TDC structure is organized into learning tasks. Improvement in these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel classes of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.\n\nFinally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits into training, validation, and test sets to simulate the type of understanding and generalization (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy) needed for transition into production and clinical implementation.\n\n\n## TDC Data Loaders\n\nTDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized \"Problem -- Learning Task -- Data Set\" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets.\n\nFor a concrete example, to obtain the HIA dataset from the ADME therapeutic learning task in the single-instance prediction problem:\n\n```python\nfrom tdc.single_pred import ADME\ndata = ADME(name = 'HIA_Hou')\n# split into train/val/test with scaffold split methods\nsplit = data.get_split(method = 'scaffold')\n# get the entire data in the various formats\ndata.get_data(format = 'df')\n```\n\nYou can see all the datasets that belong to a task as follows:\n\n```python\nfrom tdc.utils import retrieve_dataset_names\nretrieve_dataset_names('ADME')\n```\n\nSee all therapeutic tasks and datasets on the [TDC website](https://zitniklab.hms.harvard.edu/TDC/overview/)!\n\n## TDC Data Functions\n\n#### Dataset Splits\n\nTo retrieve the training/validation/test dataset split, you could type\n```python \ndata = X(name = Y)\ndata.get_split(seed = 42)\n# {'train': df_train, 'val': df_val, 'test': df_test}\n```\nYou can specify the function's splitting method, random seed, and split fractions by, e.g., `data.get_split(method = 'scaffold', seed = 1, frac = [0.7, 0.1, 0.2])`. Check the [data split page](https://zitniklab.hms.harvard.edu/TDC/functions/data_split/) for details.\n\n#### Strategies for Model Evaluation\n\nWe provide various evaluation metrics for the tasks in TDC, described in [model evaluation page](https://zitniklab.hms.harvard.edu/TDC/functions/data_evaluation/) on the website. For example, to use metric ROC-AUC, you could type\n\n```python\nfrom tdc import Evaluator\nevaluator = Evaluator(name = 'ROC-AUC')\nscore = evaluator(y_true, y_pred)\n```\n\n#### Data Processing \n\nTDC provides numerous data processing functions, including label transformation, data balancing, pairing data to PyG/DGL graphs, negative sampling, database querying, and so on. For function usage, see our [data processing page](https://zitniklab.hms.harvard.edu/TDC/functions/data_process/) on the TDC website.\n\n#### Molecule Generation Oracles\n\nFor molecule generation tasks, we provide 10+ oracles for both goal-oriented and distribution learning. For detailed usage of each oracle, please have a look at the [oracle page](https://zitniklab.hms.harvard.edu/TDC/functions/oracles/) on the website. For example, we want to retrieve the GSK3Beta oracle:\n\n```python\nfrom tdc import Oracle\noracle = Oracle(name = 'GSK3B')\noracle(['CC(C)(C)....' \n  'C[C@@H]1....',\n  'CCNC(=O)....', \n  'C[C@@H]1....'])\n\n# [0.03, 0.02, 0.0, 0.1]\n```\n\n## TDC Leaderboards\n\nEvery dataset in TDC is a benchmark, and we provide training/validation and test sets for it, together with data splits and performance evaluation metrics. To participate in the leaderboard for a specific benchmark, follow these steps:\n\n* Use the TDC benchmark data loader to retrieve the benchmark.\n\n* Use training and/or validation set to train your model.\n\n* Use the TDC model evaluator to calculate your model's performance on the test set.\n\n* Submit the test set performance to a TDC leaderboard.\n\nAs many datasets share a therapeutics theme, we organize benchmarks into meaningfully defined groups, which we refer to as benchmark groups. Datasets and tasks within a benchmark group are carefully curated and centered around a theme (for example, TDC contains a benchmark group to support ML predictions of the ADMET properties). While every benchmark group consists of multiple benchmarks, it is possible to separately submit results for each benchmark. Here is the code framework to access the benchmarks:\n\n```python\nfrom tdc import BenchmarkGroup\ngroup = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')\npredictions_list = []\n\nfor seed in [1, 2, 3, 4, 5]:\n    benchmark = group.get('Caco2_Wang') \n    # all benchmark names in a benchmark group are stored in group.dataset_names\n    predictions = {}\n    name = benchmark['name']\n    train_val, test = benchmark['train_val'], benchmark['test']\n    train, valid = group.get_train_valid_split(benchmark = name, split_type = 'default', seed = seed)\n    \n        # --------------------------------------------- # \n        #  Train your model using train, valid, test    #\n        #  Save test prediction in y_pred_test variable #\n        # --------------------------------------------- #\n        \n    predictions[name] = y_pred_test\n    predictions_list.append(predictions)\n\nresults = group.evaluate_many(predictions_list)\n# {'caco2_wang': [6.328, 0.101]}\n```\n\nFor more information, visit [here](https://tdcommons.ai/benchmark/overview/).\n\n\n## Cite Us\n\nIf you find Therapeutics Data Commons useful, cite our [NeurIPS'24 AIDrugX paper](https://openreview.net/pdf?id=kL8dlYp6IM), our [NeurIPS paper](https://openreview.net/pdf?id=8nvgnORnoWr), and [Nature Chemical Biology paper](https://www.nature.com/articles/s41589-022-01131-2) :\n\n```\n@inproceedings{\nvelez-arce2024signals,\ntitle={Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics},\nauthor={Alejandro Velez-Arce and Xiang Lin and Kexin Huang and Michelle M Li and Wenhao Gao and Bradley Pentelute and Tianfan Fu and Manolis Kellis and Marinka Zitnik},\nbooktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},\nyear={2024},\nurl={https://openreview.net/forum?id=kL8dlYp6IM}\n}\n```\n\n```\n@article{Huang2021tdc,\n  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development},\n  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, \n          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},\n  journal={Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks},\n  year={2021}\n}\n```\n\n```\n@article{Huang2022artificial,\n  title={Artificial intelligence foundation for therapeutic science},\n  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, \n          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},\n  journal={Nature Chemical Biology},\n  year={2022}\n}\n```\n\nTDC is built on top of other open-sourced projects. Additionally, please cite the original work if you used these datasets/functions in your research. You can find the original paper for the function/dataset on the website.\n\n## Contribute\n\nTDC is a community-driven and open-science initiative. To get involved, join our [Slack Workspace](https://join.slack.com/t/pytdc/shared_invite/zt-x0ujg5v6-zwtQZt83fhRdgrYjXRFz5g) and check out the [contribution guide](CONTRIBUTE.md)!\n\n## Contact\n\nReach us at [contact@tdcommons.ai](mailto:contact@tdcommons.ai) or open a GitHub issue.\n\n## Data Server\n\nMany TDC datasets are hosted on [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/21LKWG) with the following persistent identifier [https://doi.org/10.7910/DVN/21LKWG](https://doi.org/10.7910/DVN/21LKWG). When Dataverse is under maintenance, TDC datasets cannot be retrieved. That happens rarely; please check the status on [the Dataverse website](https://dataverse.harvard.edu/).\n\n## License\nThe TDC codebase is licensed under the MIT license. For individual dataset usage, please refer to the dataset license on the website.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Therapeutics Data Commons",
    "version": "1.1.12",
    "project_urls": {
        "Homepage": "https://github.com/mims-harvard/TDC"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5f1556451d7f4643f1ce34a86cb90bdca53cdc480433deb4c67d6617720b2d49",
                "md5": "e314ab99bbd042c1e1f45895c9d822c4",
                "sha256": "47788329b79737ec2e4e1838fd0b0df0aef0f0b7607a3a562387eb9ce8f97c8f"
            },
            "downloads": -1,
            "filename": "pytdc-1.1.12.tar.gz",
            "has_sig": false,
            "md5_digest": "e314ab99bbd042c1e1f45895c9d822c4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 151167,
            "upload_time": "2025-01-21T13:11:55",
            "upload_time_iso_8601": "2025-01-21T13:11:55.951976Z",
            "url": "https://files.pythonhosted.org/packages/5f/15/56451d7f4643f1ce34a86cb90bdca53cdc480433deb4c67d6617720b2d49/pytdc-1.1.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-21 13:11:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mims-harvard",
    "github_project": "TDC",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "circle": true,
    "requirements": [
        {
            "name": "accelerate",
            "specs": [
                [
                    "==",
                    "0.33.0"
                ]
            ]
        },
        {
            "name": "dataclasses",
            "specs": [
                [
                    "<",
                    "1.0"
                ],
                [
                    ">=",
                    "0.6"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    "<",
                    "2.20.0"
                ]
            ]
        },
        {
            "name": "evaluate",
            "specs": [
                [
                    "==",
                    "0.4.2"
                ]
            ]
        },
        {
            "name": "fuzzywuzzy",
            "specs": [
                [
                    "<",
                    "1.0"
                ],
                [
                    ">=",
                    "0.18.0"
                ]
            ]
        },
        {
            "name": "huggingface_hub",
            "specs": [
                [
                    "<",
                    "1.0"
                ],
                [
                    ">=",
                    "0.20.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.26.4"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    "<",
                    "4.0.0"
                ],
                [
                    ">=",
                    "3.0.10"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.1.4"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.2.2"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.12.2"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.65.0"
                ],
                [
                    "<",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.43.4"
                ]
            ]
        },
        {
            "name": "cellxgene-census",
            "specs": [
                [
                    "==",
                    "1.15.0"
                ]
            ]
        },
        {
            "name": "gget",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.28.4"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.6.3"
                ]
            ]
        },
        {
            "name": "rdkit",
            "specs": [
                [
                    "<",
                    "2024.3.1"
                ],
                [
                    ">=",
                    "2023.9.5"
                ]
            ]
        },
        {
            "name": "tiledbsoma",
            "specs": [
                [
                    ">=",
                    "1.7.2"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "yapf",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.40.2"
                ]
            ]
        }
    ],
    "lcname": "pytdc"
}

TDC Team