qc-B3DB


Nameqc-B3DB JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryA rich molecule dataset for Blood-Brain Barrier (BBB) permeability.
upload_time2025-08-17 01:38:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords blood-brain barrier bbb drug discovery bbb permeability cns drug delivery molecule database
VCS
bugtrack_url
requirements pandas
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # About *B3DB*

In this repo, we present a large benchmark dataset, [Blood-Brain Barrier Database (B3DB)](https://www.nature.com/articles/s41597-021-01069-5), compiled
from 50 published resources (as summarized at
[raw_data/raw_data_summary.tsv](raw_data/raw_data_summary.tsv)) and categorized based on
the consistency between different experimental references/measurements. This dataset was [published in Scientific Data](https://www.nature.com/articles/s41597-021-01069-5) and this repository is occasionally uploaded with new experimental data. Scientists who would like to contribute data should contact the database's maintainers (e.g., by creating a new Issue in this database).

A subset of the
molecules in B3DB has numerical `logBB` values (1058 compounds), while the whole dataset
has categorical (BBB+ or BBB-) BBB permeability labels (7807 compounds prior to v1.0.0 and 7982 compounds after). Some physicochemical properties
of the molecules are also provided.

## Citation

Please use the following citations in any publication using our *B3DB* dataset:

```md
@article{Meng_A_curated_diverse_2021,
author = {Meng, Fanwang and Xi, Yang and Huang, Jinfeng and Ayers, Paul W.},
doi = {10.1038/s41597-021-01069-5},
journal = {Scientific Data},
number = {289},
title = {A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors},
volume = {8},
year = {2021},
url = {https://www.nature.com/articles/s41597-021-01069-5},
publisher = {Springer Nature}
}

@article{Meng_B3clf_2025,
author = {Meng, Fanwang and Chen, Jitian and Collins-Ramirez, Juan Samuel and Ayers, Paul W.},
doi = {xxx},
journal = {xxx},
number = {xxx},
title = {B3clf: A Resampling-Integrated Machine Learning Framework to Predict Blood-Brain Barrier Permeability},
volume = {x},
year = {xxx},
url = {xxx},
publisher = {xxx}
}
```

## Features of *B3DB*

1. The largest dataset with numerical and categorical values for Blood-Brain Barrier small molecules
    (to the best of our knowledge, as of February 25, 2021).

2. Inclusion of stereochemistry information with isomeric SMILES with chiral specifications if
    available. Otherwise, canonical SMILES are used.

3. Characterization of uncertainty of experimental measurements by grouping the collected molecular
    data records.

4. Extended datasets for numerical and categorical data with precomputed physicochemical properties
    using [mordred](https://github.com/mordred-descriptor/mordred).

## Usage

### Via PyPI

The B3DB dataset is avaliable at [PyPI](https://pypi.org/project/qc-B3DB/). One can install it using pip:

```bash
pip install qc-B3DB
```

Then load the data (dictionary of `pandas` dataframe) with the following code snippet:

```python

from B3DB import B3DB_DATA_DICT

# access the data via dictionary keys
# 'B3DB_regression'
# 'B3DB_regression_extended'
# 'B3DB_classification'
# 'B3DB_classification_extended'
# "B3DB_classification_external"
df_b3db_reg = B3DB_DATA_DICT["B3DB_regression"]
df_b3db_reg.head()
#    NO.                                      compound_name  ... group comments
# 0    1                                         moxalactam  ...     A      NaN
# 1    2                                      schembl614298  ...     A      NaN
# 2    3                             morphine-6-glucuronide  ...     A      NaN
# 3    4  2-[4-(5-bromo-3-methylpyridin-2-yl)butylamino]...  ...     A      NaN
# 4    5                                                NaN  ...     A      NaN

# [5 rows x 10 columns]

```

### Manually Download the Data

There are two types of dataset in [B3DB](B3DB), [regression data](B3DB/B3DB_regression.tsv)
and [classification data](B3DB/B3DB_classification.tsv) and they can be loaded simply using *pandas*. For example

```python
import pandas as pd

# load regression dataset
regression_data = pd.read_csv("B3DB/B3DB_regression.tsv",
                              sep="\t")

# load classification dataset
classification_data = pd.read_csv("B3DB/B3DB_classification.tsv",
                                  sep="\t")

# load extended regression dataset
regression_data_extended = pd.read_csv("B3DB/B3DB_regression_extended.tsv.gz",
                                       sep="\t", compression="gzip")

# load extended classification dataset
classification_data_extended = pd.read_csv("B3DB/B3DB_classification_extended.tsv.gz",
                                           sep="\t", compression="gzip")

```

### Examples in Jupyter Notebooks

We also have three examples to show how to use our dataset,
[numerical_data_analysis.ipynb](notebooks/numerical_data_analysis.ipynb),
[PCA_projection_fingerprint.ipynb](notebooks/PCA_projection_fingerprint.ipynb) and
[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb).
[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb) uses precomputed
chemical descriptors for visualization of chemical space of `B3DB`, and can be used directly
using *MyBinder*,
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/theochem/B3DB/main?filepath=notebooks%2FPCA_projection_descriptors.ipynb).
Due to the difficulty of installing `RDKit` in *MyBinder*, only `PCA_projection_descriptors.
ipynb` is set up in *MyBinder*.

## Data Curation

Detailed procedures for data curation can be found in [data curation section](data_curation/) in this repository.

The materials and data under this repo are distributed under the
[CC0 Licence](http://creativecommons.org/publicdomain/zero/1.0/).

## ChangeLog

- 2025Aug16, the B3DB dataset is avaliable via PyPI.
- 2025Aug16, we have added a new set of 171 BBB+ and 4 BBB- compounds to the dataset since
  version 1.1.0.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "qc-B3DB",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "Blood-Brain Barrier, BBB, drug discovery, BBB permeability, CNS drug delivery, molecule database",
    "author": null,
    "author_email": "QC-Devs Community <qcdevs@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/1d/b7/dbc7a21e0d59a9f56e1629052fff23fad41518298c71e3b746396566e4de/qc_b3db-1.1.0.tar.gz",
    "platform": "Linux",
    "description": "# About *B3DB*\n\nIn this repo, we present a large benchmark dataset, [Blood-Brain Barrier Database (B3DB)](https://www.nature.com/articles/s41597-021-01069-5), compiled\nfrom 50 published resources (as summarized at\n[raw_data/raw_data_summary.tsv](raw_data/raw_data_summary.tsv)) and categorized based on\nthe consistency between different experimental references/measurements. This dataset was [published in Scientific Data](https://www.nature.com/articles/s41597-021-01069-5) and this repository is occasionally uploaded with new experimental data. Scientists who would like to contribute data should contact the database's maintainers (e.g., by creating a new Issue in this database).\n\nA subset of the\nmolecules in B3DB has numerical `logBB` values (1058 compounds), while the whole dataset\nhas categorical (BBB+ or BBB-) BBB permeability labels (7807 compounds prior to v1.0.0 and 7982 compounds after). Some physicochemical properties\nof the molecules are also provided.\n\n## Citation\n\nPlease use the following citations in any publication using our *B3DB* dataset:\n\n```md\n@article{Meng_A_curated_diverse_2021,\nauthor = {Meng, Fanwang and Xi, Yang and Huang, Jinfeng and Ayers, Paul W.},\ndoi = {10.1038/s41597-021-01069-5},\njournal = {Scientific Data},\nnumber = {289},\ntitle = {A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors},\nvolume = {8},\nyear = {2021},\nurl = {https://www.nature.com/articles/s41597-021-01069-5},\npublisher = {Springer Nature}\n}\n\n@article{Meng_B3clf_2025,\nauthor = {Meng, Fanwang and Chen, Jitian and Collins-Ramirez, Juan Samuel and Ayers, Paul W.},\ndoi = {xxx},\njournal = {xxx},\nnumber = {xxx},\ntitle = {B3clf: A Resampling-Integrated Machine Learning Framework to Predict Blood-Brain Barrier Permeability},\nvolume = {x},\nyear = {xxx},\nurl = {xxx},\npublisher = {xxx}\n}\n```\n\n## Features of *B3DB*\n\n1. The largest dataset with numerical and categorical values for Blood-Brain Barrier small molecules\n    (to the best of our knowledge, as of February 25, 2021).\n\n2. Inclusion of stereochemistry information with isomeric SMILES with chiral specifications if\n    available. Otherwise, canonical SMILES are used.\n\n3. Characterization of uncertainty of experimental measurements by grouping the collected molecular\n    data records.\n\n4. Extended datasets for numerical and categorical data with precomputed physicochemical properties\n    using [mordred](https://github.com/mordred-descriptor/mordred).\n\n## Usage\n\n### Via PyPI\n\nThe B3DB dataset is avaliable at [PyPI](https://pypi.org/project/qc-B3DB/). One can install it using pip:\n\n```bash\npip install qc-B3DB\n```\n\nThen load the data (dictionary of `pandas` dataframe) with the following code snippet:\n\n```python\n\nfrom B3DB import B3DB_DATA_DICT\n\n# access the data via dictionary keys\n# 'B3DB_regression'\n# 'B3DB_regression_extended'\n# 'B3DB_classification'\n# 'B3DB_classification_extended'\n# \"B3DB_classification_external\"\ndf_b3db_reg = B3DB_DATA_DICT[\"B3DB_regression\"]\ndf_b3db_reg.head()\n#    NO.                                      compound_name  ... group comments\n# 0    1                                         moxalactam  ...     A      NaN\n# 1    2                                      schembl614298  ...     A      NaN\n# 2    3                             morphine-6-glucuronide  ...     A      NaN\n# 3    4  2-[4-(5-bromo-3-methylpyridin-2-yl)butylamino]...  ...     A      NaN\n# 4    5                                                NaN  ...     A      NaN\n\n# [5 rows x 10 columns]\n\n```\n\n### Manually Download the Data\n\nThere are two types of dataset in [B3DB](B3DB), [regression data](B3DB/B3DB_regression.tsv)\nand [classification data](B3DB/B3DB_classification.tsv) and they can be loaded simply using *pandas*. For example\n\n```python\nimport pandas as pd\n\n# load regression dataset\nregression_data = pd.read_csv(\"B3DB/B3DB_regression.tsv\",\n                              sep=\"\\t\")\n\n# load classification dataset\nclassification_data = pd.read_csv(\"B3DB/B3DB_classification.tsv\",\n                                  sep=\"\\t\")\n\n# load extended regression dataset\nregression_data_extended = pd.read_csv(\"B3DB/B3DB_regression_extended.tsv.gz\",\n                                       sep=\"\\t\", compression=\"gzip\")\n\n# load extended classification dataset\nclassification_data_extended = pd.read_csv(\"B3DB/B3DB_classification_extended.tsv.gz\",\n                                           sep=\"\\t\", compression=\"gzip\")\n\n```\n\n### Examples in Jupyter Notebooks\n\nWe also have three examples to show how to use our dataset,\n[numerical_data_analysis.ipynb](notebooks/numerical_data_analysis.ipynb),\n[PCA_projection_fingerprint.ipynb](notebooks/PCA_projection_fingerprint.ipynb) and\n[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb).\n[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb) uses precomputed\nchemical descriptors for visualization of chemical space of `B3DB`, and can be used directly\nusing *MyBinder*,\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/theochem/B3DB/main?filepath=notebooks%2FPCA_projection_descriptors.ipynb).\nDue to the difficulty of installing `RDKit` in *MyBinder*, only `PCA_projection_descriptors.\nipynb` is set up in *MyBinder*.\n\n## Data Curation\n\nDetailed procedures for data curation can be found in [data curation section](data_curation/) in this repository.\n\nThe materials and data under this repo are distributed under the\n[CC0 Licence](http://creativecommons.org/publicdomain/zero/1.0/).\n\n## ChangeLog\n\n- 2025Aug16, the B3DB dataset is avaliable via PyPI.\n- 2025Aug16, we have added a new set of 171 BBB+ and 4 BBB- compounds to the dataset since\n  version 1.1.0.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A rich molecule dataset for Blood-Brain Barrier (BBB) permeability.",
    "version": "1.1.0",
    "project_urls": {
        "documentation": "https://github.com/theochem/B3DB",
        "homepage": "https://github.com/theochem/B3DB",
        "repository": "https://github.com/theochem/B3DB"
    },
    "split_keywords": [
        "blood-brain barrier",
        " bbb",
        " drug discovery",
        " bbb permeability",
        " cns drug delivery",
        " molecule database"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d881f31f74ab40d7773f52510c21e844f32c033dc447c58c0f668ad05d4c17c2",
                "md5": "da2ae774492589499f1d9889a862a274",
                "sha256": "721a73cc1e94a60dfa7590ed8c65837ab94e250c161e0c609be77b9b142de0e3"
            },
            "downloads": -1,
            "filename": "qc_b3db-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da2ae774492589499f1d9889a862a274",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 78265071,
            "upload_time": "2025-08-17T01:38:19",
            "upload_time_iso_8601": "2025-08-17T01:38:19.324250Z",
            "url": "https://files.pythonhosted.org/packages/d8/81/f31f74ab40d7773f52510c21e844f32c033dc447c58c0f668ad05d4c17c2/qc_b3db-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1db7dbc7a21e0d59a9f56e1629052fff23fad41518298c71e3b746396566e4de",
                "md5": "a4fe56b7668f769725ebd1e30298c313",
                "sha256": "68ff38c4d8f75e6d88b25c1944deaf3d13a892cc7bb5d403204f781533a32ee3"
            },
            "downloads": -1,
            "filename": "qc_b3db-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a4fe56b7668f769725ebd1e30298c313",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 78267078,
            "upload_time": "2025-08-17T01:38:25",
            "upload_time_iso_8601": "2025-08-17T01:38:25.364532Z",
            "url": "https://files.pythonhosted.org/packages/1d/b7/dbc7a21e0d59a9f56e1629052fff23fad41518298c71e3b746396566e4de/qc_b3db-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-17 01:38:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "theochem",
    "github_project": "B3DB",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.2.1"
                ]
            ]
        }
    ],
    "lcname": "qc-b3db"
}
        
Elapsed time: 0.81658s