papyrus-scripts

Name	papyrus-scripts JSON
Version	1.0.2 JSON
	download
home_page	https://github.com/OlivierBeq/Papyrus-scripts
Summary	A collection of scripts to handle the Papyrus bioactivity dataset
upload_time	2023-05-16 10:09:52
maintainer	Olivier J. M. Béquignon
docs_url	None
author	Olivier J. M. Béquignon - Brandon J. Bongers - Willem Jespers
requires_python
license
keywords	bioactivity data qsar proteochemometrics cheminformatics modelling machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Papyrus-scripts

Collection of scripts to interact with the Papyrus bioactivity dataset.

![alt text](figures/papyrus_workflow.svg)

<br/>

**Associated Preprint:** <a href="https://doi.org/10.33774/chemrxiv-2021-1rxhk">10.33774/chemrxiv-2021-1rxhk</a>
```
Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.
Papyrus - A large scale curated dataset aimed at bioactivity predictions.
ChemRxiv. Cambridge: Cambridge Open Engage; 2021;
This content is a preprint and has not been peer-reviewed.
```

## Installation

```bash
pip install papyrus-scripts
``` 

:warning: If pip gives the following error and resolves in import errors
```bash
Defaulting to user installation because normal site-packages is not writeable
```
Then uninstall and reinstalling the library with the following commands:
```bash
pip uninstall -y papyrus-scripts
python -m pip install papyrus-scripts
```

Additional dependencies can be installed to allow:
 - similarity and substructure searches
    ```bash
    conda install FPSim2 openbabel h5py cupy -c conda-forge
    ```

- training DNN models:
    ```bash
    conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
    ```

## Donwload the dataset

The Papyrus data can be donwload in three different ways.<br/>
**The use of the command line interface is strongly recommended to download the data.**

### - Using the command line interface (CLI)

Once the library is installed (see [*Installation*](https://github.com/OlivierBeq/Papyrus-scripts#installation)),
one can easily download the data.

- The following command will download the Papyrus++ bioactivities and protein targets (high-quality Ki and KD data as well as IC50 and EC50 of reproducible assays) for the latest version.
```bash
papyrus download -V latest
```

- The following command will donwload the entire set of high-, medium-, and low-quality bioactivities and protein targets along with all precomputed molecular and protein descriptors for version 05.5.
```bash
papyrus download -V 05.5 --more --d all 
```

- The following command will download Papyrus++ bioactivities, protein targets and compound structures for both version 05.4 and 05.5.
```bash
papyrus download -V 05.5 -V 05.4 -S 
```

More options can be found using 
```bash
papyrus download --help 
```

By default, the data is downloaded to [pystow](https://github.com/cthoyt/pystow)'s default directory.<br/>
One can override the folder path by specifying the `-o` switch in the above commands.

### - Using the application programming interface (API)

```python

from papyrus_scripts import download_papyrus

# Donwload the latest version of the entire dataset with all precomputed descriptors
download_papyrus(version='latest', only_pp=False, structures=True, descriptors='all')
```

### - Directly from online archives 

Different online servers host the Papyrus data based on release and ChEMBL version (table below).

 
| Papyrus version | ChEMBL version |                                Zenodo                                |                            4TU                            |                                                Google Drive                                                 |
|:---------------:|:--------------:|:--------------------------------------------------------------------:|:---------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------:|
|      05.4       |       29       |                                 :x:                                  | [:heavy_check_mark:](https://doi.org/10.4121/16896406.v2) | [:heavy_check_mark:](https://drive.google.com/drive/folders/1Lhw5G6gu_nLzHQoGmnl02uhFsmOgEZ5a?usp=sharing)  | 
|      05.5       |       30       | [:heavy_check_mark:](https://zenodo.org/record/7019874#.Y2lECL3MKUk) |                            :x:                            | [:heavy_check_mark:](https://drive.google.com/drive/folders/1BrCx0lN1YVvjgXOOaJZHJ7DBrLqFAbWV?usp=sharing)  |
|      05.6       |       31       | [:heavy_check_mark:](https://zenodo.org/record/7377161#.Y5BvrHbMKUk) |                            :x:                            |                                                     :x:                                                     |

Precomputed molecular and protein descriptors along with molecular structures (2D for default set and 3D for low quality set with stereochemistry) are not available for version 05.4 from 4TU but are from Google Drive.

As stated in the pre-print **we strongly encourage** the use of the dataset in which stereochemistry was not considered.
This corresponds to files containing the mention "2D" and/or "without_stereochemistry". 

## Interconversion of the compressed files

The available LZMA-compressed files (*.xz*) may not be supported by some software (e.g. Pipeline Pilot).
<br/>**Decompressing the data is strongly discouraged!**<br/>
Though Gzip files were made available at 4TU for version 05.4, we now provide a CLI option to locally interconvert from LZMA to Gzip and vice-versa.

To convert from LZMA to Gzip (or vice-versa) use the following command:
```bash
papyrus convert -v latest 
```

## Removal of the data

One can remove the Papyrus data using either the CLI or the API.

The following exerts exemplify the removal of all Papyrus data files, including all versions utility files. 
```bash
papyrus clean --remove_root
```

```python
from papyrus_scripts import remove_papyrus

remove_papyrus(papyrus_root=True)
```


## Easy handling of the dataset

Once installed the Papyrus-scripts allow for the easy filtering of the data.<br/>
- Simple examples can be found in the <a href="notebook_examples/simple_examples.ipynb">simple_examples.ipynb</a> notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/OlivierBeq/Papyrus-scripts/blob/master/notebook_examples/simple_examples.ipynb)
- An example on matching data with the Protein Data Bank can be found in the <a href="notebook_examples/matchRCSB.ipynb">simple_examples.ipynb</a> notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/OlivierBeq/Papyrus-scripts/blob/master/notebook_examples/matchRCSB.ipynb)
- More advanced examples will be added to the <a href="notebook_examples/advanced_querying.ipynb">advanced_querying.ipynb</a> notebook.
## Reproducing results of the pre-print

The scripts used to extract subsets, generate models and obtain visualizations can be found <a href="https://github.com/OlivierBeq/Papyrus-modelling">here</a>.

## Features to come

- [x] Substructure and similarity molecular searches
- [x] ability to use DNN models
- [ ] adapt models to QSPRpred
- [x] ability to repeat model training over multiple seeds
- [ ] y-scrambling
 
## Examples to come

- Use of custom grouping schemes for training/test set splitting and cross-validation
- Use custom molecular and protein descriptors (either Python function or file on disk) 


## Logos

Logos can be found under <a href="figures/logo">**figures/logo**</a>
Two version exist depending on the background used.

:warning: GitHub does not render the white logo properly in the table below but should not deter you from using it! 

<div class="colored-table">

|                                                              On white background                                                              |                     On colored background                     |
|:---------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------:|
|                                            <img src="figures/logo/Papyrus_trnsp-bg.svg" width=200>                                            | <img src="figures/logo/Papyrus_trnsp-bg-white.svg" width=200> |

</div>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/OlivierBeq/Papyrus-scripts",
    "name": "papyrus-scripts",
    "maintainer": "Olivier J. M. B\u00e9quignon",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "\"olivier.bequignon.maintainer@gmail.com\"",
    "keywords": "bioactivity data,QSAR,proteochemometrics,cheminformatics,modelling,machine learning",
    "author": "Olivier J. M. B\u00e9quignon - Brandon J. Bongers - Willem Jespers",
    "author_email": "\"olivier.bequignon.maintainer@gmail.com\"",
    "download_url": "https://files.pythonhosted.org/packages/6b/54/6ae4f489632303221c4e7e55bf7277cd1c8bbc6061531e898dcee8383ada/papyrus_scripts-1.0.2.tar.gz",
    "platform": null,
    "description": "# Papyrus-scripts\r\n\r\nCollection of scripts to interact with the Papyrus bioactivity dataset.\r\n\r\n![alt text](figures/papyrus_workflow.svg)\r\n\r\n<br/>\r\n\r\n**Associated Preprint:** <a href=\"https://doi.org/10.33774/chemrxiv-2021-1rxhk\">10.33774/chemrxiv-2021-1rxhk</a>\r\n```\r\nB\u00e9quignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.\r\nPapyrus - A large scale curated dataset aimed at bioactivity predictions.\r\nChemRxiv. Cambridge: Cambridge Open Engage; 2021;\r\nThis content is a preprint and has not been peer-reviewed.\r\n```\r\n\r\n## Installation\r\n\r\n```bash\r\npip install papyrus-scripts\r\n``` \r\n\r\n:warning: If pip gives the following error and resolves in import errors\r\n```bash\r\nDefaulting to user installation because normal site-packages is not writeable\r\n```\r\nThen uninstall and reinstalling the library with the following commands:\r\n```bash\r\npip uninstall -y papyrus-scripts\r\npython -m pip install papyrus-scripts\r\n```\r\n\r\nAdditional dependencies can be installed to allow:\r\n - similarity and substructure searches\r\n    ```bash\r\n    conda install FPSim2 openbabel h5py cupy -c conda-forge\r\n    ```\r\n\r\n- training DNN models:\r\n    ```bash\r\n    conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch\r\n    ```\r\n\r\n## Donwload the dataset\r\n\r\nThe Papyrus data can be donwload in three different ways.<br/>\r\n**The use of the command line interface is strongly recommended to download the data.**\r\n\r\n### - Using the command line interface (CLI)\r\n\r\nOnce the library is installed (see [*Installation*](https://github.com/OlivierBeq/Papyrus-scripts#installation)),\r\none can easily download the data.\r\n\r\n- The following command will download the Papyrus++ bioactivities and protein targets (high-quality Ki and KD data as well as IC50 and EC50 of reproducible assays) for the latest version.\r\n```bash\r\npapyrus download -V latest\r\n```\r\n\r\n- The following command will donwload the entire set of high-, medium-, and low-quality bioactivities and protein targets along with all precomputed molecular and protein descriptors for version 05.5.\r\n```bash\r\npapyrus download -V 05.5 --more --d all \r\n```\r\n\r\n- The following command will download Papyrus++ bioactivities, protein targets and compound structures for both version 05.4 and 05.5.\r\n```bash\r\npapyrus download -V 05.5 -V 05.4 -S \r\n```\r\n\r\nMore options can be found using \r\n```bash\r\npapyrus download --help \r\n```\r\n\r\nBy default, the data is downloaded to [pystow](https://github.com/cthoyt/pystow)'s default directory.<br/>\r\nOne can override the folder path by specifying the `-o` switch in the above commands.\r\n\r\n### - Using the application programming interface (API)\r\n\r\n```python\r\n\r\nfrom papyrus_scripts import download_papyrus\r\n\r\n# Donwload the latest version of the entire dataset with all precomputed descriptors\r\ndownload_papyrus(version='latest', only_pp=False, structures=True, descriptors='all')\r\n```\r\n\r\n### - Directly from online archives \r\n\r\nDifferent online servers host the Papyrus data based on release and ChEMBL version (table below).\r\n\r\n \r\n| Papyrus version | ChEMBL version |                                Zenodo                                |                            4TU                            |                                                Google Drive                                                 |\r\n|:---------------:|:--------------:|:--------------------------------------------------------------------:|:---------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------:|\r\n|      05.4       |       29       |                                 :x:                                  | [:heavy_check_mark:](https://doi.org/10.4121/16896406.v2) | [:heavy_check_mark:](https://drive.google.com/drive/folders/1Lhw5G6gu_nLzHQoGmnl02uhFsmOgEZ5a?usp=sharing)  | \r\n|      05.5       |       30       | [:heavy_check_mark:](https://zenodo.org/record/7019874#.Y2lECL3MKUk) |                            :x:                            | [:heavy_check_mark:](https://drive.google.com/drive/folders/1BrCx0lN1YVvjgXOOaJZHJ7DBrLqFAbWV?usp=sharing)  |\r\n|      05.6       |       31       | [:heavy_check_mark:](https://zenodo.org/record/7377161#.Y5BvrHbMKUk) |                            :x:                            |                                                     :x:                                                     |\r\n\r\nPrecomputed molecular and protein descriptors along with molecular structures (2D for default set and 3D for low quality set with stereochemistry) are not available for version 05.4 from 4TU but are from Google Drive.\r\n\r\nAs stated in the pre-print **we strongly encourage** the use of the dataset in which stereochemistry was not considered.\r\nThis corresponds to files containing the mention \"2D\" and/or \"without_stereochemistry\". \r\n\r\n## Interconversion of the compressed files\r\n\r\nThe available LZMA-compressed files (*.xz*) may not be supported by some software (e.g. Pipeline Pilot).\r\n<br/>**Decompressing the data is strongly discouraged!**<br/>\r\nThough Gzip files were made available at 4TU for version 05.4, we now provide a CLI option to locally interconvert from LZMA to Gzip and vice-versa.\r\n\r\nTo convert from LZMA to Gzip (or vice-versa) use the following command:\r\n```bash\r\npapyrus convert -v latest \r\n```\r\n\r\n## Removal of the data\r\n\r\nOne can remove the Papyrus data using either the CLI or the API.\r\n\r\nThe following exerts exemplify the removal of all Papyrus data files, including all versions utility files. \r\n```bash\r\npapyrus clean --remove_root\r\n```\r\n\r\n```python\r\nfrom papyrus_scripts import remove_papyrus\r\n\r\nremove_papyrus(papyrus_root=True)\r\n```\r\n\r\n\r\n## Easy handling of the dataset\r\n\r\nOnce installed the Papyrus-scripts allow for the easy filtering of the data.<br/>\r\n- Simple examples can be found in the <a href=\"notebook_examples/simple_examples.ipynb\">simple_examples.ipynb</a> notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/OlivierBeq/Papyrus-scripts/blob/master/notebook_examples/simple_examples.ipynb)\r\n- An example on matching data with the Protein Data Bank can be found in the <a href=\"notebook_examples/matchRCSB.ipynb\">simple_examples.ipynb</a> notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/OlivierBeq/Papyrus-scripts/blob/master/notebook_examples/matchRCSB.ipynb)\r\n- More advanced examples will be added to the <a href=\"notebook_examples/advanced_querying.ipynb\">advanced_querying.ipynb</a> notebook.\r\n## Reproducing results of the pre-print\r\n\r\nThe scripts used to extract subsets, generate models and obtain visualizations can be found <a href=\"https://github.com/OlivierBeq/Papyrus-modelling\">here</a>.\r\n\r\n## Features to come\r\n\r\n- [x] Substructure and similarity molecular searches\r\n- [x] ability to use DNN models\r\n- [ ] adapt models to QSPRpred\r\n- [x] ability to repeat model training over multiple seeds\r\n- [ ] y-scrambling\r\n \r\n## Examples to come\r\n\r\n- Use of custom grouping schemes for training/test set splitting and cross-validation\r\n- Use custom molecular and protein descriptors (either Python function or file on disk) \r\n\r\n\r\n## Logos\r\n\r\nLogos can be found under <a href=\"figures/logo\">**figures/logo**</a>\r\nTwo version exist depending on the background used.\r\n\r\n:warning: GitHub does not render the white logo properly in the table below but should not deter you from using it! \r\n\r\n<div class=\"colored-table\">\r\n\r\n|                                                              On white background                                                              |                     On colored background                     |\r\n|:---------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------:|\r\n|                                            <img src=\"figures/logo/Papyrus_trnsp-bg.svg\" width=200>                                            | <img src=\"figures/logo/Papyrus_trnsp-bg-white.svg\" width=200> |\r\n\r\n</div>\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A collection of scripts to handle the Papyrus bioactivity dataset",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://github.com/OlivierBeq/Papyrus-scripts"
    },
    "split_keywords": [
        "bioactivity data",
        "qsar",
        "proteochemometrics",
        "cheminformatics",
        "modelling",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "37e6ce2206c4a5147ad1ce46002653ebea21ec6a11d1e14c8efb8ab1ce937503",
                "md5": "7ada4ea2b4c328d8b822557ba169902c",
                "sha256": "29e52798c965167b8971e5c644121d3cb8f8d34bc2f1497e759e913c08ef39f7"
            },
            "downloads": -1,
            "filename": "papyrus_scripts-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7ada4ea2b4c328d8b822557ba169902c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 72043,
            "upload_time": "2023-05-16T10:09:45",
            "upload_time_iso_8601": "2023-05-16T10:09:45.610114Z",
            "url": "https://files.pythonhosted.org/packages/37/e6/ce2206c4a5147ad1ce46002653ebea21ec6a11d1e14c8efb8ab1ce937503/papyrus_scripts-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6b546ae4f489632303221c4e7e55bf7277cd1c8bbc6061531e898dcee8383ada",
                "md5": "c4ccd6d6099822832c4196b967850503",
                "sha256": "77e4afd27a7b4bb9f7675a89a6dd351df15d3a36c22a7bd2415ae2d18177c690"
            },
            "downloads": -1,
            "filename": "papyrus_scripts-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c4ccd6d6099822832c4196b967850503",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 69041,
            "upload_time": "2023-05-16T10:09:52",
            "upload_time_iso_8601": "2023-05-16T10:09:52.521319Z",
            "url": "https://files.pythonhosted.org/packages/6b/54/6ae4f489632303221c4e7e55bf7277cd1c8bbc6061531e898dcee8383ada/papyrus_scripts-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-16 10:09:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "OlivierBeq",
    "github_project": "Papyrus-scripts",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "papyrus-scripts"
}

Olivier J. M. Béquignon - Brandon J. Bongers - Willem Jespers