cheminftools


Namecheminftools JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/marcossantanaioc/cheminftools
SummaryA collection of tools for daily cheminformatics tasks.
upload_time2023-06-08 22:06:26
maintainer
docs_urlNone
authormarcossantanaioc
requires_python>=3.9
licenseApache Software License 2.0
keywords cheminformatics computational chemistry rdkit
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            cheminftools
================

## Installation

> From GitHub repo:
    ```pip install git+https://github.com/marcossantanaioc/cheminftools.git```

> From PyPi:
    ```pip install cheminftools```


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## How to use

Chemtools offer a collection of cheminformatics scripts for daily tasks.
Currently supported tasks include:

    1 - Standardization of chemical structures

    2 - Calculation of molecular descriptors

    3 - Filtering datasets your own or predefined alerts (e.g. PAINS, Dundee, Glaxo, etc.)

# Standardization

A dataset of molecules can be standardized in just 1 line of code!

``` python
import pandas as pd
import numpy as np
from cheminftools.tools.sanitizer import MolCleaner
from cheminftools.tools.featurizer import MolFeaturizer
from cheminftools.tools.filtering import MolFilter
from rdkit import Chem
import json
```

``` python
data = pd.read_csv('../data/example_data.csv')
```

# Sanitizing

The `MolCleaner` class performs sanitization tasks, including:

        1. Standardize unknown stereochemistry (Handled by the RDKit Mol file parser)
            i) Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
            ii) Fix wiggly bonds on double bonds – set double bond to crossed bond
        2. Clears S Group data from the mol file
        3. Kekulize the structure
        4. Remove H atoms (See the page on explicit Hs for more details)
        5. Normalization:
            Fix hypervalent nitro groups
            Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
            Correct amides with N=COH
            Standardise sulphoxides to charge separated form
            Standardize diazonium N (atom :2 here: [*:1]-[N;X2:2]#[N;X1:3]>>[*:1]) to N+
            Ensure quaternary N is charged
            Ensure trivalent O ([*:1]=[O;X2;v3;+0:2]-[#6:3]) is charged
            Ensure trivalent S ([O:1]=[S;D2;+0:2]-[#6:3]) is charged
            Ensure halogen with no neighbors ([F,Cl,Br,I;X0;+0:1]) is charged
        6. The molecule is neutralized, if possible. See the page on neutralization rules for more details.
        7. Remove stereo from tartrate to simplify salt matching
        8. Normalise (straighten) triple bonds and allenes
        
        
        
        The curation steps in ChEMBL structure pipeline were augmented with additional steps to identify duplicated entries
        9. Find stereo centers
        10. Generate inchi keys
        11. Find duplicated SMILES. If the same SMILES is present multiple times, two outcomes are possible.
            i. The same compound (e.g. same ID and same SMILES)
            ii. Isomers with different SMILES, IDs and/or activities
            
            In case i), the compounds are merged by taking the median values of all numeric columns in the dataframe. 
            For case ii), the compounds are further classified as 'to merge' or 'to keep' depending on the activity values.
                a) Compounds are considered for mergining (to merge) if the difference in acvitities is less than 1log unit.
                b) Compounds are considered for keeping as individual entries (to keep) if the difference in activities is larger than 1log unit. In this case, the user can
                select which compound to keep - the one with highest or lowest activity.

``` python
processed_data = MolCleaner.from_df(data, smiles_col='smiles', act_col='pIC50', id_col='molecule_chembl_id')
```

    +-------------------------------------------------------------+-------------------------------------------------------------+
    |                      processed_smiles                       |                           smiles                            |
    +=============================================================+=============================================================+
    |       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |
    +-------------------------------------------------------------+-------------------------------------------------------------+
    |       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |
    +-------------------------------------------------------------+-------------------------------------------------------------+
    | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 |
    +-------------------------------------------------------------+-------------------------------------------------------------+
    |               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |
    +-------------------------------------------------------------+-------------------------------------------------------------+
    | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  |
    +-------------------------------------------------------------+-------------------------------------------------------------+

# Filtering

The
`MolFilter`
class is responsible for removing compounds that match defined
substructural alerts. The class `AlertMatcher` can be used to generate your own catalog of alerts
based on a dictionary. You can also use catalogs from RDKIT, such as [PAINS catalog](http://rdkit.org/docs/source/rdkit.Chem.rdfiltercatalog.html).

The example below shows how to create an alerts catalog starting from a json of the Glaxos alerts.
### Load json and prepare dictionary
``` python
alerts_df = pd.read_csv('../data/libraries/alert_collection.csv')
alerts_df = alerts_df[alerts_df['rule_set_name']=='Glaxo']
alerts_df.rename(columns={'smarts':'SMARTS'},inplace=True)
alerts_df_reindex = alerts_df[['description','SMARTS','rule_set_name','priority','max_matches']].set_index('description')
alerts_dict = alerts_df_reindex.to_dict(orient='index')

```
### Create matcher object from dict
``` python
matcher = AlertMatcher(alerts_dict)
catalog = matcher.create_matcher()
```
### Run filtering
``` python
alerts_data = MolFilter.from_df(df=processed_data, smiles_column='processed_smiles', catalog=catalog)
```

    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
    |                                  smiles                                    |         SMARTS        |          alert_name       |   rule_set_name  |
    +============================================================================+=======================+===========================+==================+
    |        Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1         |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |
    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                | [Br,Cl,I][CX4;CH,CH2] | R1 Reactive alkyl halides |      Glaxo       |
    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |
    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |      [N&D2](=O)       |        R21 Nitroso        |      Glaxo       |
    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
    | CS(=O)(=O)O[C@H]1CN[C@H](C#Cc2cc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3s2)C1 |   COS(=O)(=O)[C,c]    |      R5 Sulphonates       |      Glaxo       |
    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+


# Featurization

The
`MolFeaturizer`
class converts SMILES into molecular descriptors. The current version
supports Morgan fingerprints, Atom Pairs, Torsion Fingerprints, RDKit
fingerprints and 200 constitutional descriptors, and MACCS keys.

``` python
fingerprinter = MolFeaturizer('rdkit2d')
```

``` python
X = fingerprinter.transform(processed_data['processed_smiles'])
```

``` python
X[:5, :5]
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=uint8)

```
# Collecting data from ChEMBL
The current version of cheminftools support queries to ChEMBL based on UNIPROT accession codes.
It should be straightforward to get activity data for multiple targets using the `ChemblFetcher` class.
Users can find the latest version (and also older ones) of ChEMBL on the official [page](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/).
Installation instructions come together with each ChEMBL release.

`ChbemlFetcher` expects a configuration file for the database. This file includes information such as
the host, user, password and port to connect to the database. 
An example is shown below:

```markdown
[postgresql]
host = localhost
database = customer
user = postgres
password = admindb
port = 5432
```

```python
from cheminftools.data.data_gather import ChemblFetcher

target_uniprot = ['P00742', 'P50613']
chembl = ChemblFetcher(database_config_filename='database.ini',  # Path to configuration file. You can find \an example in the cheminftools.examples folder
                       database_name='chembl',  # Name of database
                       version='32')  # ChEMBL version to use
df = chembl.query_target_uniprot(target_uniprot=target_uniprot)
```
The output is a pandas DataFrame with the desired activity types (e.g. IC50, Kd, Ki)
for each target in `target_uniprot`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/marcossantanaioc/cheminftools",
    "name": "cheminftools",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "cheminformatics computational chemistry rdkit",
    "author": "marcossantanaioc",
    "author_email": "marcosvssantana@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d3/21/392daf92a6d824c016a13c99e422730a58a663178b97e5731e10125844d4/cheminftools-0.1.4.tar.gz",
    "platform": null,
    "description": "cheminftools\n================\n\n## Installation\n\n> From GitHub repo:\n    ```pip install git+https://github.com/marcossantanaioc/cheminftools.git```\n\n> From PyPi:\n    ```pip install cheminftools```\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## How to use\n\nChemtools offer a collection of cheminformatics scripts for daily tasks.\nCurrently supported tasks include:\n\n    1 - Standardization of chemical structures\n\n    2 - Calculation of molecular descriptors\n\n    3 - Filtering datasets your own or predefined alerts (e.g. PAINS, Dundee, Glaxo, etc.)\n\n# Standardization\n\nA dataset of molecules can be standardized in just 1 line of code!\n\n``` python\nimport pandas as pd\nimport numpy as np\nfrom cheminftools.tools.sanitizer import MolCleaner\nfrom cheminftools.tools.featurizer import MolFeaturizer\nfrom cheminftools.tools.filtering import MolFilter\nfrom rdkit import Chem\nimport json\n```\n\n``` python\ndata = pd.read_csv('../data/example_data.csv')\n```\n\n# Sanitizing\n\nThe `MolCleaner` class performs sanitization tasks, including:\n\n        1. Standardize unknown stereochemistry (Handled by the RDKit Mol file parser)\n            i) Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo\n            ii) Fix wiggly bonds on double bonds \u2013 set double bond to crossed bond\n        2. Clears S Group data from the mol file\n        3. Kekulize the structure\n        4. Remove H atoms (See the page on explicit Hs for more details)\n        5. Normalization:\n            Fix hypervalent nitro groups\n            Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)\n            Correct amides with N=COH\n            Standardise sulphoxides to charge separated form\n            Standardize diazonium N (atom :2 here: [*:1]-[N;X2:2]#[N;X1:3]>>[*:1]) to N+\n            Ensure quaternary N is charged\n            Ensure trivalent O ([*:1]=[O;X2;v3;+0:2]-[#6:3]) is charged\n            Ensure trivalent S ([O:1]=[S;D2;+0:2]-[#6:3]) is charged\n            Ensure halogen with no neighbors ([F,Cl,Br,I;X0;+0:1]) is charged\n        6. The molecule is neutralized, if possible. See the page on neutralization rules for more details.\n        7. Remove stereo from tartrate to simplify salt matching\n        8. Normalise (straighten) triple bonds and allenes\n        \n        \n        \n        The curation steps in ChEMBL structure pipeline were augmented with additional steps to identify duplicated entries\n        9. Find stereo centers\n        10. Generate inchi keys\n        11. Find duplicated SMILES. If the same SMILES is present multiple times, two outcomes are possible.\n            i. The same compound (e.g. same ID and same SMILES)\n            ii. Isomers with different SMILES, IDs and/or activities\n            \n            In case i), the compounds are merged by taking the median values of all numeric columns in the dataframe. \n            For case ii), the compounds are further classified as 'to merge' or 'to keep' depending on the activity values.\n                a) Compounds are considered for mergining (to merge) if the difference in acvitities is less than 1log unit.\n                b) Compounds are considered for keeping as individual entries (to keep) if the difference in activities is larger than 1log unit. In this case, the user can\n                select which compound to keep - the one with highest or lowest activity.\n\n``` python\nprocessed_data = MolCleaner.from_df(data, smiles_col='smiles', act_col='pIC50', id_col='molecule_chembl_id')\n```\n\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n    |                      processed_smiles                       |                           smiles                            |\n    +=============================================================+=============================================================+\n    |       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n    |       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n    | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 |\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n    |               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n    | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  |\n    +-------------------------------------------------------------+-------------------------------------------------------------+\n\n# Filtering\n\nThe\n`MolFilter`\nclass is responsible for removing compounds that match defined\nsubstructural alerts. The class `AlertMatcher` can be used to generate your own catalog of alerts\nbased on a dictionary. You can also use catalogs from RDKIT, such as [PAINS catalog](http://rdkit.org/docs/source/rdkit.Chem.rdfiltercatalog.html).\n\nThe example below shows how to create an alerts catalog starting from a json of the Glaxos alerts.\n### Load json and prepare dictionary\n``` python\nalerts_df = pd.read_csv('../data/libraries/alert_collection.csv')\nalerts_df = alerts_df[alerts_df['rule_set_name']=='Glaxo']\nalerts_df.rename(columns={'smarts':'SMARTS'},inplace=True)\nalerts_df_reindex = alerts_df[['description','SMARTS','rule_set_name','priority','max_matches']].set_index('description')\nalerts_dict = alerts_df_reindex.to_dict(orient='index')\n\n```\n### Create matcher object from dict\n``` python\nmatcher = AlertMatcher(alerts_dict)\ncatalog = matcher.create_matcher()\n```\n### Run filtering\n``` python\nalerts_data = MolFilter.from_df(df=processed_data, smiles_column='processed_smiles', catalog=catalog)\n```\n\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n    |                                  smiles                                    |         SMARTS        |          alert_name       |   rule_set_name  |\n    +============================================================================+=======================+===========================+==================+\n    |        Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1         |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                | [Br,Cl,I][CX4;CH,CH2] | R1 Reactive alkyl halides |      Glaxo       |\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n    |               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |      [N&D2](=O)       |        R21 Nitroso        |      Glaxo       |\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n    | CS(=O)(=O)O[C@H]1CN[C@H](C#Cc2cc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3s2)C1 |   COS(=O)(=O)[C,c]    |      R5 Sulphonates       |      Glaxo       |\n    +----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+\n\n\n# Featurization\n\nThe\n`MolFeaturizer`\nclass converts SMILES into molecular descriptors. The current version\nsupports Morgan fingerprints, Atom Pairs, Torsion Fingerprints, RDKit\nfingerprints and 200 constitutional descriptors, and MACCS keys.\n\n``` python\nfingerprinter = MolFeaturizer('rdkit2d')\n```\n\n``` python\nX = fingerprinter.transform(processed_data['processed_smiles'])\n```\n\n``` python\nX[:5, :5]\narray([[0, 0, 0, 0, 0],\n       [0, 0, 0, 0, 0],\n       [0, 0, 0, 0, 0],\n       [0, 0, 0, 0, 0],\n       [0, 0, 0, 0, 0]], dtype=uint8)\n\n```\n# Collecting data from ChEMBL\nThe current version of cheminftools support queries to ChEMBL based on UNIPROT accession codes.\nIt should be straightforward to get activity data for multiple targets using the `ChemblFetcher` class.\nUsers can find the latest version (and also older ones) of ChEMBL on the official [page](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/).\nInstallation instructions come together with each ChEMBL release.\n\n`ChbemlFetcher` expects a configuration file for the database. This file includes information such as\nthe host, user, password and port to connect to the database. \nAn example is shown below:\n\n```markdown\n[postgresql]\nhost = localhost\ndatabase = customer\nuser = postgres\npassword = admindb\nport = 5432\n```\n\n```python\nfrom cheminftools.data.data_gather import ChemblFetcher\n\ntarget_uniprot = ['P00742', 'P50613']\nchembl = ChemblFetcher(database_config_filename='database.ini',  # Path to configuration file. You can find \\an example in the cheminftools.examples folder\n                       database_name='chembl',  # Name of database\n                       version='32')  # ChEMBL version to use\ndf = chembl.query_target_uniprot(target_uniprot=target_uniprot)\n```\nThe output is a pandas DataFrame with the desired activity types (e.g. IC50, Kd, Ki)\nfor each target in `target_uniprot`.\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "A collection of tools for daily cheminformatics tasks.",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/marcossantanaioc/cheminftools"
    },
    "split_keywords": [
        "cheminformatics",
        "computational",
        "chemistry",
        "rdkit"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "27002c69e9cb9964046ca0c7caef3e58e15e4b679a7ec3af7bb9154a348b8036",
                "md5": "ffc8328ccb2f86f141e7778c8e609366",
                "sha256": "9f388a9c0e6e4b34cc310943f76f6f14feffb3e8b4b4eede955bd91e906b86f5"
            },
            "downloads": -1,
            "filename": "cheminftools-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ffc8328ccb2f86f141e7778c8e609366",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 19688,
            "upload_time": "2023-06-08T22:06:24",
            "upload_time_iso_8601": "2023-06-08T22:06:24.727658Z",
            "url": "https://files.pythonhosted.org/packages/27/00/2c69e9cb9964046ca0c7caef3e58e15e4b679a7ec3af7bb9154a348b8036/cheminftools-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d321392daf92a6d824c016a13c99e422730a58a663178b97e5731e10125844d4",
                "md5": "35d277b8cd87cbe87076fdafbd785d95",
                "sha256": "14147de11956d911f188f8aedaf4954b1b19223bbd5cb2ba05e90d58d004fdc8"
            },
            "downloads": -1,
            "filename": "cheminftools-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "35d277b8cd87cbe87076fdafbd785d95",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19846,
            "upload_time": "2023-06-08T22:06:26",
            "upload_time_iso_8601": "2023-06-08T22:06:26.606711Z",
            "url": "https://files.pythonhosted.org/packages/d3/21/392daf92a6d824c016a13c99e422730a58a663178b97e5731e10125844d4/cheminftools-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-08 22:06:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "marcossantanaioc",
    "github_project": "cheminftools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cheminftools"
}
        
Elapsed time: 0.07617s