descriptastorus


Namedescriptastorus JSON
Version 2.6.1 PyPI version JSON
download
home_pagehttps://github/bp-kelley/descriptastorus
SummaryDescriptor creation, storage and molecular file indexing
upload_time2023-07-28 15:36:59
maintainer
docs_urlNone
authorBrian Kelley
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            DescriptaStorus
===============

The descriptastorus provides 

  1. fast random access to rows of properties suitable for
machine learning and 
  2. fast random access to indexed molecule files
  3. A mechanism for generating new descriptors for new molecules
  4. A mechanism for validating that you can recreate the same storage in different software/hardware environments
  5. An easy script for making your own descriptor files from raw data.

[n.b.] kyotocabinet is required to read/write the inchiKey and name indices
  This should be installed in your environment.

There are three basic ways to use DescriptaStorus:
  
  1. Make a DescriptaStore using a script
  2. Append new data to the store
  3. Use a DescriptaStore to access properties

Installing
==========

```
1. install rdkit
2. install scikit-learn
pip install git+https://github.com/bp-kelley/descriptastorus
```

Requirements are in the setup.py file, but essentially:

 1. python2/3
 2. rdkit
 3. [optional but highly recommended] kyotocabinet

Using RDKit descriptors
=======================
Grab a descriptor generator from the registry.

Currently registered descriptors:

	* atompaircounts
	* morgan3counts
	* morganchiral3counts
	* morganfeature3counts
	* rdkit2d
	* rdkit2dnormalized
	* rdkitfpbits

Descriptors are input as a tuple or list to the generator.

```
from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
generator = MakeGenerator(("RDKit2D",))
for name, numpy_type in generator.GetColumns():
  print("name: {} data type: {}".format(name, numpy_type))
```

The resulting columns and datatypes look like:
```
name: RDKit2D_calculated data type: <class 'bool'>
name: BalabanJ data type: <class 'numpy.float64'>
name: BertzCT data type: <class 'numpy.float64'>
name: Chi0 data type: <class 'numpy.float64'>
name: Chi0n data type: <class 'numpy.float64'>
name: Chi0v data type: <class 'numpy.float64'>
name: Chi1 data type: <class 'numpy.float64'>

```

Note: RDKit2D_calculated is just a flag for the store to indicate that the
RDKit2D features were successfully calculated.

To get combine multiple generators simply add them to the list
of desired datatypes:

```
from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
generator = MakeGenerator(("RDKit2D", "Morgan3Counts"))
smiles = "c1ccccc1"
data = generator.process(smiles)
assert data[0] is True
```

The first element is True if the molecule was successfully processed, this is used
in the descriptastor to indicate that the row is valid.

If a molecule is unsuccessfully processed, None is returned

```
data = generator.process("not a smiles")
assert data is None
```

Individual descriptor sets can also be used outside of the
generator.

```
from descriptastorus.descriptors import rdNormalizedDescriptors
from rdkit import Chem
import logging

# make the normalized descriptor generator
generator = rdNormalizedDescriptors.RDKit2DNormalized()
generator.columns # list of tuples:  (descriptor_name, numpytype) ...

# features = generator.process(smiles)
# features[0] is True/False if the smiles could be processed correcty
# features[1:] are the normalized descriptors as described in generator.columns

# example for converting a smiles string into the values
def rdkit_2d_normalized_features(smiles: str):
    # n.b. the first element is true/false if the descriptors were properly computed
    results = generator.process(smiles)[
    processed, features = results[0], results[1:]
    if processed is None:
       logging.warning("Unable to process smiles %s", smiles)
    # if processed is None, the features are are default values for the type
    return features
```

Making a DescriptaStore
=======================

see scripts/storus.py for more details:

```
usage: storus.py [-h] [--hasHeader] [--index-inchikey]
                 [--smilesColumn SMILESCOLUMN] [--nameColumn NAMECOLUMN]
                 [--seperator SEPERATOR]
                 smilesfile storage

positional arguments:
  smilesfile            file containing smiles strings
  storage               directory in which to store the descriptors

optional arguments:
  -h, --help            show this help message and exit
  --hasHeader           Indicate whether the smiles file has a header row
  --index-inchikey      Optionally index the descriptors with inchi keys
  --smilesColumn SMILESCOLUMN
                        Row index (or header name if the file has a header)
                        for the smiles column
  --nameColumn NAMECOLUMN
                        Row index (or header name if the file has a header)
                        for the name column
  --seperator SEPERATOR
                        Row index (or header name if the file has a header)
                        for the name column

```

Example:

Suppose you have a smiles file like the following:

```
SMILES STRU_ID
c1ccccc1 NAME
```

This is a whitespace seperated file with a header.  To make the standard
storage and also index the inchikey:

```
python scripts/storus.py --smilesColumn=SMILES --nameColumn=STRU_ID --hasHeader --index-inchikey \
  --seperator=" " \
  smiles.txt mysmiles-store
```

Note that smiles files are very seperator dependent.  If the smiles or name column
can't be found, it is might be because the seperator is misspecified.

The default properties created are 'Morgan3Counts,RDKit2D'.

Using a DescriptaStore
======================

Using the descriptastore (the descriptastore is a directory of files):

```
from descriptastorus import DescriptaStore
d = DescriptaStore("/db/cix/descriptastorus/store")

# print out the column names
print(d.descriptors().colnames)

# this will take a while!
for moldata, descriptors in d:
    smiles, name = moldata
    descriptors # is a numpy array of data morgan3 counts + rdkit descriptors
```

Note that the descriptors may contain status flags named as "X_Calculated" where X
is one of the descriptor sets, such as RDKit2D.

These are not returned by the iterator, or through the following api points:

```
colnames = d.getDescriptorNames()
descriptors = d.getDescriptors(index)
for moldata, descriptors in d:
  ...
```

To obtain these flags, you can either set the keepCalculatedFlags option

```
colnames = d.getDescriptorNames(keepCalculatedFlags=True)
descriptors = d.getDescriptors(keepCalculatedFlags=True)
```

or use the direct descriptor interface:

```
# to iterate through only the descriptors:
for descriptors in d.descriptors():
    ...
```

# to lookup by name (requires kyotocabinet)

```
rows = []
for name in names:
    rows.extend( d.lookupName(name) )

# sorting the rows helps with disk seeking
rows.sort()
for row in rows:
    descriptors = d.getDescriptors(row)
    ...
```

# To lookup by inchikey (requires kyotocabinet)

```
rows = []
for key in inchiKeys:
    rows.extend( d.lookupInchiKey(key) )

rows.sort()
for row in rows:
    descriptors = d.descriptors().get(row)
    smiles, name = d.molIndex().get(row)
    ...
```

Doing things yourself
=====================
    
Creating a Raw store
--------------------

The storage system is quite simple.  It is made by specifying the column names and
numpy types to store and also the number of rows to initialize.

Example:

```
 >>> from descriptastorus import raw
 >>> import numpy
 >>> columns = [('exactmw', numpy.float64), ('numRotatableBonds', numpy.int32) ...]
 >>> r = raw.MakeStore( columns, 2, "store")
 >>> r.putRow(0, [45.223, 3])
```

Using an existing store
-----------------------

After creation, to open the read only storage:

```
 >>> r = raw.RawStore("store")
```

Get the number or rows:

```
 >>> r.N
 2
```

Get the column names:

```
 >>> r.colnames
 ['exactmw', 'numRotatableBonds']
```

Extract the column:

```
>>> r.get(0)
[45.223, 3]
```

Make a MolFileIndex
===================

If the smiles file has a header

```
>>> from descriptastorus import MolFileIndex
>>> index = MolFileIndex.MakeSmilesIndex("data/test1.smi", "test1", hasHeader=True,
...                                      smilesColumn="smiles", nameColumn="name")
>>> index.N
13
>>> index.getMol(12)
'c1ccccc1CCCCCCCCCCCC'
>>> index.getName(12)
13
```

If the smiles file has no header

```
>>> from descriptastorus import MolFileIndex
>>> index = MolFileIndex.MakeSmilesIndex("data/test2.smi", "test2", hasHeader=False,
...                                      smilesColumn=1, nameColumn=0)
>>> index.N
13
>>> index.getMol(12)
'c1ccccc1CCCCCCCCCCCC'
>>> index.getName(12)
13
```

Use a MolFileIndex
==================

Using a molfile index is fairly simple:

```
>>> from descriptastorus import MolFileIndex
>>> idx = MolFileIndex("/db/cix/descriptastorus/test")
>>> idx.get(1000)
['CC(C)(O)c1ccc(nc1)c4ccc3C=CN(Cc2ccc(F)cc2)c3c4', 'XX-AAAA']
>>> idx.getName(1000)
'XX-AAAA'
>>> idx.getMol(1000)
CC(C)(O)c1ccc(nc1)c4ccc3C=CN(Cc2ccc(F)cc2)c3c4'
```

Installation
============

```
  git clone https://bitbucket.org/novartisnibr/rdkit-descriptastorus.git
  cd rdkit-descriptastorus
  python setup.py install
```


TODO:

  * fast forwards iteration (fast now, but could be faster)
  * faster append-only store creation
  * Fast molecule indexing/lookup (almost done)
  * Output to bcolz pandas backend

            

Raw data

            {
    "_id": null,
    "home_page": "https://github/bp-kelley/descriptastorus",
    "name": "descriptastorus",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Brian Kelley",
    "author_email": "Brian Kelley <bkelley@relayex.com>",
    "download_url": "https://files.pythonhosted.org/packages/32/4c/29572a0da91c06a128a3c8efa6ad8717cf193f664ebaad7ef11260542791/descriptastorus-2.6.1.tar.gz",
    "platform": null,
    "description": "DescriptaStorus\n===============\n\nThe descriptastorus provides \n\n  1. fast random access to rows of properties suitable for\nmachine learning and \n  2. fast random access to indexed molecule files\n  3. A mechanism for generating new descriptors for new molecules\n  4. A mechanism for validating that you can recreate the same storage in different software/hardware environments\n  5. An easy script for making your own descriptor files from raw data.\n\n[n.b.] kyotocabinet is required to read/write the inchiKey and name indices\n  This should be installed in your environment.\n\nThere are three basic ways to use DescriptaStorus:\n  \n  1. Make a DescriptaStore using a script\n  2. Append new data to the store\n  3. Use a DescriptaStore to access properties\n\nInstalling\n==========\n\n```\n1. install rdkit\n2. install scikit-learn\npip install git+https://github.com/bp-kelley/descriptastorus\n```\n\nRequirements are in the setup.py file, but essentially:\n\n 1. python2/3\n 2. rdkit\n 3. [optional but highly recommended] kyotocabinet\n\nUsing RDKit descriptors\n=======================\nGrab a descriptor generator from the registry.\n\nCurrently registered descriptors:\n\n\t* atompaircounts\n\t* morgan3counts\n\t* morganchiral3counts\n\t* morganfeature3counts\n\t* rdkit2d\n\t* rdkit2dnormalized\n\t* rdkitfpbits\n\nDescriptors are input as a tuple or list to the generator.\n\n```\nfrom descriptastorus.descriptors.DescriptorGenerator import MakeGenerator\ngenerator = MakeGenerator((\"RDKit2D\",))\nfor name, numpy_type in generator.GetColumns():\n  print(\"name: {} data type: {}\".format(name, numpy_type))\n```\n\nThe resulting columns and datatypes look like:\n```\nname: RDKit2D_calculated data type: <class 'bool'>\nname: BalabanJ data type: <class 'numpy.float64'>\nname: BertzCT data type: <class 'numpy.float64'>\nname: Chi0 data type: <class 'numpy.float64'>\nname: Chi0n data type: <class 'numpy.float64'>\nname: Chi0v data type: <class 'numpy.float64'>\nname: Chi1 data type: <class 'numpy.float64'>\n\n```\n\nNote: RDKit2D_calculated is just a flag for the store to indicate that the\nRDKit2D features were successfully calculated.\n\nTo get combine multiple generators simply add them to the list\nof desired datatypes:\n\n```\nfrom descriptastorus.descriptors.DescriptorGenerator import MakeGenerator\ngenerator = MakeGenerator((\"RDKit2D\", \"Morgan3Counts\"))\nsmiles = \"c1ccccc1\"\ndata = generator.process(smiles)\nassert data[0] is True\n```\n\nThe first element is True if the molecule was successfully processed, this is used\nin the descriptastor to indicate that the row is valid.\n\nIf a molecule is unsuccessfully processed, None is returned\n\n```\ndata = generator.process(\"not a smiles\")\nassert data is None\n```\n\nIndividual descriptor sets can also be used outside of the\ngenerator.\n\n```\nfrom descriptastorus.descriptors import rdNormalizedDescriptors\nfrom rdkit import Chem\nimport logging\n\n# make the normalized descriptor generator\ngenerator = rdNormalizedDescriptors.RDKit2DNormalized()\ngenerator.columns # list of tuples:  (descriptor_name, numpytype) ...\n\n# features = generator.process(smiles)\n# features[0] is True/False if the smiles could be processed correcty\n# features[1:] are the normalized descriptors as described in generator.columns\n\n# example for converting a smiles string into the values\ndef rdkit_2d_normalized_features(smiles: str):\n\u00a0 \u00a0 # n.b. the first element is true/false if the descriptors were properly computed\n    results = generator.process(smiles)[\n    processed, features = results[0], results[1:]\n    if processed is None:\n       logging.warning(\"Unable to process smiles %s\", smiles)\n    # if processed is None, the features are are default values for the type\n\u00a0 \u00a0 return features\n```\n\nMaking a DescriptaStore\n=======================\n\nsee scripts/storus.py for more details:\n\n```\nusage: storus.py [-h] [--hasHeader] [--index-inchikey]\n                 [--smilesColumn SMILESCOLUMN] [--nameColumn NAMECOLUMN]\n                 [--seperator SEPERATOR]\n                 smilesfile storage\n\npositional arguments:\n  smilesfile            file containing smiles strings\n  storage               directory in which to store the descriptors\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --hasHeader           Indicate whether the smiles file has a header row\n  --index-inchikey      Optionally index the descriptors with inchi keys\n  --smilesColumn SMILESCOLUMN\n                        Row index (or header name if the file has a header)\n                        for the smiles column\n  --nameColumn NAMECOLUMN\n                        Row index (or header name if the file has a header)\n                        for the name column\n  --seperator SEPERATOR\n                        Row index (or header name if the file has a header)\n                        for the name column\n\n```\n\nExample:\n\nSuppose you have a smiles file like the following:\n\n```\nSMILES STRU_ID\nc1ccccc1 NAME\n```\n\nThis is a whitespace seperated file with a header.  To make the standard\nstorage and also index the inchikey:\n\n```\npython scripts/storus.py --smilesColumn=SMILES --nameColumn=STRU_ID --hasHeader --index-inchikey \\\n  --seperator=\" \" \\\n  smiles.txt mysmiles-store\n```\n\nNote that smiles files are very seperator dependent.  If the smiles or name column\ncan't be found, it is might be because the seperator is misspecified.\n\nThe default properties created are 'Morgan3Counts,RDKit2D'.\n\nUsing a DescriptaStore\n======================\n\nUsing the descriptastore (the descriptastore is a directory of files):\n\n```\nfrom descriptastorus import DescriptaStore\nd = DescriptaStore(\"/db/cix/descriptastorus/store\")\n\n# print out the column names\nprint(d.descriptors().colnames)\n\n# this will take a while!\nfor moldata, descriptors in d:\n    smiles, name = moldata\n    descriptors # is a numpy array of data morgan3 counts + rdkit descriptors\n```\n\nNote that the descriptors may contain status flags named as \"X_Calculated\" where X\nis one of the descriptor sets, such as RDKit2D.\n\nThese are not returned by the iterator, or through the following api points:\n\n```\ncolnames = d.getDescriptorNames()\ndescriptors = d.getDescriptors(index)\nfor moldata, descriptors in d:\n  ...\n```\n\nTo obtain these flags, you can either set the keepCalculatedFlags option\n\n```\ncolnames = d.getDescriptorNames(keepCalculatedFlags=True)\ndescriptors = d.getDescriptors(keepCalculatedFlags=True)\n```\n\nor use the direct descriptor interface:\n\n```\n# to iterate through only the descriptors:\nfor descriptors in d.descriptors():\n    ...\n```\n\n# to lookup by name (requires kyotocabinet)\n\n```\nrows = []\nfor name in names:\n    rows.extend( d.lookupName(name) )\n\n# sorting the rows helps with disk seeking\nrows.sort()\nfor row in rows:\n    descriptors = d.getDescriptors(row)\n    ...\n```\n\n# To lookup by inchikey (requires kyotocabinet)\n\n```\nrows = []\nfor key in inchiKeys:\n    rows.extend( d.lookupInchiKey(key) )\n\nrows.sort()\nfor row in rows:\n    descriptors = d.descriptors().get(row)\n    smiles, name = d.molIndex().get(row)\n    ...\n```\n\nDoing things yourself\n=====================\n    \nCreating a Raw store\n--------------------\n\nThe storage system is quite simple.  It is made by specifying the column names and\nnumpy types to store and also the number of rows to initialize.\n\nExample:\n\n```\n >>> from descriptastorus import raw\n >>> import numpy\n >>> columns = [('exactmw', numpy.float64), ('numRotatableBonds', numpy.int32) ...]\n >>> r = raw.MakeStore( columns, 2, \"store\")\n >>> r.putRow(0, [45.223, 3])\n```\n\nUsing an existing store\n-----------------------\n\nAfter creation, to open the read only storage:\n\n```\n >>> r = raw.RawStore(\"store\")\n```\n\nGet the number or rows:\n\n```\n >>> r.N\n 2\n```\n\nGet the column names:\n\n```\n >>> r.colnames\n ['exactmw', 'numRotatableBonds']\n```\n\nExtract the column:\n\n```\n>>> r.get(0)\n[45.223, 3]\n```\n\nMake a MolFileIndex\n===================\n\nIf the smiles file has a header\n\n```\n>>> from descriptastorus import MolFileIndex\n>>> index = MolFileIndex.MakeSmilesIndex(\"data/test1.smi\", \"test1\", hasHeader=True,\n...                                      smilesColumn=\"smiles\", nameColumn=\"name\")\n>>> index.N\n13\n>>> index.getMol(12)\n'c1ccccc1CCCCCCCCCCCC'\n>>> index.getName(12)\n13\n```\n\nIf the smiles file has no header\n\n```\n>>> from descriptastorus import MolFileIndex\n>>> index = MolFileIndex.MakeSmilesIndex(\"data/test2.smi\", \"test2\", hasHeader=False,\n...                                      smilesColumn=1, nameColumn=0)\n>>> index.N\n13\n>>> index.getMol(12)\n'c1ccccc1CCCCCCCCCCCC'\n>>> index.getName(12)\n13\n```\n\nUse a MolFileIndex\n==================\n\nUsing a molfile index is fairly simple:\n\n```\n>>> from descriptastorus import MolFileIndex\n>>> idx = MolFileIndex(\"/db/cix/descriptastorus/test\")\n>>> idx.get(1000)\n['CC(C)(O)c1ccc(nc1)c4ccc3C=CN(Cc2ccc(F)cc2)c3c4', 'XX-AAAA']\n>>> idx.getName(1000)\n'XX-AAAA'\n>>> idx.getMol(1000)\nCC(C)(O)c1ccc(nc1)c4ccc3C=CN(Cc2ccc(F)cc2)c3c4'\n```\n\nInstallation\n============\n\n```\n  git clone https://bitbucket.org/novartisnibr/rdkit-descriptastorus.git\n  cd rdkit-descriptastorus\n  python setup.py install\n```\n\n\nTODO:\n\n  * fast forwards iteration (fast now, but could be faster)\n  * faster append-only store creation\n  * Fast molecule indexing/lookup (almost done)\n  * Output to bcolz pandas backend\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Descriptor creation, storage and molecular file indexing",
    "version": "2.6.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/bp-kelley/descriptastorus/issues",
        "Homepage": "https://github.com/bp-kelley/descriptastorus"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aeef088eba892ac0e47c79b1b3737e81e24422349be37399223b1dbf02466301",
                "md5": "a9771cd0f5bf67f403fdcc31a0c96165",
                "sha256": "a075138991fc71bda63723b1fbf6eb70f82fbca4b38788a6ecbf125f760cb725"
            },
            "downloads": -1,
            "filename": "descriptastorus-2.6.1-py3.10.egg",
            "has_sig": false,
            "md5_digest": "a9771cd0f5bf67f403fdcc31a0c96165",
            "packagetype": "bdist_egg",
            "python_version": "2.6.1",
            "requires_python": ">=3.6",
            "size": 1804619,
            "upload_time": "2023-07-28T15:36:57",
            "upload_time_iso_8601": "2023-07-28T15:36:57.578667Z",
            "url": "https://files.pythonhosted.org/packages/ae/ef/088eba892ac0e47c79b1b3737e81e24422349be37399223b1dbf02466301/descriptastorus-2.6.1-py3.10.egg",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "385298059685bfa8f47df497772fabb52e9d0a4f8d5b506dba3faf888548d550",
                "md5": "5bced09d321eba2a9e691c227175fc2a",
                "sha256": "04dfeb0a8014037070056945e599f39ded53bcc94f70199ea32b63754d1ef7b6"
            },
            "downloads": -1,
            "filename": "descriptastorus-2.6.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5bced09d321eba2a9e691c227175fc2a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 1091085,
            "upload_time": "2023-07-28T15:36:55",
            "upload_time_iso_8601": "2023-07-28T15:36:55.127658Z",
            "url": "https://files.pythonhosted.org/packages/38/52/98059685bfa8f47df497772fabb52e9d0a4f8d5b506dba3faf888548d550/descriptastorus-2.6.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "324c29572a0da91c06a128a3c8efa6ad8717cf193f664ebaad7ef11260542791",
                "md5": "63aec5e581311e78ad79aa2017c31d70",
                "sha256": "350dd95b3a6fcfcbc81af52b1b6a1b36678d04903744f7b422a254198d8c2f37"
            },
            "downloads": -1,
            "filename": "descriptastorus-2.6.1.tar.gz",
            "has_sig": false,
            "md5_digest": "63aec5e581311e78ad79aa2017c31d70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 1072123,
            "upload_time": "2023-07-28T15:36:59",
            "upload_time_iso_8601": "2023-07-28T15:36:59.528334Z",
            "url": "https://files.pythonhosted.org/packages/32/4c/29572a0da91c06a128a3c8efa6ad8717cf193f664ebaad7ef11260542791/descriptastorus-2.6.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-28 15:36:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bp-kelley",
    "github_project": "descriptastorus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "descriptastorus"
}
        
Elapsed time: 0.10976s