chemicalspace


Namechemicalspace JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryAn Object-oriented Representation for Chemical Spaces
upload_time2024-06-03 15:04:46
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT License Copyright (c) 2024 Giulio Mattedi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords chemistry cheminformatics rdkit machine learning ml
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![](.badges/coverage.svg)  
![](.badges/tests.svg)

# ChemicalSpace

An Object-Oriented Representation for Chemical Spaces

`chemicalspace` is a Python package that provides an object-oriented
representation for chemical spaces. It is designed to be used in conjunction
with the `RDKit` package, which provides the underlying cheminformatics functionality.

While in the awesome `RDKit`, the main frame of reference is that of single molecules, here the main focus is on operations on chemical spaces.

## Installation

To install `chemicalspace` you can use `pip`:

```bash
pip install chemicalspace
```

# Usage

The main class in `chemicalspace` is `ChemicalSpace`.
The class provides a number of methods for working with chemical spaces,
including reading and writing, filtering, clustering and
picking from chemical spaces.

## Basics

### Initialization

A `ChemicalSpace` can be initialized from SMILES strings or `RDKit` molecules.
It optionally takes molecule indices and scores as arguments.

```python
from chemicalspace import ChemicalSpace

smiles = ('CCO', 'CCN', 'CCl')
indices = ("mol1", "mol2", "mol3")
scores = (0.1, 0.2, 0.3)

space = ChemicalSpace(mols=smiles, indices=indices, scores=scores)

print(space)
```

```text
<ChemicalSpace: 3 molecules | 3 indices | 3 scores>
```

### Reading and Writing

A `ChemicalSpace` can be read from and written to SMI and SDF files.

```python
from chemicalspace import ChemicalSpace

# Load from SMI file
space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space.to_smi("outputs1.smi")

# Load from SDF file
space = ChemicalSpace.from_sdf("tests/data/inputs1.sdf")
space.to_sdf("outputs1.sdf")

print(space)
```

```text
<ChemicalSpace: 10 molecules | 10 indices | No scores>
```

### Indexing, Slicing and Masking

Indexing, slicing and masking a `ChemicalSpace` object returns a new `ChemicalSpace` object.

#### Indexing

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

print(space[0])
```

```text
<ChemicalSpace: 1 molecules | 1 indices | No scores>
```

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
idx = [1, 2, 4]

print(space[idx])
```

```text
<ChemicalSpace: 3 molecules | 3 indices | No scores>
```

#### Slicing

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

print(space[:2])
```

```text
<ChemicalSpace: 2 molecules | 2 indices | No scores>
```

#### Masking

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
mask = [True, False, True, False, True, False, True, False, True, False]

print(space[mask])
```

```text
<ChemicalSpace: 5 molecules | 5 indices | No scores>
```

### Deduplicating

Deduplicating a `ChemicalSpace` object removes duplicate molecules.  
See [Hashing and Identity](#hashing-and-identity) for details on molecule identity.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space_twice = space + space  # 20 molecules
space_deduplicated = space_twice.deduplicate()  # 10 molecules

print(space_deduplicated)
```

```text
<ChemicalSpace: 10 molecules | 10 indices | No scores>
```

### Chunking

A `ChemicalSpace` object can be chunked into smaller `ChemicalSpace` objects.   
The `.chunks` method returns a generator of `ChemicalSpace` objects.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
chunks = space.chunks(chunk_size=3)

for chunk in chunks:
    print(chunk)
```

```text
<ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 1 molecules | 1 indices | No scores>
```

### Drawing

A `ChemicalSpace` object can be rendered as a grid of molecules.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space.draw()
```
![draw](.media/sample.png)

### Featurizing

#### Features

A `ChemicalSpace` object can be featurized as a `numpy` array of features.
By default, ECFP4/Morgan2 fingerprints are used.
The features are cached for subsequent calls,
and spaces generated by a `ChemicalSpace` object (e.g. by slicing, masking, chunking)
inherit the respective features.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space_slice = space[:6:2]

# Custom ECFP4 features
print(space.features.shape)
print(space_slice.features.shape)
```

```text
(10, 1024)
(3, 1024)
```

#### Custom featurizer

This should take in a `rdkit.Chem.Mol` molecule, and the numerical
return value should be castable to NumPy array (see `chemicalspace.utils.MolFeaturizerType`).

```python
from chemicalspace import ChemicalSpace
from chemicalspace.utils import maccs_featurizer

space = ChemicalSpace.from_smi("tests/data/inputs1.smi", featurizer=maccs_featurizer)
space_slice = space[:6:2]

# Custom ECFP4 features
print(space.features.shape)
print(space_slice.features.shape)
```

```text
(10, 167)
(3, 167)
```

#### Metrics

A distance metric on the feature space is necessary for clustering, calculating diversity, and
identifying neighbors. By default, the `jaccard` (a.k.a Tanimoto) distance is used.
`ChemicalSpace` takes a `metric` string argument that allows to specify a `sklearn` metric.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi", metric='euclidean')
```

### Binary Operations

#### Single entries

Single entries as SMILES strings or `RDKit` molecules
can be added to a `ChemicalSpace` object.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space.add("CCO", "mol11")

print(space)
```

```text
<ChemicalSpace: 11 molecules | 11 indices | No scores>
```

#### Chemical spaces

Two `ChemicalSpace` objects can be added together.

```python
from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space2 = ChemicalSpace.from_smi("tests/data/inputs2.smi")

space = space1 + space2

print(space)
```

```text
<ChemicalSpace: 25 molecules | 25 indices | No scores>
```

And subtracted from each other to return only molecules in `space1`
that are not in `space2`.   
See [Hashing and Identity](#hashing-and-identity) for more details.

```python
from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space2 = ChemicalSpace.from_smi("tests/data/inputs2.smi")

space = space1 - space2

print(space)
```

```text
<ChemicalSpace: 5 molecules | 5 indices | No scores>
```

### Hashing and Identity

Individual molecules in a chemical space are hashed by their InChI Keys only (by default), or by InChI Keys and index.
Scores **do not** affect the hashing process.

```python
from chemicalspace import ChemicalSpace

smiles = ('CCO', 'CCN', 'CCl')
indices = ("mol1", "mol2", "mol3")

# Two spaces with the same molecules, and indices
# But one space includes the indices in the hashing process
space_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=True)
space_no_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=False)

print(space_indices == space_indices)
print(space_indices == space_no_indices)
print(space_no_indices == space_no_indices)
```

```text
True
False
True
```

`ChemicalSpace` objects are hashed by their molecular hashes, in an **order-independent** manner.

```python
from rdkit import Chem
from rdkit.Chem import inchi
from chemicalspace import ChemicalSpace

mol = Chem.MolFromSmiles("c1ccccc1")
inchi_key = inchi.MolToInchiKey(mol)

space = ChemicalSpace(mols=(mol,))

assert hash(space) == hash(frozenset((inchi_key,)))
```

The identity of a `ChemicalSpace` is evaluated on its hashed representation.

```python
from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space1_again = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space2 = ChemicalSpace.from_smi("tests/data/inputs2.smi.gz")

print(space1 == space1)
print(space1 == space1_again)
print(space1 == space2)
```

```text
True
True
False
```

### Copy

`ChemicalSpace` supports copy and deepcopy operations.
Deepcopy allows to fully unlink the copied object from the original one, including the `RDKit` molecules.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

# Shallow copy
space_copy = space.copy()
assert id(space.mols[0]) == id(space_copy.mols[0])

# Deep copy
space_deepcopy = space.copy(deep=True)
assert id(space.mols[0]) != id(space_deepcopy.mols[0])
```

## Clustering

### Labels

A `ChemicalSpace` can be clustered using by its molecular features.
`kmedoids`, `agglomerative-clustering`, `sphere-exclusion` and `scaffold`
are the available clustering methods.
Refer to the respective methods in [`chemicalspace.layers.clustering`](chemicalspace/layers/clustering.py)
for more details.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
cluster_labels = space.cluster(n_clusters=3)

print(cluster_labels)
```

```text
[0 1 2 1 1 0 0 0 0 0]
```

### Clusters

`ChemicalSpace.yield_clusters` can be used to iterate clusters as `ChemicalSpace` objects.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
clusters = space.yield_clusters(n_clusters=3)

for cluster in clusters:
    print(cluster)
```

```text
<ChemicalSpace: 6 molecules | 6 indices | No scores>
<ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 1 molecules | 1 indices | No scores>
```

### KFold Clustering

`ChemicalSpace.splits` can be used to iterate train/test cluster splits for ML training.
At each iteration, one cluster is used as the test set and the rest as the training set.
Note that there is no guarantee on the size of the clusters.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

for train, test in space.split(n_splits=3):
    print(train, test)
```

```text
<ChemicalSpace: 4 molecules | 4 indices | No scores> <ChemicalSpace: 6 molecules | 6 indices | No scores>
<ChemicalSpace: 7 molecules | 7 indices | No scores> <ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 9 molecules | 9 indices | No scores> <ChemicalSpace: 1 molecules | 1 indices | No scores>
```

## Overlap

`ChemicalSpace` implements methods for calculating the overlap with another space.

### Overlap

The molecules of a `ChemicalSpace` that are similar to another space can be flagged.
The similarity between two molecules is calculated by the Tanimoto similarity of their
ECFP4/Morgan2 fingerprints.

```python
from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space2 = ChemicalSpace.from_smi("tests/data/inputs2.smi.gz")

# Indices of `space1` that are similar to `space2`
overlap = space1.find_overlap(space2, radius=0.6)

print(overlap)
```

```text
[0 1 2 3 4]
```

### Carving

The overlap between two `ChemicalSpace` objects can be carved out from one of the objects,
so to ensure that the two spaces are disjoint for a given similarity radius.

```python
from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space2 = ChemicalSpace.from_smi("tests/data/inputs2.smi.gz")

# Carve out the overlap from `space1`
space1_carved = space1.carve(space2, radius=0.6)

print(space1_carved)
```

```text
<ChemicalSpace: 5 molecules | 5 indices | No scores>
```

## Dimensionality Reduction

`ChemicalSpace` implements methods for dimensionality reduction by
`pca`, `tsne` or `umap` projection of its features.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
proj = space.project(method='pca')

print(proj.shape)
```

```text
(10, 2)
```

## Picking

A subset of a `ChemicalSpace` can be picked by a number of acquisition strategies.  
See [`chemicalspace.layers.acquisition`](chemicalspace/layers/acquisition.py) for details.

```python
from chemicalspace import ChemicalSpace
import numpy as np

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

space_pick_random = space.pick(n=3, strategy='random')
print(space_pick_random)

space_pick_diverse = space.pick(n=3, strategy='maxmin')
print(space_pick_diverse)

space.scores = np.array(range(len(space)))  # Assign dummy scores
space_pick_greedy = space.pick(n=3, strategy='greedy')
print(space_pick_greedy)
```

```text
<ChemicalSpace: 3 molecules | 3 indices | No scores>
<ChemicalSpace: 3 molecules | 3 indices | 3 scores>
```

## Uniqueness and Diversity

### Uniqueness

The uniqueness of a `ChemicalSpace` object can be calculated by the number of unique molecules.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")
space_twice = space + space  # 20 molecules
uniqueness = space_twice.uniqueness()

print(uniqueness)
```

```text
0.5
```

### Diversity

The diversity of a `ChemicalSpace` object can be calculated as:

- The average of the pairwise distance matrix
- The normalized [Vendi score](https://arxiv.org/abs/2210.02410) of the same distance matrix.

The Vendi score can be interpreted as the effective number of molecules in the space,
and here it is normalized by the number of molecules in the space taking values in the range `[0, 1]`.

```python
from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

diversity_int = space.diversity(method='internal-distance')
diversity_vendi = space.diversity(method='vendi')
print(diversity_int)
print(diversity_vendi)

# Dummy space with the same molecule len(space) times
space_redundant = ChemicalSpace(mols=tuple([space.mols[0]] * len(space)))

diversity_int_redundant = space_redundant.diversity(method='internal-distance')
diversity_vendi_redundant = space_redundant.diversity(method='vendi')

print(diversity_int_redundant)
print(diversity_vendi_redundant)
```

```text
0.7730273985449335
0.12200482273434754
0.0
0.1
```

# Advanced

## Layers

`ChemicalSpace` is implemented as a series of *layers* that provide the functionality of the class. As can be seen in the [source code](chemicalspace/space.py), the class simply combines the layers.

If only a subset of the functionality of `ChemicalSpace` is necessary, and lean objects are a priority, one can combine only the required layers:

```python
from chemicalspace.layers.clustering import ChemicalSpaceClusteringLayer
from chemicalspace.layers.neighbors import ChemicalSpaceNeighborsLayer


class MyCustomSpace(ChemicalSpaceClusteringLayer, ChemicalSpaceNeighborsLayer):
    pass


space = MyCustomSpace(mols=["c1ccccc1"])
space
```
```text
<MyCustomSpace: 1 molecules | No indices | No scores>
```

# Development

## Installation

Install the development dependencies with `pip`:

```bash
pip install -e .[dev]
```

## Hooks

The project uses `pre-commit` for code formatting, linting and testing.
Install the hooks with:

```bash
pre-commit install
```

## Documentation

The documentation can be built by running:
```bash
cd docs
./rebuild.sh
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chemicalspace",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Giulio Mattedi <giulio.mattedi@gmail.com>",
    "keywords": "chemistry, cheminformatics, rdkit, machine learning, ml",
    "author": null,
    "author_email": "Giulio Mattedi <giulio.mattedi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/de/89/67e0fadfd289772c886bfeb5a4d5f903e4c1dc7e5e2d59b34d0c98736ee5/chemicalspace-0.1.1.tar.gz",
    "platform": null,
    "description": "![](.badges/coverage.svg)&nbsp;&nbsp;\n![](.badges/tests.svg)\n\n# ChemicalSpace\n\nAn Object-Oriented Representation for Chemical Spaces\n\n`chemicalspace` is a Python package that provides an object-oriented\nrepresentation for chemical spaces. It is designed to be used in conjunction\nwith the `RDKit` package, which provides the underlying cheminformatics functionality.\n\nWhile in the awesome `RDKit`, the main frame of reference is that of single molecules, here the main focus is on operations on chemical spaces.\n\n## Installation\n\nTo install `chemicalspace` you can use `pip`:\n\n```bash\npip install chemicalspace\n```\n\n# Usage\n\nThe main class in `chemicalspace` is `ChemicalSpace`.\nThe class provides a number of methods for working with chemical spaces,\nincluding reading and writing, filtering, clustering and\npicking from chemical spaces.\n\n## Basics\n\n### Initialization\n\nA `ChemicalSpace` can be initialized from SMILES strings or `RDKit` molecules.\nIt optionally takes molecule indices and scores as arguments.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nsmiles = ('CCO', 'CCN', 'CCl')\nindices = (\"mol1\", \"mol2\", \"mol3\")\nscores = (0.1, 0.2, 0.3)\n\nspace = ChemicalSpace(mols=smiles, indices=indices, scores=scores)\n\nprint(space)\n```\n\n```text\n<ChemicalSpace: 3 molecules | 3 indices | 3 scores>\n```\n\n### Reading and Writing\n\nA `ChemicalSpace` can be read from and written to SMI and SDF files.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\n# Load from SMI file\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.to_smi(\"outputs1.smi\")\n\n# Load from SDF file\nspace = ChemicalSpace.from_sdf(\"tests/data/inputs1.sdf\")\nspace.to_sdf(\"outputs1.sdf\")\n\nprint(space)\n```\n\n```text\n<ChemicalSpace: 10 molecules | 10 indices | No scores>\n```\n\n### Indexing, Slicing and Masking\n\nIndexing, slicing and masking a `ChemicalSpace` object returns a new `ChemicalSpace` object.\n\n#### Indexing\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nprint(space[0])\n```\n\n```text\n<ChemicalSpace: 1 molecules | 1 indices | No scores>\n```\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nidx = [1, 2, 4]\n\nprint(space[idx])\n```\n\n```text\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n```\n\n#### Slicing\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nprint(space[:2])\n```\n\n```text\n<ChemicalSpace: 2 molecules | 2 indices | No scores>\n```\n\n#### Masking\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nmask = [True, False, True, False, True, False, True, False, True, False]\n\nprint(space[mask])\n```\n\n```text\n<ChemicalSpace: 5 molecules | 5 indices | No scores>\n```\n\n### Deduplicating\n\nDeduplicating a `ChemicalSpace` object removes duplicate molecules.  \nSee [Hashing and Identity](#hashing-and-identity) for details on molecule identity.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_twice = space + space  # 20 molecules\nspace_deduplicated = space_twice.deduplicate()  # 10 molecules\n\nprint(space_deduplicated)\n```\n\n```text\n<ChemicalSpace: 10 molecules | 10 indices | No scores>\n```\n\n### Chunking\n\nA `ChemicalSpace` object can be chunked into smaller `ChemicalSpace` objects.   \nThe `.chunks` method returns a generator of `ChemicalSpace` objects.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nchunks = space.chunks(chunk_size=3)\n\nfor chunk in chunks:\n    print(chunk)\n```\n\n```text\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 1 molecules | 1 indices | No scores>\n```\n\n### Drawing\n\nA `ChemicalSpace` object can be rendered as a grid of molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.draw()\n```\n![draw](.media/sample.png)\n\n### Featurizing\n\n#### Features\n\nA `ChemicalSpace` object can be featurized as a `numpy` array of features.\nBy default, ECFP4/Morgan2 fingerprints are used.\nThe features are cached for subsequent calls,\nand spaces generated by a `ChemicalSpace` object (e.g. by slicing, masking, chunking)\ninherit the respective features.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_slice = space[:6:2]\n\n# Custom ECFP4 features\nprint(space.features.shape)\nprint(space_slice.features.shape)\n```\n\n```text\n(10, 1024)\n(3, 1024)\n```\n\n#### Custom featurizer\n\nThis should take in a `rdkit.Chem.Mol` molecule, and the numerical\nreturn value should be castable to NumPy array (see `chemicalspace.utils.MolFeaturizerType`).\n\n```python\nfrom chemicalspace import ChemicalSpace\nfrom chemicalspace.utils import maccs_featurizer\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\", featurizer=maccs_featurizer)\nspace_slice = space[:6:2]\n\n# Custom ECFP4 features\nprint(space.features.shape)\nprint(space_slice.features.shape)\n```\n\n```text\n(10, 167)\n(3, 167)\n```\n\n#### Metrics\n\nA distance metric on the feature space is necessary for clustering, calculating diversity, and\nidentifying neighbors. By default, the `jaccard` (a.k.a Tanimoto) distance is used.\n`ChemicalSpace` takes a `metric` string argument that allows to specify a `sklearn` metric.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\", metric='euclidean')\n```\n\n### Binary Operations\n\n#### Single entries\n\nSingle entries as SMILES strings or `RDKit` molecules\ncan be added to a `ChemicalSpace` object.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.add(\"CCO\", \"mol11\")\n\nprint(space)\n```\n\n```text\n<ChemicalSpace: 11 molecules | 11 indices | No scores>\n```\n\n#### Chemical spaces\n\nTwo `ChemicalSpace` objects can be added together.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi\")\n\nspace = space1 + space2\n\nprint(space)\n```\n\n```text\n<ChemicalSpace: 25 molecules | 25 indices | No scores>\n```\n\nAnd subtracted from each other to return only molecules in `space1`\nthat are not in `space2`.   \nSee [Hashing and Identity](#hashing-and-identity) for more details.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi\")\n\nspace = space1 - space2\n\nprint(space)\n```\n\n```text\n<ChemicalSpace: 5 molecules | 5 indices | No scores>\n```\n\n### Hashing and Identity\n\nIndividual molecules in a chemical space are hashed by their InChI Keys only (by default), or by InChI Keys and index.\nScores **do not** affect the hashing process.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nsmiles = ('CCO', 'CCN', 'CCl')\nindices = (\"mol1\", \"mol2\", \"mol3\")\n\n# Two spaces with the same molecules, and indices\n# But one space includes the indices in the hashing process\nspace_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=True)\nspace_no_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=False)\n\nprint(space_indices == space_indices)\nprint(space_indices == space_no_indices)\nprint(space_no_indices == space_no_indices)\n```\n\n```text\nTrue\nFalse\nTrue\n```\n\n`ChemicalSpace` objects are hashed by their molecular hashes, in an **order-independent** manner.\n\n```python\nfrom rdkit import Chem\nfrom rdkit.Chem import inchi\nfrom chemicalspace import ChemicalSpace\n\nmol = Chem.MolFromSmiles(\"c1ccccc1\")\ninchi_key = inchi.MolToInchiKey(mol)\n\nspace = ChemicalSpace(mols=(mol,))\n\nassert hash(space) == hash(frozenset((inchi_key,)))\n```\n\nThe identity of a `ChemicalSpace` is evaluated on its hashed representation.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace1_again = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\nprint(space1 == space1)\nprint(space1 == space1_again)\nprint(space1 == space2)\n```\n\n```text\nTrue\nTrue\nFalse\n```\n\n### Copy\n\n`ChemicalSpace` supports copy and deepcopy operations.\nDeepcopy allows to fully unlink the copied object from the original one, including the `RDKit` molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\n# Shallow copy\nspace_copy = space.copy()\nassert id(space.mols[0]) == id(space_copy.mols[0])\n\n# Deep copy\nspace_deepcopy = space.copy(deep=True)\nassert id(space.mols[0]) != id(space_deepcopy.mols[0])\n```\n\n## Clustering\n\n### Labels\n\nA `ChemicalSpace` can be clustered using by its molecular features.\n`kmedoids`, `agglomerative-clustering`, `sphere-exclusion` and `scaffold`\nare the available clustering methods.\nRefer to the respective methods in [`chemicalspace.layers.clustering`](chemicalspace/layers/clustering.py)\nfor more details.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\ncluster_labels = space.cluster(n_clusters=3)\n\nprint(cluster_labels)\n```\n\n```text\n[0 1 2 1 1 0 0 0 0 0]\n```\n\n### Clusters\n\n`ChemicalSpace.yield_clusters` can be used to iterate clusters as `ChemicalSpace` objects.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nclusters = space.yield_clusters(n_clusters=3)\n\nfor cluster in clusters:\n    print(cluster)\n```\n\n```text\n<ChemicalSpace: 6 molecules | 6 indices | No scores>\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 1 molecules | 1 indices | No scores>\n```\n\n### KFold Clustering\n\n`ChemicalSpace.splits` can be used to iterate train/test cluster splits for ML training.\nAt each iteration, one cluster is used as the test set and the rest as the training set.\nNote that there is no guarantee on the size of the clusters.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nfor train, test in space.split(n_splits=3):\n    print(train, test)\n```\n\n```text\n<ChemicalSpace: 4 molecules | 4 indices | No scores> <ChemicalSpace: 6 molecules | 6 indices | No scores>\n<ChemicalSpace: 7 molecules | 7 indices | No scores> <ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 9 molecules | 9 indices | No scores> <ChemicalSpace: 1 molecules | 1 indices | No scores>\n```\n\n## Overlap\n\n`ChemicalSpace` implements methods for calculating the overlap with another space.\n\n### Overlap\n\nThe molecules of a `ChemicalSpace` that are similar to another space can be flagged.\nThe similarity between two molecules is calculated by the Tanimoto similarity of their\nECFP4/Morgan2 fingerprints.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\n# Indices of `space1` that are similar to `space2`\noverlap = space1.find_overlap(space2, radius=0.6)\n\nprint(overlap)\n```\n\n```text\n[0 1 2 3 4]\n```\n\n### Carving\n\nThe overlap between two `ChemicalSpace` objects can be carved out from one of the objects,\nso to ensure that the two spaces are disjoint for a given similarity radius.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\n# Carve out the overlap from `space1`\nspace1_carved = space1.carve(space2, radius=0.6)\n\nprint(space1_carved)\n```\n\n```text\n<ChemicalSpace: 5 molecules | 5 indices | No scores>\n```\n\n## Dimensionality Reduction\n\n`ChemicalSpace` implements methods for dimensionality reduction by\n`pca`, `tsne` or `umap` projection of its features.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nproj = space.project(method='pca')\n\nprint(proj.shape)\n```\n\n```text\n(10, 2)\n```\n\n## Picking\n\nA subset of a `ChemicalSpace` can be picked by a number of acquisition strategies.  \nSee [`chemicalspace.layers.acquisition`](chemicalspace/layers/acquisition.py) for details.\n\n```python\nfrom chemicalspace import ChemicalSpace\nimport numpy as np\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nspace_pick_random = space.pick(n=3, strategy='random')\nprint(space_pick_random)\n\nspace_pick_diverse = space.pick(n=3, strategy='maxmin')\nprint(space_pick_diverse)\n\nspace.scores = np.array(range(len(space)))  # Assign dummy scores\nspace_pick_greedy = space.pick(n=3, strategy='greedy')\nprint(space_pick_greedy)\n```\n\n```text\n<ChemicalSpace: 3 molecules | 3 indices | No scores>\n<ChemicalSpace: 3 molecules | 3 indices | 3 scores>\n```\n\n## Uniqueness and Diversity\n\n### Uniqueness\n\nThe uniqueness of a `ChemicalSpace` object can be calculated by the number of unique molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_twice = space + space  # 20 molecules\nuniqueness = space_twice.uniqueness()\n\nprint(uniqueness)\n```\n\n```text\n0.5\n```\n\n### Diversity\n\nThe diversity of a `ChemicalSpace` object can be calculated as:\n\n- The average of the pairwise distance matrix\n- The normalized [Vendi score](https://arxiv.org/abs/2210.02410) of the same distance matrix.\n\nThe Vendi score can be interpreted as the effective number of molecules in the space,\nand here it is normalized by the number of molecules in the space taking values in the range `[0, 1]`.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\ndiversity_int = space.diversity(method='internal-distance')\ndiversity_vendi = space.diversity(method='vendi')\nprint(diversity_int)\nprint(diversity_vendi)\n\n# Dummy space with the same molecule len(space) times\nspace_redundant = ChemicalSpace(mols=tuple([space.mols[0]] * len(space)))\n\ndiversity_int_redundant = space_redundant.diversity(method='internal-distance')\ndiversity_vendi_redundant = space_redundant.diversity(method='vendi')\n\nprint(diversity_int_redundant)\nprint(diversity_vendi_redundant)\n```\n\n```text\n0.7730273985449335\n0.12200482273434754\n0.0\n0.1\n```\n\n# Advanced\n\n## Layers\n\n`ChemicalSpace` is implemented as a series of *layers* that provide the functionality of the class. As can be seen in the [source code](chemicalspace/space.py), the class simply combines the layers.\n\nIf only a subset of the functionality of `ChemicalSpace` is necessary, and lean objects are a priority, one can combine only the required layers:\n\n```python\nfrom chemicalspace.layers.clustering import ChemicalSpaceClusteringLayer\nfrom chemicalspace.layers.neighbors import ChemicalSpaceNeighborsLayer\n\n\nclass MyCustomSpace(ChemicalSpaceClusteringLayer, ChemicalSpaceNeighborsLayer):\n    pass\n\n\nspace = MyCustomSpace(mols=[\"c1ccccc1\"])\nspace\n```\n```text\n<MyCustomSpace: 1 molecules | No indices | No scores>\n```\n\n# Development\n\n## Installation\n\nInstall the development dependencies with `pip`:\n\n```bash\npip install -e .[dev]\n```\n\n## Hooks\n\nThe project uses `pre-commit` for code formatting, linting and testing.\nInstall the hooks with:\n\n```bash\npre-commit install\n```\n\n## Documentation\n\nThe documentation can be built by running:\n```bash\ncd docs\n./rebuild.sh\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Giulio Mattedi  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "An Object-oriented Representation for Chemical Spaces",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/gmattedi/chemicalspace",
        "Issues": "https://github.com/gmattedi/chemicalspace/issues"
    },
    "split_keywords": [
        "chemistry",
        " cheminformatics",
        " rdkit",
        " machine learning",
        " ml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38fc34c532e3b34fc4e3ca24bb093d1b38858efb1bc031be8453a610ec5221ff",
                "md5": "602b7aa9f22bd384c1fc16ef9303640a",
                "sha256": "5e07e3231dbf6348bdc855f0dd9d5925c87ee51772991f11f4a36fb5f86883f5"
            },
            "downloads": -1,
            "filename": "chemicalspace-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "602b7aa9f22bd384c1fc16ef9303640a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 24476,
            "upload_time": "2024-06-03T15:04:45",
            "upload_time_iso_8601": "2024-06-03T15:04:45.203055Z",
            "url": "https://files.pythonhosted.org/packages/38/fc/34c532e3b34fc4e3ca24bb093d1b38858efb1bc031be8453a610ec5221ff/chemicalspace-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de8967e0fadfd289772c886bfeb5a4d5f903e4c1dc7e5e2d59b34d0c98736ee5",
                "md5": "4e7fa3f2c7e331a39262286fcea88987",
                "sha256": "68e6c9e2197a62bb52a6e51669072b780a3a20d10dbb3390a09083699d5d1d08"
            },
            "downloads": -1,
            "filename": "chemicalspace-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4e7fa3f2c7e331a39262286fcea88987",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 29746,
            "upload_time": "2024-06-03T15:04:46",
            "upload_time_iso_8601": "2024-06-03T15:04:46.646460Z",
            "url": "https://files.pythonhosted.org/packages/de/89/67e0fadfd289772c886bfeb5a4d5f903e4c1dc7e5e2d59b34d0c98736ee5/chemicalspace-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-03 15:04:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gmattedi",
    "github_project": "chemicalspace",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "chemicalspace"
}
        
Elapsed time: 4.18702s