diffpass


Namediffpass JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/Bitbol-Lab/DiffPaSS
SummaryDifferentiable Pairing using Soft Scores
upload_time2024-05-10 11:24:55
maintainerNone
docs_urlNone
authorUmberto Lupo and Damiano Sgarbossa
requires_python>=3.7
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DiffPaSS – Differentiable Pairing using Soft Scores

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Overview

DiffPaSS is a family of high-performance and scalable PyTorch modules
for finding optimal one-to-one pairings between two collections of
biological sequences, and for performing general graph alignment.

### Pairing multiple-sequence alignments (MSAs)

A typical example of the problem DiffPaSS is designed to solve is the
following: given two multiple sequence alignments (MSAs) A and B,
containing interacting biological sequences, find the optimal one-to-one
pairing between the sequences in A and B.

<figure>
<img src="https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg" alt="MSA pairing problem" />
<figcaption>
Pairing problem for two multiple sequence alignments, where pairings are
restricted to be within the same species
</figcaption>
</figure>

To find an optimal pairing, we can maximize the average mutual
information between columns of the two paired MSAs
([`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)),
or we can maximize the similarity between distance-based
([`MirrortreePairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#mirrortreepairing))
or orthology
([`BestHitsPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#besthitspairing))
networks constructed from the two MSAs.

### Graph alignment and pairing unaligned sequence collections

DiffPaSS can be used for general graph alignment problems
([`GraphAlignment`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#graphalignment)),
where the goal is to find the one-to-one pairing between the nodes of
two weighted graphs that maximizes the similarity between the two
graphs. The user can specify the (dis-)similarity measure to be
optimized, as an arbitrary differentiable function of the adjacency
matrices of the two graphs.

Using this capability, DiffPaSS can be used for finding the optimal
one-to-one pairing between two unaligned collections of sequences, if
weighted graphs are built in advance from the two collections (for
example, using the pairwise Levenshtein distance). This is useful when
alignments are not available or reliable.

### Can I pair two collections with a different number of sequences?

DiffPaSS optimizes and returns permutation matrices. Hence, its inputs
are required to have the same number of sequences. However, DiffPaSS can
be used to pair two collections (e.g. MSAs) containing a different
number of sequences, by padding the smaller collection with dummy
sequences. For multiple sequence alignments, a simple choice is to add
dummy sequences consisting entirely of gap symbols. For general graphs,
dummy nodes, connected to the other nodes with arbitrary edge weights,
can be added to the smaller graph.

### How DiffPaSS works: soft scores, differentiable optimization, bootstrap

Check [our paper](https://openreview.net/forum?id=n5hO5seROB) for
details of the DiffPaSS and DiffPaSS-IPA algorithms. Briefly, the main
ingredients are as follows:

1.  Using “soft” scores that differentiably extend information-theoretic
    scores between two paired multiple sequence alignments (MSAs), or
    scores based on sequence similarity or graph similarity measures.

2.  The (truncated) Sinkhorn operator for smoothly parametrizing “soft
    permutations”, and the matching operator for parametrizing real
    permutations [\[Mena et al,
    2018\]](https://openreview.net/forum?id=Byt3oJ-0W).

3.  A novel and efficient bootstrap technique, motivated by mathematical
    results and heuristic insights into this smooth optimization
    process. See the animation below for an illustration.

4.  A notion of “robust pairs” that can be used to identify pairs that
    are consistently found throughout a DiffPaSS bootstrap. These pairs
    can be used as ground truths in another DiffPaSS run, giving rise to
    the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).

<figure>
<video src="https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4" width="432" height="243" controls>
</video>
<figcaption>
The DiffPaSS bootstrap technique and robust pairs
</figcaption>
</figure>

## Installation

### From PyPI

DiffPaSS requires Python 3.7 or later. It is available on PyPI and can
be installed with `pip`:

``` sh
python -m pip install diffpass
```

### From source

Clone this repository on your local machine by running

``` sh
git clone git@github.com:Bitbol-Lab/DiffPaSS.git
```

and move inside the root folder. We recommend creating and activating a
dedicated conda or virtualenv Python virtual environment. Then, make an
editable install of the package:

``` sh
python -m pip install -e .
```

## Quickstart

### Input data preprocessing

First, parse your multiple sequence alignments (MSAs) in FASTA format
into a list of tuples `(header, sequence)` using
[`read_msa`](https://Bitbol-Lab.github.io/DiffPaSS/msa_parsing.html#read_msa).

``` python
from diffpass.msa_parsing import read_msa

# Parse and one-hot encode the MSAs
msa_data_A = read_msa("path/to/msa_A.fasta")
msa_data_B = read_msa("path/to/msa_B.fasta")
```

We assume that the MSAs contain species information in the headers,
which will be used to restrict the pairings to be within the same
species (more generally, “groups”). We need a simple function to extract
the species information from the headers. For instance, if the headers
are in the format `>sequence_id|species_name|...`, we can use:

``` python
def species_name_func(header):
    return header.split("|")[1]
```

This function will be used to group the sequences by species:

``` python
from diffpass.data_utils import create_groupwise_seq_records

msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)
msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)
```

If one of the MSAs contains sequences from species not present in the
other MSA, we can remove these species from both MSAs:

``` python
from diffpass.data_utils import remove_groups_not_in_both

msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(
    msa_data_A_species_by_species, msa_data_B_species_by_species
)
```

If there are species with different numbers of sequences in the two
MSAs, we can add dummy sequences to the smaller species to make the
number of sequences equal. For example, we can add dummy sequences
consisting entirely of gap symbols:

``` python
from diffpass.data_utils import pad_msas_with_dummy_sequences

msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(
    msa_data_A_species_by_species, msa_data_B_species_by_species
)

species = list(msa_data_A_species_by_species_padded.keys())
species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))
```

Next, one-hot encode the MSAs using the
[`one_hot_encode_msa`](https://Bitbol-Lab.github.io/DiffPaSS/data_utils.html#one_hot_encode_msa)
function.

``` python
from diffpass.data_utils import one_hot_encode_msa

device = "cuda" if torch.cuda.is_available() else "cpu"

# Unpack the padded MSAs into a list of records
msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]
msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]

# One-hot encode the MSAs and load them to a device
msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)
msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)
```

### Pairing optimization

Finally, we can instantiate an
[`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)
object and optimize the mutual information between the paired MSAs using
the DiffPaSS bootstrap algorithm. The results are stored in a
[`DiffPaSSResults`](https://Bitbol-Lab.github.io/DiffPaSS/base.html#diffpassresults)
container. The lists of (hard) losses and permutations found can be
accessed as attributes of the container.

``` python
from diffpass.train import InformationPairing

information_pairing = InformationPairing(group_sizes=species_sizes).to(device)
bootstrap_results = information_pairing.fit_bootstrap(x, y)

print(f"Final hard loss: {bootstrap_results.hard_losses[-1].item()}")
print(f"Final hard permutations (one permutation per species): {bootstrap_results.hard_perms[-1][-1].item()}")
```

For more details and examples, including the DiffPaSS-IPA variant, see
the tutorials.

## Tutorials

See the
[`mutual_information_msa_pairing.ipynb`](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/nbs/tutorials/mutual_information_msa_pairing.ipynb)
notebook for an example of paired MSA optimization in the case of
well-known prokaryotic datasets, for which ground truth pairings are
given by genome proximity.

## Documentation

The full documentation is available at
[https://Bitbol-Lab.github.io/DiffPaSS/](https://bitbol-lab.github.io/DiffPaSS/).

## Citation

To cite this work, please refer to the following publication:

``` bibtex
@inproceedings{
  lupo2024diffpass,
  title={DiffPa{SS} {\textendash} Differentiable and scalable pairing of biological sequences using soft scores},
  author={Umberto Lupo and Damiano Sgarbossa and Martina Milighetti and Anne-Florence Bitbol},
  booktitle={ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design},
  year={2024},
  url={https://openreview.net/forum?id=n5hO5seROB}
}
```

## nbdev

Project developed using [nbdev](https://nbdev.fast.ai/).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Bitbol-Lab/DiffPaSS",
    "name": "diffpass",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "Umberto Lupo and Damiano Sgarbossa",
    "author_email": "umberto.lupo@epfl.ch, damiano.sgarbossa@epfl.ch",
    "download_url": "https://files.pythonhosted.org/packages/7a/2e/2f8d260d4fdd744945197f83e5add89a9bd9657921c449ee96803f9ab72f/diffpass-0.1.0.tar.gz",
    "platform": null,
    "description": "# DiffPaSS \u2013 Differentiable Pairing using Soft Scores\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Overview\n\nDiffPaSS is a family of high-performance and scalable PyTorch modules\nfor finding optimal one-to-one pairings between two collections of\nbiological sequences, and for performing general graph alignment.\n\n### Pairing multiple-sequence alignments (MSAs)\n\nA typical example of the problem DiffPaSS is designed to solve is the\nfollowing: given two multiple sequence alignments (MSAs) A and B,\ncontaining interacting biological sequences, find the optimal one-to-one\npairing between the sequences in A and B.\n\n<figure>\n<img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" alt=\"MSA pairing problem\" />\n<figcaption>\nPairing problem for two multiple sequence alignments, where pairings are\nrestricted to be within the same species\n</figcaption>\n</figure>\n\nTo find an optimal pairing, we can maximize the average mutual\ninformation between columns of the two paired MSAs\n([`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)),\nor we can maximize the similarity between distance-based\n([`MirrortreePairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#mirrortreepairing))\nor orthology\n([`BestHitsPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#besthitspairing))\nnetworks constructed from the two MSAs.\n\n### Graph alignment and pairing unaligned sequence collections\n\nDiffPaSS can be used for general graph alignment problems\n([`GraphAlignment`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#graphalignment)),\nwhere the goal is to find the one-to-one pairing between the nodes of\ntwo weighted graphs that maximizes the similarity between the two\ngraphs. The user can specify the (dis-)similarity measure to be\noptimized, as an arbitrary differentiable function of the adjacency\nmatrices of the two graphs.\n\nUsing this capability, DiffPaSS can be used for finding the optimal\none-to-one pairing between two unaligned collections of sequences, if\nweighted graphs are built in advance from the two collections (for\nexample, using the pairwise Levenshtein distance). This is useful when\nalignments are not available or reliable.\n\n### Can I pair two collections with a different number of sequences?\n\nDiffPaSS optimizes and returns permutation matrices. Hence, its inputs\nare required to have the same number of sequences. However, DiffPaSS can\nbe used to pair two collections (e.g.\u00a0MSAs) containing a different\nnumber of sequences, by padding the smaller collection with dummy\nsequences. For multiple sequence alignments, a simple choice is to add\ndummy sequences consisting entirely of gap symbols. For general graphs,\ndummy nodes, connected to the other nodes with arbitrary edge weights,\ncan be added to the smaller graph.\n\n### How DiffPaSS works: soft scores, differentiable optimization, bootstrap\n\nCheck [our paper](https://openreview.net/forum?id=n5hO5seROB) for\ndetails of the DiffPaSS and DiffPaSS-IPA algorithms. Briefly, the main\ningredients are as follows:\n\n1.  Using \u201csoft\u201d scores that differentiably extend information-theoretic\n    scores between two paired multiple sequence alignments (MSAs), or\n    scores based on sequence similarity or graph similarity measures.\n\n2.  The (truncated) Sinkhorn operator for smoothly parametrizing \u201csoft\n    permutations\u201d, and the matching operator for parametrizing real\n    permutations [\\[Mena et al,\n    2018\\]](https://openreview.net/forum?id=Byt3oJ-0W).\n\n3.  A novel and efficient bootstrap technique, motivated by mathematical\n    results and heuristic insights into this smooth optimization\n    process. See the animation below for an illustration.\n\n4.  A notion of \u201crobust pairs\u201d that can be used to identify pairs that\n    are consistently found throughout a DiffPaSS bootstrap. These pairs\n    can be used as ground truths in another DiffPaSS run, giving rise to\n    the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).\n\n<figure>\n<video src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4\" width=\"432\" height=\"243\" controls>\n</video>\n<figcaption>\nThe DiffPaSS bootstrap technique and robust pairs\n</figcaption>\n</figure>\n\n## Installation\n\n### From PyPI\n\nDiffPaSS requires Python 3.7 or later. It is available on PyPI and can\nbe installed with `pip`:\n\n``` sh\npython -m pip install diffpass\n```\n\n### From source\n\nClone this repository on your local machine by running\n\n``` sh\ngit clone git@github.com:Bitbol-Lab/DiffPaSS.git\n```\n\nand move inside the root folder. We recommend creating and activating a\ndedicated conda or virtualenv Python virtual environment. Then, make an\neditable install of the package:\n\n``` sh\npython -m pip install -e .\n```\n\n## Quickstart\n\n### Input data preprocessing\n\nFirst, parse your multiple sequence alignments (MSAs) in FASTA format\ninto a list of tuples `(header, sequence)` using\n[`read_msa`](https://Bitbol-Lab.github.io/DiffPaSS/msa_parsing.html#read_msa).\n\n``` python\nfrom diffpass.msa_parsing import read_msa\n\n# Parse and one-hot encode the MSAs\nmsa_data_A = read_msa(\"path/to/msa_A.fasta\")\nmsa_data_B = read_msa(\"path/to/msa_B.fasta\")\n```\n\nWe assume that the MSAs contain species information in the headers,\nwhich will be used to restrict the pairings to be within the same\nspecies (more generally, \u201cgroups\u201d). We need a simple function to extract\nthe species information from the headers. For instance, if the headers\nare in the format `>sequence_id|species_name|...`, we can use:\n\n``` python\ndef species_name_func(header):\n    return header.split(\"|\")[1]\n```\n\nThis function will be used to group the sequences by species:\n\n``` python\nfrom diffpass.data_utils import create_groupwise_seq_records\n\nmsa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)\nmsa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)\n```\n\nIf one of the MSAs contains sequences from species not present in the\nother MSA, we can remove these species from both MSAs:\n\n``` python\nfrom diffpass.data_utils import remove_groups_not_in_both\n\nmsa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(\n    msa_data_A_species_by_species, msa_data_B_species_by_species\n)\n```\n\nIf there are species with different numbers of sequences in the two\nMSAs, we can add dummy sequences to the smaller species to make the\nnumber of sequences equal. For example, we can add dummy sequences\nconsisting entirely of gap symbols:\n\n``` python\nfrom diffpass.data_utils import pad_msas_with_dummy_sequences\n\nmsa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(\n    msa_data_A_species_by_species, msa_data_B_species_by_species\n)\n\nspecies = list(msa_data_A_species_by_species_padded.keys())\nspecies_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))\n```\n\nNext, one-hot encode the MSAs using the\n[`one_hot_encode_msa`](https://Bitbol-Lab.github.io/DiffPaSS/data_utils.html#one_hot_encode_msa)\nfunction.\n\n``` python\nfrom diffpass.data_utils import one_hot_encode_msa\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n# Unpack the padded MSAs into a list of records\nmsa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]\nmsa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]\n\n# One-hot encode the MSAs and load them to a device\nmsa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)\nmsa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)\n```\n\n### Pairing optimization\n\nFinally, we can instantiate an\n[`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)\nobject and optimize the mutual information between the paired MSAs using\nthe DiffPaSS bootstrap algorithm. The results are stored in a\n[`DiffPaSSResults`](https://Bitbol-Lab.github.io/DiffPaSS/base.html#diffpassresults)\ncontainer. The lists of (hard) losses and permutations found can be\naccessed as attributes of the container.\n\n``` python\nfrom diffpass.train import InformationPairing\n\ninformation_pairing = InformationPairing(group_sizes=species_sizes).to(device)\nbootstrap_results = information_pairing.fit_bootstrap(x, y)\n\nprint(f\"Final hard loss: {bootstrap_results.hard_losses[-1].item()}\")\nprint(f\"Final hard permutations (one permutation per species): {bootstrap_results.hard_perms[-1][-1].item()}\")\n```\n\nFor more details and examples, including the DiffPaSS-IPA variant, see\nthe tutorials.\n\n## Tutorials\n\nSee the\n[`mutual_information_msa_pairing.ipynb`](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/nbs/tutorials/mutual_information_msa_pairing.ipynb)\nnotebook for an example of paired MSA optimization in the case of\nwell-known prokaryotic datasets, for which ground truth pairings are\ngiven by genome proximity.\n\n## Documentation\n\nThe full documentation is available at\n[https://Bitbol-Lab.github.io/DiffPaSS/](https://bitbol-lab.github.io/DiffPaSS/).\n\n## Citation\n\nTo cite this work, please refer to the following publication:\n\n``` bibtex\n@inproceedings{\n  lupo2024diffpass,\n  title={DiffPa{SS} {\\textendash} Differentiable and scalable pairing of biological sequences using soft scores},\n  author={Umberto Lupo and Damiano Sgarbossa and Martina Milighetti and Anne-Florence Bitbol},\n  booktitle={ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design},\n  year={2024},\n  url={https://openreview.net/forum?id=n5hO5seROB}\n}\n```\n\n## nbdev\n\nProject developed using [nbdev](https://nbdev.fast.ai/).\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Differentiable Pairing using Soft Scores",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/Bitbol-Lab/DiffPaSS"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "afa30b9793011f99129f26384ac8adef775689138529bcaa6f4d82c491dd06b1",
                "md5": "e45b7ac9206ebe62f79a9e34fc93748c",
                "sha256": "e29096505865c5cbeb47fc674f857243d4e2f4e0d7f0ad59d09cbb69c2ff7161"
            },
            "downloads": -1,
            "filename": "diffpass-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e45b7ac9206ebe62f79a9e34fc93748c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 36041,
            "upload_time": "2024-05-10T11:24:54",
            "upload_time_iso_8601": "2024-05-10T11:24:54.452292Z",
            "url": "https://files.pythonhosted.org/packages/af/a3/0b9793011f99129f26384ac8adef775689138529bcaa6f4d82c491dd06b1/diffpass-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a2e2f8d260d4fdd744945197f83e5add89a9bd9657921c449ee96803f9ab72f",
                "md5": "a9c2a090e3e5b8e05ee4558e7bdcbe23",
                "sha256": "9adf4f42650019bfafd1be63725a600777e39c88fecf86ba6782ab28064ff27d"
            },
            "downloads": -1,
            "filename": "diffpass-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a9c2a090e3e5b8e05ee4558e7bdcbe23",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 36254,
            "upload_time": "2024-05-10T11:24:55",
            "upload_time_iso_8601": "2024-05-10T11:24:55.818672Z",
            "url": "https://files.pythonhosted.org/packages/7a/2e/2f8d260d4fdd744945197f83e5add89a9bd9657921c449ee96803f9ab72f/diffpass-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-10 11:24:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Bitbol-Lab",
    "github_project": "DiffPaSS",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "diffpass"
}
        
Elapsed time: 0.20787s