corigami


Namecorigami JSON
Version 0.0.1 PyPI version JSON
download
home_page
SummaryC.Origami: cell type-specific chromatin structure prediction.
upload_time2022-11-30 21:14:56
maintainer
docs_urlNone
author
requires_python>=3.9
license
keywords deep learning chromatin prediction cell type-specific epigenetics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # C.Origami

[Models](#Download-model-and-other-relevant-resource-files) |
[GitHub](https://github.com/tanjimin/C.Origami) |
[Publications](#list-of-papers)

C.Origami is a deep neural network model enables *de novo* cell type-specific chromatin architecture predictions. C.Origami originates from the Origami architecture that incorporates DNA sequence and cell type-specific features for downstream tasks. It can predict the effect of aberrant genome reorganization such as translocations. In addition, it can be used to perform high-throughput *in silico* genetic screening to identify chromatin related *trans*-acting factors. 

## Dependencies and Installation

### Create a new conda environment for C.Origami and install dependencies
```bash
conda create -n corigami python=3.9
conda activate corigami

pip install torch==1.12.0 torchvision==0.13.0 pandas==1.3.0 matplotlib==3.3.2 pybigwig==0.3.18 omegaconf==2.1.1 tqdm==4.64.0
```

## Download dataset files and pretrained model weights

| File name | Content | 
| - | - | 
| [corigami_data.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) | DNA reference sequence, CTCF ChIP-seq(IMR-90), ATAC-seq(IMR-90), Hi-C matrix(IMR-90), pretrained model weights | 
| [corigami_data_gm12878_add_on.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) | CTCF ChIP-seq(GM12878), ATAC-seq(GM12878), Hi-C matrix(GM12878) | 

To run inference or training, you may download [corigami_data.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) which contains the training data from IMR-90 cell line, and pretrained model weights. 
To test performance on GM12878 *de novo* prediction, you need to additionally download the add-on data file [corigami_data_gm12878_add_on.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) and unzip it under `corigami_data/data/hg38`.

### Prediction with DNA sequence, CTCF ChIP-seq, and ATAC-seq data 
In order to use our pipeline we require the sequencing data to be pre-processed. Reference DNA sequence is included in the corigami_data file provided above.The input for both the CTCF and ATAC data should be in the form of a bigwig (bw) file. The bigwig should be normalized to the total number of reads. Data quality can be inspected using an applications such as [IGV](https://igv.org).


# Inference

C.Origami can perform de novo prediction of cell type-specific chromatin architecture using both DNA sequence features and cell type-specific genomic information.

For any inference application, download one of our pre-trained models or use your own model. C.Origami is pre-trained on the human IMR-90 cell line (hg38 assembly). Before inference, please download dataset and change path according to the instruction.

Inference allows you to pick between 3 tasks: **predict**, **perturbation**, or **screening**. Examples for each one and the required parameters are under the `examples` folder. 

## Prediction

Prediction will produce both an image of the 2MB window as well as a numpy matrix for further downstream analysis. 

```docs

    Usage:
    python corigami/inference/prediction.py [options] 

    Options:
    -h --help       Show this screen.
    --out           Output path for storing results
    --celltype      Sample cell type for prediction, used for output separation
    --chr           Chromosome for prediction')
    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)
    --model         Path to the model checkpoint')
    --seq           Path to the folder where the sequence .fa.gz files are stored
    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored
    --atac          Path to the folder where the ATAC-seq .bw files are stored
```

`An example of a C.Origami predicted (2MB window) Hi-C matrix for the IMR-90 cell line at chromosome 2 with start position 500,000:`
<p align="center">
  <img  src="https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_500000.png">
  </p>

## Editing/Perturbation

For now the only perturbation implemented is deletion. Specify the same parameters as before along with specific deletion parameters. If you want to do multiple deletions, you can specify in the config by creating additional start and end positions. 

```docs
  Usage:
    python corigami/inference/editing.py [options] 

    Options:
    -h --help       Show this screen.
    --out           Output path for storing results
    --celltype      Sample cell type for prediction, used for output separation
    --chr           Chromosome for prediction')
    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)
    --model         Path to the model checkpoint')
    --seq           Path to the folder where the sequence .fa.gz files are stored
    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored
    --atac          Path to the folder where the ATAC-seq .bw files are stored

    --del-start     Starting point for deletion
    --del-width     Width for deletion
    --padding       Padding type, either zero or follow. Using zero: the missing region at the end will be padded with zero for ctcf and atac seq, while sequence will be padded with N (unknown necleotide). Using follow: the end will be padded with features in the following region
    --hide-line     Remove the line showing deletion site
```

`An example of a C.Origami predicted (2MB window) Hi-C matrix for the IMR-90 cell line at chromosome 2 with start position 500,000 and a deletion from 1.5MB to 1.6MB (100,000 basepairs deleted):`
<p align="center">
  <img  src="https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_500000_del_1500000_100000_padding_zero.png">
  </p>

## Screening

In silico genetic screening can be used to see what regions of perturbation lead to the greatest impact on the prediction. Running this task will result in a bedgraph file consisting of the chr number, start position, end position, and impact score. The more impact the perturbation had, the higher the impact score.

Screening can be done only for one chromosome at a time. The end position unless otherwise specified will be 2MB from the start position specified above it. The `perturb-width` is allows you to set the size of the deletion you want to make or in other words how many base pairs to remove. The `step-size` is how far each deletion is from the past deletion (start position) - please note it is fine for the deletions to overlap. 

```docs
  Usage:
    python corigami/inference/screening.py [options] 

    Options:
    -h --help       Show this screen.
    --out           Output path for storing results
    --celltype      Sample cell type for prediction, used for output separation
    --chr           Chromosome for prediction')
    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)
    --model         Path to the model checkpoint')
    --seq           Path to the folder where the sequence .fa.gz files are stored
    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored
    --atac          Path to the folder where the ATAC-seq .bw files are stored
    
    --screen-start        Starting point for screening
    --screen-end          Ending point for screening
    --perturb-width       Width of perturbation used for screening
    --step-size           step size of perturbations in screening
    --plot-impact-score   Plot impact score and save png. (Not recommended for large scale screening (>10000 perturbations)')
    --save-pred           Save prediction tensor
    --save-perturbation   Save perturbed tensor
    --save-diff           Save difference tensor
    --save-impact-score   Save impact score array
    --save-bedgraph       Save bedgraph file for impact score
    --save-frames         Save each deletion instance with png and npy (Could be taxing on computation and screening, not recommended).')
    --padding             Padding type, either zero or follow. Using zero: the missing region at the end will be padded with zero for ctcf and atac seq, while sequence will be padded with N (unknown necleotide). Using follow: the end will be padded with features in the following region
```
**Please note that screening can be very computationally intensive especially when screening at a 1 Kb resolution or less. For instance, screening on chromosome 8, a medium-size chromosome which has a length of 146Mb, requires the model to make 146Mb / 1Kb * 2 predictions = 292,000 separate predictions.**

`An example of a barplot representing the impact score of each perturbation. C.Origami screened chromosome 2 from position 1.25 MB to 2.25 MB with a perturbation of 1000 basepairs (perturb-width) being made every 1000 basepairs (step-size):`
<p align="center">
  <img  src="https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_screen_1250000_2250000_width_1000_step_1000.png">
  </p>


# Training

Coming up soon!

## Citation

If you use the C-Origami code in your project, please cite the bioRxiv paper:

```BibTeX
@inproceedings{tan2020,
    title={Cell type-specific prediction of 3D chromatin architecture},
    author={Jimin Tan and Javier Rodriguez-Hernaez and Theodore Sakellaropoulos and Francesco Boccalatte and Iannis Aifantis and Jane Skok and David Fenyƶ and Bo Xia and Aristotelis Tsirigos},
    journal = {bioRxiv e-prints},
    archivePrefix = "bioRxiv",
    doi = {10.1101/2022.03.05.483136},
    year={2022}
}
```


## List of Papers

The following lists titles of papers from the C-Origami project. 

Cell type-specific prediction of 3D chromatin architecture
Jimin Tan, Javier Rodriguez-Hernaez, Theodore Sakellaropoulos, Francesco Boccalatte, Iannis Aifantis, Jane Skok, David Fenyƶ, Bo Xia, Aristotelis Tsirigos
bioRxiv 2022.03.05.483136; doi: https://doi.org/10.1101/2022.03.05.483136

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "corigami",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Jimin Tan <tanjimin@nyu.edu>, Nina Shenker-Tauris <shenkn01@nyu.edu>",
    "keywords": "deep learning,chromatin,prediction,cell type-specific,epigenetics",
    "author": "",
    "author_email": "Jimin Tan <tanjimin@nyu.edu>, Nina Shenker-Tauris <shenkn01@nyu.edu>",
    "download_url": "https://files.pythonhosted.org/packages/ea/91/e3baf9c027843785a3f937832828488c1d806f0556a1278f0255c299ffd0/corigami-0.0.1.tar.gz",
    "platform": null,
    "description": "# C.Origami\n\n[Models](#Download-model-and-other-relevant-resource-files) |\n[GitHub](https://github.com/tanjimin/C.Origami) |\n[Publications](#list-of-papers)\n\nC.Origami is a deep neural network model enables *de novo* cell type-specific chromatin architecture predictions. C.Origami originates from the Origami architecture that incorporates DNA sequence and cell type-specific features for downstream tasks. It can predict the effect of aberrant genome reorganization such as translocations. In addition, it can be used to perform high-throughput *in silico* genetic screening to identify chromatin related *trans*-acting factors. \n\n## Dependencies and Installation\n\n### Create a new conda environment for C.Origami and install dependencies\n```bash\nconda create -n corigami python=3.9\nconda activate corigami\n\npip install torch==1.12.0 torchvision==0.13.0 pandas==1.3.0 matplotlib==3.3.2 pybigwig==0.3.18 omegaconf==2.1.1 tqdm==4.64.0\n```\n\n## Download dataset files and pretrained model weights\n\n| File name | Content | \n| - | - | \n| [corigami_data.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) | DNA reference sequence, CTCF ChIP-seq(IMR-90), ATAC-seq(IMR-90), Hi-C matrix(IMR-90), pretrained model weights | \n| [corigami_data_gm12878_add_on.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) | CTCF ChIP-seq(GM12878), ATAC-seq(GM12878), Hi-C matrix(GM12878) | \n\nTo run inference or training, you may download [corigami_data.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) which contains the training data from IMR-90 cell line, and pretrained model weights. \nTo test performance on GM12878 *de novo* prediction, you need to additionally download the add-on data file [corigami_data_gm12878_add_on.tar.gz](https://zenodo.org/record/7226561/files/corigami_data.tar.gz?download=1) and unzip it under `corigami_data/data/hg38`.\n\n### Prediction with DNA sequence, CTCF ChIP-seq, and ATAC-seq data \nIn order to use our pipeline we require the sequencing data to be pre-processed. Reference DNA sequence is included in the corigami_data file provided above.The input for both the CTCF and ATAC data should be in the form of a bigwig (bw) file. The bigwig should be normalized to the total number of reads. Data quality can be inspected using an applications such as [IGV](https://igv.org).\n\n\n# Inference\n\nC.Origami can perform de novo prediction of cell type-specific chromatin architecture using both DNA sequence features and cell type-specific genomic information.\n\nFor any inference application, download one of our pre-trained models or use your own model. C.Origami is pre-trained on the human IMR-90 cell line (hg38 assembly). Before inference, please download dataset and change path according to the instruction.\n\nInference allows you to pick between 3 tasks: **predict**, **perturbation**, or **screening**. Examples for each one and the required parameters are under the `examples` folder. \n\n## Prediction\n\nPrediction will produce both an image of the 2MB window as well as a numpy matrix for further downstream analysis. \n\n```docs\n\n    Usage:\n    python corigami/inference/prediction.py [options] \n\n    Options:\n    -h --help       Show this screen.\n    --out           Output path for storing results\n    --celltype      Sample cell type for prediction, used for output separation\n    --chr           Chromosome for prediction')\n    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)\n    --model         Path to the model checkpoint')\n    --seq           Path to the folder where the sequence .fa.gz files are stored\n    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored\n    --atac          Path to the folder where the ATAC-seq .bw files are stored\n```\n\n`An example of a C.Origami predicted (2MB window) Hi-C matrix for the IMR-90 cell line at chromosome 2 with start position 500,000:`\n<p align=\"center\">\n  <img  src=\"https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_500000.png\">\n  </p>\n\n## Editing/Perturbation\n\nFor now the only perturbation implemented is deletion. Specify the same parameters as before along with specific deletion parameters. If you want to do multiple deletions, you can specify in the config by creating additional start and end positions. \n\n```docs\n  Usage:\n    python corigami/inference/editing.py [options] \n\n    Options:\n    -h --help       Show this screen.\n    --out           Output path for storing results\n    --celltype      Sample cell type for prediction, used for output separation\n    --chr           Chromosome for prediction')\n    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)\n    --model         Path to the model checkpoint')\n    --seq           Path to the folder where the sequence .fa.gz files are stored\n    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored\n    --atac          Path to the folder where the ATAC-seq .bw files are stored\n\n    --del-start     Starting point for deletion\n    --del-width     Width for deletion\n    --padding       Padding type, either zero or follow. Using zero: the missing region at the end will be padded with zero for ctcf and atac seq, while sequence will be padded with N (unknown necleotide). Using follow: the end will be padded with features in the following region\n    --hide-line     Remove the line showing deletion site\n```\n\n`An example of a C.Origami predicted (2MB window) Hi-C matrix for the IMR-90 cell line at chromosome 2 with start position 500,000 and a deletion from 1.5MB to 1.6MB (100,000 basepairs deleted):`\n<p align=\"center\">\n  <img  src=\"https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_500000_del_1500000_100000_padding_zero.png\">\n  </p>\n\n## Screening\n\nIn silico genetic screening can be used to see what regions of perturbation lead to the greatest impact on the prediction. Running this task will result in a bedgraph file consisting of the chr number, start position, end position, and impact score. The more impact the perturbation had, the higher the impact score.\n\nScreening can be done only for one chromosome at a time. The end position unless otherwise specified will be 2MB from the start position specified above it. The `perturb-width` is allows you to set the size of the deletion you want to make or in other words how many base pairs to remove. The `step-size` is how far each deletion is from the past deletion (start position) - please note it is fine for the deletions to overlap. \n\n```docs\n  Usage:\n    python corigami/inference/screening.py [options] \n\n    Options:\n    -h --help       Show this screen.\n    --out           Output path for storing results\n    --celltype      Sample cell type for prediction, used for output separation\n    --chr           Chromosome for prediction')\n    --start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)\n    --model         Path to the model checkpoint')\n    --seq           Path to the folder where the sequence .fa.gz files are stored\n    --ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored\n    --atac          Path to the folder where the ATAC-seq .bw files are stored\n    \n    --screen-start        Starting point for screening\n    --screen-end          Ending point for screening\n    --perturb-width       Width of perturbation used for screening\n    --step-size           step size of perturbations in screening\n    --plot-impact-score   Plot impact score and save png. (Not recommended for large scale screening (>10000 perturbations)')\n    --save-pred           Save prediction tensor\n    --save-perturbation   Save perturbed tensor\n    --save-diff           Save difference tensor\n    --save-impact-score   Save impact score array\n    --save-bedgraph       Save bedgraph file for impact score\n    --save-frames         Save each deletion instance with png and npy (Could be taxing on computation and screening, not recommended).')\n    --padding             Padding type, either zero or follow. Using zero: the missing region at the end will be padded with zero for ctcf and atac seq, while sequence will be padded with N (unknown necleotide). Using follow: the end will be padded with features in the following region\n```\n**Please note that screening can be very computationally intensive especially when screening at a 1 Kb resolution or less. For instance, screening on chromosome 8, a medium-size chromosome which has a length of 146Mb, requires the model to make 146Mb / 1Kb * 2 predictions = 292,000 separate predictions.**\n\n`An example of a barplot representing the impact score of each perturbation. C.Origami screened chromosome 2 from position 1.25 MB to 2.25 MB with a perturbation of 1000 basepairs (perturb-width) being made every 1000 basepairs (step-size):`\n<p align=\"center\">\n  <img  src=\"https://github.com/tanjimin/C.Origami/blob/dev/src/corigami/examples/imgs/chr2_screen_1250000_2250000_width_1000_step_1000.png\">\n  </p>\n\n\n# Training\n\nComing up soon!\n\n## Citation\n\nIf you use the C-Origami code in your project, please cite the bioRxiv paper:\n\n```BibTeX\n@inproceedings{tan2020,\n    title={Cell type-specific prediction of 3D chromatin architecture},\n    author={Jimin Tan and Javier Rodriguez-Hernaez and Theodore Sakellaropoulos and Francesco Boccalatte and Iannis Aifantis and Jane Skok and David Feny\u00f6 and Bo Xia and Aristotelis Tsirigos},\n    journal = {bioRxiv e-prints},\n    archivePrefix = \"bioRxiv\",\n    doi = {10.1101/2022.03.05.483136},\n    year={2022}\n}\n```\n\n\n## List of Papers\n\nThe following lists titles of papers from the C-Origami project. \n\nCell type-specific prediction of 3D chromatin architecture\nJimin Tan, Javier Rodriguez-Hernaez, Theodore Sakellaropoulos, Francesco Boccalatte, Iannis Aifantis, Jane Skok, David Feny\u00f6, Bo Xia, Aristotelis Tsirigos\nbioRxiv 2022.03.05.483136; doi: https://doi.org/10.1101/2022.03.05.483136\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "C.Origami: cell type-specific chromatin structure prediction.",
    "version": "0.0.1",
    "split_keywords": [
        "deep learning",
        "chromatin",
        "prediction",
        "cell type-specific",
        "epigenetics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "af97db47edfc818e7c24023f9022b758",
                "sha256": "8ff2d004804b35ddf22e31bd03dd1d2c41c221967824a5cdbe7c8459a2acec9b"
            },
            "downloads": -1,
            "filename": "corigami-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "af97db47edfc818e7c24023f9022b758",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 25909,
            "upload_time": "2022-11-30T21:14:55",
            "upload_time_iso_8601": "2022-11-30T21:14:55.262139Z",
            "url": "https://files.pythonhosted.org/packages/e9/ea/0ac67e262b29c15e90523f2f67d2018a7c9c56c54f2419d7722b2ba9b052/corigami-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "d82e20bfda9547b86a2b5e06d6ccb67a",
                "sha256": "fe685566830a35a7774529f64eaf883f50f4df4927b2328b348a524537750dcb"
            },
            "downloads": -1,
            "filename": "corigami-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d82e20bfda9547b86a2b5e06d6ccb67a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 23750,
            "upload_time": "2022-11-30T21:14:56",
            "upload_time_iso_8601": "2022-11-30T21:14:56.891967Z",
            "url": "https://files.pythonhosted.org/packages/ea/91/e3baf9c027843785a3f937832828488c1d806f0556a1278f0255c299ffd0/corigami-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-11-30 21:14:56",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "corigami"
}
        
Elapsed time: 0.08887s