transcript-transformer

Name	transcript-transformer JSON
Version	0.8.8 JSON
	download
home_page	https://github.com/jdcla/transcript_transformer
Summary	Transformers for Transcripts
upload_time	2024-09-04 18:35:50
maintainer	None
docs_url	None
author	Jim Clauwaert
requires_python	>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
<h1>transcript_transformer</h1> 

Deep learning utility functions for processing and annotating transcript genome data.

[![PyPi Version](https://img.shields.io/pypi/v/transcript-transformer.svg)](https://pypi.python.org/pypi/transcript-transformer/)
[![GitHub license](https://img.shields.io/github/license/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/LICENSE.md)
[![GitHub issues](https://img.shields.io/github/issues/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/issues)
[![GitHub stars](https://img.shields.io/github/stars/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/stargazers)
</div>


`transcript_transformer`  is constructed in concordance with the creation of TIS Transformer, ([paper](https://doi.org/10.1093/nargab/lqad021), [repository](https://github.com/TRISTAN-ORF/TIS_transformer)) and RIBO-former ([paper](https://doi.org/10.1101/2023.06.20.545724), [repository paper](https://github.com/TRISTAN-ORF/RiboTIE_article), [repository tool](https://github.com/TRISTAN-ORF/RiboTIE)). `transcript_transformer` makes use of the [Performer](https://arxiv.org/abs/2009.14794) architecture to allow for the annotations and processing of transcripts at single nucleotide resolution. The package applies `h5py` for data loading and `pytorch-lightning` as a high-level interface for training and evaluation of deep learning models. `transcript_transformer` is designed to allow a high degree of modularity, but has not been tested for every combination of arguments, and can therefore return errors. For a more targeted and streamlined explanation on how to apply TIS transformer or RIBO-former, please refer to their repositories.

## 🔗 Installation
`pytorch` needs to be separately [installed by the user](https://pytorch.org/get-started/locally/). 

Next, the package can be installed running 
```bash
pip install transcript-transformer
```

## 📖 User guide <a name="code"></a>

The library features a tool that can be called directly by the command `transcript_transformer`, featuring four main functions: `data`, `pretrain`, `train` and `predict`.  

### Data loading
Information is separated by transcript and information type. Information belonging to a single transcript is mapped according to the index they populate within each `h5py.dataset`, used to store different types of information. [Variable length arrays](https://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data) are used to store the sequences and annotations of all transcripts under a single data set. 
Sequences are stored using integer arrays following: `{A:0, T:1, C:2, G:3, N:4}`
An example `data.h5` has the following structure:


```
data.h5                                     (h5py.file)
    transcript                              (h5py.group)
    ├── tis                                 (h5py.dataset, dtype=vlen(int))
    ├── contig                              (h5py.dataset, dtype=str)
    ├── id                                  (h5py.dataset, dtype=str)
    ├── seq                                 (h5py.dataset, dtype=vlen(int))
    ├── ribo                                (h5py.group)
    │   ├── SRR0000001                      (h5py.group)
    │   │   ├── 5                           (h5py.group)
    │   │   │   ├── data                    (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── indices                 (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── indptr                  (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── shape                   (h5py.dataset, dtype=vlen(int))
    │   ├── ...
    │   ....
    
```

Ribosome profiling data is saved by reads mapped to each transcript position. Mapped reads are furthermore separated by their read lengths. As ribosome profiling data is often sparse, we made use of `scipy.sparse` to save data within the `h5` format. This allows us to save space and store matrix objects as separate arrays. Saving and loading of the data is achieved using the [``h5max``](https://github.com/jdcla/h5max) package.

<div align="center">
<img src="https://github.com/jdcla/h5max/raw/main/h5max.png" width="600">
</div>


### data

`transcript_transformer data` is used to process the transcriptome of a given assembly to make it readily available for data loading. [Dictionary `.yml`/`.json`](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/template.yml) files are used to specify the application of data to the models. After processing, given dictionary files can still be altered to define what data is used for a specific run. As such, for a given assembly, it is possible to store all available data in a single database. New ribosome profiling experiments can be added to an existing database by running `transcript_transformer data` again after updating the config file.

The following command can be used to parse data by running:
```bash
transcript_transformer data template.yml
```
where `template.yml` is:
```yaml
gtf_path : path/to/gtf_file.gtf
fa_path : path/to/fa_file.fa
########################################################
## add entries when using ribosome profiling data.
## format: 'id : ribosome profiling paths'
## leave empty for sequence input models (TIS transformer)
## DO NOT change id after data is parsed to h5 file
########################################################
ribo_paths :
  SRR000001 : ribo/SRR000001.sam
  SRR000002 : ribo/SRR000002.sam
  SRR000003 : ribo/SRR000003.sam
########################################################
## Data is parsed and stored in a hdf5 format file.
########################################################
h5_path : my_experiment.h5
```

Several other options exist that specify how ribosome profiling data is loaded. Refer to [`template.yml`](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/template.yml), available in the root directory of this repository, for more information on each option. 

### pretrain

Conform with transformers trained for natural language processing objectives, models can first be trained following a self-supervised learning objective. Using a masked language modelling approach, models are tasked to predict the classes of the masked input tokens. As such, a model is trained the 'semantics' of transcript sequences. The approach is similar to the one described by [Zaheer et al. ](https://arxiv.org/abs/2007.14062).


Example

```
transcript_transformer pretrain input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --accelerator gpu --devices 1
```

</details>

### train
The package supports training the models architectures listed under `transcript_transformer/models.py`. The function expects the configuration file containing the input data info (see [data loading](https://github.com/TRISTAN-ORF/transcript_transformer#data)). Use the `--transfer_checkpoint` flag to start training upon pre-trained models.



Example

```
transcript_transformer train input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --transfer_checkpoint lightning_logs/mlm_model/version_0/ --name experiment_1 --accelerator gpu --devices 1
```

### predict

The predict function returns probabilities for all nucleotide positions on the transcript and can be saved using the `.npy` or `.h5` format. In addition to reading from `.h5` files, the function supports the use of a single RNA sequence as input or a path to a `.fa` file. Note that `.fa` and `.npy` formats are only supported for models that only apply transcript nucleotide information.


Example

```
transcript_transformer predict AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGT RNA --output_type npy models/example_model.ckpt
transcript_transformer predict data/example_data.fa fa --output_type npy models/example_model.ckpt
```

### Output data

The model returns predictions for every nucleotide on the transcripts. For each transcript, the array lists the transcript label and model outputs. The tool can output predictions using both the `npy` or `h5` format. 

```python
>>> results = np.load('results.npy', allow_pickle=True)
>>> results[0]
array(['>ENST00000410304',
       array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,
              4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,
              4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,
              2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,
              3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,
              1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,
              6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,
              2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32)],
      dtype=object)

```

### Other function flags
Various other function flags dictate the properties of the dataloader, model architecture and training procedure. Check them out 

```bash
transcript_transformer data -h 
transcript_transformer pretrain -h 
transcript_transformer data -h
transcript_transformer predict -h 
```


</details>

## ✔️ Package features

- [x] creation of `h5` file from genome assemblies and ribosome profiling datasets
- [x] bucket sampling
- [x] pre-training functionality
- [x] data loading for sequence and ribosome data
- [x] custom target labels
- [ ] function hooks for custom data loading and pre-processing
- [x] model architectures
- [x] application of trained networks
- [ ] post-processing
- [ ] test scripts

## 🖊️ Citation <a name="citation"></a>
       
```bibtex
@article {10.1093/nargab/lqad021,
    author = {Clauwaert, Jim and McVey, Zahra and Gupta, Ramneek and Menschaert, Gerben},
    title = "{TIS Transformer: remapping the human proteome using deep learning}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {5},
    number = {1},
    year = {2023},
    month = {03},
    issn = {2631-9268},
    doi = {10.1093/nargab/lqad021},
    url = {https://doi.org/10.1093/nargab/lqad021},
    note = {lqad021},
    eprint = {https://academic.oup.com/nargab/article-pdf/5/1/lqad021/49418780/lqad021\_supplemental\_file.pdf},
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jdcla/transcript_transformer",
    "name": "transcript-transformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Jim Clauwaert",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/00/9e/baad3690d58242b53ffc59bc27e3d777cdf49ecadff0a7dcab8953910286/transcript_transformer-0.8.8.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<h1>transcript_transformer</h1> \n\nDeep learning utility functions for processing and annotating transcript genome data.\n\n[![PyPi Version](https://img.shields.io/pypi/v/transcript-transformer.svg)](https://pypi.python.org/pypi/transcript-transformer/)\n[![GitHub license](https://img.shields.io/github/license/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/LICENSE.md)\n[![GitHub issues](https://img.shields.io/github/issues/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/issues)\n[![GitHub stars](https://img.shields.io/github/stars/TRISTAN-ORF/transcript_transformer)](https://github.com/TRISTAN-ORF/transcript_transformer/stargazers)\n</div>\n\n\n`transcript_transformer`  is constructed in concordance with the creation of TIS Transformer, ([paper](https://doi.org/10.1093/nargab/lqad021), [repository](https://github.com/TRISTAN-ORF/TIS_transformer)) and RIBO-former ([paper](https://doi.org/10.1101/2023.06.20.545724), [repository paper](https://github.com/TRISTAN-ORF/RiboTIE_article), [repository tool](https://github.com/TRISTAN-ORF/RiboTIE)). `transcript_transformer` makes use of the [Performer](https://arxiv.org/abs/2009.14794) architecture to allow for the annotations and processing of transcripts at single nucleotide resolution. The package applies `h5py` for data loading and `pytorch-lightning` as a high-level interface for training and evaluation of deep learning models. `transcript_transformer` is designed to allow a high degree of modularity, but has not been tested for every combination of arguments, and can therefore return errors. For a more targeted and streamlined explanation on how to apply TIS transformer or RIBO-former, please refer to their repositories.\n\n## \ud83d\udd17 Installation\n`pytorch` needs to be separately [installed by the user](https://pytorch.org/get-started/locally/). \n\nNext, the package can be installed running \n```bash\npip install transcript-transformer\n```\n\n## \ud83d\udcd6 User guide <a name=\"code\"></a>\n\nThe library features a tool that can be called directly by the command `transcript_transformer`, featuring four main functions: `data`, `pretrain`, `train` and `predict`.  \n\n### Data loading\nInformation is separated by transcript and information type. Information belonging to a single transcript is mapped according to the index they populate within each `h5py.dataset`, used to store different types of information. [Variable length arrays](https://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data) are used to store the sequences and annotations of all transcripts under a single data set. \nSequences are stored using integer arrays following: `{A:0, T:1, C:2, G:3, N:4}`\nAn example `data.h5` has the following structure:\n\n\n```\ndata.h5                                     (h5py.file)\n    transcript                              (h5py.group)\n    \u251c\u2500\u2500 tis                                 (h5py.dataset, dtype=vlen(int))\n    \u251c\u2500\u2500 contig                              (h5py.dataset, dtype=str)\n    \u251c\u2500\u2500 id                                  (h5py.dataset, dtype=str)\n    \u251c\u2500\u2500 seq                                 (h5py.dataset, dtype=vlen(int))\n    \u251c\u2500\u2500 ribo                                (h5py.group)\n    \u2502   \u251c\u2500\u2500 SRR0000001                      (h5py.group)\n    \u2502   \u2502   \u251c\u2500\u2500 5                           (h5py.group)\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 data                    (h5py.dataset, dtype=vlen(int))\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 indices                 (h5py.dataset, dtype=vlen(int))\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 indptr                  (h5py.dataset, dtype=vlen(int))\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 shape                   (h5py.dataset, dtype=vlen(int))\n    \u2502   \u251c\u2500\u2500 ...\n    \u2502   ....\n    \n```\n\nRibosome profiling data is saved by reads mapped to each transcript position. Mapped reads are furthermore separated by their read lengths. As ribosome profiling data is often sparse, we made use of `scipy.sparse` to save data within the `h5` format. This allows us to save space and store matrix objects as separate arrays. Saving and loading of the data is achieved using the [``h5max``](https://github.com/jdcla/h5max) package.\n\n<div align=\"center\">\n<img src=\"https://github.com/jdcla/h5max/raw/main/h5max.png\" width=\"600\">\n</div>\n\n\n### data\n\n`transcript_transformer data` is used to process the transcriptome of a given assembly to make it readily available for data loading. [Dictionary `.yml`/`.json`](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/template.yml) files are used to specify the application of data to the models. After processing, given dictionary files can still be altered to define what data is used for a specific run. As such, for a given assembly, it is possible to store all available data in a single database. New ribosome profiling experiments can be added to an existing database by running `transcript_transformer data` again after updating the config file.\n\nThe following command can be used to parse data by running:\n```bash\ntranscript_transformer data template.yml\n```\nwhere `template.yml` is:\n```yaml\ngtf_path : path/to/gtf_file.gtf\nfa_path : path/to/fa_file.fa\n########################################################\n## add entries when using ribosome profiling data.\n## format: 'id : ribosome profiling paths'\n## leave empty for sequence input models (TIS transformer)\n## DO NOT change id after data is parsed to h5 file\n########################################################\nribo_paths :\n  SRR000001 : ribo/SRR000001.sam\n  SRR000002 : ribo/SRR000002.sam\n  SRR000003 : ribo/SRR000003.sam\n########################################################\n## Data is parsed and stored in a hdf5 format file.\n########################################################\nh5_path : my_experiment.h5\n```\n\nSeveral other options exist that specify how ribosome profiling data is loaded. Refer to [`template.yml`](https://github.com/TRISTAN-ORF/transcript_transformer/blob/main/template.yml), available in the root directory of this repository, for more information on each option. \n\n### pretrain\n\nConform with transformers trained for natural language processing objectives, models can first be trained following a self-supervised learning objective. Using a masked language modelling approach, models are tasked to predict the classes of the masked input tokens. As such, a model is trained the 'semantics' of transcript sequences. The approach is similar to the one described by [Zaheer et al. ](https://arxiv.org/abs/2007.14062).\n\n\nExample\n\n```\ntranscript_transformer pretrain input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --accelerator gpu --devices 1\n```\n\n</details>\n\n### train\nThe package supports training the models architectures listed under `transcript_transformer/models.py`. The function expects the configuration file containing the input data info (see [data loading](https://github.com/TRISTAN-ORF/transcript_transformer#data)). Use the `--transfer_checkpoint` flag to start training upon pre-trained models.\n\n\n\nExample\n\n```\ntranscript_transformer train input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --transfer_checkpoint lightning_logs/mlm_model/version_0/ --name experiment_1 --accelerator gpu --devices 1\n```\n\n### predict\n\nThe predict function returns probabilities for all nucleotide positions on the transcript and can be saved using the `.npy` or `.h5` format. In addition to reading from `.h5` files, the function supports the use of a single RNA sequence as input or a path to a `.fa` file. Note that `.fa` and `.npy` formats are only supported for models that only apply transcript nucleotide information.\n\n\nExample\n\n```\ntranscript_transformer predict AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGT RNA --output_type npy models/example_model.ckpt\ntranscript_transformer predict data/example_data.fa fa --output_type npy models/example_model.ckpt\n```\n\n### Output data\n\nThe model returns predictions for every nucleotide on the transcripts. For each transcript, the array lists the transcript label and model outputs. The tool can output predictions using both the `npy` or `h5` format. \n\n```python\n>>> results = np.load('results.npy', allow_pickle=True)\n>>> results[0]\narray(['>ENST00000410304',\n       array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,\n              4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,\n              4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,\n              2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,\n              3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,\n              1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,\n              6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,\n              2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32)],\n      dtype=object)\n\n```\n\n### Other function flags\nVarious other function flags dictate the properties of the dataloader, model architecture and training procedure. Check them out \n\n```bash\ntranscript_transformer data -h \ntranscript_transformer pretrain -h \ntranscript_transformer data -h\ntranscript_transformer predict -h \n```\n\n\n</details>\n\n## \u2714\ufe0f Package features\n\n- [x] creation of `h5` file from genome assemblies and ribosome profiling datasets\n- [x] bucket sampling\n- [x] pre-training functionality\n- [x] data loading for sequence and ribosome data\n- [x] custom target labels\n- [ ] function hooks for custom data loading and pre-processing\n- [x] model architectures\n- [x] application of trained networks\n- [ ] post-processing\n- [ ] test scripts\n\n## \ud83d\udd8a\ufe0f Citation <a name=\"citation\"></a>\n       \n```bibtex\n@article {10.1093/nargab/lqad021,\n    author = {Clauwaert, Jim and McVey, Zahra and Gupta, Ramneek and Menschaert, Gerben},\n    title = \"{TIS Transformer: remapping the human proteome using deep learning}\",\n    journal = {NAR Genomics and Bioinformatics},\n    volume = {5},\n    number = {1},\n    year = {2023},\n    month = {03},\n    issn = {2631-9268},\n    doi = {10.1093/nargab/lqad021},\n    url = {https://doi.org/10.1093/nargab/lqad021},\n    note = {lqad021},\n    eprint = {https://academic.oup.com/nargab/article-pdf/5/1/lqad021/49418780/lqad021\\_supplemental\\_file.pdf},\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Transformers for Transcripts",
    "version": "0.8.8",
    "project_urls": {
        "Homepage": "https://github.com/jdcla/transcript_transformer"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c3f0c869c2c44307047ad0dfb86979c2692b8d27f553011e10dbf3a9eb97252a",
                "md5": "c9eb4db2166391a457fab4dc0ce2b171",
                "sha256": "3ea34c6e422695cc62c33ec79681056ace48ffffc891e9a7b77c1f95a8528a8b"
            },
            "downloads": -1,
            "filename": "transcript_transformer-0.8.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c9eb4db2166391a457fab4dc0ce2b171",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 17299169,
            "upload_time": "2024-09-04T18:35:46",
            "upload_time_iso_8601": "2024-09-04T18:35:46.749521Z",
            "url": "https://files.pythonhosted.org/packages/c3/f0/c869c2c44307047ad0dfb86979c2692b8d27f553011e10dbf3a9eb97252a/transcript_transformer-0.8.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "009ebaad3690d58242b53ffc59bc27e3d777cdf49ecadff0a7dcab8953910286",
                "md5": "a96a4caa2fb4923501de717abdc86d8a",
                "sha256": "1dfcf2831915fb08bd686109e13807ca150c06402e5c321a0f023839eea3e9fe"
            },
            "downloads": -1,
            "filename": "transcript_transformer-0.8.8.tar.gz",
            "has_sig": false,
            "md5_digest": "a96a4caa2fb4923501de717abdc86d8a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 28400074,
            "upload_time": "2024-09-04T18:35:50",
            "upload_time_iso_8601": "2024-09-04T18:35:50.285866Z",
            "url": "https://files.pythonhosted.org/packages/00/9e/baad3690d58242b53ffc59bc27e3d777cdf49ecadff0a7dcab8953910286/transcript_transformer-0.8.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-04 18:35:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jdcla",
    "github_project": "transcript_transformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "transcript-transformer"
}

Jim Clauwaert