seq2squiggle


Nameseq2squiggle JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/ZKI-PH-ImageAnalysis/seq2squiggle
SummaryEnd-to-end simulation of nanopore sequencing signals with feed-forward transformers
upload_time2024-08-16 12:52:37
maintainerNone
docs_urlNone
authorDenis Beslic
requires_python<4.0,>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # seq2squiggle

`seq2squiggle` is a deep learning-based tool for generating artifical nanopore signals from DNA sequence data.

<img src="/img/seq2squiggle_architecture.png" width="750">


Please cite the following publication if you use `seq2squiggle` in your work:
- Beslic,  D., Kucklick, M., Engelmann, S., Fuchs, S., Renards, B.Y., Körber, N. End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv (2024).

## Installation 

### Dependencies

`seq2squiggle` requires Python >= 3.10. 

We recommend to run `seq2squiggle` in a separate conda / mamba environment. This keeps the tool and its dependencies isolated from your other Python environments.

```
conda create -n seq2squiggle-env python=3.10
conda activate seq2squiggle-env 
```

### Install with pip
```
pip install seq2squiggle 
```

### Install from source
```
git clone https://github.com/ZKI-PH-ImageAnalysis/seq2squiggle.git
cd seq2squiggle
pip install . 
```

### Download training data and model weights

`seq2squiggle` requires compatible pretrained model weights to make predictions, which can be specified using the `--model` command-line parameter.

If you do not provide a model file, `seq2squiggle` will automatically attempt to download a compatible model file to ensure predictions can be made. 

## Predict signals from FASTA file
`seq2squiggle` simulates artificial signals based on an input FASTX file. By default, the output is in SLOW5/BLOW5 format. Exporting to the new POD5 format is also supported, though BLOW5 is preferred for its stability. You will need to specify the path to the model through the configuration file.

For optimal performance, running `seq2squiggle` on a GPU is recommended, especially to speed up inference. However, the tool also works on CPU-only systems, though at a slower inference speed.

### Examples 

Generate 10,000 reads from a fasta file:
```
seq2squiggle predict example.fasta -o example.blow5 -n 10000
```
Generate reads with a coverage of 30:
```
seq2squiggle predict example.fasta -o example.blow5 -c 30
```
Generate reads with a coverage of 30 and an average read length of 5,000:
```
seq2squiggle predict example.fasta -o example.blow5 -c 30 -r 5000
```
Simulate signals from basecalled reads (each single read will be simulated):
```
seq2squiggle predict example.fastq -o example.blow5 --read-input
```
Export as pod5:
```
seq2squiggle predict example.fastq -o example.pod5 --read-input
```



## Different noise options
`seq2squiggle` supports different options for generating the signal data.
Per default, the noise sampler and duration sampler are used.

### Examples

Generate reads using both the noise sampler and duration sampler: 
```
seq2squiggle predict example.fasta -o example.blow5
```
Generate reads using the noise sampler with an increased factor and duration sampler:
```
seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5
```
Generate reads using a static normal distribution for the noise and duration sampler:
```
seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5 --noise-sampling False
```
Generate reads using only the noise sampler and a static normal distribution for the event length:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length -1
```
Generate reads using only the noise sampler and ideal event lengths:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0
```
Generate reads using a static normal distribution for the amplitude noise and ideal event lengths:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std 1.0
```
Generate reads using no amplitude noise and ideal event lengths:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std -1
```

## Train a new model
`seq2squiggle` uses the uncalled4's align output (events.tsv) as training data. 

Run the following commands to generate the data with [uncalled4](https://github.com/skovaka/uncalled4):
```
uncalled4 align REF_FASTA SLOW5 --bam-in INPUT_BAM --eventalign-out OUTPUT_TSV --eventalign-flags print-read-names,signal-index,samples --pore-model dna_r10.4.1_400bps_9mer --flowcell FLO-MIN114 --kit SQK-LSK114
```

Additionally, we use a small script to standardize the event_noise column:
```
./src/seq2squiggle/standardize-events.py INPUT_TSV OUTPUT_TSV
```

To preprocess and train a model from scratch:
```
seq2squiggle preprocess events.tsv train_dir --max-chunks -1 --config my_config.yml
seq2squiggle preprocess events_valid.tsv valid_dir --max-chunks -1 --config my_config.yml
seq2squiggle train train_dir valid_dir --config my_config.yml --model last.ckpt
```

## Acknowledgement
The model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito). 

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ZKI-PH-ImageAnalysis/seq2squiggle",
    "name": "seq2squiggle",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Denis Beslic",
    "author_email": "beslicd@rki.de",
    "download_url": "https://files.pythonhosted.org/packages/23/85/4fa4ea8b22db9e1df71439f7a0ea7f2a007f9fb2f9e014e2b782270a6956/seq2squiggle-0.2.0.tar.gz",
    "platform": null,
    "description": "# seq2squiggle\n\n`seq2squiggle` is a deep learning-based tool for generating artifical nanopore signals from DNA sequence data.\n\n<img src=\"/img/seq2squiggle_architecture.png\" width=\"750\">\n\n\nPlease cite the following publication if you use `seq2squiggle` in your work:\n- Beslic,  D., Kucklick, M., Engelmann, S., Fuchs, S., Renards, B.Y., K\u00f6rber, N. End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv (2024).\n\n## Installation \n\n### Dependencies\n\n`seq2squiggle` requires Python >= 3.10. \n\nWe recommend to run `seq2squiggle` in a separate conda / mamba environment. This keeps the tool and its dependencies isolated from your other Python environments.\n\n```\nconda create -n seq2squiggle-env python=3.10\nconda activate seq2squiggle-env \n```\n\n### Install with pip\n```\npip install seq2squiggle \n```\n\n### Install from source\n```\ngit clone https://github.com/ZKI-PH-ImageAnalysis/seq2squiggle.git\ncd seq2squiggle\npip install . \n```\n\n### Download training data and model weights\n\n`seq2squiggle` requires compatible pretrained model weights to make predictions, which can be specified using the `--model` command-line parameter.\n\nIf you do not provide a model file, `seq2squiggle` will automatically attempt to download a compatible model file to ensure predictions can be made. \n\n## Predict signals from FASTA file\n`seq2squiggle` simulates artificial signals based on an input FASTX file. By default, the output is in SLOW5/BLOW5 format. Exporting to the new POD5 format is also supported, though BLOW5 is preferred for its stability. You will need to specify the path to the model through the configuration file.\n\nFor optimal performance, running `seq2squiggle` on a GPU is recommended, especially to speed up inference. However, the tool also works on CPU-only systems, though at a slower inference speed.\n\n### Examples \n\nGenerate 10,000 reads from a fasta file:\n```\nseq2squiggle predict example.fasta -o example.blow5 -n 10000\n```\nGenerate reads with a coverage of 30:\n```\nseq2squiggle predict example.fasta -o example.blow5 -c 30\n```\nGenerate reads with a coverage of 30 and an average read length of 5,000:\n```\nseq2squiggle predict example.fasta -o example.blow5 -c 30 -r 5000\n```\nSimulate signals from basecalled reads (each single read will be simulated):\n```\nseq2squiggle predict example.fastq -o example.blow5 --read-input\n```\nExport as pod5:\n```\nseq2squiggle predict example.fastq -o example.pod5 --read-input\n```\n\n\n\n## Different noise options\n`seq2squiggle` supports different options for generating the signal data.\nPer default, the noise sampler and duration sampler are used.\n\n### Examples\n\nGenerate reads using both the noise sampler and duration sampler: \n```\nseq2squiggle predict example.fasta -o example.blow5\n```\nGenerate reads using the noise sampler with an increased factor and duration sampler:\n```\nseq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5\n```\nGenerate reads using a static normal distribution for the noise and duration sampler:\n```\nseq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5 --noise-sampling False\n```\nGenerate reads using only the noise sampler and a static normal distribution for the event length:\n```\nseq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length -1\n```\nGenerate reads using only the noise sampler and ideal event lengths:\n```\nseq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0\n```\nGenerate reads using a static normal distribution for the amplitude noise and ideal event lengths:\n```\nseq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std 1.0\n```\nGenerate reads using no amplitude noise and ideal event lengths:\n```\nseq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std -1\n```\n\n## Train a new model\n`seq2squiggle` uses the uncalled4's align output (events.tsv) as training data. \n\nRun the following commands to generate the data with [uncalled4](https://github.com/skovaka/uncalled4):\n```\nuncalled4 align REF_FASTA SLOW5 --bam-in INPUT_BAM --eventalign-out OUTPUT_TSV --eventalign-flags print-read-names,signal-index,samples --pore-model dna_r10.4.1_400bps_9mer --flowcell FLO-MIN114 --kit SQK-LSK114\n```\n\nAdditionally, we use a small script to standardize the event_noise column:\n```\n./src/seq2squiggle/standardize-events.py INPUT_TSV OUTPUT_TSV\n```\n\nTo preprocess and train a model from scratch:\n```\nseq2squiggle preprocess events.tsv train_dir --max-chunks -1 --config my_config.yml\nseq2squiggle preprocess events_valid.tsv valid_dir --max-chunks -1 --config my_config.yml\nseq2squiggle train train_dir valid_dir --config my_config.yml --model last.ckpt\n```\n\n## Acknowledgement\nThe model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito). \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "End-to-end simulation of nanopore sequencing signals with feed-forward transformers",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/ZKI-PH-ImageAnalysis/seq2squiggle",
        "Repository": "https://github.com/ZKI-PH-ImageAnalysis/seq2squiggle"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3cef540aaa5a44b566a44616882ac8085ad9ac6090e5f65ec78b3e1a19564bd2",
                "md5": "e41c005050830ab03bbfdb88fac83e86",
                "sha256": "0d6c04f470f779d81a082ce751ddc5f72928c10ef4506de7a94abd9d56bdcf15"
            },
            "downloads": -1,
            "filename": "seq2squiggle-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e41c005050830ab03bbfdb88fac83e86",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 45299,
            "upload_time": "2024-08-16T12:52:36",
            "upload_time_iso_8601": "2024-08-16T12:52:36.278693Z",
            "url": "https://files.pythonhosted.org/packages/3c/ef/540aaa5a44b566a44616882ac8085ad9ac6090e5f65ec78b3e1a19564bd2/seq2squiggle-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "23854fa4ea8b22db9e1df71439f7a0ea7f2a007f9fb2f9e014e2b782270a6956",
                "md5": "a3b5c2757440edea38246debac8d9f3c",
                "sha256": "e6592e63131831e0dc396c27f6a7251c2967ba25b422d54e447ab455e64960ef"
            },
            "downloads": -1,
            "filename": "seq2squiggle-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a3b5c2757440edea38246debac8d9f3c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 40682,
            "upload_time": "2024-08-16T12:52:37",
            "upload_time_iso_8601": "2024-08-16T12:52:37.706504Z",
            "url": "https://files.pythonhosted.org/packages/23/85/4fa4ea8b22db9e1df71439f7a0ea7f2a007f9fb2f9e014e2b782270a6956/seq2squiggle-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-16 12:52:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ZKI-PH-ImageAnalysis",
    "github_project": "seq2squiggle",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "seq2squiggle"
}
        
Elapsed time: 0.46987s