ngs-analysis

Name	ngs-analysis JSON
Version	0.0.6 JSON
	download
home_page
Summary	Analyze deep sequencing of complex libraries
upload_time	2024-02-09 15:33:49
maintainer
docs_url	None
author	David Feldman
requires_python	>=3.9
license
keywords	ngs library variant barcode
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ngs-analysis

Convenient analysis of sequencing reads that span multiple DNA or protein parts. For instance, given a library of protein variants linked to DNA barcodes, this tool can answer questions like:

- How accurate are the variant sequences, at the DNA or protein level?
- How frequently is the same barcode linked to two different variants?
- Which reads contain parts required for function (e.g., a kozak start sequence, or a fused protein tag)?

This kind of analysis often involves parsing raw sequencing reads for DNA and/or protein sub-sequences (parts), then mapping the parts to a reference of anticipated part combinations. Here the workflow is: 

1. Define how to parse reads into parts using plain text expressions (no code)
2. Parse your anticipated DNA sequences to generate a reference
3. Parse a batch of sequencing samples
4. Map the parts found in each read to the reference

It’s been tested with Illumina paired-end reads and Oxford Nanopore long reads. Under the hood it uses [NGmerge](https://github.com/jsh58/NGmerge) to merge paired reads and [MMseqs2](https://github.com/soedinglab/MMseqs2) for sequencing mapping. It is moderately performant: 1 million paired-end reads can be mapped to a reference of 100,000 variant-barcode pairs in ~1 minute.

# Workflow

A cartoon example with two reference sequences, each consisting of a variant linked to a barcode:

<img src="examples/sequences.png" alt="sequences" width="500"/>

Here's the analysis workflow and outputs:

<img src="examples/workflow.png" alt="analysis workflow" width="500"/>

Note that in the last two columns, the parsed variant is mapped to a reference variant defined by the barcode present in the same read, rather than all possible reference variants. Check out the [example notebook for paired end reads](examples/paired_reads/paired_read_example.ipynb) for details.

### **TL;DR**

Run `ngs-analysis --help` to see available commands.

1. Make an empty directory, add `config.yaml` and `samples.csv` based on the example.
2. Add `reference_dna.csv` with anticipated DNA sequences (including adapters).
3. Run `ngs-analysis setup`. Add `--clean` to start the analysis from scratch.
4. Check that `designs.csv` is accurate; if not, fix `config.yaml`.
5. 
    - If you have paired-end data, put it in `0_paired_reads/` and run `ngs-analysis merge_read_pairs <sample>`.
    - If you have single-end data (e.g., nanopore), put it in `1_reads/`.
6. Run `ngs-analysis parse_reads <sample>`. Check that `2_parsed/<sample>.parsed.pq` looks alright (with pandas, use `pd.read_parquet`)
7. Run `ngs-analysis map_parsed_reads <sample>`. Results are in `3_mapped/<sample>.mapped.csv`

# Simulation mode

Debugging complex read structures and experimental layouts can be tricky. For example, your `config.yaml` might parse reference sequences incorrectly, or samples might map to the reference in an unexpected way (e.g., if the same barcode is attached to different variants). 

Before running an analysis (or designing an experiment), you can simulate the results by defining `sample_plan.csv` and running `simulate_single_reads` or `simulate_paired_reads`, which have options to add simple random mutations and variable coverage per subpool.

Here's `sample_plan.csv` from the [paired read example](examples/paired_reads/sample_plan.csv). Note that "source" refers to the optional "source" column in `reference_DNA.csv`.

| sample | source | coverage |
| --- | --- | --- |
|  sample\_A | pool1 | 50 |
|  sample\_B | pool2 | 50 |
|  sample\_C | pool1 | 50 |
|  sample\_C | pool2 | 20 |
|  sample\_D | pool3 | 50 |

To analyze the simulated data, just add the `--simulate` flag to `merge_read_pairs`, `parse_reads`, `map_parsed_reads`, and `plot`. Results will be saved to `{step}/simulate/{sample}` rather than `{step}/{sample}`.


# Install

```bash
pip install ngs-analysis
```

Make sure that the `mmseqs` and `NGmerge` executables are available (NGmerge is only needed for paired reads). 

On Linux and Intel-based MacOS, you can use `conda install -c bioconda -c conda-forge mmseqs2 ngmerge`. On Apple Silicon `mmseqs` can be installed via Homebrew with `brew install mmseqs2`, and NGmerge can be installed [from source](https://github.com/jsh58/NGmerge?tab=readme-ov-file#compile), or via `brew install brewsci/bio/ngmerge`.

Tested on Linux and MacOS (Apple Silicon).

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "ngs-analysis",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "NGS,library,variant,barcode",
    "author": "David Feldman",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/f6/55/d36ec35e3c068a77f581ca110abdbe47b8a235e9b51fd3f1f25c955026b9/ngs-analysis-0.0.6.tar.gz",
    "platform": null,
    "description": "# ngs-analysis\n\nConvenient analysis of sequencing reads that span multiple DNA or protein parts. For instance, given a library of protein variants linked to DNA barcodes, this tool can answer questions like:\n\n- How accurate are the variant sequences, at the DNA or protein level?\n- How frequently is the same barcode linked to two different variants?\n- Which reads contain parts required for function (e.g., a kozak start sequence, or a fused protein tag)?\n\nThis kind of analysis often involves parsing raw sequencing reads for DNA and/or protein sub-sequences (parts), then mapping the parts to a reference of anticipated part combinations. Here the workflow is: \n\n1. Define how to parse reads into parts using plain text expressions (no code)\n2. Parse your anticipated DNA sequences to generate a reference\n3. Parse a batch of sequencing samples\n4. Map the parts found in each read to the reference\n\nIt\u2019s been tested with Illumina paired-end reads and Oxford Nanopore long reads. Under the hood it uses [NGmerge](https://github.com/jsh58/NGmerge) to merge paired reads and [MMseqs2](https://github.com/soedinglab/MMseqs2) for sequencing mapping. It is moderately performant: 1 million paired-end reads can be mapped to a reference of 100,000 variant-barcode pairs in ~1 minute.\n\n# Workflow\n\nA cartoon example with two reference sequences, each consisting of a variant linked to a barcode:\n\n<img src=\"examples/sequences.png\" alt=\"sequences\" width=\"500\"/>\n\nHere's the analysis workflow and outputs:\n\n<img src=\"examples/workflow.png\" alt=\"analysis workflow\" width=\"500\"/>\n\nNote that in the last two columns, the parsed variant is mapped to a reference variant defined by the barcode present in the same read, rather than all possible reference variants. Check out the [example notebook for paired end reads](examples/paired_reads/paired_read_example.ipynb) for details.\n\n### **TL;DR**\n\nRun `ngs-analysis --help` to see available commands.\n\n1. Make an empty directory, add `config.yaml` and `samples.csv` based on the example.\n2. Add `reference_dna.csv` with anticipated DNA sequences (including adapters).\n3. Run `ngs-analysis setup`. Add `--clean` to start the analysis from scratch.\n4. Check that `designs.csv` is accurate; if not, fix `config.yaml`.\n5. \n    - If you have paired-end data, put it in `0_paired_reads/` and run `ngs-analysis merge_read_pairs <sample>`.\n    - If you have single-end data (e.g., nanopore), put it in `1_reads/`.\n6. Run `ngs-analysis parse_reads <sample>`. Check that `2_parsed/<sample>.parsed.pq` looks alright (with pandas, use `pd.read_parquet`)\n7. Run `ngs-analysis map_parsed_reads <sample>`. Results are in `3_mapped/<sample>.mapped.csv`\n\n# Simulation mode\n\nDebugging complex read structures and experimental layouts can be tricky. For example, your `config.yaml` might parse reference sequences incorrectly, or samples might map to the reference in an unexpected way (e.g., if the same barcode is attached to different variants). \n\nBefore running an analysis (or designing an experiment), you can simulate the results by defining `sample_plan.csv` and running `simulate_single_reads` or `simulate_paired_reads`, which have options to add simple random mutations and variable coverage per subpool.\n\nHere's `sample_plan.csv` from the [paired read example](examples/paired_reads/sample_plan.csv). Note that \"source\" refers to the optional \"source\" column in `reference_DNA.csv`.\n\n| sample | source | coverage |\n| --- | --- | --- |\n|  sample\\_A | pool1 | 50 |\n|  sample\\_B | pool2 | 50 |\n|  sample\\_C | pool1 | 50 |\n|  sample\\_C | pool2 | 20 |\n|  sample\\_D | pool3 | 50 |\n\nTo analyze the simulated data, just add the `--simulate` flag to `merge_read_pairs`, `parse_reads`, `map_parsed_reads`, and `plot`. Results will be saved to `{step}/simulate/{sample}` rather than `{step}/{sample}`.\n\n\n# Install\n\n```bash\npip install ngs-analysis\n```\n\nMake sure that the `mmseqs` and `NGmerge` executables are available (NGmerge is only needed for paired reads). \n\nOn Linux and Intel-based MacOS, you can use `conda install -c bioconda -c conda-forge mmseqs2 ngmerge`. On Apple Silicon `mmseqs` can be installed via Homebrew with `brew install mmseqs2`, and NGmerge can be installed [from source](https://github.com/jsh58/NGmerge?tab=readme-ov-file#compile), or via `brew install brewsci/bio/ngmerge`.\n\nTested on Linux and MacOS (Apple Silicon).\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Analyze deep sequencing of complex libraries",
    "version": "0.0.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/feldman4/ngs-analysis/issues",
        "Homepage": "https://github.com/feldman4/ngs-analysis"
    },
    "split_keywords": [
        "ngs",
        "library",
        "variant",
        "barcode"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f655d36ec35e3c068a77f581ca110abdbe47b8a235e9b51fd3f1f25c955026b9",
                "md5": "2baf0eb0496d1eca2708ce23bfece50e",
                "sha256": "cce2845975c7ddf4225623216cef45837a3b43897e620963657a94b12dc541b9"
            },
            "downloads": -1,
            "filename": "ngs-analysis-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "2baf0eb0496d1eca2708ce23bfece50e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 24186,
            "upload_time": "2024-02-09T15:33:49",
            "upload_time_iso_8601": "2024-02-09T15:33:49.784870Z",
            "url": "https://files.pythonhosted.org/packages/f6/55/d36ec35e3c068a77f581ca110abdbe47b8a235e9b51fd3f1f25c955026b9/ngs-analysis-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-09 15:33:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "feldman4",
    "github_project": "ngs-analysis",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ngs-analysis"
}

David Feldman