sma-finder


Namesma-finder JSON
Version 1.4.4 PyPI version JSON
download
home_pagehttps://github.com/broadinstitute/sma_finder
SummaryA tool for diagnosing spinal muscular atrophy (SMA) using exome or genome sequencing data
upload_time2024-01-31 16:54:43
maintainer
docs_urlNone
author
requires_python>=3.7
licenseMIT
keywords spinal muscular atrophy sma smn smn1 smn2
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## SMA Finder  

SMA Finder is a tool for diagnosing spinal muscular atrophy (SMA) based on Illumina exome, genome, or targeted sequencing data.  
It takes a reference sequence (FASTA) and 1 or more alignment files (CRAM or BAM) as input, evaluates reads at the 
c.840 position of *SMN1* and *SMN2* to detect the most common molecular causes of SMA, and then reports whether it found a complete loss of SMN1. 

It has been tested and confirmed to be highly accurate on short read data aligned to GRCh37, GRCh38, or T2T using the BWA aligner.

*Limitations:*  
- does not report SMA carrier status or *SMN1/SMN2* copy numbers  
- does not detect the ~5% of cases caused by *SMN1* loss-of-function mutations that do not involve the c.840 position  
- requires at least 14 reads to overlap the c.840 position in *SMN1* plus *SMN2* in order to make a call  
- was developed and tested on Illumina short read sequencing data generated from whole blood DNA and aligned using the BWA aligner. Performance on data from other sequencing technologies, sample types, and alignment pipelines is unknown. 


### Install

```
python3 -m pip install sma-finder
```

### Example

Example command:
```
sma_finder --verbose --hg38-reference-fasta /ref/hg38.fa  sample1.cram
```
Command output:
```
Input args:
    --hg38-reference-fasta: /ref/hg38.fa
    --output-tsv: sample1.sma_finder_results.tsv
    CRAMS or BAMS: sample1.cram
---
Output row #1:
        filename_prefix                     sample1
        file_type                           cram
        genome_version                      hg38
        sample_id                           s1
        sma_status                          has SMA
        confidence_score                    168
        c840_reads_with_smn1_base_C         0
        c840_total_reads                    174
Wrote 1 rows to sample1.sma_finder_results.tsv        
```

### Usage

Usage help text:

```
sma_finder --help

usage: sma_finder.py [-h] [--hg37-reference-fasta HG37_REFERENCE_FASTA]
                     [--hg38-reference-fasta HG38_REFERENCE_FASTA]
                     [--t2t-reference-fasta T2T_REFERENCE_FASTA]
                     [-o OUTPUT_TSV] [-v]
                     cram_or_bam_path [cram_or_bam_path ...]

positional arguments:
  cram_or_bam_path      One or more CRAM or BAM file paths

optional arguments:
  -h, --help            show this help message and exit
  --hg37-reference-fasta HG37_REFERENCE_FASTA
                        HG37 reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to HG37.
  --hg38-reference-fasta HG38_REFERENCE_FASTA
                        HG38 reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to HG38.
  --t2t-reference-fasta T2T_REFERENCE_FASTA
                        T2T reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to the
                        CHM13 telomere-to-telomere benchmark.
  -o OUTPUT_TSV, --output-tsv OUTPUT_TSV
                        Optional output tsv file path
  -v, --verbose         Whether to print extra details during the run
```


### Output

The output .tsv contains one row per input CRAM or BAM file and has the following columns:

<table>
    <tr>
        <td><b>filename_prefix</b></td>
        <td>CRAM or BAM filename prefix. If the input file is <i>/path/sample1.cram</i> this would be <i>"sample1"</i>.</td>
    </tr>
    <tr>
        <td><b>file_type</b></td>
        <td><i>"cram"</i> or <i>"bam"</i></td>
    </tr>
        <tr>
        <td><b>genome_version</b></td>
        <td><i>"hg37"</i>, <i>"hg38"</i>, or <i>"t2t"</i></td>
    </tr>
    <tr>
        <td><b>sample_id</b></td>
        <td>sample id from the CRAM or BAM file header (parsed from the read group metadata)</td>
    </tr>
    <tr>
        <td><b>sma_status</b></td>
        <td>possible values are:<br> 
            <i>"has SMA"</i><br>
            <i>"does not have SMA"</i><br>
            <i>"not enough coverage at SMN c.840 position"</i><br>
        </td>
    <tr>
        <td><b>confidence_score</b></td>
        <td>PHRED-scaled integer score measuring the level of confidence that the sma_status is correct. The bigger the score, the higher the confidence. It is calculated in a similar way to the PL field in GATK HaplotypeCaller genotypes.</td>
    <tr>
        <td><b>c840_reads_with_smn1_base_C</b></td>
        <td>number of reads that have a 'C' nucleotide at the c.840 position in SMN1 plus SMN2</td> 
    <tr>
        <td><b>c840_total_reads</b></td>
        <td>total number of reads overlapping the c.840 position in SMN1 plus SMN2</td>  
    </tr>
</table>
<br />  

---

  
### Combining results from multiple samples

After running SMA Finder on many samples, it often useful to combine the per-sample output tables into
a single table. One way to do this is with the following shell command:

```
combined_table_filename=combined_results.tsv
head -n 1 $(ls *.tsv | head -n 1) > ${combined_table_filename}   # get table header from the 1st table 
for i in *.tsv; do
    tail -n +2 $i >> ${combined_table_filename}    # concatenate all tables
done
```

### Plotting combined results

A scatter plot summarizing read counts from many samples can be generated using the `plot_SMN1_SMN2_scatter` command:

```
python3 plot_SMN1_SMN2_scatter.py --format svg --format png ${combined_table_filename}
```

It generates plots like this one which is based on a neuromuscular cohort with 16,626 exomes:

<img width="799" alt="image" src="https://github.com/broadinstitute/sma-finder/assets/6240170/d097a231-9b66-445b-b53c-84abdb9887d0">

---
### Details

This poster on SMA Finder was presented at the [SVAR22](https://www.grahamerwin.org/svar-conference) conference:

<img src="https://github.com/broadinstitute/sma_finder/raw/main/docs/SMA_poster_SVAR22.png" />


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/broadinstitute/sma_finder",
    "name": "sma-finder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "Spinal Muscular Atrophy,SMA,SMN,SMN1,SMN2",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/63/59/4b9cb498a264ca79ffe6ac60a3ef21fc83976e6782c5ae184e7341c8ebb7/sma_finder-1.4.4.tar.gz",
    "platform": null,
    "description": "## SMA Finder  \n\nSMA Finder is a tool for diagnosing spinal muscular atrophy (SMA) based on Illumina exome, genome, or targeted sequencing data.  \nIt takes a reference sequence (FASTA) and 1 or more alignment files (CRAM or BAM) as input, evaluates reads at the \nc.840 position of *SMN1* and *SMN2* to detect the most common molecular causes of SMA, and then reports whether it found a complete loss of SMN1. \n\nIt has been tested and confirmed to be highly accurate on short read data aligned to GRCh37, GRCh38, or T2T using the BWA aligner.\n\n*Limitations:*  \n- does not report SMA carrier status or *SMN1/SMN2* copy numbers  \n- does not detect the ~5% of cases caused by *SMN1* loss-of-function mutations that do not involve the c.840 position  \n- requires at least 14 reads to overlap the c.840 position in *SMN1* plus *SMN2* in order to make a call  \n- was developed and tested on Illumina short read sequencing data generated from whole blood DNA and aligned using the BWA aligner. Performance on data from other sequencing technologies, sample types, and alignment pipelines is unknown. \n\n\n### Install\n\n```\npython3 -m pip install sma-finder\n```\n\n### Example\n\nExample command:\n```\nsma_finder --verbose --hg38-reference-fasta /ref/hg38.fa  sample1.cram\n```\nCommand output:\n```\nInput args:\n    --hg38-reference-fasta: /ref/hg38.fa\n    --output-tsv: sample1.sma_finder_results.tsv\n    CRAMS or BAMS: sample1.cram\n---\nOutput row #1:\n        filename_prefix                     sample1\n        file_type                           cram\n        genome_version                      hg38\n        sample_id                           s1\n        sma_status                          has SMA\n        confidence_score                    168\n        c840_reads_with_smn1_base_C         0\n        c840_total_reads                    174\nWrote 1 rows to sample1.sma_finder_results.tsv        \n```\n\n### Usage\n\nUsage help text:\n\n```\nsma_finder --help\n\nusage: sma_finder.py [-h] [--hg37-reference-fasta HG37_REFERENCE_FASTA]\n                     [--hg38-reference-fasta HG38_REFERENCE_FASTA]\n                     [--t2t-reference-fasta T2T_REFERENCE_FASTA]\n                     [-o OUTPUT_TSV] [-v]\n                     cram_or_bam_path [cram_or_bam_path ...]\n\npositional arguments:\n  cram_or_bam_path      One or more CRAM or BAM file paths\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --hg37-reference-fasta HG37_REFERENCE_FASTA\n                        HG37 reference genome FASTA path. This should be\n                        specified if the input bam or cram is aligned to HG37.\n  --hg38-reference-fasta HG38_REFERENCE_FASTA\n                        HG38 reference genome FASTA path. This should be\n                        specified if the input bam or cram is aligned to HG38.\n  --t2t-reference-fasta T2T_REFERENCE_FASTA\n                        T2T reference genome FASTA path. This should be\n                        specified if the input bam or cram is aligned to the\n                        CHM13 telomere-to-telomere benchmark.\n  -o OUTPUT_TSV, --output-tsv OUTPUT_TSV\n                        Optional output tsv file path\n  -v, --verbose         Whether to print extra details during the run\n```\n\n\n### Output\n\nThe output .tsv contains one row per input CRAM or BAM file and has the following columns:\n\n<table>\n    <tr>\n        <td><b>filename_prefix</b></td>\n        <td>CRAM or BAM filename prefix. If the input file is <i>/path/sample1.cram</i> this would be <i>\"sample1\"</i>.</td>\n    </tr>\n    <tr>\n        <td><b>file_type</b></td>\n        <td><i>\"cram\"</i> or <i>\"bam\"</i></td>\n    </tr>\n        <tr>\n        <td><b>genome_version</b></td>\n        <td><i>\"hg37\"</i>, <i>\"hg38\"</i>, or <i>\"t2t\"</i></td>\n    </tr>\n    <tr>\n        <td><b>sample_id</b></td>\n        <td>sample id from the CRAM or BAM file header (parsed from the read group metadata)</td>\n    </tr>\n    <tr>\n        <td><b>sma_status</b></td>\n        <td>possible values are:<br> \n            <i>\"has SMA\"</i><br>\n            <i>\"does not have SMA\"</i><br>\n            <i>\"not enough coverage at SMN c.840 position\"</i><br>\n        </td>\n    <tr>\n        <td><b>confidence_score</b></td>\n        <td>PHRED-scaled integer score measuring the level of confidence that the sma_status is correct. The bigger the score, the higher the confidence. It is calculated in a similar way to the PL field in GATK HaplotypeCaller genotypes.</td>\n    <tr>\n        <td><b>c840_reads_with_smn1_base_C</b></td>\n        <td>number of reads that have a 'C' nucleotide at the c.840 position in SMN1 plus SMN2</td> \n    <tr>\n        <td><b>c840_total_reads</b></td>\n        <td>total number of reads overlapping the c.840 position in SMN1 plus SMN2</td>  \n    </tr>\n</table>\n<br />  \n\n---\n\n  \n### Combining results from multiple samples\n\nAfter running SMA Finder on many samples, it often useful to combine the per-sample output tables into\na single table. One way to do this is with the following shell command:\n\n```\ncombined_table_filename=combined_results.tsv\nhead -n 1 $(ls *.tsv | head -n 1) > ${combined_table_filename}   # get table header from the 1st table \nfor i in *.tsv; do\n    tail -n +2 $i >> ${combined_table_filename}    # concatenate all tables\ndone\n```\n\n### Plotting combined results\n\nA scatter plot summarizing read counts from many samples can be generated using the `plot_SMN1_SMN2_scatter` command:\n\n```\npython3 plot_SMN1_SMN2_scatter.py --format svg --format png ${combined_table_filename}\n```\n\nIt generates plots like this one which is based on a neuromuscular cohort with 16,626 exomes:\n\n<img width=\"799\" alt=\"image\" src=\"https://github.com/broadinstitute/sma-finder/assets/6240170/d097a231-9b66-445b-b53c-84abdb9887d0\">\n\n---\n### Details\n\nThis poster on SMA Finder was presented at the [SVAR22](https://www.grahamerwin.org/svar-conference) conference:\n\n<img src=\"https://github.com/broadinstitute/sma_finder/raw/main/docs/SMA_poster_SVAR22.png\" />\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for diagnosing spinal muscular atrophy (SMA) using exome or genome sequencing data",
    "version": "1.4.4",
    "project_urls": {
        "Homepage": "https://github.com/broadinstitute/sma_finder"
    },
    "split_keywords": [
        "spinal muscular atrophy",
        "sma",
        "smn",
        "smn1",
        "smn2"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "630f46dfd349199673b74b45708c838729b7db3a6dee53fcdad8d7b219527dfd",
                "md5": "18142d8947d47122328246e7776f1413",
                "sha256": "f3f7eb50a71921560194e16c1a8434078a6b9eef5bff8456bb13475b6e05e18f"
            },
            "downloads": -1,
            "filename": "sma_finder-1.4.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "18142d8947d47122328246e7776f1413",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14155,
            "upload_time": "2024-01-31T16:54:42",
            "upload_time_iso_8601": "2024-01-31T16:54:42.377219Z",
            "url": "https://files.pythonhosted.org/packages/63/0f/46dfd349199673b74b45708c838729b7db3a6dee53fcdad8d7b219527dfd/sma_finder-1.4.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "63594b9cb498a264ca79ffe6ac60a3ef21fc83976e6782c5ae184e7341c8ebb7",
                "md5": "c13d70376982e9f431ebd701655f8b69",
                "sha256": "e0fdad3e594d1d909fc1f1be91a17464bb64dd0f86096da4b2bb66c4fc92f963"
            },
            "downloads": -1,
            "filename": "sma_finder-1.4.4.tar.gz",
            "has_sig": false,
            "md5_digest": "c13d70376982e9f431ebd701655f8b69",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13565,
            "upload_time": "2024-01-31T16:54:43",
            "upload_time_iso_8601": "2024-01-31T16:54:43.944600Z",
            "url": "https://files.pythonhosted.org/packages/63/59/4b9cb498a264ca79ffe6ac60a3ef21fc83976e6782c5ae184e7341c8ebb7/sma_finder-1.4.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-31 16:54:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "broadinstitute",
    "github_project": "sma_finder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "sma-finder"
}
        
Elapsed time: 0.17633s