hiscanner


Namehiscanner JSON
Version 0.2b5 PyPI version JSON
download
home_pageNone
SummaryHigh-resolution single-cell copy number analysis.
upload_time2025-02-06 22:21:17
maintainerNone
docs_urlNone
authorYifan Zhao
requires_python>=3.8
licenseMIT
keywords bioinformatics single-cell cnv genomics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HiScanner (HIgh-resolution Single-Cell Allelic copy Number callER)
[![PyPI version](https://badge.fury.io/py/hiscanner.svg)](https://badge.fury.io/py/hiscanner)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

HiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:

1. **Standard Pipeline** (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)
2. **RDR-only Pipeline**: Simplified analysis using only read depth ratios

## Table of Contents

- [Installation](#installation)
- [Required External Files](#required-external-files)
- [Pipeline Options](#pipeline-options)
- [Running the Pipeline](#running-the-pipeline)
- [Output Structure](#output-structure)
- [Troubleshooting](#troubleshooting)
- [Citation](#citation)

### Installation
```bash
# Create new conda environment with all dependencies
conda create -n hiscanner_test python=3.8
conda activate hiscanner_test
pip install hiscanner --no-cache-dir
```

Install R and required packages:
```bash
conda install -c conda-forge r-base  
conda install -c bioconda r-mgcv>=1.8
```

Install other dependencies:
```bash
conda install -c bioconda snakemake samtools bcftools
```
We tested with snakemake==7.32.4, samtools==1.15.1, bcftools==1.13.

## Required External Files: 

### 1) Mappability Track
<details>
<summary>hg19/GRCh37</summary>

```bash
# 100mer:
wget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate
tar -xvzf hg19CRG.100bp.tar.gz
```
</details>

<details>
<summary>hg38/GRCh38 (150bp)</summary>
```bash
150bp: Coming soon
```
</details>

<details>
<summary>Other reference genomes: instructions on custom track generation</summary>

For other genomes/read length configurations, follow instructions at [CGAP Annotations](https://cgap-annotations.readthedocs.io/en/latest/bic-seq2_mappability.html) to generate mappability tracks.
</details>

<details>
<summary>IMPORTANT: Guidelines when choosing mappability tracks</summary>

- **Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp)**: Valid but conservative (some uniquely mappable regions may be missed)
- **Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp)**: Not valid, will cause false positives
</details>

### 2) Reference Genome
We require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:
```bash
samtools faidx /path/to/reference.fasta
mkdir /path/to/reference/split
awk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta
```


## Pipeline Options

HiScanner offers two pipeline options:

### 1. Standard Pipeline (RDR + BAF)
Uses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.

Steps:
1. SNP Calling (via SCAN2)
2. Phasing & BAF Computation
3. ADO Analysis
4. Segmentation
5. CNV Calling

Requirements:
- SCAN2 output files
- Bulk and single cell BAM files
- Reference genome
- Mappability tracks

### 2. RDR-only Pipeline
Uses only read depth ratios for simplified CNV analysis.

Steps:
1. Segmentation
2. CNV Calling

Requirements:
- Single cell BAM files
- Reference genome
- Mappability tracks

## Running the Pipeline

### Option 1: Standard Pipeline (RDR + BAF)


#### Step 1: SCAN2 (Prerequisites)

[SCAN2](https://github.com/parklab/SCAN2) needs to be run separately before using HiScanner. 

> **Note**: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:
> ```bash
> scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'
> ```
> This will stop at the phasing step, which is sufficient for HiScanner's requirements.

If you have already run SCAN2, ensure you have:
- VCF file with raw variants (`gatk/hc_raw.mmq60.vcf.gz`)
- Phased heterozygous variants (`shapeit/phased_hets.vcf`)

- Additionally, we note that the phased genotype field in `phased_hets.vcf` should be named as `phasedgt`. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it to `phasedgt` in the VCF.

The expected location is `scan2_out/` in your project directory.

#### Steps 2-5: HiScanner Pipeline
<details>
<summary>1. Initialize HiScanner project</summary>

```bash
hiscanner init --output ./my_project
cd my_project
```
</details>

<details>
<summary>2. Edit config.yaml</summary>

Edit with your paths and parameters
</details>

<details>
<summary>3. Prepare metadata file</summary>

Must contain the following columns:
```
bamID    bam    singlecell
bulk1    /path/to/bulk.bam    N
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
```
</details>

<details>
<summary>4. Validate configuration</summary>

```bash
hiscanner --config config.yaml validate
```
</details>

<details>
<summary>5. Run the pipeline</summary>

```bash
hiscanner --config config.yaml run --step snp      # Check SCAN2 results
hiscanner --config config.yaml run --step phase    # Process SCAN2 results
hiscanner --config config.yaml run --step ado      # ADO analysis to identify optimal bin size
hiscanner --config config.yaml run --step normalize # Normalize read depth ratios
hiscanner --config config.yaml run --step segment  # Segmentation
hiscanner --config config.yaml run --step cnv      # CNV calling

# Or run all steps at once:
hiscanner --config config.yaml run --step all
```
</details>

For ```normalize``` (the most time-consuming step), we provide an option to run with cluster, e.g., 
```bash
hiscanner --config config.yaml run --step normalize --use-cluster
```

### Option 2: RDR-only Pipeline
<details>
<summary>1. Initialize project</summary>

```bash
hiscanner init --output ./my_project
cd my_project
```
</details>

<details>
<summary>2. Edit config.yaml</summary>

- Set `rdr_only: true`
- Configure paths and parameters
</details>

<details>
<summary>3. Prepare metadata file</summary>

Must contain the following columns (bulk samples are not required):
```
bamID    bam    singlecell
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
```
</details>

<details>
<summary>4. Validate configuration</summary>

```bash
hiscanner --config config.yaml validate
```
</details>

<details>
<summary>5. Run the pipeline</summary>

```bash
hiscanner --config config.yaml run --step normalize # Normalize read depth ratios
hiscanner --config config.yaml run --step segment  # Segmentation
hiscanner --config config.yaml run --step cnv      # CNV calling
```
</details>


## Output Structure

```
hiscanner_output/
├── phased_hets/       # Processed heterozygous SNPs (Standard pipeline only)
├── ado/               # ADO analysis results (Standard pipeline only)
├── bins/             # Binned read depth
├── segs/             # Segmentation results
└── final_calls/      # Final CNV calls
```

## Cleaning Up
HiScanner creates several temporary directories during analysis. You can clean these up using the clean command:
```bash
hiscanner --config config.yaml clean
```

## Troubleshooting

Common issues:
1. Missing SCAN2 results: Ensure scan2_output directory is correctly specified
2. File permissions: Check access to BAM files and reference data
3. Memory issues: Adjust batch_size in config.yaml

For more detailed information, check the log files in hiscanner_output/logs/


## Support
HiScanner is currently under active development. For support or questions, please open an issue on our [GitHub repository](github.com/parklab/hiscanner).


## Citation

If you use HiScanner in your research, please cite:

Zhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2024). High-resolution detection of copy number alterations in single cells with HiScanner. bioRxiv, 2024-04.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hiscanner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "bioinformatics, single-cell, cnv, genomics",
    "author": "Yifan Zhao",
    "author_email": "yifnzhao@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/35/ac/26ef559bf8afbcd1140663f5849fa14cca1dcf853f813713e16f14252620/hiscanner-0.2b5.tar.gz",
    "platform": null,
    "description": "# HiScanner (HIgh-resolution Single-Cell Allelic copy Number callER)\n[![PyPI version](https://badge.fury.io/py/hiscanner.svg)](https://badge.fury.io/py/hiscanner)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nHiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:\n\n1. **Standard Pipeline** (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)\n2. **RDR-only Pipeline**: Simplified analysis using only read depth ratios\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Required External Files](#required-external-files)\n- [Pipeline Options](#pipeline-options)\n- [Running the Pipeline](#running-the-pipeline)\n- [Output Structure](#output-structure)\n- [Troubleshooting](#troubleshooting)\n- [Citation](#citation)\n\n### Installation\n```bash\n# Create new conda environment with all dependencies\nconda create -n hiscanner_test python=3.8\nconda activate hiscanner_test\npip install hiscanner --no-cache-dir\n```\n\nInstall R and required packages:\n```bash\nconda install -c conda-forge r-base  \nconda install -c bioconda r-mgcv>=1.8\n```\n\nInstall other dependencies:\n```bash\nconda install -c bioconda snakemake samtools bcftools\n```\nWe tested with snakemake==7.32.4, samtools==1.15.1, bcftools==1.13.\n\n## Required External Files: \n\n### 1) Mappability Track\n<details>\n<summary>hg19/GRCh37</summary>\n\n```bash\n# 100mer:\nwget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate\ntar -xvzf hg19CRG.100bp.tar.gz\n```\n</details>\n\n<details>\n<summary>hg38/GRCh38 (150bp)</summary>\n```bash\n150bp: Coming soon\n```\n</details>\n\n<details>\n<summary>Other reference genomes: instructions on custom track generation</summary>\n\nFor other genomes/read length configurations, follow instructions at [CGAP Annotations](https://cgap-annotations.readthedocs.io/en/latest/bic-seq2_mappability.html) to generate mappability tracks.\n</details>\n\n<details>\n<summary>IMPORTANT: Guidelines when choosing mappability tracks</summary>\n\n- **Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp)**: Valid but conservative (some uniquely mappable regions may be missed)\n- **Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp)**: Not valid, will cause false positives\n</details>\n\n### 2) Reference Genome\nWe require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:\n```bash\nsamtools faidx /path/to/reference.fasta\nmkdir /path/to/reference/split\nawk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta\n```\n\n\n## Pipeline Options\n\nHiScanner offers two pipeline options:\n\n### 1. Standard Pipeline (RDR + BAF)\nUses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.\n\nSteps:\n1. SNP Calling (via SCAN2)\n2. Phasing & BAF Computation\n3. ADO Analysis\n4. Segmentation\n5. CNV Calling\n\nRequirements:\n- SCAN2 output files\n- Bulk and single cell BAM files\n- Reference genome\n- Mappability tracks\n\n### 2. RDR-only Pipeline\nUses only read depth ratios for simplified CNV analysis.\n\nSteps:\n1. Segmentation\n2. CNV Calling\n\nRequirements:\n- Single cell BAM files\n- Reference genome\n- Mappability tracks\n\n## Running the Pipeline\n\n### Option 1: Standard Pipeline (RDR + BAF)\n\n\n#### Step 1: SCAN2 (Prerequisites)\n\n[SCAN2](https://github.com/parklab/SCAN2) needs to be run separately before using HiScanner. \n\n> **Note**: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:\n> ```bash\n> scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'\n> ```\n> This will stop at the phasing step, which is sufficient for HiScanner's requirements.\n\nIf you have already run SCAN2, ensure you have:\n- VCF file with raw variants (`gatk/hc_raw.mmq60.vcf.gz`)\n- Phased heterozygous variants (`shapeit/phased_hets.vcf`)\n\n- Additionally, we note that the phased genotype field in `phased_hets.vcf` should be named as `phasedgt`. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it to `phasedgt` in the VCF.\n\nThe expected location is `scan2_out/` in your project directory.\n\n#### Steps 2-5: HiScanner Pipeline\n<details>\n<summary>1. Initialize HiScanner project</summary>\n\n```bash\nhiscanner init --output ./my_project\ncd my_project\n```\n</details>\n\n<details>\n<summary>2. Edit config.yaml</summary>\n\nEdit with your paths and parameters\n</details>\n\n<details>\n<summary>3. Prepare metadata file</summary>\n\nMust contain the following columns:\n```\nbamID    bam    singlecell\nbulk1    /path/to/bulk.bam    N\ncell1    /path/to/cell1.bam   Y \ncell2    /path/to/cell2.bam   Y\n```\n</details>\n\n<details>\n<summary>4. Validate configuration</summary>\n\n```bash\nhiscanner --config config.yaml validate\n```\n</details>\n\n<details>\n<summary>5. Run the pipeline</summary>\n\n```bash\nhiscanner --config config.yaml run --step snp      # Check SCAN2 results\nhiscanner --config config.yaml run --step phase    # Process SCAN2 results\nhiscanner --config config.yaml run --step ado      # ADO analysis to identify optimal bin size\nhiscanner --config config.yaml run --step normalize # Normalize read depth ratios\nhiscanner --config config.yaml run --step segment  # Segmentation\nhiscanner --config config.yaml run --step cnv      # CNV calling\n\n# Or run all steps at once:\nhiscanner --config config.yaml run --step all\n```\n</details>\n\nFor ```normalize``` (the most time-consuming step), we provide an option to run with cluster, e.g., \n```bash\nhiscanner --config config.yaml run --step normalize --use-cluster\n```\n\n### Option 2: RDR-only Pipeline\n<details>\n<summary>1. Initialize project</summary>\n\n```bash\nhiscanner init --output ./my_project\ncd my_project\n```\n</details>\n\n<details>\n<summary>2. Edit config.yaml</summary>\n\n- Set `rdr_only: true`\n- Configure paths and parameters\n</details>\n\n<details>\n<summary>3. Prepare metadata file</summary>\n\nMust contain the following columns (bulk samples are not required):\n```\nbamID    bam    singlecell\ncell1    /path/to/cell1.bam   Y \ncell2    /path/to/cell2.bam   Y\n```\n</details>\n\n<details>\n<summary>4. Validate configuration</summary>\n\n```bash\nhiscanner --config config.yaml validate\n```\n</details>\n\n<details>\n<summary>5. Run the pipeline</summary>\n\n```bash\nhiscanner --config config.yaml run --step normalize # Normalize read depth ratios\nhiscanner --config config.yaml run --step segment  # Segmentation\nhiscanner --config config.yaml run --step cnv      # CNV calling\n```\n</details>\n\n\n## Output Structure\n\n```\nhiscanner_output/\n\u251c\u2500\u2500 phased_hets/       # Processed heterozygous SNPs (Standard pipeline only)\n\u251c\u2500\u2500 ado/               # ADO analysis results (Standard pipeline only)\n\u251c\u2500\u2500 bins/             # Binned read depth\n\u251c\u2500\u2500 segs/             # Segmentation results\n\u2514\u2500\u2500 final_calls/      # Final CNV calls\n```\n\n## Cleaning Up\nHiScanner creates several temporary directories during analysis. You can clean these up using the clean command:\n```bash\nhiscanner --config config.yaml clean\n```\n\n## Troubleshooting\n\nCommon issues:\n1. Missing SCAN2 results: Ensure scan2_output directory is correctly specified\n2. File permissions: Check access to BAM files and reference data\n3. Memory issues: Adjust batch_size in config.yaml\n\nFor more detailed information, check the log files in hiscanner_output/logs/\n\n\n## Support\nHiScanner is currently under active development. For support or questions, please open an issue on our [GitHub repository](github.com/parklab/hiscanner).\n\n\n## Citation\n\nIf you use HiScanner in your research, please cite:\n\nZhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2024). High-resolution detection of copy number alterations in single cells with HiScanner. bioRxiv, 2024-04.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-resolution single-cell copy number analysis.",
    "version": "0.2b5",
    "project_urls": {
        "Changelog": "https://github.com/parklab/hiscanner/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/parklab/hiscanner#readme",
        "Homepage": "https://github.com/parklab/hiscanner",
        "Repository": "https://github.com/parklab/hiscanner.git"
    },
    "split_keywords": [
        "bioinformatics",
        " single-cell",
        " cnv",
        " genomics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "87019e905d443bdc03b8d5bc96384bd8bb9d157724fe1a126c9f4a3bdf1a0a85",
                "md5": "1eef5a55aa605d650ab3f0197c4456aa",
                "sha256": "3c3a467b9d7526bff3c3fab71ef24fdb5e8c9043d39aa943d491f4aa19e7f25a"
            },
            "downloads": -1,
            "filename": "hiscanner-0.2b5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1eef5a55aa605d650ab3f0197c4456aa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4867338,
            "upload_time": "2025-02-06T22:21:15",
            "upload_time_iso_8601": "2025-02-06T22:21:15.125975Z",
            "url": "https://files.pythonhosted.org/packages/87/01/9e905d443bdc03b8d5bc96384bd8bb9d157724fe1a126c9f4a3bdf1a0a85/hiscanner-0.2b5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "35ac26ef559bf8afbcd1140663f5849fa14cca1dcf853f813713e16f14252620",
                "md5": "2417a2d1c032d1b87e53b311ce17f140",
                "sha256": "79dedfd6a0c5f330cf39eec5686aaccff4a083fba43c67f60c5223d01e199e34"
            },
            "downloads": -1,
            "filename": "hiscanner-0.2b5.tar.gz",
            "has_sig": false,
            "md5_digest": "2417a2d1c032d1b87e53b311ce17f140",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4792227,
            "upload_time": "2025-02-06T22:21:17",
            "upload_time_iso_8601": "2025-02-06T22:21:17.771900Z",
            "url": "https://files.pythonhosted.org/packages/35/ac/26ef559bf8afbcd1140663f5849fa14cca1dcf853f813713e16f14252620/hiscanner-0.2b5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-06 22:21:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "parklab",
    "github_project": "hiscanner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hiscanner"
}
        
Elapsed time: 0.52107s