hiscanner


Namehiscanner JSON
Version 1.6 PyPI version JSON
download
home_pageNone
SummaryHigh-resolution single-cell copy number analysis.
upload_time2025-08-12 13:18:41
maintainerNone
docs_urlNone
authorYifan Zhao
requires_python>=3.8
licenseMIT
keywords bioinformatics single-cell cnv genomics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HiScanner (HIgh-resolution Single-Cell Allelic copy Number callER)
[![PyPI version](https://badge.fury.io/py/hiscanner.svg)](https://badge.fury.io/py/hiscanner)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

HiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:

1. **Standard Pipeline** (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)
2. **RDR-only Pipeline**: Simplified analysis using only read depth ratios

We provide a demo dataset and tutorial to help you get started. After installation, see https://github.com/parklab/hiscanner_demo for instructions. The typical run time for the demo without (`--use-cluster` option) is less than 30 minutes.

## Table of Contents

- [Installation](#installation)
- [Required External Files](#required-external-files)
- [Pipeline Options](#pipeline-options)
- [Running the Pipeline](#running-the-pipeline)
- [Output Structure](#output-structure)
- [Troubleshooting](#troubleshooting)
- [Citation](#citation)

### Installation
```bash
# Create new conda environment with all dependencies
conda create -n hiscanner_test \
    -c conda-forge -c bioconda -c defaults \
    python=3.8 samtools=1.15.1 bcftools=1.13 r-base "r-mgcv>=1.8"
conda activate hiscanner_test
pip install hiscanner --no-cache-dir
```

HiScanner (version 1.3) has been tested with Linux distributions:
- CentOS Linux release 7.9.2009
- Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-204-generic x86_64)

The typical installation time is less than 5 minutes.


## Required External Files: 

### 1) Mappability Track
<details>
<summary>hg19/GRCh37 (100mer) </summary>

```bash
wget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate
tar -xvzf hg19CRG.100bp.tar.gz
```
</details>

<details>
<summary>hg38/GRCh38 (150mer)</summary>

Download from: https://doi.org/10.6084/m9.figshare.28370357.v1

</details>

<details>
<summary>Other reference genomes: instructions on custom track generation</summary>

For other genomes/read length configurations, follow instructions at [CGAP Annotations](https://cgap-annotations.readthedocs.io/en/latest/bic-seq2_mappability.html) to generate mappability tracks.
</details>

<details>
<summary>IMPORTANT: Guidelines when choosing mappability tracks</summary>

- **Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp)**: Valid but conservative (some uniquely mappable regions may be missed)
- **Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp)**: Not valid, will cause false positives
</details>

### 2) Reference Genome
We require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:
```bash
samtools faidx /path/to/reference.fasta
mkdir /path/to/reference/split
awk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} sh -c 'samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta'
```


## Pipeline Options

HiScanner offers two pipeline options:

### 1. Standard Pipeline (RDR + BAF)
Uses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.

Steps:
1. SNP Calling (via SCAN2)
2. Phasing & BAF Computation
3. ADO Analysis
4. Segmentation
5. CNV Calling

Requirements:
- SCAN2 output files
- Bulk and single cell BAM files
- Reference genome
- Mappability tracks

### 2. RDR-only Pipeline
Uses only read depth ratios for simplified CNV analysis.

Steps:
1. Segmentation
2. CNV Calling

Requirements:
- Single cell BAM files
- Reference genome
- Mappability tracks

## Running the Pipeline

### Option 1: Standard Pipeline (RDR + BAF)


#### Step 1: SCAN2 (Prerequisites)

[SCAN2](https://github.com/parklab/SCAN2) needs to be run separately before using HiScanner. 

> **Note**: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:
> ```bash
> scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'
> ```
> This will stop at the phasing step, which is sufficient for HiScanner's requirements.

If you have already run SCAN2, ensure you have:
- VCF file with raw variants (`gatk/hc_raw.mmq60.vcf.gz`)
- Phased heterozygous variants (`shapeit/phased_hets.vcf`)

- Additionally, we note that the phased genotype field in `phased_hets.vcf` should be named as `phasedgt`. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it to `phasedgt` in the VCF.

The expected location is `scan2_out/` in your project directory.

#### Steps 2-5: HiScanner Pipeline
<details>
<summary>1. Initialize HiScanner project</summary>

```bash
hiscanner init --output ./my_project
cd my_project
```
</details>

<details>
<summary>2. Edit config.yaml</summary>

Edit with your paths and parameters
</details>

<details>
<summary>3. Prepare metadata file</summary>

Must contain the following columns:
```
bamID    bam    singlecell
bulk1    /path/to/bulk.bam    N
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
```
</details>

<details>
<summary>4. Validate configuration</summary>

```bash
hiscanner validate
```
</details>

<details>
<summary>5. Run the pipeline</summary>

```bash
hiscanner run --step snp      # Check SCAN2 results
hiscanner run --step phase    # Process SCAN2 results
hiscanner run --step ado      # ADO analysis to identify optimal bin size
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment  # Segmentation
hiscanner run --step cnv      # CNV calling

# Or run all steps at once:
hiscanner run --step all
```
</details>

For ```normalize``` (the most time-consuming step), we provide an option to run with cluster, e.g., 
```bash
hiscanner --config config.yaml run --step normalize --use-cluster
```

### Option 2: RDR-only Pipeline
<details>
<summary>1. Initialize project</summary>

```bash
hiscanner init --output ./my_project
cd my_project
```
</details>

<details>
<summary>2. Edit config.yaml</summary>

- Set `rdr_only: true`
- Configure paths and parameters
</details>

<details>
<summary>3. Prepare metadata file</summary>

Must contain the following columns (bulk samples are not required):
```
bamID    bam    singlecell
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
```
</details>

<details>
<summary>4. Validate configuration</summary>

```bash
hiscanner validate
```
</details>

<details>
<summary>5. Run the pipeline</summary>

```bash
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment  # Segmentation
hiscanner run --step cnv      # CNV calling
```
</details>


## Output Structure

```
hiscanner_output/
├── phased_hets/       # Processed heterozygous SNPs (Standard pipeline only)
├── ado/               # ADO analysis results (Standard pipeline only)
├── bins/             # Binned read depth
├── segs/             # Segmentation results
└── final_calls/      # Final CNV calls
```

## Cleaning Up
HiScanner creates several temporary directories during analysis. You can clean these up using the clean command:
```bash
hiscanner clean
```

## Troubleshooting

Common issues:
1. SCAN2 output formatting issues (hg38 users): Newer SCAN2 versions using Eagle phasing (instead of SHAPEIT) may produce incompatible genotype formatting. Solution: Clean the `GT` field in phased_hets.vcf.gz to keep only the phase information (e.g., "0|1", instead of "0|1:5,6:7,8").
2. Missing SCAN2 results: Ensure scan2_output directory is correctly specified. If the vcf is not zip-compressed, you can use `bgzip` to compress it (in scan2 environment).
```bash
bgzip scan2_out/gatk/hc_raw.mmq60.vcf
tabix -p vcf scan2_out/gatk/hc_raw.mmq60.vcf.gz
```
3. File permissions: Check access to BAM files and reference data
4. Memory issues: Adjust batch_size in config.yaml


## Support
HiScanner is currently under active development. For support or questions, please open an issue on our [GitHub repository](github.com/parklab/hiscanner).


## Citation

If you use HiScanner in your research, please cite:

Zhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2025). High-resolution detection of copy number alterations in single cells with HiScanner. Nature Communications, 16(1), 5477.

## License
HiScanner is freely available for non-commercial use.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hiscanner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "bioinformatics, single-cell, cnv, genomics",
    "author": "Yifan Zhao",
    "author_email": "yifnzhao@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d5/5a/abec8d5907e35ce4d7a7250ce13e87e5aecb2d0d5a3b6ef0b9c90cbd7476/hiscanner-1.6.tar.gz",
    "platform": null,
    "description": "# HiScanner (HIgh-resolution Single-Cell Allelic copy Number callER)\n[![PyPI version](https://badge.fury.io/py/hiscanner.svg)](https://badge.fury.io/py/hiscanner)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nHiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:\n\n1. **Standard Pipeline** (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)\n2. **RDR-only Pipeline**: Simplified analysis using only read depth ratios\n\nWe provide a demo dataset and tutorial to help you get started. After installation, see https://github.com/parklab/hiscanner_demo for instructions. The typical run time for the demo without (`--use-cluster` option) is less than 30 minutes.\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Required External Files](#required-external-files)\n- [Pipeline Options](#pipeline-options)\n- [Running the Pipeline](#running-the-pipeline)\n- [Output Structure](#output-structure)\n- [Troubleshooting](#troubleshooting)\n- [Citation](#citation)\n\n### Installation\n```bash\n# Create new conda environment with all dependencies\nconda create -n hiscanner_test \\\n    -c conda-forge -c bioconda -c defaults \\\n    python=3.8 samtools=1.15.1 bcftools=1.13 r-base \"r-mgcv>=1.8\"\nconda activate hiscanner_test\npip install hiscanner --no-cache-dir\n```\n\nHiScanner (version 1.3) has been tested with Linux distributions:\n- CentOS Linux release 7.9.2009\n- Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-204-generic x86_64)\n\nThe typical installation time is less than 5 minutes.\n\n\n## Required External Files: \n\n### 1) Mappability Track\n<details>\n<summary>hg19/GRCh37 (100mer) </summary>\n\n```bash\nwget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate\ntar -xvzf hg19CRG.100bp.tar.gz\n```\n</details>\n\n<details>\n<summary>hg38/GRCh38 (150mer)</summary>\n\nDownload from: https://doi.org/10.6084/m9.figshare.28370357.v1\n\n</details>\n\n<details>\n<summary>Other reference genomes: instructions on custom track generation</summary>\n\nFor other genomes/read length configurations, follow instructions at [CGAP Annotations](https://cgap-annotations.readthedocs.io/en/latest/bic-seq2_mappability.html) to generate mappability tracks.\n</details>\n\n<details>\n<summary>IMPORTANT: Guidelines when choosing mappability tracks</summary>\n\n- **Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp)**: Valid but conservative (some uniquely mappable regions may be missed)\n- **Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp)**: Not valid, will cause false positives\n</details>\n\n### 2) Reference Genome\nWe require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:\n```bash\nsamtools faidx /path/to/reference.fasta\nmkdir /path/to/reference/split\nawk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} sh -c 'samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta'\n```\n\n\n## Pipeline Options\n\nHiScanner offers two pipeline options:\n\n### 1. Standard Pipeline (RDR + BAF)\nUses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.\n\nSteps:\n1. SNP Calling (via SCAN2)\n2. Phasing & BAF Computation\n3. ADO Analysis\n4. Segmentation\n5. CNV Calling\n\nRequirements:\n- SCAN2 output files\n- Bulk and single cell BAM files\n- Reference genome\n- Mappability tracks\n\n### 2. RDR-only Pipeline\nUses only read depth ratios for simplified CNV analysis.\n\nSteps:\n1. Segmentation\n2. CNV Calling\n\nRequirements:\n- Single cell BAM files\n- Reference genome\n- Mappability tracks\n\n## Running the Pipeline\n\n### Option 1: Standard Pipeline (RDR + BAF)\n\n\n#### Step 1: SCAN2 (Prerequisites)\n\n[SCAN2](https://github.com/parklab/SCAN2) needs to be run separately before using HiScanner. \n\n> **Note**: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:\n> ```bash\n> scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'\n> ```\n> This will stop at the phasing step, which is sufficient for HiScanner's requirements.\n\nIf you have already run SCAN2, ensure you have:\n- VCF file with raw variants (`gatk/hc_raw.mmq60.vcf.gz`)\n- Phased heterozygous variants (`shapeit/phased_hets.vcf`)\n\n- Additionally, we note that the phased genotype field in `phased_hets.vcf` should be named as `phasedgt`. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it to `phasedgt` in the VCF.\n\nThe expected location is `scan2_out/` in your project directory.\n\n#### Steps 2-5: HiScanner Pipeline\n<details>\n<summary>1. Initialize HiScanner project</summary>\n\n```bash\nhiscanner init --output ./my_project\ncd my_project\n```\n</details>\n\n<details>\n<summary>2. Edit config.yaml</summary>\n\nEdit with your paths and parameters\n</details>\n\n<details>\n<summary>3. Prepare metadata file</summary>\n\nMust contain the following columns:\n```\nbamID    bam    singlecell\nbulk1    /path/to/bulk.bam    N\ncell1    /path/to/cell1.bam   Y \ncell2    /path/to/cell2.bam   Y\n```\n</details>\n\n<details>\n<summary>4. Validate configuration</summary>\n\n```bash\nhiscanner validate\n```\n</details>\n\n<details>\n<summary>5. Run the pipeline</summary>\n\n```bash\nhiscanner run --step snp      # Check SCAN2 results\nhiscanner run --step phase    # Process SCAN2 results\nhiscanner run --step ado      # ADO analysis to identify optimal bin size\nhiscanner run --step normalize # Normalize read depth ratios\nhiscanner run --step segment  # Segmentation\nhiscanner run --step cnv      # CNV calling\n\n# Or run all steps at once:\nhiscanner run --step all\n```\n</details>\n\nFor ```normalize``` (the most time-consuming step), we provide an option to run with cluster, e.g., \n```bash\nhiscanner --config config.yaml run --step normalize --use-cluster\n```\n\n### Option 2: RDR-only Pipeline\n<details>\n<summary>1. Initialize project</summary>\n\n```bash\nhiscanner init --output ./my_project\ncd my_project\n```\n</details>\n\n<details>\n<summary>2. Edit config.yaml</summary>\n\n- Set `rdr_only: true`\n- Configure paths and parameters\n</details>\n\n<details>\n<summary>3. Prepare metadata file</summary>\n\nMust contain the following columns (bulk samples are not required):\n```\nbamID    bam    singlecell\ncell1    /path/to/cell1.bam   Y \ncell2    /path/to/cell2.bam   Y\n```\n</details>\n\n<details>\n<summary>4. Validate configuration</summary>\n\n```bash\nhiscanner validate\n```\n</details>\n\n<details>\n<summary>5. Run the pipeline</summary>\n\n```bash\nhiscanner run --step normalize # Normalize read depth ratios\nhiscanner run --step segment  # Segmentation\nhiscanner run --step cnv      # CNV calling\n```\n</details>\n\n\n## Output Structure\n\n```\nhiscanner_output/\n\u251c\u2500\u2500 phased_hets/       # Processed heterozygous SNPs (Standard pipeline only)\n\u251c\u2500\u2500 ado/               # ADO analysis results (Standard pipeline only)\n\u251c\u2500\u2500 bins/             # Binned read depth\n\u251c\u2500\u2500 segs/             # Segmentation results\n\u2514\u2500\u2500 final_calls/      # Final CNV calls\n```\n\n## Cleaning Up\nHiScanner creates several temporary directories during analysis. You can clean these up using the clean command:\n```bash\nhiscanner clean\n```\n\n## Troubleshooting\n\nCommon issues:\n1. SCAN2 output formatting issues (hg38 users): Newer SCAN2 versions using Eagle phasing (instead of SHAPEIT) may produce incompatible genotype formatting. Solution: Clean the `GT` field in phased_hets.vcf.gz to keep only the phase information (e.g., \"0|1\", instead of \"0|1:5,6:7,8\").\n2. Missing SCAN2 results: Ensure scan2_output directory is correctly specified. If the vcf is not zip-compressed, you can use `bgzip` to compress it (in scan2 environment).\n```bash\nbgzip scan2_out/gatk/hc_raw.mmq60.vcf\ntabix -p vcf scan2_out/gatk/hc_raw.mmq60.vcf.gz\n```\n3. File permissions: Check access to BAM files and reference data\n4. Memory issues: Adjust batch_size in config.yaml\n\n\n## Support\nHiScanner is currently under active development. For support or questions, please open an issue on our [GitHub repository](github.com/parklab/hiscanner).\n\n\n## Citation\n\nIf you use HiScanner in your research, please cite:\n\nZhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2025). High-resolution detection of copy number alterations in single cells with HiScanner. Nature Communications, 16(1), 5477.\n\n## License\nHiScanner is freely available for non-commercial use.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-resolution single-cell copy number analysis.",
    "version": "1.6",
    "project_urls": {
        "Changelog": "https://github.com/parklab/hiscanner/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/parklab/hiscanner#readme",
        "Homepage": "https://github.com/parklab/hiscanner",
        "Repository": "https://github.com/parklab/hiscanner.git"
    },
    "split_keywords": [
        "bioinformatics",
        " single-cell",
        " cnv",
        " genomics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "992bde1e83a3d2dbb155d957c30dc14d91935f0b6a0fd341b46f3e28215cfedd",
                "md5": "a3f30d4ac297d6fd51be1e66c62f299a",
                "sha256": "c97a360b353a4e43c16d77a415d71e5c98f07976eca16369a5a73897b58ea258"
            },
            "downloads": -1,
            "filename": "hiscanner-1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a3f30d4ac297d6fd51be1e66c62f299a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4868852,
            "upload_time": "2025-08-12T13:18:39",
            "upload_time_iso_8601": "2025-08-12T13:18:39.592821Z",
            "url": "https://files.pythonhosted.org/packages/99/2b/de1e83a3d2dbb155d957c30dc14d91935f0b6a0fd341b46f3e28215cfedd/hiscanner-1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d55aabec8d5907e35ce4d7a7250ce13e87e5aecb2d0d5a3b6ef0b9c90cbd7476",
                "md5": "ef174981afc7ffaed0249bc2ecdfe8f7",
                "sha256": "686f34aa4555aed84a5ebf317b47f7855b66ff695afd0cfd79996a234be39ef2"
            },
            "downloads": -1,
            "filename": "hiscanner-1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "ef174981afc7ffaed0249bc2ecdfe8f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4794064,
            "upload_time": "2025-08-12T13:18:41",
            "upload_time_iso_8601": "2025-08-12T13:18:41.279761Z",
            "url": "https://files.pythonhosted.org/packages/d5/5a/abec8d5907e35ce4d7a7250ce13e87e5aecb2d0d5a3b6ef0b9c90cbd7476/hiscanner-1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 13:18:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "parklab",
    "github_project": "hiscanner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hiscanner"
}
        
Elapsed time: 0.40854s