markdup


Namemarkdup JSON
Version 0.0.15 PyPI version JSON
download
home_pageNone
SummaryFast, accurate BAM deduplication with intelligent UMI detection and correct fragment detection
upload_time2025-10-21 18:37:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords bioinformatics bam deduplication umi sequencing genomics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MarkDup

[![Pypi Releases](https://img.shields.io/pypi/v/markdup.svg)](https://pypi.python.org/pypi/markdup)
[![Downloads](https://img.shields.io/pepy/dt/markdup)](https://pepy.tech/project/markdup)
[![Development Status](https://img.shields.io/badge/status-alpha-orange.svg)](https://github.com/y9c/markdup)

**A fast, accurate BAM deduplication tool with intelligent UMI detection and correct fragment detection.**

## Why MarkDup?

Existing BAM deduplication tools suffer from several critical issues: buggy duplicate detection due to incorrect biological positioning and strand handling, poor performance especially with UMI-based deduplication, and inadequate UMI clustering that leads to over-merging.

MarkDup solves these problems with:

- **Correct fragment detection** using proper strand-aware coordinate handling
- **Significantly faster processing** through optimized algorithms and parallel processing
- **Smart UMI clustering** that prevents over-merging with frequency-aware algorithms
- **Automatic detection** that handles both UMI and non-UMI data without requiring different tools

## Quick Start

```bash
# Installation
pip install markdup
```

```bash
# Basic usage (auto-detects everything)
markdup input.bam output.bam

# With multiple threads
markdup input.bam output.bam --threads 8

# Force coordinate-based (no UMIs)
markdup input.bam output.bam --no-umi
```

## Key Features

### 🧬 **Correct Fragment Detection**

- **Strand-aware coordinates**: Properly handles forward/reverse strand reads
- **CIGAR-aware positioning**: Correctly processes indels and complex alignments
- **Biological positioning**: Uses 5'/3' positions, not reference positions

### âš¡ **High Performance**

- **Significantly faster** UMI clustering with optimized algorithms
- **Parallel processing**: Multi-core support for large files
- **Memory efficient**: Window-based processing for large datasets

### 🔄 **Automatic Detection**

- **UMI auto-detection**: Finds UMIs in read names or BAM tags
- **Sequencing type detection**: Automatically detects single-end vs paired-end
- **Quality metrics**: Selects the best quality criteria automatically

### 🎯 **Smart UMI Clustering**

- **Frequency-aware**: Prevents over-clustering of high-frequency UMIs
- **Edit distance**: Configurable similarity thresholds
- **Exact matching**: Handles identical UMIs efficiently

## Performance

- **Significantly faster** UMI clustering with optimized algorithms
- **Multi-core processing** for parallel performance
- **Memory efficient** window-based processing for large files
- **Automatic optimization** based on input data characteristics

## How It Works

1. **Auto-detect**: UMI presence, sequencing type, and quality metrics
2. **Group fragments**: By biological position and strand
3. **Cluster UMIs**: Using edit distance and frequency-aware algorithms
4. **Select best**: The highest quality read from each cluster
5. **Output**: Deduplicated reads with cluster information

## Documentation

### Usage Examples

#### Basic Deduplication

```bash
# Auto-detect UMIs and process
markdup input.bam output.bam

# Force coordinate-based method (no UMIs)
markdup input.bam output.bam --no-umi
```

#### Advanced Options

```bash
# Custom UMI settings
markdup input.bam output.bam --umi-tag UB --min-edit-dist-frac 0.15

# Start-only positioning (useful for ChIP-seq)
markdup input.bam output.bam --start-only

# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates
```

### Performance Tuning

```bash
# Use 8 threads
markdup input.bam output.bam --threads 8

# Use larger windows for better performance
markdup input.bam output.bam --window-size 200000
```

### Detailed Documentation

- [Installation Guide](docs/installation.md) - How to install MarkDup
- [Usage Guide](docs/usage.md) - How to use MarkDup
- [Algorithm Details](docs/algorithm.md) - How MarkDup works and fixes existing problems
- [FAQ](docs/faq.md) - Frequently asked questions
- [Contributing](docs/contributing.md) - How to contribute

### Command Line Options

| Option                  | Description                   | Default     |
| ----------------------- | ----------------------------- | ----------- |
| `INPUT_BAM`             | Input BAM file                | Required    |
| `OUTPUT_BAM`            | Output BAM file               | Required    |
| `--threads`             | Number of threads             | 1           |
| `--no-umi`              | Force coordinate-based method | Auto-detect |
| `--umi-tag`             | UMI BAM tag (e.g., UB)        | Auto-detect |
| `--start-only`          | Use start position only       | False       |
| `--end-only`            | Use end position only         | False       |
| `--keep-duplicates`     | Keep and mark duplicates      | False       |
| `--max-dist-frac`       | UMI edit distance threshold   | 0.1         |
| `--max-frequency-ratio` | UMI frequency threshold       | 0.1         |

### Output Format

MarkDup adds BAM tags to track deduplication:

| Tag  | Description                             |
| ---- | --------------------------------------- |
| `cn` | Cluster name (chr:start-end:strand:UMI) |
| `cs` | Cluster size (number of reads)          |

## License

MIT License - see [LICENSE](LICENSE) for details.

 

<p align="center">
  <img
    src="https://raw.githubusercontent.com/y9c/y9c/master/resource/footer_line.svg?sanitize=true"
  />
</p>
<p align="center">
  Copyright &copy; 2025-present
  <a href="https://github.com/y9c" target="_blank">Chang Y</a>
</p>
<p align="center">
  <a href="https://github.com/y9c/markdup/blob/master/LICENSE">
    <img src="https://img.shields.io/static/v1.svg?style=for-the-badge&label=License&message=MIT&logoColor=d9e0ee&colorA=282a36&colorB=c678dd" />
  </a>
</p>

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "markdup",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bioinformatics, bam, deduplication, umi, sequencing, genomics",
    "author": null,
    "author_email": "Chang Ye <yech1990@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/60/23/e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8/markdup-0.0.15.tar.gz",
    "platform": null,
    "description": "# MarkDup\n\n[![Pypi Releases](https://img.shields.io/pypi/v/markdup.svg)](https://pypi.python.org/pypi/markdup)\n[![Downloads](https://img.shields.io/pepy/dt/markdup)](https://pepy.tech/project/markdup)\n[![Development Status](https://img.shields.io/badge/status-alpha-orange.svg)](https://github.com/y9c/markdup)\n\n**A fast, accurate BAM deduplication tool with intelligent UMI detection and correct fragment detection.**\n\n## Why MarkDup?\n\nExisting BAM deduplication tools suffer from several critical issues: buggy duplicate detection due to incorrect biological positioning and strand handling, poor performance especially with UMI-based deduplication, and inadequate UMI clustering that leads to over-merging.\n\nMarkDup solves these problems with:\n\n- **Correct fragment detection** using proper strand-aware coordinate handling\n- **Significantly faster processing** through optimized algorithms and parallel processing\n- **Smart UMI clustering** that prevents over-merging with frequency-aware algorithms\n- **Automatic detection** that handles both UMI and non-UMI data without requiring different tools\n\n## Quick Start\n\n```bash\n# Installation\npip install markdup\n```\n\n```bash\n# Basic usage (auto-detects everything)\nmarkdup input.bam output.bam\n\n# With multiple threads\nmarkdup input.bam output.bam --threads 8\n\n# Force coordinate-based (no UMIs)\nmarkdup input.bam output.bam --no-umi\n```\n\n## Key Features\n\n### \ud83e\uddec **Correct Fragment Detection**\n\n- **Strand-aware coordinates**: Properly handles forward/reverse strand reads\n- **CIGAR-aware positioning**: Correctly processes indels and complex alignments\n- **Biological positioning**: Uses 5'/3' positions, not reference positions\n\n### \u26a1 **High Performance**\n\n- **Significantly faster** UMI clustering with optimized algorithms\n- **Parallel processing**: Multi-core support for large files\n- **Memory efficient**: Window-based processing for large datasets\n\n### \ud83d\udd04 **Automatic Detection**\n\n- **UMI auto-detection**: Finds UMIs in read names or BAM tags\n- **Sequencing type detection**: Automatically detects single-end vs paired-end\n- **Quality metrics**: Selects the best quality criteria automatically\n\n### \ud83c\udfaf **Smart UMI Clustering**\n\n- **Frequency-aware**: Prevents over-clustering of high-frequency UMIs\n- **Edit distance**: Configurable similarity thresholds\n- **Exact matching**: Handles identical UMIs efficiently\n\n## Performance\n\n- **Significantly faster** UMI clustering with optimized algorithms\n- **Multi-core processing** for parallel performance\n- **Memory efficient** window-based processing for large files\n- **Automatic optimization** based on input data characteristics\n\n## How It Works\n\n1. **Auto-detect**: UMI presence, sequencing type, and quality metrics\n2. **Group fragments**: By biological position and strand\n3. **Cluster UMIs**: Using edit distance and frequency-aware algorithms\n4. **Select best**: The highest quality read from each cluster\n5. **Output**: Deduplicated reads with cluster information\n\n## Documentation\n\n### Usage Examples\n\n#### Basic Deduplication\n\n```bash\n# Auto-detect UMIs and process\nmarkdup input.bam output.bam\n\n# Force coordinate-based method (no UMIs)\nmarkdup input.bam output.bam --no-umi\n```\n\n#### Advanced Options\n\n```bash\n# Custom UMI settings\nmarkdup input.bam output.bam --umi-tag UB --min-edit-dist-frac 0.15\n\n# Start-only positioning (useful for ChIP-seq)\nmarkdup input.bam output.bam --start-only\n\n# Keep duplicates and mark them\nmarkdup input.bam output.bam --keep-duplicates\n```\n\n### Performance Tuning\n\n```bash\n# Use 8 threads\nmarkdup input.bam output.bam --threads 8\n\n# Use larger windows for better performance\nmarkdup input.bam output.bam --window-size 200000\n```\n\n### Detailed Documentation\n\n- [Installation Guide](docs/installation.md) - How to install MarkDup\n- [Usage Guide](docs/usage.md) - How to use MarkDup\n- [Algorithm Details](docs/algorithm.md) - How MarkDup works and fixes existing problems\n- [FAQ](docs/faq.md) - Frequently asked questions\n- [Contributing](docs/contributing.md) - How to contribute\n\n### Command Line Options\n\n| Option                  | Description                   | Default     |\n| ----------------------- | ----------------------------- | ----------- |\n| `INPUT_BAM`             | Input BAM file                | Required    |\n| `OUTPUT_BAM`            | Output BAM file               | Required    |\n| `--threads`             | Number of threads             | 1           |\n| `--no-umi`              | Force coordinate-based method | Auto-detect |\n| `--umi-tag`             | UMI BAM tag (e.g., UB)        | Auto-detect |\n| `--start-only`          | Use start position only       | False       |\n| `--end-only`            | Use end position only         | False       |\n| `--keep-duplicates`     | Keep and mark duplicates      | False       |\n| `--max-dist-frac`       | UMI edit distance threshold   | 0.1         |\n| `--max-frequency-ratio` | UMI frequency threshold       | 0.1         |\n\n### Output Format\n\nMarkDup adds BAM tags to track deduplication:\n\n| Tag  | Description                             |\n| ---- | --------------------------------------- |\n| `cn` | Cluster name (chr:start-end:strand:UMI) |\n| `cs` | Cluster size (number of reads)          |\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n&nbsp;\n\n<p align=\"center\">\n  <img\n    src=\"https://raw.githubusercontent.com/y9c/y9c/master/resource/footer_line.svg?sanitize=true\"\n  />\n</p>\n<p align=\"center\">\n  Copyright &copy; 2025-present\n  <a href=\"https://github.com/y9c\" target=\"_blank\">Chang Y</a>\n</p>\n<p align=\"center\">\n  <a href=\"https://github.com/y9c/markdup/blob/master/LICENSE\">\n    <img src=\"https://img.shields.io/static/v1.svg?style=for-the-badge&label=License&message=MIT&logoColor=d9e0ee&colorA=282a36&colorB=c678dd\" />\n  </a>\n</p>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast, accurate BAM deduplication with intelligent UMI detection and correct fragment detection",
    "version": "0.0.15",
    "project_urls": null,
    "split_keywords": [
        "bioinformatics",
        " bam",
        " deduplication",
        " umi",
        " sequencing",
        " genomics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "52873d79b49a164b33d30915ad4842564f3c34b726776979ab3636889d6b21e9",
                "md5": "efc936ea7760531e446eaffcffba474d",
                "sha256": "250a892791e9e873ae1a7f884cf59d94ca9c44cf496e282667b765a3ee8520c8"
            },
            "downloads": -1,
            "filename": "markdup-0.0.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "efc936ea7760531e446eaffcffba474d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28070,
            "upload_time": "2025-10-21T18:37:43",
            "upload_time_iso_8601": "2025-10-21T18:37:43.273392Z",
            "url": "https://files.pythonhosted.org/packages/52/87/3d79b49a164b33d30915ad4842564f3c34b726776979ab3636889d6b21e9/markdup-0.0.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6023e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8",
                "md5": "5a7385efd8d4d260375e559564491686",
                "sha256": "ceeb53f67115da4a0df3be20a4df86d3b5979e76a94c989f33f40c8868a1451e"
            },
            "downloads": -1,
            "filename": "markdup-0.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "5a7385efd8d4d260375e559564491686",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 37843,
            "upload_time": "2025-10-21T18:37:44",
            "upload_time_iso_8601": "2025-10-21T18:37:44.712647Z",
            "url": "https://files.pythonhosted.org/packages/60/23/e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8/markdup-0.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 18:37:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "markdup"
}
        
Elapsed time: 0.67725s