# MarkDup
[](https://pypi.python.org/pypi/markdup)
[](https://pepy.tech/project/markdup)
[](https://github.com/y9c/markdup)
**A fast, accurate BAM deduplication tool with intelligent UMI detection and correct fragment detection.**
## Why MarkDup?
Existing BAM deduplication tools suffer from several critical issues: buggy duplicate detection due to incorrect biological positioning and strand handling, poor performance especially with UMI-based deduplication, and inadequate UMI clustering that leads to over-merging.
MarkDup solves these problems with:
- **Correct fragment detection** using proper strand-aware coordinate handling
- **Significantly faster processing** through optimized algorithms and parallel processing
- **Smart UMI clustering** that prevents over-merging with frequency-aware algorithms
- **Automatic detection** that handles both UMI and non-UMI data without requiring different tools
## Quick Start
```bash
# Installation
pip install markdup
```
```bash
# Basic usage (auto-detects everything)
markdup input.bam output.bam
# With multiple threads
markdup input.bam output.bam --threads 8
# Force coordinate-based (no UMIs)
markdup input.bam output.bam --no-umi
```
## Key Features
### 🧬 **Correct Fragment Detection**
- **Strand-aware coordinates**: Properly handles forward/reverse strand reads
- **CIGAR-aware positioning**: Correctly processes indels and complex alignments
- **Biological positioning**: Uses 5'/3' positions, not reference positions
### âš¡ **High Performance**
- **Significantly faster** UMI clustering with optimized algorithms
- **Parallel processing**: Multi-core support for large files
- **Memory efficient**: Window-based processing for large datasets
### 🔄 **Automatic Detection**
- **UMI auto-detection**: Finds UMIs in read names or BAM tags
- **Sequencing type detection**: Automatically detects single-end vs paired-end
- **Quality metrics**: Selects the best quality criteria automatically
### 🎯 **Smart UMI Clustering**
- **Frequency-aware**: Prevents over-clustering of high-frequency UMIs
- **Edit distance**: Configurable similarity thresholds
- **Exact matching**: Handles identical UMIs efficiently
## Performance
- **Significantly faster** UMI clustering with optimized algorithms
- **Multi-core processing** for parallel performance
- **Memory efficient** window-based processing for large files
- **Automatic optimization** based on input data characteristics
## How It Works
1. **Auto-detect**: UMI presence, sequencing type, and quality metrics
2. **Group fragments**: By biological position and strand
3. **Cluster UMIs**: Using edit distance and frequency-aware algorithms
4. **Select best**: The highest quality read from each cluster
5. **Output**: Deduplicated reads with cluster information
## Documentation
### Usage Examples
#### Basic Deduplication
```bash
# Auto-detect UMIs and process
markdup input.bam output.bam
# Force coordinate-based method (no UMIs)
markdup input.bam output.bam --no-umi
```
#### Advanced Options
```bash
# Custom UMI settings
markdup input.bam output.bam --umi-tag UB --min-edit-dist-frac 0.15
# Start-only positioning (useful for ChIP-seq)
markdup input.bam output.bam --start-only
# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates
```
### Performance Tuning
```bash
# Use 8 threads
markdup input.bam output.bam --threads 8
# Use larger windows for better performance
markdup input.bam output.bam --window-size 200000
```
### Detailed Documentation
- [Installation Guide](docs/installation.md) - How to install MarkDup
- [Usage Guide](docs/usage.md) - How to use MarkDup
- [Algorithm Details](docs/algorithm.md) - How MarkDup works and fixes existing problems
- [FAQ](docs/faq.md) - Frequently asked questions
- [Contributing](docs/contributing.md) - How to contribute
### Command Line Options
| Option | Description | Default |
| ----------------------- | ----------------------------- | ----------- |
| `INPUT_BAM` | Input BAM file | Required |
| `OUTPUT_BAM` | Output BAM file | Required |
| `--threads` | Number of threads | 1 |
| `--no-umi` | Force coordinate-based method | Auto-detect |
| `--umi-tag` | UMI BAM tag (e.g., UB) | Auto-detect |
| `--start-only` | Use start position only | False |
| `--end-only` | Use end position only | False |
| `--keep-duplicates` | Keep and mark duplicates | False |
| `--max-dist-frac` | UMI edit distance threshold | 0.1 |
| `--max-frequency-ratio` | UMI frequency threshold | 0.1 |
### Output Format
MarkDup adds BAM tags to track deduplication:
| Tag | Description |
| ---- | --------------------------------------- |
| `cn` | Cluster name (chr:start-end:strand:UMI) |
| `cs` | Cluster size (number of reads) |
## License
MIT License - see [LICENSE](LICENSE) for details.
<p align="center">
<img
src="https://raw.githubusercontent.com/y9c/y9c/master/resource/footer_line.svg?sanitize=true"
/>
</p>
<p align="center">
Copyright © 2025-present
<a href="https://github.com/y9c" target="_blank">Chang Y</a>
</p>
<p align="center">
<a href="https://github.com/y9c/markdup/blob/master/LICENSE">
<img src="https://img.shields.io/static/v1.svg?style=for-the-badge&label=License&message=MIT&logoColor=d9e0ee&colorA=282a36&colorB=c678dd" />
</a>
</p>
Raw data
{
"_id": null,
"home_page": null,
"name": "markdup",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "bioinformatics, bam, deduplication, umi, sequencing, genomics",
"author": null,
"author_email": "Chang Ye <yech1990@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/60/23/e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8/markdup-0.0.15.tar.gz",
"platform": null,
"description": "# MarkDup\n\n[](https://pypi.python.org/pypi/markdup)\n[](https://pepy.tech/project/markdup)\n[](https://github.com/y9c/markdup)\n\n**A fast, accurate BAM deduplication tool with intelligent UMI detection and correct fragment detection.**\n\n## Why MarkDup?\n\nExisting BAM deduplication tools suffer from several critical issues: buggy duplicate detection due to incorrect biological positioning and strand handling, poor performance especially with UMI-based deduplication, and inadequate UMI clustering that leads to over-merging.\n\nMarkDup solves these problems with:\n\n- **Correct fragment detection** using proper strand-aware coordinate handling\n- **Significantly faster processing** through optimized algorithms and parallel processing\n- **Smart UMI clustering** that prevents over-merging with frequency-aware algorithms\n- **Automatic detection** that handles both UMI and non-UMI data without requiring different tools\n\n## Quick Start\n\n```bash\n# Installation\npip install markdup\n```\n\n```bash\n# Basic usage (auto-detects everything)\nmarkdup input.bam output.bam\n\n# With multiple threads\nmarkdup input.bam output.bam --threads 8\n\n# Force coordinate-based (no UMIs)\nmarkdup input.bam output.bam --no-umi\n```\n\n## Key Features\n\n### \ud83e\uddec **Correct Fragment Detection**\n\n- **Strand-aware coordinates**: Properly handles forward/reverse strand reads\n- **CIGAR-aware positioning**: Correctly processes indels and complex alignments\n- **Biological positioning**: Uses 5'/3' positions, not reference positions\n\n### \u26a1 **High Performance**\n\n- **Significantly faster** UMI clustering with optimized algorithms\n- **Parallel processing**: Multi-core support for large files\n- **Memory efficient**: Window-based processing for large datasets\n\n### \ud83d\udd04 **Automatic Detection**\n\n- **UMI auto-detection**: Finds UMIs in read names or BAM tags\n- **Sequencing type detection**: Automatically detects single-end vs paired-end\n- **Quality metrics**: Selects the best quality criteria automatically\n\n### \ud83c\udfaf **Smart UMI Clustering**\n\n- **Frequency-aware**: Prevents over-clustering of high-frequency UMIs\n- **Edit distance**: Configurable similarity thresholds\n- **Exact matching**: Handles identical UMIs efficiently\n\n## Performance\n\n- **Significantly faster** UMI clustering with optimized algorithms\n- **Multi-core processing** for parallel performance\n- **Memory efficient** window-based processing for large files\n- **Automatic optimization** based on input data characteristics\n\n## How It Works\n\n1. **Auto-detect**: UMI presence, sequencing type, and quality metrics\n2. **Group fragments**: By biological position and strand\n3. **Cluster UMIs**: Using edit distance and frequency-aware algorithms\n4. **Select best**: The highest quality read from each cluster\n5. **Output**: Deduplicated reads with cluster information\n\n## Documentation\n\n### Usage Examples\n\n#### Basic Deduplication\n\n```bash\n# Auto-detect UMIs and process\nmarkdup input.bam output.bam\n\n# Force coordinate-based method (no UMIs)\nmarkdup input.bam output.bam --no-umi\n```\n\n#### Advanced Options\n\n```bash\n# Custom UMI settings\nmarkdup input.bam output.bam --umi-tag UB --min-edit-dist-frac 0.15\n\n# Start-only positioning (useful for ChIP-seq)\nmarkdup input.bam output.bam --start-only\n\n# Keep duplicates and mark them\nmarkdup input.bam output.bam --keep-duplicates\n```\n\n### Performance Tuning\n\n```bash\n# Use 8 threads\nmarkdup input.bam output.bam --threads 8\n\n# Use larger windows for better performance\nmarkdup input.bam output.bam --window-size 200000\n```\n\n### Detailed Documentation\n\n- [Installation Guide](docs/installation.md) - How to install MarkDup\n- [Usage Guide](docs/usage.md) - How to use MarkDup\n- [Algorithm Details](docs/algorithm.md) - How MarkDup works and fixes existing problems\n- [FAQ](docs/faq.md) - Frequently asked questions\n- [Contributing](docs/contributing.md) - How to contribute\n\n### Command Line Options\n\n| Option | Description | Default |\n| ----------------------- | ----------------------------- | ----------- |\n| `INPUT_BAM` | Input BAM file | Required |\n| `OUTPUT_BAM` | Output BAM file | Required |\n| `--threads` | Number of threads | 1 |\n| `--no-umi` | Force coordinate-based method | Auto-detect |\n| `--umi-tag` | UMI BAM tag (e.g., UB) | Auto-detect |\n| `--start-only` | Use start position only | False |\n| `--end-only` | Use end position only | False |\n| `--keep-duplicates` | Keep and mark duplicates | False |\n| `--max-dist-frac` | UMI edit distance threshold | 0.1 |\n| `--max-frequency-ratio` | UMI frequency threshold | 0.1 |\n\n### Output Format\n\nMarkDup adds BAM tags to track deduplication:\n\n| Tag | Description |\n| ---- | --------------------------------------- |\n| `cn` | Cluster name (chr:start-end:strand:UMI) |\n| `cs` | Cluster size (number of reads) |\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n \n\n<p align=\"center\">\n <img\n src=\"https://raw.githubusercontent.com/y9c/y9c/master/resource/footer_line.svg?sanitize=true\"\n />\n</p>\n<p align=\"center\">\n Copyright © 2025-present\n <a href=\"https://github.com/y9c\" target=\"_blank\">Chang Y</a>\n</p>\n<p align=\"center\">\n <a href=\"https://github.com/y9c/markdup/blob/master/LICENSE\">\n <img src=\"https://img.shields.io/static/v1.svg?style=for-the-badge&label=License&message=MIT&logoColor=d9e0ee&colorA=282a36&colorB=c678dd\" />\n </a>\n</p>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast, accurate BAM deduplication with intelligent UMI detection and correct fragment detection",
"version": "0.0.15",
"project_urls": null,
"split_keywords": [
"bioinformatics",
" bam",
" deduplication",
" umi",
" sequencing",
" genomics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "52873d79b49a164b33d30915ad4842564f3c34b726776979ab3636889d6b21e9",
"md5": "efc936ea7760531e446eaffcffba474d",
"sha256": "250a892791e9e873ae1a7f884cf59d94ca9c44cf496e282667b765a3ee8520c8"
},
"downloads": -1,
"filename": "markdup-0.0.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "efc936ea7760531e446eaffcffba474d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 28070,
"upload_time": "2025-10-21T18:37:43",
"upload_time_iso_8601": "2025-10-21T18:37:43.273392Z",
"url": "https://files.pythonhosted.org/packages/52/87/3d79b49a164b33d30915ad4842564f3c34b726776979ab3636889d6b21e9/markdup-0.0.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6023e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8",
"md5": "5a7385efd8d4d260375e559564491686",
"sha256": "ceeb53f67115da4a0df3be20a4df86d3b5979e76a94c989f33f40c8868a1451e"
},
"downloads": -1,
"filename": "markdup-0.0.15.tar.gz",
"has_sig": false,
"md5_digest": "5a7385efd8d4d260375e559564491686",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 37843,
"upload_time": "2025-10-21T18:37:44",
"upload_time_iso_8601": "2025-10-21T18:37:44.712647Z",
"url": "https://files.pythonhosted.org/packages/60/23/e3967fc06c7e518fe100b88ef986971ed05677b425680e2003e6f85f9da8/markdup-0.0.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 18:37:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "markdup"
}