# REMAG
**RE**covery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
## Quick Start
```bash
# Clone the repository
git clone https://github.com/danielzmbp/remag.git
cd remag
# Create conda environment and install
conda create -n remag python=3.9
conda activate remag
pip install .
# Run REMAG
remag -f contigs.fasta -b alignments.bam -o output_directory
```
## Installation
### From source
```bash
# Create and activate conda environment
conda create -n remag python=3.9
conda activate remag
# Clone and install
git clone https://github.com/danielzmbp/remag.git
cd remag
pip install .
```
### Development installation
For contributors and developers:
```bash
# Install with development dependencies
pip install -e ".[dev]"
```
### GPU-accelerated installation
For GPU-accelerated clustering (requires NVIDIA GPU):
```bash
# Install with RAPIDS support
pip install "remag[gpu]"
```
## Usage
### Command line interface
After installation, you can use REMAG via the command line:
```bash
remag -f contigs.fasta -b alignments.bam -o output_directory
```
### Python module mode
```bash
python -m remag -f contigs.fasta -b alignments.bam -o output_directory
```
## How REMAG Works
REMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:
1. **Bacterial Pre-filtering**: By default, REMAG automatically filters out bacterial contigs using the integrated 4CAC classifier (can be disabled with `--skip-bacterial-filter`)
2. **Feature Extraction**: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training
3. **Contrastive Learning**: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together
4. **HDBSCAN Clustering**: Density-based clustering on the learned contig embeddings to form bins
5. **Quality Assessment**: Uses miniprot to align bins against a database of eukaryotic core genes to detect contamination
6. **Iterative Refinement**: Automatically splits contaminated bins based on core gene duplications to improve bin quality
## Key Features
- **Automatic Bacterial Filtering**: The 4CAC classifier automatically identifies and removes bacterial sequences before binning
- **Multi-Sample Support**: Can process coverage information from multiple samples (BAM files) simultaneously
- **Barlow Twins Loss**: Uses a self-supervised contrastive learning approach that doesn't require negative pairs
- **Fragment Augmentation**: Large contigs are split into multiple overlapping fragments during training to improve representation learning
## Options
```
-f, --fasta PATH Input FASTA file with contigs to bin. Can be gzipped. [required]
-b, --bam PATH Input BAM file(s) for coverage calculation. Must be indexed. Each BAM represents a sample. Supports space-separated files or glob patterns (e.g., "*.bam", "sample_*.bam"). Use quotes around glob patterns.
-t, --tsv PATH Input TSV file(s) with coverage information.
-o, --output PATH Output directory for results. [required]
--epochs INTEGER RANGE Training epochs for neural network. [default: 400; 50<=x<=2000]
--batch-size INTEGER RANGE Batch size for training. [default: 2048; 64<=x<=8192]
--embedding-dim INTEGER RANGE Embedding dimension for contrastive learning. [default: 256; 64<=x<=512]
--base-learning-rate FLOAT RANGE
Base learning rate for optimizer. [default: 0.008; 0.00001<=x<=0.1]
--min-cluster-size INTEGER RANGE
Minimum fragments per cluster. [default: 2; 2<=x<=100]
--min-samples INTEGER RANGE Minimum samples for HDBSCAN core points. [default: None; 1<=x<=100]
--cluster-selection-epsilon FLOAT RANGE
Epsilon for HDBSCAN cluster selection. [default: 0.0; 0.0<=x<=1.0]
--min-contig-length INTEGER RANGE
Minimum contig length in bp. [default: 1000; 500<=x<=10000]
--max-positive-pairs INTEGER RANGE
Maximum positive pairs for contrastive learning. [default: 5000000; 100000<=x<=10000000]
-c, --cores INTEGER RANGE Number of CPU cores. [default: 8; 1<=x<=64]
--min-bin-size INTEGER RANGE Minimum bin size in bp. [default: 100000; 50000<=x<=10000000]
-v, --verbose Enable verbose logging.
--skip-bacterial-filter Skip bacterial contig filtering (4CAC classifier + contrastive learning).
--skip-refinement Skip bin refinement.
--skip-kmeans-filtering Skip K-means filtering on embeddings.
--max-refinement-rounds INTEGER RANGE
Maximum refinement rounds. [default: 2; 1<=x<=10]
--num-augmentations INTEGER RANGE
Number of random fragments per contig. [default: 8; 1<=x<=32]
--keep-intermediate Keep intermediate files (training fragments, etc.).
-h, --help Show this message and exit.
```
## Output
REMAG produces several output files:
### Core output files (always created):
- `bins/`: Directory containing FASTA files for each bin
- `bins.csv`: Final contig-to-bin assignments
- `remag.log`: Detailed log file
- `*_non_bacterial_filtered.fasta`: Filtered FASTA file with bacterial contigs removed (when bacterial filtering is enabled)
### Additional files (with `--keep-intermediate` option):
- `embeddings.csv`: Contig embeddings from the neural network
- `umap_embeddings.csv`: UMAP projections for visualization
- `umap_plot.pdf`: UMAP visualization plot with cluster assignments
- `siamese_model.pt`: Trained Siamese neural network model
- `params.json`: Complete run parameters for reproducibility
- `features.csv`: Extracted k-mer and coverage features
- `fragments.pkl`: Fragment information used during training
- `classification_results.csv`: 4CAC bacterial classification results
- `refinement_summary.json`: Summary of the bin refinement process
- `kmeans_filtering_stats.json`: Statistics from k-means pre-filtering (if enabled)
- `core_gene_duplication_results.json`: Core gene duplication analysis from refinement
- `temp_miniprot/`: Temporary directory for miniprot alignments (removed unless --keep-intermediate)
## Requirements
- Python 3.8+
- PyTorch (≥1.11.0)
- scikit-learn (≥1.0.0)
- XGBoost (≥1.6.0) - for 4CAC classifier
- HDBSCAN (≥0.8.28)
- UMAP (≥0.5.0)
- pandas (≥1.3.0)
- numpy (≥1.21.0)
- matplotlib (≥3.5.0)
- pysam (≥0.18.0)
- loguru (≥0.6.0)
- tqdm (≥4.62.0)
- rich-click (≥1.5.0)
- joblib (≥1.1.0)
The package includes a pre-trained 4CAC classifier model for bacterial contig filtering. The 4CAC classifier code and models are adapted from the [Shamir-Lab/4CAC repository](https://github.com/Shamir-Lab/4CAC).
## Acknowledgments
The integrated 4CAC classifier (`xgbclass` module) is adapted from the work by Shamir Lab:
- **Repository**: [Shamir-Lab/4CAC](https://github.com/Shamir-Lab/4CAC)
- **Paper**: Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res. 2024;52(19):e94–e94.
## License
MIT License - see LICENSE file for details.
## Citation
If you use REMAG in your research, please cite:
```
[Citation information will be added when available]
```
Raw data
{
"_id": null,
"home_page": null,
"name": "remag",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Daniel G\u00f3mez-P\u00e9rez <daniel.gomez-perez@earlham.ac.uk>",
"keywords": "metagenomics, binning, neural networks, contrastive learning, bioinformatics",
"author": null,
"author_email": "Daniel G\u00f3mez-P\u00e9rez <daniel.gomez-perez@earlham.ac.uk>",
"download_url": "https://files.pythonhosted.org/packages/65/f6/7ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13/remag-0.1.0.tar.gz",
"platform": null,
"description": "# REMAG\n\n**RE**covery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.\n\n## Quick Start\n\n```bash\n# Clone the repository\ngit clone https://github.com/danielzmbp/remag.git\ncd remag\n\n# Create conda environment and install\nconda create -n remag python=3.9\nconda activate remag\npip install .\n\n# Run REMAG\nremag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n## Installation\n\n### From source\n\n```bash\n# Create and activate conda environment\nconda create -n remag python=3.9\nconda activate remag\n\n# Clone and install\ngit clone https://github.com/danielzmbp/remag.git\ncd remag\npip install .\n```\n\n### Development installation\n\nFor contributors and developers:\n\n```bash\n# Install with development dependencies\npip install -e \".[dev]\"\n```\n\n### GPU-accelerated installation\n\nFor GPU-accelerated clustering (requires NVIDIA GPU):\n\n```bash\n# Install with RAPIDS support\npip install \"remag[gpu]\"\n```\n\n## Usage\n\n### Command line interface\n\nAfter installation, you can use REMAG via the command line:\n\n```bash\nremag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n### Python module mode\n\n```bash\npython -m remag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n## How REMAG Works\n\nREMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:\n\n1. **Bacterial Pre-filtering**: By default, REMAG automatically filters out bacterial contigs using the integrated 4CAC classifier (can be disabled with `--skip-bacterial-filter`)\n2. **Feature Extraction**: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training\n3. **Contrastive Learning**: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together\n4. **HDBSCAN Clustering**: Density-based clustering on the learned contig embeddings to form bins\n5. **Quality Assessment**: Uses miniprot to align bins against a database of eukaryotic core genes to detect contamination\n6. **Iterative Refinement**: Automatically splits contaminated bins based on core gene duplications to improve bin quality\n\n## Key Features\n\n- **Automatic Bacterial Filtering**: The 4CAC classifier automatically identifies and removes bacterial sequences before binning\n- **Multi-Sample Support**: Can process coverage information from multiple samples (BAM files) simultaneously\n- **Barlow Twins Loss**: Uses a self-supervised contrastive learning approach that doesn't require negative pairs\n- **Fragment Augmentation**: Large contigs are split into multiple overlapping fragments during training to improve representation learning\n\n## Options\n\n```\n -f, --fasta PATH Input FASTA file with contigs to bin. Can be gzipped. [required]\n -b, --bam PATH Input BAM file(s) for coverage calculation. Must be indexed. Each BAM represents a sample. Supports space-separated files or glob patterns (e.g., \"*.bam\", \"sample_*.bam\"). Use quotes around glob patterns.\n -t, --tsv PATH Input TSV file(s) with coverage information.\n -o, --output PATH Output directory for results. [required]\n --epochs INTEGER RANGE Training epochs for neural network. [default: 400; 50<=x<=2000]\n --batch-size INTEGER RANGE Batch size for training. [default: 2048; 64<=x<=8192]\n --embedding-dim INTEGER RANGE Embedding dimension for contrastive learning. [default: 256; 64<=x<=512]\n --base-learning-rate FLOAT RANGE\n Base learning rate for optimizer. [default: 0.008; 0.00001<=x<=0.1]\n --min-cluster-size INTEGER RANGE\n Minimum fragments per cluster. [default: 2; 2<=x<=100]\n --min-samples INTEGER RANGE Minimum samples for HDBSCAN core points. [default: None; 1<=x<=100]\n --cluster-selection-epsilon FLOAT RANGE\n Epsilon for HDBSCAN cluster selection. [default: 0.0; 0.0<=x<=1.0]\n --min-contig-length INTEGER RANGE\n Minimum contig length in bp. [default: 1000; 500<=x<=10000]\n --max-positive-pairs INTEGER RANGE\n Maximum positive pairs for contrastive learning. [default: 5000000; 100000<=x<=10000000]\n -c, --cores INTEGER RANGE Number of CPU cores. [default: 8; 1<=x<=64]\n --min-bin-size INTEGER RANGE Minimum bin size in bp. [default: 100000; 50000<=x<=10000000]\n -v, --verbose Enable verbose logging.\n --skip-bacterial-filter Skip bacterial contig filtering (4CAC classifier + contrastive learning).\n --skip-refinement Skip bin refinement.\n --skip-kmeans-filtering Skip K-means filtering on embeddings.\n --max-refinement-rounds INTEGER RANGE\n Maximum refinement rounds. [default: 2; 1<=x<=10]\n --num-augmentations INTEGER RANGE\n Number of random fragments per contig. [default: 8; 1<=x<=32]\n --keep-intermediate Keep intermediate files (training fragments, etc.).\n -h, --help Show this message and exit.\n```\n\n## Output\n\nREMAG produces several output files:\n\n### Core output files (always created):\n- `bins/`: Directory containing FASTA files for each bin\n- `bins.csv`: Final contig-to-bin assignments\n- `remag.log`: Detailed log file\n- `*_non_bacterial_filtered.fasta`: Filtered FASTA file with bacterial contigs removed (when bacterial filtering is enabled)\n\n### Additional files (with `--keep-intermediate` option):\n- `embeddings.csv`: Contig embeddings from the neural network\n- `umap_embeddings.csv`: UMAP projections for visualization\n- `umap_plot.pdf`: UMAP visualization plot with cluster assignments\n- `siamese_model.pt`: Trained Siamese neural network model\n- `params.json`: Complete run parameters for reproducibility\n- `features.csv`: Extracted k-mer and coverage features\n- `fragments.pkl`: Fragment information used during training\n- `classification_results.csv`: 4CAC bacterial classification results\n- `refinement_summary.json`: Summary of the bin refinement process\n- `kmeans_filtering_stats.json`: Statistics from k-means pre-filtering (if enabled)\n- `core_gene_duplication_results.json`: Core gene duplication analysis from refinement\n- `temp_miniprot/`: Temporary directory for miniprot alignments (removed unless --keep-intermediate)\n\n\n## Requirements\n\n- Python 3.8+\n- PyTorch (\u22651.11.0)\n- scikit-learn (\u22651.0.0)\n- XGBoost (\u22651.6.0) - for 4CAC classifier\n- HDBSCAN (\u22650.8.28)\n- UMAP (\u22650.5.0)\n- pandas (\u22651.3.0)\n- numpy (\u22651.21.0)\n- matplotlib (\u22653.5.0)\n- pysam (\u22650.18.0)\n- loguru (\u22650.6.0)\n- tqdm (\u22654.62.0)\n- rich-click (\u22651.5.0)\n- joblib (\u22651.1.0)\n\nThe package includes a pre-trained 4CAC classifier model for bacterial contig filtering. The 4CAC classifier code and models are adapted from the [Shamir-Lab/4CAC repository](https://github.com/Shamir-Lab/4CAC).\n\n## Acknowledgments\n\nThe integrated 4CAC classifier (`xgbclass` module) is adapted from the work by Shamir Lab:\n\n- **Repository**: [Shamir-Lab/4CAC](https://github.com/Shamir-Lab/4CAC)\n- **Paper**: Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res. 2024;52(19):e94\u2013e94.\n \n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Citation\n\nIf you use REMAG in your research, please cite:\n\n```\n[Citation information will be added when available]\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Metagenomic binning using neural networks and contrastive learning",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/danielzmbp/remag/issues",
"Documentation": "https://github.com/danielzmbp/remag",
"Homepage": "https://github.com/danielzmbp/remag",
"Repository": "https://github.com/danielzmbp/remag"
},
"split_keywords": [
"metagenomics",
" binning",
" neural networks",
" contrastive learning",
" bioinformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d1cc4bc4e598d33513ed0f7628bfa7402a7606f88c8a1a0678ce1900fbb9ffde",
"md5": "6badd0eceb2b1aeae9b2fd1ec664d106",
"sha256": "6ff020ae312ef47214b7fcc6c921ee2281217514cee8a44e25702fee4c799234"
},
"downloads": -1,
"filename": "remag-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6badd0eceb2b1aeae9b2fd1ec664d106",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 76866376,
"upload_time": "2025-07-26T10:06:05",
"upload_time_iso_8601": "2025-07-26T10:06:05.647419Z",
"url": "https://files.pythonhosted.org/packages/d1/cc/4bc4e598d33513ed0f7628bfa7402a7606f88c8a1a0678ce1900fbb9ffde/remag-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "65f67ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13",
"md5": "d7e3581ea98d5041e92202424ef78138",
"sha256": "6ea865f9886942d9c65e49985188a4623b504e2e3c0f6434e21777480faf2eed"
},
"downloads": -1,
"filename": "remag-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "d7e3581ea98d5041e92202424ef78138",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 76416442,
"upload_time": "2025-07-26T10:06:10",
"upload_time_iso_8601": "2025-07-26T10:06:10.061572Z",
"url": "https://files.pythonhosted.org/packages/65/f6/7ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13/remag-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-26 10:06:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "danielzmbp",
"github_project": "remag",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "hdbscan",
"specs": [
[
">=",
"0.8.28"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.5.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "pysam",
"specs": [
[
">=",
"0.18.0"
]
]
},
{
"name": "rich-click",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.11.0"
]
]
},
{
"name": "loguru",
"specs": [
[
">=",
"0.6.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.62.0"
]
]
},
{
"name": "umap-learn",
"specs": [
[
">=",
"0.5.0"
]
]
},
{
"name": "xgboost",
"specs": [
[
">=",
"1.6.0"
]
]
},
{
"name": "joblib",
"specs": [
[
">=",
"1.1.0"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"5.8.0"
]
]
}
],
"lcname": "remag"
}