remag


Nameremag JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryMetagenomic binning using neural networks and contrastive learning
upload_time2025-07-26 10:06:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords metagenomics binning neural networks contrastive learning bioinformatics
VCS
bugtrack_url
requirements hdbscan matplotlib numpy pandas pysam rich-click torch loguru scikit-learn tqdm umap-learn xgboost joblib psutil
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # REMAG

**RE**covery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.

## Quick Start

```bash
# Clone the repository
git clone https://github.com/danielzmbp/remag.git
cd remag

# Create conda environment and install
conda create -n remag python=3.9
conda activate remag
pip install .

# Run REMAG
remag -f contigs.fasta -b alignments.bam -o output_directory
```

## Installation

### From source

```bash
# Create and activate conda environment
conda create -n remag python=3.9
conda activate remag

# Clone and install
git clone https://github.com/danielzmbp/remag.git
cd remag
pip install .
```

### Development installation

For contributors and developers:

```bash
# Install with development dependencies
pip install -e ".[dev]"
```

### GPU-accelerated installation

For GPU-accelerated clustering (requires NVIDIA GPU):

```bash
# Install with RAPIDS support
pip install "remag[gpu]"
```

## Usage

### Command line interface

After installation, you can use REMAG via the command line:

```bash
remag -f contigs.fasta -b alignments.bam -o output_directory
```

### Python module mode

```bash
python -m remag -f contigs.fasta -b alignments.bam -o output_directory
```

## How REMAG Works

REMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:

1. **Bacterial Pre-filtering**: By default, REMAG automatically filters out bacterial contigs using the integrated 4CAC classifier (can be disabled with `--skip-bacterial-filter`)
2. **Feature Extraction**: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training
3. **Contrastive Learning**: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together
4. **HDBSCAN Clustering**: Density-based clustering on the learned contig embeddings to form bins
5. **Quality Assessment**: Uses miniprot to align bins against a database of eukaryotic core genes to detect contamination
6. **Iterative Refinement**: Automatically splits contaminated bins based on core gene duplications to improve bin quality

## Key Features

- **Automatic Bacterial Filtering**: The 4CAC classifier automatically identifies and removes bacterial sequences before binning
- **Multi-Sample Support**: Can process coverage information from multiple samples (BAM files) simultaneously
- **Barlow Twins Loss**: Uses a self-supervised contrastive learning approach that doesn't require negative pairs
- **Fragment Augmentation**: Large contigs are split into multiple overlapping fragments during training to improve representation learning

## Options

```
  -f, --fasta PATH                Input FASTA file with contigs to bin. Can be gzipped.  [required]
  -b, --bam PATH                  Input BAM file(s) for coverage calculation. Must be indexed. Each BAM represents a sample. Supports space-separated files or glob patterns (e.g., "*.bam", "sample_*.bam"). Use quotes around glob patterns.
  -t, --tsv PATH                  Input TSV file(s) with coverage information.
  -o, --output PATH               Output directory for results.  [required]
  --epochs INTEGER RANGE          Training epochs for neural network.  [default: 400; 50<=x<=2000]
  --batch-size INTEGER RANGE      Batch size for training.  [default: 2048; 64<=x<=8192]
  --embedding-dim INTEGER RANGE   Embedding dimension for contrastive learning.  [default: 256; 64<=x<=512]
  --base-learning-rate FLOAT RANGE
                                  Base learning rate for optimizer.  [default: 0.008; 0.00001<=x<=0.1]
  --min-cluster-size INTEGER RANGE
                                  Minimum fragments per cluster.  [default: 2; 2<=x<=100]
  --min-samples INTEGER RANGE     Minimum samples for HDBSCAN core points.  [default: None; 1<=x<=100]
  --cluster-selection-epsilon FLOAT RANGE
                                  Epsilon for HDBSCAN cluster selection.  [default: 0.0; 0.0<=x<=1.0]
  --min-contig-length INTEGER RANGE
                                  Minimum contig length in bp.  [default: 1000; 500<=x<=10000]
  --max-positive-pairs INTEGER RANGE
                                  Maximum positive pairs for contrastive learning.  [default: 5000000; 100000<=x<=10000000]
  -c, --cores INTEGER RANGE       Number of CPU cores.  [default: 8; 1<=x<=64]
  --min-bin-size INTEGER RANGE    Minimum bin size in bp.  [default: 100000; 50000<=x<=10000000]
  -v, --verbose                   Enable verbose logging.
  --skip-bacterial-filter         Skip bacterial contig filtering (4CAC classifier + contrastive learning).
  --skip-refinement               Skip bin refinement.
  --skip-kmeans-filtering         Skip K-means filtering on embeddings.
  --max-refinement-rounds INTEGER RANGE
                                  Maximum refinement rounds.  [default: 2; 1<=x<=10]
  --num-augmentations INTEGER RANGE
                                  Number of random fragments per contig.  [default: 8; 1<=x<=32]
  --keep-intermediate             Keep intermediate files (training fragments, etc.).
  -h, --help                      Show this message and exit.
```

## Output

REMAG produces several output files:

### Core output files (always created):
- `bins/`: Directory containing FASTA files for each bin
- `bins.csv`: Final contig-to-bin assignments
- `remag.log`: Detailed log file
- `*_non_bacterial_filtered.fasta`: Filtered FASTA file with bacterial contigs removed (when bacterial filtering is enabled)

### Additional files (with `--keep-intermediate` option):
- `embeddings.csv`: Contig embeddings from the neural network
- `umap_embeddings.csv`: UMAP projections for visualization
- `umap_plot.pdf`: UMAP visualization plot with cluster assignments
- `siamese_model.pt`: Trained Siamese neural network model
- `params.json`: Complete run parameters for reproducibility
- `features.csv`: Extracted k-mer and coverage features
- `fragments.pkl`: Fragment information used during training
- `classification_results.csv`: 4CAC bacterial classification results
- `refinement_summary.json`: Summary of the bin refinement process
- `kmeans_filtering_stats.json`: Statistics from k-means pre-filtering (if enabled)
- `core_gene_duplication_results.json`: Core gene duplication analysis from refinement
- `temp_miniprot/`: Temporary directory for miniprot alignments (removed unless --keep-intermediate)


## Requirements

- Python 3.8+
- PyTorch (≥1.11.0)
- scikit-learn (≥1.0.0)
- XGBoost (≥1.6.0) - for 4CAC classifier
- HDBSCAN (≥0.8.28)
- UMAP (≥0.5.0)
- pandas (≥1.3.0)
- numpy (≥1.21.0)
- matplotlib (≥3.5.0)
- pysam (≥0.18.0)
- loguru (≥0.6.0)
- tqdm (≥4.62.0)
- rich-click (≥1.5.0)
- joblib (≥1.1.0)

The package includes a pre-trained 4CAC classifier model for bacterial contig filtering. The 4CAC classifier code and models are adapted from the [Shamir-Lab/4CAC repository](https://github.com/Shamir-Lab/4CAC).

## Acknowledgments

The integrated 4CAC classifier (`xgbclass` module) is adapted from the work by Shamir Lab:

- **Repository**: [Shamir-Lab/4CAC](https://github.com/Shamir-Lab/4CAC)
- **Paper**: Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res. 2024;52(19):e94–e94.
   

## License

MIT License - see LICENSE file for details.

## Citation

If you use REMAG in your research, please cite:

```
[Citation information will be added when available]
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "remag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Daniel G\u00f3mez-P\u00e9rez <daniel.gomez-perez@earlham.ac.uk>",
    "keywords": "metagenomics, binning, neural networks, contrastive learning, bioinformatics",
    "author": null,
    "author_email": "Daniel G\u00f3mez-P\u00e9rez <daniel.gomez-perez@earlham.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/65/f6/7ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13/remag-0.1.0.tar.gz",
    "platform": null,
    "description": "# REMAG\n\n**RE**covery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.\n\n## Quick Start\n\n```bash\n# Clone the repository\ngit clone https://github.com/danielzmbp/remag.git\ncd remag\n\n# Create conda environment and install\nconda create -n remag python=3.9\nconda activate remag\npip install .\n\n# Run REMAG\nremag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n## Installation\n\n### From source\n\n```bash\n# Create and activate conda environment\nconda create -n remag python=3.9\nconda activate remag\n\n# Clone and install\ngit clone https://github.com/danielzmbp/remag.git\ncd remag\npip install .\n```\n\n### Development installation\n\nFor contributors and developers:\n\n```bash\n# Install with development dependencies\npip install -e \".[dev]\"\n```\n\n### GPU-accelerated installation\n\nFor GPU-accelerated clustering (requires NVIDIA GPU):\n\n```bash\n# Install with RAPIDS support\npip install \"remag[gpu]\"\n```\n\n## Usage\n\n### Command line interface\n\nAfter installation, you can use REMAG via the command line:\n\n```bash\nremag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n### Python module mode\n\n```bash\npython -m remag -f contigs.fasta -b alignments.bam -o output_directory\n```\n\n## How REMAG Works\n\nREMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:\n\n1. **Bacterial Pre-filtering**: By default, REMAG automatically filters out bacterial contigs using the integrated 4CAC classifier (can be disabled with `--skip-bacterial-filter`)\n2. **Feature Extraction**: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training\n3. **Contrastive Learning**: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together\n4. **HDBSCAN Clustering**: Density-based clustering on the learned contig embeddings to form bins\n5. **Quality Assessment**: Uses miniprot to align bins against a database of eukaryotic core genes to detect contamination\n6. **Iterative Refinement**: Automatically splits contaminated bins based on core gene duplications to improve bin quality\n\n## Key Features\n\n- **Automatic Bacterial Filtering**: The 4CAC classifier automatically identifies and removes bacterial sequences before binning\n- **Multi-Sample Support**: Can process coverage information from multiple samples (BAM files) simultaneously\n- **Barlow Twins Loss**: Uses a self-supervised contrastive learning approach that doesn't require negative pairs\n- **Fragment Augmentation**: Large contigs are split into multiple overlapping fragments during training to improve representation learning\n\n## Options\n\n```\n  -f, --fasta PATH                Input FASTA file with contigs to bin. Can be gzipped.  [required]\n  -b, --bam PATH                  Input BAM file(s) for coverage calculation. Must be indexed. Each BAM represents a sample. Supports space-separated files or glob patterns (e.g., \"*.bam\", \"sample_*.bam\"). Use quotes around glob patterns.\n  -t, --tsv PATH                  Input TSV file(s) with coverage information.\n  -o, --output PATH               Output directory for results.  [required]\n  --epochs INTEGER RANGE          Training epochs for neural network.  [default: 400; 50<=x<=2000]\n  --batch-size INTEGER RANGE      Batch size for training.  [default: 2048; 64<=x<=8192]\n  --embedding-dim INTEGER RANGE   Embedding dimension for contrastive learning.  [default: 256; 64<=x<=512]\n  --base-learning-rate FLOAT RANGE\n                                  Base learning rate for optimizer.  [default: 0.008; 0.00001<=x<=0.1]\n  --min-cluster-size INTEGER RANGE\n                                  Minimum fragments per cluster.  [default: 2; 2<=x<=100]\n  --min-samples INTEGER RANGE     Minimum samples for HDBSCAN core points.  [default: None; 1<=x<=100]\n  --cluster-selection-epsilon FLOAT RANGE\n                                  Epsilon for HDBSCAN cluster selection.  [default: 0.0; 0.0<=x<=1.0]\n  --min-contig-length INTEGER RANGE\n                                  Minimum contig length in bp.  [default: 1000; 500<=x<=10000]\n  --max-positive-pairs INTEGER RANGE\n                                  Maximum positive pairs for contrastive learning.  [default: 5000000; 100000<=x<=10000000]\n  -c, --cores INTEGER RANGE       Number of CPU cores.  [default: 8; 1<=x<=64]\n  --min-bin-size INTEGER RANGE    Minimum bin size in bp.  [default: 100000; 50000<=x<=10000000]\n  -v, --verbose                   Enable verbose logging.\n  --skip-bacterial-filter         Skip bacterial contig filtering (4CAC classifier + contrastive learning).\n  --skip-refinement               Skip bin refinement.\n  --skip-kmeans-filtering         Skip K-means filtering on embeddings.\n  --max-refinement-rounds INTEGER RANGE\n                                  Maximum refinement rounds.  [default: 2; 1<=x<=10]\n  --num-augmentations INTEGER RANGE\n                                  Number of random fragments per contig.  [default: 8; 1<=x<=32]\n  --keep-intermediate             Keep intermediate files (training fragments, etc.).\n  -h, --help                      Show this message and exit.\n```\n\n## Output\n\nREMAG produces several output files:\n\n### Core output files (always created):\n- `bins/`: Directory containing FASTA files for each bin\n- `bins.csv`: Final contig-to-bin assignments\n- `remag.log`: Detailed log file\n- `*_non_bacterial_filtered.fasta`: Filtered FASTA file with bacterial contigs removed (when bacterial filtering is enabled)\n\n### Additional files (with `--keep-intermediate` option):\n- `embeddings.csv`: Contig embeddings from the neural network\n- `umap_embeddings.csv`: UMAP projections for visualization\n- `umap_plot.pdf`: UMAP visualization plot with cluster assignments\n- `siamese_model.pt`: Trained Siamese neural network model\n- `params.json`: Complete run parameters for reproducibility\n- `features.csv`: Extracted k-mer and coverage features\n- `fragments.pkl`: Fragment information used during training\n- `classification_results.csv`: 4CAC bacterial classification results\n- `refinement_summary.json`: Summary of the bin refinement process\n- `kmeans_filtering_stats.json`: Statistics from k-means pre-filtering (if enabled)\n- `core_gene_duplication_results.json`: Core gene duplication analysis from refinement\n- `temp_miniprot/`: Temporary directory for miniprot alignments (removed unless --keep-intermediate)\n\n\n## Requirements\n\n- Python 3.8+\n- PyTorch (\u22651.11.0)\n- scikit-learn (\u22651.0.0)\n- XGBoost (\u22651.6.0) - for 4CAC classifier\n- HDBSCAN (\u22650.8.28)\n- UMAP (\u22650.5.0)\n- pandas (\u22651.3.0)\n- numpy (\u22651.21.0)\n- matplotlib (\u22653.5.0)\n- pysam (\u22650.18.0)\n- loguru (\u22650.6.0)\n- tqdm (\u22654.62.0)\n- rich-click (\u22651.5.0)\n- joblib (\u22651.1.0)\n\nThe package includes a pre-trained 4CAC classifier model for bacterial contig filtering. The 4CAC classifier code and models are adapted from the [Shamir-Lab/4CAC repository](https://github.com/Shamir-Lab/4CAC).\n\n## Acknowledgments\n\nThe integrated 4CAC classifier (`xgbclass` module) is adapted from the work by Shamir Lab:\n\n- **Repository**: [Shamir-Lab/4CAC](https://github.com/Shamir-Lab/4CAC)\n- **Paper**: Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res. 2024;52(19):e94\u2013e94.\n   \n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Citation\n\nIf you use REMAG in your research, please cite:\n\n```\n[Citation information will be added when available]\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Metagenomic binning using neural networks and contrastive learning",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/danielzmbp/remag/issues",
        "Documentation": "https://github.com/danielzmbp/remag",
        "Homepage": "https://github.com/danielzmbp/remag",
        "Repository": "https://github.com/danielzmbp/remag"
    },
    "split_keywords": [
        "metagenomics",
        " binning",
        " neural networks",
        " contrastive learning",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d1cc4bc4e598d33513ed0f7628bfa7402a7606f88c8a1a0678ce1900fbb9ffde",
                "md5": "6badd0eceb2b1aeae9b2fd1ec664d106",
                "sha256": "6ff020ae312ef47214b7fcc6c921ee2281217514cee8a44e25702fee4c799234"
            },
            "downloads": -1,
            "filename": "remag-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6badd0eceb2b1aeae9b2fd1ec664d106",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 76866376,
            "upload_time": "2025-07-26T10:06:05",
            "upload_time_iso_8601": "2025-07-26T10:06:05.647419Z",
            "url": "https://files.pythonhosted.org/packages/d1/cc/4bc4e598d33513ed0f7628bfa7402a7606f88c8a1a0678ce1900fbb9ffde/remag-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "65f67ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13",
                "md5": "d7e3581ea98d5041e92202424ef78138",
                "sha256": "6ea865f9886942d9c65e49985188a4623b504e2e3c0f6434e21777480faf2eed"
            },
            "downloads": -1,
            "filename": "remag-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d7e3581ea98d5041e92202424ef78138",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 76416442,
            "upload_time": "2025-07-26T10:06:10",
            "upload_time_iso_8601": "2025-07-26T10:06:10.061572Z",
            "url": "https://files.pythonhosted.org/packages/65/f6/7ecc2916b0ecfe0d42b11265ba16af357553edd29fa717392b957c54aa13/remag-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-26 10:06:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "danielzmbp",
    "github_project": "remag",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "hdbscan",
            "specs": [
                [
                    ">=",
                    "0.8.28"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "pysam",
            "specs": [
                [
                    ">=",
                    "0.18.0"
                ]
            ]
        },
        {
            "name": "rich-click",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.11.0"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.6.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.0"
                ]
            ]
        },
        {
            "name": "umap-learn",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "xgboost",
            "specs": [
                [
                    ">=",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "joblib",
            "specs": [
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "psutil",
            "specs": [
                [
                    ">=",
                    "5.8.0"
                ]
            ]
        }
    ],
    "lcname": "remag"
}
        
Elapsed time: 2.08100s