biobatchnet


Namebiobatchnet JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryA deep learning framework for batch effect correction in biological data
upload_time2025-09-09 20:04:06
maintainerNone
docs_urlNone
authorHaochen Liu
requires_python>=3.9
licenseMIT
keywords batch-effect deep-learning single-cell imc scrna-seq bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # BioBatchNet

[![PyPI version](https://badge.fury.io/py/biobatchnet.svg)](https://badge.fury.io/py/biobatchnet)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

BioBatchNet is a deep learning framework for batch effect correction in biological data, supporting both single-cell RNA-seq (scRNA-seq) and Imaging Mass Cytometry (IMC) data.

## Installation

### From PyPI (Recommended)
```bash
pip install biobatchnet
```

### From Source
```bash
git clone https://github.com/Manchester-HealthAI/BioBatchNet
cd BioBatchNet
pip install -e .
```

### Prerequisites
```bash
conda create -n biobatchnet python=3.10
conda activate biobatchnet
pip install torch numpy pandas anndata
```

## Quick Start

### Basic Usage

```python
import BioBatchNet
from BioBatchNet import correct_batch_effects
import anndata as ad

# Load your data
adata = ad.read_h5ad('your_data.h5ad')
X = adata.X
batch_labels = adata.obs['batch'].values

# Correct batch effects with one line
corrected_data = correct_batch_effects(
    data=X,
    batch_info=batch_labels,
    data_type='imc',  # or 'scrna' for single-cell RNA-seq
    epochs=100
)

# Add corrected data back to AnnData
adata.layers['corrected'] = corrected_data
```

## Advanced Usage

### Custom Loss Weights

For IMC data with specific batch characteristics:

```python
# Define custom loss weights
loss_weights = {
    'recon_loss': 10,        # Reconstruction loss weight
    'discriminator': 0.3,    # Adversarial loss weight
    'classifier': 1,         # Batch classifier loss weight
    'mmd_loss_1': 0,        # MMD loss weight
    'kl_loss_1': 0.005,     # KL divergence weight for bio encoder
    'kl_loss_2': 0.1,       # KL divergence weight for batch encoder
    'ortho_loss': 0.01      # Orthogonality loss weight
}

corrected_data = correct_batch_effects(
    data=X,
    batch_info=batch_labels,
    data_type='imc',
    latent_dim=20,
    epochs=100,
    loss_weights=loss_weights
)
```

### Custom Architecture Parameters

Fine-tune the neural network architecture:

```python
corrected_data = correct_batch_effects(
    data=X,
    batch_info=batch_labels,
    data_type='imc',
    latent_dim=20,
    epochs=100,
    bio_encoder_hidden_layers=[500, 2000, 2000],
    batch_encoder_hidden_layers=[500],
    decoder_hidden_layers=[2000, 2000, 500],
    batch_classifier_layers_power=[500, 2000, 2000],
    batch_classifier_layers_weak=[128]
)
```

### Direct Model Access

For more control, use the models directly:

```python
from BioBatchNet import IMCVAE, GeneVAE

# For IMC data
model = IMCVAE(
    in_sz=40,           # Number of features
    out_sz=40,          # Output dimension
    num_batch=4,        # Number of batches
    latent_sz=20,       # Latent dimension
    bio_encoder_hidden_layers=[500, 2000, 2000],
    batch_encoder_hidden_layers=[500],
    decoder_hidden_layers=[2000, 2000, 500],
    batch_classifier_layers_power=[500, 2000, 2000],
    batch_classifier_layers_weak=[128]
)

# Train the model
model.fit(data, batch_labels, epochs=100, lr=1e-3)

# Get biological embeddings (batch-corrected representations)
bio_embeddings = model.get_bio_embeddings(data)

# Get corrected data
corrected_data = model.correct_batch_effects(data)
```

## Data Formats

### Input Data Requirements

1. **Data Matrix**: 
   - NumPy array or PyTorch tensor
   - Shape: (n_cells, n_features)
   - For scRNA-seq: gene expression matrix
   - For IMC: protein expression matrix

2. **Batch Information**:
   - NumPy array or list
   - Can be string labels (e.g., ['Patient1', 'Patient2']) or numeric
   - Length must match number of cells

### Output

- **Corrected Data**: Same shape as input, with batch effects removed
- **Bio Embeddings**: Low-dimensional biological representations (n_cells, latent_dim)

## Recommended Parameters

### For IMC Data

```python
# Small dataset (< 10 batches)
loss_weights = {
    'recon_loss': 10,
    'discriminator': 0.3,
    'classifier': 1,
    'mmd_loss_1': 0,
    'kl_loss_1': 0.005,
    'kl_loss_2': 0.1,
    'ortho_loss': 0.01
}

# Large dataset (> 30 batches)
loss_weights = {
    'recon_loss': 10,
    'discriminator': 0.1,
    'classifier': 1,
    'mmd_loss_1': 0.01,
    'kl_loss_1': 0.0,
    'kl_loss_2': 0.1,
    'ortho_loss': 0.01
}
```

### For scRNA-seq Data

```python
loss_weights = {
    'recon_loss': 10,
    'discriminator': 0.04,
    'classifier': 1,
    'kl_loss_1': 1e-7,
    'kl_loss_2': 0.01,
    'ortho_loss': 0.0002,
    'mmd_loss_1': 0,
    'kl_loss_size': 0.002
}
```

## Example Workflow

```python
import BioBatchNet
import anndata as ad
import numpy as np
from BioBatchNet import correct_batch_effects

# 1. Load data
adata = ad.read_h5ad('IMMUcan_batch.h5ad')
print(f"Data shape: {adata.shape}")
print(f"Batches: {adata.obs['BATCH'].unique()}")

# 2. Prepare data
X = adata.X
batch_labels = adata.obs['BATCH'].values

# Convert categorical to numpy array if needed
if hasattr(batch_labels, 'to_numpy'):
    batch_labels = batch_labels.to_numpy()

# 3. Correct batch effects
corrected = correct_batch_effects(
    data=X,
    batch_info=batch_labels,
    data_type='imc',
    latent_dim=20,
    epochs=100
)

# 4. Store results
adata.layers['corrected'] = corrected

# 5. Optional: Get embeddings for visualization
from BioBatchNet import IMCVAE
model = IMCVAE(
    in_sz=X.shape[1],
    out_sz=X.shape[1],
    num_batch=len(np.unique(batch_labels)),
    latent_sz=20
)
model.fit(X, batch_labels, epochs=100)
embeddings = model.get_bio_embeddings(X)
adata.obsm['X_biobatchnet'] = embeddings

# 6. Visualize results (using scanpy)
import scanpy as sc
sc.pp.neighbors(adata, use_rep='X_biobatchnet')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['BATCH', 'celltype'])
```

## Tips and Best Practices

1. **Data Preprocessing**: 
   - Ensure data is properly normalized before batch correction
   - For scRNA-seq: consider log-transformation
   - For IMC: consider arcsinh transformation

2. **Batch Size**:
   - Default batch size is 256
   - Reduce if encountering memory issues
   - Increase for faster training with sufficient memory

3. **Number of Epochs**:
   - Start with 100 epochs for initial testing
   - Increase to 200-500 for final results
   - Monitor loss convergence

4. **Latent Dimension**:
   - Default is 20
   - Increase for complex datasets with many cell types
   - Decrease for simpler datasets

5. **Post-processing**:
   - Output may need scaling/normalization
   - Consider clipping extreme values
   - Validate biological signals are preserved

## Troubleshooting

### Memory Issues
```python
# Reduce batch size
corrected = correct_batch_effects(
    data=X,
    batch_info=batch_labels,
    batch_size=64  # Smaller batch size
)
```

### Convergence Issues
```python
# Adjust learning rate
model.fit(data, batch_labels, lr=1e-4)  # Lower learning rate
```

### Large Output Range
```python
# Post-process corrected data
corrected = correct_batch_effects(data=X, batch_info=batch_labels)
corrected = np.clip(corrected, 0, np.percentile(corrected, 99))
```

## Features

- **Multi-modal Support**: Works with both scRNA-seq and IMC data
- **Easy-to-Use API**: One-line batch correction function
- **Flexible Architecture**: Customizable neural network parameters
- **Adaptive Loss Weights**: Automatically adjusts based on dataset characteristics
- **Comprehensive Documentation**: Detailed usage examples and best practices

## Citation

If you use BioBatchNet in your research, please cite:

```
[Citation information to be added]
```

## Support

For issues and questions:
- GitHub Issues: https://github.com/Manchester-HealthAI/BioBatchNet/issues
- PyPI Package: https://pypi.org/project/biobatchnet/

## License

MIT License

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "biobatchnet",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Haochen Liu <haiping.liu.uom@gmail.com>",
    "keywords": "batch-effect, deep-learning, single-cell, IMC, scRNA-seq, bioinformatics",
    "author": "Haochen Liu",
    "author_email": "Haochen Liu <haiping.liu.uom@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/73/ce/982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57/biobatchnet-0.1.4.tar.gz",
    "platform": null,
    "description": "# BioBatchNet\n\n[![PyPI version](https://badge.fury.io/py/biobatchnet.svg)](https://badge.fury.io/py/biobatchnet)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nBioBatchNet is a deep learning framework for batch effect correction in biological data, supporting both single-cell RNA-seq (scRNA-seq) and Imaging Mass Cytometry (IMC) data.\n\n## Installation\n\n### From PyPI (Recommended)\n```bash\npip install biobatchnet\n```\n\n### From Source\n```bash\ngit clone https://github.com/Manchester-HealthAI/BioBatchNet\ncd BioBatchNet\npip install -e .\n```\n\n### Prerequisites\n```bash\nconda create -n biobatchnet python=3.10\nconda activate biobatchnet\npip install torch numpy pandas anndata\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport BioBatchNet\nfrom BioBatchNet import correct_batch_effects\nimport anndata as ad\n\n# Load your data\nadata = ad.read_h5ad('your_data.h5ad')\nX = adata.X\nbatch_labels = adata.obs['batch'].values\n\n# Correct batch effects with one line\ncorrected_data = correct_batch_effects(\n    data=X,\n    batch_info=batch_labels,\n    data_type='imc',  # or 'scrna' for single-cell RNA-seq\n    epochs=100\n)\n\n# Add corrected data back to AnnData\nadata.layers['corrected'] = corrected_data\n```\n\n## Advanced Usage\n\n### Custom Loss Weights\n\nFor IMC data with specific batch characteristics:\n\n```python\n# Define custom loss weights\nloss_weights = {\n    'recon_loss': 10,        # Reconstruction loss weight\n    'discriminator': 0.3,    # Adversarial loss weight\n    'classifier': 1,         # Batch classifier loss weight\n    'mmd_loss_1': 0,        # MMD loss weight\n    'kl_loss_1': 0.005,     # KL divergence weight for bio encoder\n    'kl_loss_2': 0.1,       # KL divergence weight for batch encoder\n    'ortho_loss': 0.01      # Orthogonality loss weight\n}\n\ncorrected_data = correct_batch_effects(\n    data=X,\n    batch_info=batch_labels,\n    data_type='imc',\n    latent_dim=20,\n    epochs=100,\n    loss_weights=loss_weights\n)\n```\n\n### Custom Architecture Parameters\n\nFine-tune the neural network architecture:\n\n```python\ncorrected_data = correct_batch_effects(\n    data=X,\n    batch_info=batch_labels,\n    data_type='imc',\n    latent_dim=20,\n    epochs=100,\n    bio_encoder_hidden_layers=[500, 2000, 2000],\n    batch_encoder_hidden_layers=[500],\n    decoder_hidden_layers=[2000, 2000, 500],\n    batch_classifier_layers_power=[500, 2000, 2000],\n    batch_classifier_layers_weak=[128]\n)\n```\n\n### Direct Model Access\n\nFor more control, use the models directly:\n\n```python\nfrom BioBatchNet import IMCVAE, GeneVAE\n\n# For IMC data\nmodel = IMCVAE(\n    in_sz=40,           # Number of features\n    out_sz=40,          # Output dimension\n    num_batch=4,        # Number of batches\n    latent_sz=20,       # Latent dimension\n    bio_encoder_hidden_layers=[500, 2000, 2000],\n    batch_encoder_hidden_layers=[500],\n    decoder_hidden_layers=[2000, 2000, 500],\n    batch_classifier_layers_power=[500, 2000, 2000],\n    batch_classifier_layers_weak=[128]\n)\n\n# Train the model\nmodel.fit(data, batch_labels, epochs=100, lr=1e-3)\n\n# Get biological embeddings (batch-corrected representations)\nbio_embeddings = model.get_bio_embeddings(data)\n\n# Get corrected data\ncorrected_data = model.correct_batch_effects(data)\n```\n\n## Data Formats\n\n### Input Data Requirements\n\n1. **Data Matrix**: \n   - NumPy array or PyTorch tensor\n   - Shape: (n_cells, n_features)\n   - For scRNA-seq: gene expression matrix\n   - For IMC: protein expression matrix\n\n2. **Batch Information**:\n   - NumPy array or list\n   - Can be string labels (e.g., ['Patient1', 'Patient2']) or numeric\n   - Length must match number of cells\n\n### Output\n\n- **Corrected Data**: Same shape as input, with batch effects removed\n- **Bio Embeddings**: Low-dimensional biological representations (n_cells, latent_dim)\n\n## Recommended Parameters\n\n### For IMC Data\n\n```python\n# Small dataset (< 10 batches)\nloss_weights = {\n    'recon_loss': 10,\n    'discriminator': 0.3,\n    'classifier': 1,\n    'mmd_loss_1': 0,\n    'kl_loss_1': 0.005,\n    'kl_loss_2': 0.1,\n    'ortho_loss': 0.01\n}\n\n# Large dataset (> 30 batches)\nloss_weights = {\n    'recon_loss': 10,\n    'discriminator': 0.1,\n    'classifier': 1,\n    'mmd_loss_1': 0.01,\n    'kl_loss_1': 0.0,\n    'kl_loss_2': 0.1,\n    'ortho_loss': 0.01\n}\n```\n\n### For scRNA-seq Data\n\n```python\nloss_weights = {\n    'recon_loss': 10,\n    'discriminator': 0.04,\n    'classifier': 1,\n    'kl_loss_1': 1e-7,\n    'kl_loss_2': 0.01,\n    'ortho_loss': 0.0002,\n    'mmd_loss_1': 0,\n    'kl_loss_size': 0.002\n}\n```\n\n## Example Workflow\n\n```python\nimport BioBatchNet\nimport anndata as ad\nimport numpy as np\nfrom BioBatchNet import correct_batch_effects\n\n# 1. Load data\nadata = ad.read_h5ad('IMMUcan_batch.h5ad')\nprint(f\"Data shape: {adata.shape}\")\nprint(f\"Batches: {adata.obs['BATCH'].unique()}\")\n\n# 2. Prepare data\nX = adata.X\nbatch_labels = adata.obs['BATCH'].values\n\n# Convert categorical to numpy array if needed\nif hasattr(batch_labels, 'to_numpy'):\n    batch_labels = batch_labels.to_numpy()\n\n# 3. Correct batch effects\ncorrected = correct_batch_effects(\n    data=X,\n    batch_info=batch_labels,\n    data_type='imc',\n    latent_dim=20,\n    epochs=100\n)\n\n# 4. Store results\nadata.layers['corrected'] = corrected\n\n# 5. Optional: Get embeddings for visualization\nfrom BioBatchNet import IMCVAE\nmodel = IMCVAE(\n    in_sz=X.shape[1],\n    out_sz=X.shape[1],\n    num_batch=len(np.unique(batch_labels)),\n    latent_sz=20\n)\nmodel.fit(X, batch_labels, epochs=100)\nembeddings = model.get_bio_embeddings(X)\nadata.obsm['X_biobatchnet'] = embeddings\n\n# 6. Visualize results (using scanpy)\nimport scanpy as sc\nsc.pp.neighbors(adata, use_rep='X_biobatchnet')\nsc.tl.umap(adata)\nsc.pl.umap(adata, color=['BATCH', 'celltype'])\n```\n\n## Tips and Best Practices\n\n1. **Data Preprocessing**: \n   - Ensure data is properly normalized before batch correction\n   - For scRNA-seq: consider log-transformation\n   - For IMC: consider arcsinh transformation\n\n2. **Batch Size**:\n   - Default batch size is 256\n   - Reduce if encountering memory issues\n   - Increase for faster training with sufficient memory\n\n3. **Number of Epochs**:\n   - Start with 100 epochs for initial testing\n   - Increase to 200-500 for final results\n   - Monitor loss convergence\n\n4. **Latent Dimension**:\n   - Default is 20\n   - Increase for complex datasets with many cell types\n   - Decrease for simpler datasets\n\n5. **Post-processing**:\n   - Output may need scaling/normalization\n   - Consider clipping extreme values\n   - Validate biological signals are preserved\n\n## Troubleshooting\n\n### Memory Issues\n```python\n# Reduce batch size\ncorrected = correct_batch_effects(\n    data=X,\n    batch_info=batch_labels,\n    batch_size=64  # Smaller batch size\n)\n```\n\n### Convergence Issues\n```python\n# Adjust learning rate\nmodel.fit(data, batch_labels, lr=1e-4)  # Lower learning rate\n```\n\n### Large Output Range\n```python\n# Post-process corrected data\ncorrected = correct_batch_effects(data=X, batch_info=batch_labels)\ncorrected = np.clip(corrected, 0, np.percentile(corrected, 99))\n```\n\n## Features\n\n- **Multi-modal Support**: Works with both scRNA-seq and IMC data\n- **Easy-to-Use API**: One-line batch correction function\n- **Flexible Architecture**: Customizable neural network parameters\n- **Adaptive Loss Weights**: Automatically adjusts based on dataset characteristics\n- **Comprehensive Documentation**: Detailed usage examples and best practices\n\n## Citation\n\nIf you use BioBatchNet in your research, please cite:\n\n```\n[Citation information to be added]\n```\n\n## Support\n\nFor issues and questions:\n- GitHub Issues: https://github.com/Manchester-HealthAI/BioBatchNet/issues\n- PyPI Package: https://pypi.org/project/biobatchnet/\n\n## License\n\nMIT License\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A deep learning framework for batch effect correction in biological data",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/Manchester-HealthAI/BioBatchNet/issues",
        "Documentation": "https://github.com/Manchester-HealthAI/BioBatchNet/blob/main/USAGE.md",
        "Homepage": "https://github.com/Manchester-HealthAI/BioBatchNet",
        "Repository": "https://github.com/Manchester-HealthAI/BioBatchNet"
    },
    "split_keywords": [
        "batch-effect",
        " deep-learning",
        " single-cell",
        " imc",
        " scrna-seq",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "425d9eac90ecda0a8081693c2de955000b5b47fc7d80ce8af72f45c7d5e51aa8",
                "md5": "67ac58601a18af480196935c4cef97a6",
                "sha256": "b5dad691b344d26b514e308a71867f8573d61fff157e4bb3a691aa00e3b64c9f"
            },
            "downloads": -1,
            "filename": "biobatchnet-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "67ac58601a18af480196935c4cef97a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 32939,
            "upload_time": "2025-09-09T20:04:05",
            "upload_time_iso_8601": "2025-09-09T20:04:05.851554Z",
            "url": "https://files.pythonhosted.org/packages/42/5d/9eac90ecda0a8081693c2de955000b5b47fc7d80ce8af72f45c7d5e51aa8/biobatchnet-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "73ce982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57",
                "md5": "0e8293ca563e0199507faac879396f12",
                "sha256": "f792063e3f17653a023ec12aba79d057caac50bfb123d64b2e9f0a23f7f7b582"
            },
            "downloads": -1,
            "filename": "biobatchnet-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "0e8293ca563e0199507faac879396f12",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 27855,
            "upload_time": "2025-09-09T20:04:06",
            "upload_time_iso_8601": "2025-09-09T20:04:06.949723Z",
            "url": "https://files.pythonhosted.org/packages/73/ce/982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57/biobatchnet-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-09 20:04:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Manchester-HealthAI",
    "github_project": "BioBatchNet",
    "github_not_found": true,
    "lcname": "biobatchnet"
}
        
Elapsed time: 1.92544s