# BioBatchNet
[](https://badge.fury.io/py/biobatchnet)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
BioBatchNet is a deep learning framework for batch effect correction in biological data, supporting both single-cell RNA-seq (scRNA-seq) and Imaging Mass Cytometry (IMC) data.
## Installation
### From PyPI (Recommended)
```bash
pip install biobatchnet
```
### From Source
```bash
git clone https://github.com/Manchester-HealthAI/BioBatchNet
cd BioBatchNet
pip install -e .
```
### Prerequisites
```bash
conda create -n biobatchnet python=3.10
conda activate biobatchnet
pip install torch numpy pandas anndata
```
## Quick Start
### Basic Usage
```python
import BioBatchNet
from BioBatchNet import correct_batch_effects
import anndata as ad
# Load your data
adata = ad.read_h5ad('your_data.h5ad')
X = adata.X
batch_labels = adata.obs['batch'].values
# Correct batch effects with one line
corrected_data = correct_batch_effects(
data=X,
batch_info=batch_labels,
data_type='imc', # or 'scrna' for single-cell RNA-seq
epochs=100
)
# Add corrected data back to AnnData
adata.layers['corrected'] = corrected_data
```
## Advanced Usage
### Custom Loss Weights
For IMC data with specific batch characteristics:
```python
# Define custom loss weights
loss_weights = {
'recon_loss': 10, # Reconstruction loss weight
'discriminator': 0.3, # Adversarial loss weight
'classifier': 1, # Batch classifier loss weight
'mmd_loss_1': 0, # MMD loss weight
'kl_loss_1': 0.005, # KL divergence weight for bio encoder
'kl_loss_2': 0.1, # KL divergence weight for batch encoder
'ortho_loss': 0.01 # Orthogonality loss weight
}
corrected_data = correct_batch_effects(
data=X,
batch_info=batch_labels,
data_type='imc',
latent_dim=20,
epochs=100,
loss_weights=loss_weights
)
```
### Custom Architecture Parameters
Fine-tune the neural network architecture:
```python
corrected_data = correct_batch_effects(
data=X,
batch_info=batch_labels,
data_type='imc',
latent_dim=20,
epochs=100,
bio_encoder_hidden_layers=[500, 2000, 2000],
batch_encoder_hidden_layers=[500],
decoder_hidden_layers=[2000, 2000, 500],
batch_classifier_layers_power=[500, 2000, 2000],
batch_classifier_layers_weak=[128]
)
```
### Direct Model Access
For more control, use the models directly:
```python
from BioBatchNet import IMCVAE, GeneVAE
# For IMC data
model = IMCVAE(
in_sz=40, # Number of features
out_sz=40, # Output dimension
num_batch=4, # Number of batches
latent_sz=20, # Latent dimension
bio_encoder_hidden_layers=[500, 2000, 2000],
batch_encoder_hidden_layers=[500],
decoder_hidden_layers=[2000, 2000, 500],
batch_classifier_layers_power=[500, 2000, 2000],
batch_classifier_layers_weak=[128]
)
# Train the model
model.fit(data, batch_labels, epochs=100, lr=1e-3)
# Get biological embeddings (batch-corrected representations)
bio_embeddings = model.get_bio_embeddings(data)
# Get corrected data
corrected_data = model.correct_batch_effects(data)
```
## Data Formats
### Input Data Requirements
1. **Data Matrix**:
- NumPy array or PyTorch tensor
- Shape: (n_cells, n_features)
- For scRNA-seq: gene expression matrix
- For IMC: protein expression matrix
2. **Batch Information**:
- NumPy array or list
- Can be string labels (e.g., ['Patient1', 'Patient2']) or numeric
- Length must match number of cells
### Output
- **Corrected Data**: Same shape as input, with batch effects removed
- **Bio Embeddings**: Low-dimensional biological representations (n_cells, latent_dim)
## Recommended Parameters
### For IMC Data
```python
# Small dataset (< 10 batches)
loss_weights = {
'recon_loss': 10,
'discriminator': 0.3,
'classifier': 1,
'mmd_loss_1': 0,
'kl_loss_1': 0.005,
'kl_loss_2': 0.1,
'ortho_loss': 0.01
}
# Large dataset (> 30 batches)
loss_weights = {
'recon_loss': 10,
'discriminator': 0.1,
'classifier': 1,
'mmd_loss_1': 0.01,
'kl_loss_1': 0.0,
'kl_loss_2': 0.1,
'ortho_loss': 0.01
}
```
### For scRNA-seq Data
```python
loss_weights = {
'recon_loss': 10,
'discriminator': 0.04,
'classifier': 1,
'kl_loss_1': 1e-7,
'kl_loss_2': 0.01,
'ortho_loss': 0.0002,
'mmd_loss_1': 0,
'kl_loss_size': 0.002
}
```
## Example Workflow
```python
import BioBatchNet
import anndata as ad
import numpy as np
from BioBatchNet import correct_batch_effects
# 1. Load data
adata = ad.read_h5ad('IMMUcan_batch.h5ad')
print(f"Data shape: {adata.shape}")
print(f"Batches: {adata.obs['BATCH'].unique()}")
# 2. Prepare data
X = adata.X
batch_labels = adata.obs['BATCH'].values
# Convert categorical to numpy array if needed
if hasattr(batch_labels, 'to_numpy'):
batch_labels = batch_labels.to_numpy()
# 3. Correct batch effects
corrected = correct_batch_effects(
data=X,
batch_info=batch_labels,
data_type='imc',
latent_dim=20,
epochs=100
)
# 4. Store results
adata.layers['corrected'] = corrected
# 5. Optional: Get embeddings for visualization
from BioBatchNet import IMCVAE
model = IMCVAE(
in_sz=X.shape[1],
out_sz=X.shape[1],
num_batch=len(np.unique(batch_labels)),
latent_sz=20
)
model.fit(X, batch_labels, epochs=100)
embeddings = model.get_bio_embeddings(X)
adata.obsm['X_biobatchnet'] = embeddings
# 6. Visualize results (using scanpy)
import scanpy as sc
sc.pp.neighbors(adata, use_rep='X_biobatchnet')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['BATCH', 'celltype'])
```
## Tips and Best Practices
1. **Data Preprocessing**:
- Ensure data is properly normalized before batch correction
- For scRNA-seq: consider log-transformation
- For IMC: consider arcsinh transformation
2. **Batch Size**:
- Default batch size is 256
- Reduce if encountering memory issues
- Increase for faster training with sufficient memory
3. **Number of Epochs**:
- Start with 100 epochs for initial testing
- Increase to 200-500 for final results
- Monitor loss convergence
4. **Latent Dimension**:
- Default is 20
- Increase for complex datasets with many cell types
- Decrease for simpler datasets
5. **Post-processing**:
- Output may need scaling/normalization
- Consider clipping extreme values
- Validate biological signals are preserved
## Troubleshooting
### Memory Issues
```python
# Reduce batch size
corrected = correct_batch_effects(
data=X,
batch_info=batch_labels,
batch_size=64 # Smaller batch size
)
```
### Convergence Issues
```python
# Adjust learning rate
model.fit(data, batch_labels, lr=1e-4) # Lower learning rate
```
### Large Output Range
```python
# Post-process corrected data
corrected = correct_batch_effects(data=X, batch_info=batch_labels)
corrected = np.clip(corrected, 0, np.percentile(corrected, 99))
```
## Features
- **Multi-modal Support**: Works with both scRNA-seq and IMC data
- **Easy-to-Use API**: One-line batch correction function
- **Flexible Architecture**: Customizable neural network parameters
- **Adaptive Loss Weights**: Automatically adjusts based on dataset characteristics
- **Comprehensive Documentation**: Detailed usage examples and best practices
## Citation
If you use BioBatchNet in your research, please cite:
```
[Citation information to be added]
```
## Support
For issues and questions:
- GitHub Issues: https://github.com/Manchester-HealthAI/BioBatchNet/issues
- PyPI Package: https://pypi.org/project/biobatchnet/
## License
MIT License
Raw data
{
"_id": null,
"home_page": null,
"name": "biobatchnet",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Haochen Liu <haiping.liu.uom@gmail.com>",
"keywords": "batch-effect, deep-learning, single-cell, IMC, scRNA-seq, bioinformatics",
"author": "Haochen Liu",
"author_email": "Haochen Liu <haiping.liu.uom@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/73/ce/982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57/biobatchnet-0.1.4.tar.gz",
"platform": null,
"description": "# BioBatchNet\n\n[](https://badge.fury.io/py/biobatchnet)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\nBioBatchNet is a deep learning framework for batch effect correction in biological data, supporting both single-cell RNA-seq (scRNA-seq) and Imaging Mass Cytometry (IMC) data.\n\n## Installation\n\n### From PyPI (Recommended)\n```bash\npip install biobatchnet\n```\n\n### From Source\n```bash\ngit clone https://github.com/Manchester-HealthAI/BioBatchNet\ncd BioBatchNet\npip install -e .\n```\n\n### Prerequisites\n```bash\nconda create -n biobatchnet python=3.10\nconda activate biobatchnet\npip install torch numpy pandas anndata\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport BioBatchNet\nfrom BioBatchNet import correct_batch_effects\nimport anndata as ad\n\n# Load your data\nadata = ad.read_h5ad('your_data.h5ad')\nX = adata.X\nbatch_labels = adata.obs['batch'].values\n\n# Correct batch effects with one line\ncorrected_data = correct_batch_effects(\n data=X,\n batch_info=batch_labels,\n data_type='imc', # or 'scrna' for single-cell RNA-seq\n epochs=100\n)\n\n# Add corrected data back to AnnData\nadata.layers['corrected'] = corrected_data\n```\n\n## Advanced Usage\n\n### Custom Loss Weights\n\nFor IMC data with specific batch characteristics:\n\n```python\n# Define custom loss weights\nloss_weights = {\n 'recon_loss': 10, # Reconstruction loss weight\n 'discriminator': 0.3, # Adversarial loss weight\n 'classifier': 1, # Batch classifier loss weight\n 'mmd_loss_1': 0, # MMD loss weight\n 'kl_loss_1': 0.005, # KL divergence weight for bio encoder\n 'kl_loss_2': 0.1, # KL divergence weight for batch encoder\n 'ortho_loss': 0.01 # Orthogonality loss weight\n}\n\ncorrected_data = correct_batch_effects(\n data=X,\n batch_info=batch_labels,\n data_type='imc',\n latent_dim=20,\n epochs=100,\n loss_weights=loss_weights\n)\n```\n\n### Custom Architecture Parameters\n\nFine-tune the neural network architecture:\n\n```python\ncorrected_data = correct_batch_effects(\n data=X,\n batch_info=batch_labels,\n data_type='imc',\n latent_dim=20,\n epochs=100,\n bio_encoder_hidden_layers=[500, 2000, 2000],\n batch_encoder_hidden_layers=[500],\n decoder_hidden_layers=[2000, 2000, 500],\n batch_classifier_layers_power=[500, 2000, 2000],\n batch_classifier_layers_weak=[128]\n)\n```\n\n### Direct Model Access\n\nFor more control, use the models directly:\n\n```python\nfrom BioBatchNet import IMCVAE, GeneVAE\n\n# For IMC data\nmodel = IMCVAE(\n in_sz=40, # Number of features\n out_sz=40, # Output dimension\n num_batch=4, # Number of batches\n latent_sz=20, # Latent dimension\n bio_encoder_hidden_layers=[500, 2000, 2000],\n batch_encoder_hidden_layers=[500],\n decoder_hidden_layers=[2000, 2000, 500],\n batch_classifier_layers_power=[500, 2000, 2000],\n batch_classifier_layers_weak=[128]\n)\n\n# Train the model\nmodel.fit(data, batch_labels, epochs=100, lr=1e-3)\n\n# Get biological embeddings (batch-corrected representations)\nbio_embeddings = model.get_bio_embeddings(data)\n\n# Get corrected data\ncorrected_data = model.correct_batch_effects(data)\n```\n\n## Data Formats\n\n### Input Data Requirements\n\n1. **Data Matrix**: \n - NumPy array or PyTorch tensor\n - Shape: (n_cells, n_features)\n - For scRNA-seq: gene expression matrix\n - For IMC: protein expression matrix\n\n2. **Batch Information**:\n - NumPy array or list\n - Can be string labels (e.g., ['Patient1', 'Patient2']) or numeric\n - Length must match number of cells\n\n### Output\n\n- **Corrected Data**: Same shape as input, with batch effects removed\n- **Bio Embeddings**: Low-dimensional biological representations (n_cells, latent_dim)\n\n## Recommended Parameters\n\n### For IMC Data\n\n```python\n# Small dataset (< 10 batches)\nloss_weights = {\n 'recon_loss': 10,\n 'discriminator': 0.3,\n 'classifier': 1,\n 'mmd_loss_1': 0,\n 'kl_loss_1': 0.005,\n 'kl_loss_2': 0.1,\n 'ortho_loss': 0.01\n}\n\n# Large dataset (> 30 batches)\nloss_weights = {\n 'recon_loss': 10,\n 'discriminator': 0.1,\n 'classifier': 1,\n 'mmd_loss_1': 0.01,\n 'kl_loss_1': 0.0,\n 'kl_loss_2': 0.1,\n 'ortho_loss': 0.01\n}\n```\n\n### For scRNA-seq Data\n\n```python\nloss_weights = {\n 'recon_loss': 10,\n 'discriminator': 0.04,\n 'classifier': 1,\n 'kl_loss_1': 1e-7,\n 'kl_loss_2': 0.01,\n 'ortho_loss': 0.0002,\n 'mmd_loss_1': 0,\n 'kl_loss_size': 0.002\n}\n```\n\n## Example Workflow\n\n```python\nimport BioBatchNet\nimport anndata as ad\nimport numpy as np\nfrom BioBatchNet import correct_batch_effects\n\n# 1. Load data\nadata = ad.read_h5ad('IMMUcan_batch.h5ad')\nprint(f\"Data shape: {adata.shape}\")\nprint(f\"Batches: {adata.obs['BATCH'].unique()}\")\n\n# 2. Prepare data\nX = adata.X\nbatch_labels = adata.obs['BATCH'].values\n\n# Convert categorical to numpy array if needed\nif hasattr(batch_labels, 'to_numpy'):\n batch_labels = batch_labels.to_numpy()\n\n# 3. Correct batch effects\ncorrected = correct_batch_effects(\n data=X,\n batch_info=batch_labels,\n data_type='imc',\n latent_dim=20,\n epochs=100\n)\n\n# 4. Store results\nadata.layers['corrected'] = corrected\n\n# 5. Optional: Get embeddings for visualization\nfrom BioBatchNet import IMCVAE\nmodel = IMCVAE(\n in_sz=X.shape[1],\n out_sz=X.shape[1],\n num_batch=len(np.unique(batch_labels)),\n latent_sz=20\n)\nmodel.fit(X, batch_labels, epochs=100)\nembeddings = model.get_bio_embeddings(X)\nadata.obsm['X_biobatchnet'] = embeddings\n\n# 6. Visualize results (using scanpy)\nimport scanpy as sc\nsc.pp.neighbors(adata, use_rep='X_biobatchnet')\nsc.tl.umap(adata)\nsc.pl.umap(adata, color=['BATCH', 'celltype'])\n```\n\n## Tips and Best Practices\n\n1. **Data Preprocessing**: \n - Ensure data is properly normalized before batch correction\n - For scRNA-seq: consider log-transformation\n - For IMC: consider arcsinh transformation\n\n2. **Batch Size**:\n - Default batch size is 256\n - Reduce if encountering memory issues\n - Increase for faster training with sufficient memory\n\n3. **Number of Epochs**:\n - Start with 100 epochs for initial testing\n - Increase to 200-500 for final results\n - Monitor loss convergence\n\n4. **Latent Dimension**:\n - Default is 20\n - Increase for complex datasets with many cell types\n - Decrease for simpler datasets\n\n5. **Post-processing**:\n - Output may need scaling/normalization\n - Consider clipping extreme values\n - Validate biological signals are preserved\n\n## Troubleshooting\n\n### Memory Issues\n```python\n# Reduce batch size\ncorrected = correct_batch_effects(\n data=X,\n batch_info=batch_labels,\n batch_size=64 # Smaller batch size\n)\n```\n\n### Convergence Issues\n```python\n# Adjust learning rate\nmodel.fit(data, batch_labels, lr=1e-4) # Lower learning rate\n```\n\n### Large Output Range\n```python\n# Post-process corrected data\ncorrected = correct_batch_effects(data=X, batch_info=batch_labels)\ncorrected = np.clip(corrected, 0, np.percentile(corrected, 99))\n```\n\n## Features\n\n- **Multi-modal Support**: Works with both scRNA-seq and IMC data\n- **Easy-to-Use API**: One-line batch correction function\n- **Flexible Architecture**: Customizable neural network parameters\n- **Adaptive Loss Weights**: Automatically adjusts based on dataset characteristics\n- **Comprehensive Documentation**: Detailed usage examples and best practices\n\n## Citation\n\nIf you use BioBatchNet in your research, please cite:\n\n```\n[Citation information to be added]\n```\n\n## Support\n\nFor issues and questions:\n- GitHub Issues: https://github.com/Manchester-HealthAI/BioBatchNet/issues\n- PyPI Package: https://pypi.org/project/biobatchnet/\n\n## License\n\nMIT License\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A deep learning framework for batch effect correction in biological data",
"version": "0.1.4",
"project_urls": {
"Bug Tracker": "https://github.com/Manchester-HealthAI/BioBatchNet/issues",
"Documentation": "https://github.com/Manchester-HealthAI/BioBatchNet/blob/main/USAGE.md",
"Homepage": "https://github.com/Manchester-HealthAI/BioBatchNet",
"Repository": "https://github.com/Manchester-HealthAI/BioBatchNet"
},
"split_keywords": [
"batch-effect",
" deep-learning",
" single-cell",
" imc",
" scrna-seq",
" bioinformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "425d9eac90ecda0a8081693c2de955000b5b47fc7d80ce8af72f45c7d5e51aa8",
"md5": "67ac58601a18af480196935c4cef97a6",
"sha256": "b5dad691b344d26b514e308a71867f8573d61fff157e4bb3a691aa00e3b64c9f"
},
"downloads": -1,
"filename": "biobatchnet-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "67ac58601a18af480196935c4cef97a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 32939,
"upload_time": "2025-09-09T20:04:05",
"upload_time_iso_8601": "2025-09-09T20:04:05.851554Z",
"url": "https://files.pythonhosted.org/packages/42/5d/9eac90ecda0a8081693c2de955000b5b47fc7d80ce8af72f45c7d5e51aa8/biobatchnet-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "73ce982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57",
"md5": "0e8293ca563e0199507faac879396f12",
"sha256": "f792063e3f17653a023ec12aba79d057caac50bfb123d64b2e9f0a23f7f7b582"
},
"downloads": -1,
"filename": "biobatchnet-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "0e8293ca563e0199507faac879396f12",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 27855,
"upload_time": "2025-09-09T20:04:06",
"upload_time_iso_8601": "2025-09-09T20:04:06.949723Z",
"url": "https://files.pythonhosted.org/packages/73/ce/982637d65de1503be25992a3070f200675a7a8c1981e18a8840977bf2c57/biobatchnet-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-09 20:04:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Manchester-HealthAI",
"github_project": "BioBatchNet",
"github_not_found": true,
"lcname": "biobatchnet"
}