bacformer

Name	bacformer JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Modeling bacterial genomes.
upload_time	2025-07-08 16:22:51
maintainer	None
docs_url	None
author	Maciej Wiatrak
requires_python	>=3.10
license	None
keywords	bacteria bioinformatics genomics prokaryotes transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Bacformer

[//]: # ([![Tests][badge-tests]][tests])

[//]: # ([![Documentation][badge-docs]][documentation])

[//]: # ()
[//]: # ([badge-tests]: https://img.shields.io/github/actions/workflow/status/macwiatrak/Bacformer/test.yaml?branch=main)

[//]: # ([badge-docs]: https://img.shields.io/readthedocs/Bacformer)

Bacformer is a prokaryotic foundational model which models
whole-bacterial genomes as a sequence of proteins ordered by their genomic coordinates on the chromosome and plasmid(s).
It takes as input average protein embeddings from protein language models and computes contextualised protein
embeddings conditional on other proteis present in the genome. Bacformer is trained on a diverse dataset of ~1.3M bacterial genomes and ~3B proteins.

![Bacformer](files/Bacformer.png)

Bacformer can be applied to a wide range of tasks, including: strain clustering, essential genes prediction, operon identification,
ppi prediction, protein function prediction and more. We provide [model checkpoints]() for pretrained models as well as Bacformer
finetuned for various tasks. We also provide tutorials and make Bacformer available via [HuggingFace](https://huggingface.co/macwiatrak).

## News

- **2025-05-15**: Bacformer is now available on [HuggingFace](https://huggingface.co/macwiatrak).

## Contents

- [Setup](#setup)
  - [Requirements](#requirements)
  - [installation](#installation)
- [Usage](#usage)
  - [Tutorials](#tutorials)
- [HuggingFace](#huggingface)
- [Pretrained model checkpoints](#pretrained-model-checkpoints)
- [Contributing](#contributing)
- [Citation](#citation)
- [Installation](#installation)
- [Release notes](#release-notes)
- [Contact](#contact)

## Setup

### Requirements

Bacformer is based on [PyTorch](https://pytorch.org/) and [HuggingFace Transformers](https://huggingface.co/docs/transformers/index)
and was developed in `python=3.10`.

Bacformer uses [ESM-2](https://github.com/facebookresearch/esm) protein embedding as input (`esm2_t12_35M_UR50D`). We
recommend using the [faplm](https://github.com/pengzhangzhi/faplm) package to compute protein embeddings in a fast and efficient way.

### Installation

[//]: # (You can install Bacformer using `pip`)

[//]: # (```bash)

[//]: # (pip install bacformer)

[//]: # (```)

[//]: # ()
[//]: # (by cloning the repository and installing the dependencies:)

You can install Bacformer by cloning the repository and installing the dependencies.
```bash
git clone https://github.com/macwiatrak/Bacformer.git
cd Bacformer
# 1) install Bacformer **with its core dependencies**
pip install .
# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"
```

<details>
<summary>Have trouble installing bacformer?</summary>

create clean conda env, and install the `cuda-toolkit 12.1.0` for compilation:
```bash
# Create new environment with Python 3.10
micromamba create -n bacformer python=3.10 -y

# Activate the environment
micromamba activate bacformer

# Install CUDA toolkit
micromamba install -c nvidia/label/cuda-12.1.0 cuda-toolkit -y

# Install PyTorch with CUDA support (using pip for latest version)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install flash-attention
pip install flash-attn --no-build-isolation --no-cache-dir

# Optional: verify installations
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
```

Another workaround is docker container. You can use the official nvidia pytorch [containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) which have all the dependencies for flash attention.
</details>

## Usage

Below are examples on how to use Bacformer to compute contextual protein embeddings.

### Computing contextual protein embeddings on a set of toy protein sequences

```python
import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs

device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)

# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
    "MGYDLVAGFQKNVRTI",
    "MKAILVVLLG",
    "MQLIESRFYKDPWGNVHATC",
    "MSTNPKPQRFAWL",
]
# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    protein_sequences,
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# (batch_size, n_prots + special tokens or max_n_proteins, embedding_dim)
print('last hidden state shape:', outputs["last_hidden_state"].shape)
# (batch_size, embedding_dim)
print('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape)
```

### Processing and embedding whole bacterial genome

Process a whole bacterial genome assembly from GenBank (in this case, `Pseudomonas aeruginosa PAO1` genome)
and compute contextualised protein embeddings with Bacformer.

```python
import torch
from transformers import AutoModel
from bacformer.pp import preprocess_genome_assembly, protein_seqs_to_bacformer_inputs


# preprocess a bacterial genome assembly
genome_info = preprocess_genome_assembly(filepath="files/pao1.gbff")

# load the model
device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-masked-complete-genomes", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)


# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    genome_info['protein_sequence'],
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# the resulting contextalized protein embeddings can be used for analysis
print('last hidden state shape:', outputs["last_hidden_state"].shape)
```

### Embed dataset column with Bacformer

Use Bacformer to embed a column of protein sequences from a HuggingFace dataset. The example below can be easily adapted
to a pandas DataFrame or any other data structure containing protein sequences.

Below we show how to compute contextualised protein embeddings for all proteins present in the genome required for operon prediction,
or how to compute a single genome embedding for a set of genomes required for strain clustering.

```python
from bacformer.pp import embed_dataset_col
from datasets import load_dataset


# load the operon dataset from long-read RNA sequencing
operon_dataset = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing", split="test")

# embed the protein sequences with Bacformer
# we compute contextualised protein embeddings for all proteins in the genome
operon_dataset = embed_dataset_col(
    dataset=operon_dataset,
    model_path="macwiatrak/bacformer-masked-complete-genomes",
    max_n_proteins=9000,
    genome_pooling_method=None,  # set to None to get embeddings for all proteins in the genome
)


# load the strain clustering toy dataset
strain_clustering_dataset = load_dataset("macwiatrak/strain-clustering-protein-sequences-sample", split="train")

# embed the protein sequences with Bacformer
# use mean genome pooling as we need a single genome embedding for each genome for clustering
strain_clustering_dataset = embed_dataset_col(
    dataset=strain_clustering_dataset,
    model_path="macwiatrak/bacformer-masked-MAG",
    max_n_proteins=9000,
    genome_pooling_method="mean",
)

# convert to pandas and print the first 5 rows
strain_clustering_df = strain_clustering_dataset.to_pandas()
strain_clustering_df.head()
```

### Tutorials

We provide a set of tutorials to help you get started with Bacformer. The tutorials cover the following topics:
- [Bacformer for strain clustering](tutorials/strain_clustering_tutorial.ipynb)
- [Finetuning Bacformer for essential genes prediction](tutorials/finetune_gene_essentiality_prediction_tutorial.ipynb)
- [Bacformer for phenotypic traits prediction](tutorials/phenotypic_traits_prediction_tutorial.ipynb)
- [Finetuning Bacformer for phenotypic traits prediction](tutorials/finetune_phenotypic_traits_prediction_tutorial.ipynb)
- [Zero-shot operon identification with Bacformer](tutorials/zero_shot_operon_prediction.ipynb)

We are actively working on more tutorials, so stay tuned! If you have any suggestions for tutorials, please let us know by raising an issue in the [issue tracker][issue tracker].

## HuggingFace

Bacformer is integrated with [HuggingFace](https://huggingface.co/macwiatrak).

```python
import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoModelForCausalLM

device = "cuda:0"
# load the Bacformer model trained on MAGs with an autoregressive objective
causal_model = AutoModelForCausalLM.from_pretrained("macwiatrak/bacformer-causal-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer model trained on MAGs with a masked objective
masked_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-masked-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer encoder model finetuned on complete genomes (i.e. without the protein family classification head)
# we recommend using this model for complete genomes as a start for finetuning on your own dataset for all tasks except generation
encoder_model = AutoModel.from_pretrained("macwiatrak/bacformer-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)
```

## Pretrained model checkpoints

We provide a range of pretrained model checkpoints for Bacformer which are available via [HuggingFace](https://huggingface.co/macwiatrak).

| Checkpoint name                                                         | Genome type                         | Description                                                                                                                                                                                                                                                         |
|-------------------------------------------------------------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `bacformer-causal-MAG`                                                  | MAG                                 | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with an autoregressive objective.                                                                                                                                                               |
| `bacformer-masked-MAG`                                                  | MAG                                 | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of proteins.                                                                                                                                     |
| `bacformer-causal-complete-genomes`                                     | Complete (i.e. uninterrupted)       | A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective.                                                                                                                                                             |
| `bacformer-masked-complete-genomes`                                     | Complete (i.e. uninterrupted)       | A `bacformer-masked-MAG` finetuned on a set of ~40 k complete genomes with a masked objective, randomly masking 15 % of the proteins.                                                                                                                               |
| `bacformer-causal-protein-family-modeling-complete-genomes`             | Complete (i.e. uninterrupted)       | A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective. In contrast to other models, this model takes as input a protein-family token rather than the protein sequence, allowing generation of sequences of protein families. |
| `bacformer-for-essential-genes-prediction`                              | Complete (i.e. uninterrupted)       | A `bacformer-masked-complete-genomes` finetuned on the essential-genes prediction task.                                                                                                                                                                             |



## Contributing

We welcome contributions to Bacformer! If you would like to contribute, please follow these steps:
1. Fork the repository.
2. Install `pre-commit` and set up the pre-commit hooks (make sure to do it at the root of the repository).
```bash
pip install pre-commit
pre-commit install
```
2. Create a new branch for your feature or bug fix.
3. Make your changes and commit them.
4. Push your changes to your forked repository.
5. Create a pull request to the main repository.
6. Make sure to add tests for your changes and run the tests to ensure everything is working correctly.

## Contact

For questions, bugs, and feature requests, please raise an issue in the repository.

## Citation

> t.b.a

[uv]: https://github.com/astral-sh/uv
[scverse discourse]: https://discourse.scverse.org/
[issue tracker]: https://github.com/macwiatrak/Bacformer/issues
[tests]: https://github.com/macwiatrak/Bacformer/actions/workflows/test.yaml
[documentation]: https://bacformer.readthedocs.io
[changelog]: https://bacformer.readthedocs.io/en/latest/changelog.html
[api documentation]: https://bacformer.readthedocs.io/en/latest/api.html
[pypi]: https://pypi.org/project/bacformer

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bacformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Maciej Wiatrak <macwiatrak@gmail.com>",
    "keywords": "bacteria, bioinformatics, genomics, prokaryotes, transformers",
    "author": "Maciej Wiatrak",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/2a/59/e9d57b1f4759994223b6291a882c56fc42320830c9bd76187de5847df58b/bacformer-0.1.0.tar.gz",
    "platform": null,
    "description": "# Bacformer\n\n[//]: # ([![Tests][badge-tests]][tests])\n\n[//]: # ([![Documentation][badge-docs]][documentation])\n\n[//]: # ()\n[//]: # ([badge-tests]: https://img.shields.io/github/actions/workflow/status/macwiatrak/Bacformer/test.yaml?branch=main)\n\n[//]: # ([badge-docs]: https://img.shields.io/readthedocs/Bacformer)\n\nBacformer is a prokaryotic foundational model which models\nwhole-bacterial genomes as a sequence of proteins ordered by their genomic coordinates on the chromosome and plasmid(s).\nIt takes as input average protein embeddings from protein language models and computes contextualised protein\nembeddings conditional on other proteis present in the genome. Bacformer is trained on a diverse dataset of ~1.3M bacterial genomes and ~3B proteins.\n\n![Bacformer](files/Bacformer.png)\n\nBacformer can be applied to a wide range of tasks, including: strain clustering, essential genes prediction, operon identification,\nppi prediction, protein function prediction and more. We provide [model checkpoints]() for pretrained models as well as Bacformer\nfinetuned for various tasks. We also provide tutorials and make Bacformer available via [HuggingFace](https://huggingface.co/macwiatrak).\n\n## News\n\n- **2025-05-15**: Bacformer is now available on [HuggingFace](https://huggingface.co/macwiatrak).\n\n## Contents\n\n- [Setup](#setup)\n  - [Requirements](#requirements)\n  - [installation](#installation)\n- [Usage](#usage)\n  - [Tutorials](#tutorials)\n- [HuggingFace](#huggingface)\n- [Pretrained model checkpoints](#pretrained-model-checkpoints)\n- [Contributing](#contributing)\n- [Citation](#citation)\n- [Installation](#installation)\n- [Release notes](#release-notes)\n- [Contact](#contact)\n\n## Setup\n\n### Requirements\n\nBacformer is based on [PyTorch](https://pytorch.org/) and [HuggingFace Transformers](https://huggingface.co/docs/transformers/index)\nand was developed in `python=3.10`.\n\nBacformer uses [ESM-2](https://github.com/facebookresearch/esm) protein embedding as input (`esm2_t12_35M_UR50D`). We\nrecommend using the [faplm](https://github.com/pengzhangzhi/faplm) package to compute protein embeddings in a fast and efficient way.\n\n### Installation\n\n[//]: # (You can install Bacformer using `pip`)\n\n[//]: # (```bash)\n\n[//]: # (pip install bacformer)\n\n[//]: # (```)\n\n[//]: # ()\n[//]: # (by cloning the repository and installing the dependencies:)\n\nYou can install Bacformer by cloning the repository and installing the dependencies.\n```bash\ngit clone https://github.com/macwiatrak/Bacformer.git\ncd Bacformer\n# 1) install Bacformer **with its core dependencies**\npip install .\n# 2) (optional but recommended) add the fast\u2010attention extra (\u201cfaesm\u201d)\npip install \".[faesm]\"\n```\n\n<details>\n<summary>Have trouble installing bacformer?</summary>\n\ncreate clean conda env, and install the `cuda-toolkit 12.1.0` for compilation:\n```bash\n# Create new environment with Python 3.10\nmicromamba create -n bacformer python=3.10 -y\n\n# Activate the environment\nmicromamba activate bacformer\n\n# Install CUDA toolkit\nmicromamba install -c nvidia/label/cuda-12.1.0 cuda-toolkit -y\n\n# Install PyTorch with CUDA support (using pip for latest version)\npip install torch --index-url https://download.pytorch.org/whl/cu121\n\n# Install flash-attention\npip install flash-attn --no-build-isolation --no-cache-dir\n\n# Optional: verify installations\npython -c \"import torch; print(f'PyTorch version: {torch.__version__}')\"\npython -c \"import torch; print(f'CUDA available: {torch.cuda.is_available()}')\"\n```\n\nAnother workaround is docker container. You can use the official nvidia pytorch [containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) which have all the dependencies for flash attention.\n</details>\n\n## Usage\n\nBelow are examples on how to use Bacformer to compute contextual protein embeddings.\n\n### Computing contextual protein embeddings on a set of toy protein sequences\n\n```python\nimport torch\nfrom transformers import AutoModel\nfrom bacformer.pp import protein_seqs_to_bacformer_inputs\n\ndevice = \"cuda:0\"\nmodel = AutoModel.from_pretrained(\n    \"macwiatrak/bacformer-masked-MAG\", trust_remote_code=True\n).to(device).eval().to(torch.bfloat16)\n\n# Example input: a sequence of protein sequences\n# in this case: 4 toy protein sequences\n# Bacformer was trained with a maximum nr of proteins of 6000.\nprotein_sequences = [\n    \"MGYDLVAGFQKNVRTI\",\n    \"MKAILVVLLG\",\n    \"MQLIESRFYKDPWGNVHATC\",\n    \"MSTNPKPQRFAWL\",\n]\n# embed the proteins with ESM-2 to get average protein embeddings\ninputs = protein_seqs_to_bacformer_inputs(\n    protein_sequences,\n    device=device,\n    batch_size=128,  # the batch size for computing the protein embeddings\n    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with\n)\n\n# compute contextualised protein embeddings with Bacformer\nwith torch.no_grad():\n    outputs = model(**inputs, return_dict=True)\n\n# (batch_size, n_prots + special tokens or max_n_proteins, embedding_dim)\nprint('last hidden state shape:', outputs[\"last_hidden_state\"].shape)\n# (batch_size, embedding_dim)\nprint('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape)\n```\n\n### Processing and embedding whole bacterial genome\n\nProcess a whole bacterial genome assembly from GenBank (in this case, `Pseudomonas aeruginosa PAO1` genome)\nand compute contextualised protein embeddings with Bacformer.\n\n```python\nimport torch\nfrom transformers import AutoModel\nfrom bacformer.pp import preprocess_genome_assembly, protein_seqs_to_bacformer_inputs\n\n\n# preprocess a bacterial genome assembly\ngenome_info = preprocess_genome_assembly(filepath=\"files/pao1.gbff\")\n\n# load the model\ndevice = \"cuda:0\"\nmodel = AutoModel.from_pretrained(\n    \"macwiatrak/bacformer-masked-complete-genomes\", trust_remote_code=True\n).to(device).eval().to(torch.bfloat16)\n\n\n# embed the proteins with ESM-2 to get average protein embeddings\ninputs = protein_seqs_to_bacformer_inputs(\n    genome_info['protein_sequence'],\n    device=device,\n    batch_size=128,  # the batch size for computing the protein embeddings\n    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with\n)\n\n# compute contextualised protein embeddings with Bacformer\nwith torch.no_grad():\n    outputs = model(**inputs, return_dict=True)\n\n# the resulting contextalized protein embeddings can be used for analysis\nprint('last hidden state shape:', outputs[\"last_hidden_state\"].shape)\n```\n\n### Embed dataset column with Bacformer\n\nUse Bacformer to embed a column of protein sequences from a HuggingFace dataset. The example below can be easily adapted\nto a pandas DataFrame or any other data structure containing protein sequences.\n\nBelow we show how to compute contextualised protein embeddings for all proteins present in the genome required for operon prediction,\nor how to compute a single genome embedding for a set of genomes required for strain clustering.\n\n```python\nfrom bacformer.pp import embed_dataset_col\nfrom datasets import load_dataset\n\n\n# load the operon dataset from long-read RNA sequencing\noperon_dataset = load_dataset(\"macwiatrak/operon-identification-long-read-rna-sequencing\", split=\"test\")\n\n# embed the protein sequences with Bacformer\n# we compute contextualised protein embeddings for all proteins in the genome\noperon_dataset = embed_dataset_col(\n    dataset=operon_dataset,\n    model_path=\"macwiatrak/bacformer-masked-complete-genomes\",\n    max_n_proteins=9000,\n    genome_pooling_method=None,  # set to None to get embeddings for all proteins in the genome\n)\n\n\n# load the strain clustering toy dataset\nstrain_clustering_dataset = load_dataset(\"macwiatrak/strain-clustering-protein-sequences-sample\", split=\"train\")\n\n# embed the protein sequences with Bacformer\n# use mean genome pooling as we need a single genome embedding for each genome for clustering\nstrain_clustering_dataset = embed_dataset_col(\n    dataset=strain_clustering_dataset,\n    model_path=\"macwiatrak/bacformer-masked-MAG\",\n    max_n_proteins=9000,\n    genome_pooling_method=\"mean\",\n)\n\n# convert to pandas and print the first 5 rows\nstrain_clustering_df = strain_clustering_dataset.to_pandas()\nstrain_clustering_df.head()\n```\n\n### Tutorials\n\nWe provide a set of tutorials to help you get started with Bacformer. The tutorials cover the following topics:\n- [Bacformer for strain clustering](tutorials/strain_clustering_tutorial.ipynb)\n- [Finetuning Bacformer for essential genes prediction](tutorials/finetune_gene_essentiality_prediction_tutorial.ipynb)\n- [Bacformer for phenotypic traits prediction](tutorials/phenotypic_traits_prediction_tutorial.ipynb)\n- [Finetuning Bacformer for phenotypic traits prediction](tutorials/finetune_phenotypic_traits_prediction_tutorial.ipynb)\n- [Zero-shot operon identification with Bacformer](tutorials/zero_shot_operon_prediction.ipynb)\n\nWe are actively working on more tutorials, so stay tuned! If you have any suggestions for tutorials, please let us know by raising an issue in the [issue tracker][issue tracker].\n\n## HuggingFace\n\nBacformer is integrated with [HuggingFace](https://huggingface.co/macwiatrak).\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoModelForMaskedLM, AutoModelForCausalLM\n\ndevice = \"cuda:0\"\n# load the Bacformer model trained on MAGs with an autoregressive objective\ncausal_model = AutoModelForCausalLM.from_pretrained(\"macwiatrak/bacformer-causal-MAG\", trust_remote_code=True).to(torch.bfloat16).eval().to(device)\n\n# load the Bacformer model trained on MAGs with a masked objective\nmasked_model = AutoModelForMaskedLM.from_pretrained(\"macwiatrak/bacformer-masked-MAG\", trust_remote_code=True).to(torch.bfloat16).eval().to(device)\n\n# load the Bacformer encoder model finetuned on complete genomes (i.e. without the protein family classification head)\n# we recommend using this model for complete genomes as a start for finetuning on your own dataset for all tasks except generation\nencoder_model = AutoModel.from_pretrained(\"macwiatrak/bacformer-masked-complete-genomes\", trust_remote_code=True).to(torch.bfloat16).eval().to(device)\n```\n\n## Pretrained model checkpoints\n\nWe provide a range of pretrained model checkpoints for Bacformer which are available via [HuggingFace](https://huggingface.co/macwiatrak).\n\n| Checkpoint name                                                         | Genome type                         | Description                                                                                                                                                                                                                                                         |\n|-------------------------------------------------------------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `bacformer-causal-MAG`                                                  | MAG                                 | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with an autoregressive objective.                                                                                                                                                               |\n| `bacformer-masked-MAG`                                                  | MAG                                 | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of proteins.                                                                                                                                     |\n| `bacformer-causal-complete-genomes`                                     | Complete (i.e. uninterrupted)       | A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective.                                                                                                                                                             |\n| `bacformer-masked-complete-genomes`                                     | Complete (i.e. uninterrupted)       | A `bacformer-masked-MAG` finetuned on a set of ~40 k complete genomes with a masked objective, randomly masking 15 % of the proteins.                                                                                                                               |\n| `bacformer-causal-protein-family-modeling-complete-genomes`             | Complete (i.e. uninterrupted)       | A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective. In contrast to other models, this model takes as input a protein-family token rather than the protein sequence, allowing generation of sequences of protein families. |\n| `bacformer-for-essential-genes-prediction`                              | Complete (i.e. uninterrupted)       | A `bacformer-masked-complete-genomes` finetuned on the essential-genes prediction task.                                                                                                                                                                             |\n\n\n\n## Contributing\n\nWe welcome contributions to Bacformer! If you would like to contribute, please follow these steps:\n1. Fork the repository.\n2. Install `pre-commit` and set up the pre-commit hooks (make sure to do it at the root of the repository).\n```bash\npip install pre-commit\npre-commit install\n```\n2. Create a new branch for your feature or bug fix.\n3. Make your changes and commit them.\n4. Push your changes to your forked repository.\n5. Create a pull request to the main repository.\n6. Make sure to add tests for your changes and run the tests to ensure everything is working correctly.\n\n## Contact\n\nFor questions, bugs, and feature requests, please raise an issue in the repository.\n\n## Citation\n\n> t.b.a\n\n[uv]: https://github.com/astral-sh/uv\n[scverse discourse]: https://discourse.scverse.org/\n[issue tracker]: https://github.com/macwiatrak/Bacformer/issues\n[tests]: https://github.com/macwiatrak/Bacformer/actions/workflows/test.yaml\n[documentation]: https://bacformer.readthedocs.io\n[changelog]: https://bacformer.readthedocs.io/en/latest/changelog.html\n[api documentation]: https://bacformer.readthedocs.io/en/latest/api.html\n[pypi]: https://pypi.org/project/bacformer\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Modeling bacterial genomes.",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://bacformer.readthedocs.io/",
        "Homepage": "https://github.com/macwiatrak/Bacformer",
        "Source": "https://github.com/macwiatrak/Bacformer"
    },
    "split_keywords": [
        "bacteria",
        " bioinformatics",
        " genomics",
        " prokaryotes",
        " transformers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "70cd7b193e84a521573c9ea352a62981d509b585a87f9d16110bb07e269cb691",
                "md5": "947e7b92eeb61fd6790872a904aa1477",
                "sha256": "1b2d3c8811150a1a892a6aa4db1a35a5a365cde2a4a54f6df022cc8cf1c4c143"
            },
            "downloads": -1,
            "filename": "bacformer-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "947e7b92eeb61fd6790872a904aa1477",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 45535,
            "upload_time": "2025-07-08T16:22:48",
            "upload_time_iso_8601": "2025-07-08T16:22:48.441220Z",
            "url": "https://files.pythonhosted.org/packages/70/cd/7b193e84a521573c9ea352a62981d509b585a87f9d16110bb07e269cb691/bacformer-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2a59e9d57b1f4759994223b6291a882c56fc42320830c9bd76187de5847df58b",
                "md5": "02588d693f5dda1fe8bd20c4b332f52a",
                "sha256": "63c5b8e1cabe22bd1990d2f921e0bea1e2f0743e15c8b371e5f445f57b3ed77a"
            },
            "downloads": -1,
            "filename": "bacformer-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "02588d693f5dda1fe8bd20c4b332f52a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 21765282,
            "upload_time": "2025-07-08T16:22:51",
            "upload_time_iso_8601": "2025-07-08T16:22:51.441847Z",
            "url": "https://files.pythonhosted.org/packages/2a/59/e9d57b1f4759994223b6291a882c56fc42320830c9bd76187de5847df58b/bacformer-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-08 16:22:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "macwiatrak",
    "github_project": "Bacformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "bacformer"
}

Maciej Wiatrak