GenAIRR


NameGenAIRR JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/MuteJester/GenAIRR
SummaryAn advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.
upload_time2024-11-12 11:11:18
maintainerNone
docs_urlNone
authorThomas Konstantinovsky & Ayelet Peres
requires_python>=3.9
licenseNone
keywords immunogenetics sequence simulation bioinformatics alignment benchmarking
VCS
bugtrack_url
requirements pandas numpy scipy setuptools
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GenAIRR: AIRR Sequence Simulator

GenAIRR is a Python module designed to generate synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences for the purpose of benchmarking alignment algorithms and conducting sequence analysis in a non-biased manner.


- **Realistic Sequence Simulation**: Generate heavy and light immunoglobulin chain sequences with extensive customization options.
- **Advanced Mutation and Augmentation**: Introduce mutations and augment sequences to closely mimic the natural diversity and sequencing artifacts.
- **Precision in Allele-Specific Corrections**: Utilize sophisticated correction maps to accurately handle allele-specific trimming events and ambiguities.
- **Indel Simulation Capability**: Reflect the intricacies of sequencing data by simulating insertions and deletions within sequences.

# Visit GenAIRR's Documentation 
[GenAIRR's ReadTheDocs](https://genairr.readthedocs.io/en/latest/)

# Acknowledgements
Some parts of the code were inspired and adapted from https://github.com/Cowanlab/airrship

# Quick Start Guide to GenAIRR

Welcome to the Quick Start Guide for GenAIRR, a Python module designed for generating synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences. This guide will walk you through the basic usage of GenAIRR, including setting up your environment, simulating heavy and light chain sequences, and customizing your simulations.


## Installation

Before you begin, ensure that you have Python 3.x installed on your system. GenAIRR can be installed using pip, Python's package installer. Execute the following command in your terminal:



```python
import pandas as pd
# Install GenAIRR using pip
!pip install GenAIRR
```

## Setting Up

To start using GenAIRR, you need to import the necessary classes from the module. We'll also set up a `DataConfig` object to specify our configuration.



```python
# Importing GenAIRR classes
from GenAIRR.simulation import HeavyChainSequenceAugmentor, LightChainSequenceAugmentor, SequenceAugmentorArguments
from GenAIRR.utilities import DataConfig
from GenAIRR.data import builtin_heavy_chain_data_config,builtin_kappa_chain_data_config,builtin_lambda_chain_data_config
# Initialize DataConfig with the path to your configuration
#data_config = DataConfig('/path/to/your/config')
# Or Use one of Our Builtin Data Configs
data_config_builtin = builtin_heavy_chain_data_config()


# Set up augmentation arguments (if you have specific requirements)
args = SequenceAugmentorArguments()

```

## Simulating Heavy Chain Sequences

Let's simulate a heavy chain sequence using `HeavyChainSequenceAugmentor`. This example demonstrates a simple simulation with default settings.



```python
# Initialize the HeavyChainSequenceAugmentor
heavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, args)

# Simulate a heavy chain sequence
heavy_sequence = heavy_augmentor.simulate_augmented_sequence

# Print the simulated heavy chain sequence
print("Simulated Heavy Chain Sequence:", heavy_sequence)

```

    Simulated Heavy Chain Sequence: <bound method HeavyChainSequenceAugmentor.simulate_augmented_sequence of <GenAIRR.simulation.heavy_chain_sequence_augmentor.HeavyChainSequenceAugmentor object at 0x000001FD56378D90>>
    

## Customizing Simulations

GenAIRR allows for extensive customization to closely mimic the natural diversity of immune sequences. Below is an example of how to customize mutation rates and indel simulations.



```python
# Customize augmentation arguments
custom_args = SequenceAugmentorArguments(min_mutation_rate=0.01, max_mutation_rate=0.05, simulate_indels=True, max_indels=3,
                                         corrupt_proba=0.7,save_ns_record=True,save_mutations_record=True)

# Use custom arguments to simulate a heavy chain sequence
custom_heavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)
custom_heavy_sequence = custom_heavy_augmentor.simulate_augmented_sequence()

# Print the customized heavy chain sequence
print("Customized Simulated Heavy Chain Sequence:", custom_heavy_sequence)

```

    Customized Simulated Heavy Chain Sequence: {'sequence': 'GTGTTGGAGTACGAACGCGGAGTTCTGTTGTGAATTGGGCGGTGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGCCCCTGNGACTCTCCTGTGCAGCCTCTGGANTCACCTTTAGTAGCTATTGGNTGAGGTGNGTCCGCCAGGCTCCAGGGAAGGGACTGGAGTGGGTGGCCAACATAAAACAAGATGGAAGTGAGAAATACTATGTNGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACNCGGCNGTGTATTACTGTGCGAGAGTCCGACAGGAGCAGCCAAATCGTCTCTTCGGCTACTCAGGGACCCTTTCTGGTTNGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG', 'v_sequence_start': 43, 'v_sequence_end': 338, 'd_sequence_start': 347, 'd_sequence_end': 353, 'j_sequence_start': 386, 'j_sequence_end': 433, 'v_call': 'IGHVF10-G49*03,IGHVF10-G49*04', 'd_call': 'IGHD6-13*01,IGHD6-25*01,IGHD6-6*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.02771362586605081, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 6, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'add', 'corruption_add_amount': 43, 'corruption_remove_amount': 0, 'mutations': b'ezkxOiAnVD5DJywgMTQ3OiAnQz5HJywgMTc0OiAnRz5BJywgMTk4OiAnRz5BJ30=', 'Ns': b'ezk3OiAnQT5OJywgMTIxOiAnVD5OJywgMTQyOiAnQT5OJywgMTUwOiAnRz5OJywgMjI1OiAnRz5OJywgMzEzOiAnQT5OJywgMzE4OiAnVD5OJywgMzkyOiAnQz5OJ30=', 'indels': {}}
    

## Generating Naïve Sequences

In immunogenetics, a naïve sequence refers to an antibody sequence that has not undergone the process of somatic hypermutation. GenAIRR allows you to simulate such naïve sequences using the `HeavyChainSequence` class. Let's start by generating a naïve heavy chain sequence.



```python
from GenAIRR.sequence import HeavyChainSequence

# Create a naive heavy chain sequence
naive_heavy_sequence = HeavyChainSequence.create_random(data_config_builtin)

# Access the generated naive sequence
naive_sequence = naive_heavy_sequence

print("Naïve Heavy Chain Sequence:", naive_sequence)
print('Ungapped Sequence: ')
print(naive_sequence.ungapped_seq)

```

    Naïve Heavy Chain Sequence: 0|-----------------------------------------------------------------------------V(IGHVF3-G8*01)|294|296|----D(IGHD2-8*02)|312|332|------------J(IGHJ2*01)|381
    Ungapped Sequence: 
    CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG
    

## Applying Mutations

To mimic the natural diversity and evolution of immune sequences, GenAIRR supports the simulation of mutations through various models. Here, we demonstrate how to apply mutations to a naïve sequence using the `S5F` and `Uniform` mutation models from the mutations submodule.


### Using the S5F Mutation Model

The `S5F` model is a sophisticated mutation model that considers context-dependent mutation probabilities. It's particularly useful for simulating realistic somatic hypermutations.



```python
from GenAIRR.mutation import S5F

# Initialize the S5F mutation model with custom mutation rates
s5f_model = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)

# Apply mutations to the naive sequence using the S5F model
s5f_mutated_sequence, mutations, mutation_rate = s5f_model.apply_mutation(naive_heavy_sequence)

print("S5F Mutated Heavy Chain Sequence:", s5f_mutated_sequence)
print("S5F Mutation Details:", mutations)
print("S5F Mutation Rate:", mutation_rate)

```

    S5F Mutated Heavy Chain Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGCTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCTAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCAGAGCTCTGTGACCGCCGCGGACTCGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG
    S5F Mutation Details: {270: 'A>T', 192: 'A>T', 247: 'T>A', 76: 'G>C'}
    S5F Mutation Rate: 0.011222406361310347
    

### Using the Uniform Mutation Model

The `Uniform` mutation model applies mutations at a uniform rate across the sequence, providing a simpler alternative to the context-dependent models.



```python
from GenAIRR.mutation import Uniform

# Initialize the Uniform mutation model with custom mutation rates
uniform_model = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)

# Apply mutations to the naive sequence using the Uniform model
uniform_mutated_sequence, mutations, mutation_rate = uniform_model.apply_mutation(naive_heavy_sequence)

print("Uniform Mutated Heavy Chain Sequence:", uniform_mutated_sequence)
print("Uniform Mutation Details:", mutations)
print("Uniform Mutation Rate:", mutation_rate)

```

    Uniform Mutated Heavy Chain Sequence: CAGGTGCACCTGCAGGAGTCGGGCCGAGGAGTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTTACTTGTGGAGTTGGGTCCGCCAGCCACCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTGTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG
    Uniform Mutation Details: {122: 'C>A', 8: 'G>C', 100: 'G>T', 96: 'A>T', 25: 'C>G', 346: 'C>G', 30: 'C>G'}
    Uniform Mutation Rate: 0.019802269687583134
    

## Common Use Cases

GenAIRR is a versatile tool designed to meet a broad range of needs in immunogenetics research. This section provides examples and explanations for some common use cases, including generating multiple sequences, simulating specific allele combinations, and more.


### Generating Many Sequences

One common requirement is to generate a large dataset of synthetic AIRR sequences for analysis or benchmarking. Below is an example of how to generate multiple sequences using GenAIRR in a loop.



```python
num_sequences = 5  # Number of sequences to generate

heavy_sequences = []
for _ in range(num_sequences):
    # Simulate a heavy chain sequence
    heavy_sequence = heavy_augmentor.simulate_augmented_sequence()
    heavy_sequences.append(heavy_sequence)

# Display the generated sequences
for i, seq in enumerate(heavy_sequences, start=1):
    print(f"Heavy Chain Sequence {i}: {seq}")

```

    Heavy Chain Sequence 1: {'sequence': 'TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCGANGCTGAACTCCACAACGCCTCCCTCAAAACCTGACCCACCACGTCCAGGGACCCGTCCGGTAGTCACATGGTCCTGACACTGTCGAACATGGACCCTGTGGACACAGTCACACATTACTGTGCACCGATNCCCCCCCCTACGANGATTCCGGCCGGGCCCTGGCTAATCCAATCACTTGTTGGAGGTCTGGGGCAAAGGGACCACGGCCACCGACTCNTAAG', 'v_sequence_start': 9, 'v_sequence_end': 176, 'd_sequence_start': 185, 'd_sequence_end': 199, 'j_sequence_start': 207, 'j_sequence_end': 269, 'v_call': 'IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04', 'd_call': 'IGHD3-10*03', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.241635687732342, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 4, 'd_trim_3': 13, 'j_trim_5': 2, 'j_trim_3': 0, 'corruption_event': 'remove_before_add', 'corruption_add_amount': 9, 'corruption_remove_amount': 132, 'indels': {}}
    Heavy Chain Sequence 2: {'sequence': 'CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGGAGGCCCAGTCCCTCTCGTACCCTGTCTCTGGTGACTCCATCAGCAATAGTGGTTACTCCTGGGGCTGAATCCGTCCCCCCNCAGGGAAGGGGCTGGAGTGGATNGCGACTATANATTATAGGGGCAGCTCCTGCTACAACCCGTCCCTCAAGAGTCGAGTCACCATCTCCACAGACACGTCCAAGAAGCAGGTCTCCCTGATGCTGAGCTCTATGACCGCCGCANACACGACTGTNTATTACTGTGCGAGAGTCATGGTTCTGATGTTTTGGAGCAACTGGTTCGACCCCTGGGACCAGGGAAGCCTGGTCACCCTCTCCTCAN', 'v_sequence_start': 0, 'v_sequence_end': 297, 'd_sequence_start': 308, 'd_sequence_end': 317, 'j_sequence_start': 320, 'j_sequence_end': 370, 'v_call': 'IGHVF3-G10*06', 'd_call': 'IGHD3-9*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.11621621621621622, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 15, 'j_trim_5': 1, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
    Heavy Chain Sequence 3: {'sequence': 'CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGTCGTAATTGAAACTTTCACGAAAGACTTCTGAATACACTCGCACANACCGCTATCTGCACTGGGTCCGACAGGCCCCCAGACGGGCGTTTGAGTGGGTGGGGNGGATCACGCCTTTCAGTGGTAACACCCACTACGTGCAGACGTCCCAGGACAGAGTCCCCATTACCAGGNACAAGTNTACGAGTCCAGCCTATATAGAACTGAACACCCTNAAATGCGAGGACACAGACATATATTAATGCGCANGATCCACGGGAACCCCAGCNGAGAACTGGTACTTCGATCTTTGGGGCCGTGGCCCCCTGATCACCGTCTACTCTG', 'v_sequence_start': 0, 'v_sequence_end': 295, 'd_sequence_start': 295, 'd_sequence_end': 305, 'j_sequence_start': 316, 'j_sequence_end': 367, 'v_call': 'IGHVF6-G20*02', 'd_call': 'IGHD4-11*01,IGHD4-4*01', 'j_call': 'IGHJ2*01', 'mutation_rate': 0.1989100817438692, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 3, 'd_trim_3': 3, 'j_trim_5': 3, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
    Heavy Chain Sequence 4: {'sequence': 'CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGANTACACCTTCACCAGGTATGATATNAGCTNGGTTCGACAGGCCCCTGGACAAGGGCTTGAGTGGGTGGGATGGATCAGCGCTTACAAGGGTAACACAAACTATGAACAGAAGCTCCAGGGCAGAGTCACCATGACCACTGACACATCCACGAGCACAGCCTACATAGAGCTGAGGAGTCTGAGATCTGACGACACGGCCGTGTATCACTGTGCGAGAATCGGCGGCAGGGACGAGTCCGCAGATATCTCGCATCCCTATTGCTACTCCGGTATGGACGTCTGGGGCCAAGNNACCACGGTCACCGTCTCCTCAG', 'v_sequence_start': 0, 'v_sequence_end': 294, 'd_sequence_start': 316, 'd_sequence_end': 323, 'j_sequence_start': 332, 'j_sequence_end': 391, 'v_call': 'IGHVF6-G25*02', 'd_call': 'IGHD5-18*01,IGHD5-5*01', 'j_call': 'IGHJ6*02', 'mutation_rate': 0.0639386189258312, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
    Heavy Chain Sequence 5: {'sequence': 'GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGGAGTCCCTCANACTGCCCTGTGCAGCCTCTGGATTCACCTGGTGAANGNATGCCGTGAATTGGGGCCGGCGGCCTCCAGGGATGGGACTTGATTGGGTCTCAGTTCTNAGTGCTAGTGGTGAGAGAACNTTCTCCATAGACTCCATGAAGGGCCGGGTCACCACCTCCAGGGTCAATTGCAAGAGTACGCTGTATCTGAAAATGAAGGGCCTGAGAGCCGAGGACGCGGCTGTTTATTATTGAGCGAGAGAGGCCTTAGGGTCGGATTACTACTCCTTTTACATGGACGTCTGGGGCACAGGGACCGCGGNCACCGTCTCGTCAC', 'v_sequence_start': 0, 'v_sequence_end': 296, 'd_sequence_start': 302, 'd_sequence_end': 306, 'j_sequence_start': 310, 'j_sequence_end': 370, 'v_call': 'IGHVF10-G41*02', 'd_call': 'Short-D', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.1972972972972973, 'v_trim_5': 0, 'v_trim_3': 0, 'd_trim_5': 7, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
    


```python
import pandas as pd
pd.DataFrame(heavy_sequences)
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sequence</th>
      <th>v_sequence_start</th>
      <th>v_sequence_end</th>
      <th>d_sequence_start</th>
      <th>d_sequence_end</th>
      <th>j_sequence_start</th>
      <th>j_sequence_end</th>
      <th>v_call</th>
      <th>d_call</th>
      <th>j_call</th>
      <th>...</th>
      <th>v_trim_5</th>
      <th>v_trim_3</th>
      <th>d_trim_5</th>
      <th>d_trim_3</th>
      <th>j_trim_5</th>
      <th>j_trim_3</th>
      <th>corruption_event</th>
      <th>corruption_add_amount</th>
      <th>corruption_remove_amount</th>
      <th>indels</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCG...</td>
      <td>9</td>
      <td>176</td>
      <td>185</td>
      <td>199</td>
      <td>207</td>
      <td>269</td>
      <td>IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04</td>
      <td>IGHD3-10*03</td>
      <td>IGHJ6*03</td>
      <td>...</td>
      <td>0</td>
      <td>2</td>
      <td>4</td>
      <td>13</td>
      <td>2</td>
      <td>0</td>
      <td>remove_before_add</td>
      <td>9</td>
      <td>132</td>
      <td>{}</td>
    </tr>
    <tr>
      <th>1</th>
      <td>CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGG...</td>
      <td>0</td>
      <td>297</td>
      <td>308</td>
      <td>317</td>
      <td>320</td>
      <td>370</td>
      <td>IGHVF3-G10*06</td>
      <td>IGHD3-9*01</td>
      <td>IGHJ5*02</td>
      <td>...</td>
      <td>0</td>
      <td>2</td>
      <td>8</td>
      <td>15</td>
      <td>1</td>
      <td>0</td>
      <td>no-corruption</td>
      <td>0</td>
      <td>0</td>
      <td>{}</td>
    </tr>
    <tr>
      <th>2</th>
      <td>CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGT...</td>
      <td>0</td>
      <td>295</td>
      <td>295</td>
      <td>305</td>
      <td>316</td>
      <td>367</td>
      <td>IGHVF6-G20*02</td>
      <td>IGHD4-11*01,IGHD4-4*01</td>
      <td>IGHJ2*01</td>
      <td>...</td>
      <td>0</td>
      <td>1</td>
      <td>3</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>no-corruption</td>
      <td>0</td>
      <td>0</td>
      <td>{}</td>
    </tr>
    <tr>
      <th>3</th>
      <td>CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGG...</td>
      <td>0</td>
      <td>294</td>
      <td>316</td>
      <td>323</td>
      <td>332</td>
      <td>391</td>
      <td>IGHVF6-G25*02</td>
      <td>IGHD5-18*01,IGHD5-5*01</td>
      <td>IGHJ6*02</td>
      <td>...</td>
      <td>0</td>
      <td>2</td>
      <td>8</td>
      <td>6</td>
      <td>4</td>
      <td>0</td>
      <td>no-corruption</td>
      <td>0</td>
      <td>0</td>
      <td>{}</td>
    </tr>
    <tr>
      <th>4</th>
      <td>GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGG...</td>
      <td>0</td>
      <td>296</td>
      <td>302</td>
      <td>306</td>
      <td>310</td>
      <td>370</td>
      <td>IGHVF10-G41*02</td>
      <td>Short-D</td>
      <td>IGHJ6*03</td>
      <td>...</td>
      <td>0</td>
      <td>0</td>
      <td>7</td>
      <td>6</td>
      <td>4</td>
      <td>0</td>
      <td>no-corruption</td>
      <td>0</td>
      <td>0</td>
      <td>{}</td>
    </tr>
  </tbody>
</table>
<p>5 rows × 21 columns</p>
</div>



### Generating a Specific Allele Combination Sequence

In some cases, you might want to simulate sequences with specific V, D, and J allele combinations. Here's how to specify alleles for your simulations.



```python
# Define your specific alleles
v_allele = 'IGHVF6-G21*01'
d_allele = 'IGHD5-18*01'
j_allele = 'IGHJ6*03'

# Extract the allele objects from data_config
v_allele = next((allele for family in data_config_builtin.v_alleles.values() for allele in family if allele.name == v_allele), None)
d_allele = next((allele for family in data_config_builtin.d_alleles.values() for allele in family if allele.name == d_allele), None)
j_allele = next((allele for family in data_config_builtin.j_alleles.values() for allele in family if allele.name == j_allele), None)

# Check if all alleles were found
if not v_allele or not d_allele or not j_allele:
    raise ValueError("One or more specified alleles could not be found in the data config.")


# Generate a sequence with the specified allele combination
specific_allele_sequence = HeavyChainSequence([v_allele, d_allele, j_allele], data_config_builtin)
specific_allele_sequence.mutate(s5f_model)



print("Specific Allele Combination Sequence:", specific_allele_sequence.mutated_seq)

```

    Specific Allele Combination Sequence: CAGGTGCAGTTGGTGCAGTCTGGGACTGAGTTGAAGACGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTAGAGGCACCTTCAGCAGCTCTGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGATAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGATACGGCCGTGTATTACTGTGCGAGAGAGGATGGGTCCGGATCCCACCCCATTTACTATTACTACTACTACATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCCTCAG
    

### Simulating Sequences with Custom Mutation Rates

Adjusting mutation rates allows for the simulation of sequences at various stages of affinity maturation. Here's how to customize mutation rates in your simulations.



```python
# Customize augmentation arguments with your desired mutation rates
custom_args = SequenceAugmentorArguments(min_mutation_rate=0.15, max_mutation_rate=0.3)

# Initialize the augmentor with custom arguments
custom_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)

# Generate a sequence with the custom mutation rates
custom_mutation_sequence = custom_augmentor.simulate_augmented_sequence()

print("Custom Mutation Rate Sequence:", custom_mutation_sequence)

```

    Custom Mutation Rate Sequence: {'sequence': 'GGTTGGAGCTCATTGGGAGCTNCTATTCTAGTGGGACTACCTAGTACAACCTGTCCCTCAAGAATCGCGTCACCATATCAGTCGACACGTCCAAGAATCANTCCTCCCTGGAGCTGAGCTCCGTGACCGCAGCGGACACGGCCGTGCCTNGTTGNGCGGGAAAGTTGAATATAGTGGCTAACTCTGCCTTTTGCTCTCTGGGGCCAGGGGACAGTGGCCACTGTTTTTTCAG', 'v_sequence_start': 0, 'v_sequence_end': 161, 'd_sequence_start': 165, 'd_sequence_end': 180, 'j_sequence_start': 186, 'j_sequence_end': 232, 'v_call': 'IGHVF3-G10*04', 'd_call': 'IGHD5-12*01,IGHD5-18*02', 'j_call': 'IGHJ3*02', 'mutation_rate': 0.15517241379310345, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 4, 'd_trim_3': 8, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'remove', 'corruption_add_amount': 0, 'corruption_remove_amount': 136, 'indels': {}}
    

### Generating Naïve vs. Mutated Sequence Pairs

Comparing naïve and mutated versions of the same sequence can be useful for studying somatic hypermutation effects. Here's how to generate such pairs with GenAIRR.



```python
# Generate a naive sequence
sequence_object = HeavyChainSequence.create_random(data_config_builtin)
sequence_object.mutate(s5f_model)

print("Naïve Sequence:", sequence_object.ungapped_seq)
print("Mutated Sequence:", sequence_object.mutated_seq)

```

    Naïve Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAAAGCCACTCGGTCACACTACGGTGGTAACTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG
    Mutated Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGAACATCCATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCACTAGACACGTCCAAGAACCAGTTCTCCCTGAAACTGAGCTCTGTGGCCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAACGCCACTCGGTCACACTACGGTGGTAATTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG
    


### Simulating TCR-Beta Sequences 
GenAIRR also support TCRB sequence simulation. Here's how you can simulate TCRB Sequences.




```python
# Customize augmentation arguments with your desired mutation rates
from GenAIRR.TCR.simulation import TCRHeavyChainSequenceAugmentor, SequenceAugmentorArguments
from GenAIRR.data import builtin_tcrb_data_config

tcr_data_config = builtin_tcrb_data_config()
custom_args = SequenceAugmentorArguments(simulate_indels=0.2)

# Initialize the augmentor with custom arguments
custom_augmentor = TCRHeavyChainSequenceAugmentor(tcr_data_config, custom_args)

# Generate 100 sequences
generated_seqs = []
for _ in range(100):
    generated_seqs.append(custom_augmentor.simulate_augmented_sequence())

print("Generated Sequences:", generated_seqs)

```

## Conclusion

This section highlighted some common use cases for GenAIRR, demonstrating its flexibility in simulating AIRR sequences for various research purposes. Whether you need large datasets, specific allele combinations, custom mutation rates, or comparative analyses of naïve and mutated sequences, GenAIRR provides the necessary tools to achieve your objectives.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MuteJester/GenAIRR",
    "name": "GenAIRR",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "immunogenetics, sequence simulation, bioinformatics, alignment benchmarking",
    "author": "Thomas Konstantinovsky & Ayelet Peres",
    "author_email": "thomaskon90@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e2/ac/b4f16e007dce7153aa2d9953ee7d8404493a40d831f1e0d2f08386157027/GenAIRR-0.3.0.tar.gz",
    "platform": null,
    "description": "# GenAIRR: AIRR Sequence Simulator\r\n\r\nGenAIRR is a Python module designed to generate synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences for the purpose of benchmarking alignment algorithms and conducting sequence analysis in a non-biased manner.\r\n\r\n\r\n- **Realistic Sequence Simulation**: Generate heavy and light immunoglobulin chain sequences with extensive customization options.\r\n- **Advanced Mutation and Augmentation**: Introduce mutations and augment sequences to closely mimic the natural diversity and sequencing artifacts.\r\n- **Precision in Allele-Specific Corrections**: Utilize sophisticated correction maps to accurately handle allele-specific trimming events and ambiguities.\r\n- **Indel Simulation Capability**: Reflect the intricacies of sequencing data by simulating insertions and deletions within sequences.\r\n\r\n# Visit GenAIRR's Documentation \r\n[GenAIRR's ReadTheDocs](https://genairr.readthedocs.io/en/latest/)\r\n\r\n# Acknowledgements\r\nSome parts of the code were inspired and adapted from https://github.com/Cowanlab/airrship\r\n\r\n# Quick Start Guide to GenAIRR\r\n\r\nWelcome to the Quick Start Guide for GenAIRR, a Python module designed for generating synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences. This guide will walk you through the basic usage of GenAIRR, including setting up your environment, simulating heavy and light chain sequences, and customizing your simulations.\r\n\r\n\r\n## Installation\r\n\r\nBefore you begin, ensure that you have Python 3.x installed on your system. GenAIRR can be installed using pip, Python's package installer. Execute the following command in your terminal:\r\n\r\n\r\n\r\n```python\r\nimport pandas as pd\r\n# Install GenAIRR using pip\r\n!pip install GenAIRR\r\n```\r\n\r\n## Setting Up\r\n\r\nTo start using GenAIRR, you need to import the necessary classes from the module. We'll also set up a `DataConfig` object to specify our configuration.\r\n\r\n\r\n\r\n```python\r\n# Importing GenAIRR classes\r\nfrom GenAIRR.simulation import HeavyChainSequenceAugmentor, LightChainSequenceAugmentor, SequenceAugmentorArguments\r\nfrom GenAIRR.utilities import DataConfig\r\nfrom GenAIRR.data import builtin_heavy_chain_data_config,builtin_kappa_chain_data_config,builtin_lambda_chain_data_config\r\n# Initialize DataConfig with the path to your configuration\r\n#data_config = DataConfig('/path/to/your/config')\r\n# Or Use one of Our Builtin Data Configs\r\ndata_config_builtin = builtin_heavy_chain_data_config()\r\n\r\n\r\n# Set up augmentation arguments (if you have specific requirements)\r\nargs = SequenceAugmentorArguments()\r\n\r\n```\r\n\r\n## Simulating Heavy Chain Sequences\r\n\r\nLet's simulate a heavy chain sequence using `HeavyChainSequenceAugmentor`. This example demonstrates a simple simulation with default settings.\r\n\r\n\r\n\r\n```python\r\n# Initialize the HeavyChainSequenceAugmentor\r\nheavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, args)\r\n\r\n# Simulate a heavy chain sequence\r\nheavy_sequence = heavy_augmentor.simulate_augmented_sequence\r\n\r\n# Print the simulated heavy chain sequence\r\nprint(\"Simulated Heavy Chain Sequence:\", heavy_sequence)\r\n\r\n```\r\n\r\n    Simulated Heavy Chain Sequence: <bound method HeavyChainSequenceAugmentor.simulate_augmented_sequence of <GenAIRR.simulation.heavy_chain_sequence_augmentor.HeavyChainSequenceAugmentor object at 0x000001FD56378D90>>\r\n    \r\n\r\n## Customizing Simulations\r\n\r\nGenAIRR allows for extensive customization to closely mimic the natural diversity of immune sequences. Below is an example of how to customize mutation rates and indel simulations.\r\n\r\n\r\n\r\n```python\r\n# Customize augmentation arguments\r\ncustom_args = SequenceAugmentorArguments(min_mutation_rate=0.01, max_mutation_rate=0.05, simulate_indels=True, max_indels=3,\r\n                                         corrupt_proba=0.7,save_ns_record=True,save_mutations_record=True)\r\n\r\n# Use custom arguments to simulate a heavy chain sequence\r\ncustom_heavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)\r\ncustom_heavy_sequence = custom_heavy_augmentor.simulate_augmented_sequence()\r\n\r\n# Print the customized heavy chain sequence\r\nprint(\"Customized Simulated Heavy Chain Sequence:\", custom_heavy_sequence)\r\n\r\n```\r\n\r\n    Customized Simulated Heavy Chain Sequence: {'sequence': 'GTGTTGGAGTACGAACGCGGAGTTCTGTTGTGAATTGGGCGGTGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGCCCCTGNGACTCTCCTGTGCAGCCTCTGGANTCACCTTTAGTAGCTATTGGNTGAGGTGNGTCCGCCAGGCTCCAGGGAAGGGACTGGAGTGGGTGGCCAACATAAAACAAGATGGAAGTGAGAAATACTATGTNGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACNCGGCNGTGTATTACTGTGCGAGAGTCCGACAGGAGCAGCCAAATCGTCTCTTCGGCTACTCAGGGACCCTTTCTGGTTNGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG', 'v_sequence_start': 43, 'v_sequence_end': 338, 'd_sequence_start': 347, 'd_sequence_end': 353, 'j_sequence_start': 386, 'j_sequence_end': 433, 'v_call': 'IGHVF10-G49*03,IGHVF10-G49*04', 'd_call': 'IGHD6-13*01,IGHD6-25*01,IGHD6-6*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.02771362586605081, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 6, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'add', 'corruption_add_amount': 43, 'corruption_remove_amount': 0, 'mutations': b'ezkxOiAnVD5DJywgMTQ3OiAnQz5HJywgMTc0OiAnRz5BJywgMTk4OiAnRz5BJ30=', 'Ns': b'ezk3OiAnQT5OJywgMTIxOiAnVD5OJywgMTQyOiAnQT5OJywgMTUwOiAnRz5OJywgMjI1OiAnRz5OJywgMzEzOiAnQT5OJywgMzE4OiAnVD5OJywgMzkyOiAnQz5OJ30=', 'indels': {}}\r\n    \r\n\r\n## Generating Na\u00efve Sequences\r\n\r\nIn immunogenetics, a na\u00efve sequence refers to an antibody sequence that has not undergone the process of somatic hypermutation. GenAIRR allows you to simulate such na\u00efve sequences using the `HeavyChainSequence` class. Let's start by generating a na\u00efve heavy chain sequence.\r\n\r\n\r\n\r\n```python\r\nfrom GenAIRR.sequence import HeavyChainSequence\r\n\r\n# Create a naive heavy chain sequence\r\nnaive_heavy_sequence = HeavyChainSequence.create_random(data_config_builtin)\r\n\r\n# Access the generated naive sequence\r\nnaive_sequence = naive_heavy_sequence\r\n\r\nprint(\"Na\u00efve Heavy Chain Sequence:\", naive_sequence)\r\nprint('Ungapped Sequence: ')\r\nprint(naive_sequence.ungapped_seq)\r\n\r\n```\r\n\r\n    Na\u00efve Heavy Chain Sequence: 0|-----------------------------------------------------------------------------V(IGHVF3-G8*01)|294|296|----D(IGHD2-8*02)|312|332|------------J(IGHJ2*01)|381\r\n    Ungapped Sequence: \r\n    CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG\r\n    \r\n\r\n## Applying Mutations\r\n\r\nTo mimic the natural diversity and evolution of immune sequences, GenAIRR supports the simulation of mutations through various models. Here, we demonstrate how to apply mutations to a na\u00efve sequence using the `S5F` and `Uniform` mutation models from the mutations submodule.\r\n\r\n\r\n### Using the S5F Mutation Model\r\n\r\nThe `S5F` model is a sophisticated mutation model that considers context-dependent mutation probabilities. It's particularly useful for simulating realistic somatic hypermutations.\r\n\r\n\r\n\r\n```python\r\nfrom GenAIRR.mutation import S5F\r\n\r\n# Initialize the S5F mutation model with custom mutation rates\r\ns5f_model = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)\r\n\r\n# Apply mutations to the naive sequence using the S5F model\r\ns5f_mutated_sequence, mutations, mutation_rate = s5f_model.apply_mutation(naive_heavy_sequence)\r\n\r\nprint(\"S5F Mutated Heavy Chain Sequence:\", s5f_mutated_sequence)\r\nprint(\"S5F Mutation Details:\", mutations)\r\nprint(\"S5F Mutation Rate:\", mutation_rate)\r\n\r\n```\r\n\r\n    S5F Mutated Heavy Chain Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGCTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCTAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCAGAGCTCTGTGACCGCCGCGGACTCGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG\r\n    S5F Mutation Details: {270: 'A>T', 192: 'A>T', 247: 'T>A', 76: 'G>C'}\r\n    S5F Mutation Rate: 0.011222406361310347\r\n    \r\n\r\n### Using the Uniform Mutation Model\r\n\r\nThe `Uniform` mutation model applies mutations at a uniform rate across the sequence, providing a simpler alternative to the context-dependent models.\r\n\r\n\r\n\r\n```python\r\nfrom GenAIRR.mutation import Uniform\r\n\r\n# Initialize the Uniform mutation model with custom mutation rates\r\nuniform_model = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)\r\n\r\n# Apply mutations to the naive sequence using the Uniform model\r\nuniform_mutated_sequence, mutations, mutation_rate = uniform_model.apply_mutation(naive_heavy_sequence)\r\n\r\nprint(\"Uniform Mutated Heavy Chain Sequence:\", uniform_mutated_sequence)\r\nprint(\"Uniform Mutation Details:\", mutations)\r\nprint(\"Uniform Mutation Rate:\", mutation_rate)\r\n\r\n```\r\n\r\n    Uniform Mutated Heavy Chain Sequence: CAGGTGCACCTGCAGGAGTCGGGCCGAGGAGTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTTACTTGTGGAGTTGGGTCCGCCAGCCACCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTGTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG\r\n    Uniform Mutation Details: {122: 'C>A', 8: 'G>C', 100: 'G>T', 96: 'A>T', 25: 'C>G', 346: 'C>G', 30: 'C>G'}\r\n    Uniform Mutation Rate: 0.019802269687583134\r\n    \r\n\r\n## Common Use Cases\r\n\r\nGenAIRR is a versatile tool designed to meet a broad range of needs in immunogenetics research. This section provides examples and explanations for some common use cases, including generating multiple sequences, simulating specific allele combinations, and more.\r\n\r\n\r\n### Generating Many Sequences\r\n\r\nOne common requirement is to generate a large dataset of synthetic AIRR sequences for analysis or benchmarking. Below is an example of how to generate multiple sequences using GenAIRR in a loop.\r\n\r\n\r\n\r\n```python\r\nnum_sequences = 5  # Number of sequences to generate\r\n\r\nheavy_sequences = []\r\nfor _ in range(num_sequences):\r\n    # Simulate a heavy chain sequence\r\n    heavy_sequence = heavy_augmentor.simulate_augmented_sequence()\r\n    heavy_sequences.append(heavy_sequence)\r\n\r\n# Display the generated sequences\r\nfor i, seq in enumerate(heavy_sequences, start=1):\r\n    print(f\"Heavy Chain Sequence {i}: {seq}\")\r\n\r\n```\r\n\r\n    Heavy Chain Sequence 1: {'sequence': 'TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCGANGCTGAACTCCACAACGCCTCCCTCAAAACCTGACCCACCACGTCCAGGGACCCGTCCGGTAGTCACATGGTCCTGACACTGTCGAACATGGACCCTGTGGACACAGTCACACATTACTGTGCACCGATNCCCCCCCCTACGANGATTCCGGCCGGGCCCTGGCTAATCCAATCACTTGTTGGAGGTCTGGGGCAAAGGGACCACGGCCACCGACTCNTAAG', 'v_sequence_start': 9, 'v_sequence_end': 176, 'd_sequence_start': 185, 'd_sequence_end': 199, 'j_sequence_start': 207, 'j_sequence_end': 269, 'v_call': 'IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04', 'd_call': 'IGHD3-10*03', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.241635687732342, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 4, 'd_trim_3': 13, 'j_trim_5': 2, 'j_trim_3': 0, 'corruption_event': 'remove_before_add', 'corruption_add_amount': 9, 'corruption_remove_amount': 132, 'indels': {}}\r\n    Heavy Chain Sequence 2: {'sequence': 'CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGGAGGCCCAGTCCCTCTCGTACCCTGTCTCTGGTGACTCCATCAGCAATAGTGGTTACTCCTGGGGCTGAATCCGTCCCCCCNCAGGGAAGGGGCTGGAGTGGATNGCGACTATANATTATAGGGGCAGCTCCTGCTACAACCCGTCCCTCAAGAGTCGAGTCACCATCTCCACAGACACGTCCAAGAAGCAGGTCTCCCTGATGCTGAGCTCTATGACCGCCGCANACACGACTGTNTATTACTGTGCGAGAGTCATGGTTCTGATGTTTTGGAGCAACTGGTTCGACCCCTGGGACCAGGGAAGCCTGGTCACCCTCTCCTCAN', 'v_sequence_start': 0, 'v_sequence_end': 297, 'd_sequence_start': 308, 'd_sequence_end': 317, 'j_sequence_start': 320, 'j_sequence_end': 370, 'v_call': 'IGHVF3-G10*06', 'd_call': 'IGHD3-9*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.11621621621621622, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 15, 'j_trim_5': 1, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}\r\n    Heavy Chain Sequence 3: {'sequence': 'CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGTCGTAATTGAAACTTTCACGAAAGACTTCTGAATACACTCGCACANACCGCTATCTGCACTGGGTCCGACAGGCCCCCAGACGGGCGTTTGAGTGGGTGGGGNGGATCACGCCTTTCAGTGGTAACACCCACTACGTGCAGACGTCCCAGGACAGAGTCCCCATTACCAGGNACAAGTNTACGAGTCCAGCCTATATAGAACTGAACACCCTNAAATGCGAGGACACAGACATATATTAATGCGCANGATCCACGGGAACCCCAGCNGAGAACTGGTACTTCGATCTTTGGGGCCGTGGCCCCCTGATCACCGTCTACTCTG', 'v_sequence_start': 0, 'v_sequence_end': 295, 'd_sequence_start': 295, 'd_sequence_end': 305, 'j_sequence_start': 316, 'j_sequence_end': 367, 'v_call': 'IGHVF6-G20*02', 'd_call': 'IGHD4-11*01,IGHD4-4*01', 'j_call': 'IGHJ2*01', 'mutation_rate': 0.1989100817438692, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 3, 'd_trim_3': 3, 'j_trim_5': 3, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}\r\n    Heavy Chain Sequence 4: {'sequence': 'CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGANTACACCTTCACCAGGTATGATATNAGCTNGGTTCGACAGGCCCCTGGACAAGGGCTTGAGTGGGTGGGATGGATCAGCGCTTACAAGGGTAACACAAACTATGAACAGAAGCTCCAGGGCAGAGTCACCATGACCACTGACACATCCACGAGCACAGCCTACATAGAGCTGAGGAGTCTGAGATCTGACGACACGGCCGTGTATCACTGTGCGAGAATCGGCGGCAGGGACGAGTCCGCAGATATCTCGCATCCCTATTGCTACTCCGGTATGGACGTCTGGGGCCAAGNNACCACGGTCACCGTCTCCTCAG', 'v_sequence_start': 0, 'v_sequence_end': 294, 'd_sequence_start': 316, 'd_sequence_end': 323, 'j_sequence_start': 332, 'j_sequence_end': 391, 'v_call': 'IGHVF6-G25*02', 'd_call': 'IGHD5-18*01,IGHD5-5*01', 'j_call': 'IGHJ6*02', 'mutation_rate': 0.0639386189258312, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}\r\n    Heavy Chain Sequence 5: {'sequence': 'GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGGAGTCCCTCANACTGCCCTGTGCAGCCTCTGGATTCACCTGGTGAANGNATGCCGTGAATTGGGGCCGGCGGCCTCCAGGGATGGGACTTGATTGGGTCTCAGTTCTNAGTGCTAGTGGTGAGAGAACNTTCTCCATAGACTCCATGAAGGGCCGGGTCACCACCTCCAGGGTCAATTGCAAGAGTACGCTGTATCTGAAAATGAAGGGCCTGAGAGCCGAGGACGCGGCTGTTTATTATTGAGCGAGAGAGGCCTTAGGGTCGGATTACTACTCCTTTTACATGGACGTCTGGGGCACAGGGACCGCGGNCACCGTCTCGTCAC', 'v_sequence_start': 0, 'v_sequence_end': 296, 'd_sequence_start': 302, 'd_sequence_end': 306, 'j_sequence_start': 310, 'j_sequence_end': 370, 'v_call': 'IGHVF10-G41*02', 'd_call': 'Short-D', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.1972972972972973, 'v_trim_5': 0, 'v_trim_3': 0, 'd_trim_5': 7, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}\r\n    \r\n\r\n\r\n```python\r\nimport pandas as pd\r\npd.DataFrame(heavy_sequences)\r\n```\r\n\r\n\r\n\r\n\r\n<div>\r\n<style scoped>\r\n    .dataframe tbody tr th:only-of-type {\r\n        vertical-align: middle;\r\n    }\r\n\r\n    .dataframe tbody tr th {\r\n        vertical-align: top;\r\n    }\r\n\r\n    .dataframe thead th {\r\n        text-align: right;\r\n    }\r\n</style>\r\n<table border=\"1\" class=\"dataframe\">\r\n  <thead>\r\n    <tr style=\"text-align: right;\">\r\n      <th></th>\r\n      <th>sequence</th>\r\n      <th>v_sequence_start</th>\r\n      <th>v_sequence_end</th>\r\n      <th>d_sequence_start</th>\r\n      <th>d_sequence_end</th>\r\n      <th>j_sequence_start</th>\r\n      <th>j_sequence_end</th>\r\n      <th>v_call</th>\r\n      <th>d_call</th>\r\n      <th>j_call</th>\r\n      <th>...</th>\r\n      <th>v_trim_5</th>\r\n      <th>v_trim_3</th>\r\n      <th>d_trim_5</th>\r\n      <th>d_trim_3</th>\r\n      <th>j_trim_5</th>\r\n      <th>j_trim_3</th>\r\n      <th>corruption_event</th>\r\n      <th>corruption_add_amount</th>\r\n      <th>corruption_remove_amount</th>\r\n      <th>indels</th>\r\n    </tr>\r\n  </thead>\r\n  <tbody>\r\n    <tr>\r\n      <th>0</th>\r\n      <td>TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCG...</td>\r\n      <td>9</td>\r\n      <td>176</td>\r\n      <td>185</td>\r\n      <td>199</td>\r\n      <td>207</td>\r\n      <td>269</td>\r\n      <td>IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04</td>\r\n      <td>IGHD3-10*03</td>\r\n      <td>IGHJ6*03</td>\r\n      <td>...</td>\r\n      <td>0</td>\r\n      <td>2</td>\r\n      <td>4</td>\r\n      <td>13</td>\r\n      <td>2</td>\r\n      <td>0</td>\r\n      <td>remove_before_add</td>\r\n      <td>9</td>\r\n      <td>132</td>\r\n      <td>{}</td>\r\n    </tr>\r\n    <tr>\r\n      <th>1</th>\r\n      <td>CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGG...</td>\r\n      <td>0</td>\r\n      <td>297</td>\r\n      <td>308</td>\r\n      <td>317</td>\r\n      <td>320</td>\r\n      <td>370</td>\r\n      <td>IGHVF3-G10*06</td>\r\n      <td>IGHD3-9*01</td>\r\n      <td>IGHJ5*02</td>\r\n      <td>...</td>\r\n      <td>0</td>\r\n      <td>2</td>\r\n      <td>8</td>\r\n      <td>15</td>\r\n      <td>1</td>\r\n      <td>0</td>\r\n      <td>no-corruption</td>\r\n      <td>0</td>\r\n      <td>0</td>\r\n      <td>{}</td>\r\n    </tr>\r\n    <tr>\r\n      <th>2</th>\r\n      <td>CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGT...</td>\r\n      <td>0</td>\r\n      <td>295</td>\r\n      <td>295</td>\r\n      <td>305</td>\r\n      <td>316</td>\r\n      <td>367</td>\r\n      <td>IGHVF6-G20*02</td>\r\n      <td>IGHD4-11*01,IGHD4-4*01</td>\r\n      <td>IGHJ2*01</td>\r\n      <td>...</td>\r\n      <td>0</td>\r\n      <td>1</td>\r\n      <td>3</td>\r\n      <td>3</td>\r\n      <td>3</td>\r\n      <td>0</td>\r\n      <td>no-corruption</td>\r\n      <td>0</td>\r\n      <td>0</td>\r\n      <td>{}</td>\r\n    </tr>\r\n    <tr>\r\n      <th>3</th>\r\n      <td>CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGG...</td>\r\n      <td>0</td>\r\n      <td>294</td>\r\n      <td>316</td>\r\n      <td>323</td>\r\n      <td>332</td>\r\n      <td>391</td>\r\n      <td>IGHVF6-G25*02</td>\r\n      <td>IGHD5-18*01,IGHD5-5*01</td>\r\n      <td>IGHJ6*02</td>\r\n      <td>...</td>\r\n      <td>0</td>\r\n      <td>2</td>\r\n      <td>8</td>\r\n      <td>6</td>\r\n      <td>4</td>\r\n      <td>0</td>\r\n      <td>no-corruption</td>\r\n      <td>0</td>\r\n      <td>0</td>\r\n      <td>{}</td>\r\n    </tr>\r\n    <tr>\r\n      <th>4</th>\r\n      <td>GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGG...</td>\r\n      <td>0</td>\r\n      <td>296</td>\r\n      <td>302</td>\r\n      <td>306</td>\r\n      <td>310</td>\r\n      <td>370</td>\r\n      <td>IGHVF10-G41*02</td>\r\n      <td>Short-D</td>\r\n      <td>IGHJ6*03</td>\r\n      <td>...</td>\r\n      <td>0</td>\r\n      <td>0</td>\r\n      <td>7</td>\r\n      <td>6</td>\r\n      <td>4</td>\r\n      <td>0</td>\r\n      <td>no-corruption</td>\r\n      <td>0</td>\r\n      <td>0</td>\r\n      <td>{}</td>\r\n    </tr>\r\n  </tbody>\r\n</table>\r\n<p>5 rows \u00d7 21 columns</p>\r\n</div>\r\n\r\n\r\n\r\n### Generating a Specific Allele Combination Sequence\r\n\r\nIn some cases, you might want to simulate sequences with specific V, D, and J allele combinations. Here's how to specify alleles for your simulations.\r\n\r\n\r\n\r\n```python\r\n# Define your specific alleles\r\nv_allele = 'IGHVF6-G21*01'\r\nd_allele = 'IGHD5-18*01'\r\nj_allele = 'IGHJ6*03'\r\n\r\n# Extract the allele objects from data_config\r\nv_allele = next((allele for family in data_config_builtin.v_alleles.values() for allele in family if allele.name == v_allele), None)\r\nd_allele = next((allele for family in data_config_builtin.d_alleles.values() for allele in family if allele.name == d_allele), None)\r\nj_allele = next((allele for family in data_config_builtin.j_alleles.values() for allele in family if allele.name == j_allele), None)\r\n\r\n# Check if all alleles were found\r\nif not v_allele or not d_allele or not j_allele:\r\n    raise ValueError(\"One or more specified alleles could not be found in the data config.\")\r\n\r\n\r\n# Generate a sequence with the specified allele combination\r\nspecific_allele_sequence = HeavyChainSequence([v_allele, d_allele, j_allele], data_config_builtin)\r\nspecific_allele_sequence.mutate(s5f_model)\r\n\r\n\r\n\r\nprint(\"Specific Allele Combination Sequence:\", specific_allele_sequence.mutated_seq)\r\n\r\n```\r\n\r\n    Specific Allele Combination Sequence: CAGGTGCAGTTGGTGCAGTCTGGGACTGAGTTGAAGACGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTAGAGGCACCTTCAGCAGCTCTGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGATAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGATACGGCCGTGTATTACTGTGCGAGAGAGGATGGGTCCGGATCCCACCCCATTTACTATTACTACTACTACATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCCTCAG\r\n    \r\n\r\n### Simulating Sequences with Custom Mutation Rates\r\n\r\nAdjusting mutation rates allows for the simulation of sequences at various stages of affinity maturation. Here's how to customize mutation rates in your simulations.\r\n\r\n\r\n\r\n```python\r\n# Customize augmentation arguments with your desired mutation rates\r\ncustom_args = SequenceAugmentorArguments(min_mutation_rate=0.15, max_mutation_rate=0.3)\r\n\r\n# Initialize the augmentor with custom arguments\r\ncustom_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)\r\n\r\n# Generate a sequence with the custom mutation rates\r\ncustom_mutation_sequence = custom_augmentor.simulate_augmented_sequence()\r\n\r\nprint(\"Custom Mutation Rate Sequence:\", custom_mutation_sequence)\r\n\r\n```\r\n\r\n    Custom Mutation Rate Sequence: {'sequence': 'GGTTGGAGCTCATTGGGAGCTNCTATTCTAGTGGGACTACCTAGTACAACCTGTCCCTCAAGAATCGCGTCACCATATCAGTCGACACGTCCAAGAATCANTCCTCCCTGGAGCTGAGCTCCGTGACCGCAGCGGACACGGCCGTGCCTNGTTGNGCGGGAAAGTTGAATATAGTGGCTAACTCTGCCTTTTGCTCTCTGGGGCCAGGGGACAGTGGCCACTGTTTTTTCAG', 'v_sequence_start': 0, 'v_sequence_end': 161, 'd_sequence_start': 165, 'd_sequence_end': 180, 'j_sequence_start': 186, 'j_sequence_end': 232, 'v_call': 'IGHVF3-G10*04', 'd_call': 'IGHD5-12*01,IGHD5-18*02', 'j_call': 'IGHJ3*02', 'mutation_rate': 0.15517241379310345, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 4, 'd_trim_3': 8, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'remove', 'corruption_add_amount': 0, 'corruption_remove_amount': 136, 'indels': {}}\r\n    \r\n\r\n### Generating Na\u00efve vs. Mutated Sequence Pairs\r\n\r\nComparing na\u00efve and mutated versions of the same sequence can be useful for studying somatic hypermutation effects. Here's how to generate such pairs with GenAIRR.\r\n\r\n\r\n\r\n```python\r\n# Generate a naive sequence\r\nsequence_object = HeavyChainSequence.create_random(data_config_builtin)\r\nsequence_object.mutate(s5f_model)\r\n\r\nprint(\"Na\u00efve Sequence:\", sequence_object.ungapped_seq)\r\nprint(\"Mutated Sequence:\", sequence_object.mutated_seq)\r\n\r\n```\r\n\r\n    Na\u00efve Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAAAGCCACTCGGTCACACTACGGTGGTAACTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG\r\n    Mutated Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGAACATCCATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCACTAGACACGTCCAAGAACCAGTTCTCCCTGAAACTGAGCTCTGTGGCCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAACGCCACTCGGTCACACTACGGTGGTAATTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG\r\n    \r\n\r\n\r\n### Simulating TCR-Beta Sequences \r\nGenAIRR also support TCRB sequence simulation. Here's how you can simulate TCRB Sequences.\r\n\r\n\r\n\r\n\r\n```python\r\n# Customize augmentation arguments with your desired mutation rates\r\nfrom GenAIRR.TCR.simulation import TCRHeavyChainSequenceAugmentor, SequenceAugmentorArguments\r\nfrom GenAIRR.data import builtin_tcrb_data_config\r\n\r\ntcr_data_config = builtin_tcrb_data_config()\r\ncustom_args = SequenceAugmentorArguments(simulate_indels=0.2)\r\n\r\n# Initialize the augmentor with custom arguments\r\ncustom_augmentor = TCRHeavyChainSequenceAugmentor(tcr_data_config, custom_args)\r\n\r\n# Generate 100 sequences\r\ngenerated_seqs = []\r\nfor _ in range(100):\r\n    generated_seqs.append(custom_augmentor.simulate_augmented_sequence())\r\n\r\nprint(\"Generated Sequences:\", generated_seqs)\r\n\r\n```\r\n\r\n## Conclusion\r\n\r\nThis section highlighted some common use cases for GenAIRR, demonstrating its flexibility in simulating AIRR sequences for various research purposes. Whether you need large datasets, specific allele combinations, custom mutation rates, or comparative analyses of na\u00efve and mutated sequences, GenAIRR provides the necessary tools to achieve your objectives.\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.",
    "version": "0.3.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/MuteJester/GenAIRR/issues",
        "Download": "https://github.com/MuteJester/GenAIRR/archive/refs/tags/0.3.0.tar.gz",
        "Homepage": "https://github.com/MuteJester/GenAIRR"
    },
    "split_keywords": [
        "immunogenetics",
        " sequence simulation",
        " bioinformatics",
        " alignment benchmarking"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e2acb4f16e007dce7153aa2d9953ee7d8404493a40d831f1e0d2f08386157027",
                "md5": "80d441138efdb887aee4f8ab933b64cd",
                "sha256": "90b595fa93a563c749f54c2f59e64dc4ffa01509ebb349b8ba062270cb05692d"
            },
            "downloads": -1,
            "filename": "GenAIRR-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "80d441138efdb887aee4f8ab933b64cd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 2119475,
            "upload_time": "2024-11-12T11:11:18",
            "upload_time_iso_8601": "2024-11-12T11:11:18.611929Z",
            "url": "https://files.pythonhosted.org/packages/e2/ac/b4f16e007dce7153aa2d9953ee7d8404493a40d831f1e0d2f08386157027/GenAIRR-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-12 11:11:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MuteJester",
    "github_project": "GenAIRR",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    "~=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "~=",
                    "1.24.3"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "~=",
                    "1.11.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "~=",
                    "68.0.0"
                ]
            ]
        }
    ],
    "lcname": "genairr"
}
        
Elapsed time: 0.40774s