# biodatatypes
Pure-Python package for handling biological sequence datatypes as Enum objects.
## Installation
```bash
pip install biodatatypes
```
## Basic usage
```python
from biodatatypes import Nucleotide, AminoAcid, Codon
from biodatatypes import NucleotideSequence, AminoAcidSequence, CodonSequence
# Nucleotide
nucleotide_a = Nucleotide['A']
nucleotide_c = Nucleotide.C
nucleotide_g = Nucleotide.from_str('G')
nucleotide_t = Nucleotide(4)
gap = Nucleotide.from_str('-')
also_gap = Nucleotide.Gap
# AminoAcid
amino_acid_ala = AminoAcid['Ala']
amino_acid_arg = AminoAcid.Arg
amino_acid_asn = AminoAcid.from_str('N')
amino_acid_gly = AminoAcid.from_str('Gly')
amino_acid_asp = AminoAcid(4)
stop = AminoAcid['Stop']
also_stop = AminoAcid.from_str('Ter')
also_also_stop = AminoAcid.from_str('*')
# Codon
codon_gca = Codon['GCA']
codon_gcg = Codon.GCG
codon_gct = Codon.from_str('GCT')
codon_aat = Codon(4)
codon_atg = Codon.start_codon()
# NucleotideSequence
nucleotide_sequence = NucleotideSequence.from_str('ATGAAACGATAG')
gapped_nucleotide_sequence = NucleotideSequence.from_str('ATG-AACGA--AG')
masked_nucleotide_sequence = NucleotideSequence.from_str('ATG#AA##ATAG')
print(nucleotide_sequence) # ATGAAACGATAG
print(repr(nucleotide_sequence)) # ATGAAACGATAG
print(gapped_nucleotide_sequence) # ATG-AACGA--AG
print(masked_nucleotide_sequence) # ATG#AA##ATAG
# AminoAcidSequence
amino_acid_sequence = AminoAcidSequence.from_str('ACDE')
gapped_amino_acid_sequence = AminoAcidSequence.from_str('A-C-E')
masked_amino_acid_sequence = AminoAcidSequence.from_str('A#DE')
print(amino_acid_sequence) # ACDE
print(gapped_amino_acid_sequence) # A-C-E
print(masked_amino_acid_sequence) # A#DE
# CodonSequence
codon_sequence = CodonSequence.from_str('ATGAAACGATAG')
gapped_codon_sequence = CodonSequence.from_str('ATGAAA---CGATAG')
masked_codon_sequence = CodonSequence.from_str('ATG###CGATAG')
print(codon_sequence) # ATGAAACGATAG
print(repr(codon_sequence)) # ATG AAA CGA TAG
print(gapped_codon_sequence) # ATGAAA---CGATAG
print(masked_codon_sequence) # ATG###CGATAG
```
## Making custom datatypes
While default datatypes `Nucleotide`, `AminoAcid`, and `Codon` are provided, it is possible to create custom nucleotide, amino acid, and codon datatypes by subclassing `NucleotideEnum`, `AminoAcidEnum`, and `CodonEnum` respectively.
### Custom nucleotide enum
```python
from biodatatypes.unit.base import NucleotideEnum, AminoAcidEnum, CodonEnum
from biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin
# Create a custom nucleotide enum
class MyNucleotide(GapTokenMixin, NucleotideEnum):
A = 1
C = 2
G = 3
T = 4
Gap = 5
my_a = MyNucleotide['A']
my_c = MyNucleotide.C
my_g = MyNucleotide.from_str('G')
my_t = MyNucleotide(4)
my_gap = MyNucleotide.from_str('-')
# Use methods inherited from NucleotideEnum
print(my_a.is_purine()) # True
print(my_c.is_purine()) # False
print(my_a.is_gap()) # False
print(my_gap.is_gap()) # True
print(my_a.is_standard()) # True
print(my_t.to_onehot()) # [0, 0, 0, 0, 1]
print(my_c.to_complement()) # G
```
`NucleotideEnum` contains methods for handling standard nucleotides.
To extend functionality to handle gaps, use `GapTokenMixin` together with the corresponding enum class (e.g. `NucleotideEnum` for nucleotides).
`GapTokenMixin` adds `is_gap()` method to check if a token is a gap.
It expects that the gap token enum is named `Gap` (case-sensitive).
To extend functionality to handle masks, use `MaskTokenMixin` together with the corresponding enum class.
`MaskTokenMixin` adds `is_mask()` method to check if a token is a mask.
It expects that the mask token enum is named `Mask` (case-sensitive).
To extend functionality to handle unspecified "other" nucleotides, use `OtherTokenMixin` together with the corresponding enum class.
`OtherTokenMixin` adds `is_other()` method to check if a token is an unspecified "other" nucleotide.
It expects that the "other" token enum is named `Other` (case-sensitive).
To extend functionality to handle gaps, masks, and unspecified "other" nucleotides, use `SpecialTokenMixin` together with the corresponding enum class.
`SpecialTokenMixin` adds `is_special()` method to check if a token is a gap, mask, or other.
The same enum name requirements apply as above.
### Custom amino acid enum
Similarly, custom amino acid enums can be created by subclassing `AminoAcidEnum`.
Amino acid enums are expected to use the the IUPAC three-letter amino acid codes as enum names.
If the termination signal token is included, it is expected to be named `Stop` (case-sensitive).
The termination token has a one-letter code of `*` and a three-letter code of `Ter`.
```python
from biodatatypes.unit.base import AminoAcidEnum
from biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin
# Create a custom amino acid enum
# Add mixins to extend functionality when non-standard amino acid tokens (gap, mask) are expected
class MyAminoAcid(MaskTokenMixin, GapTokenMixin, AminoAcidEnum):
Ala = 1
Arg = 2
Asn = 3
Asp = 4
Gap = 5
Mask = 6
Stop = 7
my_ala = MyAminoAcid['Ala']
my_arg = MyAminoAcid.Arg
my_asn = MyAminoAcid.from_str('N')
my_gap = MyAminoAcid.from_str('-')
my_mask = MyAminoAcid.Mask
my_stop = MyAminoAcid['Stop']
print(my_asn.is_polar()) # True
print(my_ala.is_polar()) # False
print(my_gap.is_gap()) # True
print(my_ala.is_gap()) # False
print(my_mask.is_mask()) # True
print(my_asn.has_amide()) # True
print(my_asn.to_one_letter()) # N
```
The same mixins used for `NucleotideEnum` can be used for `AminoAcidEnum` to extend functionality to handle gaps, masks, and unspecified "other" amino acids.
The same enum name requirements apply as previously mentioned for making custom nucleotide enums.
### Custom codon enum
When creating a custom codon enum, aside from specifying the enumerations using the IUPAC three-letter codon codes, it is also necessary to specify associated nucleotide and amino acid enums by overriding the `nucleotide_class` and `aminoacid_class` getter methods.
The `nucleotide_class` property is used when calling `from_nucleotides` and `to_nucleotides()` methods to convert the codon to and from a triplet nucleotide sequence.
The `aminoacid_class` property is used when calling the `translate()` method to translate the codon to an amino acid based on the standard genetic code.
It is not necessary to create custom nucleotide and amino acid enums for this purpose as the included `Nucleotide` and `AminoAcid` can be used, but it is necessary to specify the corresponding enum classes.
```python
from biodatatypes import Nucleotide, AminoAcid
from biodatatypes.unit.base import CodonEnum
from biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin
# Create a custom codon enum
# Add mixins to extend functionality when non-standard codon tokens (gap, mask) are expected
class MyCodon(MaskTokenMixin, GapTokenMixin, CodonEnum):
GCA = 1
GCG = 2
GCT = 3
TAG = 4
Gap = 5
Mask = 6
ATG = 7
@property
def nucleotide_class(self):
return Nucleotide
@property
def aminoacid_class(self):
return MyAminoAcid
my_gca = MyCodon['GCA']
my_gcg = MyCodon.GCG
my_gct = MyCodon.from_str('GCT')
my_gap = MyCodon.from_str('---')
my_mask = MyCodon.Mask
my_atg = MyCodon.start_codon()
print(my_gca.is_fourfold_degenerate()) # True, GCA, GCC, GCG, GCT encode Ala
print(my_atg.is_start_codon()) # True
print(my_atg.is_stop_codon()) # False
print(my_gap.is_gap()) # True
print(my_mask.is_mask()) # True
print(my_gca.translate()) # A
```
The same mixins can be used for `CodonEnum` to extend functionality to handle gaps, masks, and unspecified "other" codons. The same enum name requirements apply as previously mentioned for making custom nucleotide and amino acid enums.
## License
MIT License
## Author
Kent Kawashima
Raw data
{
"_id": null,
"home_page": "https://github.com/kentwait/biodatatypes",
"name": "biodatatypes",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "bioinformatics,biology,biological sequence,datatypes",
"author": "Kent Kawashima",
"author_email": "kentkawashima@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6e/35/836cf7ac1b327b56ac86aaa41b9eab470e52654e93447c28aeccb09d3087/biodatatypes-0.2.1.tar.gz",
"platform": null,
"description": "# biodatatypes\nPure-Python package for handling biological sequence datatypes as Enum objects.\n\n## Installation\n\n```bash\npip install biodatatypes\n```\n\n## Basic usage\n\n```python\nfrom biodatatypes import Nucleotide, AminoAcid, Codon\nfrom biodatatypes import NucleotideSequence, AminoAcidSequence, CodonSequence\n\n# Nucleotide\nnucleotide_a = Nucleotide['A']\nnucleotide_c = Nucleotide.C\nnucleotide_g = Nucleotide.from_str('G')\nnucleotide_t = Nucleotide(4)\ngap = Nucleotide.from_str('-')\nalso_gap = Nucleotide.Gap\n\n# AminoAcid\namino_acid_ala = AminoAcid['Ala']\namino_acid_arg = AminoAcid.Arg\namino_acid_asn = AminoAcid.from_str('N')\namino_acid_gly = AminoAcid.from_str('Gly')\namino_acid_asp = AminoAcid(4)\nstop = AminoAcid['Stop']\nalso_stop = AminoAcid.from_str('Ter')\nalso_also_stop = AminoAcid.from_str('*')\n\n# Codon\ncodon_gca = Codon['GCA']\ncodon_gcg = Codon.GCG\ncodon_gct = Codon.from_str('GCT')\ncodon_aat = Codon(4)\ncodon_atg = Codon.start_codon()\n\n# NucleotideSequence\nnucleotide_sequence = NucleotideSequence.from_str('ATGAAACGATAG')\ngapped_nucleotide_sequence = NucleotideSequence.from_str('ATG-AACGA--AG')\nmasked_nucleotide_sequence = NucleotideSequence.from_str('ATG#AA##ATAG')\nprint(nucleotide_sequence) # ATGAAACGATAG\nprint(repr(nucleotide_sequence)) # ATGAAACGATAG\nprint(gapped_nucleotide_sequence) # ATG-AACGA--AG\nprint(masked_nucleotide_sequence) # ATG#AA##ATAG\n\n# AminoAcidSequence\namino_acid_sequence = AminoAcidSequence.from_str('ACDE')\ngapped_amino_acid_sequence = AminoAcidSequence.from_str('A-C-E')\nmasked_amino_acid_sequence = AminoAcidSequence.from_str('A#DE')\nprint(amino_acid_sequence) # ACDE\nprint(gapped_amino_acid_sequence) # A-C-E\nprint(masked_amino_acid_sequence) # A#DE\n\n# CodonSequence\ncodon_sequence = CodonSequence.from_str('ATGAAACGATAG')\ngapped_codon_sequence = CodonSequence.from_str('ATGAAA---CGATAG')\nmasked_codon_sequence = CodonSequence.from_str('ATG###CGATAG')\nprint(codon_sequence) # ATGAAACGATAG\nprint(repr(codon_sequence)) # ATG AAA CGA TAG\nprint(gapped_codon_sequence) # ATGAAA---CGATAG\nprint(masked_codon_sequence) # ATG###CGATAG\n```\n\n## Making custom datatypes\n\nWhile default datatypes `Nucleotide`, `AminoAcid`, and `Codon` are provided, it is possible to create custom nucleotide, amino acid, and codon datatypes by subclassing `NucleotideEnum`, `AminoAcidEnum`, and `CodonEnum` respectively.\n\n### Custom nucleotide enum\n\n```python\nfrom biodatatypes.unit.base import NucleotideEnum, AminoAcidEnum, CodonEnum\nfrom biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin\n\n# Create a custom nucleotide enum\nclass MyNucleotide(GapTokenMixin, NucleotideEnum):\n A = 1\n C = 2\n G = 3\n T = 4\n Gap = 5\n\nmy_a = MyNucleotide['A']\nmy_c = MyNucleotide.C\nmy_g = MyNucleotide.from_str('G')\nmy_t = MyNucleotide(4)\nmy_gap = MyNucleotide.from_str('-')\n\n# Use methods inherited from NucleotideEnum\nprint(my_a.is_purine()) # True\nprint(my_c.is_purine()) # False\nprint(my_a.is_gap()) # False\nprint(my_gap.is_gap()) # True\nprint(my_a.is_standard()) # True\nprint(my_t.to_onehot()) # [0, 0, 0, 0, 1]\nprint(my_c.to_complement()) # G\n```\n\n`NucleotideEnum` contains methods for handling standard nucleotides.\n\nTo extend functionality to handle gaps, use `GapTokenMixin` together with the corresponding enum class (e.g. `NucleotideEnum` for nucleotides). \n`GapTokenMixin` adds `is_gap()` method to check if a token is a gap. \nIt expects that the gap token enum is named `Gap` (case-sensitive).\n\nTo extend functionality to handle masks, use `MaskTokenMixin` together with the corresponding enum class. \n`MaskTokenMixin` adds `is_mask()` method to check if a token is a mask.\nIt expects that the mask token enum is named `Mask` (case-sensitive).\n\nTo extend functionality to handle unspecified \"other\" nucleotides, use `OtherTokenMixin` together with the corresponding enum class. \n`OtherTokenMixin` adds `is_other()` method to check if a token is an unspecified \"other\" nucleotide.\nIt expects that the \"other\" token enum is named `Other` (case-sensitive).\n\nTo extend functionality to handle gaps, masks, and unspecified \"other\" nucleotides, use `SpecialTokenMixin` together with the corresponding enum class. \n`SpecialTokenMixin` adds `is_special()` method to check if a token is a gap, mask, or other.\nThe same enum name requirements apply as above.\n\n### Custom amino acid enum\n\nSimilarly, custom amino acid enums can be created by subclassing `AminoAcidEnum`. \nAmino acid enums are expected to use the the IUPAC three-letter amino acid codes as enum names.\n\nIf the termination signal token is included, it is expected to be named `Stop` (case-sensitive).\nThe termination token has a one-letter code of `*` and a three-letter code of `Ter`.\n\n```python\nfrom biodatatypes.unit.base import AminoAcidEnum\nfrom biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin\n\n\n# Create a custom amino acid enum\n# Add mixins to extend functionality when non-standard amino acid tokens (gap, mask) are expected\nclass MyAminoAcid(MaskTokenMixin, GapTokenMixin, AminoAcidEnum):\n Ala = 1\n Arg = 2\n Asn = 3\n Asp = 4\n Gap = 5\n Mask = 6\n Stop = 7\n\nmy_ala = MyAminoAcid['Ala']\nmy_arg = MyAminoAcid.Arg\nmy_asn = MyAminoAcid.from_str('N')\nmy_gap = MyAminoAcid.from_str('-')\nmy_mask = MyAminoAcid.Mask\nmy_stop = MyAminoAcid['Stop']\n\nprint(my_asn.is_polar()) # True\nprint(my_ala.is_polar()) # False\nprint(my_gap.is_gap()) # True\nprint(my_ala.is_gap()) # False\nprint(my_mask.is_mask()) # True\nprint(my_asn.has_amide()) # True\nprint(my_asn.to_one_letter()) # N\n```\n\nThe same mixins used for `NucleotideEnum` can be used for `AminoAcidEnum` to extend functionality to handle gaps, masks, and unspecified \"other\" amino acids.\nThe same enum name requirements apply as previously mentioned for making custom nucleotide enums.\n\n\n### Custom codon enum\n\nWhen creating a custom codon enum, aside from specifying the enumerations using the IUPAC three-letter codon codes, it is also necessary to specify associated nucleotide and amino acid enums by overriding the `nucleotide_class` and `aminoacid_class` getter methods.\n\nThe `nucleotide_class` property is used when calling `from_nucleotides` and `to_nucleotides()` methods to convert the codon to and from a triplet nucleotide sequence.\nThe `aminoacid_class` property is used when calling the `translate()` method to translate the codon to an amino acid based on the standard genetic code.\nIt is not necessary to create custom nucleotide and amino acid enums for this purpose as the included `Nucleotide` and `AminoAcid` can be used, but it is necessary to specify the corresponding enum classes.\n\n```python\nfrom biodatatypes import Nucleotide, AminoAcid\nfrom biodatatypes.unit.base import CodonEnum\nfrom biodatatypes.unit.mixins import GapTokenMixin, MaskTokenMixin\n\n\n# Create a custom codon enum\n# Add mixins to extend functionality when non-standard codon tokens (gap, mask) are expected\nclass MyCodon(MaskTokenMixin, GapTokenMixin, CodonEnum):\n GCA = 1\n GCG = 2\n GCT = 3\n TAG = 4\n Gap = 5\n Mask = 6\n ATG = 7\n\n @property\n def nucleotide_class(self):\n return Nucleotide\n\n @property\n def aminoacid_class(self):\n return MyAminoAcid\n\nmy_gca = MyCodon['GCA']\nmy_gcg = MyCodon.GCG\nmy_gct = MyCodon.from_str('GCT')\nmy_gap = MyCodon.from_str('---')\nmy_mask = MyCodon.Mask\nmy_atg = MyCodon.start_codon()\n\nprint(my_gca.is_fourfold_degenerate()) # True, GCA, GCC, GCG, GCT encode Ala\nprint(my_atg.is_start_codon()) # True\nprint(my_atg.is_stop_codon()) # False\nprint(my_gap.is_gap()) # True\nprint(my_mask.is_mask()) # True\nprint(my_gca.translate()) # A\n\n```\n\nThe same mixins can be used for `CodonEnum` to extend functionality to handle gaps, masks, and unspecified \"other\" codons. The same enum name requirements apply as previously mentioned for making custom nucleotide and amino acid enums.\n\n## License\nMIT License\n\n## Author\nKent Kawashima\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Nucleotide, amino acid, and codon datatypes for Python",
"version": "0.2.1",
"project_urls": {
"Bug Tracker": "https://github.com/kentwait/biodatatypes/issues",
"Homepage": "https://github.com/kentwait/biodatatypes",
"Source": "https://github.com/kentwait/biodatatypes"
},
"split_keywords": [
"bioinformatics",
"biology",
"biological sequence",
"datatypes"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "90ed5c5a80e6468cb386b7abcaa8a2ac9e9307b52cbe6d8494362f38984f71b5",
"md5": "006337985164c2583de481ed3c1b6e40",
"sha256": "7007574196f90dc1028dabd8f6eed730da7b4c64e88269fe157ea135442939cf"
},
"downloads": -1,
"filename": "biodatatypes-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "006337985164c2583de481ed3c1b6e40",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 23868,
"upload_time": "2023-07-04T08:37:03",
"upload_time_iso_8601": "2023-07-04T08:37:03.217557Z",
"url": "https://files.pythonhosted.org/packages/90/ed/5c5a80e6468cb386b7abcaa8a2ac9e9307b52cbe6d8494362f38984f71b5/biodatatypes-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6e35836cf7ac1b327b56ac86aaa41b9eab470e52654e93447c28aeccb09d3087",
"md5": "2d0a0f738a0cb8dacb02f6e5ccfddda0",
"sha256": "04a9b422891fac1922d83c06095e261d2f621f44b051d01ab5b68f7fd3ec8406"
},
"downloads": -1,
"filename": "biodatatypes-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "2d0a0f738a0cb8dacb02f6e5ccfddda0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22998,
"upload_time": "2023-07-04T08:37:04",
"upload_time_iso_8601": "2023-07-04T08:37:04.801039Z",
"url": "https://files.pythonhosted.org/packages/6e/35/836cf7ac1b327b56ac86aaa41b9eab470e52654e93447c28aeccb09d3087/biodatatypes-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-04 08:37:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kentwait",
"github_project": "biodatatypes",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "biodatatypes"
}