Name | plastburstalign JSON |
Version |
0.9.1
JSON |
| download |
home_page | None |
Summary | A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays |
upload_time | 2024-08-27 22:57:52 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | None |
keywords |
bioinformatics
alignment
flatfile
mafft
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# plastburstalign
A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays
### Installation on Linux (Debian)
```bash
# Alignment software
apt install mafft
# Other dependencies
apt install python3-biopython
apt install python3-coloredlogs
```
### Overview of process
![Depiction of plastomes being split according to specified marker type; the extracted sequences are then aligned and concatenated](docs/PlastomeBurstAndAlign_ProcessOverview.png)
### Main features
- Extracts all sequences from each of three different genome marker types (i.e., genes, introns, or intergenic spacers) from a set of plastid genomes in GenBank flatfile format, groups and aligns homologous extracted sequences, and then saves the alignments to file
- Saves both the individual alignments and the concatenation of all alignments
- Automatic removal of any duplicate regions (i.e., relevant for features duplicated through the IRs)
- Exon splicing operations #1: Automatic merging of all exons of any cis-splied gene [see functions of class `ExonSpliceHandler`]
- Exon splicing operations #2: Automatic grouping of all exons of any trans-spliced gene (e.g., _rps12_), followed by merging the exons [see functions of class `ExonSpliceHandler`]
- Automatic removal of regions that do not fulfill
- a minimum, user-specified sequence length
- a minimum, user-specified number of taxa in the dataset that the region must be found in [see function `DataCleaning()` for both]
### Additional features
- Rapid sequence extraction and alignment of the genes/introns/intergenic spacers due to process parallelization using multiple CPUs [see internal function `_nuc_MSA()`]
- Automatic removal of any user-specified genes/introns/intergenic spacers
- Choice of
- the order of concatenation of the aligned genes/introns/intergenic spacers to either the natural order of the first input genome (commandline option `seq`) or an alphabetic order (commandline option `alpha`)
- automatic case standardization of gene names to adjust for letter-case differences between gene annotations of different genome records (which is especially relevant for anticodon and amino acid abbreviations of tRNAs); includes the option to remove anticodon and amino acid abbreviations from tRNA gene names altogether [see function `clean_gene()`]
- If a gene/intron/intergenic spacer cannot be extracted from a GenBank record, provision of explanation why the extraction failed
- Availability of two log levels:
- default (suitable for regular software execution), and
- verbose (suitable for debugging)
- Package works out of the box on Unix-like systems due to inclusion of the alignment software executable (MAFFT) into the package.
### Usage
#### Option 1: As a script
If current working directory within `plastburstalign`, execute the package via:
```bash
python -m plastburstalign
```
#### Option 2: As a module
From within Python, execute the package functions via:
```python
from plastburstalign import PlastomeRegionBurstAndAlign
burst = PlastomeRegionBurstAndAlign()
burst.execute()
```
#### Usage of individual package components
Individual components can be used as well. For example, to use the class `MAFFT` by itself (e.g., instantiate a configuration of MAFFT that will execute with 1 thread; institute another that will execute with 10 threads), type:
```python
from plastburstalign import MAFFT
mafft_1 = MAFFT()
mafft_10 = MAFFT({"num_threads": 10})
```
### Explanation of exon splicing
As the gene list produced through parsing all input genomes is iterated over, genes that comprise multiple exons are automatically flagged and treated according to the distance between their exons. Cis-spliced genes only comprise exons that are adjacent to each other, trans-spliced genes comprise one or more exons that are not adjacent to each other. This software merges the exons of any cis-spliced gene in place (i.e., according to the location specified by the source GenBank file; no repositioning of the exons necessary). The exons of any trans-spliced gene (e.g., _rps12_), by contrast, undergo a repositioning before being merged. Specifically, the software accommodates the fact that GenBank flatfiles list trans-spliced genes (e.g., _rps12_) out of their natural order along the genome sequence and additionally repositions the exons of trans-spliced genes by converting them to adjacent exons and then merges these exons.
For the repositioning of trans-spliced gene, all annotations of that gene are first moved from the main gene list to a separate list. Then, the annotations are split into simple location features for each contiguous group of exons. Third, the expected location of each of these simple gene features is determined by comparing its end location with the end locations of the gene features in the main gene list: if the expected location has no overlap with either the proceeding and succeeding genes and the feature is different in name from either, it is directly inserted into that location. Alternatively, if the expected location of the feature results in a flanking gene (strictly adjacent or overlapping) with the same name, the annotations are merged; the merging is true for both the proceeding and the succeeding gene.
### Testing
```bash
cd benchmarking
# CDS
python test_script_cds.py benchmarking1
python test_script_cds.py benchmarking2
# INT
python test_script_int.py benchmarking1
python test_script_int.py benchmarking2
# IGS
python test_script_igs.py benchmarking1
python test_script_igs.py benchmarking2
```
- Dataset `benchmarking1.tar.gz`: all Asteraceae (n=155) listed in [Yang et al. 2022](https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2022.808156)
- Dataset `benchmarking2.tar.gz`: all monocots (n=733) listed in [Yang et al. 2022](https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2022.808156)
### Exemplary usage
See [this document](https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign/blob/main/docs/exemplary_usage.md)
### Generating more test data
See [this document](https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign/blob/main/docs/generating_test_data.md)
Raw data
{
"_id": null,
"home_page": null,
"name": "plastburstalign",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "bioinformatics, alignment, flatfile, MAFFT",
"author": null,
"author_email": "\"Michael Gruenstaeudl, PhD\" <m_gruenstaeudl@fhsu.edu>, Gregory Smith <g_smith10@mail.fhsu.edu>",
"download_url": "https://files.pythonhosted.org/packages/78/59/d1d97f7bb1701daa5c84a065c973adfda1c3ad96b0c8cf0e87313e563b53/plastburstalign-0.9.1.tar.gz",
"platform": null,
"description": "# plastburstalign\nA Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays\n\n\n### Installation on Linux (Debian)\n```bash\n# Alignment software\napt install mafft\n\n# Other dependencies\napt install python3-biopython\napt install python3-coloredlogs\n```\n\n### Overview of process\n![Depiction of plastomes being split according to specified marker type; the extracted sequences are then aligned and concatenated](docs/PlastomeBurstAndAlign_ProcessOverview.png)\n\n### Main features\n- Extracts all sequences from each of three different genome marker types (i.e., genes, introns, or intergenic spacers) from a set of plastid genomes in GenBank flatfile format, groups and aligns homologous extracted sequences, and then saves the alignments to file\n- Saves both the individual alignments and the concatenation of all alignments\n- Automatic removal of any duplicate regions (i.e., relevant for features duplicated through the IRs)\n- Exon splicing operations #1: Automatic merging of all exons of any cis-splied gene [see functions of class `ExonSpliceHandler`]\n- Exon splicing operations #2: Automatic grouping of all exons of any trans-spliced gene (e.g., _rps12_), followed by merging the exons [see functions of class `ExonSpliceHandler`]\n- Automatic removal of regions that do not fulfill\n - a minimum, user-specified sequence length\n - a minimum, user-specified number of taxa in the dataset that the region must be found in [see function `DataCleaning()` for both]\n\n### Additional features\n- Rapid sequence extraction and alignment of the genes/introns/intergenic spacers due to process parallelization using multiple CPUs [see internal function `_nuc_MSA()`]\n- Automatic removal of any user-specified genes/introns/intergenic spacers\n- Choice of\n - the order of concatenation of the aligned genes/introns/intergenic spacers to either the natural order of the first input genome (commandline option `seq`) or an alphabetic order (commandline option `alpha`)\n - automatic case standardization of gene names to adjust for letter-case differences between gene annotations of different genome records (which is especially relevant for anticodon and amino acid abbreviations of tRNAs); includes the option to remove anticodon and amino acid abbreviations from tRNA gene names altogether [see function `clean_gene()`]\n- If a gene/intron/intergenic spacer cannot be extracted from a GenBank record, provision of explanation why the extraction failed\n- Availability of two log levels:\n - default (suitable for regular software execution), and \n - verbose (suitable for debugging)\n- Package works out of the box on Unix-like systems due to inclusion of the alignment software executable (MAFFT) into the package.\n\n\n### Usage\n\n#### Option 1: As a script\nIf current working directory within `plastburstalign`, execute the package via:\n```bash\npython -m plastburstalign\n```\n\n#### Option 2: As a module\nFrom within Python, execute the package functions via:\n```python\nfrom plastburstalign import PlastomeRegionBurstAndAlign\nburst = PlastomeRegionBurstAndAlign()\nburst.execute()\n```\n\n#### Usage of individual package components\nIndividual components can be used as well. For example, to use the class `MAFFT` by itself (e.g., instantiate a configuration of MAFFT that will execute with 1 thread; institute another that will execute with 10 threads), type:\n\n```python\nfrom plastburstalign import MAFFT\n\nmafft_1 = MAFFT()\nmafft_10 = MAFFT({\"num_threads\": 10})\n```\n\n\n### Explanation of exon splicing\n\nAs the gene list produced through parsing all input genomes is iterated over, genes that comprise multiple exons are automatically flagged and treated according to the distance between their exons. Cis-spliced genes only comprise exons that are adjacent to each other, trans-spliced genes comprise one or more exons that are not adjacent to each other. This software merges the exons of any cis-spliced gene in place (i.e., according to the location specified by the source GenBank file; no repositioning of the exons necessary). The exons of any trans-spliced gene (e.g., _rps12_), by contrast, undergo a repositioning before being merged. Specifically, the software accommodates the fact that GenBank flatfiles list trans-spliced genes (e.g., _rps12_) out of their natural order along the genome sequence and additionally repositions the exons of trans-spliced genes by converting them to adjacent exons and then merges these exons.\n\nFor the repositioning of trans-spliced gene, all annotations of that gene are first moved from the main gene list to a separate list. Then, the annotations are split into simple location features for each contiguous group of exons. Third, the expected location of each of these simple gene features is determined by comparing its end location with the end locations of the gene features in the main gene list: if the expected location has no overlap with either the proceeding and succeeding genes and the feature is different in name from either, it is directly inserted into that location. Alternatively, if the expected location of the feature results in a flanking gene (strictly adjacent or overlapping) with the same name, the annotations are merged; the merging is true for both the proceeding and the succeeding gene.\n\n\n### Testing\n```bash\ncd benchmarking\n# CDS\npython test_script_cds.py benchmarking1\npython test_script_cds.py benchmarking2\n# INT\npython test_script_int.py benchmarking1\npython test_script_int.py benchmarking2\n# IGS\npython test_script_igs.py benchmarking1\npython test_script_igs.py benchmarking2\n```\n- Dataset `benchmarking1.tar.gz`: all Asteraceae (n=155) listed in [Yang et al. 2022](https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2022.808156)\n- Dataset `benchmarking2.tar.gz`: all monocots (n=733) listed in [Yang et al. 2022](https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2022.808156)\n\n\n### Exemplary usage\nSee [this document](https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign/blob/main/docs/exemplary_usage.md)\n\n\n### Generating more test data\nSee [this document](https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign/blob/main/docs/generating_test_data.md)\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays",
"version": "0.9.1",
"project_urls": {
"Homepage": "https://gruenstaeudl-lab.com/",
"Repository": "https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign"
},
"split_keywords": [
"bioinformatics",
" alignment",
" flatfile",
" mafft"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e5fd649fc8ac4452d978acfc8308d82c1bf496fb1c9d1b34de260b08142545f5",
"md5": "72d254cbe5ee5dfccf492b97db7eb498",
"sha256": "1c18839df8c8ffc473ff05cba5f5c5e07bd01208ef7b294e9ced0f7d443b452d"
},
"downloads": -1,
"filename": "plastburstalign-0.9.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "72d254cbe5ee5dfccf492b97db7eb498",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 28307,
"upload_time": "2024-08-27T22:57:51",
"upload_time_iso_8601": "2024-08-27T22:57:51.242272Z",
"url": "https://files.pythonhosted.org/packages/e5/fd/649fc8ac4452d978acfc8308d82c1bf496fb1c9d1b34de260b08142545f5/plastburstalign-0.9.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7859d1d97f7bb1701daa5c84a065c973adfda1c3ad96b0c8cf0e87313e563b53",
"md5": "9df50744e97a8b32081252b2c813b67b",
"sha256": "8d8a68e12f70f553801960e429b9201effd6d0d44618fddc9e5b525d50f979db"
},
"downloads": -1,
"filename": "plastburstalign-0.9.1.tar.gz",
"has_sig": false,
"md5_digest": "9df50744e97a8b32081252b2c813b67b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 27091,
"upload_time": "2024-08-27T22:57:52",
"upload_time_iso_8601": "2024-08-27T22:57:52.528882Z",
"url": "https://files.pythonhosted.org/packages/78/59/d1d97f7bb1701daa5c84a065c973adfda1c3ad96b0c8cf0e87313e563b53/plastburstalign-0.9.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-27 22:57:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michaelgruenstaeudl",
"github_project": "PlastomeBurstAndAlign",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "plastburstalign"
}