bamnostic


Namebamnostic JSON
Version 1.1.10 PyPI version JSON
download
home_pagehttps://github.com/betteridiot/bamnostic/
SummaryPure Python, OS-agnostic Binary Alignment Map (BAM) random access and parsing tool
upload_time2023-04-26 15:26:55
maintainer
docs_urlNone
authorMarcus D. Sherman
requires_python
licenseBSD 3-Clause
keywords bam pysam genomics genetics struct
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            |Documentation Status| |Conda Version| |PyPI version| |Maintainability|

|status| |DOI| |License|

+---------------------------+------------------------------------------+
| Platform                  | Build Status                             |
+===========================+==========================================+
| Linux                     | |Build Status TravisCI|                  |
+---------------------------+------------------------------------------+
| Windows                   | |Build status Appveyor|                  |
+---------------------------+------------------------------------------+
| conda                     | |noarch|                                 |
+---------------------------+------------------------------------------+

+---------------------+------------------------------------------------+
| Host                | Downloads                                      |
+=====================+================================================+
| PyPI                | |Downloads|                                    |
+---------------------+------------------------------------------------+
| conda               | |Conda Downloads|                              |
+---------------------+------------------------------------------------+

BAMnostic
=========

a *pure Python*, **OS-agnositic** Binary Alignment Map (BAM) file parser
and random access tool.

Note:
-----

Documentation can be found at
`here <http://bamnostic.readthedocs.io/en/latest/>`__ or by going to
this address: http://bamnostic.readthedocs.io. Documentation was made
available through `Read the Docs <https://readthedocs.org/>`__.

--------------

Installation
------------

There are 4 methods of installation available (choose one):

Through the ``conda`` package manager (`Anaconda Cloud <https://anaconda.org/conda-forge/bamnostic>`__)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   # first, add the conda-forge channel to your conda build
   conda config --add channels conda-forge

   # now bamnostic is available for install
   conda install bamnostic

Through the Python Package Index (`PyPI <https://pypi.org/>`__)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   pip install bamnostic

   # or, if you don't have superuser access
   pip install --user bamnostic

Through pip+Github
~~~~~~~~~~~~~~~~~~

.. code:: bash

   # again, use --user if you don't have superuser access
   pip install -e git+https://github.com/betteridiot/bamnostic.git#egg=bamnostic

   # or, if you don't have superuser access
   pip install --user -e git+https://github.com/betteridiot/bamnostic.git#bamnostic#egg=bamnostic

Traditional GitHub clone
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   git clone https://github.com/betteridiot/bamnostic.git
   cd bamnostic
   pip install -e .

   # or, if you don't have superuser access
   pip install --user -e .

--------------

Quickstart
----------

Bamnostic is meant to be a reduced drop-in replacement for
`pysam <https://github.com/pysam-developers/pysam>`__. As such it has
much the same API as ``pysam`` with regard to BAM-related operations.
**Note**: the ``pileup()`` method is not supported at this time. ###
Importing

.. code:: python

   >>> import bamnostic as bs

Loading your BAM file (Note: CRAM format are not supported at this time)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Bamnostic comes with an example BAM (and respective BAI) file just to
play around with the output. Note, however, that the example BAM file
does not contain many reference contigs. Therefore, random access is
limited. This example file is made availble through
``bamnostic.example_bam``, which is a just a string path to the BAM file
within the package.

.. code:: python

   >>> bam = bs.AlignmentFile(bs.example_bam, 'rb')

Get the header
~~~~~~~~~~~~~~

**Note**: this will print out the SAM header. If the SAM header is not
in the BAM file, it will print out the dictionary representation of the
BAM header. It is a dictionary of refID keys with contig names and
length tuple values.

.. code:: python

   >>> bam.header
   {0: ('chr1', 1575), 1: ('chr2', 1584)}

Data validation through ``head()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   >>>bam.head(n=2)
   [EAS56_57:6:190:289:82  69  chr1    100 0   *   =   100 0   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192,
    EAS56_57:6:190:289:82  137 chr1    100 73  35M =   100 0   AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:C:64 Aq:C:0  NM:C:0  UQ:C:0  H0:C:1  H1:C:0]

Getting the first read
~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   >>> first_read = next(bam)
   >>> print(first_read)
   EAS56_57:6:190:289:82   69  chr1    100 0   *   =   100 0   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192

Exploring the read
~~~~~~~~~~~~~~~~~~

.. code:: python

   # read name
   >>> print(first_read.read_name)
   EAS56_57:6:190:289:82

   # 0-based position
   >>> print(first_read.pos)
   99

   # nucleotide sequence
   >>> print(first_read.seq)
   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA

   # Read FLAG
   >>> print(first_read.flag)
   69

   # decoded FLAG
   >>> bs.utils.flag_decode(first_read.flag)
   [(1, 'read paired'), (4, 'read unmapped'), (64, 'first in pair')]

Random Access
~~~~~~~~~~~~~

.. code:: python

   >>> for i, read in enumerate(bam.fetch('chr2', 1, 100)):
   ...    if i >= 3:
   ...        break
   ...    print(read)

   B7_591:8:4:841:340  73  chr2    1   99  36M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA    <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;;    MF:C:18 Aq:C:77 NM:C:0  UQ:C:0  H0:C:1  H1:C:0
   EAS54_67:4:142:943:582  73  chr2    1   99  35M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA <<<<<<;<<<<<<:<<;<<<<;<<<;<<<:;<<<5 MF:C:18 Aq:C:41 NM:C:0  UQ:C:0  H0:C:1  H1:C:0
   EAS54_67:6:43:859:229   153 chr2    1   66  35M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA +37<=<.;<<7.;77<5<<0<<<;<<<27<<<<<< MF:C:32 Aq:C:0  NM:C:0  UQ:C:0  H0:C:1  H1:C:0

--------------

Introduction
------------

Next-Generation Sequencing
~~~~~~~~~~~~~~~~~~~~~~~~~~

The field of genomics requires sequencing data produced by
Next-Generation sequencing (NGS) platforms (such as
`Illumina <https://www.illumina.com/>`__). These data take the form of
millions of short strings that represent the nucleotide sequences (A, T,
C, or G) of the sample fragments processed by the NGS platform. More
information regarding the NGS workflow can be found
`here <https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf>`__
An example of a single entry (known as FASTQ) can be seen below (`FASTQ
Format <https://en.wikipedia.org/wiki/FASTQ_format>`__):

.. code:: bash

   @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
   GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
   +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Each entry details the read name, lenght, string representation, and
quality of each aligned base along the read. ### SAM/BAM Format The data
from the NGS platforms are often aligned to reference genome. That is,
each entry goes through an alignment algorithm that finds the best
position that the entry matches along a known reference sequence. The
alignment step extends the original entry with a sundry of additional
attributes. A few of the included attributes are contig, position, and
Compact Idiosyncratic Gapped Alignment Report (CIGAR) string. The
modified entry is called the An example Sequence Alignment Map (SAM)
entry can be see below (`SAM
format <https://samtools.github.io/hts-specs/SAMv1.pdf>`__):

.. code:: bash

   @HD VN:1.5 SO:coordinate
   @SQ SN:ref LN:45
   r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
   r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
   r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
   r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
   r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
   r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1

There are many benefits to the SAM format: human-readable, each entry is
contained to a single line (supporting simple stream analysis), concise
description of the read’s quality and position, and a file header
metadata that supports integrity and reproducibility. Additionally, a
compressed form of the SAM format was designed in parallel. It is called
the Binary Alignment Map
(`BAM <https://samtools.github.io/hts-specs/SAMv1.pdf>`__). Using a
series of clever byte encoding of each SAM entry, the data are
compressed into specialized, concatenated GZIP blocks called Blocked GNU
Zip Format (`BGZF <https://samtools.github.io/hts-specs/SAMv1.pdf>`__)
blocks. Each BGZF block contains a finite amount of data (≈65Kb). While
the whole file is GZIP compatible, each individual block is also
independently GZIP compatible. This data structure, ultimately, makes
the file larger than just a normal GZIP file, but it also allow for
random access within the file though the use of a BAM Index file
(`BAI <https://samtools.github.io/hts-specs/SAMv1.pdf>`__).

BAI
~~~

The BAI file, often produced via
`samtools <http://samtools.sourceforge.net/>`__, requires the BAM file
to be sorted prior to indexing. Using a modified R-tree binning
strategy, each reference contig is divided into sequential,
non-overlapping bins. That is a parent bin may contain numerous
children, but none of the children bins overlap another’s assigned
interval. Each BAM entry is then assigned to the bin that fully contains
it. A visual description of the binning strategy can be found
`here <https://samtools.github.io/hts-specs/SAMv1.pdf>`__. Each bin is
comprised of chunks, and each chunk contains its respective start and
stop byte positions within the BAM file. In addition to the bin index, a
linear index is produced as well. Again, the reference contig is divided
into equally sized windows (covering ≈16Kbp/each). Along those windows,
the start offset of the first read that **overlaps** that window is
stored. Now, given a region of interest, the first bin that overlaps the
region is looked up. The chunks in the bin are stored as *virtual
offsets*. A virtual offset is a 64-bit unsigned integer that is
comprised of the compressed offset ``coffset`` (indicating the byte
position of the start of the containing BGZF block) and the uncompressed
offset ``uoffset`` (indicating the byte position within the uncompressed
data of the BGZF block that the data starts). A virtual offset is
calculated by:

.. code:: python

   virtual_offset = coffset << 16 | uoffset

Similarly, the complement of the above is as follows:

.. code:: python

   coffset = virtual_offset >> 16
   uoffset = virtual_offset ^ (coffset << 16)

A simple seek call against the BAM file will put the head at the start
of your region of interest.

--------------

Motivation
----------

The common practice within the field of genomics/genetics when analyzing
BAM files is to use the program known as
`samtools <http://samtools.sourceforge.net/>`__. The maintainers of
samtools have done a tremendous job of providing distributions that work
on a multitude of operating systems. While samtools is powerful, as a
command line interface, it is also limited in that it doesn’t really
afford the ability to perform real-time dynamic processing of reads
(without requiring many system calls to samtools). Due to its general
nature and inherent readability, a package was written in Python called
`pysam <https://github.com/pysam-developers/pysam>`__. This package
allowed users a very comfortable means to doing such dynamic processing.
However, the foundation of these tools is built on a C-API called
`htslib <https://github.com/samtools/htslib>`__ and htslib cannot be
compiled in a Windows environment. By extension, neither can pysam. In
building a tool for genomic visualization, I wanted it to be platform
agnostic. This is precisely when I found out that the tools I had
planned to use as a backend did not work on Windows…the most prevalent
operation system in the end-user world. So, I wrote **bamnostic**. As of
this writing, bamnostic is OS-agnostic and written completely in Pure
Python–requiring only the standard library (and ``pytest`` for the test
suite). Special care was taken to ensure that it would run on all
versions of CPython 2.7 or greater. Additionally, it runs in both stable
versions of PyPy. While it may perform slower than its C counterparts,
bamnostic opens up the science to a much greater end-user group. Lastly,
it is lightweight enough to fit into any simple web server
(e.g. `Flask <http://flask.pocoo.org/>`__), further expanding the
science of genetics/genomics.

--------------

Citation
--------

If you use bamnostic in your analyses, please consider citing `Li et al
(2009) <http://www.ncbi.nlm.nih.gov/pubmed/19505943>`__ as well.
Regarding the citation for bamnostic, please use the JoSS journal
article (click on the JOSS badge above) or use the following: >Sherman
MD and Mills RE, (2018). BAMnostic: an OS-agnostic toolkit for genomic
sequence analysis . Journal of Open Source Software, 3(28), 826,
https://doi.org/10.21105/joss.00826

--------------

Community Guidelines:
---------------------

Eagerly accepting PRs for improvements, optimizations, or features. For
any questions or issues, please feel free to make a post to bamnostic’s
`Issue tracker <https://github.com/betteridiot/bamnostic/issues>`__ on
github or read over our
`CONTRIBUTING <https://github.com/betteridiot/bamnostic/blob/master/CONTRIBUTING.md>`__
documentation.

--------------

Commmunity Contributors:
------------------------

Below you will find a list of contributors and it acts as a small token
of my gratitude to the community that has helped support this project.
1. `@GeekLogan <https://github.com/GeekLogan>`__ 2.
`@giesselmann <https://github.com/giesselmann>`__ 3.
`@olgabot <https://github.com/olgabot>`__ 4.
`@OliverVoogd <https://github.com/OliverVoogd>`__ 5.
`@gmat <https://github.com/gmat>`__

.. |Documentation Status| image:: https://readthedocs.org/projects/bamnostic/badge/?version=latest
   :target: https://bamnostic.readthedocs.io/en/latest/?badge=latest
.. |Conda Version| image:: https://img.shields.io/conda/vn/conda-forge/bamnostic.svg
   :target: https://anaconda.org/conda-forge/bamnostic
.. |PyPI version| image:: https://badge.fury.io/py/bamnostic.svg
   :target: https://badge.fury.io/py/bamnostic
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/d7e36e72f109c598c86d/maintainability
   :target: https://codeclimate.com/github/betteridiot/bamnostic/maintainability
.. |status| image:: http://joss.theoj.org/papers/9952b35bbb30ca6c01e6a27b80006bd8/status.svg
   :target: http://joss.theoj.org/papers/9952b35bbb30ca6c01e6a27b80006bd8
.. |DOI| image:: https://zenodo.org/badge/121782433.svg
   :target: https://zenodo.org/badge/latestdoi/121782433
.. |License| image:: https://img.shields.io/badge/License-BSD%203--Clause-blue.svg
   :target: https://github.com/betteridiot/bamnostic/blob/master/LICENSE
.. |Build Status TravisCI| image:: https://travis-ci.org/betteridiot/bamnostic.svg?branch=master
   :target: https://travis-ci.org/betteridiot/bamnostic
.. |Build status Appveyor| image:: https://ci.appveyor.com/api/projects/status/y95q02gkv3lgmlf4/branch/master?svg=true
   :target: https://ci.appveyor.com/project/betteridiot/bamnostic/branch/master
.. |noarch| image:: https://img.shields.io/circleci/project/github/conda-forge/bamnostic-feedstock/master.svg?label=noarch
   :target: https://circleci.com/gh/conda-forge/bamnostic-feedstock
.. |Downloads| image:: http://pepy.tech/badge/bamnostic
   :target: http://pepy.tech/project/bamnostic
.. |Conda Downloads| image:: https://img.shields.io/conda/dn/conda-forge/bamnostic.svg
   :target: https://anaconda.org/conda-forge/bamnostic

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/betteridiot/bamnostic/",
    "name": "bamnostic",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "BAM pysam genomics genetics struct",
    "author": "Marcus D. Sherman",
    "author_email": "mdsherm@umich.edu",
    "download_url": "https://files.pythonhosted.org/packages/53/89/9b260fded5f59d4acf8f5a218c91cd38467a6eb5cfd3b7edcb10f657eb77/bamnostic-1.1.10.tar.gz",
    "platform": null,
    "description": "|Documentation Status| |Conda Version| |PyPI version| |Maintainability|\n\n|status| |DOI| |License|\n\n+---------------------------+------------------------------------------+\n| Platform                  | Build Status                             |\n+===========================+==========================================+\n| Linux                     | |Build Status TravisCI|                  |\n+---------------------------+------------------------------------------+\n| Windows                   | |Build status Appveyor|                  |\n+---------------------------+------------------------------------------+\n| conda                     | |noarch|                                 |\n+---------------------------+------------------------------------------+\n\n+---------------------+------------------------------------------------+\n| Host                | Downloads                                      |\n+=====================+================================================+\n| PyPI                | |Downloads|                                    |\n+---------------------+------------------------------------------------+\n| conda               | |Conda Downloads|                              |\n+---------------------+------------------------------------------------+\n\nBAMnostic\n=========\n\na *pure Python*, **OS-agnositic** Binary Alignment Map (BAM) file parser\nand random access tool.\n\nNote:\n-----\n\nDocumentation can be found at\n`here <http://bamnostic.readthedocs.io/en/latest/>`__ or by going to\nthis address: http://bamnostic.readthedocs.io. Documentation was made\navailable through `Read the Docs <https://readthedocs.org/>`__.\n\n--------------\n\nInstallation\n------------\n\nThere are 4 methods of installation available (choose one):\n\nThrough the ``conda`` package manager (`Anaconda Cloud <https://anaconda.org/conda-forge/bamnostic>`__)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n   # first, add the conda-forge channel to your conda build\n   conda config --add channels conda-forge\n\n   # now bamnostic is available for install\n   conda install bamnostic\n\nThrough the Python Package Index (`PyPI <https://pypi.org/>`__)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n   pip install bamnostic\n\n   # or, if you don't have superuser access\n   pip install --user bamnostic\n\nThrough pip+Github\n~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n   # again, use --user if you don't have superuser access\n   pip install -e git+https://github.com/betteridiot/bamnostic.git#egg=bamnostic\n\n   # or, if you don't have superuser access\n   pip install --user -e git+https://github.com/betteridiot/bamnostic.git#bamnostic#egg=bamnostic\n\nTraditional GitHub clone\n~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n   git clone https://github.com/betteridiot/bamnostic.git\n   cd bamnostic\n   pip install -e .\n\n   # or, if you don't have superuser access\n   pip install --user -e .\n\n--------------\n\nQuickstart\n----------\n\nBamnostic is meant to be a reduced drop-in replacement for\n`pysam <https://github.com/pysam-developers/pysam>`__. As such it has\nmuch the same API as ``pysam`` with regard to BAM-related operations.\n**Note**: the ``pileup()`` method is not supported at this time. ###\nImporting\n\n.. code:: python\n\n   >>> import bamnostic as bs\n\nLoading your BAM file (Note: CRAM format are not supported at this time)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBamnostic comes with an example BAM (and respective BAI) file just to\nplay around with the output. Note, however, that the example BAM file\ndoes not contain many reference contigs. Therefore, random access is\nlimited. This example file is made availble through\n``bamnostic.example_bam``, which is a just a string path to the BAM file\nwithin the package.\n\n.. code:: python\n\n   >>> bam = bs.AlignmentFile(bs.example_bam, 'rb')\n\nGet the header\n~~~~~~~~~~~~~~\n\n**Note**: this will print out the SAM header. If the SAM header is not\nin the BAM file, it will print out the dictionary representation of the\nBAM header. It is a dictionary of refID keys with contig names and\nlength tuple values.\n\n.. code:: python\n\n   >>> bam.header\n   {0: ('chr1', 1575), 1: ('chr2', 1584)}\n\nData validation through ``head()``\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n   >>>bam.head(n=2)\n   [EAS56_57:6:190:289:82  69  chr1    100 0   *   =   100 0   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192,\n    EAS56_57:6:190:289:82  137 chr1    100 73  35M =   100 0   AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:C:64 Aq:C:0  NM:C:0  UQ:C:0  H0:C:1  H1:C:0]\n\nGetting the first read\n~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n   >>> first_read = next(bam)\n   >>> print(first_read)\n   EAS56_57:6:190:289:82   69  chr1    100 0   *   =   100 0   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192\n\nExploring the read\n~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n   # read name\n   >>> print(first_read.read_name)\n   EAS56_57:6:190:289:82\n\n   # 0-based position\n   >>> print(first_read.pos)\n   99\n\n   # nucleotide sequence\n   >>> print(first_read.seq)\n   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA\n\n   # Read FLAG\n   >>> print(first_read.flag)\n   69\n\n   # decoded FLAG\n   >>> bs.utils.flag_decode(first_read.flag)\n   [(1, 'read paired'), (4, 'read unmapped'), (64, 'first in pair')]\n\nRandom Access\n~~~~~~~~~~~~~\n\n.. code:: python\n\n   >>> for i, read in enumerate(bam.fetch('chr2', 1, 100)):\n   ...    if i >= 3:\n   ...        break\n   ...    print(read)\n\n   B7_591:8:4:841:340  73  chr2    1   99  36M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA    <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;;    MF:C:18 Aq:C:77 NM:C:0  UQ:C:0  H0:C:1  H1:C:0\n   EAS54_67:4:142:943:582  73  chr2    1   99  35M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA <<<<<<;<<<<<<:<<;<<<<;<<<;<<<:;<<<5 MF:C:18 Aq:C:41 NM:C:0  UQ:C:0  H0:C:1  H1:C:0\n   EAS54_67:6:43:859:229   153 chr2    1   66  35M *   0   0   TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA +37<=<.;<<7.;77<5<<0<<<;<<<27<<<<<< MF:C:32 Aq:C:0  NM:C:0  UQ:C:0  H0:C:1  H1:C:0\n\n--------------\n\nIntroduction\n------------\n\nNext-Generation Sequencing\n~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe field of genomics requires sequencing data produced by\nNext-Generation sequencing (NGS) platforms (such as\n`Illumina <https://www.illumina.com/>`__). These data take the form of\nmillions of short strings that represent the nucleotide sequences (A, T,\nC, or G) of the sample fragments processed by the NGS platform. More\ninformation regarding the NGS workflow can be found\n`here <https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf>`__\nAn example of a single entry (known as FASTQ) can be seen below (`FASTQ\nFormat <https://en.wikipedia.org/wiki/FASTQ_format>`__):\n\n.. code:: bash\n\n   @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36\n   GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC\n   +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36\n   IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC\n\nEach entry details the read name, lenght, string representation, and\nquality of each aligned base along the read. ### SAM/BAM Format The data\nfrom the NGS platforms are often aligned to reference genome. That is,\neach entry goes through an alignment algorithm that finds the best\nposition that the entry matches along a known reference sequence. The\nalignment step extends the original entry with a sundry of additional\nattributes. A few of the included attributes are contig, position, and\nCompact Idiosyncratic Gapped Alignment Report (CIGAR) string. The\nmodified entry is called the An example Sequence Alignment Map (SAM)\nentry can be see below (`SAM\nformat <https://samtools.github.io/hts-specs/SAMv1.pdf>`__):\n\n.. code:: bash\n\n   @HD VN:1.5 SO:coordinate\n   @SQ SN:ref LN:45\n   r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *\n   r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *\n   r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;\n   r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *\n   r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;\n   r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1\n\nThere are many benefits to the SAM format: human-readable, each entry is\ncontained to a single line (supporting simple stream analysis), concise\ndescription of the read\u2019s quality and position, and a file header\nmetadata that supports integrity and reproducibility. Additionally, a\ncompressed form of the SAM format was designed in parallel. It is called\nthe Binary Alignment Map\n(`BAM <https://samtools.github.io/hts-specs/SAMv1.pdf>`__). Using a\nseries of clever byte encoding of each SAM entry, the data are\ncompressed into specialized, concatenated GZIP blocks called Blocked GNU\nZip Format (`BGZF <https://samtools.github.io/hts-specs/SAMv1.pdf>`__)\nblocks. Each BGZF block contains a finite amount of data (\u224865Kb). While\nthe whole file is GZIP compatible, each individual block is also\nindependently GZIP compatible. This data structure, ultimately, makes\nthe file larger than just a normal GZIP file, but it also allow for\nrandom access within the file though the use of a BAM Index file\n(`BAI <https://samtools.github.io/hts-specs/SAMv1.pdf>`__).\n\nBAI\n~~~\n\nThe BAI file, often produced via\n`samtools <http://samtools.sourceforge.net/>`__, requires the BAM file\nto be sorted prior to indexing. Using a modified R-tree binning\nstrategy, each reference contig is divided into sequential,\nnon-overlapping bins. That is a parent bin may contain numerous\nchildren, but none of the children bins overlap another\u2019s assigned\ninterval. Each BAM entry is then assigned to the bin that fully contains\nit. A visual description of the binning strategy can be found\n`here <https://samtools.github.io/hts-specs/SAMv1.pdf>`__. Each bin is\ncomprised of chunks, and each chunk contains its respective start and\nstop byte positions within the BAM file. In addition to the bin index, a\nlinear index is produced as well. Again, the reference contig is divided\ninto equally sized windows (covering \u224816Kbp/each). Along those windows,\nthe start offset of the first read that **overlaps** that window is\nstored. Now, given a region of interest, the first bin that overlaps the\nregion is looked up. The chunks in the bin are stored as *virtual\noffsets*. A virtual offset is a 64-bit unsigned integer that is\ncomprised of the compressed offset ``coffset`` (indicating the byte\nposition of the start of the containing BGZF block) and the uncompressed\noffset ``uoffset`` (indicating the byte position within the uncompressed\ndata of the BGZF block that the data starts). A virtual offset is\ncalculated by:\n\n.. code:: python\n\n   virtual_offset = coffset << 16 | uoffset\n\nSimilarly, the complement of the above is as follows:\n\n.. code:: python\n\n   coffset = virtual_offset >> 16\n   uoffset = virtual_offset ^ (coffset << 16)\n\nA simple seek call against the BAM file will put the head at the start\nof your region of interest.\n\n--------------\n\nMotivation\n----------\n\nThe common practice within the field of genomics/genetics when analyzing\nBAM files is to use the program known as\n`samtools <http://samtools.sourceforge.net/>`__. The maintainers of\nsamtools have done a tremendous job of providing distributions that work\non a multitude of operating systems. While samtools is powerful, as a\ncommand line interface, it is also limited in that it doesn\u2019t really\nafford the ability to perform real-time dynamic processing of reads\n(without requiring many system calls to samtools). Due to its general\nnature and inherent readability, a package was written in Python called\n`pysam <https://github.com/pysam-developers/pysam>`__. This package\nallowed users a very comfortable means to doing such dynamic processing.\nHowever, the foundation of these tools is built on a C-API called\n`htslib <https://github.com/samtools/htslib>`__ and htslib cannot be\ncompiled in a Windows environment. By extension, neither can pysam. In\nbuilding a tool for genomic visualization, I wanted it to be platform\nagnostic. This is precisely when I found out that the tools I had\nplanned to use as a backend did not work on Windows\u2026the most prevalent\noperation system in the end-user world. So, I wrote **bamnostic**. As of\nthis writing, bamnostic is OS-agnostic and written completely in Pure\nPython\u2013requiring only the standard library (and ``pytest`` for the test\nsuite). Special care was taken to ensure that it would run on all\nversions of CPython 2.7 or greater. Additionally, it runs in both stable\nversions of PyPy. While it may perform slower than its C counterparts,\nbamnostic opens up the science to a much greater end-user group. Lastly,\nit is lightweight enough to fit into any simple web server\n(e.g.\u00a0`Flask <http://flask.pocoo.org/>`__), further expanding the\nscience of genetics/genomics.\n\n--------------\n\nCitation\n--------\n\nIf you use bamnostic in your analyses, please consider citing `Li et al\n(2009) <http://www.ncbi.nlm.nih.gov/pubmed/19505943>`__ as well.\nRegarding the citation for bamnostic, please use the JoSS journal\narticle (click on the JOSS badge above) or use the following: >Sherman\nMD and Mills RE, (2018). BAMnostic: an OS-agnostic toolkit for genomic\nsequence analysis . Journal of Open Source Software, 3(28), 826,\nhttps://doi.org/10.21105/joss.00826\n\n--------------\n\nCommunity Guidelines:\n---------------------\n\nEagerly accepting PRs for improvements, optimizations, or features. For\nany questions or issues, please feel free to make a post to bamnostic\u2019s\n`Issue tracker <https://github.com/betteridiot/bamnostic/issues>`__ on\ngithub or read over our\n`CONTRIBUTING <https://github.com/betteridiot/bamnostic/blob/master/CONTRIBUTING.md>`__\ndocumentation.\n\n--------------\n\nCommmunity Contributors:\n------------------------\n\nBelow you will find a list of contributors and it acts as a small token\nof my gratitude to the community that has helped support this project.\n1. `@GeekLogan <https://github.com/GeekLogan>`__ 2.\n`@giesselmann <https://github.com/giesselmann>`__ 3.\n`@olgabot <https://github.com/olgabot>`__ 4.\n`@OliverVoogd <https://github.com/OliverVoogd>`__ 5.\n`@gmat <https://github.com/gmat>`__\n\n.. |Documentation Status| image:: https://readthedocs.org/projects/bamnostic/badge/?version=latest\n   :target: https://bamnostic.readthedocs.io/en/latest/?badge=latest\n.. |Conda Version| image:: https://img.shields.io/conda/vn/conda-forge/bamnostic.svg\n   :target: https://anaconda.org/conda-forge/bamnostic\n.. |PyPI version| image:: https://badge.fury.io/py/bamnostic.svg\n   :target: https://badge.fury.io/py/bamnostic\n.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/d7e36e72f109c598c86d/maintainability\n   :target: https://codeclimate.com/github/betteridiot/bamnostic/maintainability\n.. |status| image:: http://joss.theoj.org/papers/9952b35bbb30ca6c01e6a27b80006bd8/status.svg\n   :target: http://joss.theoj.org/papers/9952b35bbb30ca6c01e6a27b80006bd8\n.. |DOI| image:: https://zenodo.org/badge/121782433.svg\n   :target: https://zenodo.org/badge/latestdoi/121782433\n.. |License| image:: https://img.shields.io/badge/License-BSD%203--Clause-blue.svg\n   :target: https://github.com/betteridiot/bamnostic/blob/master/LICENSE\n.. |Build Status TravisCI| image:: https://travis-ci.org/betteridiot/bamnostic.svg?branch=master\n   :target: https://travis-ci.org/betteridiot/bamnostic\n.. |Build status Appveyor| image:: https://ci.appveyor.com/api/projects/status/y95q02gkv3lgmlf4/branch/master?svg=true\n   :target: https://ci.appveyor.com/project/betteridiot/bamnostic/branch/master\n.. |noarch| image:: https://img.shields.io/circleci/project/github/conda-forge/bamnostic-feedstock/master.svg?label=noarch\n   :target: https://circleci.com/gh/conda-forge/bamnostic-feedstock\n.. |Downloads| image:: http://pepy.tech/badge/bamnostic\n   :target: http://pepy.tech/project/bamnostic\n.. |Conda Downloads| image:: https://img.shields.io/conda/dn/conda-forge/bamnostic.svg\n   :target: https://anaconda.org/conda-forge/bamnostic\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause",
    "summary": "Pure Python, OS-agnostic Binary Alignment Map (BAM) random access and parsing tool",
    "version": "1.1.10",
    "split_keywords": [
        "bam",
        "pysam",
        "genomics",
        "genetics",
        "struct"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6c3246455674d390862af25ec948ea149ffc2c4421984079dbcb14a2ed2e7b86",
                "md5": "ed053f75c7e10f358da6379810f76847",
                "sha256": "8fab604d56996f185844a7530b0bb3a96610a14d7cd5d3465fe61a5e3a1729de"
            },
            "downloads": -1,
            "filename": "bamnostic-1.1.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ed053f75c7e10f358da6379810f76847",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 183069,
            "upload_time": "2023-04-26T15:26:52",
            "upload_time_iso_8601": "2023-04-26T15:26:52.800213Z",
            "url": "https://files.pythonhosted.org/packages/6c/32/46455674d390862af25ec948ea149ffc2c4421984079dbcb14a2ed2e7b86/bamnostic-1.1.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "53899b260fded5f59d4acf8f5a218c91cd38467a6eb5cfd3b7edcb10f657eb77",
                "md5": "528400693b91ba4b257760f0716cf5f1",
                "sha256": "2f7e7e5cb693c5f933c5b5c3fde49c6c8dee62b608ebd13a4604401573e37017"
            },
            "downloads": -1,
            "filename": "bamnostic-1.1.10.tar.gz",
            "has_sig": false,
            "md5_digest": "528400693b91ba4b257760f0716cf5f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 238497,
            "upload_time": "2023-04-26T15:26:55",
            "upload_time_iso_8601": "2023-04-26T15:26:55.916959Z",
            "url": "https://files.pythonhosted.org/packages/53/89/9b260fded5f59d4acf8f5a218c91cd38467a6eb5cfd3b7edcb10f657eb77/bamnostic-1.1.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-26 15:26:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "betteridiot",
    "github_project": "bamnostic",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "appveyor": true,
    "requirements": [],
    "lcname": "bamnostic"
}
        
Elapsed time: 0.07860s