This module allows fast random access to files compressed with bgzip_ and
indexed by tabix_. It includes a C extension with code from klib_. The bgzip
and tabix programs are available here_.
Installation
------------
::
pip install --user pytabix
Synopsis
--------
Genomics data is often in a table where each row corresponds to a genomic
region (start, end) or a position::
chrom pos snp
1 1000760 rs75316104
1 1000894 rs114006445
1 1000910 rs79750022
1 1001177 rs4970401
1 1001256 rs78650406
With tabix_, you can quickly retrieve all rows in a genomic region by
specifying a query with a sequence name, start, and end:
.. code:: python
import tabix
# Open a remote or local file.
url = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/"
url += "ALL.2of4intersection.20100804.genotypes.vcf.gz"
tb = tabix.open(url)
# These queries are identical. A query returns an iterator over the results.
records = tb.query("1", 1000000, 1250000)
records = tb.queryi(0, 1000000, 1250000)
records = tb.querys("1:1000000-1250000")
# Each record is a list of strings.
for record in records:
print record[:3]
.. code:: python
['1', '1000760', 'rs75316104']
['1', '1000760', 'rs75316104']
['1', '1000894', 'rs114006445']
['1', '1000910', 'rs79750022']
['1', '1001177', 'rs4970401']
['1', '1001256', 'rs78650406']
Example
-------
Let's say you have a table of gene coordinates:
.. code:: bash
$ zcat example.bed.gz | shuf | head -n5 | column -t
chr19 53611131 53636172 55786 ZNF415
chr10 72149121 72150375 221017 CEP57L1P1
chr4 185009858 185139113 133121 ENPP6
chrX 132669772 133119672 2719 GPC3
chr6 134924279 134925376 114182 FAM8A6P
Sort_ it by chromosome, then by start and end positions. Then, use bgzip_ to
deflate the file into compressed blocks:
.. code:: bash
$ zcat example.bed.gz | sort -k1V -k2n -k3n | bgzip > example.bed.bgz
The compressed size is usually slightly larger than that obtained with gzip.
Index the file with tabix_:
.. code:: bash
$ tabix -s 1 -b 2 -e 3 example.bed.gz
$ ls
example.bed.gz example.bed.bgz example.bed.bgz.tbi
.. _bgzip: http://samtools.sourceforge.net/tabix.shtml
.. _tabix: http://samtools.sourceforge.net/tabix.shtml
.. _klib: https://github.com/jmarshall/klib
.. _here: http://sourceforge.net/projects/samtools/files/tabix/
.. _Sort: https://www.gnu.org/software/coreutils/manual/html_node/Details-about-version-sort.html#Details-about-version-sort
Raw data
{
"_id": null,
"home_page": "https://github.com/slowkow/pytabix",
"name": "pytabix",
"maintainer": "Kamil Slowikowski",
"docs_url": null,
"requires_python": null,
"maintainer_email": "slowikow@broadinstitute.org",
"keywords": "tabix, bgzip, bioinformatics, genomics",
"author": "Hyeshik Chang, Kamil Slowikowski",
"author_email": "hyeshik@snu.ac.kr, slowikow@broadinstitute.org",
"download_url": "https://files.pythonhosted.org/packages/84/6a/520ecf75c2ada77492cb4ed21fb22aed178e791df434ca083b59fffadddd/pytabix-0.1.tar.gz",
"platform": "",
"description": "This module allows fast random access to files compressed with bgzip_ and\r\nindexed by tabix_. It includes a C extension with code from klib_. The bgzip\r\nand tabix programs are available here_.\r\n\r\nInstallation\r\n------------\r\n\r\n::\r\n\r\n pip install --user pytabix\r\n\r\n\r\nSynopsis\r\n--------\r\n\r\nGenomics data is often in a table where each row corresponds to a genomic\r\nregion (start, end) or a position::\r\n\r\n chrom pos snp\r\n 1 1000760 rs75316104\r\n 1 1000894 rs114006445\r\n 1 1000910 rs79750022\r\n 1 1001177 rs4970401\r\n 1 1001256 rs78650406\r\n\r\nWith tabix_, you can quickly retrieve all rows in a genomic region by\r\nspecifying a query with a sequence name, start, and end:\r\n\r\n.. code:: python\r\n\r\n import tabix\r\n\r\n # Open a remote or local file.\r\n url = \"ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/\"\r\n url += \"ALL.2of4intersection.20100804.genotypes.vcf.gz\"\r\n\r\n tb = tabix.open(url)\r\n\r\n # These queries are identical. A query returns an iterator over the results.\r\n records = tb.query(\"1\", 1000000, 1250000)\r\n records = tb.queryi(0, 1000000, 1250000)\r\n records = tb.querys(\"1:1000000-1250000\")\r\n\r\n # Each record is a list of strings.\r\n for record in records:\r\n print record[:3]\r\n\r\n.. code:: python\r\n\r\n ['1', '1000760', 'rs75316104']\r\n ['1', '1000760', 'rs75316104']\r\n ['1', '1000894', 'rs114006445']\r\n ['1', '1000910', 'rs79750022']\r\n ['1', '1001177', 'rs4970401']\r\n ['1', '1001256', 'rs78650406']\r\n\r\n\r\nExample\r\n-------\r\n\r\nLet's say you have a table of gene coordinates:\r\n\r\n.. code:: bash\r\n\r\n $ zcat example.bed.gz | shuf | head -n5 | column -t\r\n chr19 53611131 53636172 55786 ZNF415\r\n chr10 72149121 72150375 221017 CEP57L1P1\r\n chr4 185009858 185139113 133121 ENPP6\r\n chrX 132669772 133119672 2719 GPC3\r\n chr6 134924279 134925376 114182 FAM8A6P\r\n\r\nSort_ it by chromosome, then by start and end positions. Then, use bgzip_ to\r\ndeflate the file into compressed blocks:\r\n\r\n.. code:: bash\r\n\r\n $ zcat example.bed.gz | sort -k1V -k2n -k3n | bgzip > example.bed.bgz\r\n\r\nThe compressed size is usually slightly larger than that obtained with gzip.\r\n\r\nIndex the file with tabix_:\r\n\r\n.. code:: bash\r\n\r\n $ tabix -s 1 -b 2 -e 3 example.bed.gz\r\n \r\n $ ls\r\n example.bed.gz example.bed.bgz example.bed.bgz.tbi\r\n\r\n.. _bgzip: http://samtools.sourceforge.net/tabix.shtml\r\n.. _tabix: http://samtools.sourceforge.net/tabix.shtml\r\n.. _klib: https://github.com/jmarshall/klib\r\n.. _here: http://sourceforge.net/projects/samtools/files/tabix/\r\n.. _Sort: https://www.gnu.org/software/coreutils/manual/html_node/Details-about-version-sort.html#Details-about-version-sort",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python interface for tabix",
"version": "0.1",
"project_urls": {
"Download": "UNKNOWN",
"Homepage": "https://github.com/slowkow/pytabix"
},
"split_keywords": [
"tabix",
" bgzip",
" bioinformatics",
" genomics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "846a520ecf75c2ada77492cb4ed21fb22aed178e791df434ca083b59fffadddd",
"md5": "bf9c069c3787c0c240255b917ef34405",
"sha256": "0774f1687ebd41811fb07a0e50951b6be72d7cc7e22ed2b18972eaf7482eb7d1"
},
"downloads": -1,
"filename": "pytabix-0.1.tar.gz",
"has_sig": false,
"md5_digest": "bf9c069c3787c0c240255b917ef34405",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 45811,
"upload_time": "2014-04-16T17:49:24",
"upload_time_iso_8601": "2014-04-16T17:49:24.235849Z",
"url": "https://files.pythonhosted.org/packages/84/6a/520ecf75c2ada77492cb4ed21fb22aed178e791df434ca083b59fffadddd/pytabix-0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2014-04-16 17:49:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "slowkow",
"github_project": "pytabix",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pytabix"
}