biocommons.seqrepo


Namebiocommons.seqrepo JSON
Version 0.6.9 PyPI version JSON
download
home_page
SummaryNon-redundant, compressed, journalled, file-based storage for biological sequences
upload_time2024-02-20 03:30:34
maintainer
docs_urlNone
author
requires_python>=3.9
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # biocommons.seqrepo

SeqRepo is a Python package for storing and reading a local collection of
biological sequences. The repository is non-redundant, compressed, and
journalled, making it efficient to store and transfer multiple snapshots.

## Introduction

Specific, named biological sequences provide the reference and coordinate
system for communicating variation and consequential phenotypic changes.
Several databases of sequences exist, with significant overlap, all using
distinct names. Furthermore, these systems are often difficult to install
locally.

SeqRepo provides an efficient, non-redundant and indexed storage system for
biological sequences. Clients refer to sequences and metadata using familiar
identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based
identifiers. The interface supports fast slicing of arbitrary regions of large
sequences.

A "fully-qualified" identifier includes a namespace to disambiguate accessions
from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the
namespace is provided, seqrepo uses it as-is; if the namespace is not provided
and the unqualified identifier refers to a unique sequence, it is returned;
otherwise, the use of ambiguous identifiers raise an error.

SeqRepo favors namespaces from [identifiers.org](https://identifiers.org)
whenever available. Examples include
[refseq](<https://registry.identifiers.org/registry/refseq>) and
[ensembl](<https://registry.identifiers.org/registry/ensembl>).

[seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service) provides a REST interface
and docker image.

Released under the Apache License, 2.0.

[![ci_rel](https://travis-ci.org/biocommons/biocommons.seqrepo.svg?branch=master)](https://travis-ci.org/biocommons/biocommons.seqrepo)
\|
[![cov](https://coveralls.io/repos/github/biocommons/biocommons.seqrepo/badge.svg?branch=)](https://coveralls.io/github/biocommons/biocommons.seqrepo?branch=)
\|
[![pypi_rel](https://badge.fury.io/py/biocommons.seqrepo.png)](https://pypi.org/pypi?name=biocommons.seqrepo)
\| [ChangeLog](https://github.com/biocommons/biocommons.seqrepo/tree/master/docs/changelog/0.5)

## Citation

Hart RK, Prlić A (2020). **SeqRepo: A system for managing local collections of
biological sequences.** PLoS ONE 15(12): e0239883.
<https://doi.org/10.1371/journal.pone.0239883>

## Features

- Timestamped, read-only snapshots.
- Space-efficient storage of sequences within a single snapshot and across snapshots.
- Bandwidth-efficient transfer incremental updates.
- Fast fetching of sequence slices on chromosome-scale sequences.
- Precomputed digests that may be used as sequence aliases.
- Mappings of external aliases (i.e., accessions or identifiers like
  `NM_013305.4`) to sequences.

## Deployments Scenarios

- Local read-only archive, mirrored from public site, accessed via Python API
  (see [Mirroring documentation](docs/mirror.rst))
- Local read-write archive, maintained with command line utility and/or API (see
  [Command Line Interface documentation](docs/cli.rst)).
- Docker data-only container that may be linked to application container.
- SeqRepo and refget REST API for local or remote access (see
    [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service))

## Technical Quick Peek

Within a single snapshot, sequences are stored *non-redundantly* and
*compressed* in an add-only journalled filesystem structure. A truncated SHA-512
hash is used to assess uniquness and as an internal id. (The digest is truncated
for space efficiency.)

Sequences are compressed using the Block GZipped Format
([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam
to provide fast random access to compressed sequences. (Variable compression
typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across
snapshots and eliminating redundant transfers (e.g., with `rsync`).

Each sequence id is associated with a namespaced alias in a sqlite database.
Such as `<seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>`, `<NCBI,NP_004009.1>`,
`<gi,5032303>`, `<ensembl-75ENSP00000354464>`, `<ensembl-85,ENSP00000354464.4>`.
The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies
(including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes
approximately 8GB. The minimum marginal size for additional snapshots is
approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see [docs/design.rst](docs/design.rst).

## Requirements

Reading a sequence repository requires several Python packages, all of which are
available from pypi. Installation should be as simple as `pip install
biocommons.seqrepo`.

*Writing* sequence files also requires `bgzip`, which provided in the
[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install
the `tabix` package with `sudo apt install tabix`.

Development and deployments are on Ubuntu. Other systems may work but are not
tested. Patches to get other systems working would be welcomed.

## Quick Start

### OS X

    $ brew install python libpq

### Ubuntu

    $ sudo apt install -y python3-dev gcc zlib1g-dev tabix

### All platforms

    $ python -m venv venv
    $ source venv/bin/activate
    $ pip install seqrepo
    $ sudo mkdir -p /usr/local/share/seqrepo
    $ sudo chown $USER /usr/local/share/seqrepo
    $ seqrepo pull -i 2018-11-26 
    $ seqrepo show-status -i 2018-11-26 
    seqrepo 0.2.3.post3.dev8+nb8298bd62283
    root directory: /usr/local/share/seqrepo/2018-11-26, 7.9 GB
    backends: fastadir (schema 1), seqaliasdb (schema 1) 
    sequences: 773587 sequences, 93051609959 residues, 192 files
    aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

    # Simple Pythonic interface to sequences
    >> from biocommons.seqrepo import SeqRepo
    >> sr = SeqRepo("/usr/local/share/seqrepo/latest")
    >> sr["NC_000001.11"][780000:780020]
    'TGGTGGCACGCGCTTGTAGT'

    # Or, use the seqrepo shell for even easier access
    $ seqrepo start-shell -i 2018-11-26
    In [1]: sr["NC_000001.11"][780000:780020]
    Out[1]: 'TGGTGGCACGCGCTTGTAGT'

    # N.B. The following output is edited for simplicity
    $ seqrepo export -i 2018-11-26 | head -n100
    >SHA1:9a2acba3dd7603f... SEGUID:mirLo912A/MppLuS1cUyFMduLUQ Ensembl-85:GENSCAN00000003538 ...
    MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
    SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
    QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
    >SHA1:ca996b263102b1... SEGUID:yplrJjECsVqQufeYy0HkDD16z58 NCBI:XR_001733142.1 gi:1034683989
    TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG
    AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA
    ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA

See [Installation](docs/installation.rst) and
[Mirroring](docs/mirror.rst) for more information.

## Environment Variables

SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query
response caching. It defaults to 1 million but can also be set to "none" to be
unlimited.

SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during
FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be
set to a specific value or "none" to be unlimited. Using a moderate value (>10)
will greatly increase performance of sequence retrieval.

## Developing

### Developing on OS X

    brew install python libpq bash

If you get "xcrun: error: invalid active developer path", you need to install
XCode. See this [StackOverflow answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a).

### Developing on Ubuntu

    sudo apt install -y python3-dev gcc zlib1g-dev tabix

Here's how to get started developing:

    make devready
    source venv/bin/activate
    seqrepo --version

## Building a docker image

Docker images are available at https://hub.docker.com/r/biocommons/seqrepo.
Tags correspond to the version of data, not the version of seqrepo, because the
intent is to make it easy to depend on a local version of seqrepo *files*.  Each
docker image is an installation of seqrepo that downloads the corresponding
version of seqrepo data.  When used in conjunction with docker volumes for
persistence, this provides an easy way to incorporate seqrepo data into a docker
stack.

### Building

    cd misc/docker
    make 2021-01-29.log  # builds and pushes to hub.docker.com (i.e., you need creds)

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "biocommons.seqrepo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "",
    "author": "",
    "author_email": "biocommons contributors <biocommons-dev@googlegroups.com>",
    "download_url": "https://files.pythonhosted.org/packages/a2/f9/62ef74b0aa3bed299e2c02dcc8abb386be924ea48dc735a58bafb4789b5b/biocommons.seqrepo-0.6.9.tar.gz",
    "platform": null,
    "description": "# biocommons.seqrepo\n\nSeqRepo is a Python package for storing and reading a local collection of\nbiological sequences. The repository is non-redundant, compressed, and\njournalled, making it efficient to store and transfer multiple snapshots.\n\n## Introduction\n\nSpecific, named biological sequences provide the reference and coordinate\nsystem for communicating variation and consequential phenotypic changes.\nSeveral databases of sequences exist, with significant overlap, all using\ndistinct names. Furthermore, these systems are often difficult to install\nlocally.\n\nSeqRepo provides an efficient, non-redundant and indexed storage system for\nbiological sequences. Clients refer to sequences and metadata using familiar\nidentifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based\nidentifiers. The interface supports fast slicing of arbitrary regions of large\nsequences.\n\nA \"fully-qualified\" identifier includes a namespace to disambiguate accessions\nfrom different origins or sequence sets (e.g., \"1\" in GRCh37 and GRCh38). If the\nnamespace is provided, seqrepo uses it as-is; if the namespace is not provided\nand the unqualified identifier refers to a unique sequence, it is returned;\notherwise, the use of ambiguous identifiers raise an error.\n\nSeqRepo favors namespaces from [identifiers.org](https://identifiers.org)\nwhenever available. Examples include\n[refseq](<https://registry.identifiers.org/registry/refseq>) and\n[ensembl](<https://registry.identifiers.org/registry/ensembl>).\n\n[seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service) provides a REST interface\nand docker image.\n\nReleased under the Apache License, 2.0.\n\n[![ci_rel](https://travis-ci.org/biocommons/biocommons.seqrepo.svg?branch=master)](https://travis-ci.org/biocommons/biocommons.seqrepo)\n\\|\n[![cov](https://coveralls.io/repos/github/biocommons/biocommons.seqrepo/badge.svg?branch=)](https://coveralls.io/github/biocommons/biocommons.seqrepo?branch=)\n\\|\n[![pypi_rel](https://badge.fury.io/py/biocommons.seqrepo.png)](https://pypi.org/pypi?name=biocommons.seqrepo)\n\\| [ChangeLog](https://github.com/biocommons/biocommons.seqrepo/tree/master/docs/changelog/0.5)\n\n## Citation\n\nHart RK, Prli\u0107 A (2020). **SeqRepo: A system for managing local collections of\nbiological sequences.** PLoS ONE 15(12): e0239883.\n<https://doi.org/10.1371/journal.pone.0239883>\n\n## Features\n\n- Timestamped, read-only snapshots.\n- Space-efficient storage of sequences within a single snapshot and across snapshots.\n- Bandwidth-efficient transfer incremental updates.\n- Fast fetching of sequence slices on chromosome-scale sequences.\n- Precomputed digests that may be used as sequence aliases.\n- Mappings of external aliases (i.e., accessions or identifiers like\n  `NM_013305.4`) to sequences.\n\n## Deployments Scenarios\n\n- Local read-only archive, mirrored from public site, accessed via Python API\n  (see [Mirroring documentation](docs/mirror.rst))\n- Local read-write archive, maintained with command line utility and/or API (see\n  [Command Line Interface documentation](docs/cli.rst)).\n- Docker data-only container that may be linked to application container.\n- SeqRepo and refget REST API for local or remote access (see\n    [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service))\n\n## Technical Quick Peek\n\nWithin a single snapshot, sequences are stored *non-redundantly* and\n*compressed* in an add-only journalled filesystem structure. A truncated SHA-512\nhash is used to assess uniquness and as an internal id. (The digest is truncated\nfor space efficiency.)\n\nSequences are compressed using the Block GZipped Format\n([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam\nto provide fast random access to compressed sequences. (Variable compression\ntypically makes random access impossible.)\n\nSequence files are immutable, thereby enabling the use of hardlinks across\nsnapshots and eliminating redundant transfers (e.g., with `rsync`).\n\nEach sequence id is associated with a namespaced alias in a sqlite database.\nSuch as `<seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>`, `<NCBI,NP_004009.1>`,\n`<gi,5032303>`, `<ensembl-75ENSP00000354464>`, `<ensembl-85,ENSP00000354464.4>`.\nThe sqlite database is mutable across releases.\n\nFor calibration, recent releases that include 3 human genome assemblies\n(including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes\napproximately 8GB. The minimum marginal size for additional snapshots is\napproximately 2GB (for the sqlite database, which is not hardlinked).\n\nFor more information, see [docs/design.rst](docs/design.rst).\n\n## Requirements\n\nReading a sequence repository requires several Python packages, all of which are\navailable from pypi. Installation should be as simple as `pip install\nbiocommons.seqrepo`.\n\n*Writing* sequence files also requires `bgzip`, which provided in the\n[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install\nthe `tabix` package with `sudo apt install tabix`.\n\nDevelopment and deployments are on Ubuntu. Other systems may work but are not\ntested. Patches to get other systems working would be welcomed.\n\n## Quick Start\n\n### OS X\n\n    $ brew install python libpq\n\n### Ubuntu\n\n    $ sudo apt install -y python3-dev gcc zlib1g-dev tabix\n\n### All platforms\n\n    $ python -m venv venv\n    $ source venv/bin/activate\n    $ pip install seqrepo\n    $ sudo mkdir -p /usr/local/share/seqrepo\n    $ sudo chown $USER /usr/local/share/seqrepo\n    $ seqrepo pull -i 2018-11-26 \n    $ seqrepo show-status -i 2018-11-26 \n    seqrepo 0.2.3.post3.dev8+nb8298bd62283\n    root directory: /usr/local/share/seqrepo/2018-11-26, 7.9 GB\n    backends: fastadir (schema 1), seqaliasdb (schema 1) \n    sequences: 773587 sequences, 93051609959 residues, 192 files\n    aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences\n\n    # Simple Pythonic interface to sequences\n    >> from biocommons.seqrepo import SeqRepo\n    >> sr = SeqRepo(\"/usr/local/share/seqrepo/latest\")\n    >> sr[\"NC_000001.11\"][780000:780020]\n    'TGGTGGCACGCGCTTGTAGT'\n\n    # Or, use the seqrepo shell for even easier access\n    $ seqrepo start-shell -i 2018-11-26\n    In [1]: sr[\"NC_000001.11\"][780000:780020]\n    Out[1]: 'TGGTGGCACGCGCTTGTAGT'\n\n    # N.B. The following output is edited for simplicity\n    $ seqrepo export -i 2018-11-26 | head -n100\n    >SHA1:9a2acba3dd7603f... SEGUID:mirLo912A/MppLuS1cUyFMduLUQ Ensembl-85:GENSCAN00000003538 ...\n    MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA\n    SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS\n    QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF\n    >SHA1:ca996b263102b1... SEGUID:yplrJjECsVqQufeYy0HkDD16z58 NCBI:XR_001733142.1 gi:1034683989\n    TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG\n    AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA\n    ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA\n\nSee [Installation](docs/installation.rst) and\n[Mirroring](docs/mirror.rst) for more information.\n\n## Environment Variables\n\nSEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query\nresponse caching. It defaults to 1 million but can also be set to \"none\" to be\nunlimited.\n\nSEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during\nFASTA sequence retrievals. It defaults to 0 to disable any caching, but can be\nset to a specific value or \"none\" to be unlimited. Using a moderate value (>10)\nwill greatly increase performance of sequence retrieval.\n\n## Developing\n\n### Developing on OS X\n\n    brew install python libpq bash\n\nIf you get \"xcrun: error: invalid active developer path\", you need to install\nXCode. See this [StackOverflow answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a).\n\n### Developing on Ubuntu\n\n    sudo apt install -y python3-dev gcc zlib1g-dev tabix\n\nHere's how to get started developing:\n\n    make devready\n    source venv/bin/activate\n    seqrepo --version\n\n## Building a docker image\n\nDocker images are available at https://hub.docker.com/r/biocommons/seqrepo.\nTags correspond to the version of data, not the version of seqrepo, because the\nintent is to make it easy to depend on a local version of seqrepo *files*.  Each\ndocker image is an installation of seqrepo that downloads the corresponding\nversion of seqrepo data.  When used in conjunction with docker volumes for\npersistence, this provides an easy way to incorporate seqrepo data into a docker\nstack.\n\n### Building\n\n    cd misc/docker\n    make 2021-01-29.log  # builds and pushes to hub.docker.com (i.e., you need creds)\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Non-redundant, compressed, journalled, file-based storage for biological sequences",
    "version": "0.6.9",
    "project_urls": {
        "Bug Tracker": "https://github.com/biocommons/biocommons.seqrepo/issues",
        "Homepage": "https://github.com/biocommons/biocommons.seqrepo"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1dfd9bc4f46f7db19e1c1a94f27080ca113e9243fea98009b79b0fda9e842a19",
                "md5": "2bf42ca1c989c8ed291906717c7211f1",
                "sha256": "f16c9131fc08aa06ed50c385e3a0ffefc092ea39393571a0b44d2f2349d5f495"
            },
            "downloads": -1,
            "filename": "biocommons.seqrepo-0.6.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2bf42ca1c989c8ed291906717c7211f1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 39537,
            "upload_time": "2024-02-20T03:30:32",
            "upload_time_iso_8601": "2024-02-20T03:30:32.286256Z",
            "url": "https://files.pythonhosted.org/packages/1d/fd/9bc4f46f7db19e1c1a94f27080ca113e9243fea98009b79b0fda9e842a19/biocommons.seqrepo-0.6.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a2f962ef74b0aa3bed299e2c02dcc8abb386be924ea48dc735a58bafb4789b5b",
                "md5": "2b7a51d98483a92fe037cc59f2f51f09",
                "sha256": "b08d616c6ab5c4bc8e1ef5e09e94c5c274c6da41f75a435323eddc716c9b5575"
            },
            "downloads": -1,
            "filename": "biocommons.seqrepo-0.6.9.tar.gz",
            "has_sig": false,
            "md5_digest": "2b7a51d98483a92fe037cc59f2f51f09",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 103113,
            "upload_time": "2024-02-20T03:30:34",
            "upload_time_iso_8601": "2024-02-20T03:30:34.203188Z",
            "url": "https://files.pythonhosted.org/packages/a2/f9/62ef74b0aa3bed299e2c02dcc8abb386be924ea48dc735a58bafb4789b5b/biocommons.seqrepo-0.6.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-20 03:30:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "biocommons",
    "github_project": "biocommons.seqrepo",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "biocommons.seqrepo"
}
        
Elapsed time: 0.19396s