interlap

Name	interlap JSON
Version	0.2.7 JSON
	download
home_page	http://brentp.github.io/interlap
Summary	interlap: fast, simple interval overlap testing
upload_time	2020-10-02 17:54:06
maintainer
docs_url	None
author	Brent S Pedersen
requires_python
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            InterLap: simple, fast interval overlap testing
-----------------------------------------------
[![Build Status](https://travis-ci.org/brentp/interlap.svg?branch=master)](https://travis-ci.org/brentp/interlap)

InterLap is >20 times faster than doing a naive search (see: https://brentp.github.io/interlap/benchmark.html)
with **no memory overhead** because it operates on a sorted list. It is pure python and has no
dependencies.

It uses binary search and a knowledge of the longest observed interval to quickly query datasets
with 100's of thousands of intervals.

See the documentation at [https://brentp.github.io/interlap/](https://brentp.github.io/interlap)

Usage
-----

Interlap takes tuples or lists where the first 2 elements are start, end and the remaining
elements can be anything.


```Python
>>> from interlap import InterLap
>>> inter = InterLap()

#Here create 10K random intervals:

>>> import random
>>> random.seed(42)
>>> sites = random.sample(range(22, 100000000, 12), 50000)
>>> ranges = [(i, i + random.randint(2000, 20000)) for i in sites]

>>> inter.update(ranges)
>>> inter.add((20, 22, {'info': 'hi'}))

>>> [20, 21] in inter
True

>>> next(inter.find((21, 21)))
(20, 22, {'info': 'hi'})

>>> inter.add((2, 3, {'info': 'hello'}))

*NOTE*: below shows how edge-cases are handled.
>>> list(inter.find((2, 2)))
[(2, 3, {'info': 'hello'})]
>>> list(inter.find((3, 3)))
[(2, 3, {'info': 'hello'})]

Test every item in the InterLap. These 50K queries take < 0.5 seconds:

>>> for s, e in ranges:
...     assert (s, e) in inter

InterLap objects are iterable:

>>> for seo in inter:
...     assert (seo[0], seo[1]) in inter

```

Installation
------------

```Shell

pip install interlap

```

Example
-------

In general, we will want one interlap per chromosome for genomic data.
The snippet below shows a simple way to do that in the process of creating
and then querying some intervals.

```Python

from interlap import InterLap
from collections import defaultdict
import sys

# use defaultdict to key by chromosome.
inter = defaultdict(InterLap)

for toks in (x.rstrip().split("\t") for x in open(sys.argv[1])):
    start, end = map(int, toks[1:3])
    inter[toks[0]].add((start, end, toks))

# now find overlaps in another file:

for toks in (x.rstrip().split("\t") for x in open(sys.argv[2])):
    start, end = map(int, toks[1:3])

    print list(inter[toks[0]].find((start, end)))

```

Why
---

I am aware of bx-python's interval tree (I wrote the cython version)
but for some projects it is nice to have a simple dependency (or no
dependency since this can be included as a single file or 30 lines
of code) when we just need something a bit better than naive overlap
testing.

In my testing, the method implemented here, using a sorted list and keeping
track of the longest observed interval is the fastest *pure python* method
*as long as the longest observed interval is does not cover a substantial 
fraction of intervals in the set*.


IntervalSet Operations
----------------------

As of version 0.2.0 Interlap also includes an `Interval` class that behaves
like what is normally called an interval set.

```python

# note how it merges overlapping sub-intervals.
>>> Interval([(1, 95), (95, 100)]).add(Interval([(90, 100)]))
Interval([(1, 100)])

```

See the doctests under the Interval class for more details

Raw data

            {
    "_id": null,
    "home_page": "http://brentp.github.io/interlap",
    "name": "interlap",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Brent S Pedersen",
    "author_email": "bpederse@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/72/84/9f71369fc868dc963ddf51d1bfd8853a9793a37a21c9081a433f6e81d56a/interlap-0.2.7.tar.gz",
    "platform": "",
    "description": "InterLap: simple, fast interval overlap testing\n-----------------------------------------------\n[![Build Status](https://travis-ci.org/brentp/interlap.svg?branch=master)](https://travis-ci.org/brentp/interlap)\n\nInterLap is >20 times faster than doing a naive search (see: https://brentp.github.io/interlap/benchmark.html)\nwith **no memory overhead** because it operates on a sorted list. It is pure python and has no\ndependencies.\n\nIt uses binary search and a knowledge of the longest observed interval to quickly query datasets\nwith 100's of thousands of intervals.\n\nSee the documentation at [https://brentp.github.io/interlap/](https://brentp.github.io/interlap)\n\nUsage\n-----\n\nInterlap takes tuples or lists where the first 2 elements are start, end and the remaining\nelements can be anything.\n\n\n```Python\n>>> from interlap import InterLap\n>>> inter = InterLap()\n\n#Here create 10K random intervals:\n\n>>> import random\n>>> random.seed(42)\n>>> sites = random.sample(range(22, 100000000, 12), 50000)\n>>> ranges = [(i, i + random.randint(2000, 20000)) for i in sites]\n\n>>> inter.update(ranges)\n>>> inter.add((20, 22, {'info': 'hi'}))\n\n>>> [20, 21] in inter\nTrue\n\n>>> next(inter.find((21, 21)))\n(20, 22, {'info': 'hi'})\n\n>>> inter.add((2, 3, {'info': 'hello'}))\n\n*NOTE*: below shows how edge-cases are handled.\n>>> list(inter.find((2, 2)))\n[(2, 3, {'info': 'hello'})]\n>>> list(inter.find((3, 3)))\n[(2, 3, {'info': 'hello'})]\n\nTest every item in the InterLap. These 50K queries take < 0.5 seconds:\n\n>>> for s, e in ranges:\n...     assert (s, e) in inter\n\nInterLap objects are iterable:\n\n>>> for seo in inter:\n...     assert (seo[0], seo[1]) in inter\n\n```\n\nInstallation\n------------\n\n```Shell\n\npip install interlap\n\n```\n\nExample\n-------\n\nIn general, we will want one interlap per chromosome for genomic data.\nThe snippet below shows a simple way to do that in the process of creating\nand then querying some intervals.\n\n```Python\n\nfrom interlap import InterLap\nfrom collections import defaultdict\nimport sys\n\n# use defaultdict to key by chromosome.\ninter = defaultdict(InterLap)\n\nfor toks in (x.rstrip().split(\"\\t\") for x in open(sys.argv[1])):\n    start, end = map(int, toks[1:3])\n    inter[toks[0]].add((start, end, toks))\n\n# now find overlaps in another file:\n\nfor toks in (x.rstrip().split(\"\\t\") for x in open(sys.argv[2])):\n    start, end = map(int, toks[1:3])\n\n    print list(inter[toks[0]].find((start, end)))\n\n```\n\nWhy\n---\n\nI am aware of bx-python's interval tree (I wrote the cython version)\nbut for some projects it is nice to have a simple dependency (or no\ndependency since this can be included as a single file or 30 lines\nof code) when we just need something a bit better than naive overlap\ntesting.\n\nIn my testing, the method implemented here, using a sorted list and keeping\ntrack of the longest observed interval is the fastest *pure python* method\n*as long as the longest observed interval is does not cover a substantial \nfraction of intervals in the set*.\n\n\nIntervalSet Operations\n----------------------\n\nAs of version 0.2.0 Interlap also includes an `Interval` class that behaves\nlike what is normally called an interval set.\n\n```python\n\n# note how it merges overlapping sub-intervals.\n>>> Interval([(1, 95), (95, 100)]).add(Interval([(90, 100)]))\nInterval([(1, 100)])\n\n```\n\nSee the doctests under the Interval class for more details",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "interlap: fast, simple interval overlap testing",
    "version": "0.2.7",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8a7147649d3393edfb3ab2e3f1887b6a",
                "sha256": "31e4f30c54b067c4939049f5d8131ae5e2fa682ec71aa56f89c0e5b900806ec9"
            },
            "downloads": -1,
            "filename": "interlap-0.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "8a7147649d3393edfb3ab2e3f1887b6a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6066,
            "upload_time": "2020-10-02T17:54:06",
            "upload_time_iso_8601": "2020-10-02T17:54:06.464857Z",
            "url": "https://files.pythonhosted.org/packages/72/84/9f71369fc868dc963ddf51d1bfd8853a9793a37a21c9081a433f6e81d56a/interlap-0.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-10-02 17:54:06",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "interlap"
}

Brent S Pedersen