InterLap: simple, fast interval overlap testing
-----------------------------------------------
[![Build Status](https://travis-ci.org/brentp/interlap.svg?branch=master)](https://travis-ci.org/brentp/interlap)
InterLap is >20 times faster than doing a naive search (see: https://brentp.github.io/interlap/benchmark.html)
with **no memory overhead** because it operates on a sorted list. It is pure python and has no
dependencies.
It uses binary search and a knowledge of the longest observed interval to quickly query datasets
with 100's of thousands of intervals.
See the documentation at [https://brentp.github.io/interlap/](https://brentp.github.io/interlap)
Usage
-----
Interlap takes tuples or lists where the first 2 elements are start, end and the remaining
elements can be anything.
```Python
>>> from interlap import InterLap
>>> inter = InterLap()
#Here create 10K random intervals:
>>> import random
>>> random.seed(42)
>>> sites = random.sample(range(22, 100000000, 12), 50000)
>>> ranges = [(i, i + random.randint(2000, 20000)) for i in sites]
>>> inter.update(ranges)
>>> inter.add((20, 22, {'info': 'hi'}))
>>> [20, 21] in inter
True
>>> next(inter.find((21, 21)))
(20, 22, {'info': 'hi'})
>>> inter.add((2, 3, {'info': 'hello'}))
*NOTE*: below shows how edge-cases are handled.
>>> list(inter.find((2, 2)))
[(2, 3, {'info': 'hello'})]
>>> list(inter.find((3, 3)))
[(2, 3, {'info': 'hello'})]
Test every item in the InterLap. These 50K queries take < 0.5 seconds:
>>> for s, e in ranges:
... assert (s, e) in inter
InterLap objects are iterable:
>>> for seo in inter:
... assert (seo[0], seo[1]) in inter
```
Installation
------------
```Shell
pip install interlap
```
Example
-------
In general, we will want one interlap per chromosome for genomic data.
The snippet below shows a simple way to do that in the process of creating
and then querying some intervals.
```Python
from interlap import InterLap
from collections import defaultdict
import sys
# use defaultdict to key by chromosome.
inter = defaultdict(InterLap)
for toks in (x.rstrip().split("\t") for x in open(sys.argv[1])):
start, end = map(int, toks[1:3])
inter[toks[0]].add((start, end, toks))
# now find overlaps in another file:
for toks in (x.rstrip().split("\t") for x in open(sys.argv[2])):
start, end = map(int, toks[1:3])
print list(inter[toks[0]].find((start, end)))
```
Why
---
I am aware of bx-python's interval tree (I wrote the cython version)
but for some projects it is nice to have a simple dependency (or no
dependency since this can be included as a single file or 30 lines
of code) when we just need something a bit better than naive overlap
testing.
In my testing, the method implemented here, using a sorted list and keeping
track of the longest observed interval is the fastest *pure python* method
*as long as the longest observed interval is does not cover a substantial
fraction of intervals in the set*.
IntervalSet Operations
----------------------
As of version 0.2.0 Interlap also includes an `Interval` class that behaves
like what is normally called an interval set.
```python
# note how it merges overlapping sub-intervals.
>>> Interval([(1, 95), (95, 100)]).add(Interval([(90, 100)]))
Interval([(1, 100)])
```
See the doctests under the Interval class for more details
Raw data
{
"_id": null,
"home_page": "http://brentp.github.io/interlap",
"name": "interlap",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Brent S Pedersen",
"author_email": "bpederse@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/72/84/9f71369fc868dc963ddf51d1bfd8853a9793a37a21c9081a433f6e81d56a/interlap-0.2.7.tar.gz",
"platform": "",
"description": "InterLap: simple, fast interval overlap testing\n-----------------------------------------------\n[![Build Status](https://travis-ci.org/brentp/interlap.svg?branch=master)](https://travis-ci.org/brentp/interlap)\n\nInterLap is >20 times faster than doing a naive search (see: https://brentp.github.io/interlap/benchmark.html)\nwith **no memory overhead** because it operates on a sorted list. It is pure python and has no\ndependencies.\n\nIt uses binary search and a knowledge of the longest observed interval to quickly query datasets\nwith 100's of thousands of intervals.\n\nSee the documentation at [https://brentp.github.io/interlap/](https://brentp.github.io/interlap)\n\nUsage\n-----\n\nInterlap takes tuples or lists where the first 2 elements are start, end and the remaining\nelements can be anything.\n\n\n```Python\n>>> from interlap import InterLap\n>>> inter = InterLap()\n\n#Here create 10K random intervals:\n\n>>> import random\n>>> random.seed(42)\n>>> sites = random.sample(range(22, 100000000, 12), 50000)\n>>> ranges = [(i, i + random.randint(2000, 20000)) for i in sites]\n\n>>> inter.update(ranges)\n>>> inter.add((20, 22, {'info': 'hi'}))\n\n>>> [20, 21] in inter\nTrue\n\n>>> next(inter.find((21, 21)))\n(20, 22, {'info': 'hi'})\n\n>>> inter.add((2, 3, {'info': 'hello'}))\n\n*NOTE*: below shows how edge-cases are handled.\n>>> list(inter.find((2, 2)))\n[(2, 3, {'info': 'hello'})]\n>>> list(inter.find((3, 3)))\n[(2, 3, {'info': 'hello'})]\n\nTest every item in the InterLap. These 50K queries take < 0.5 seconds:\n\n>>> for s, e in ranges:\n... assert (s, e) in inter\n\nInterLap objects are iterable:\n\n>>> for seo in inter:\n... assert (seo[0], seo[1]) in inter\n\n```\n\nInstallation\n------------\n\n```Shell\n\npip install interlap\n\n```\n\nExample\n-------\n\nIn general, we will want one interlap per chromosome for genomic data.\nThe snippet below shows a simple way to do that in the process of creating\nand then querying some intervals.\n\n```Python\n\nfrom interlap import InterLap\nfrom collections import defaultdict\nimport sys\n\n# use defaultdict to key by chromosome.\ninter = defaultdict(InterLap)\n\nfor toks in (x.rstrip().split(\"\\t\") for x in open(sys.argv[1])):\n start, end = map(int, toks[1:3])\n inter[toks[0]].add((start, end, toks))\n\n# now find overlaps in another file:\n\nfor toks in (x.rstrip().split(\"\\t\") for x in open(sys.argv[2])):\n start, end = map(int, toks[1:3])\n\n print list(inter[toks[0]].find((start, end)))\n\n```\n\nWhy\n---\n\nI am aware of bx-python's interval tree (I wrote the cython version)\nbut for some projects it is nice to have a simple dependency (or no\ndependency since this can be included as a single file or 30 lines\nof code) when we just need something a bit better than naive overlap\ntesting.\n\nIn my testing, the method implemented here, using a sorted list and keeping\ntrack of the longest observed interval is the fastest *pure python* method\n*as long as the longest observed interval is does not cover a substantial \nfraction of intervals in the set*.\n\n\nIntervalSet Operations\n----------------------\n\nAs of version 0.2.0 Interlap also includes an `Interval` class that behaves\nlike what is normally called an interval set.\n\n```python\n\n# note how it merges overlapping sub-intervals.\n>>> Interval([(1, 95), (95, 100)]).add(Interval([(90, 100)]))\nInterval([(1, 100)])\n\n```\n\nSee the doctests under the Interval class for more details",
"bugtrack_url": null,
"license": "MIT",
"summary": "interlap: fast, simple interval overlap testing",
"version": "0.2.7",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "8a7147649d3393edfb3ab2e3f1887b6a",
"sha256": "31e4f30c54b067c4939049f5d8131ae5e2fa682ec71aa56f89c0e5b900806ec9"
},
"downloads": -1,
"filename": "interlap-0.2.7.tar.gz",
"has_sig": false,
"md5_digest": "8a7147649d3393edfb3ab2e3f1887b6a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6066,
"upload_time": "2020-10-02T17:54:06",
"upload_time_iso_8601": "2020-10-02T17:54:06.464857Z",
"url": "https://files.pythonhosted.org/packages/72/84/9f71369fc868dc963ddf51d1bfd8853a9793a37a21c9081a433f6e81d56a/interlap-0.2.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-10-02 17:54:06",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "interlap"
}