Roaring Bitmap in Cython
========================
A roaring bitmap is an efficient compressed datastructure to store a set
of integers. A Roaring bitmap stores a set of 32-bit integers in a series of
arrays and bitmaps, whichever takes the least space (which is always
``2 ** 16`` bits or less).
This datastructure is useful for storing a large number of integers, e.g., for
an inverted index used by search engines and databases. In particular, it is
possible to quickly compute the intersection of a series of sets, which can be
used to implement a query as the conjunction of subqueries.
This implementation is based on the Java and C implementations at
https://github.com/lemire/RoaringBitmap
and https://github.com/lemire/CRoaring
Additional features of this implementation:
- Inverted list representation: blocks that are mostly full are stored
compactly as an array of non-members (instead of as an array of members or a
fixed-size bitmap).
- Collections of immutable roaring bitmaps can be efficiently serialized with
``mmap`` in a single file.
Missing features w.r.t. CRoaring:
- Run-length encoded blocks
- Various AVX2 / SSE optimizations
See also PyRoaringBitmap, a Python wrapper of CRoaring:
https://github.com/Ezibenroc/PyRoaringBitMap
License, requirements
---------------------
The code is licensed under GNU GPL v2, or any later version at your option.
- Python 2.7+/3.3+ http://www.python.org (headers required, e.g. python-dev package)
- Cython 0.20+ http://www.cython.org
Installation, usage
-------------------
::
$ git clone https://github.com/andreasvc/roaringbitmap.git
$ cd roaringbitmap
$ make
(or ``make py2`` for Python 2)
A ``RoaringBitmap()`` can be used as a replacement for a normal (mutable)
Python set containing (unsigned) 32-bit integers:
.. code-block:: python
>>> from roaringbitmap import RoaringBitmap
>>> RoaringBitmap(range(10)) & RoaringBitmap(range(5, 15))
RoaringBitmap({5, 6, 7, 8, 9})
``ImmutableRoaringBitmap`` is an immutable variant (analogous to ``frozenset``)
which is stored compactly as a contiguous block of memory.
A sequence of immutable RoaringBitmaps can be stored in a single file and
accessed efficiently with ``mmap``, without needing to copy or deserialize:
.. code-block:: python
>>> from roaringbitmap import MultiRoaringBitmap
>>> mrb = MultiRoaringBitmap([range(n, n + 5) for n in range(10)], filename='index')
>>> mrb = MultiRoaringBitmap.fromfile('index')
>>> mrb[5]
ImmutableRoaringBitmap({5, 6, 7, 8, 9})
For API documentation cf. http://roaringbitmap.readthedocs.io
Benchmarks
----------
Output of ``$ make bench``::
small sparse set
100 runs with sets of 200 random elements n s.t. 0 <= n < 40000
set() RoaringBitmap() ratio
init 0.000834 0.00138 0.603
initsort 0.00085 0.000394 2.16
and 0.00102 8.49e-05 12.1
or 0.00171 0.000169 10.1
xor 0.00152 0.000213 7.11
sub 0.000934 0.000197 4.74
iand 1.29e-05 2.97e-06 4.35
ior 9.7e-06 3.26e-06 2.98
ixor 8.98e-06 3.43e-06 2.62
isub 6.83e-06 3.3e-06 2.07
eq 0.000438 1.17e-05 37.6
neq 6.37e-06 7.81e-06 0.816
jaccard 0.0029 0.000126 23.1
medium load factor
100 runs with sets of 59392 random elements n s.t. 0 <= n < 118784
set() RoaringBitmap() ratio
init 0.564 0.324 1.74
initsort 0.696 0.273 2.55
and 0.613 0.000418 1466
or 0.976 0.000292 3344
xor 0.955 0.000294 3250
sub 0.346 0.000316 1092
iand 0.00658 1.14e-05 575
ior 0.00594 1.08e-05 548
ixor 0.00434 1.12e-05 385
isub 0.00431 1.09e-05 397
eq 0.0991 0.000116 851
neq 9.62e-06 1.29e-05 0.743
jaccard 1.62 0.00025 6476
dense set / high load factor
100 runs with sets of 39800 random elements n s.t. 0 <= n < 40000
set() RoaringBitmap() ratio
init 0.33 0.0775 4.26
initsort 0.352 0.148 2.38
and 0.24 0.000223 1078
or 0.45 0.000165 2734
xor 0.404 0.000161 2514
sub 0.169 0.000173 973
iand 0.00287 6.02e-06 477
ior 0.00179 6.34e-06 282
ixor 0.00195 5.53e-06 353
isub 0.0017 6.35e-06 267
eq 0.0486 4.65e-05 1045
neq 1.01e-05 1.13e-05 0.888
jaccard 0.722 0.000118 6136
See https://github.com/Ezibenroc/roaring_analysis/ for a performance comparison
of PyRoaringBitmap and this library.
References
----------
- http://roaringbitmap.org/
- Chambi, S., Lemire, D., Kaser, O., & Godin, R. (2016). Better bitmap
performance with Roaring bitmaps. Software: practice and experience, 46(5),
pp. 709-719. http://arxiv.org/abs/1402.6407
- The idea of using the inverted list representation is based on
https://issues.apache.org/jira/browse/LUCENE-5983
Raw data
{
"_id": null,
"home_page": "http://roaringbitmap.readthedocs.io",
"name": "roaringbitmap",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Andreas van Cranenburgh",
"author_email": "A.W.van.Cranenburgh@rug.nl",
"download_url": "https://files.pythonhosted.org/packages/ca/52/335b9551b15c535ef78a0833e6b4e2279ef176694e3ea1de824b67e9bf2d/roaringbitmap-0.7.2.tar.gz",
"platform": "Many",
"description": "Roaring Bitmap in Cython\n========================\n\nA roaring bitmap is an efficient compressed datastructure to store a set\nof integers. A Roaring bitmap stores a set of 32-bit integers in a series of\narrays and bitmaps, whichever takes the least space (which is always\n``2 ** 16`` bits or less).\n\nThis datastructure is useful for storing a large number of integers, e.g., for\nan inverted index used by search engines and databases. In particular, it is\npossible to quickly compute the intersection of a series of sets, which can be\nused to implement a query as the conjunction of subqueries.\n\nThis implementation is based on the Java and C implementations at\nhttps://github.com/lemire/RoaringBitmap\nand https://github.com/lemire/CRoaring\n\nAdditional features of this implementation:\n\n- Inverted list representation: blocks that are mostly full are stored\n compactly as an array of non-members (instead of as an array of members or a\n fixed-size bitmap).\n- Collections of immutable roaring bitmaps can be efficiently serialized with\n ``mmap`` in a single file.\n\nMissing features w.r.t. CRoaring:\n\n- Run-length encoded blocks\n- Various AVX2 / SSE optimizations\n\nSee also PyRoaringBitmap, a Python wrapper of CRoaring:\nhttps://github.com/Ezibenroc/PyRoaringBitMap\n\nLicense, requirements\n---------------------\nThe code is licensed under GNU GPL v2, or any later version at your option.\n\n- Python 2.7+/3.3+ http://www.python.org (headers required, e.g. python-dev package)\n- Cython 0.20+ http://www.cython.org\n\nInstallation, usage\n-------------------\n\n::\n\n $ git clone https://github.com/andreasvc/roaringbitmap.git\n $ cd roaringbitmap\n $ make\n\n(or ``make py2`` for Python 2)\n\nA ``RoaringBitmap()`` can be used as a replacement for a normal (mutable)\nPython set containing (unsigned) 32-bit integers:\n\n.. code-block:: python\n\n >>> from roaringbitmap import RoaringBitmap\n >>> RoaringBitmap(range(10)) & RoaringBitmap(range(5, 15))\n RoaringBitmap({5, 6, 7, 8, 9})\n\n``ImmutableRoaringBitmap`` is an immutable variant (analogous to ``frozenset``)\nwhich is stored compactly as a contiguous block of memory.\n\nA sequence of immutable RoaringBitmaps can be stored in a single file and\naccessed efficiently with ``mmap``, without needing to copy or deserialize:\n\n.. code-block:: python\n\n >>> from roaringbitmap import MultiRoaringBitmap\n >>> mrb = MultiRoaringBitmap([range(n, n + 5) for n in range(10)], filename='index')\n\n >>> mrb = MultiRoaringBitmap.fromfile('index')\n >>> mrb[5]\n ImmutableRoaringBitmap({5, 6, 7, 8, 9})\n\nFor API documentation cf. http://roaringbitmap.readthedocs.io\n\nBenchmarks\n----------\nOutput of ``$ make bench``::\n\n small sparse set\n 100 runs with sets of 200 random elements n s.t. 0 <= n < 40000\n set() RoaringBitmap() ratio\n init 0.000834 0.00138 0.603\n initsort 0.00085 0.000394 2.16\n and 0.00102 8.49e-05 12.1\n or 0.00171 0.000169 10.1\n xor 0.00152 0.000213 7.11\n sub 0.000934 0.000197 4.74\n iand 1.29e-05 2.97e-06 4.35\n ior 9.7e-06 3.26e-06 2.98\n ixor 8.98e-06 3.43e-06 2.62\n isub 6.83e-06 3.3e-06 2.07\n eq 0.000438 1.17e-05 37.6\n neq 6.37e-06 7.81e-06 0.816\n jaccard 0.0029 0.000126 23.1\n\n medium load factor\n 100 runs with sets of 59392 random elements n s.t. 0 <= n < 118784\n set() RoaringBitmap() ratio\n init 0.564 0.324 1.74\n initsort 0.696 0.273 2.55\n and 0.613 0.000418 1466\n or 0.976 0.000292 3344\n xor 0.955 0.000294 3250\n sub 0.346 0.000316 1092\n iand 0.00658 1.14e-05 575\n ior 0.00594 1.08e-05 548\n ixor 0.00434 1.12e-05 385\n isub 0.00431 1.09e-05 397\n eq 0.0991 0.000116 851\n neq 9.62e-06 1.29e-05 0.743\n jaccard 1.62 0.00025 6476\n\n dense set / high load factor\n 100 runs with sets of 39800 random elements n s.t. 0 <= n < 40000\n set() RoaringBitmap() ratio\n init 0.33 0.0775 4.26\n initsort 0.352 0.148 2.38\n and 0.24 0.000223 1078\n or 0.45 0.000165 2734\n xor 0.404 0.000161 2514\n sub 0.169 0.000173 973\n iand 0.00287 6.02e-06 477\n ior 0.00179 6.34e-06 282\n ixor 0.00195 5.53e-06 353\n isub 0.0017 6.35e-06 267\n eq 0.0486 4.65e-05 1045\n neq 1.01e-05 1.13e-05 0.888\n jaccard 0.722 0.000118 6136\n\nSee https://github.com/Ezibenroc/roaring_analysis/ for a performance comparison\nof PyRoaringBitmap and this library.\n\nReferences\n----------\n- http://roaringbitmap.org/\n- Chambi, S., Lemire, D., Kaser, O., & Godin, R. (2016). Better bitmap\n performance with Roaring bitmaps. Software: practice and experience, 46(5),\n pp. 709-719. http://arxiv.org/abs/1402.6407\n- The idea of using the inverted list representation is based on\n https://issues.apache.org/jira/browse/LUCENE-5983\n",
"bugtrack_url": null,
"license": "GPL",
"summary": "Roaring Bitmap",
"version": "0.7.2",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "2253891280e88d438f23b613daf87d18",
"sha256": "bfe049fd9f200a011081ea5e4b2746d0c35256c755cd78886e375a0ef473fa0d"
},
"downloads": -1,
"filename": "roaringbitmap-0.7.2.tar.gz",
"has_sig": false,
"md5_digest": "2253891280e88d438f23b613daf87d18",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 233062,
"upload_time": "2022-12-29T15:02:16",
"upload_time_iso_8601": "2022-12-29T15:02:16.683173Z",
"url": "https://files.pythonhosted.org/packages/ca/52/335b9551b15c535ef78a0833e6b4e2279ef176694e3ea1de824b67e9bf2d/roaringbitmap-0.7.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-29 15:02:16",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "roaringbitmap"
}