bbhash


Namebbhash JSON
Version 0.5.4 PyPI version JSON
download
home_pagehttp://github.com/dib-lab/pybbhash
SummaryA Python wrapper for the BBHash Minimal Perfect Hash Function
upload_time2020-10-25 23:22:07
maintainer
docs_urlNone
authorC. Titus Brown
requires_python
licenseBSD 3-clause
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pybbhash

<a href="https://pypi.org/project/bbhash/"><img alt="PyPI" src="https://badge.fury.io/py/bbhash.svg"></a>
<a href="https://github.com/dib-lab/pybbhash/blob/latest/LICENSE.txt"><img alt="License: 3-Clause BSD" src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg"></a>

This is a Python (Cython) wrapper for the
[BBHash codebase](https://github.com/rizkg/BBHash) for building
[minimal perfect hash functions](https://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_function).

Right now, this is supporting k-mer-based hashing needs from
[spacegraphcats](https://github.com/spacegraphcats/spacegraphcats),
using hash values generated (mostly) by murmurhash, e.g. from
[khmer's Nodetable](https://github.com/dib-lab/khmer/) and
[sourmash](https://github.com/dib-lab/sourmash/) hashing.  As such, I
am focused on building MPHF for 64-bit hashes and am wrapping only
that bit of the interface; the rest should be ~straightforward (hah!).

I've also added a Python-accessible "values table", `BBHashTable`, in
the `bbhash_table` module. This is a table that supports a dictionary-like
feature where you can associate a hash with a value, and then query the
table with the hash to retrieve the value. The only tricky bit here is
that unlike the bbhash module, this table supports queries with hashes
that are *not* in the MPHF.

## Thoughts for further improvement.

* I would like to be able to use generic Python iterators in the PyMPHF
  construction. Right now there is a round of memory-inefficient copying of
  hashes, which is bad when you have a lot of k-mers!
  
* I would like to be able to save to/load from strings, not just files.

I also need to investigate thread safety.

## Usage

### Usage of core bbhash functionality:

```
import bbhash

# some collection of 64-bit (or smaller) hashes
uint_hashes = [10, 20, 50, 80]

num_threads = 1 # hopefully self-explanatory :)
gamma = 1.0     # internal gamma parameter for BBHash

mph = bbhash.PyMPHF(uint_hashes, len(uint_hashes), num_threads, gamma)

for val in uint_hashes:
    print('{} now hashes to {}'.format(val, mph.lookup(val)))

# can also use 'mph.save(filename)' and 'mph = bbhash.load_mphf(filename)'.
```

### Usage of BBHashTable

```
import random
from collections import defaultdict
from bbhash_table import BBHashTable

all_hashes = [ random.randint(100, 2**32) for i in range(200) ]
half_hashes = all_hashes[:100]

table = BBHashTable()

# hash the first 100 of the hashes
table.initialize(half_hashes)

# store associated values
for hashval, value in zip(half_hashes, [ 1, 2, 3, 4, 5 ] *20):
   table[hashval] = value
   
# retrieve & count for all (which will include hashes not in MPHF)
d = defaultdict(int)
for hashval in all_hashes:
   value = table[hashval]
   d[value] += 1

assert d[1] == 20
assert d[None] == 100
```

The last for loop can be done quickly, in Cython, using

```
d = table.get_unique_values(all_hashes)
```

Motivation: the table is a useful way to (just for one hypothetical
example :) store a mapping from k-mers to compact De Bruijn graph node
IDs.  (We use this in several places in spacegraphcats!)

----

CTB Oct 2020
            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/dib-lab/pybbhash",
    "name": "bbhash",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "C. Titus Brown",
    "author_email": "titus@idyll.org",
    "download_url": "https://files.pythonhosted.org/packages/b2/c5/e5868524242e28fb56f7b0d0f8eafa061d6ade8202bd82fd723d1c62c0e8/bbhash-0.5.4.tar.gz",
    "platform": "",
    "description": "# pybbhash\n\n<a href=\"https://pypi.org/project/bbhash/\"><img alt=\"PyPI\" src=\"https://badge.fury.io/py/bbhash.svg\"></a>\n<a href=\"https://github.com/dib-lab/pybbhash/blob/latest/LICENSE.txt\"><img alt=\"License: 3-Clause BSD\" src=\"https://img.shields.io/badge/License-BSD%203--Clause-blue.svg\"></a>\n\nThis is a Python (Cython) wrapper for the\n[BBHash codebase](https://github.com/rizkg/BBHash) for building\n[minimal perfect hash functions](https://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_function).\n\nRight now, this is supporting k-mer-based hashing needs from\n[spacegraphcats](https://github.com/spacegraphcats/spacegraphcats),\nusing hash values generated (mostly) by murmurhash, e.g. from\n[khmer's Nodetable](https://github.com/dib-lab/khmer/) and\n[sourmash](https://github.com/dib-lab/sourmash/) hashing.  As such, I\nam focused on building MPHF for 64-bit hashes and am wrapping only\nthat bit of the interface; the rest should be ~straightforward (hah!).\n\nI've also added a Python-accessible \"values table\", `BBHashTable`, in\nthe `bbhash_table` module. This is a table that supports a dictionary-like\nfeature where you can associate a hash with a value, and then query the\ntable with the hash to retrieve the value. The only tricky bit here is\nthat unlike the bbhash module, this table supports queries with hashes\nthat are *not* in the MPHF.\n\n## Thoughts for further improvement.\n\n* I would like to be able to use generic Python iterators in the PyMPHF\n  construction. Right now there is a round of memory-inefficient copying of\n  hashes, which is bad when you have a lot of k-mers!\n  \n* I would like to be able to save to/load from strings, not just files.\n\nI also need to investigate thread safety.\n\n## Usage\n\n### Usage of core bbhash functionality:\n\n```\nimport bbhash\n\n# some collection of 64-bit (or smaller) hashes\nuint_hashes = [10, 20, 50, 80]\n\nnum_threads = 1 # hopefully self-explanatory :)\ngamma = 1.0     # internal gamma parameter for BBHash\n\nmph = bbhash.PyMPHF(uint_hashes, len(uint_hashes), num_threads, gamma)\n\nfor val in uint_hashes:\n    print('{} now hashes to {}'.format(val, mph.lookup(val)))\n\n# can also use 'mph.save(filename)' and 'mph = bbhash.load_mphf(filename)'.\n```\n\n### Usage of BBHashTable\n\n```\nimport random\nfrom collections import defaultdict\nfrom bbhash_table import BBHashTable\n\nall_hashes = [ random.randint(100, 2**32) for i in range(200) ]\nhalf_hashes = all_hashes[:100]\n\ntable = BBHashTable()\n\n# hash the first 100 of the hashes\ntable.initialize(half_hashes)\n\n# store associated values\nfor hashval, value in zip(half_hashes, [ 1, 2, 3, 4, 5 ] *20):\n   table[hashval] = value\n   \n# retrieve & count for all (which will include hashes not in MPHF)\nd = defaultdict(int)\nfor hashval in all_hashes:\n   value = table[hashval]\n   d[value] += 1\n\nassert d[1] == 20\nassert d[None] == 100\n```\n\nThe last for loop can be done quickly, in Cython, using\n\n```\nd = table.get_unique_values(all_hashes)\n```\n\nMotivation: the table is a useful way to (just for one hypothetical\nexample :) store a mapping from k-mers to compact De Bruijn graph node\nIDs.  (We use this in several places in spacegraphcats!)\n\n----\n\nCTB Oct 2020",
    "bugtrack_url": null,
    "license": "BSD 3-clause",
    "summary": "A Python wrapper for the BBHash Minimal Perfect Hash Function",
    "version": "0.5.4",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "0ed687dd9edd595e7800386d6e73d65a",
                "sha256": "1b5e07cd99927c1517441b97bf625c5f4fb3a3bafa114c621ad83f014b4d9ea8"
            },
            "downloads": -1,
            "filename": "bbhash-0.5.4.tar.gz",
            "has_sig": false,
            "md5_digest": "0ed687dd9edd595e7800386d6e73d65a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19440,
            "upload_time": "2020-10-25T23:22:07",
            "upload_time_iso_8601": "2020-10-25T23:22:07.320693Z",
            "url": "https://files.pythonhosted.org/packages/b2/c5/e5868524242e28fb56f7b0d0f8eafa061d6ade8202bd82fd723d1c62c0e8/bbhash-0.5.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-10-25 23:22:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "dib-lab",
    "github_project": "pybbhash",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bbhash"
}
        
Elapsed time: 0.01561s