perfect-hash


Nameperfect-hash JSON
Version 0.4.3 PyPI version JSON
download
home_pagehttps://github.com/ilanschnell/perfect-hash
Summarycreating perfect minimal hash function
upload_time2023-11-05 04:43:02
maintainer
docs_urlNone
authorIlan Schnell
requires_python
licenseBSD
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Creating minimal perfect hash functions
=======================================

Generate a minimal perfect hash function for a given set of keys.
A given code template is filled with parameters, such that the
output is code which implements the hash function.
Templates can easily be constructed for any programming language.


Installation
------------

The minimal perfect hash function generator is written in pure Python,
and can be installed using:

.. code-block:: shell-session

    $ pip install perfect-hash

The code supports Python 2.7 and Python 3.5 or higher.
However, some of the examples do not support Python 2 anymore.


Introduction
------------

A perfect hash function of a certain set S of keys is a hash function
which maps all keys in S to different numbers.
That means that for the set S, the hash function is collision-free,
or perfect.
Further, a perfect hash function is called "minimal" when it maps N keys
to N *consecutive* integers, usually in the range from 0 to N-1.


Usage
-----

Given a set of keys which are character strings, the program returns a minimal
perfect hash function.  This hash function is returned in the form of Python
code by default.  Suppose we have a file with keys:

.. code-block::

    # 'animals.txt'
    Elephant
    Horse
    Camel
    Python
    Dog
    Cat


The exact way this file is parsed can be specified using command line
options, for example it is possible to only read one column from a file
which contains different items in each row.
The program is invoked like this:

.. code-block:: python

    $ perfect-hash animals.txt
    # =======================================================================
    # ================= Python code for perfect hash function ===============
    # =======================================================================

    G = [0, 3, 6, 0, 4, 1, 5]

    def hash_f(key, T):
        return sum(ord(T[i % 8]) * ord(c) for i, c in enumerate(key)) % 7

    def perfect_hash(key):
        return (G[hash_f(key, "1mmhoNMG")] +
                G[hash_f(key, "gf53KKbH")]) % 7

    # ============================ Sanity check =============================

    K = ["Elephant", "Horse", "Camel", "Python", "Dog", "Cat"]
    assert len(K) == 6

    for h, k in enumerate(K):
        assert perfect_hash(k) == h


The way the program works is by filling a code template with the calculated
parameters.  The program can take such a template in form of a file and
fill in the calculated parameters, this allows the generation of perfect
hash function in any programming language.  The hash function is kept quite
simple and does not require machine or language specific byte level operations
which might be hard to implement in the target language.
The following parameters are available in the template:

==========  =======================================
string      expands to
==========  =======================================
``$NS``     length of S1 and S2 salt
``$S1``     S1 salt
``$S2``     S2 salt
``$NG``     length of array G
``$G``      array of integers G
``$NK``     number of keys, i.e. length of array K
``$K``      array with (quoted) keys
``$$``      $ (a literal dollar sign)
==========  =======================================


Since the syntax for arrays is not the same in all programming languages,
some specifics can be adjusted using command line options.
The built-in template which creates the above code is:

.. code-block:: python

    G = [$G]

    def hash_f(key, T):
        return sum(ord(T[i % $NS]) * ord(c) for i, c in enumerate(str(key))) % $NG

    def perfect_hash(key):
        return (G[hash_f(key, "$S1")] +
                G[hash_f(key, "$S2")]) % $NG


Using code templates, makes this program very flexible.  The source repository
includes several complete examples for C.  There are many choices one
faces when implementing a static hash table: Do the parameter lists go into
a separate header file?  Should the API for the table only contain the hash
values, but not the objects being mapped?  And so on.
All these various choices are possible because of the template is simply
filled with the parameters, no matter what else is inside the template.


Hash function types
-------------------

One important option the ``perfect-hash`` command provides is ``--hft`` which
is short of "hash function type".  There are two types to choose from:

1. A random hash function generation which creates hash function with a
   random string being used as it's salt.   This is the default.
   Since the generated random hash function does not include large enough
   output for a very large number of keys (over 10,000), the perfect hash
   function generation will fail for such large keys.  However, the
   implementation of this hash function is quite simple and fast.

2. A random hash function generation which creates hash function with a
   random integers being used as it's salt.  Using this option will always
   succeed, but an implementation requires two additional integer
   arrays (apart from the always present array ``G``).


Examples
--------

The source repository contains many useful examples (in ``examples/``) which
illustrate how to use the ``perfect-hash`` command, as well
as ``python_hash.py`` as a library.


License of output
-----------------

perfect-hash is released under the BSD license.  However, that does not
cause the output produced by perfect-hash to be under BSD.  The reason is
that the output contains only small pieces of text that come directly from
perfect-hash's source code – less than 10 lines long if the default template
is being used, which serves more for illustration purposes - too small for
being significant.  Therefore the output is not “work based on perfect-hash”.

The output produced by perfect-hash contains essentially all of the
input data.  Therefore the output is a “derivative work” of the input (in
the sense of U.S. copyright law); and its copyright status depends on the
copyright of the input.  For most software licenses, the result is that the
output is under the same license, with the same copyright holder, as the
input that was passed to perfect-hash.


Acknowledgments
---------------

Part of the code is based on an a program A.M. Kuchling wrote:
http://www.amk.ca/python/code/perfect-hash

The algorithm this library is based on is described in the paper
"Optimal algorithms for minimal perfect hashing",
Z. J. Czech, G. Havas and B.S. Majewski.
http://cmph.sourceforge.net/papers/chm92.pdf

I tried to illustrate the algorithm and explain how it works on:
http://ilan.schnell-web.net/prog/perfect-hash/algo.html

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ilanschnell/perfect-hash",
    "name": "perfect-hash",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Ilan Schnell",
    "author_email": "ilanschnell@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6a/6b/d94b0f8ab4aab03eb0bb01bfc6e9a10897bf2eb74d81d30be20dac95389f/perfect-hash-0.4.3.tar.gz",
    "platform": null,
    "description": "Creating minimal perfect hash functions\n=======================================\n\nGenerate a minimal perfect hash function for a given set of keys.\nA given code template is filled with parameters, such that the\noutput is code which implements the hash function.\nTemplates can easily be constructed for any programming language.\n\n\nInstallation\n------------\n\nThe minimal perfect hash function generator is written in pure Python,\nand can be installed using:\n\n.. code-block:: shell-session\n\n    $ pip install perfect-hash\n\nThe code supports Python 2.7 and Python 3.5 or higher.\nHowever, some of the examples do not support Python 2 anymore.\n\n\nIntroduction\n------------\n\nA perfect hash function of a certain set S of keys is a hash function\nwhich maps all keys in S to different numbers.\nThat means that for the set S, the hash function is collision-free,\nor perfect.\nFurther, a perfect hash function is called \"minimal\" when it maps N keys\nto N *consecutive* integers, usually in the range from 0 to N-1.\n\n\nUsage\n-----\n\nGiven a set of keys which are character strings, the program returns a minimal\nperfect hash function.  This hash function is returned in the form of Python\ncode by default.  Suppose we have a file with keys:\n\n.. code-block::\n\n    # 'animals.txt'\n    Elephant\n    Horse\n    Camel\n    Python\n    Dog\n    Cat\n\n\nThe exact way this file is parsed can be specified using command line\noptions, for example it is possible to only read one column from a file\nwhich contains different items in each row.\nThe program is invoked like this:\n\n.. code-block:: python\n\n    $ perfect-hash animals.txt\n    # =======================================================================\n    # ================= Python code for perfect hash function ===============\n    # =======================================================================\n\n    G = [0, 3, 6, 0, 4, 1, 5]\n\n    def hash_f(key, T):\n        return sum(ord(T[i % 8]) * ord(c) for i, c in enumerate(key)) % 7\n\n    def perfect_hash(key):\n        return (G[hash_f(key, \"1mmhoNMG\")] +\n                G[hash_f(key, \"gf53KKbH\")]) % 7\n\n    # ============================ Sanity check =============================\n\n    K = [\"Elephant\", \"Horse\", \"Camel\", \"Python\", \"Dog\", \"Cat\"]\n    assert len(K) == 6\n\n    for h, k in enumerate(K):\n        assert perfect_hash(k) == h\n\n\nThe way the program works is by filling a code template with the calculated\nparameters.  The program can take such a template in form of a file and\nfill in the calculated parameters, this allows the generation of perfect\nhash function in any programming language.  The hash function is kept quite\nsimple and does not require machine or language specific byte level operations\nwhich might be hard to implement in the target language.\nThe following parameters are available in the template:\n\n==========  =======================================\nstring      expands to\n==========  =======================================\n``$NS``     length of S1 and S2 salt\n``$S1``     S1 salt\n``$S2``     S2 salt\n``$NG``     length of array G\n``$G``      array of integers G\n``$NK``     number of keys, i.e. length of array K\n``$K``      array with (quoted) keys\n``$$``      $ (a literal dollar sign)\n==========  =======================================\n\n\nSince the syntax for arrays is not the same in all programming languages,\nsome specifics can be adjusted using command line options.\nThe built-in template which creates the above code is:\n\n.. code-block:: python\n\n    G = [$G]\n\n    def hash_f(key, T):\n        return sum(ord(T[i % $NS]) * ord(c) for i, c in enumerate(str(key))) % $NG\n\n    def perfect_hash(key):\n        return (G[hash_f(key, \"$S1\")] +\n                G[hash_f(key, \"$S2\")]) % $NG\n\n\nUsing code templates, makes this program very flexible.  The source repository\nincludes several complete examples for C.  There are many choices one\nfaces when implementing a static hash table: Do the parameter lists go into\na separate header file?  Should the API for the table only contain the hash\nvalues, but not the objects being mapped?  And so on.\nAll these various choices are possible because of the template is simply\nfilled with the parameters, no matter what else is inside the template.\n\n\nHash function types\n-------------------\n\nOne important option the ``perfect-hash`` command provides is ``--hft`` which\nis short of \"hash function type\".  There are two types to choose from:\n\n1. A random hash function generation which creates hash function with a\n   random string being used as it's salt.   This is the default.\n   Since the generated random hash function does not include large enough\n   output for a very large number of keys (over 10,000), the perfect hash\n   function generation will fail for such large keys.  However, the\n   implementation of this hash function is quite simple and fast.\n\n2. A random hash function generation which creates hash function with a\n   random integers being used as it's salt.  Using this option will always\n   succeed, but an implementation requires two additional integer\n   arrays (apart from the always present array ``G``).\n\n\nExamples\n--------\n\nThe source repository contains many useful examples (in ``examples/``) which\nillustrate how to use the ``perfect-hash`` command, as well\nas ``python_hash.py`` as a library.\n\n\nLicense of output\n-----------------\n\nperfect-hash is released under the BSD license.  However, that does not\ncause the output produced by perfect-hash to be under BSD.  The reason is\nthat the output contains only small pieces of text that come directly from\nperfect-hash's source code \u2013 less than 10 lines long if the default template\nis being used, which serves more for illustration purposes - too small for\nbeing significant.  Therefore the output is not \u201cwork based on perfect-hash\u201d.\n\nThe output produced by perfect-hash contains essentially all of the\ninput data.  Therefore the output is a \u201cderivative work\u201d of the input (in\nthe sense of U.S. copyright law); and its copyright status depends on the\ncopyright of the input.  For most software licenses, the result is that the\noutput is under the same license, with the same copyright holder, as the\ninput that was passed to perfect-hash.\n\n\nAcknowledgments\n---------------\n\nPart of the code is based on an a program A.M. Kuchling wrote:\nhttp://www.amk.ca/python/code/perfect-hash\n\nThe algorithm this library is based on is described in the paper\n\"Optimal algorithms for minimal perfect hashing\",\nZ. J. Czech, G. Havas and B.S. Majewski.\nhttp://cmph.sourceforge.net/papers/chm92.pdf\n\nI tried to illustrate the algorithm and explain how it works on:\nhttp://ilan.schnell-web.net/prog/perfect-hash/algo.html\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "creating perfect minimal hash function",
    "version": "0.4.3",
    "project_urls": {
        "Homepage": "https://github.com/ilanschnell/perfect-hash"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6a6bd94b0f8ab4aab03eb0bb01bfc6e9a10897bf2eb74d81d30be20dac95389f",
                "md5": "09b501487ab9e8181cad5b3f217d06e1",
                "sha256": "b886b33f79bbf429c2a8c6dbb9329748fb931124f28a2eda461195e6430f88b5"
            },
            "downloads": -1,
            "filename": "perfect-hash-0.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "09b501487ab9e8181cad5b3f217d06e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10834,
            "upload_time": "2023-11-05T04:43:02",
            "upload_time_iso_8601": "2023-11-05T04:43:02.227460Z",
            "url": "https://files.pythonhosted.org/packages/6a/6b/d94b0f8ab4aab03eb0bb01bfc6e9a10897bf2eb74d81d30be20dac95389f/perfect-hash-0.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-05 04:43:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ilanschnell",
    "github_project": "perfect-hash",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "perfect-hash"
}
        
Elapsed time: 1.14880s