aghasher


Nameaghasher JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/dstein64/aghasher
SummaryAn implementation of Anchor Graph Hashing
upload_time2022-12-25 20:22:06
maintainer
docs_urlNone
authorDaniel Steinberg
requires_python
licenseMIT
keywords anchor-graph-hashing hashing locality-sensitive-hashing machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Build Status](https://github.com/dstein64/aghasher/workflows/build/badge.svg)](https://github.com/dstein64/aghasher/actions)

aghasher
========

An implementation of the Anchor Graph Hashing algorithm (AGH-1), presented in *Hashing with Graphs* (Liu et al. 2011).

Dependencies
------------

*aghasher* supports Python 2.7 and Python 3, with numpy and scipy. These should be linked with a BLAS implementation
(e.g., OpenBLAS, ATLAS, Intel MKL). Without being linked to BLAS, numpy/scipy will use a fallback that causes
PyAnchorGraphHasher to run over 50x slower.

Installation
------------

[aghasher](https://pypi.python.org/pypi/aghasher) is available on PyPI, the Python Package Index.

```sh
$ pip install aghasher
```

How To Use
----------

To use aghasher, first import the *aghasher* module.

    import aghasher
    
### Training a Model

An AnchorGraphHasher is constructed using the *train* method, which returns an AnchorGraphHasher and the hash bit
embedding for the training data.

    agh, H_train = aghasher.AnchorGraphHasher.train(X, anchors, num_bits, nn_anchors, sigma)

AnchorGraphHasher.train takes 5 arguments:

* **X** An *n-by-d* numpy.ndarray with training data. The rows correspond to *n* observations, and the columns
  correspond to *d* dimensions.
* **anchors** An *m-by-d* numpy.ndarray with anchors. *m* is the total number of anchors. Rows correspond to anchors,
  and columns correspond to dimensions. The dimensionality of the anchors much match the dimensionality of the training
  data.
* **num_bits** (optional; defaults to 12) Number of hash bits for the embedding.
* **nn_anchors** (optional; defaults to 2) Number of nearest anchors that are used for approximating the neighborhood
  structure.
* **sigma** (optional; defaults to *None*) sigma for the Gaussian radial basis function that is used to determine
  similarity between points. When sigma is specified as *None*, the code will automatically set a value, depending on
  the training data and anchors.

### Hashing Data with an AnchorGraphHasher Model

With an AnchorGraphHasher object, which has variable name *agh* in the preceding and following examples, hashing
out-of-sample data is done with the object's *hash* method.

    agh.hash(X)
    
The hash method takes one argument:

* **X** An *n-by-d* numpy.ndarray with data. The rows correspond to *n* observations, and the columns correspond to *d*
dimensions. The dimensionality of the data much match the dimensionality of the training data used to train the
AnchorGraphHasher.

Since Python does not have a native bit vector data structure, the hash method returns an *n-by-r* numpy.ndarray, where
*n* is the number of observations in *data*, and *r* is the number of hash bits specified when the model was trained.
The elements of the returned array are boolean values that correspond to bits.

### Testing an AnchorGraphHasher Model

Testing is performed with the AnchorGraphHasher.test method.

    precision = AnchorGraphHasher.test(H_train, H_test, y_train, y_test, radius)
    
AnchorGraphHasher.test takes 5 arguments:

* **H_train** An *n-by-r* numpy.ndarray with the hash bit embedding corresponding to the training data. The rows
  correspond to the *n* observations, and the columns correspond to the *r* hash bits.
* **H_test** An *m-by-r* numpy.ndarray with the hash bit embedding corresponding to the testing data. The rows
  correspond to the *m* observations, and the columns correspond to the *r* hash bits.
* **y_train** An *n-by-1* numpy.ndarray with the ground truth labels for the training data.
* **y_test** An *m-by-1* numpy.ndarray with the ground truth labels for the testing data.
* **radius** (optional; defaults to 2) Hamming radius to use for calculating precision.

Tests
-----

Tests are in [tests/](https://github.com/dstein64/aghasher/blob/master/tests).

```sh
# Run tests
$ python3 -m unittest discover tests -v
```

Differences from the Matlab Reference Implementation
----------------------------------------------------

The code is structured differently than the Matlab reference implementation.

The Matlab code implements an additional hashing method, hierarchical hashing (referred to as 2-AGH), an extension of
1-AGH that is not implemented here.

There is one functional difference relative to the Matlab code. If *sigma* is specified (as opposed to being
auto-estimated), then for the same value of *sigma*, the Matlab and Python code will produce different results. They
will produce the same results when the Matlab *sigma* is sqrt(2) times bigger than the manually specified *sigma* in the
Python code. This is because in the Gaussian RBF kernel, the Python code uses a 2 in the denominator of the exponent,
and the Matlab code does not. A 2 was included in the denominator of the Python code, as that is the canonical way to
use an RBF kernel.

License
-------

*aghasher* has an [MIT License](https://en.wikipedia.org/wiki/MIT_License).

See [LICENSE](LICENSE).

References
----------

Liu, Wei, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. “Hashing with Graphs.” In Proceedings of the 28th
International Conference on Machine Learning (ICML-11), edited by Lise Getoor and Tobias Scheffer, 1–8. ICML ’11. New
York, NY, USA: ACM.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dstein64/aghasher",
    "name": "aghasher",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "anchor-graph-hashing,hashing,locality-sensitive-hashing,machine-learning",
    "author": "Daniel Steinberg",
    "author_email": "ds@dannyadam.com",
    "download_url": "https://files.pythonhosted.org/packages/f8/26/834018eef9c1921a919b64b5ae857515942c366c68f77096295944561f6c/aghasher-0.1.1.tar.gz",
    "platform": null,
    "description": "[![Build Status](https://github.com/dstein64/aghasher/workflows/build/badge.svg)](https://github.com/dstein64/aghasher/actions)\n\naghasher\n========\n\nAn implementation of the Anchor Graph Hashing algorithm (AGH-1), presented in *Hashing with Graphs* (Liu et al. 2011).\n\nDependencies\n------------\n\n*aghasher* supports Python 2.7 and Python 3, with numpy and scipy. These should be linked with a BLAS implementation\n(e.g., OpenBLAS, ATLAS, Intel MKL). Without being linked to BLAS, numpy/scipy will use a fallback that causes\nPyAnchorGraphHasher to run over 50x slower.\n\nInstallation\n------------\n\n[aghasher](https://pypi.python.org/pypi/aghasher) is available on PyPI, the Python Package Index.\n\n```sh\n$ pip install aghasher\n```\n\nHow To Use\n----------\n\nTo use aghasher, first import the *aghasher* module.\n\n    import aghasher\n    \n### Training a Model\n\nAn AnchorGraphHasher is constructed using the *train* method, which returns an AnchorGraphHasher and the hash bit\nembedding for the training data.\n\n    agh, H_train = aghasher.AnchorGraphHasher.train(X, anchors, num_bits, nn_anchors, sigma)\n\nAnchorGraphHasher.train takes 5 arguments:\n\n* **X** An *n-by-d* numpy.ndarray with training data. The rows correspond to *n* observations, and the columns\n  correspond to *d* dimensions.\n* **anchors** An *m-by-d* numpy.ndarray with anchors. *m* is the total number of anchors. Rows correspond to anchors,\n  and columns correspond to dimensions. The dimensionality of the anchors much match the dimensionality of the training\n  data.\n* **num_bits** (optional; defaults to 12) Number of hash bits for the embedding.\n* **nn_anchors** (optional; defaults to 2) Number of nearest anchors that are used for approximating the neighborhood\n  structure.\n* **sigma** (optional; defaults to *None*) sigma for the Gaussian radial basis function that is used to determine\n  similarity between points. When sigma is specified as *None*, the code will automatically set a value, depending on\n  the training data and anchors.\n\n### Hashing Data with an AnchorGraphHasher Model\n\nWith an AnchorGraphHasher object, which has variable name *agh* in the preceding and following examples, hashing\nout-of-sample data is done with the object's *hash* method.\n\n    agh.hash(X)\n    \nThe hash method takes one argument:\n\n* **X** An *n-by-d* numpy.ndarray with data. The rows correspond to *n* observations, and the columns correspond to *d*\ndimensions. The dimensionality of the data much match the dimensionality of the training data used to train the\nAnchorGraphHasher.\n\nSince Python does not have a native bit vector data structure, the hash method returns an *n-by-r* numpy.ndarray, where\n*n* is the number of observations in *data*, and *r* is the number of hash bits specified when the model was trained.\nThe elements of the returned array are boolean values that correspond to bits.\n\n### Testing an AnchorGraphHasher Model\n\nTesting is performed with the AnchorGraphHasher.test method.\n\n    precision = AnchorGraphHasher.test(H_train, H_test, y_train, y_test, radius)\n    \nAnchorGraphHasher.test takes 5 arguments:\n\n* **H_train** An *n-by-r* numpy.ndarray with the hash bit embedding corresponding to the training data. The rows\n  correspond to the *n* observations, and the columns correspond to the *r* hash bits.\n* **H_test** An *m-by-r* numpy.ndarray with the hash bit embedding corresponding to the testing data. The rows\n  correspond to the *m* observations, and the columns correspond to the *r* hash bits.\n* **y_train** An *n-by-1* numpy.ndarray with the ground truth labels for the training data.\n* **y_test** An *m-by-1* numpy.ndarray with the ground truth labels for the testing data.\n* **radius** (optional; defaults to 2) Hamming radius to use for calculating precision.\n\nTests\n-----\n\nTests are in [tests/](https://github.com/dstein64/aghasher/blob/master/tests).\n\n```sh\n# Run tests\n$ python3 -m unittest discover tests -v\n```\n\nDifferences from the Matlab Reference Implementation\n----------------------------------------------------\n\nThe code is structured differently than the Matlab reference implementation.\n\nThe Matlab code implements an additional hashing method, hierarchical hashing (referred to as 2-AGH), an extension of\n1-AGH that is not implemented here.\n\nThere is one functional difference relative to the Matlab code. If *sigma* is specified (as opposed to being\nauto-estimated), then for the same value of *sigma*, the Matlab and Python code will produce different results. They\nwill produce the same results when the Matlab *sigma* is sqrt(2) times bigger than the manually specified *sigma* in the\nPython code. This is because in the Gaussian RBF kernel, the Python code uses a 2 in the denominator of the exponent,\nand the Matlab code does not. A 2 was included in the denominator of the Python code, as that is the canonical way to\nuse an RBF kernel.\n\nLicense\n-------\n\n*aghasher* has an [MIT License](https://en.wikipedia.org/wiki/MIT_License).\n\nSee [LICENSE](LICENSE).\n\nReferences\n----------\n\nLiu, Wei, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. \u201cHashing with Graphs.\u201d In Proceedings of the 28th\nInternational Conference on Machine Learning (ICML-11), edited by Lise Getoor and Tobias Scheffer, 1\u20138. ICML \u201911. New\nYork, NY, USA: ACM.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An implementation of Anchor Graph Hashing",
    "version": "0.1.1",
    "split_keywords": [
        "anchor-graph-hashing",
        "hashing",
        "locality-sensitive-hashing",
        "machine-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "36a513594c88cfcd8274e3164f9f2138",
                "sha256": "00b0da42afff65c9dab3f749e38aba1038b67162162ad7f39dfa4d3d55640841"
            },
            "downloads": -1,
            "filename": "aghasher-0.1.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36a513594c88cfcd8274e3164f9f2138",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 7823,
            "upload_time": "2022-12-25T20:22:05",
            "upload_time_iso_8601": "2022-12-25T20:22:05.187895Z",
            "url": "https://files.pythonhosted.org/packages/be/4a/1bebc89e83ef3a53e0481edda9c35b2f9d6645401ccd63b7eaf640e82edc/aghasher-0.1.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "2a6fb509776fd1b79492746d5bb2a48a",
                "sha256": "50d916fff7c224c689ab1e2a86abc673b4b43265cd8b631aff5041295749a93f"
            },
            "downloads": -1,
            "filename": "aghasher-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2a6fb509776fd1b79492746d5bb2a48a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7598,
            "upload_time": "2022-12-25T20:22:06",
            "upload_time_iso_8601": "2022-12-25T20:22:06.870177Z",
            "url": "https://files.pythonhosted.org/packages/f8/26/834018eef9c1921a919b64b5ae857515942c366c68f77096295944561f6c/aghasher-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-25 20:22:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "dstein64",
    "github_project": "aghasher",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "aghasher"
}
        
Elapsed time: 0.03829s