sparselsh


Namesparselsh JSON
Version 2.1.1 PyPI version JSON
download
home_pagehttps://github.com/brandonrobertz/sparselsh
SummaryA locality sensitive hashing library with an emphasis on large, sparse datasets.
upload_time2022-12-20 22:22:21
maintainer
docs_urlNone
authorBrandon Roberts
requires_python
license
keywords clustering sparse lsh
VCS
bugtrack_url
requirements numpy scipy scikit-learn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SparseLSH

A locality sensitive hashing library with an emphasis on large, highly-dimensional datasets.

## Features

- Fast and memory-efficient calculations using sparse matrices.
- Built-in support for key-value storage backends: pure-Python, Redis (memory-bound), LevelDB, BerkeleyDB
- Multiple hash indexes support (based on Kay Zhu's lshash)
- Built-in support for common distance/objective functions for ranking outputs.

## Details

SparseLSH is based on a fork of Kay Zhu's lshash, and is suited for datasets that won't
fit into main memory or are highly dimensional. Using sparse matrices
allows for speedups of easily over an order of magnitude compared to using dense, list-based
or numpy array-based vector math. Sparse matrices also makes it possible to deal with
these datasets purely in memory using python dicts or through Redis. When this isn't
appropriate, you can use one of the disk-based key-value stores, LevelDB and BerkeleyDB.
Serialization is done using cPickle (for raw C speedups), falling back to python
pickle if it's not available.

## Installation

The easy way:

    pip install sparselsh

Or you can clone this repo and follow these instructions:

`SparseLSH` depends on the following libraries:
- [numpy](http://www.numpy.org/)
- [scipy](http://www.scipy.org/)

Optionally (for in-memory and disk-based persistence):
- [redis](https://pypi.python.org/pypi/redis/)
- [leveldb](https://code.google.com/p/py-leveldb/)
- [bsddb](https://pypi.python.org/pypi/bsddb3/6.0.1) (built-in on Python 2.7.x)

To install (minimal install):

    python setup.py install

If you would like to use the LevelDB or Redis storage backends, you can
install the dependencies from the `optional-requirements.txt`:

    pip install -r optional-requirements.txt

## Quickstart

You can quickly use LSH using the bundled `sparselsh` command line utility. Simply pass the path to a file containing records to be clustered, one per line, and the script will output groups of similar items.

    sparselsh path/to/recordsfile.txt

To create 4-bit hashes for input data of 7 dimensions:

    from sparselsh import LSH
    from scipy.sparse import csr_matrix

    X = csr_matrix([
        [3, 0, 0, 0, 0, 0, -1],
        [0, 1, 0, 0, 0, 0,  1],
        [1, 1, 1, 1, 1, 1,  1]
    ])

    # One label for each input point
    y = ["label-one", "second", "last"]

    lsh = LSH(
        4,
        X.shape[1],
        num_hashtables=1,
        storage_config={"dict":None}
    )

    # If you're using >= v2.1.0, this is much faster
    # lsh.index(X, extra_data=y)

    # For all versions
    for ix in range(X.shape[0]):
        x = X.getrow(ix)
        c = y[ix]
        lsh.index( x, extra_data=c)

    # Build a 1-D (single row) sparse matrix
    X_sim = csr_matrix([[1, 1, 1, 1, 1, 1, 0]])
    # find the point in X nearest to X_sim
    points = lsh.query(X_sim, num_results=1)

The query above result in a list of matrix-class tuple & similarity
score tuples. A lower score is better in this case (the score here is 1.0).
Here's a breakdown of the return value when `query` is called with a
single input row (1-dimensional sparse matrix, the output is different
when passing multiple query points):

    [((<1x7 sparse matrix of type '<type 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format>, 'label'), 1.0)]

We can look at the most similar matched item by accessing the sparse array
and invoking it's `todense` function:

    #                      ,------- Get first result-score tuple
    #                     |   ,---- Get data. [1] is distance score
    #                     |  |  ,-- Get the point. [1] is extra_data
    #                     |  |  |
    In [11]: print points[0][0][0].todense()
    [[1 1 1 1 1 1 1]]

You can also pass multiple records to `query` by simply increasing the
dimension of the input to `query`. This will change the output data
to have one extra dimension, representing the input query. (NOTE: When
then dimension is 1, a.k.a. a single sparse row, this extra dimension won't
be added.) Here's the output when `query` is passed a 2-dimensional input:

    [
      [((<1x7 sparse matrix ...>, 'label'), 1.0)],
      [((<1x7 sparse matrix ...>, 'extra'), 0.8),
       ((<1x7 sparse matrix ...>, 'another'), 0.3)]
    ]

Here, you can see an extra dimension, one for each query point passed
to `query`. The data structure for each query point result is the same
as the 1-Dimensional output.

## Main Interface

Most of the parameters are supplied at class init time:

    LSH( hash_size,
         input_dim,
         num_of_hashtables=1,
         storage_config=None,
         matrices_filename=None,
         overwrite=False)

Parameters:

    hash_size:
        The length of the resulting binary hash. This controls how many "buckets"
        there will be for items to be sorted into.

    input_dim:
        The dimension of the input vector. This needs to be the same as the input
        points.

    num_hashtables = 1:
        (optional) The number of hash tables used. More hashtables increases the
        probability of hash-collisions and the more similar items are likely
        to be found for a query item. Increase this to get better accuracy
        at the expense of increased memory usage.

    storage = None:
        (optional) A dict representing the storage backend and configuration
        options. The following storage backends are supported with the following
        configurations:
            In-Memory Python Dictionary:
                {"dict": None} # Takes no options
            Redis:
                {"redis": {"host": "127.0.0.1", "port": 6379, "db": 0}
            LevelDB:
                {"leveldb":{"db": "ldb"}}
                Where "ldb" specifies the directory to store the LevelDB database.
                (In this case it will be `./ldb/`)
            Berkeley DB:
                {"berkeleydb":{"filename": "./db"}}
                Where "filename" is the location of the database file.

    matrices_filename = None:
        (optional) Specify the path to the .npz file random matrices are stored
        or to be stored if the file does not exist yet. If you change the input
        dimensions or the number of hashtables, you'll need to set the following
        option, overwrite, to True, or delete this file.

    overwrite = False:
        (optional) Whether to overwrite the matrices file if it already exists.

### Index (Add points to hash table):

- To index a data point of a given `LSH` instance:

    lsh.index(input_point, extra_data=None)

Parameters:

    input_point:
        The input data point is an array or tuple of numbers of input_dim.

    extra_data = None:
        (optional) Extra data to be added along with the input_point.
        This can be used to hold values like class labels, URIs, titles, etc.

This function returns nothing.

### Query (Search for similar points)

To query a data point against a given `LSH` instance:

    lsh.query(query_point, num_results=None, distance_func="euclidean")

Parameters:

    query_point:
        The query data point is a sparse CSR matrix.

    num_results = None:
        (optional) Integer, specifies the max amount of results to be
        returned. If not specified all candidates will be returned as a
        list in ranked order.
        NOTE: You do not save processing by limiting the results. Currently,
        a similarity ranking and sort is done on all items in the hashtable
        regardless if this parameter.

    distance_func = "euclidean":
        (optional) Distance function to use to rank the candidates. By default
        euclidean distance function will be used.

Returns a list of tuples, each of which has the original input point (which
will be a tuple of csr-matrix, extra_data or just the csr-matrix if no extra
data was supplied) and a similarity score.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/brandonrobertz/sparselsh",
    "name": "sparselsh",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "clustering,sparse,lsh",
    "author": "Brandon Roberts",
    "author_email": "brandon@bxroberts.org",
    "download_url": "https://files.pythonhosted.org/packages/50/5a/e13e4ce3572397753ae732d0523b50fd2d03ccddf00fa1f4f94f6e2f3834/sparselsh-2.1.1.tar.gz",
    "platform": null,
    "description": "# SparseLSH\n\nA locality sensitive hashing library with an emphasis on large, highly-dimensional datasets.\n\n## Features\n\n- Fast and memory-efficient calculations using sparse matrices.\n- Built-in support for key-value storage backends: pure-Python, Redis (memory-bound), LevelDB, BerkeleyDB\n- Multiple hash indexes support (based on Kay Zhu's lshash)\n- Built-in support for common distance/objective functions for ranking outputs.\n\n## Details\n\nSparseLSH is based on a fork of Kay Zhu's lshash, and is suited for datasets that won't\nfit into main memory or are highly dimensional. Using sparse matrices\nallows for speedups of easily over an order of magnitude compared to using dense, list-based\nor numpy array-based vector math. Sparse matrices also makes it possible to deal with\nthese datasets purely in memory using python dicts or through Redis. When this isn't\nappropriate, you can use one of the disk-based key-value stores, LevelDB and BerkeleyDB.\nSerialization is done using cPickle (for raw C speedups), falling back to python\npickle if it's not available.\n\n## Installation\n\nThe easy way:\n\n    pip install sparselsh\n\nOr you can clone this repo and follow these instructions:\n\n`SparseLSH` depends on the following libraries:\n- [numpy](http://www.numpy.org/)\n- [scipy](http://www.scipy.org/)\n\nOptionally (for in-memory and disk-based persistence):\n- [redis](https://pypi.python.org/pypi/redis/)\n- [leveldb](https://code.google.com/p/py-leveldb/)\n- [bsddb](https://pypi.python.org/pypi/bsddb3/6.0.1) (built-in on Python 2.7.x)\n\nTo install (minimal install):\n\n    python setup.py install\n\nIf you would like to use the LevelDB or Redis storage backends, you can\ninstall the dependencies from the `optional-requirements.txt`:\n\n    pip install -r optional-requirements.txt\n\n## Quickstart\n\nYou can quickly use LSH using the bundled `sparselsh` command line utility. Simply pass the path to a file containing records to be clustered, one per line, and the script will output groups of similar items.\n\n    sparselsh path/to/recordsfile.txt\n\nTo create 4-bit hashes for input data of 7 dimensions:\n\n    from sparselsh import LSH\n    from scipy.sparse import csr_matrix\n\n    X = csr_matrix([\n        [3, 0, 0, 0, 0, 0, -1],\n        [0, 1, 0, 0, 0, 0,  1],\n        [1, 1, 1, 1, 1, 1,  1]\n    ])\n\n    # One label for each input point\n    y = [\"label-one\", \"second\", \"last\"]\n\n    lsh = LSH(\n        4,\n        X.shape[1],\n        num_hashtables=1,\n        storage_config={\"dict\":None}\n    )\n\n    # If you're using >= v2.1.0, this is much faster\n    # lsh.index(X, extra_data=y)\n\n    # For all versions\n    for ix in range(X.shape[0]):\n        x = X.getrow(ix)\n        c = y[ix]\n        lsh.index( x, extra_data=c)\n\n    # Build a 1-D (single row) sparse matrix\n    X_sim = csr_matrix([[1, 1, 1, 1, 1, 1, 0]])\n    # find the point in X nearest to X_sim\n    points = lsh.query(X_sim, num_results=1)\n\nThe query above result in a list of matrix-class tuple & similarity\nscore tuples. A lower score is better in this case (the score here is 1.0).\nHere's a breakdown of the return value when `query` is called with a\nsingle input row (1-dimensional sparse matrix, the output is different\nwhen passing multiple query points):\n\n    [((<1x7 sparse matrix of type '<type 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format>, 'label'), 1.0)]\n\nWe can look at the most similar matched item by accessing the sparse array\nand invoking it's `todense` function:\n\n    #                      ,------- Get first result-score tuple\n    #                     |   ,---- Get data. [1] is distance score\n    #                     |  |  ,-- Get the point. [1] is extra_data\n    #                     |  |  |\n    In [11]: print points[0][0][0].todense()\n    [[1 1 1 1 1 1 1]]\n\nYou can also pass multiple records to `query` by simply increasing the\ndimension of the input to `query`. This will change the output data\nto have one extra dimension, representing the input query. (NOTE: When\nthen dimension is 1, a.k.a. a single sparse row, this extra dimension won't\nbe added.) Here's the output when `query` is passed a 2-dimensional input:\n\n    [\n      [((<1x7 sparse matrix ...>, 'label'), 1.0)],\n      [((<1x7 sparse matrix ...>, 'extra'), 0.8),\n       ((<1x7 sparse matrix ...>, 'another'), 0.3)]\n    ]\n\nHere, you can see an extra dimension, one for each query point passed\nto `query`. The data structure for each query point result is the same\nas the 1-Dimensional output.\n\n## Main Interface\n\nMost of the parameters are supplied at class init time:\n\n    LSH( hash_size,\n         input_dim,\n         num_of_hashtables=1,\n         storage_config=None,\n         matrices_filename=None,\n         overwrite=False)\n\nParameters:\n\n    hash_size:\n        The length of the resulting binary hash. This controls how many \"buckets\"\n        there will be for items to be sorted into.\n\n    input_dim:\n        The dimension of the input vector. This needs to be the same as the input\n        points.\n\n    num_hashtables = 1:\n        (optional) The number of hash tables used. More hashtables increases the\n        probability of hash-collisions and the more similar items are likely\n        to be found for a query item. Increase this to get better accuracy\n        at the expense of increased memory usage.\n\n    storage = None:\n        (optional) A dict representing the storage backend and configuration\n        options. The following storage backends are supported with the following\n        configurations:\n            In-Memory Python Dictionary:\n                {\"dict\": None} # Takes no options\n            Redis:\n                {\"redis\": {\"host\": \"127.0.0.1\", \"port\": 6379, \"db\": 0}\n            LevelDB:\n                {\"leveldb\":{\"db\": \"ldb\"}}\n                Where \"ldb\" specifies the directory to store the LevelDB database.\n                (In this case it will be `./ldb/`)\n            Berkeley DB:\n                {\"berkeleydb\":{\"filename\": \"./db\"}}\n                Where \"filename\" is the location of the database file.\n\n    matrices_filename = None:\n        (optional) Specify the path to the .npz file random matrices are stored\n        or to be stored if the file does not exist yet. If you change the input\n        dimensions or the number of hashtables, you'll need to set the following\n        option, overwrite, to True, or delete this file.\n\n    overwrite = False:\n        (optional) Whether to overwrite the matrices file if it already exists.\n\n### Index (Add points to hash table):\n\n- To index a data point of a given `LSH` instance:\n\n    lsh.index(input_point, extra_data=None)\n\nParameters:\n\n    input_point:\n        The input data point is an array or tuple of numbers of input_dim.\n\n    extra_data = None:\n        (optional) Extra data to be added along with the input_point.\n        This can be used to hold values like class labels, URIs, titles, etc.\n\nThis function returns nothing.\n\n### Query (Search for similar points)\n\nTo query a data point against a given `LSH` instance:\n\n    lsh.query(query_point, num_results=None, distance_func=\"euclidean\")\n\nParameters:\n\n    query_point:\n        The query data point is a sparse CSR matrix.\n\n    num_results = None:\n        (optional) Integer, specifies the max amount of results to be\n        returned. If not specified all candidates will be returned as a\n        list in ranked order.\n        NOTE: You do not save processing by limiting the results. Currently,\n        a similarity ranking and sort is done on all items in the hashtable\n        regardless if this parameter.\n\n    distance_func = \"euclidean\":\n        (optional) Distance function to use to rank the candidates. By default\n        euclidean distance function will be used.\n\nReturns a list of tuples, each of which has the original input point (which\nwill be a tuple of csr-matrix, extra_data or just the csr-matrix if no extra\ndata was supplied) and a similarity score.\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A locality sensitive hashing library with an emphasis on large, sparse datasets.",
    "version": "2.1.1",
    "split_keywords": [
        "clustering",
        "sparse",
        "lsh"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "4b91076defba3df68a0d984abb876a9c",
                "sha256": "6ac4a6f00c7a17ec8a46e316ffa1620ec19987e408d46dec5b62b06e7888df69"
            },
            "downloads": -1,
            "filename": "sparselsh-2.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4b91076defba3df68a0d984abb876a9c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22941,
            "upload_time": "2022-12-20T22:22:19",
            "upload_time_iso_8601": "2022-12-20T22:22:19.966611Z",
            "url": "https://files.pythonhosted.org/packages/5f/c4/084f49af5617bb5bb747eb7c8addaf3d2ccf27473d0856e9734f9eea3ff5/sparselsh-2.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "ce189babe8a757218bd0b85fdf739255",
                "sha256": "0476224d59f23412832ebd5b55d34e1b56e3d08ba94db9942a72cb1a0bffd773"
            },
            "downloads": -1,
            "filename": "sparselsh-2.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ce189babe8a757218bd0b85fdf739255",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21683,
            "upload_time": "2022-12-20T22:22:21",
            "upload_time_iso_8601": "2022-12-20T22:22:21.969453Z",
            "url": "https://files.pythonhosted.org/packages/50/5a/e13e4ce3572397753ae732d0523b50fd2d03ccddf00fa1f4f94f6e2f3834/sparselsh-2.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-20 22:22:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "brandonrobertz",
    "github_project": "sparselsh",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.18.6"
                ],
                [
                    "<",
                    "2.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "<",
                    "2.0"
                ],
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "<",
                    "2.0"
                ],
                [
                    ">=",
                    "0.24.0"
                ]
            ]
        }
    ],
    "lcname": "sparselsh"
}
        
Elapsed time: 0.02115s