embeddingcache


Nameembeddingcache JSON
Version 0.1.0 PyPI version JSON
download
home_page
SummaryRetrieve text embeddings, but cache them locally if we have already computed them.
upload_time2023-07-13 16:30:06
maintainer
docs_urlNone
author
requires_python>=3.7
licenseCopyright 2023, Joseph Turian Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords sample setuptools development
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# embeddingcache

[![PyPI](https://img.shields.io/pypi/v/embeddingcache)](https://pypi.org/project/embeddingcache/)
[![license](https://img.shields.io/badge/License-Apache--2.0-green.svg?labelColor=gray)](https://github.com/turian/embeddingcache#license)
[![python](https://img.shields.io/badge/-Python_3.7_%7C_3.8_%7C_3.9_%7C_3.10%7C_3.11-blue?logo=python&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)
[![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![tests](https://github.com/turian/embeddingcache/actions/workflows/test.yml/badge.svg)](https://github.com/turian/embeddingcache/actions/workflows/test.yml)

Retrieve text embeddings, but cache them locally if we have already computed them.

</div>

## Motivation

If you are doing a handful of different NLP tasks, or have a single
NLP pipeline you keep tuning, you probably don't want to recompute
embeddings. Hence, we cache them.

## Quickstart

```
pip install embeddingcache
```

```
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
            strs=["hi", "I love Berlin."],
            embedding_model="all-MiniLM-L6-v2",
            db_directory=Path("dbs/"),
            verbose=True,
        )
```

## Design assumptions

We use SQLite3 to cache embeddings. [This could be adapted easily,
since we use SQLAlchemy.]

We assume read-heavy loads, with one concurrent writer. (However,
we retry on write failures.)

We shard SQLite3 into two databases:
hashstring.db: hashstring table. Each row is a (unique, primary
key) SHA512 hash to text (also unique). Both fields are indexed.

[embedding_model_name].db: embedding table. Each row is a (unique,
primary key) SHA512 hash to a 1-dim numpy (float32) vector, which
we serialize to the table as bytes.

## Developer instructions

```
pre-commit install
pip install -e .
pytest
```

## TODO

* Update pyproject.toml
* Add tests
* Consider other hash functions?
* float32 and float64 support
* Consider adding optional joblib for caching?
* Different ways of computing embeddings (e.g. using an API) rather than locally
* S3 backup and/or
* WAL
* [LiteStream](https://fly.io/blog/all-in-on-sqlite-litestream/)
* Retry on write errors
* Other DB backends
* Best practices: Give specific OpenAI version number.
* RocksDB / RocksDB-cloud?
* Include model name in DB for sanity check on slugify.
* Validate on numpy array size.
* Validate BLOB size for hashes.
* Add optional libraries like openai and sentence-transformers
    * Also consider other embedding providers, e.g. cohere
    * And libs just for devs
* Consider the max_length of each text to embed, warn if we exceed
* pdoc3 and/or sphinx
* Normalize embeddings by default, but add option
* Option to return torch tensors
* Consider reusing the same DB connection instead of creating it
from scratch every time.
* Add batch_size parameter?
* Test check for collisions
* Use logging not verbose output.
* Rewrite using classes.
* Fix dependabot.
* Don't keep re-using DB session, store it in the class or global
* DRY.
* Suggest to use versioned OpenAI model
* Add device to sentence transformers
* Allow fast_sentence_transformers
* Test that things work if there are duplicate strings
* Remove DBs after test
* Do we have to have nested embedding.embedding for all calls?
* codecov and code quality shields

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "embeddingcache",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "sample,setuptools,development",
    "author": "",
    "author_email": "Joseph Turian <lastname@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/33/94/94db8fd64a06d5f0894c6cc66447d4ffcfda30f1b5e0bd653731e65eb569/embeddingcache-0.1.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# embeddingcache\n\n[![PyPI](https://img.shields.io/pypi/v/embeddingcache)](https://pypi.org/project/embeddingcache/)\n[![license](https://img.shields.io/badge/License-Apache--2.0-green.svg?labelColor=gray)](https://github.com/turian/embeddingcache#license)\n[![python](https://img.shields.io/badge/-Python_3.7_%7C_3.8_%7C_3.9_%7C_3.10%7C_3.11-blue?logo=python&logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![tests](https://github.com/turian/embeddingcache/actions/workflows/test.yml/badge.svg)](https://github.com/turian/embeddingcache/actions/workflows/test.yml)\n\nRetrieve text embeddings, but cache them locally if we have already computed them.\n\n</div>\n\n## Motivation\n\nIf you are doing a handful of different NLP tasks, or have a single\nNLP pipeline you keep tuning, you probably don't want to recompute\nembeddings. Hence, we cache them.\n\n## Quickstart\n\n```\npip install embeddingcache\n```\n\n```\nfrom embeddingcache.embeddingcache import get_embeddings\nembeddings = get_embeddings(\n            strs=[\"hi\", \"I love Berlin.\"],\n            embedding_model=\"all-MiniLM-L6-v2\",\n            db_directory=Path(\"dbs/\"),\n            verbose=True,\n        )\n```\n\n## Design assumptions\n\nWe use SQLite3 to cache embeddings. [This could be adapted easily,\nsince we use SQLAlchemy.]\n\nWe assume read-heavy loads, with one concurrent writer. (However,\nwe retry on write failures.)\n\nWe shard SQLite3 into two databases:\nhashstring.db: hashstring table. Each row is a (unique, primary\nkey) SHA512 hash to text (also unique). Both fields are indexed.\n\n[embedding_model_name].db: embedding table. Each row is a (unique,\nprimary key) SHA512 hash to a 1-dim numpy (float32) vector, which\nwe serialize to the table as bytes.\n\n## Developer instructions\n\n```\npre-commit install\npip install -e .\npytest\n```\n\n## TODO\n\n* Update pyproject.toml\n* Add tests\n* Consider other hash functions?\n* float32 and float64 support\n* Consider adding optional joblib for caching?\n* Different ways of computing embeddings (e.g. using an API) rather than locally\n* S3 backup and/or\n* WAL\n* [LiteStream](https://fly.io/blog/all-in-on-sqlite-litestream/)\n* Retry on write errors\n* Other DB backends\n* Best practices: Give specific OpenAI version number.\n* RocksDB / RocksDB-cloud?\n* Include model name in DB for sanity check on slugify.\n* Validate on numpy array size.\n* Validate BLOB size for hashes.\n* Add optional libraries like openai and sentence-transformers\n    * Also consider other embedding providers, e.g. cohere\n    * And libs just for devs\n* Consider the max_length of each text to embed, warn if we exceed\n* pdoc3 and/or sphinx\n* Normalize embeddings by default, but add option\n* Option to return torch tensors\n* Consider reusing the same DB connection instead of creating it\nfrom scratch every time.\n* Add batch_size parameter?\n* Test check for collisions\n* Use logging not verbose output.\n* Rewrite using classes.\n* Fix dependabot.\n* Don't keep re-using DB session, store it in the class or global\n* DRY.\n* Suggest to use versioned OpenAI model\n* Add device to sentence transformers\n* Allow fast_sentence_transformers\n* Test that things work if there are duplicate strings\n* Remove DBs after test\n* Do we have to have nested embedding.embedding for all calls?\n* codecov and code quality shields\n",
    "bugtrack_url": null,
    "license": "Copyright 2023, Joseph Turian  Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at  http://www.apache.org/licenses/LICENSE-2.0  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ",
    "summary": "Retrieve text embeddings, but cache them locally if we have already computed them.",
    "version": "0.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/turian/embeddingcache",
        "Homepage": "https://github.com/turian/embeddingcache",
        "Source": "https://github.com/turian/embeddingcache"
    },
    "split_keywords": [
        "sample",
        "setuptools",
        "development"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c5be4475025e728c00ab4c22c12d71a8d08c8e1bd7d58121a54ff38e93adad9f",
                "md5": "e7ac3e6ee2630b7fde2dab3dbe133972",
                "sha256": "fb7a2fab718d9d1f9780441879b147986129a9f04cf322713c03ab3a7c893346"
            },
            "downloads": -1,
            "filename": "embeddingcache-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e7ac3e6ee2630b7fde2dab3dbe133972",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 10382,
            "upload_time": "2023-07-13T16:30:04",
            "upload_time_iso_8601": "2023-07-13T16:30:04.687290Z",
            "url": "https://files.pythonhosted.org/packages/c5/be/4475025e728c00ab4c22c12d71a8d08c8e1bd7d58121a54ff38e93adad9f/embeddingcache-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "339494db8fd64a06d5f0894c6cc66447d4ffcfda30f1b5e0bd653731e65eb569",
                "md5": "1073c22980094b8d8b32be12d83334d0",
                "sha256": "a1e617b220d8fcdf0dbc391ec4575d5b7debe6b7659eadbc6800dd9491d54e06"
            },
            "downloads": -1,
            "filename": "embeddingcache-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1073c22980094b8d8b32be12d83334d0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 11881,
            "upload_time": "2023-07-13T16:30:06",
            "upload_time_iso_8601": "2023-07-13T16:30:06.418472Z",
            "url": "https://files.pythonhosted.org/packages/33/94/94db8fd64a06d5f0894c6cc66447d4ffcfda30f1b5e0bd653731e65eb569/embeddingcache-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-13 16:30:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "turian",
    "github_project": "embeddingcache",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "embeddingcache"
}
        
Elapsed time: 0.09842s