staticvectors


Namestaticvectors JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/neuml/staticvectors
SummaryWork with static vector models
upload_time2025-01-23 21:06:25
maintainerNone
docs_urlNone
authorNeuML
requires_python>=3.9
licenseApache 2.0: http://www.apache.org/licenses/LICENSE-2.0
keywords search embedding machine-learning nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <img src="https://raw.githubusercontent.com/neuml/staticvectors/master/logo.png"/>
</p>

<p align="center">
    <b>Work with static vector models</b>
</p>

<p align="center">
    <a href="https://github.com/neuml/staticvectors/releases">
        <img src="https://img.shields.io/github/release/neuml/staticvectors.svg?style=flat&color=success" alt="Version"/>
    </a>
    <a href="https://github.com/neuml/staticvectors/releases">
        <img src="https://img.shields.io/github/release-date/neuml/staticvectors.svg?style=flat&color=blue" alt="GitHub Release Date"/>
    </a>
    <a href="https://github.com/neuml/stativectors/issues">
        <img src="https://img.shields.io/github/issues/neuml/staticvectors.svg?style=flat&color=success" alt="GitHub issues"/>
    </a>
    <a href="https://github.com/neuml/staticvectors">
        <img src="https://img.shields.io/github/last-commit/neuml/staticvectors.svg?style=flat&color=blue" alt="GitHub last commit"/>
    </a>
    <a href="https://github.com/neuml/staticvectors/actions?query=workflow%3Abuild">
        <img src="https://github.com/neuml/staticvectors/workflows/build/badge.svg" alt="Build Status"/>
    </a>
    <a href="https://coveralls.io/github/neuml/staticvectors?branch=master">
        <img src="https://img.shields.io/coverallsCoverage/github/neuml/staticvectors" alt="Coverage Status">
    </a>
</p>

`staticvectors` makes it easy to work with static vector models. This includes word vector models such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/). While [Transformers-based models](https://github.com/huggingface/transformers) are now the primary way to embed content for vector search, these older models still have a purpose.

For example, this [FastText Language identification model](https://fasttext.cc/docs/en/language-identification.html) is still one of the fastest and most efficient ways to detect languages. N-grams work well for this task and it's lightning fast.

Additionally, there are historical, low resource and other languages where there just isn't enough training data to build a solid language model. In these cases, a simpler model using one of these older techniques might be the best option. 

## What's wrong with the existing libraries

Unfortunately, the tooling to use word vector models is aging and in some cases unmaintained. The world is moving forward and these libraries are getting harder to install.

As a concrete example, the build script for [txtai](https://github.com/neuml/txtai/blob/master/.github/workflows/build.yml#L42) often has to be modified to get FastText to work on all supported platforms. There are pre-compiled versions but they're often slow to support the latest version of Python or fix issues.

This project breathes life into word vector models and integrates them with modern tooling such as the [Hugging Face Hub](https://huggingface.co/models) and [Safetensors](https://github.com/huggingface/safetensors). While it's pure Python, it's still fast due to it's heavy usage of [NumPy](https://github.com/numpy/numpy) and [vectorization techniques](https://numpy.org/doc/stable/user/whatisnumpy.html#why-is-numpy-fast).

This makes it easier to maintain as it's only a single install package to maintain.

## Installation
The easiest way to install is via pip and PyPI

```
pip install staticvectors
```

Python 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.

`staticvectors` can also be installed directly from GitHub to access the latest, unreleased features.

```
pip install git+https://github.com/neuml/staticvectors
```

## Quickstart

See the following examples on how to use this library.

### Convert an existing FastText model

```python
from staticvectors import FastTextConverter, StaticVectors

# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()
converter("lid.176.bin", "langid")

# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict(["Hello, what language is this?"])
```

### Load an existing Magnitude SQLite database

This library replaces [Magnitude Lite](https://github.com/neuml/magnitude) which is now deprecated. Magnitude libraries are supported by this library.

```python
from staticvectors import StaticVectors

model = StaticVectors("/path/to/vectors.magnitude")

# Get word vectors
model.embeddings(["hello"])
```

### Convert and quantize

```python
from staticvectors import FastTextConverter, StaticVectors

# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()

# Quantize with PQ and two subspaces - model goes from 125MB to 4MB with minimal accuracy impacts!
converter("lid.176.bin", "langid-pq2x256", quantize=2)

# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict(["Hello, what language is this?"])
```

See the unit tests in this project here for more examples.

## Libraries for Static Embeddings with Transformers models

This library is primarily focused on word vector models. There is a recent push to distill Transformers models into static embeddings models. The difference between `staticvectors` and these libraries is that the base models are Transformers models. Additionally, they use Transformers tokenizers where as word vector models tokenize on whitespace and use n-grams.

Check out these links for more on static embeddings with Transformers models.

- [Model2Vec](https://github.com/MinishLab/model2vec) - Turn any sentence transformer into a really small static model
- [Static Sentence Transformers](https://huggingface.co/blog/static-embeddings) - Run 100x to 400x faster on CPU than state-of-the-art embedding models

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuml/staticvectors",
    "name": "staticvectors",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "search embedding machine-learning nlp",
    "author": "NeuML",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/80/78/c7100fff6b35be1b806a2bfd4206afc55d4d3fa4ab45742d61aca5cfcc59/staticvectors-0.1.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <img src=\"https://raw.githubusercontent.com/neuml/staticvectors/master/logo.png\"/>\n</p>\n\n<p align=\"center\">\n    <b>Work with static vector models</b>\n</p>\n\n<p align=\"center\">\n    <a href=\"https://github.com/neuml/staticvectors/releases\">\n        <img src=\"https://img.shields.io/github/release/neuml/staticvectors.svg?style=flat&color=success\" alt=\"Version\"/>\n    </a>\n    <a href=\"https://github.com/neuml/staticvectors/releases\">\n        <img src=\"https://img.shields.io/github/release-date/neuml/staticvectors.svg?style=flat&color=blue\" alt=\"GitHub Release Date\"/>\n    </a>\n    <a href=\"https://github.com/neuml/stativectors/issues\">\n        <img src=\"https://img.shields.io/github/issues/neuml/staticvectors.svg?style=flat&color=success\" alt=\"GitHub issues\"/>\n    </a>\n    <a href=\"https://github.com/neuml/staticvectors\">\n        <img src=\"https://img.shields.io/github/last-commit/neuml/staticvectors.svg?style=flat&color=blue\" alt=\"GitHub last commit\"/>\n    </a>\n    <a href=\"https://github.com/neuml/staticvectors/actions?query=workflow%3Abuild\">\n        <img src=\"https://github.com/neuml/staticvectors/workflows/build/badge.svg\" alt=\"Build Status\"/>\n    </a>\n    <a href=\"https://coveralls.io/github/neuml/staticvectors?branch=master\">\n        <img src=\"https://img.shields.io/coverallsCoverage/github/neuml/staticvectors\" alt=\"Coverage Status\">\n    </a>\n</p>\n\n`staticvectors` makes it easy to work with static vector models. This includes word vector models such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/). While [Transformers-based models](https://github.com/huggingface/transformers) are now the primary way to embed content for vector search, these older models still have a purpose.\n\nFor example, this [FastText Language identification model](https://fasttext.cc/docs/en/language-identification.html) is still one of the fastest and most efficient ways to detect languages. N-grams work well for this task and it's lightning fast.\n\nAdditionally, there are historical, low resource and other languages where there just isn't enough training data to build a solid language model. In these cases, a simpler model using one of these older techniques might be the best option. \n\n## What's wrong with the existing libraries\n\nUnfortunately, the tooling to use word vector models is aging and in some cases unmaintained. The world is moving forward and these libraries are getting harder to install.\n\nAs a concrete example, the build script for [txtai](https://github.com/neuml/txtai/blob/master/.github/workflows/build.yml#L42) often has to be modified to get FastText to work on all supported platforms. There are pre-compiled versions but they're often slow to support the latest version of Python or fix issues.\n\nThis project breathes life into word vector models and integrates them with modern tooling such as the [Hugging Face Hub](https://huggingface.co/models) and [Safetensors](https://github.com/huggingface/safetensors). While it's pure Python, it's still fast due to it's heavy usage of [NumPy](https://github.com/numpy/numpy) and [vectorization techniques](https://numpy.org/doc/stable/user/whatisnumpy.html#why-is-numpy-fast).\n\nThis makes it easier to maintain as it's only a single install package to maintain.\n\n## Installation\nThe easiest way to install is via pip and PyPI\n\n```\npip install staticvectors\n```\n\nPython 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\n`staticvectors` can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/staticvectors\n```\n\n## Quickstart\n\nSee the following examples on how to use this library.\n\n### Convert an existing FastText model\n\n```python\nfrom staticvectors import FastTextConverter, StaticVectors\n\n# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin\nconverter = FastTextConverter()\nconverter(\"lid.176.bin\", \"langid\")\n\n# Load the converted model - runs in pure Python, FastText library install not required for inference\nmodel = StaticVectors(\"langid\")\nmodel.predict([\"Hello, what language is this?\"])\n```\n\n### Load an existing Magnitude SQLite database\n\nThis library replaces [Magnitude Lite](https://github.com/neuml/magnitude) which is now deprecated. Magnitude libraries are supported by this library.\n\n```python\nfrom staticvectors import StaticVectors\n\nmodel = StaticVectors(\"/path/to/vectors.magnitude\")\n\n# Get word vectors\nmodel.embeddings([\"hello\"])\n```\n\n### Convert and quantize\n\n```python\nfrom staticvectors import FastTextConverter, StaticVectors\n\n# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin\nconverter = FastTextConverter()\n\n# Quantize with PQ and two subspaces - model goes from 125MB to 4MB with minimal accuracy impacts!\nconverter(\"lid.176.bin\", \"langid-pq2x256\", quantize=2)\n\n# Load the converted model - runs in pure Python, FastText library install not required for inference\nmodel = StaticVectors(\"langid\")\nmodel.predict([\"Hello, what language is this?\"])\n```\n\nSee the unit tests in this project here for more examples.\n\n## Libraries for Static Embeddings with Transformers models\n\nThis library is primarily focused on word vector models. There is a recent push to distill Transformers models into static embeddings models. The difference between `staticvectors` and these libraries is that the base models are Transformers models. Additionally, they use Transformers tokenizers where as word vector models tokenize on whitespace and use n-grams.\n\nCheck out these links for more on static embeddings with Transformers models.\n\n- [Model2Vec](https://github.com/MinishLab/model2vec) - Turn any sentence transformer into a really small static model\n- [Static Sentence Transformers](https://huggingface.co/blog/static-embeddings) - Run 100x to 400x faster on CPU than state-of-the-art embedding models\n",
    "bugtrack_url": null,
    "license": "Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0",
    "summary": "Work with static vector models",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/neuml/staticvectors",
        "Homepage": "https://github.com/neuml/staticvectors",
        "Issue Tracker": "https://github.com/neuml/staticvectors/issues",
        "Source Code": "https://github.com/neuml/staticvectors"
    },
    "split_keywords": [
        "search",
        "embedding",
        "machine-learning",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "897d214eba7e55c2dd770337ce24319d750e160087bbf64f5d1d2d4c34912933",
                "md5": "716da5ff2d153c93f414561041196c39",
                "sha256": "c9357a6157d18dccbd95c01940e0126a8e37bfbda653764ede57eea4c53db234"
            },
            "downloads": -1,
            "filename": "staticvectors-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "716da5ff2d153c93f414561041196c39",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 20322,
            "upload_time": "2025-01-23T21:06:23",
            "upload_time_iso_8601": "2025-01-23T21:06:23.502999Z",
            "url": "https://files.pythonhosted.org/packages/89/7d/214eba7e55c2dd770337ce24319d750e160087bbf64f5d1d2d4c34912933/staticvectors-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8078c7100fff6b35be1b806a2bfd4206afc55d4d3fa4ab45742d61aca5cfcc59",
                "md5": "b5776f786be79e4b6c1b6c72c61adbf9",
                "sha256": "6b4bb9765631e365b5493791f20bd80d41d1ec4bf9f4d259fe52a6bfae526ed8"
            },
            "downloads": -1,
            "filename": "staticvectors-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b5776f786be79e4b6c1b6c72c61adbf9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19920,
            "upload_time": "2025-01-23T21:06:25",
            "upload_time_iso_8601": "2025-01-23T21:06:25.249822Z",
            "url": "https://files.pythonhosted.org/packages/80/78/c7100fff6b35be1b806a2bfd4206afc55d4d3fa4ab45742d61aca5cfcc59/staticvectors-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-23 21:06:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuml",
    "github_project": "staticvectors",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "staticvectors"
}
        
Elapsed time: 0.76292s