reach

Name	reach JSON
Version	4.1.1 JSON
	download
home_page	https://github.com/stephantul/reach
Summary	A light-weight package for working with pre-trained word embeddings
upload_time	2023-07-20 09:37:21
maintainer
docs_url	None
author	Stéphan Tulkens
requires_python
license	MIT
keywords	word vectors natural language processing embeddings
VCS
bugtrack_url
requirements	numpy tqdm pyahocorasick
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # reach

[![Documentation Status](https://readthedocs.org/projects/reach/badge/?version=latest)](https://reach.readthedocs.io/en/latest/?badge=latest)
[![PyPI version](https://badge.fury.io/py/reach.svg)](https://badge.fury.io/py/reach)
[![Downloads](https://pepy.tech/badge/reach)](https://pepy.tech/project/reach)

A light-weight package for working with pre-trained word embeddings.
Useful for input into neural networks, or for doing compositional semantics.

`reach` can read in word vectors in `word2vec` or `glove` format without
any preprocessing.

The assumption behind `reach` is a no-hassle approach to featurization. The
vectorization and bow approaches know how to deal with OOV words, removing
these problems from your code.

`reach` also includes nearest neighbor calculation for arbitrary vectors.

## [Documentation](https://reach.readthedocs.io/en/latest/)

* [API reference](https://reach.readthedocs.io/en/latest/source/api.html)
* Tutorial coming soon (see below for an example)

## Installation

If you just want `reach`:

```
pip install reach
```

If you also want [`AutoReach`](#autoreach):

```
pip install reach[auto]
```

## Example

```python
import numpy as np

from reach import Reach

# Load from a .vec or .txt file
# unk_word specifies which token is the "unknown" token.
# If this is token is not in your vector space, it is added as an extra word
# and a corresponding zero vector.
# If it is in your embedding space, it is used.
r = Reach.load("path/to/embeddings", unk_word="UNK")

# Alternatively, if you have a matrix, you can directly
# input it.

# Stand-in for word embeddings
mtr = np.random.randn(8, 300)
words = ["UNK", "cat", "dog", "best", "creature", "alive", "span", "prose"]
r = Reach(mtr, words, unk_index=0)

# Get vectors through indexing.
# Throws a KeyError if a word is not present.
vector = r['cat']

# Compare two words.
similarity = r.similarity('cat', 'dog')

# Find most similar.
similarities = r.most_similar('cat', 2)

sentence = 'a dog is the best creature alive'.split()
corpus = [sentence, sentence, sentence]

# bow representation consistent with word vectors,
# for input into neural network.
bow = r.bow(sentence)

# vectorized representation.
vectorized = r.vectorize(sentence)

# can remove OOV words automatically.
vectorized = r.vectorize(sentence, remove_oov=True)

# Can mean pool out of the box.
mean = r.mean_pool(sentence)
# Automatically take care of incorrect sentences
# these are set to the vector of the UNK word, or a vector of zeros.
corpus_mean = r.mean_pool_corpus([sentence, sentence, ["not_a_word"]], remove_oov=True, safeguard=False)

# vectorize corpus.
transformed = r.transform(corpus)

# Get nearest words to arbitrary vector
nearest = r.nearest_neighbor(np.random.randn(1, 300))

# Get every word within a certain threshold
thresholded = r.threshold("cat", threshold=.0)
```

## Loading and saving

`reach` has many options for saving and loading files, including custom separators, custom number of dimensions, loading a custom wordlist, custom number of words, and error recovery. One difference between `gensim` and `reach` is that `reach` loads both GloVe-style .vec files and regular word2vec files. Unlike `gensim`, `reach` does not support loading binary files.

### benchmark

On my machine (a 2022 M1 macbook pro), we get the following times for [`COW BIG`](https://github.com/clips/dutchembeddings), a file containing about 3 million rows and 320 dimensions.

| System | Time (7 loops)    |
|--------|-------------------|
| Gensim | 3min 57s ± 344 ms |
| reach  | 2min 14s ± 4.09 s |

## Fast format

`reach` has a special fast format, which is useful if you want to reload your word vectors often. The fast format can be created using the `save_fast_format` function, and loaded using the `load_fast_format` function. This is about equivalent to saving word vectors in `gensim`'s own format in terms of loading speed.

# autoreach

Reach also has a way of automatically inferring words from strings without using a pre-defined tokenizer, i.e., without splitting the string into words. This is useful because there might be mismatches between the tokenizer you happen to have on hand, and the word vectors you use. For example, if your vector space contains an embedding for the word `"it's"`, and your tokenizer splits this string into two tokens: `["it", "'s"]`, the embedding for `"it's"` will never be found.

autoreach solves this problem by only finding words from your pre-defined vocabulary in a string, this removing the need for any tokenization. We use the [aho-corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), which allows us to find substrings in linear time. The downside of using aho-corasick is that it also finds substrings of regular words. For example, the word `the` will be found as a substring of `these`. To circumvent this, we perform a regex-based clean-up step.

**Warning! The clean-up step involves checking for surrounding spaces and punctuation marks. Hence, if the language for which you use Reach does not actually use spaces and/or punctuation marks to designate word boundaries, the entire process might not work.**

### Example

```python
import numpy as np

from reach import AutoReach

words = ["dog", "walked", "home"]
vectors = np.random.randn(3, 32)

r = AutoReach(vectors, words)

sentence = "The dog, walked, home"
bow = r.bow(sentence)

found_words = [r.indices[index] for index in bow]
```

### benchmark

Because we no longer need to tokenize, `AutoReach` can be many times faster. In this benchmark, we compare to just splitting, and `nltk`'s `word_tokenize` function.

We will use the entirety of Mary Shelley's Frankenstein, which you can find [here](https://www.gutenberg.org/cache/epub/42324/pg42324.txt), and the glove.6b.50d vectors, which you can find [here](https://nlp.stanford.edu/data/glove.6B.zip).

```python
from pathlib import Path

from nltk import word_tokenize

from reach import AutoReach, Reach


txt = Path("pg42324.txt").read_text().lower()
normal_reach = Reach.load("glove.6B.100d.txt")
auto_reach = AutoReach.load("glove.6B.100d.txt")

# Ipython magic commands
%timeit normal_reach.vectorize(word_tokenize(txt), remove_oov=True)
# 345 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit normal_reach.vectorize(txt.split(), remove_oov=True)
# 25.4 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit auto_reach.vectorize(txt)
# 69.9 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```

As you can see, the tokenizer introduces significant overhead compared to just splitting, while using the aho-corasick algorithm to split is still reasonably fast.

# License

MIT

# Author

Stéphan Tulkens

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/stephantul/reach",
    "name": "reach",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "word vectors,natural language processing,embeddings",
    "author": "St\u00e9phan Tulkens",
    "author_email": "stephantul@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/2b/d3/be84f08baff840bed2f9f04287f741d026860fba81b664e6dea521a6780a/reach-4.1.1.tar.gz",
    "platform": null,
    "description": "# reach\n\n[![Documentation Status](https://readthedocs.org/projects/reach/badge/?version=latest)](https://reach.readthedocs.io/en/latest/?badge=latest)\n[![PyPI version](https://badge.fury.io/py/reach.svg)](https://badge.fury.io/py/reach)\n[![Downloads](https://pepy.tech/badge/reach)](https://pepy.tech/project/reach)\n\nA light-weight package for working with pre-trained word embeddings.\nUseful for input into neural networks, or for doing compositional semantics.\n\n`reach` can read in word vectors in `word2vec` or `glove` format without\nany preprocessing.\n\nThe assumption behind `reach` is a no-hassle approach to featurization. The\nvectorization and bow approaches know how to deal with OOV words, removing\nthese problems from your code.\n\n`reach` also includes nearest neighbor calculation for arbitrary vectors.\n\n## [Documentation](https://reach.readthedocs.io/en/latest/)\n\n* [API reference](https://reach.readthedocs.io/en/latest/source/api.html)\n* Tutorial coming soon (see below for an example)\n\n## Installation\n\nIf you just want `reach`:\n\n```\npip install reach\n```\n\nIf you also want [`AutoReach`](#autoreach):\n\n```\npip install reach[auto]\n```\n\n## Example\n\n```python\nimport numpy as np\n\nfrom reach import Reach\n\n# Load from a .vec or .txt file\n# unk_word specifies which token is the \"unknown\" token.\n# If this is token is not in your vector space, it is added as an extra word\n# and a corresponding zero vector.\n# If it is in your embedding space, it is used.\nr = Reach.load(\"path/to/embeddings\", unk_word=\"UNK\")\n\n# Alternatively, if you have a matrix, you can directly\n# input it.\n\n# Stand-in for word embeddings\nmtr = np.random.randn(8, 300)\nwords = [\"UNK\", \"cat\", \"dog\", \"best\", \"creature\", \"alive\", \"span\", \"prose\"]\nr = Reach(mtr, words, unk_index=0)\n\n# Get vectors through indexing.\n# Throws a KeyError if a word is not present.\nvector = r['cat']\n\n# Compare two words.\nsimilarity = r.similarity('cat', 'dog')\n\n# Find most similar.\nsimilarities = r.most_similar('cat', 2)\n\nsentence = 'a dog is the best creature alive'.split()\ncorpus = [sentence, sentence, sentence]\n\n# bow representation consistent with word vectors,\n# for input into neural network.\nbow = r.bow(sentence)\n\n# vectorized representation.\nvectorized = r.vectorize(sentence)\n\n# can remove OOV words automatically.\nvectorized = r.vectorize(sentence, remove_oov=True)\n\n# Can mean pool out of the box.\nmean = r.mean_pool(sentence)\n# Automatically take care of incorrect sentences\n# these are set to the vector of the UNK word, or a vector of zeros.\ncorpus_mean = r.mean_pool_corpus([sentence, sentence, [\"not_a_word\"]], remove_oov=True, safeguard=False)\n\n# vectorize corpus.\ntransformed = r.transform(corpus)\n\n# Get nearest words to arbitrary vector\nnearest = r.nearest_neighbor(np.random.randn(1, 300))\n\n# Get every word within a certain threshold\nthresholded = r.threshold(\"cat\", threshold=.0)\n```\n\n## Loading and saving\n\n`reach` has many options for saving and loading files, including custom separators, custom number of dimensions, loading a custom wordlist, custom number of words, and error recovery. One difference between `gensim` and `reach` is that `reach` loads both GloVe-style .vec files and regular word2vec files. Unlike `gensim`, `reach` does not support loading binary files.\n\n### benchmark\n\nOn my machine (a 2022 M1 macbook pro), we get the following times for [`COW BIG`](https://github.com/clips/dutchembeddings), a file containing about 3 million rows and 320 dimensions.\n\n| System | Time (7 loops)    |\n|--------|-------------------|\n| Gensim | 3min 57s \u00b1 344 ms |\n| reach  | 2min 14s \u00b1 4.09 s |\n\n## Fast format\n\n`reach` has a special fast format, which is useful if you want to reload your word vectors often. The fast format can be created using the `save_fast_format` function, and loaded using the `load_fast_format` function. This is about equivalent to saving word vectors in `gensim`'s own format in terms of loading speed.\n\n# autoreach\n\nReach also has a way of automatically inferring words from strings without using a pre-defined tokenizer, i.e., without splitting the string into words. This is useful because there might be mismatches between the tokenizer you happen to have on hand, and the word vectors you use. For example, if your vector space contains an embedding for the word `\"it's\"`, and your tokenizer splits this string into two tokens: `[\"it\", \"'s\"]`, the embedding for `\"it's\"` will never be found.\n\nautoreach solves this problem by only finding words from your pre-defined vocabulary in a string, this removing the need for any tokenization. We use the [aho-corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), which allows us to find substrings in linear time. The downside of using aho-corasick is that it also finds substrings of regular words. For example, the word `the` will be found as a substring of `these`. To circumvent this, we perform a regex-based clean-up step.\n\n**Warning! The clean-up step involves checking for surrounding spaces and punctuation marks. Hence, if the language for which you use Reach does not actually use spaces and/or punctuation marks to designate word boundaries, the entire process might not work.**\n\n### Example\n\n```python\nimport numpy as np\n\nfrom reach import AutoReach\n\nwords = [\"dog\", \"walked\", \"home\"]\nvectors = np.random.randn(3, 32)\n\nr = AutoReach(vectors, words)\n\nsentence = \"The dog, walked, home\"\nbow = r.bow(sentence)\n\nfound_words = [r.indices[index] for index in bow]\n```\n\n### benchmark\n\nBecause we no longer need to tokenize, `AutoReach` can be many times faster. In this benchmark, we compare to just splitting, and `nltk`'s `word_tokenize` function.\n\nWe will use the entirety of Mary Shelley's Frankenstein, which you can find [here](https://www.gutenberg.org/cache/epub/42324/pg42324.txt), and the glove.6b.50d vectors, which you can find [here](https://nlp.stanford.edu/data/glove.6B.zip).\n\n```python\nfrom pathlib import Path\n\nfrom nltk import word_tokenize\n\nfrom reach import AutoReach, Reach\n\n\ntxt = Path(\"pg42324.txt\").read_text().lower()\nnormal_reach = Reach.load(\"glove.6B.100d.txt\")\nauto_reach = AutoReach.load(\"glove.6B.100d.txt\")\n\n# Ipython magic commands\n%timeit normal_reach.vectorize(word_tokenize(txt), remove_oov=True)\n# 345 ms \u00b1 3.42 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n%timeit normal_reach.vectorize(txt.split(), remove_oov=True)\n# 25.4 ms \u00b1 132 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10 loops each)\n%timeit auto_reach.vectorize(txt)\n# 69.9 ms \u00b1 237 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10 loops each)\n\n```\n\nAs you can see, the tokenizer introduces significant overhead compared to just splitting, while using the aho-corasick algorithm to split is still reasonably fast.\n\n# License\n\nMIT\n\n# Author\n\nSt\u00e9phan Tulkens\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A light-weight package for working with pre-trained word embeddings",
    "version": "4.1.1",
    "project_urls": {
        "Homepage": "https://github.com/stephantul/reach",
        "Issue Tracker": "https://github.com/stephantul/reach/issues",
        "Source Code": "https://github.com/stephantul/reach"
    },
    "split_keywords": [
        "word vectors",
        "natural language processing",
        "embeddings"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a830896163964dcccd358ebafd77bedbdd79fc844ab7ad2a664e6c4ba577457b",
                "md5": "1c7f494dae1f22388ef8545da41f3f2a",
                "sha256": "473299ecc8afb114704cbf574c81365a9292f3c619487fafd1b3b5f4b5fbcd4b"
            },
            "downloads": -1,
            "filename": "reach-4.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1c7f494dae1f22388ef8545da41f3f2a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15588,
            "upload_time": "2023-07-20T09:37:20",
            "upload_time_iso_8601": "2023-07-20T09:37:20.646541Z",
            "url": "https://files.pythonhosted.org/packages/a8/30/896163964dcccd358ebafd77bedbdd79fc844ab7ad2a664e6c4ba577457b/reach-4.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2bd3be84f08baff840bed2f9f04287f741d026860fba81b664e6dea521a6780a",
                "md5": "43002344b62c2c5c5b3e322cdd86e822",
                "sha256": "86e48ac0e658a000f6af523a4b05a9ee3735968b9348b9e52431801a92b5bc88"
            },
            "downloads": -1,
            "filename": "reach-4.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "43002344b62c2c5c5b3e322cdd86e822",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21802,
            "upload_time": "2023-07-20T09:37:21",
            "upload_time_iso_8601": "2023-07-20T09:37:21.839399Z",
            "url": "https://files.pythonhosted.org/packages/2b/d3/be84f08baff840bed2f9f04287f741d026860fba81b664e6dea521a6780a/reach-4.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-20 09:37:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stephantul",
    "github_project": "reach",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "pyahocorasick",
            "specs": []
        }
    ],
    "lcname": "reach"
}

Stéphan Tulkens