unisim


Nameunisim JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/google/unisim
SummaryUniSim: Universal Similarity
upload_time2024-08-08 21:04:42
maintainerNone
docs_urlNone
authorGoogle
requires_pythonNone
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # UniSim: Universal Similarity

UniSim is a package for efficiently computing similarity, performing fuzzy matching, deduplicating datasets, and clustering data (text and images). The UniSim package is in beta and currently supports text (e.g. for fuzzy string matching) and image support will be added soon.

## News

- Aug 2024: 1.0.1 with Domain phishing detection colab and bug fixes is out.
- May 2024: Initial version with Text Similarity and RAG tutorial is out.

## Installation

You can use `pip` to install the latest version of UniSim:

```
pip install unisim
```


### Simple case

Computing text similarity is as easy as:

```python
from unisim import TextSim
text_sim = TextSim()
text_sim.similarity("this is a text", "This is a txt! 😀")  # 0.9113
```
The higher the similarity, the more similar the strings are.

Beyond this simple example UniSim supports GPU accelerated batch matching and ANN indexing for medium to large matching use-cases including: Dataset deduplication, [LLM RAG](/notebooks/unisim-gemma-text_rag_demo.ipynb), [Address lookup](/notebooks/unisim_text_demo.ipynb) and [Phishing Domains lookup](/notebooks/ct-domain-demo.ipynb).

### GPU Acceleration
By default, UniSim uses [Onnx](https://github.com/onnx/onnx) when running on CPU, and [TensorFlow](https://www.tensorflow.org/) or [Onnx GPU](https://github.com/onnx/onnx) for GPU acceleration. You can switch backends by setting the `BACKEND` environment variable (e.g. `os.environ["BACKEND"] = "tf"` or `"onnx"`). If you have a GPU, you can additionally install TensorFlow using or Onnx:

```bash
pip install unisim[tensorflow]
```

or

```bash
pip uninstall onnxruntime
pip install onnxruntime-gpu
```

## Text UniSim (TextSim)

The goal of TextSim is to provide an easy-to-use tool for efficient, accurate and multilingual fuzzy string matching, near-duplicate detection, and string similarity. Please see the tutorial [colab](notebooks/unisim_text_demo.ipynb) for an in-depth example on using TextSim for real-world use cases like fuzzy matching for addresses.

TextSim is significantly faster than edit-distance algorithms such as Levenshtein Distance for fuzzy string matching and more accurate than ngram-based methods such as MinHash for near-duplicate text detection and clustering. TextSim accepts strings of arbitrary length and can scale to datasets with millions of examples.

To accomplish this, TextSim leverages the [RETSim model](https://arxiv.org/abs/2311.17264) to efficiently embed texts into high-dimensional vectors that can be compared using cosine similarity. TextSim then uses [USearch](https://github.com/unum-cloud/usearch) for fast vector search.

### Basic Usage

You can compute the similarity between two strings using the `.similarity(text1, text2)` function. The similarity is a float between 0 and 1, with 1.0 representing most similar (identical strings). The similarity value is the cosine similarity between the vector representations of strings. You can directly get the vector representation of strings using the `.embed(inputs)` function as well.

```python
from unisim import TextSim

text_sim = TextSim()

# compute similarity between two strings
text_sim.similarity("this is a text", "This is a txt! 😀")  # 0.9113
text_sim.similarity("this is a text", "apples")  # 0.4220
```

TextSim offers efficient fuzzy string matching between two lists using the `.match` function, similar to the [PolyFuzz](https://maartengr.github.io/PolyFuzz/) package. The `.match` function accepts `queries` (list of strings you want to find matches for) and `targets` (list of strings you are finding matches in).

`.match(queries, targets)` returns a Pandas DataFrame, where each row contains a query, its most similar match found in targets, their similarity, and whether or not they are a match (if their similarity is >= `similarity_threshold`). By default, `0.9` is typically a good `similarity_threshold` for near-duplicate matching strings.

```python
from unisim import TextSim

text_sim = TextSim()

queries = ["apple", "appl", "icecream", "house", "random"]
targets = ["apple", "ice cream", "mouse"]

results_df = text_sim.match(queries, targets, similarity_threshold=0.9)
```

This gives you the following Pandas DataFrame of (fuzzy) matches:
```
      query     target  similarity  is_match
0     apple      apple    1.000000      True
1      appl      apple    0.914230      True
2  icecream  ice cream    0.950734      True
3     house      mouse    0.760066     False
4    random      mouse    0.456315     False
```
TextSim is able to find fuzzy matches of strings ("appl" to "apple" and "icecream" to "ice cream") while not matching "house" to "mouse". Note that TextSim can accept strings of arbitrary length and works on longer texts. You can also perform fuzzy matching within a single list by passing only a single list, e.g. `text_sim.match(queries)`.


### Large-scale Matching and Near-Duplicate Detection Workflow

TextSim offers more complex functionality which allows you to maintain an index of texts (e.g. from a large dataset) and query the index to find similar texts. TextSim supports efficient approximate nearest neighbor (ANN) search using  [USearch](https://github.com/unum-cloud/usearch) which allows it to scale to large datasets with millions of examples.

Please see a minimal working example below for how to use the `.add` and `.search` methods to create and search an index of texts, as well as the demo [colab](notebooks/unisim_text_demo.ipynb) for an in-depth example using TextSim for fuzzy matching on a real-world address matching dataset.

<details>

```python
from unisim import TextSim

text_sim = TextSim(
    store_data=True, # set to False for large datasets to save memory
    index_type="exact", # set to "approx" for large datasets to use ANN search
    batch_size=128, # increasing batch_size on GPU may be faster
    use_accelerator=True, # uses GPU if available, otherwise uses CPU
)

# the dataset can be very large, e.g. millions of texts
dataset = [
    "I love ice cream and cookies",
    "Ice cream is super delicious",
    "my mom makes the best homemade cookies 🍪🍪🍪",
    "This is an example text.",
    "UniSim supports very long texts as well.",
    "UniSim supports multilingual texts too. 你好!",
]

# index your dataset using .add
text_sim.add(dataset)

# queries can also be a very large dataset
queries = [
    "I luv ice cream and cookies🍦🍪",
    "This is an example query text.",
    "Unrelated text with no match in the dataset..."
]

# search the indexed dataset and find the most similar matches to queries
result_collection = text_sim.search(
    queries,
    similarity_threshold=0.9, # texts match if their similarity >= similarity_threshold
    k=5, # the number of most similar texts in indexed dataset to return for each query
)
```
NOTE: you can set `drop_closest_match=False` in `.search` to ignore the closest match if you know your query exists in the dataset already, e.g. for dataset deduplication, your search queries are the same as your indexed dataset.

NOTE 2: you do not need to add your dataset all at once, you can continously add to and search your index which is useful in production use cases where you have incoming data.

`.search` returns a ResultCollection, which contains the total number of matches found for your queries as well as detailed results containing the most similar matches, their similarity values, and their content. You can visualize the results using `text_sim.visualize(result)`.

```python
# get total matches found across all queries
total_matches = result_collection.total_matches

# visualize a query result (query 0 in this case) in the result_collection
result = result_collection.results[0]
text_sim.visualize(result)
```
`.visualize` prints the following output:
```
Query 0: "I luv ice cream and cookies🍦🍪"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  ---------------------------------------------
    0  True                0.91  I love ice cream and cookies
    1  False               0.66  Ice cream is super delicious
    2  False               0.53  my mom makes the best homemade cookies 🍪🍪🍪
    3  False               0.42  This is an example text.
    4  False               0.36  UniSim supports very long texts as well.
```
</details>


## Citing

If you use the UniSim package in your work, please cite:

```bibtex
@software{UniSim_Universal_Similarity_2023,
    title = {{UniSim: Universal Similarity}},
    author = {Marina Zhang, Owen Vallis, Ali Zand, and Elie Bursztein},
    url = {https://github.com/google/unisim},
    version = {0.0.1},
    year = {2023}
}

```
Additionally, if you use TextSim or the RETSim model, please cite the following paper:

```bibtex
@article{RETSim_2023,
    title = {{RETSim: Resilient and Efficient Text Similarity}},
    author = {Marina Zhang, Owen Vallis, Aysegul Bumin, Tanay Vakharia, and Elie Bursztein},
    year = {2023},
    eprint = {arXiv:2311.17264}
}
```

## Contributing
To contribute to the project, please check out the [contribution guidelines](CONTRIBUTING.md). Thank you!

## Disclaimer
This is not an official Google product.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/google/unisim",
    "name": "unisim",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Google",
    "author_email": "unisim@google.com",
    "download_url": "https://files.pythonhosted.org/packages/47/aa/1ecf7b9600ef7f00ecba00e7cac41aae71950f7186e15c06c00fb616c5ff/unisim-1.0.1.tar.gz",
    "platform": null,
    "description": "# UniSim: Universal Similarity\n\nUniSim is a package for efficiently computing similarity, performing fuzzy matching, deduplicating datasets, and clustering data (text and images). The UniSim package is in beta and currently supports text (e.g. for fuzzy string matching) and image support will be added soon.\n\n## News\n\n- Aug 2024: 1.0.1 with Domain phishing detection colab and bug fixes is out.\n- May 2024: Initial version with Text Similarity and RAG tutorial is out.\n\n## Installation\n\nYou can use `pip` to install the latest version of UniSim:\n\n```\npip install unisim\n```\n\n\n### Simple case\n\nComputing text similarity is as easy as:\n\n```python\nfrom unisim import TextSim\ntext_sim = TextSim()\ntext_sim.similarity(\"this is a text\", \"This is a txt! \ud83d\ude00\")  # 0.9113\n```\nThe higher the similarity, the more similar the strings are.\n\nBeyond this simple example UniSim supports GPU accelerated batch matching and ANN indexing for medium to large matching use-cases including: Dataset deduplication, [LLM RAG](/notebooks/unisim-gemma-text_rag_demo.ipynb), [Address lookup](/notebooks/unisim_text_demo.ipynb) and [Phishing Domains lookup](/notebooks/ct-domain-demo.ipynb).\n\n### GPU Acceleration\nBy default, UniSim uses [Onnx](https://github.com/onnx/onnx) when running on CPU, and [TensorFlow](https://www.tensorflow.org/) or [Onnx GPU](https://github.com/onnx/onnx) for GPU acceleration. You can switch backends by setting the `BACKEND` environment variable (e.g. `os.environ[\"BACKEND\"] = \"tf\"` or `\"onnx\"`). If you have a GPU, you can additionally install TensorFlow using or Onnx:\n\n```bash\npip install unisim[tensorflow]\n```\n\nor\n\n```bash\npip uninstall onnxruntime\npip install onnxruntime-gpu\n```\n\n## Text UniSim (TextSim)\n\nThe goal of TextSim is to provide an easy-to-use tool for efficient, accurate and multilingual fuzzy string matching, near-duplicate detection, and string similarity. Please see the tutorial [colab](notebooks/unisim_text_demo.ipynb) for an in-depth example on using TextSim for real-world use cases like fuzzy matching for addresses.\n\nTextSim is significantly faster than edit-distance algorithms such as Levenshtein Distance for fuzzy string matching and more accurate than ngram-based methods such as MinHash for near-duplicate text detection and clustering. TextSim accepts strings of arbitrary length and can scale to datasets with millions of examples.\n\nTo accomplish this, TextSim leverages the [RETSim model](https://arxiv.org/abs/2311.17264) to efficiently embed texts into high-dimensional vectors that can be compared using cosine similarity. TextSim then uses [USearch](https://github.com/unum-cloud/usearch) for fast vector search.\n\n### Basic Usage\n\nYou can compute the similarity between two strings using the `.similarity(text1, text2)` function. The similarity is a float between 0 and 1, with 1.0 representing most similar (identical strings). The similarity value is the cosine similarity between the vector representations of strings. You can directly get the vector representation of strings using the `.embed(inputs)` function as well.\n\n```python\nfrom unisim import TextSim\n\ntext_sim = TextSim()\n\n# compute similarity between two strings\ntext_sim.similarity(\"this is a text\", \"This is a txt! \ud83d\ude00\")  # 0.9113\ntext_sim.similarity(\"this is a text\", \"apples\")  # 0.4220\n```\n\nTextSim offers efficient fuzzy string matching between two lists using the `.match` function, similar to the [PolyFuzz](https://maartengr.github.io/PolyFuzz/) package. The `.match` function accepts `queries` (list of strings you want to find matches for) and `targets` (list of strings you are finding matches in).\n\n`.match(queries, targets)` returns a Pandas DataFrame, where each row contains a query, its most similar match found in targets, their similarity, and whether or not they are a match (if their similarity is >= `similarity_threshold`). By default, `0.9` is typically a good `similarity_threshold` for near-duplicate matching strings.\n\n```python\nfrom unisim import TextSim\n\ntext_sim = TextSim()\n\nqueries = [\"apple\", \"appl\", \"icecream\", \"house\", \"random\"]\ntargets = [\"apple\", \"ice cream\", \"mouse\"]\n\nresults_df = text_sim.match(queries, targets, similarity_threshold=0.9)\n```\n\nThis gives you the following Pandas DataFrame of (fuzzy) matches:\n```\n      query     target  similarity  is_match\n0     apple      apple    1.000000      True\n1      appl      apple    0.914230      True\n2  icecream  ice cream    0.950734      True\n3     house      mouse    0.760066     False\n4    random      mouse    0.456315     False\n```\nTextSim is able to find fuzzy matches of strings (\"appl\" to \"apple\" and \"icecream\" to \"ice cream\") while not matching \"house\" to \"mouse\". Note that TextSim can accept strings of arbitrary length and works on longer texts. You can also perform fuzzy matching within a single list by passing only a single list, e.g. `text_sim.match(queries)`.\n\n\n### Large-scale Matching and Near-Duplicate Detection Workflow\n\nTextSim offers more complex functionality which allows you to maintain an index of texts (e.g. from a large dataset) and query the index to find similar texts. TextSim supports efficient approximate nearest neighbor (ANN) search using  [USearch](https://github.com/unum-cloud/usearch) which allows it to scale to large datasets with millions of examples.\n\nPlease see a minimal working example below for how to use the `.add` and `.search` methods to create and search an index of texts, as well as the demo [colab](notebooks/unisim_text_demo.ipynb) for an in-depth example using TextSim for fuzzy matching on a real-world address matching dataset.\n\n<details>\n\n```python\nfrom unisim import TextSim\n\ntext_sim = TextSim(\n    store_data=True, # set to False for large datasets to save memory\n    index_type=\"exact\", # set to \"approx\" for large datasets to use ANN search\n    batch_size=128, # increasing batch_size on GPU may be faster\n    use_accelerator=True, # uses GPU if available, otherwise uses CPU\n)\n\n# the dataset can be very large, e.g. millions of texts\ndataset = [\n    \"I love ice cream and cookies\",\n    \"Ice cream is super delicious\",\n    \"my mom makes the best homemade cookies \ud83c\udf6a\ud83c\udf6a\ud83c\udf6a\",\n    \"This is an example text.\",\n    \"UniSim supports very long texts as well.\",\n    \"UniSim supports multilingual texts too. \u4f60\u597d!\",\n]\n\n# index your dataset using .add\ntext_sim.add(dataset)\n\n# queries can also be a very large dataset\nqueries = [\n    \"I luv ice cream and cookies\ud83c\udf66\ud83c\udf6a\",\n    \"This is an example query text.\",\n    \"Unrelated text with no match in the dataset...\"\n]\n\n# search the indexed dataset and find the most similar matches to queries\nresult_collection = text_sim.search(\n    queries,\n    similarity_threshold=0.9, # texts match if their similarity >= similarity_threshold\n    k=5, # the number of most similar texts in indexed dataset to return for each query\n)\n```\nNOTE: you can set `drop_closest_match=False` in `.search` to ignore the closest match if you know your query exists in the dataset already, e.g. for dataset deduplication, your search queries are the same as your indexed dataset.\n\nNOTE 2: you do not need to add your dataset all at once, you can continously add to and search your index which is useful in production use cases where you have incoming data.\n\n`.search` returns a ResultCollection, which contains the total number of matches found for your queries as well as detailed results containing the most similar matches, their similarity values, and their content. You can visualize the results using `text_sim.visualize(result)`.\n\n```python\n# get total matches found across all queries\ntotal_matches = result_collection.total_matches\n\n# visualize a query result (query 0 in this case) in the result_collection\nresult = result_collection.results[0]\ntext_sim.visualize(result)\n```\n`.visualize` prints the following output:\n```\nQuery 0: \"I luv ice cream and cookies\ud83c\udf66\ud83c\udf6a\"\nMost similar matches:\n\n  idx  is_match      similarity  text\n-----  ----------  ------------  ---------------------------------------------\n    0  True                0.91  I love ice cream and cookies\n    1  False               0.66  Ice cream is super delicious\n    2  False               0.53  my mom makes the best homemade cookies \ud83c\udf6a\ud83c\udf6a\ud83c\udf6a\n    3  False               0.42  This is an example text.\n    4  False               0.36  UniSim supports very long texts as well.\n```\n</details>\n\n\n## Citing\n\nIf you use the UniSim package in your work, please cite:\n\n```bibtex\n@software{UniSim_Universal_Similarity_2023,\n    title = {{UniSim: Universal Similarity}},\n    author = {Marina Zhang, Owen Vallis, Ali Zand, and Elie Bursztein},\n    url = {https://github.com/google/unisim},\n    version = {0.0.1},\n    year = {2023}\n}\n\n```\nAdditionally, if you use TextSim or the RETSim model, please cite the following paper:\n\n```bibtex\n@article{RETSim_2023,\n    title = {{RETSim: Resilient and Efficient Text Similarity}},\n    author = {Marina Zhang, Owen Vallis, Aysegul Bumin, Tanay Vakharia, and Elie Bursztein},\n    year = {2023},\n    eprint = {arXiv:2311.17264}\n}\n```\n\n## Contributing\nTo contribute to the project, please check out the [contribution guidelines](CONTRIBUTING.md). Thank you!\n\n## Disclaimer\nThis is not an official Google product.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "UniSim: Universal Similarity",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/google/unisim"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5244b807dbad8a1c03b08b61d0f4e4de6751cd93684243c15b8900d7cb389377",
                "md5": "77941d61c01a213965a506293eb335c3",
                "sha256": "66d3b687bfd599dbf7cf38ca13280b6c9c2a8bfa0904f876fb00fbc17679c08d"
            },
            "downloads": -1,
            "filename": "unisim-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "77941d61c01a213965a506293eb335c3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8072590,
            "upload_time": "2024-08-08T21:04:38",
            "upload_time_iso_8601": "2024-08-08T21:04:38.480129Z",
            "url": "https://files.pythonhosted.org/packages/52/44/b807dbad8a1c03b08b61d0f4e4de6751cd93684243c15b8900d7cb389377/unisim-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "47aa1ecf7b9600ef7f00ecba00e7cac41aae71950f7186e15c06c00fb616c5ff",
                "md5": "3fd003c6071e3361e5dbb855ccf81dc3",
                "sha256": "78b4470d184f4e9b7e2f178779d8d3f7ceef288b8da2dbbb4ef0b2c9ae540dc6"
            },
            "downloads": -1,
            "filename": "unisim-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "3fd003c6071e3361e5dbb855ccf81dc3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 8071575,
            "upload_time": "2024-08-08T21:04:42",
            "upload_time_iso_8601": "2024-08-08T21:04:42.702585Z",
            "url": "https://files.pythonhosted.org/packages/47/aa/1ecf7b9600ef7f00ecba00e7cac41aae71950f7186e15c06c00fb616c5ff/unisim-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-08 21:04:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "google",
    "github_project": "unisim",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "unisim"
}
        
Elapsed time: 4.90265s