yasem


Nameyasem JSON
Version 0.4.1 PyPI version JSON
download
home_pageNone
SummaryYASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings
upload_time2024-12-16 04:55:17
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords nlp embeddings splade sparse-vectors
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## YASEM (Yet Another Splade|Sparse Embedder)

YASEM is a simple and efficient library for executing SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) and creating sparse vectors. It provides a straightforward interface inspired by [SentenceTransformers](https://sbert.net/) for easy integration into your projects.

## Why YASEM?

- Simplicity: YASEM focuses on providing a clean and simple implementation of SPLADE without unnecessary complexity.
- Efficiency: Generate sparse embeddings quickly and easily.
- Flexibility: Works with both NumPy and PyTorch backends.
- Convenience: Includes helpful utilities like get_token_values for inspecting feature representations.

## Installation

You can install YASEM using pip:

```bash
pip install yasem
```

## Quick Start

Here's a simple example of how to use YASEM:

```python
from yasem import SpladeEmbedder

# Initialize the embedder
embedder = SpladeEmbedder("naver/splade-v3")

# Prepare some sentences
sentences = [
    "Hello, my dog is cute",
    "Hello, my cat is cute",
    "Hello, I like a ramen",
    "Hello, I like a sushi",
]

# Generate embeddings
embeddings = embedder.encode(sentences)
# or sparse csr matrix
# embeddings = embedder.encode(sentences, convert_to_csr_matrix=True)

# Compute similarity
similarity = embedder.similarity(embeddings, embeddings)
print(similarity)
# [[148.62903569 106.88184372  18.86930016  22.87525314]
#  [106.88184372 122.79656474  17.45339064  21.44758757]
#  [ 18.86930016  17.45339064  61.00272733  40.92700849]
#  [ 22.87525314  21.44758757  40.92700849  73.98511539]]


# Inspect token values for the first sentence
token_values = embedder.get_token_values(embeddings[0])
print(token_values)
# {'hello': 6.89453125, 'dog': 6.48828125, 'cute': 4.6015625,
#  'message': 2.38671875, 'greeting': 2.259765625,
#    ...

token_values = embedder.get_token_values(embeddings[3])
print(token_values)
# {'##shi': 3.63671875, 'su': 3.470703125, 'eat': 3.25,
#  'hello': 2.73046875, 'you': 2.435546875, 'like': 2.26953125, 'taste': 1.8203125,
```

## rank API

```python
# Rank documents based on query
query = "What programming language is best for machine learning?"
documents = [
   "Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch",
   "JavaScript is primarily used for web development and front-end applications", 
   "SQL is essential for database management and data manipulation"
]

# Get ranked results with relevance scores
results = embedder.rank(query, documents)
print(results)
# [
#   {'corpus_id': 0, 'score': 12.453},  # Python/ML document ranks highest
#   {'corpus_id': 2, 'score': 5.234},
#   {'corpus_id': 1, 'score': 3.123}
# ]

# Get ranked results including document text
results = embedder.rank(query, documents, return_documents=True)
print(results)  
# [
#   {
#     'corpus_id': 0,
#     'score': 12.453,
#     'text': 'Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch'
#   },
#   {
#     'corpus_id': 2, 
#     'score': 5.234,
#     'text': 'SQL is essential for database management and data manipulation'
#   },
#   ...
# ]
```

## Features

- Easy-to-use API inspired by SentenceTransformers
- Support for both NumPy and scipy.sparse.csr_matrix
- Efficient dot product similarity computation
- Utility function to inspect token values in embeddings

## License

This project is licensed under the MIT License. See the LICENSE file for the full license text. Copyright (c) 2024 Yuichi Tateno (@hotchpotch)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgements

This library is inspired by the SPLADE model and aims to provide a simple interface for its usage. Special thanks to the authors of the original SPLADE paper and the developers of the model.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "yasem",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "nlp, embeddings, splade, sparse-vectors",
    "author": null,
    "author_email": "Yuichi Tateno <hotchpotch@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5c/37/39186e0ee0f8a9acb50e7bec1058d7291e7fac827719f8ed9fdb7e27429d/yasem-0.4.1.tar.gz",
    "platform": null,
    "description": "## YASEM (Yet Another Splade|Sparse Embedder)\n\nYASEM is a simple and efficient library for executing SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) and creating sparse vectors. It provides a straightforward interface inspired by [SentenceTransformers](https://sbert.net/) for easy integration into your projects.\n\n## Why YASEM?\n\n- Simplicity: YASEM focuses on providing a clean and simple implementation of SPLADE without unnecessary complexity.\n- Efficiency: Generate sparse embeddings quickly and easily.\n- Flexibility: Works with both NumPy and PyTorch backends.\n- Convenience: Includes helpful utilities like get_token_values for inspecting feature representations.\n\n## Installation\n\nYou can install YASEM using pip:\n\n```bash\npip install yasem\n```\n\n## Quick Start\n\nHere's a simple example of how to use YASEM:\n\n```python\nfrom yasem import SpladeEmbedder\n\n# Initialize the embedder\nembedder = SpladeEmbedder(\"naver/splade-v3\")\n\n# Prepare some sentences\nsentences = [\n    \"Hello, my dog is cute\",\n    \"Hello, my cat is cute\",\n    \"Hello, I like a ramen\",\n    \"Hello, I like a sushi\",\n]\n\n# Generate embeddings\nembeddings = embedder.encode(sentences)\n# or sparse csr matrix\n# embeddings = embedder.encode(sentences, convert_to_csr_matrix=True)\n\n# Compute similarity\nsimilarity = embedder.similarity(embeddings, embeddings)\nprint(similarity)\n# [[148.62903569 106.88184372  18.86930016  22.87525314]\n#  [106.88184372 122.79656474  17.45339064  21.44758757]\n#  [ 18.86930016  17.45339064  61.00272733  40.92700849]\n#  [ 22.87525314  21.44758757  40.92700849  73.98511539]]\n\n\n# Inspect token values for the first sentence\ntoken_values = embedder.get_token_values(embeddings[0])\nprint(token_values)\n# {'hello': 6.89453125, 'dog': 6.48828125, 'cute': 4.6015625,\n#  'message': 2.38671875, 'greeting': 2.259765625,\n#    ...\n\ntoken_values = embedder.get_token_values(embeddings[3])\nprint(token_values)\n# {'##shi': 3.63671875, 'su': 3.470703125, 'eat': 3.25,\n#  'hello': 2.73046875, 'you': 2.435546875, 'like': 2.26953125, 'taste': 1.8203125,\n```\n\n## rank API\n\n```python\n# Rank documents based on query\nquery = \"What programming language is best for machine learning?\"\ndocuments = [\n   \"Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch\",\n   \"JavaScript is primarily used for web development and front-end applications\", \n   \"SQL is essential for database management and data manipulation\"\n]\n\n# Get ranked results with relevance scores\nresults = embedder.rank(query, documents)\nprint(results)\n# [\n#   {'corpus_id': 0, 'score': 12.453},  # Python/ML document ranks highest\n#   {'corpus_id': 2, 'score': 5.234},\n#   {'corpus_id': 1, 'score': 3.123}\n# ]\n\n# Get ranked results including document text\nresults = embedder.rank(query, documents, return_documents=True)\nprint(results)  \n# [\n#   {\n#     'corpus_id': 0,\n#     'score': 12.453,\n#     'text': 'Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch'\n#   },\n#   {\n#     'corpus_id': 2, \n#     'score': 5.234,\n#     'text': 'SQL is essential for database management and data manipulation'\n#   },\n#   ...\n# ]\n```\n\n## Features\n\n- Easy-to-use API inspired by SentenceTransformers\n- Support for both NumPy and scipy.sparse.csr_matrix\n- Efficient dot product similarity computation\n- Utility function to inspect token values in embeddings\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for the full license text. Copyright (c) 2024 Yuichi Tateno (@hotchpotch)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgements\n\nThis library is inspired by the SPLADE model and aims to provide a simple interface for its usage. Special thanks to the authors of the original SPLADE paper and the developers of the model.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "YASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings",
    "version": "0.4.1",
    "project_urls": {
        "homepage": "https://github.com/hotchpotch/yasem",
        "repository": "https://github.com/hotchpotch/yasem"
    },
    "split_keywords": [
        "nlp",
        " embeddings",
        " splade",
        " sparse-vectors"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0d580c4c68b33a9a5773b62b272f6b3dc392cab01b25a2f18e92d2409be4c6c9",
                "md5": "e2a659d6738e8fa6ae171495705cb41c",
                "sha256": "86a59e6251ab82f029ff7f44cb4affe841f3f7d956b7e2fc24ffcd33e1b6df02"
            },
            "downloads": -1,
            "filename": "yasem-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e2a659d6738e8fa6ae171495705cb41c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7265,
            "upload_time": "2024-12-16T04:55:14",
            "upload_time_iso_8601": "2024-12-16T04:55:14.662860Z",
            "url": "https://files.pythonhosted.org/packages/0d/58/0c4c68b33a9a5773b62b272f6b3dc392cab01b25a2f18e92d2409be4c6c9/yasem-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5c3739186e0ee0f8a9acb50e7bec1058d7291e7fac827719f8ed9fdb7e27429d",
                "md5": "2166784a8a0cd7216ed87f2a9da633bc",
                "sha256": "db1b57feb4d8f4ca013954c2ea2167f4de2f66cc652ee0de0ce59cd24172eb1b"
            },
            "downloads": -1,
            "filename": "yasem-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2166784a8a0cd7216ed87f2a9da633bc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 9140,
            "upload_time": "2024-12-16T04:55:17",
            "upload_time_iso_8601": "2024-12-16T04:55:17.365627Z",
            "url": "https://files.pythonhosted.org/packages/5c/37/39186e0ee0f8a9acb50e7bec1058d7291e7fac827719f8ed9fdb7e27429d/yasem-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-16 04:55:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hotchpotch",
    "github_project": "yasem",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "yasem"
}
        
Elapsed time: 0.39719s