text2vec2onnx

Name	text2vec2onnx JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/GanymedeNil/text2vec-onnx
Summary	Text to vector Tool, encode text
upload_time	2024-06-24 14:47:43
maintainer	None
docs_url	None
author	GanymedeNil
requires_python	>=3.8.0
license	Apache License 2.0
keywords	word embedding text2vec onnx chinese text similarity calculation tool similarity
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # text2vec-onnx

本项目是 [text2vec](https://github.com/shibing624/text2vec) 项目的 onnxruntime 推理版本，实现了向量获取和文本匹配搜索。为了保证项目的轻量，只使用了 `onnxruntime` 、 `tokenizers` 和 `numpy` 三个库。

主要在 [GanymedeNil/text2vec-base-chinese-onnx](https://huggingface.co/GanymedeNil/text2vec-base-chinese-onnx) 模型上进行测试，理论上支持 BERT 系列模型。

## 安装

### CPU 版本
```bash
pip install text2vec2onnx[cpu]
```
### GPU 版本
```bash
pip install text2vec2onnx[gpu]
```

## 使用

### 模型下载
以下载 GanymedeNil/text2vec-base-chinese-onnx 为例，下载模型到本地。

- huggingface 模型下载
```bash
huggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx
```

### 向量获取

```python
from text2vec2onnx import SentenceModel
embedder = SentenceModel(model_dir_path='local-dir')
emb = embedder.encode("你好")
```

### 文本匹配搜索

```python
from text2vec2onnx import SentenceModel, semantic_search

embedder = SentenceModel(model_dir_path='local-dir')

corpus = [
    "谢谢观看 下集再见",
    "感谢您的观看",
    "请勿模仿",
    "记得订阅我们的频道哦",
    "The following are sentences in English.",
    "Thank you. Bye-bye.",
    "It's true",
    "I don't know.",
    "Thank you for watching!",
]
corpus_embeddings = embedder.encode(corpus)

queries = [
    'Thank you. Bye.',
    '你干啥呢',
    '感谢您的收听']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))


```

## License
[Appache License 2.0](LICENSE)

## References
- [text2vec](https://github.com/shibing624/text2vec)


## Buy me a coffee
<div align="center">
<a href="https://www.buymeacoffee.com/ganymedenil" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a>
</div>
<div align="center">
<img height="360" src="https://user-images.githubusercontent.com/9687786/224522468-eafb7042-d000-4799-9d16-450489e8efa4.png"/>
<img height="360" src="https://user-images.githubusercontent.com/9687786/224522477-46f3e80b-0733-4be9-a829-37928260038c.png"/>
</div>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/GanymedeNil/text2vec-onnx",
    "name": "text2vec2onnx",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": null,
    "keywords": "word embedding, text2vec, onnx, Chinese Text Similarity Calculation Tool, similarity",
    "author": "GanymedeNil",
    "author_email": "GanymedeNil@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/50/9ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f/text2vec2onnx-1.0.0.tar.gz",
    "platform": null,
    "description": "# text2vec-onnx\n\n\u672c\u9879\u76ee\u662f [text2vec](https://github.com/shibing624/text2vec) \u9879\u76ee\u7684 onnxruntime \u63a8\u7406\u7248\u672c\uff0c\u5b9e\u73b0\u4e86\u5411\u91cf\u83b7\u53d6\u548c\u6587\u672c\u5339\u914d\u641c\u7d22\u3002\u4e3a\u4e86\u4fdd\u8bc1\u9879\u76ee\u7684\u8f7b\u91cf\uff0c\u53ea\u4f7f\u7528\u4e86 `onnxruntime` \u3001 `tokenizers` \u548c `numpy` \u4e09\u4e2a\u5e93\u3002\n\n\u4e3b\u8981\u5728 [GanymedeNil/text2vec-base-chinese-onnx](https://huggingface.co/GanymedeNil/text2vec-base-chinese-onnx) \u6a21\u578b\u4e0a\u8fdb\u884c\u6d4b\u8bd5\uff0c\u7406\u8bba\u4e0a\u652f\u6301 BERT \u7cfb\u5217\u6a21\u578b\u3002\n\n## \u5b89\u88c5\n\n### CPU \u7248\u672c\n```bash\npip install text2vec2onnx[cpu]\n```\n### GPU \u7248\u672c\n```bash\npip install text2vec2onnx[gpu]\n```\n\n## \u4f7f\u7528\n\n### \u6a21\u578b\u4e0b\u8f7d\n\u4ee5\u4e0b\u8f7d GanymedeNil/text2vec-base-chinese-onnx \u4e3a\u4f8b\uff0c\u4e0b\u8f7d\u6a21\u578b\u5230\u672c\u5730\u3002\n\n- huggingface \u6a21\u578b\u4e0b\u8f7d\n```bash\nhuggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx\n```\n\n### \u5411\u91cf\u83b7\u53d6\n\n```python\nfrom text2vec2onnx import SentenceModel\nembedder = SentenceModel(model_dir_path='local-dir')\nemb = embedder.encode(\"\u4f60\u597d\")\n```\n\n### \u6587\u672c\u5339\u914d\u641c\u7d22\n\n```python\nfrom text2vec2onnx import SentenceModel, semantic_search\n\nembedder = SentenceModel(model_dir_path='local-dir')\n\ncorpus = [\n    \"\u8c22\u8c22\u89c2\u770b \u4e0b\u96c6\u518d\u89c1\",\n    \"\u611f\u8c22\u60a8\u7684\u89c2\u770b\",\n    \"\u8bf7\u52ff\u6a21\u4eff\",\n    \"\u8bb0\u5f97\u8ba2\u9605\u6211\u4eec\u7684\u9891\u9053\u54e6\",\n    \"The following are sentences in English.\",\n    \"Thank you. Bye-bye.\",\n    \"It's true\",\n    \"I don't know.\",\n    \"Thank you for watching!\",\n]\ncorpus_embeddings = embedder.encode(corpus)\n\nqueries = [\n    'Thank you. Bye.',\n    '\u4f60\u5e72\u5565\u5462',\n    '\u611f\u8c22\u60a8\u7684\u6536\u542c']\n\nfor query in queries:\n    query_embedding = embedder.encode(query)\n    hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)\n    print(\"\\n\\n======================\\n\\n\")\n    print(\"Query:\", query)\n    print(\"\\nTop 5 most similar sentences in corpus:\")\n    hits = hits[0]  # Get the hits for the first query\n    for hit in hits:\n        print(corpus[hit['corpus_id']], \"(Score: {:.4f})\".format(hit['score']))\n\n\n```\n\n## License\n[Appache License 2.0](LICENSE)\n\n## References\n- [text2vec](https://github.com/shibing624/text2vec)\n\n\n## Buy me a coffee\n<div align=\"center\">\n<a href=\"https://www.buymeacoffee.com/ganymedenil\" target=\"_blank\"><img src=\"https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png\" alt=\"Buy Me A Coffee\" style=\"height: 60px !important;width: 217px !important;\" ></a>\n</div>\n<div align=\"center\">\n<img height=\"360\" src=\"https://user-images.githubusercontent.com/9687786/224522468-eafb7042-d000-4799-9d16-450489e8efa4.png\"/>\n<img height=\"360\" src=\"https://user-images.githubusercontent.com/9687786/224522477-46f3e80b-0733-4be9-a829-37928260038c.png\"/>\n</div>\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Text to vector Tool, encode text",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/GanymedeNil/text2vec-onnx"
    },
    "split_keywords": [
        "word embedding",
        " text2vec",
        " onnx",
        " chinese text similarity calculation tool",
        " similarity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3c3a9273608af39a1d0caeb36929db92c1082b63919df534547f17fc98e1e455",
                "md5": "b5a01e3e820ce34231e2d851757283d5",
                "sha256": "5f30c3ee3008f63f79d3cd99bab245a3cfe9754256e2c20477c3a664ea3cdaba"
            },
            "downloads": -1,
            "filename": "text2vec2onnx-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b5a01e3e820ce34231e2d851757283d5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 11479,
            "upload_time": "2024-06-24T14:47:42",
            "upload_time_iso_8601": "2024-06-24T14:47:42.789511Z",
            "url": "https://files.pythonhosted.org/packages/3c/3a/9273608af39a1d0caeb36929db92c1082b63919df534547f17fc98e1e455/text2vec2onnx-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e6509ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f",
                "md5": "5ba67f1f607b3277896ba39efcbf12d8",
                "sha256": "15e89e7f1c3063f426a3dd41b3983afc454d1b1cc3c8c91f814d06264310f154"
            },
            "downloads": -1,
            "filename": "text2vec2onnx-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5ba67f1f607b3277896ba39efcbf12d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 14150,
            "upload_time": "2024-06-24T14:47:43",
            "upload_time_iso_8601": "2024-06-24T14:47:43.787878Z",
            "url": "https://files.pythonhosted.org/packages/e6/50/9ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f/text2vec2onnx-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-24 14:47:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GanymedeNil",
    "github_project": "text2vec-onnx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "text2vec2onnx"
}

GanymedeNil