# text2vec-onnx
本项目是 [text2vec](https://github.com/shibing624/text2vec) 项目的 onnxruntime 推理版本,实现了向量获取和文本匹配搜索。为了保证项目的轻量,只使用了 `onnxruntime` 、 `tokenizers` 和 `numpy` 三个库。
主要在 [GanymedeNil/text2vec-base-chinese-onnx](https://huggingface.co/GanymedeNil/text2vec-base-chinese-onnx) 模型上进行测试,理论上支持 BERT 系列模型。
## 安装
### CPU 版本
```bash
pip install text2vec2onnx[cpu]
```
### GPU 版本
```bash
pip install text2vec2onnx[gpu]
```
## 使用
### 模型下载
以下载 GanymedeNil/text2vec-base-chinese-onnx 为例,下载模型到本地。
- huggingface 模型下载
```bash
huggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx
```
### 向量获取
```python
from text2vec2onnx import SentenceModel
embedder = SentenceModel(model_dir_path='local-dir')
emb = embedder.encode("你好")
```
### 文本匹配搜索
```python
from text2vec2onnx import SentenceModel, semantic_search
embedder = SentenceModel(model_dir_path='local-dir')
corpus = [
"谢谢观看 下集再见",
"感谢您的观看",
"请勿模仿",
"记得订阅我们的频道哦",
"The following are sentences in English.",
"Thank you. Bye-bye.",
"It's true",
"I don't know.",
"Thank you for watching!",
]
corpus_embeddings = embedder.encode(corpus)
queries = [
'Thank you. Bye.',
'你干啥呢',
'感谢您的收听']
for query in queries:
query_embedding = embedder.encode(query)
hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
hits = hits[0] # Get the hits for the first query
for hit in hits:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
```
## License
[Appache License 2.0](LICENSE)
## References
- [text2vec](https://github.com/shibing624/text2vec)
## Buy me a coffee
<div align="center">
<a href="https://www.buymeacoffee.com/ganymedenil" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a>
</div>
<div align="center">
<img height="360" src="https://user-images.githubusercontent.com/9687786/224522468-eafb7042-d000-4799-9d16-450489e8efa4.png"/>
<img height="360" src="https://user-images.githubusercontent.com/9687786/224522477-46f3e80b-0733-4be9-a829-37928260038c.png"/>
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/GanymedeNil/text2vec-onnx",
"name": "text2vec2onnx",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8.0",
"maintainer_email": null,
"keywords": "word embedding, text2vec, onnx, Chinese Text Similarity Calculation Tool, similarity",
"author": "GanymedeNil",
"author_email": "GanymedeNil@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e6/50/9ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f/text2vec2onnx-1.0.0.tar.gz",
"platform": null,
"description": "# text2vec-onnx\n\n\u672c\u9879\u76ee\u662f [text2vec](https://github.com/shibing624/text2vec) \u9879\u76ee\u7684 onnxruntime \u63a8\u7406\u7248\u672c\uff0c\u5b9e\u73b0\u4e86\u5411\u91cf\u83b7\u53d6\u548c\u6587\u672c\u5339\u914d\u641c\u7d22\u3002\u4e3a\u4e86\u4fdd\u8bc1\u9879\u76ee\u7684\u8f7b\u91cf\uff0c\u53ea\u4f7f\u7528\u4e86 `onnxruntime` \u3001 `tokenizers` \u548c `numpy` \u4e09\u4e2a\u5e93\u3002\n\n\u4e3b\u8981\u5728 [GanymedeNil/text2vec-base-chinese-onnx](https://huggingface.co/GanymedeNil/text2vec-base-chinese-onnx) \u6a21\u578b\u4e0a\u8fdb\u884c\u6d4b\u8bd5\uff0c\u7406\u8bba\u4e0a\u652f\u6301 BERT \u7cfb\u5217\u6a21\u578b\u3002\n\n## \u5b89\u88c5\n\n### CPU \u7248\u672c\n```bash\npip install text2vec2onnx[cpu]\n```\n### GPU \u7248\u672c\n```bash\npip install text2vec2onnx[gpu]\n```\n\n## \u4f7f\u7528\n\n### \u6a21\u578b\u4e0b\u8f7d\n\u4ee5\u4e0b\u8f7d GanymedeNil/text2vec-base-chinese-onnx \u4e3a\u4f8b\uff0c\u4e0b\u8f7d\u6a21\u578b\u5230\u672c\u5730\u3002\n\n- huggingface \u6a21\u578b\u4e0b\u8f7d\n```bash\nhuggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx\n```\n\n### \u5411\u91cf\u83b7\u53d6\n\n```python\nfrom text2vec2onnx import SentenceModel\nembedder = SentenceModel(model_dir_path='local-dir')\nemb = embedder.encode(\"\u4f60\u597d\")\n```\n\n### \u6587\u672c\u5339\u914d\u641c\u7d22\n\n```python\nfrom text2vec2onnx import SentenceModel, semantic_search\n\nembedder = SentenceModel(model_dir_path='local-dir')\n\ncorpus = [\n \"\u8c22\u8c22\u89c2\u770b \u4e0b\u96c6\u518d\u89c1\",\n \"\u611f\u8c22\u60a8\u7684\u89c2\u770b\",\n \"\u8bf7\u52ff\u6a21\u4eff\",\n \"\u8bb0\u5f97\u8ba2\u9605\u6211\u4eec\u7684\u9891\u9053\u54e6\",\n \"The following are sentences in English.\",\n \"Thank you. Bye-bye.\",\n \"It's true\",\n \"I don't know.\",\n \"Thank you for watching!\",\n]\ncorpus_embeddings = embedder.encode(corpus)\n\nqueries = [\n 'Thank you. Bye.',\n '\u4f60\u5e72\u5565\u5462',\n '\u611f\u8c22\u60a8\u7684\u6536\u542c']\n\nfor query in queries:\n query_embedding = embedder.encode(query)\n hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)\n print(\"\\n\\n======================\\n\\n\")\n print(\"Query:\", query)\n print(\"\\nTop 5 most similar sentences in corpus:\")\n hits = hits[0] # Get the hits for the first query\n for hit in hits:\n print(corpus[hit['corpus_id']], \"(Score: {:.4f})\".format(hit['score']))\n\n\n```\n\n## License\n[Appache License 2.0](LICENSE)\n\n## References\n- [text2vec](https://github.com/shibing624/text2vec)\n\n\n## Buy me a coffee\n<div align=\"center\">\n<a href=\"https://www.buymeacoffee.com/ganymedenil\" target=\"_blank\"><img src=\"https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png\" alt=\"Buy Me A Coffee\" style=\"height: 60px !important;width: 217px !important;\" ></a>\n</div>\n<div align=\"center\">\n<img height=\"360\" src=\"https://user-images.githubusercontent.com/9687786/224522468-eafb7042-d000-4799-9d16-450489e8efa4.png\"/>\n<img height=\"360\" src=\"https://user-images.githubusercontent.com/9687786/224522477-46f3e80b-0733-4be9-a829-37928260038c.png\"/>\n</div>\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Text to vector Tool, encode text",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/GanymedeNil/text2vec-onnx"
},
"split_keywords": [
"word embedding",
" text2vec",
" onnx",
" chinese text similarity calculation tool",
" similarity"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3c3a9273608af39a1d0caeb36929db92c1082b63919df534547f17fc98e1e455",
"md5": "b5a01e3e820ce34231e2d851757283d5",
"sha256": "5f30c3ee3008f63f79d3cd99bab245a3cfe9754256e2c20477c3a664ea3cdaba"
},
"downloads": -1,
"filename": "text2vec2onnx-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5a01e3e820ce34231e2d851757283d5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.0",
"size": 11479,
"upload_time": "2024-06-24T14:47:42",
"upload_time_iso_8601": "2024-06-24T14:47:42.789511Z",
"url": "https://files.pythonhosted.org/packages/3c/3a/9273608af39a1d0caeb36929db92c1082b63919df534547f17fc98e1e455/text2vec2onnx-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e6509ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f",
"md5": "5ba67f1f607b3277896ba39efcbf12d8",
"sha256": "15e89e7f1c3063f426a3dd41b3983afc454d1b1cc3c8c91f814d06264310f154"
},
"downloads": -1,
"filename": "text2vec2onnx-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "5ba67f1f607b3277896ba39efcbf12d8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0",
"size": 14150,
"upload_time": "2024-06-24T14:47:43",
"upload_time_iso_8601": "2024-06-24T14:47:43.787878Z",
"url": "https://files.pythonhosted.org/packages/e6/50/9ba0672e0cc83c1fe898de7bac364b3a3642dd27a0f10006723874e9493f/text2vec2onnx-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-24 14:47:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GanymedeNil",
"github_project": "text2vec-onnx",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "text2vec2onnx"
}