Name | uniem JSON |
Version |
0.3.2
JSON |
| download |
home_page | |
Summary | unified embedding model |
upload_time | 2023-07-20 03:54:28 |
maintainer | |
docs_url | None |
author | wangyuxin |
requires_python | >=3.10,<4.0 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# uniem
[![Release](https://img.shields.io/pypi/v/uniem)](https://pypi.org/project/uniem/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/uniem)
[![ci](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml)
[![cd](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml)
uniem 项目的目标是创建中文最好的通用文本嵌入模型。
本项目主要包括模型的训练,微调和评测代码,模型与数据集会在 [HuggingFace](https://huggingface.co/) 社区上进行开源。
## 🌟 重要更新
- ➿ **2023.07.11** , 发布 uniem 0.3.0, `FineTuner` 除 M3E 外,还支持 `sentence_transformers`, `text2vec` 等模型的微调,同时还支持 [SGPT](https://github.com/Muennighoff/sgpt) 的方式对 GPT 系列模型进行训练,以及 Prefix Tuning。 **FineTuner 初始化的 API 有小小的变化,无法兼容 0.2.0**
- ➿ **2023.06.17** , 发布 uniem 0.2.1 , 实现了 `FineTuner` 以原生支持模型微调,**几行代码,即刻适配**!
- 📊 **2023.06.17** , 发布 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 正式版 , 支持 6 大类 Embedding 模型 ,支持 4 大类任务 ,共 9 种数据集的自动化评测
- 🎉 **2023.06.08** , 发布 [M3E models](https://huggingface.co/moka-ai/m3e-base) ,在中文文本分类和文本检索上均优于 `openai text-embedding-ada-002`,详请请参考 [M3E models README](https://huggingface.co/moka-ai/m3e-base/blob/main/README.md)。
## 🔧 使用 M3E
M3E 系列模型完全兼容 [sentence-transformers](https://www.sbert.net/) ,你可以通过 **替换模型名称** 的方式在所有支持 sentence-transformers 的项目中无缝使用 M3E Models,比如 [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) 。
安装
```bash
pip install sentence-transformers
```
使用
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])
```
## 🎨 微调模型
`uniem` 提供了非常易用的 finetune 接口,几行代码,即刻适配!
```python
from datasets import load_dataset
from uniem.finetuner import FineTuner
dataset = load_dataset('shibing624/nli_zh', 'STS-B')
# 指定训练的模型为 m3e-small
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
finetuner.run(epochs=3)
```
微调模型详见 [uniem 微调教程](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) or <a target="_blank" href="https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
如果您想要在本地运行,您需要运行如下命令,准备环境
```bash
conda create -n uniem python=3.10
pip install uniem
```
## 💯 MTEB-zh
中文 Embedding 模型缺少统一的评测标准,所以我们参考了 [MTEB](https://huggingface.co/spaces/mteb/leaderboard) ,构建了中文评测标准 MTEB-zh,目前已经对 6 种模型在各种数据集上进行了横评,详细的评测方式和代码请参考 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 。
### 文本分类
- 数据集选择,选择开源在 HuggingFace 上的 6 种文本分类数据集,包括新闻、电商评论、股票评论、长文本等
- 评测方式,使用 MTEB 的方式进行评测,报告 Accuracy。
| | text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul | uer | erlangshen |
| ----------------- | -------- | --------- | -------- | ------ | ----------- | ------- | ----------- | ----------- |
| TNews | 0.43 | 0.4443 | 0.4827 | **0.4866** | 0.4594 | 0.3084 | 0.3539 | 0.4361 |
| JDIphone | 0.8214 | 0.8293 | 0.8533 | **0.8692** | 0.746 | 0.7972 | 0.8283 | 0.8356 |
| GubaEastmony | 0.7472 | 0.712 | 0.7621 | 0.7663 | 0.7574 | 0.735 | 0.7534 | **0.7787** |
| TYQSentiment | 0.6099 | 0.6596 | 0.7188 | **0.7247** | 0.68 | 0.6437 | 0.6662 | 0.6444 |
| StockComSentiment | 0.4307 | 0.4291 | 0.4363 | 0.4475 | **0.4819** | 0.4309 | 0.4555 | 0.4482 |
| IFlyTek | 0.414 | 0.4263 | 0.4409 | 0.4445 | **0.4486** | 0.3969 | 0.3762 | 0.4241 |
| Average | 0.5755 | 0.5834 | 0.6157 | **0.6231** | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |
### 检索排序
- 数据集选择,使用 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) 数据集,由于 T2Ranking 的数据集太大,openai 评测起来的时间成本和 api 费用有些高,所以我们只选择了 T2Ranking 中的前 10000 篇文章
- 评测方式,使用 MTEB 的方式进行评测,报告 map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10
| | text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer | erlangshen |
| ------- | -------- | -------------- | --------- | -------- | --------- | ------- | ---------- | ---------- |
| map@1 | 0.4684 | 0.6133 | 0.5574 | **0.626** | 0.6256 | 0.25203 | 0.08647 | 0.25394 |
| map@10 | 0.5877 | 0.7423 | 0.6878 | **0.7656** | 0.7627 | 0.33312 | 0.13008 | 0.34714 |
| mrr@1 | 0.5345 | 0.6931 | 0.6324 | 0.7047 | **0.7063** | 0.29258 | 0.10067 | 0.29447 |
| mrr@10 | 0.6217 | 0.7668 | 0.712 | **0.7841** | 0.7827 | 0.36287 | 0.14516 | 0.3751 |
| ndcg@1 | 0.5207 | 0.6764 | 0.6159 | 0.6881 | **0.6884** | 0.28358 | 0.09748 | 0.28578 |
| ndcg@10 | 0.6346 | 0.7786 | 0.7262 | **0.8004** | 0.7974 | 0.37468 | 0.15783 | 0.39329 |
## 🤝 Contributing
如果您想要在 MTEB-zh 中添加评测数据集或者模型,欢迎提 issue 或者 PR,我会在第一时间进行支持,期待您的贡献!
## 📜 License
uniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.
Raw data
{
"_id": null,
"home_page": "",
"name": "uniem",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "wangyuxin",
"author_email": "wangyuxin@mokahr.com",
"download_url": "https://files.pythonhosted.org/packages/28/8a/c4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70/uniem-0.3.2.tar.gz",
"platform": null,
"description": "# uniem\n[![Release](https://img.shields.io/pypi/v/uniem)](https://pypi.org/project/uniem/)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/uniem)\n[![ci](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml)\n[![cd](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml)\n\nuniem \u9879\u76ee\u7684\u76ee\u6807\u662f\u521b\u5efa\u4e2d\u6587\u6700\u597d\u7684\u901a\u7528\u6587\u672c\u5d4c\u5165\u6a21\u578b\u3002\n\n\u672c\u9879\u76ee\u4e3b\u8981\u5305\u62ec\u6a21\u578b\u7684\u8bad\u7ec3\uff0c\u5fae\u8c03\u548c\u8bc4\u6d4b\u4ee3\u7801\uff0c\u6a21\u578b\u4e0e\u6570\u636e\u96c6\u4f1a\u5728 [HuggingFace](https://huggingface.co/) \u793e\u533a\u4e0a\u8fdb\u884c\u5f00\u6e90\u3002\n\n## \ud83c\udf1f \u91cd\u8981\u66f4\u65b0\n\n- \u27bf **2023.07.11** , \u53d1\u5e03 uniem 0.3.0\uff0c `FineTuner` \u9664 M3E \u5916\uff0c\u8fd8\u652f\u6301 `sentence_transformers`, `text2vec` \u7b49\u6a21\u578b\u7684\u5fae\u8c03\uff0c\u540c\u65f6\u8fd8\u652f\u6301 [SGPT](https://github.com/Muennighoff/sgpt) \u7684\u65b9\u5f0f\u5bf9 GPT \u7cfb\u5217\u6a21\u578b\u8fdb\u884c\u8bad\u7ec3\uff0c\u4ee5\u53ca Prefix Tuning\u3002 **FineTuner \u521d\u59cb\u5316\u7684 API \u6709\u5c0f\u5c0f\u7684\u53d8\u5316\uff0c\u65e0\u6cd5\u517c\u5bb9 0.2.0**\n- \u27bf **2023.06.17** , \u53d1\u5e03 uniem 0.2.1 \uff0c \u5b9e\u73b0\u4e86 `FineTuner` \u4ee5\u539f\u751f\u652f\u6301\u6a21\u578b\u5fae\u8c03\uff0c**\u51e0\u884c\u4ee3\u7801\uff0c\u5373\u523b\u9002\u914d**\uff01\n- \ud83d\udcca **2023.06.17** , \u53d1\u5e03 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) \u6b63\u5f0f\u7248 \uff0c \u652f\u6301 6 \u5927\u7c7b Embedding \u6a21\u578b \uff0c\u652f\u6301 4 \u5927\u7c7b\u4efb\u52a1 \uff0c\u5171 9 \u79cd\u6570\u636e\u96c6\u7684\u81ea\u52a8\u5316\u8bc4\u6d4b\n- \ud83c\udf89 **2023.06.08** , \u53d1\u5e03 [M3E models](https://huggingface.co/moka-ai/m3e-base) \uff0c\u5728\u4e2d\u6587\u6587\u672c\u5206\u7c7b\u548c\u6587\u672c\u68c0\u7d22\u4e0a\u5747\u4f18\u4e8e `openai text-embedding-ada-002`\uff0c\u8be6\u8bf7\u8bf7\u53c2\u8003 [M3E models README](https://huggingface.co/moka-ai/m3e-base/blob/main/README.md)\u3002\n\n## \ud83d\udd27 \u4f7f\u7528 M3E\n\nM3E \u7cfb\u5217\u6a21\u578b\u5b8c\u5168\u517c\u5bb9 [sentence-transformers](https://www.sbert.net/) \uff0c\u4f60\u53ef\u4ee5\u901a\u8fc7 **\u66ff\u6362\u6a21\u578b\u540d\u79f0** \u7684\u65b9\u5f0f\u5728\u6240\u6709\u652f\u6301 sentence-transformers \u7684\u9879\u76ee\u4e2d\u65e0\u7f1d\u4f7f\u7528 M3E Models\uff0c\u6bd4\u5982 [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) \u3002\n\n\u5b89\u88c5\n\n```bash\npip install sentence-transformers\n```\n\n\u4f7f\u7528 \n\n```python\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer(\"moka-ai/m3e-base\")\nembeddings = model.encode(['Hello World!', '\u4f60\u597d,\u4e16\u754c!'])\n```\n\n## \ud83c\udfa8 \u5fae\u8c03\u6a21\u578b\n\n`uniem` \u63d0\u4f9b\u4e86\u975e\u5e38\u6613\u7528\u7684 finetune \u63a5\u53e3\uff0c\u51e0\u884c\u4ee3\u7801\uff0c\u5373\u523b\u9002\u914d\uff01\n\n```python\nfrom datasets import load_dataset\n\nfrom uniem.finetuner import FineTuner\n\ndataset = load_dataset('shibing624/nli_zh', 'STS-B')\n# \u6307\u5b9a\u8bad\u7ec3\u7684\u6a21\u578b\u4e3a m3e-small\nfinetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)\nfinetuner.run(epochs=3)\n```\n\n\u5fae\u8c03\u6a21\u578b\u8be6\u89c1 [uniem \u5fae\u8c03\u6559\u7a0b](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) or <a target=\"_blank\" href=\"https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb\">\n <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n</a>\n\n\n\u5982\u679c\u60a8\u60f3\u8981\u5728\u672c\u5730\u8fd0\u884c\uff0c\u60a8\u9700\u8981\u8fd0\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u51c6\u5907\u73af\u5883\n\n```bash\nconda create -n uniem python=3.10\npip install uniem\n```\n\n## \ud83d\udcaf MTEB-zh\n\n\u4e2d\u6587 Embedding \u6a21\u578b\u7f3a\u5c11\u7edf\u4e00\u7684\u8bc4\u6d4b\u6807\u51c6\uff0c\u6240\u4ee5\u6211\u4eec\u53c2\u8003\u4e86 [MTEB](https://huggingface.co/spaces/mteb/leaderboard) \uff0c\u6784\u5efa\u4e86\u4e2d\u6587\u8bc4\u6d4b\u6807\u51c6 MTEB-zh\uff0c\u76ee\u524d\u5df2\u7ecf\u5bf9 6 \u79cd\u6a21\u578b\u5728\u5404\u79cd\u6570\u636e\u96c6\u4e0a\u8fdb\u884c\u4e86\u6a2a\u8bc4\uff0c\u8be6\u7ec6\u7684\u8bc4\u6d4b\u65b9\u5f0f\u548c\u4ee3\u7801\u8bf7\u53c2\u8003 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) \u3002\n\n\n### \u6587\u672c\u5206\u7c7b\n\n- \u6570\u636e\u96c6\u9009\u62e9\uff0c\u9009\u62e9\u5f00\u6e90\u5728 HuggingFace \u4e0a\u7684 6 \u79cd\u6587\u672c\u5206\u7c7b\u6570\u636e\u96c6\uff0c\u5305\u62ec\u65b0\u95fb\u3001\u7535\u5546\u8bc4\u8bba\u3001\u80a1\u7968\u8bc4\u8bba\u3001\u957f\u6587\u672c\u7b49\n- \u8bc4\u6d4b\u65b9\u5f0f\uff0c\u4f7f\u7528 MTEB \u7684\u65b9\u5f0f\u8fdb\u884c\u8bc4\u6d4b\uff0c\u62a5\u544a Accuracy\u3002\n\n| | text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul | uer | erlangshen |\n| ----------------- | -------- | --------- | -------- | ------ | ----------- | ------- | ----------- | ----------- |\n| TNews | 0.43 | 0.4443 | 0.4827 | **0.4866** | 0.4594 | 0.3084 | 0.3539 | 0.4361 |\n| JDIphone | 0.8214 | 0.8293 | 0.8533 | **0.8692** | 0.746 | 0.7972 | 0.8283 | 0.8356 |\n| GubaEastmony | 0.7472 | 0.712 | 0.7621 | 0.7663 | 0.7574 | 0.735 | 0.7534 | **0.7787** |\n| TYQSentiment | 0.6099 | 0.6596 | 0.7188 | **0.7247** | 0.68 | 0.6437 | 0.6662 | 0.6444 |\n| StockComSentiment | 0.4307 | 0.4291 | 0.4363 | 0.4475 | **0.4819** | 0.4309 | 0.4555 | 0.4482 |\n| IFlyTek | 0.414 | 0.4263 | 0.4409 | 0.4445 | **0.4486** | 0.3969 | 0.3762 | 0.4241 |\n| Average | 0.5755 | 0.5834 | 0.6157 | **0.6231** | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |\n\n### \u68c0\u7d22\u6392\u5e8f\n\n- \u6570\u636e\u96c6\u9009\u62e9\uff0c\u4f7f\u7528 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) \u6570\u636e\u96c6\uff0c\u7531\u4e8e T2Ranking \u7684\u6570\u636e\u96c6\u592a\u5927\uff0copenai \u8bc4\u6d4b\u8d77\u6765\u7684\u65f6\u95f4\u6210\u672c\u548c api \u8d39\u7528\u6709\u4e9b\u9ad8\uff0c\u6240\u4ee5\u6211\u4eec\u53ea\u9009\u62e9\u4e86 T2Ranking \u4e2d\u7684\u524d 10000 \u7bc7\u6587\u7ae0\n- \u8bc4\u6d4b\u65b9\u5f0f\uff0c\u4f7f\u7528 MTEB \u7684\u65b9\u5f0f\u8fdb\u884c\u8bc4\u6d4b\uff0c\u62a5\u544a map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10\n\n| | text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer | erlangshen |\n| ------- | -------- | -------------- | --------- | -------- | --------- | ------- | ---------- | ---------- |\n| map@1 | 0.4684 | 0.6133 | 0.5574 | **0.626** | 0.6256 | 0.25203 | 0.08647 | 0.25394 |\n| map@10 | 0.5877 | 0.7423 | 0.6878 | **0.7656** | 0.7627 | 0.33312 | 0.13008 | 0.34714 |\n| mrr@1 | 0.5345 | 0.6931 | 0.6324 | 0.7047 | **0.7063** | 0.29258 | 0.10067 | 0.29447 |\n| mrr@10 | 0.6217 | 0.7668 | 0.712 | **0.7841** | 0.7827 | 0.36287 | 0.14516 | 0.3751 |\n| ndcg@1 | 0.5207 | 0.6764 | 0.6159 | 0.6881 | **0.6884** | 0.28358 | 0.09748 | 0.28578 |\n| ndcg@10 | 0.6346 | 0.7786 | 0.7262 | **0.8004** | 0.7974 | 0.37468 | 0.15783 | 0.39329 |\n\n## \ud83e\udd1d Contributing\n\n\u5982\u679c\u60a8\u60f3\u8981\u5728 MTEB-zh \u4e2d\u6dfb\u52a0\u8bc4\u6d4b\u6570\u636e\u96c6\u6216\u8005\u6a21\u578b\uff0c\u6b22\u8fce\u63d0 issue \u6216\u8005 PR\uff0c\u6211\u4f1a\u5728\u7b2c\u4e00\u65f6\u95f4\u8fdb\u884c\u652f\u6301\uff0c\u671f\u5f85\u60a8\u7684\u8d21\u732e\uff01\n\n## \ud83d\udcdc License\n\nuniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.",
"bugtrack_url": null,
"license": "MIT",
"summary": "unified embedding model",
"version": "0.3.2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7e0f8d11765142ddde42f4f2860456c465e9b867d0c55912d826072250471470",
"md5": "2f45f7c83b6c3a287578cb3aff518bcb",
"sha256": "a19778db6d6d992ae6c263d971fc67ead5a15de35c123629f830470381100ea5"
},
"downloads": -1,
"filename": "uniem-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2f45f7c83b6c3a287578cb3aff518bcb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 25030,
"upload_time": "2023-07-20T03:54:27",
"upload_time_iso_8601": "2023-07-20T03:54:27.185354Z",
"url": "https://files.pythonhosted.org/packages/7e/0f/8d11765142ddde42f4f2860456c465e9b867d0c55912d826072250471470/uniem-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "288ac4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70",
"md5": "f269757df357c80f72121451cadacb5c",
"sha256": "24c14eea3f4c8d35a69cfaa059e3ef0cac48a2252b8237a392ff8da473732353"
},
"downloads": -1,
"filename": "uniem-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "f269757df357c80f72121451cadacb5c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 24109,
"upload_time": "2023-07-20T03:54:28",
"upload_time_iso_8601": "2023-07-20T03:54:28.123475Z",
"url": "https://files.pythonhosted.org/packages/28/8a/c4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70/uniem-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-20 03:54:28",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "uniem"
}