uniem


Nameuniem JSON
Version 0.3.2 PyPI version JSON
download
home_page
Summaryunified embedding model
upload_time2023-07-20 03:54:28
maintainer
docs_urlNone
authorwangyuxin
requires_python>=3.10,<4.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # uniem
[![Release](https://img.shields.io/pypi/v/uniem)](https://pypi.org/project/uniem/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/uniem)
[![ci](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml)
[![cd](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml)

uniem 项目的目标是创建中文最好的通用文本嵌入模型。

本项目主要包括模型的训练,微调和评测代码,模型与数据集会在 [HuggingFace](https://huggingface.co/) 社区上进行开源。

## 🌟 重要更新

- ➿ **2023.07.11** , 发布 uniem 0.3.0, `FineTuner` 除 M3E 外,还支持 `sentence_transformers`, `text2vec` 等模型的微调,同时还支持 [SGPT](https://github.com/Muennighoff/sgpt) 的方式对 GPT 系列模型进行训练,以及 Prefix Tuning。 **FineTuner 初始化的 API 有小小的变化,无法兼容 0.2.0**
- ➿ **2023.06.17** , 发布 uniem 0.2.1 , 实现了 `FineTuner` 以原生支持模型微调,**几行代码,即刻适配**!
- 📊 **2023.06.17** , 发布 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 正式版 , 支持 6 大类 Embedding 模型 ,支持 4 大类任务 ,共 9 种数据集的自动化评测
- 🎉 **2023.06.08** , 发布 [M3E models](https://huggingface.co/moka-ai/m3e-base) ,在中文文本分类和文本检索上均优于 `openai text-embedding-ada-002`,详请请参考 [M3E models README](https://huggingface.co/moka-ai/m3e-base/blob/main/README.md)。

## 🔧 使用 M3E

M3E 系列模型完全兼容 [sentence-transformers](https://www.sbert.net/) ,你可以通过 **替换模型名称** 的方式在所有支持 sentence-transformers 的项目中无缝使用 M3E Models,比如 [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) 。

安装

```bash
pip install sentence-transformers
```

使用 

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])
```

## 🎨 微调模型

`uniem` 提供了非常易用的 finetune 接口,几行代码,即刻适配!

```python
from datasets import load_dataset

from uniem.finetuner import FineTuner

dataset = load_dataset('shibing624/nli_zh', 'STS-B')
# 指定训练的模型为 m3e-small
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
finetuner.run(epochs=3)
```

微调模型详见 [uniem 微调教程](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) or <a target="_blank" href="https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


如果您想要在本地运行,您需要运行如下命令,准备环境

```bash
conda create -n uniem python=3.10
pip install uniem
```

## 💯 MTEB-zh

中文 Embedding 模型缺少统一的评测标准,所以我们参考了 [MTEB](https://huggingface.co/spaces/mteb/leaderboard) ,构建了中文评测标准 MTEB-zh,目前已经对 6 种模型在各种数据集上进行了横评,详细的评测方式和代码请参考 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 。


### 文本分类

- 数据集选择,选择开源在 HuggingFace 上的 6 种文本分类数据集,包括新闻、电商评论、股票评论、长文本等
- 评测方式,使用 MTEB 的方式进行评测,报告 Accuracy。

|                   | text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul   | uer     | erlangshen  |
| ----------------- | -------- | --------- | -------- | ------ | ----------- | ------- | ----------- | ----------- |
| TNews             | 0.43     | 0.4443    | 0.4827   | **0.4866** | 0.4594 | 0.3084      | 0.3539  | 0.4361      |
| JDIphone          | 0.8214   | 0.8293    | 0.8533   | **0.8692** | 0.746  | 0.7972      | 0.8283  | 0.8356      |
| GubaEastmony      | 0.7472   | 0.712     | 0.7621   | 0.7663 | 0.7574 | 0.735       | 0.7534  | **0.7787**      |
| TYQSentiment      | 0.6099   | 0.6596    | 0.7188   | **0.7247** | 0.68   | 0.6437      | 0.6662  | 0.6444      |
| StockComSentiment | 0.4307   | 0.4291    | 0.4363   | 0.4475 | **0.4819** | 0.4309      | 0.4555  | 0.4482      |
| IFlyTek           | 0.414    | 0.4263    | 0.4409   | 0.4445 | **0.4486** | 0.3969      | 0.3762  | 0.4241      |
| Average           | 0.5755   | 0.5834    | 0.6157   | **0.6231** | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |

### 检索排序

- 数据集选择,使用 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) 数据集,由于 T2Ranking 的数据集太大,openai 评测起来的时间成本和 api 费用有些高,所以我们只选择了 T2Ranking 中的前 10000 篇文章
- 评测方式,使用 MTEB 的方式进行评测,报告 map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10

|         | text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer     | erlangshen |
| ------- | -------- | -------------- | --------- | -------- | --------- | ------- | ---------- | ---------- |
| map@1   | 0.4684   | 0.6133         | 0.5574    | **0.626**    | 0.6256 | 0.25203   | 0.08647 | 0.25394    |
| map@10  | 0.5877   | 0.7423         | 0.6878    | **0.7656**   | 0.7627 | 0.33312   | 0.13008 | 0.34714    |
| mrr@1   | 0.5345   | 0.6931         | 0.6324    | 0.7047   | **0.7063** | 0.29258   | 0.10067 | 0.29447    |
| mrr@10  | 0.6217   | 0.7668         | 0.712     | **0.7841**   | 0.7827 | 0.36287   | 0.14516 | 0.3751     |
| ndcg@1  | 0.5207   | 0.6764         | 0.6159    | 0.6881   | **0.6884** | 0.28358   | 0.09748 | 0.28578    |
| ndcg@10 | 0.6346   | 0.7786         | 0.7262    | **0.8004**   | 0.7974 | 0.37468   | 0.15783 | 0.39329    |

## 🤝 Contributing

如果您想要在 MTEB-zh 中添加评测数据集或者模型,欢迎提 issue 或者 PR,我会在第一时间进行支持,期待您的贡献!

## 📜 License

uniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.
            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "uniem",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "wangyuxin",
    "author_email": "wangyuxin@mokahr.com",
    "download_url": "https://files.pythonhosted.org/packages/28/8a/c4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70/uniem-0.3.2.tar.gz",
    "platform": null,
    "description": "# uniem\n[![Release](https://img.shields.io/pypi/v/uniem)](https://pypi.org/project/uniem/)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/uniem)\n[![ci](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml)\n[![cd](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml)\n\nuniem \u9879\u76ee\u7684\u76ee\u6807\u662f\u521b\u5efa\u4e2d\u6587\u6700\u597d\u7684\u901a\u7528\u6587\u672c\u5d4c\u5165\u6a21\u578b\u3002\n\n\u672c\u9879\u76ee\u4e3b\u8981\u5305\u62ec\u6a21\u578b\u7684\u8bad\u7ec3\uff0c\u5fae\u8c03\u548c\u8bc4\u6d4b\u4ee3\u7801\uff0c\u6a21\u578b\u4e0e\u6570\u636e\u96c6\u4f1a\u5728 [HuggingFace](https://huggingface.co/) \u793e\u533a\u4e0a\u8fdb\u884c\u5f00\u6e90\u3002\n\n## \ud83c\udf1f \u91cd\u8981\u66f4\u65b0\n\n- \u27bf **2023.07.11** , \u53d1\u5e03 uniem 0.3.0\uff0c `FineTuner` \u9664 M3E \u5916\uff0c\u8fd8\u652f\u6301 `sentence_transformers`, `text2vec` \u7b49\u6a21\u578b\u7684\u5fae\u8c03\uff0c\u540c\u65f6\u8fd8\u652f\u6301 [SGPT](https://github.com/Muennighoff/sgpt) \u7684\u65b9\u5f0f\u5bf9 GPT \u7cfb\u5217\u6a21\u578b\u8fdb\u884c\u8bad\u7ec3\uff0c\u4ee5\u53ca Prefix Tuning\u3002 **FineTuner \u521d\u59cb\u5316\u7684 API \u6709\u5c0f\u5c0f\u7684\u53d8\u5316\uff0c\u65e0\u6cd5\u517c\u5bb9 0.2.0**\n- \u27bf **2023.06.17** , \u53d1\u5e03 uniem 0.2.1 \uff0c \u5b9e\u73b0\u4e86 `FineTuner` \u4ee5\u539f\u751f\u652f\u6301\u6a21\u578b\u5fae\u8c03\uff0c**\u51e0\u884c\u4ee3\u7801\uff0c\u5373\u523b\u9002\u914d**\uff01\n- \ud83d\udcca **2023.06.17** , \u53d1\u5e03 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) \u6b63\u5f0f\u7248 \uff0c \u652f\u6301 6 \u5927\u7c7b Embedding \u6a21\u578b \uff0c\u652f\u6301 4 \u5927\u7c7b\u4efb\u52a1 \uff0c\u5171 9 \u79cd\u6570\u636e\u96c6\u7684\u81ea\u52a8\u5316\u8bc4\u6d4b\n- \ud83c\udf89 **2023.06.08** , \u53d1\u5e03 [M3E models](https://huggingface.co/moka-ai/m3e-base) \uff0c\u5728\u4e2d\u6587\u6587\u672c\u5206\u7c7b\u548c\u6587\u672c\u68c0\u7d22\u4e0a\u5747\u4f18\u4e8e `openai text-embedding-ada-002`\uff0c\u8be6\u8bf7\u8bf7\u53c2\u8003 [M3E models README](https://huggingface.co/moka-ai/m3e-base/blob/main/README.md)\u3002\n\n## \ud83d\udd27 \u4f7f\u7528 M3E\n\nM3E \u7cfb\u5217\u6a21\u578b\u5b8c\u5168\u517c\u5bb9 [sentence-transformers](https://www.sbert.net/) \uff0c\u4f60\u53ef\u4ee5\u901a\u8fc7 **\u66ff\u6362\u6a21\u578b\u540d\u79f0** \u7684\u65b9\u5f0f\u5728\u6240\u6709\u652f\u6301 sentence-transformers \u7684\u9879\u76ee\u4e2d\u65e0\u7f1d\u4f7f\u7528 M3E Models\uff0c\u6bd4\u5982 [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) \u3002\n\n\u5b89\u88c5\n\n```bash\npip install sentence-transformers\n```\n\n\u4f7f\u7528 \n\n```python\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer(\"moka-ai/m3e-base\")\nembeddings = model.encode(['Hello World!', '\u4f60\u597d,\u4e16\u754c!'])\n```\n\n## \ud83c\udfa8 \u5fae\u8c03\u6a21\u578b\n\n`uniem` \u63d0\u4f9b\u4e86\u975e\u5e38\u6613\u7528\u7684 finetune \u63a5\u53e3\uff0c\u51e0\u884c\u4ee3\u7801\uff0c\u5373\u523b\u9002\u914d\uff01\n\n```python\nfrom datasets import load_dataset\n\nfrom uniem.finetuner import FineTuner\n\ndataset = load_dataset('shibing624/nli_zh', 'STS-B')\n# \u6307\u5b9a\u8bad\u7ec3\u7684\u6a21\u578b\u4e3a m3e-small\nfinetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)\nfinetuner.run(epochs=3)\n```\n\n\u5fae\u8c03\u6a21\u578b\u8be6\u89c1 [uniem \u5fae\u8c03\u6559\u7a0b](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) or <a target=\"_blank\" href=\"https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb\">\n  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n</a>\n\n\n\u5982\u679c\u60a8\u60f3\u8981\u5728\u672c\u5730\u8fd0\u884c\uff0c\u60a8\u9700\u8981\u8fd0\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u51c6\u5907\u73af\u5883\n\n```bash\nconda create -n uniem python=3.10\npip install uniem\n```\n\n## \ud83d\udcaf MTEB-zh\n\n\u4e2d\u6587 Embedding \u6a21\u578b\u7f3a\u5c11\u7edf\u4e00\u7684\u8bc4\u6d4b\u6807\u51c6\uff0c\u6240\u4ee5\u6211\u4eec\u53c2\u8003\u4e86 [MTEB](https://huggingface.co/spaces/mteb/leaderboard) \uff0c\u6784\u5efa\u4e86\u4e2d\u6587\u8bc4\u6d4b\u6807\u51c6 MTEB-zh\uff0c\u76ee\u524d\u5df2\u7ecf\u5bf9 6 \u79cd\u6a21\u578b\u5728\u5404\u79cd\u6570\u636e\u96c6\u4e0a\u8fdb\u884c\u4e86\u6a2a\u8bc4\uff0c\u8be6\u7ec6\u7684\u8bc4\u6d4b\u65b9\u5f0f\u548c\u4ee3\u7801\u8bf7\u53c2\u8003 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) \u3002\n\n\n### \u6587\u672c\u5206\u7c7b\n\n- \u6570\u636e\u96c6\u9009\u62e9\uff0c\u9009\u62e9\u5f00\u6e90\u5728 HuggingFace \u4e0a\u7684 6 \u79cd\u6587\u672c\u5206\u7c7b\u6570\u636e\u96c6\uff0c\u5305\u62ec\u65b0\u95fb\u3001\u7535\u5546\u8bc4\u8bba\u3001\u80a1\u7968\u8bc4\u8bba\u3001\u957f\u6587\u672c\u7b49\n- \u8bc4\u6d4b\u65b9\u5f0f\uff0c\u4f7f\u7528 MTEB \u7684\u65b9\u5f0f\u8fdb\u884c\u8bc4\u6d4b\uff0c\u62a5\u544a Accuracy\u3002\n\n|                   | text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul   | uer     | erlangshen  |\n| ----------------- | -------- | --------- | -------- | ------ | ----------- | ------- | ----------- | ----------- |\n| TNews             | 0.43     | 0.4443    | 0.4827   | **0.4866** | 0.4594 | 0.3084      | 0.3539  | 0.4361      |\n| JDIphone          | 0.8214   | 0.8293    | 0.8533   | **0.8692** | 0.746  | 0.7972      | 0.8283  | 0.8356      |\n| GubaEastmony      | 0.7472   | 0.712     | 0.7621   | 0.7663 | 0.7574 | 0.735       | 0.7534  | **0.7787**      |\n| TYQSentiment      | 0.6099   | 0.6596    | 0.7188   | **0.7247** | 0.68   | 0.6437      | 0.6662  | 0.6444      |\n| StockComSentiment | 0.4307   | 0.4291    | 0.4363   | 0.4475 | **0.4819** | 0.4309      | 0.4555  | 0.4482      |\n| IFlyTek           | 0.414    | 0.4263    | 0.4409   | 0.4445 | **0.4486** | 0.3969      | 0.3762  | 0.4241      |\n| Average           | 0.5755   | 0.5834    | 0.6157   | **0.6231** | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |\n\n### \u68c0\u7d22\u6392\u5e8f\n\n- \u6570\u636e\u96c6\u9009\u62e9\uff0c\u4f7f\u7528 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) \u6570\u636e\u96c6\uff0c\u7531\u4e8e T2Ranking \u7684\u6570\u636e\u96c6\u592a\u5927\uff0copenai \u8bc4\u6d4b\u8d77\u6765\u7684\u65f6\u95f4\u6210\u672c\u548c api \u8d39\u7528\u6709\u4e9b\u9ad8\uff0c\u6240\u4ee5\u6211\u4eec\u53ea\u9009\u62e9\u4e86 T2Ranking \u4e2d\u7684\u524d 10000 \u7bc7\u6587\u7ae0\n- \u8bc4\u6d4b\u65b9\u5f0f\uff0c\u4f7f\u7528 MTEB \u7684\u65b9\u5f0f\u8fdb\u884c\u8bc4\u6d4b\uff0c\u62a5\u544a map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10\n\n|         | text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer     | erlangshen |\n| ------- | -------- | -------------- | --------- | -------- | --------- | ------- | ---------- | ---------- |\n| map@1   | 0.4684   | 0.6133         | 0.5574    | **0.626**    | 0.6256 | 0.25203   | 0.08647 | 0.25394    |\n| map@10  | 0.5877   | 0.7423         | 0.6878    | **0.7656**   | 0.7627 | 0.33312   | 0.13008 | 0.34714    |\n| mrr@1   | 0.5345   | 0.6931         | 0.6324    | 0.7047   | **0.7063** | 0.29258   | 0.10067 | 0.29447    |\n| mrr@10  | 0.6217   | 0.7668         | 0.712     | **0.7841**   | 0.7827 | 0.36287   | 0.14516 | 0.3751     |\n| ndcg@1  | 0.5207   | 0.6764         | 0.6159    | 0.6881   | **0.6884** | 0.28358   | 0.09748 | 0.28578    |\n| ndcg@10 | 0.6346   | 0.7786         | 0.7262    | **0.8004**   | 0.7974 | 0.37468   | 0.15783 | 0.39329    |\n\n## \ud83e\udd1d Contributing\n\n\u5982\u679c\u60a8\u60f3\u8981\u5728 MTEB-zh \u4e2d\u6dfb\u52a0\u8bc4\u6d4b\u6570\u636e\u96c6\u6216\u8005\u6a21\u578b\uff0c\u6b22\u8fce\u63d0 issue \u6216\u8005 PR\uff0c\u6211\u4f1a\u5728\u7b2c\u4e00\u65f6\u95f4\u8fdb\u884c\u652f\u6301\uff0c\u671f\u5f85\u60a8\u7684\u8d21\u732e\uff01\n\n## \ud83d\udcdc License\n\nuniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "unified embedding model",
    "version": "0.3.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7e0f8d11765142ddde42f4f2860456c465e9b867d0c55912d826072250471470",
                "md5": "2f45f7c83b6c3a287578cb3aff518bcb",
                "sha256": "a19778db6d6d992ae6c263d971fc67ead5a15de35c123629f830470381100ea5"
            },
            "downloads": -1,
            "filename": "uniem-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2f45f7c83b6c3a287578cb3aff518bcb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 25030,
            "upload_time": "2023-07-20T03:54:27",
            "upload_time_iso_8601": "2023-07-20T03:54:27.185354Z",
            "url": "https://files.pythonhosted.org/packages/7e/0f/8d11765142ddde42f4f2860456c465e9b867d0c55912d826072250471470/uniem-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "288ac4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70",
                "md5": "f269757df357c80f72121451cadacb5c",
                "sha256": "24c14eea3f4c8d35a69cfaa059e3ef0cac48a2252b8237a392ff8da473732353"
            },
            "downloads": -1,
            "filename": "uniem-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f269757df357c80f72121451cadacb5c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 24109,
            "upload_time": "2023-07-20T03:54:28",
            "upload_time_iso_8601": "2023-07-20T03:54:28.123475Z",
            "url": "https://files.pythonhosted.org/packages/28/8a/c4388f01a50ab9fcd5de6f2e203e3df965b456737fdf98ff9f0d19499f70/uniem-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-20 03:54:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "uniem"
}
        
Elapsed time: 0.11998s