# ms2vec: MindSpore Text to Vector
移植自shibing624的[text2vec](https://github.com/shibing624/text2vec)库。
**Text2vec**: Text to Vector, Get Sentence Embeddings. 文本向量化,把文本(包括词、句子、段落)表征为向量矩阵。
**text2vec**实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型,并在文本语义匹配(相似度计算)任务上比较了各模型的效果。
**Guide**
- [Features](#Features)
- [Evaluation](#Evaluation)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [References](#references)
## Features
### 文本向量表示模型
- [Word2Vec](https://github.com/shibing624/text2vec/blob/master/text2vec/word2vec.py):通过腾讯AI Lab开源的大规模高质量中文[词向量数据(800万中文词轻量版)](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe)实现词向量检索,本项目实现了句子(词向量求平均)的word2vec向量表示
- [SBERT(Sentence-BERT)](https://github.com/shibing624/text2vec/blob/master/text2vec/sentencebert_model.py):权衡性能和效率的句向量表示模型,训练时通过有监督训练BERT和softmax分类函数,文本匹配预测时直接取句子向量做余弦,句子表征方法,本项目基于MindSpore复现了Sentence-BERT模型的预测
- [CoSENT(Cosine Sentence)](https://github.com/shibing624/text2vec/blob/master/text2vec/cosent_model.py):CoSENT模型提出了一种排序的损失函数,使训练过程更贴近预测,模型收敛速度和效果比Sentence-BERT更好,本项目基于MindSpore实现了CoSENT模型的预测
- [BGE(BAAI general embedding)](https://github.com/shibing624/text2vec/blob/master/text2vec/bge_model.py):BGE本项目基于MindSpore实现了BGE模型的预测
详细文本向量表示方法见wiki: [文本向量表示方法](https://github.com/shibing624/text2vec/wiki/%E6%96%87%E6%9C%AC%E5%90%91%E9%87%8F%E8%A1%A8%E7%A4%BA%E6%96%B9%E6%B3%95)
Raw data
{
"_id": null,
"home_page": "https://github.com/lvyufeng/ms2vec",
"name": "ms2vec",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "word embedding, text2vec, Chinese Text Similarity Calculation Tool, similarity, word2vec",
"author": "Lvyufeng",
"author_email": "lvyufeng@cqu.edu.cn",
"download_url": null,
"platform": null,
"description": "# ms2vec: MindSpore Text to Vector\n\n\u79fb\u690d\u81eashibing624\u7684[text2vec](https://github.com/shibing624/text2vec)\u5e93\u3002\n\n**Text2vec**: Text to Vector, Get Sentence Embeddings. \u6587\u672c\u5411\u91cf\u5316\uff0c\u628a\u6587\u672c(\u5305\u62ec\u8bcd\u3001\u53e5\u5b50\u3001\u6bb5\u843d)\u8868\u5f81\u4e3a\u5411\u91cf\u77e9\u9635\u3002\n\n**text2vec**\u5b9e\u73b0\u4e86Word2Vec\u3001RankBM25\u3001BERT\u3001Sentence-BERT\u3001CoSENT\u7b49\u591a\u79cd\u6587\u672c\u8868\u5f81\u3001\u6587\u672c\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u6a21\u578b\uff0c\u5e76\u5728\u6587\u672c\u8bed\u4e49\u5339\u914d\uff08\u76f8\u4f3c\u5ea6\u8ba1\u7b97\uff09\u4efb\u52a1\u4e0a\u6bd4\u8f83\u4e86\u5404\u6a21\u578b\u7684\u6548\u679c\u3002\n\n**Guide**\n- [Features](#Features)\n- [Evaluation](#Evaluation)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [References](#references)\n\n\n## Features\n### \u6587\u672c\u5411\u91cf\u8868\u793a\u6a21\u578b\n- [Word2Vec](https://github.com/shibing624/text2vec/blob/master/text2vec/word2vec.py)\uff1a\u901a\u8fc7\u817e\u8bafAI Lab\u5f00\u6e90\u7684\u5927\u89c4\u6a21\u9ad8\u8d28\u91cf\u4e2d\u6587[\u8bcd\u5411\u91cf\u6570\u636e\uff08800\u4e07\u4e2d\u6587\u8bcd\u8f7b\u91cf\u7248\uff09](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (\u6587\u4ef6\u540d\uff1alight_Tencent_AILab_ChineseEmbedding.bin \u5bc6\u7801: tawe\uff09\u5b9e\u73b0\u8bcd\u5411\u91cf\u68c0\u7d22\uff0c\u672c\u9879\u76ee\u5b9e\u73b0\u4e86\u53e5\u5b50\uff08\u8bcd\u5411\u91cf\u6c42\u5e73\u5747\uff09\u7684word2vec\u5411\u91cf\u8868\u793a\n- [SBERT(Sentence-BERT)](https://github.com/shibing624/text2vec/blob/master/text2vec/sentencebert_model.py)\uff1a\u6743\u8861\u6027\u80fd\u548c\u6548\u7387\u7684\u53e5\u5411\u91cf\u8868\u793a\u6a21\u578b\uff0c\u8bad\u7ec3\u65f6\u901a\u8fc7\u6709\u76d1\u7763\u8bad\u7ec3BERT\u548csoftmax\u5206\u7c7b\u51fd\u6570\uff0c\u6587\u672c\u5339\u914d\u9884\u6d4b\u65f6\u76f4\u63a5\u53d6\u53e5\u5b50\u5411\u91cf\u505a\u4f59\u5f26\uff0c\u53e5\u5b50\u8868\u5f81\u65b9\u6cd5\uff0c\u672c\u9879\u76ee\u57fa\u4e8eMindSpore\u590d\u73b0\u4e86Sentence-BERT\u6a21\u578b\u7684\u9884\u6d4b\n- [CoSENT(Cosine Sentence)](https://github.com/shibing624/text2vec/blob/master/text2vec/cosent_model.py)\uff1aCoSENT\u6a21\u578b\u63d0\u51fa\u4e86\u4e00\u79cd\u6392\u5e8f\u7684\u635f\u5931\u51fd\u6570\uff0c\u4f7f\u8bad\u7ec3\u8fc7\u7a0b\u66f4\u8d34\u8fd1\u9884\u6d4b\uff0c\u6a21\u578b\u6536\u655b\u901f\u5ea6\u548c\u6548\u679c\u6bd4Sentence-BERT\u66f4\u597d\uff0c\u672c\u9879\u76ee\u57fa\u4e8eMindSpore\u5b9e\u73b0\u4e86CoSENT\u6a21\u578b\u7684\u9884\u6d4b\n- [BGE(BAAI general embedding)](https://github.com/shibing624/text2vec/blob/master/text2vec/bge_model.py)\uff1aBGE\u672c\u9879\u76ee\u57fa\u4e8eMindSpore\u5b9e\u73b0\u4e86BGE\u6a21\u578b\u7684\u9884\u6d4b\n\n\n\u8be6\u7ec6\u6587\u672c\u5411\u91cf\u8868\u793a\u65b9\u6cd5\u89c1wiki: [\u6587\u672c\u5411\u91cf\u8868\u793a\u65b9\u6cd5](https://github.com/shibing624/text2vec/wiki/%E6%96%87%E6%9C%AC%E5%90%91%E9%87%8F%E8%A1%A8%E7%A4%BA%E6%96%B9%E6%B3%95)\n\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "MindSpore Text to vector Tool, encode text",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/lvyufeng/ms2vec"
},
"split_keywords": [
"word embedding",
" text2vec",
" chinese text similarity calculation tool",
" similarity",
" word2vec"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b574e5932c62673fc71b30a2dde0d3d1ab3094acb80b91b8e8a448df8c8c856e",
"md5": "9c9f529c3412660a9554aca958c3e3d9",
"sha256": "82b3ef6e8f51d11b83a3eb308c93efac1e2cdd3434aa27b3c42cffb33f16f2ed"
},
"downloads": -1,
"filename": "ms2vec-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9c9f529c3412660a9554aca958c3e3d9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 25266,
"upload_time": "2024-08-15T02:50:14",
"upload_time_iso_8601": "2024-08-15T02:50:14.998273Z",
"url": "https://files.pythonhosted.org/packages/b5/74/e5932c62673fc71b30a2dde0d3d1ab3094acb80b91b8e8a448df8c8c856e/ms2vec-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-15 02:50:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lvyufeng",
"github_project": "ms2vec",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ms2vec"
}