similarities


Namesimilarities JSON
Version 1.2.3 PyPI version JSON
download
home_pagehttps://github.com/shibing624/similarities
SummarySimilarities is a toolkit for compute similarity scores between texts, performing text searches.
upload_time2024-09-08 07:12:03
maintainerNone
docs_urlNone
authorXuMing
requires_python>=3.6.0
licenseApache License 2.0
keywords similarities chinese text similarity calculation tool similarity word2vec
VCS
bugtrack_url
requirements text2vec jieba loguru transformers Pillow autofaiss fire fastapi uvicorn pydantic requests starlette gradio
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [**🇨🇳中文**](https://github.com/shibing624/similarities/blob/main/README.md) | [**🌐English**](https://github.com/shibing624/similarities/blob/main/README_EN.md) | [**📖文档/Docs**](https://github.com/shibing624/similarities/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) 

<div align="center">
  <a href="https://github.com/shibing624/similarities">
    <img src="https://raw.githubusercontent.com/shibing624/similarities/main/docs/logo.png" height="150" alt="Logo">
  </a>
</div>

-----------------

# Similarities: Similarity Calculation and Semantic Search
[![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)
[![Downloads](https://static.pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)


**similarities**: a toolkit for similarity calculation and semantic search, supports text and image. 相似度计算、语义匹配搜索工具包。

**similarities** 实现了多种文本和图片的相似度计算、语义匹配检索算法,支持亿级数据文搜文、文搜图、图搜图,python3开发,pip安装,开箱即用。

**Guide**

- [Features](#Features)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Acknowledgements](#Acknowledgements)

## Features

### 文本相似度计算 + 文本搜索

- 语义匹配模型【推荐】:本项目基于text2vec实现了CoSENT模型的文本相似度计算和文本搜索
  - 支持中英文、多语言多种SentenceBERT类预训练模型
  - 支持 Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance 等多种相似度计算方法
  - 支持 SemanticSearch/Faiss/Annoy/Hnsw 等多种文本搜索算法
  - 支持亿级数据高效检索
  - 支持命令行文本转向量(多卡)、建索引、批量检索、启动服务
- 字面匹配模型:本项目实现了Word2Vec、BM25、RankBM25、TFIDF、SimHash、同义词词林、知网Hownet义原匹配等多种字面匹配模型


### 图像相似度计算/图文相似度计算 + 图搜图/文搜图
- CLIP(Contrastive Language-Image Pre-Training)模型:图文匹配模型,可用于图文特征(embeddings)、相似度计算、图文检索、零样本图片分类,本项目基于PyTorch实现了CLIP模型的向量表征、构建索引(基于AutoFaiss)、批量检索、后台服务(基于FastAPI)、前端展现(基于Gradio)功能
  - 支持[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)等CLIP系列模型
  - 支持[OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)等Chinese-CLIP系列模型
  - 支持前后端分离部署,FastAPI后端服务,Gradio前端展现
  - 支持亿级数据高效检索,基于Faiss检索,支持GPU加速
  - 支持图搜图、文搜图、向量搜图
  - 支持图像embedding提取、文本embedding提取
  - 支持图像相似度计算、图文相似度计算
  - 支持命令行图像转向量(多卡)、建索引、批量检索、启动服务
- 图像特征提取:本项目基于cv2实现了pHash、dHash、wHash、aHash、SIFT等多种图像特征提取算法

## Demo
Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

![](https://github.com/shibing624/similarities/blob/main/docs/white_cat.png)

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

![](https://github.com/shibing624/similarities/blob/main/docs/hf_search.png)


## Install

```
pip install torch # conda install pytorch
pip install -U similarities
```

or

```
git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .
```

## Usage

### 1. 文本向量相似度计算

example: [examples/text_similarity_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_similarity_demo.py)


```python
from similarities import BertSimilarity
m = BertSimilarity(model_name_or_path="shibing624/text2vec-base-chinese")
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186
```

- `model_name_or_path`:模型名称或者路径,默认会从HF model hub下载并使用中文语义匹配模型[shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese),如果需要多语言,可以替换为[shibing624/text2vec-base-multilingual](https://huggingface.co/shibing624/text2vec-base-multilingual)模型,支持中、英、韩、日、德、意等多国语言

### 2. 文本向量搜索

在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本搜索等任务。

#### SemanticSearch精准搜索算法,Cos Similarity + topK 聚类检索,适合百万内数据集

example: [examples/text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_semantic_search_demo.py)

#### Annoy、Hnswlib等近似搜索算法,适合百万级数据集

example: [examples/fast_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/fast_text_semantic_search_demo.py)

#### Faiss高效向量检索,适合亿级数据集

- 文本转向量,建索引,批量检索,启动服务:[examples/faiss_bert_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_server_demo.py)

- 前端python调用:[examples/faiss_bert_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_client_demo.py)


### 3. 基于字面的文本相似度计算和文本搜索

支持同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索,常用于文本匹配冷启动。

example: [examples/literal_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/literal_text_semantic_search_demo.py)

### 4. 图像相似度计算和图片搜索

支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索,中文CLIP模型支持图搜图,文搜图、还支持中英文图文互搜。

example: [examples/image_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_semantic_search_demo.py)

![image_sim](https://github.com/shibing624/similarities/blob/main/docs/image_sim.png)


#### Faiss高效向量检索,适合亿级数据集

- 图像转向量,建索引,批量检索,启动服务:[examples/faiss_clip_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_server_demo.py)

- 前端python调用:[examples/faiss_clip_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_client_demo.py)

- 前端gradio调用:[examples/faiss_clip_search_gradio_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_gradio_demo.py)

<img src="https://github.com/shibing624/similarities/blob/main/docs/dog-img.png"/>

### 5. 聚类

通过社群发现(community_detection)算法可以在大规模数据集上执行聚类,寻找聚类簇(即相似的句子组)。

example: [examples/text_clustering_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_clustering_demo.py)


### 6. 图文语义去重

通过同义句挖掘(paraphrase_mining_embeddings)算法可以从大量句子或文档集中挖掘出具有相似意义的句子对,可用于冗余图文检测,语义去重。

- 文本语义去重:[examples/text_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_duplicates_demo.py)
- 图片语义去重:[examples/image_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_duplicates_demo.py)

### 命令行模式(CLI)

- 支持批量获取文本向量、图像向量(embedding)
- 支持构建索引(index)
- 支持批量检索(filter)
- 支持启动服务(server)

code: [cli.py](https://github.com/shibing624/similarities/blob/main/similarities/cli.py)

```
> similarities -h                                    

NAME
    similarities

SYNOPSIS
    similarities COMMAND

COMMANDS
    COMMAND is one of the following:

     bert_embedding
       Compute embeddings for a list of sentences

     bert_index
       Build indexes from text embeddings using autofaiss

     bert_filter
       Entry point of bert filter, batch search index

     bert_server
       Main entry point of bert search backend, start the server

     clip_embedding
       Embedding text and image with clip model

     clip_index
       Build indexes from embeddings using autofaiss

     clip_filter
       Entry point of clip filter, batch search index

     clip_server
       Main entry point of clip search backend, start the server
```

run:

```shell
pip install similarities -U
similarities clip_embedding -h

# example
cd examples
similarities clip_embedding data/toy_clip/
```

- `bert_embedding`等是二级命令,bert开头的是文本相关,clip开头的是图像相关
- 各二级命令使用方法见`similarities clip_embedding -h`
- 上面示例中`data/toy_clip/`是`clip_embedding`方法的`input_dir`参数,输入文件目录(required)



## Contact

- Issue(建议)
  :[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
- 邮件我:xuming: xuming624@qq.com
- 微信我: 加我*微信号:xuming624, 备注:姓名-公司-NLP* 进NLP交流群。

<img src="https://github.com/shibing624/similarities/blob/main/docs/wechat.jpeg" width="200" />

## Citation

如果你在研究中使用了similarities,请按如下格式引用:

APA:

```
Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities
```

BibTeX:

```
@misc{Xu_Similarities_Compute_similarity,
  title={Similarities: similarity calculation and semantic search toolkit},
  author={Xu Ming},
  year={2022},
  howpublished={\url{https://github.com/shibing624/similarities}},
}
```

## License

授权协议为 [The Apache License 2.0](/LICENSE),可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。

## Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

- 在`tests`添加相应的单元测试
- 使用`python -m pytest`来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

## Acknowledgements 

- [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)
- [https://github.com/liuhuanyong/SentenceSimilarity](https://github.com/liuhuanyong/SentenceSimilarity)
- [https://github.com/qwertyforce/image_search](https://github.com/qwertyforce/image_search)
- [ImageHash - Official Github repository](https://github.com/JohannesBuchner/imagehash)
- [https://github.com/openai/CLIP](https://github.com/openai/CLIP)
- [https://github.com/OFA-Sys/Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)
- [https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [https://github.com/rom1504/clip-retrieval](https://github.com/rom1504/clip-retrieval)

Thanks for their great work!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/similarities",
    "name": "similarities",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": null,
    "keywords": "similarities, Chinese Text Similarity Calculation Tool, similarity, word2vec",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/b6/d4/545731d7a430e231aab32634638c443011e674e794d92e9db15fe6eb1070/similarities-1.2.3.tar.gz",
    "platform": null,
    "description": "[**\ud83c\udde8\ud83c\uddf3\u4e2d\u6587**](https://github.com/shibing624/similarities/blob/main/README.md) | [**\ud83c\udf10English**](https://github.com/shibing624/similarities/blob/main/README_EN.md) | [**\ud83d\udcd6\u6587\u6863/Docs**](https://github.com/shibing624/similarities/wiki) | [**\ud83e\udd16\u6a21\u578b/Models**](https://huggingface.co/shibing624) \n\n<div align=\"center\">\n  <a href=\"https://github.com/shibing624/similarities\">\n    <img src=\"https://raw.githubusercontent.com/shibing624/similarities/main/docs/logo.png\" height=\"150\" alt=\"Logo\">\n  </a>\n</div>\n\n-----------------\n\n# Similarities: Similarity Calculation and Semantic Search\n[![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)\n[![Downloads](https://static.pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)\n[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)\n\n\n**similarities**: a toolkit for similarity calculation and semantic search, supports text and image. \u76f8\u4f3c\u5ea6\u8ba1\u7b97\u3001\u8bed\u4e49\u5339\u914d\u641c\u7d22\u5de5\u5177\u5305\u3002\n\n**similarities** \u5b9e\u73b0\u4e86\u591a\u79cd\u6587\u672c\u548c\u56fe\u7247\u7684\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u3001\u8bed\u4e49\u5339\u914d\u68c0\u7d22\u7b97\u6cd5\uff0c\u652f\u6301\u4ebf\u7ea7\u6570\u636e\u6587\u641c\u6587\u3001\u6587\u641c\u56fe\u3001\u56fe\u641c\u56fe\uff0cpython3\u5f00\u53d1\uff0cpip\u5b89\u88c5\uff0c\u5f00\u7bb1\u5373\u7528\u3002\n\n**Guide**\n\n- [Features](#Features)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Acknowledgements](#Acknowledgements)\n\n## Features\n\n### \u6587\u672c\u76f8\u4f3c\u5ea6\u8ba1\u7b97 + \u6587\u672c\u641c\u7d22\n\n- \u8bed\u4e49\u5339\u914d\u6a21\u578b\u3010\u63a8\u8350\u3011\uff1a\u672c\u9879\u76ee\u57fa\u4e8etext2vec\u5b9e\u73b0\u4e86CoSENT\u6a21\u578b\u7684\u6587\u672c\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u548c\u6587\u672c\u641c\u7d22\n  - \u652f\u6301\u4e2d\u82f1\u6587\u3001\u591a\u8bed\u8a00\u591a\u79cdSentenceBERT\u7c7b\u9884\u8bad\u7ec3\u6a21\u578b\n  - \u652f\u6301 Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance \u7b49\u591a\u79cd\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u65b9\u6cd5\n  - \u652f\u6301 SemanticSearch/Faiss/Annoy/Hnsw \u7b49\u591a\u79cd\u6587\u672c\u641c\u7d22\u7b97\u6cd5\n  - \u652f\u6301\u4ebf\u7ea7\u6570\u636e\u9ad8\u6548\u68c0\u7d22\n  - \u652f\u6301\u547d\u4ee4\u884c\u6587\u672c\u8f6c\u5411\u91cf\uff08\u591a\u5361\uff09\u3001\u5efa\u7d22\u5f15\u3001\u6279\u91cf\u68c0\u7d22\u3001\u542f\u52a8\u670d\u52a1\n- \u5b57\u9762\u5339\u914d\u6a21\u578b\uff1a\u672c\u9879\u76ee\u5b9e\u73b0\u4e86Word2Vec\u3001BM25\u3001RankBM25\u3001TFIDF\u3001SimHash\u3001\u540c\u4e49\u8bcd\u8bcd\u6797\u3001\u77e5\u7f51Hownet\u4e49\u539f\u5339\u914d\u7b49\u591a\u79cd\u5b57\u9762\u5339\u914d\u6a21\u578b\n\n\n### \u56fe\u50cf\u76f8\u4f3c\u5ea6\u8ba1\u7b97/\u56fe\u6587\u76f8\u4f3c\u5ea6\u8ba1\u7b97 + \u56fe\u641c\u56fe/\u6587\u641c\u56fe\n- CLIP(Contrastive Language-Image Pre-Training)\u6a21\u578b\uff1a\u56fe\u6587\u5339\u914d\u6a21\u578b\uff0c\u53ef\u7528\u4e8e\u56fe\u6587\u7279\u5f81\uff08embeddings\uff09\u3001\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u3001\u56fe\u6587\u68c0\u7d22\u3001\u96f6\u6837\u672c\u56fe\u7247\u5206\u7c7b\uff0c\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86CLIP\u6a21\u578b\u7684\u5411\u91cf\u8868\u5f81\u3001\u6784\u5efa\u7d22\u5f15\uff08\u57fa\u4e8eAutoFaiss\uff09\u3001\u6279\u91cf\u68c0\u7d22\u3001\u540e\u53f0\u670d\u52a1\uff08\u57fa\u4e8eFastAPI\uff09\u3001\u524d\u7aef\u5c55\u73b0\uff08\u57fa\u4e8eGradio\uff09\u529f\u80fd\n  - \u652f\u6301[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)\u7b49CLIP\u7cfb\u5217\u6a21\u578b\n  - \u652f\u6301[OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)\u7b49Chinese-CLIP\u7cfb\u5217\u6a21\u578b\n  - \u652f\u6301\u524d\u540e\u7aef\u5206\u79bb\u90e8\u7f72\uff0cFastAPI\u540e\u7aef\u670d\u52a1\uff0cGradio\u524d\u7aef\u5c55\u73b0\n  - \u652f\u6301\u4ebf\u7ea7\u6570\u636e\u9ad8\u6548\u68c0\u7d22\uff0c\u57fa\u4e8eFaiss\u68c0\u7d22\uff0c\u652f\u6301GPU\u52a0\u901f\n  - \u652f\u6301\u56fe\u641c\u56fe\u3001\u6587\u641c\u56fe\u3001\u5411\u91cf\u641c\u56fe\n  - \u652f\u6301\u56fe\u50cfembedding\u63d0\u53d6\u3001\u6587\u672cembedding\u63d0\u53d6\n  - \u652f\u6301\u56fe\u50cf\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u3001\u56fe\u6587\u76f8\u4f3c\u5ea6\u8ba1\u7b97\n  - \u652f\u6301\u547d\u4ee4\u884c\u56fe\u50cf\u8f6c\u5411\u91cf\uff08\u591a\u5361\uff09\u3001\u5efa\u7d22\u5f15\u3001\u6279\u91cf\u68c0\u7d22\u3001\u542f\u52a8\u670d\u52a1\n- \u56fe\u50cf\u7279\u5f81\u63d0\u53d6\uff1a\u672c\u9879\u76ee\u57fa\u4e8ecv2\u5b9e\u73b0\u4e86pHash\u3001dHash\u3001wHash\u3001aHash\u3001SIFT\u7b49\u591a\u79cd\u56fe\u50cf\u7279\u5f81\u63d0\u53d6\u7b97\u6cd5\n\n## Demo\nImage Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search\n\n![](https://github.com/shibing624/similarities/blob/main/docs/white_cat.png)\n\nText Search Demo: https://huggingface.co/spaces/shibing624/similarities\n\n![](https://github.com/shibing624/similarities/blob/main/docs/hf_search.png)\n\n\n## Install\n\n```\npip install torch # conda install pytorch\npip install -U similarities\n```\n\nor\n\n```\ngit clone https://github.com/shibing624/similarities.git\ncd similarities\npip install -e .\n```\n\n## Usage\n\n### 1. \u6587\u672c\u5411\u91cf\u76f8\u4f3c\u5ea6\u8ba1\u7b97\n\nexample: [examples/text_similarity_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_similarity_demo.py)\n\n\n```python\nfrom similarities import BertSimilarity\nm = BertSimilarity(model_name_or_path=\"shibing624/text2vec-base-chinese\")\nr = m.similarity('\u5982\u4f55\u66f4\u6362\u82b1\u5457\u7ed1\u5b9a\u94f6\u884c\u5361', '\u82b1\u5457\u66f4\u6539\u7ed1\u5b9a\u94f6\u884c\u5361')\nprint(f\"similarity score: {float(r)}\")  # similarity score: 0.855146050453186\n```\n\n- `model_name_or_path`\uff1a\u6a21\u578b\u540d\u79f0\u6216\u8005\u8def\u5f84\uff0c\u9ed8\u8ba4\u4f1a\u4eceHF model hub\u4e0b\u8f7d\u5e76\u4f7f\u7528\u4e2d\u6587\u8bed\u4e49\u5339\u914d\u6a21\u578b[shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese)\uff0c\u5982\u679c\u9700\u8981\u591a\u8bed\u8a00\uff0c\u53ef\u4ee5\u66ff\u6362\u4e3a[shibing624/text2vec-base-multilingual](https://huggingface.co/shibing624/text2vec-base-multilingual)\u6a21\u578b\uff0c\u652f\u6301\u4e2d\u3001\u82f1\u3001\u97e9\u3001\u65e5\u3001\u5fb7\u3001\u610f\u7b49\u591a\u56fd\u8bed\u8a00\n\n### 2. \u6587\u672c\u5411\u91cf\u641c\u7d22\n\n\u5728\u6587\u6863\u5019\u9009\u96c6\u4e2d\u627e\u4e0equery\u6700\u76f8\u4f3c\u7684\u6587\u672c\uff0c\u5e38\u7528\u4e8eQA\u573a\u666f\u7684\u95ee\u53e5\u76f8\u4f3c\u5339\u914d\u3001\u6587\u672c\u641c\u7d22\u7b49\u4efb\u52a1\u3002\n\n#### SemanticSearch\u7cbe\u51c6\u641c\u7d22\u7b97\u6cd5\uff0cCos Similarity + topK \u805a\u7c7b\u68c0\u7d22\uff0c\u9002\u5408\u767e\u4e07\u5185\u6570\u636e\u96c6\n\nexample: [examples/text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_semantic_search_demo.py)\n\n#### Annoy\u3001Hnswlib\u7b49\u8fd1\u4f3c\u641c\u7d22\u7b97\u6cd5\uff0c\u9002\u5408\u767e\u4e07\u7ea7\u6570\u636e\u96c6\n\nexample: [examples/fast_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/fast_text_semantic_search_demo.py)\n\n#### Faiss\u9ad8\u6548\u5411\u91cf\u68c0\u7d22\uff0c\u9002\u5408\u4ebf\u7ea7\u6570\u636e\u96c6\n\n- \u6587\u672c\u8f6c\u5411\u91cf\uff0c\u5efa\u7d22\u5f15\uff0c\u6279\u91cf\u68c0\u7d22\uff0c\u542f\u52a8\u670d\u52a1\uff1a[examples/faiss_bert_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_server_demo.py)\n\n- \u524d\u7aefpython\u8c03\u7528\uff1a[examples/faiss_bert_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_client_demo.py)\n\n\n### 3. \u57fa\u4e8e\u5b57\u9762\u7684\u6587\u672c\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u548c\u6587\u672c\u641c\u7d22\n\n\u652f\u6301\u540c\u4e49\u8bcd\u8bcd\u6797\uff08Cilin\uff09\u3001\u77e5\u7f51Hownet\u3001\u8bcd\u5411\u91cf\uff08WordEmbedding\uff09\u3001Tfidf\u3001SimHash\u3001BM25\u7b49\u7b97\u6cd5\u7684\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u548c\u5b57\u9762\u5339\u914d\u641c\u7d22\uff0c\u5e38\u7528\u4e8e\u6587\u672c\u5339\u914d\u51b7\u542f\u52a8\u3002\n\nexample: [examples/literal_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/literal_text_semantic_search_demo.py)\n\n### 4. \u56fe\u50cf\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u548c\u56fe\u7247\u641c\u7d22\n\n\u652f\u6301CLIP\u3001pHash\u3001SIFT\u7b49\u7b97\u6cd5\u7684\u56fe\u50cf\u76f8\u4f3c\u5ea6\u8ba1\u7b97\u548c\u5339\u914d\u641c\u7d22\uff0c\u4e2d\u6587CLIP\u6a21\u578b\u652f\u6301\u56fe\u641c\u56fe\uff0c\u6587\u641c\u56fe\u3001\u8fd8\u652f\u6301\u4e2d\u82f1\u6587\u56fe\u6587\u4e92\u641c\u3002\n\nexample: [examples/image_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_semantic_search_demo.py)\n\n![image_sim](https://github.com/shibing624/similarities/blob/main/docs/image_sim.png)\n\n\n#### Faiss\u9ad8\u6548\u5411\u91cf\u68c0\u7d22\uff0c\u9002\u5408\u4ebf\u7ea7\u6570\u636e\u96c6\n\n- \u56fe\u50cf\u8f6c\u5411\u91cf\uff0c\u5efa\u7d22\u5f15\uff0c\u6279\u91cf\u68c0\u7d22\uff0c\u542f\u52a8\u670d\u52a1\uff1a[examples/faiss_clip_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_server_demo.py)\n\n- \u524d\u7aefpython\u8c03\u7528\uff1a[examples/faiss_clip_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_client_demo.py)\n\n- \u524d\u7aefgradio\u8c03\u7528\uff1a[examples/faiss_clip_search_gradio_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_gradio_demo.py)\n\n<img src=\"https://github.com/shibing624/similarities/blob/main/docs/dog-img.png\"/>\n\n### 5. \u805a\u7c7b\n\n\u901a\u8fc7\u793e\u7fa4\u53d1\u73b0\uff08community_detection\uff09\u7b97\u6cd5\u53ef\u4ee5\u5728\u5927\u89c4\u6a21\u6570\u636e\u96c6\u4e0a\u6267\u884c\u805a\u7c7b\uff0c\u5bfb\u627e\u805a\u7c7b\u7c07\uff08\u5373\u76f8\u4f3c\u7684\u53e5\u5b50\u7ec4\uff09\u3002\n\nexample: [examples/text_clustering_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_clustering_demo.py)\n\n\n### 6. \u56fe\u6587\u8bed\u4e49\u53bb\u91cd\n\n\u901a\u8fc7\u540c\u4e49\u53e5\u6316\u6398\uff08paraphrase_mining_embeddings\uff09\u7b97\u6cd5\u53ef\u4ee5\u4ece\u5927\u91cf\u53e5\u5b50\u6216\u6587\u6863\u96c6\u4e2d\u6316\u6398\u51fa\u5177\u6709\u76f8\u4f3c\u610f\u4e49\u7684\u53e5\u5b50\u5bf9\uff0c\u53ef\u7528\u4e8e\u5197\u4f59\u56fe\u6587\u68c0\u6d4b\uff0c\u8bed\u4e49\u53bb\u91cd\u3002\n\n- \u6587\u672c\u8bed\u4e49\u53bb\u91cd\uff1a[examples/text_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_duplicates_demo.py)\n- \u56fe\u7247\u8bed\u4e49\u53bb\u91cd\uff1a[examples/image_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_duplicates_demo.py)\n\n### \u547d\u4ee4\u884c\u6a21\u5f0f\uff08CLI\uff09\n\n- \u652f\u6301\u6279\u91cf\u83b7\u53d6\u6587\u672c\u5411\u91cf\u3001\u56fe\u50cf\u5411\u91cf\uff08embedding\uff09\n- \u652f\u6301\u6784\u5efa\u7d22\u5f15\uff08index\uff09\n- \u652f\u6301\u6279\u91cf\u68c0\u7d22\uff08filter\uff09\n- \u652f\u6301\u542f\u52a8\u670d\u52a1\uff08server\uff09\n\ncode: [cli.py](https://github.com/shibing624/similarities/blob/main/similarities/cli.py)\n\n```\n> similarities -h                                    \n\nNAME\n    similarities\n\nSYNOPSIS\n    similarities COMMAND\n\nCOMMANDS\n    COMMAND is one of the following:\n\n     bert_embedding\n       Compute embeddings for a list of sentences\n\n     bert_index\n       Build indexes from text embeddings using autofaiss\n\n     bert_filter\n       Entry point of bert filter, batch search index\n\n     bert_server\n       Main entry point of bert search backend, start the server\n\n     clip_embedding\n       Embedding text and image with clip model\n\n     clip_index\n       Build indexes from embeddings using autofaiss\n\n     clip_filter\n       Entry point of clip filter, batch search index\n\n     clip_server\n       Main entry point of clip search backend, start the server\n```\n\nrun\uff1a\n\n```shell\npip install similarities -U\nsimilarities clip_embedding -h\n\n# example\ncd examples\nsimilarities clip_embedding data/toy_clip/\n```\n\n- `bert_embedding`\u7b49\u662f\u4e8c\u7ea7\u547d\u4ee4\uff0cbert\u5f00\u5934\u7684\u662f\u6587\u672c\u76f8\u5173\uff0cclip\u5f00\u5934\u7684\u662f\u56fe\u50cf\u76f8\u5173\n- \u5404\u4e8c\u7ea7\u547d\u4ee4\u4f7f\u7528\u65b9\u6cd5\u89c1`similarities clip_embedding -h`\n- \u4e0a\u9762\u793a\u4f8b\u4e2d`data/toy_clip/`\u662f`clip_embedding`\u65b9\u6cd5\u7684`input_dir`\u53c2\u6570\uff0c\u8f93\u5165\u6587\u4ef6\u76ee\u5f55\uff08required\uff09\n\n\n\n## Contact\n\n- Issue(\u5efa\u8bae)\n  \uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a \u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624, \u5907\u6ce8\uff1a\u59d3\u540d-\u516c\u53f8-NLP* \u8fdbNLP\u4ea4\u6d41\u7fa4\u3002\n\n<img src=\"https://github.com/shibing624/similarities/blob/main/docs/wechat.jpeg\" width=\"200\" />\n\n## Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86similarities\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n\n```\nXu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities\n```\n\nBibTeX:\n\n```\n@misc{Xu_Similarities_Compute_similarity,\n  title={Similarities: similarity calculation and semantic search toolkit},\n  author={Xu Ming},\n  year={2022},\n  howpublished={\\url{https://github.com/shibing624/similarities}},\n}\n```\n\n## License\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](/LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0similarities\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n## Contribute\n\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n- \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n- \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n## Acknowledgements \n\n- [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)\n- [https://github.com/liuhuanyong/SentenceSimilarity](https://github.com/liuhuanyong/SentenceSimilarity)\n- [https://github.com/qwertyforce/image_search](https://github.com/qwertyforce/image_search)\n- [ImageHash - Official Github repository](https://github.com/JohannesBuchner/imagehash)\n- [https://github.com/openai/CLIP](https://github.com/openai/CLIP)\n- [https://github.com/OFA-Sys/Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)\n- [https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- [https://github.com/rom1504/clip-retrieval](https://github.com/rom1504/clip-retrieval)\n\nThanks for their great work!\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Similarities is a toolkit for compute similarity scores between texts, performing text searches.",
    "version": "1.2.3",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/similarities"
    },
    "split_keywords": [
        "similarities",
        " chinese text similarity calculation tool",
        " similarity",
        " word2vec"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b6d4545731d7a430e231aab32634638c443011e674e794d92e9db15fe6eb1070",
                "md5": "8c2d3f0695ef3d69582cf1030f2c0036",
                "sha256": "0450df8d45498a428291808c3a751dcacf8b056e8944401dd0c4715434094d32"
            },
            "downloads": -1,
            "filename": "similarities-1.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8c2d3f0695ef3d69582cf1030f2c0036",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 1277927,
            "upload_time": "2024-09-08T07:12:03",
            "upload_time_iso_8601": "2024-09-08T07:12:03.921052Z",
            "url": "https://files.pythonhosted.org/packages/b6/d4/545731d7a430e231aab32634638c443011e674e794d92e9db15fe6eb1070/similarities-1.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-08 07:12:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "similarities",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "text2vec",
            "specs": [
                [
                    ">=",
                    "1.2.8"
                ]
            ]
        },
        {
            "name": "jieba",
            "specs": [
                [
                    ">=",
                    "0.39"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "autofaiss",
            "specs": []
        },
        {
            "name": "fire",
            "specs": []
        },
        {
            "name": "fastapi",
            "specs": []
        },
        {
            "name": "uvicorn",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "starlette",
            "specs": []
        },
        {
            "name": "gradio",
            "specs": []
        }
    ],
    "lcname": "similarities"
}
        
Elapsed time: 3.54403s