pke-zh


Namepke-zh JSON
Version 0.2.6 PyPI version JSON
download
home_pagehttps://github.com/shibing624/pke_zh
Summarypke_zh, context-aware bag-of-words term weights for query and document.
upload_time2023-11-09 12:22:22
maintainer
docs_urlNone
authorXuMing
requires_python>=3.5
licenseApache 2.0
keywords pke_zh term weighting textrank word rank wordweight
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [**🇨🇳中文**](https://github.com/shibing624/pke_zh/blob/main/README.md) |  [**📖文档/Docs**](https://github.com/shibing624/pke_zh/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) 

<div align="center">
  <a href="https://github.com/shibing624/pke_zh">
    <img src="https://github.com/shibing624/pke_zh/blob/main/docs/pke_zh.png" alt="Logo" height="156">
  </a>
</div>

-----------------

# pke_zh: Python Keyphrase Extraction for zh(chinese)
[![PyPI version](https://badge.fury.io/py/pke_zh.svg)](https://badge.fury.io/py/pke_zh)
[![Downloads](https://static.pepy.tech/badge/pke_zh)](https://pepy.tech/project/pke_zh)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/issues)
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)


PKE_zh, Python Keyphrase Extraction for zh(chinese).

**pke_zh**实现了多种中文关键词提取算法,包括有监督的WordRank,无监督的TextRank、TfIdf、KeyBert、PositionRank、TopicRank等,扩展性强,开箱即用。


**Guide**

- [Features](#Features)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [References](#references)

## Features
#### 有监督方法
- [x] WordRank:本项目基于Python实现了句子的文本特征、统计特征、Tag特征、语言模型特征提取,结合GBDT模型区分出句子中各词的重要性得分,进而提取关键词,速度快,效果好,泛化性一般,依赖有监督数据。
#### 无监督方法
- 统计算法
- [x] TFIDF:本项目基于jieba的IDF词表实现了TFIDF的关键词抽取,该方法是很强的baseline,有较强普适性,基本能应付大部分关键词抽取场景,简单有效,速度很快,效果一般
- [x] YAKE:本项目实现了YAKE,该算法基于人工总结的规则(词的位置,词频,上下文关系,词在句中频率),不依赖外部语料,从单文档提取关键词,速度很快,效果差
- 图算法
- [x] TextRank:本项目基于networkx实现了TextRank,该算法简单套用PageRank思想到关键词提取,效果不比TFIDF强,而且涉及网络构建和随机游走迭代,速度慢,效果一般
- [x] SingleRank:本项目基于networkx实现了SingleRank,该算法类似TextRank,是PageRank的变体,可以提取出关键短语,速度快,效果一般
- [x] TopicRank:本项目基于networkx实现了TopicRank,该算法基于主题模型的关键词提取,考虑了文档中词语的语义关系,可以提取出与文档主题相关的关键词,速度慢,效果一般
- [x] MultipartiteRank:本项目基于networkx实现了MultipartiteRank,该算法基于多元关系提取关键词,在TopicRank的基础上,考虑了词语的语义关系和词语位置,速度慢,效果一般
- [x] PositionRank:本项目基于networkx实现了PositionRank,该算法基于PageRank的图关系计算词权重,考虑了词位置和词频,速度一般,效果好
- 语义模型
- [x] KeyBERT:本项目基于text2vec实现了KeyBert,利用了预训练句子表征模型计算句子embedding和各词embedding相似度来提取关键词,速度很慢,效果最好

- 延展阅读:[中文关键词提取解决思路](https://github.com/shibing624/pke_zh/blob/main/docs/solution.md)

**模型选型**
- 要求速度快,选择TFIDF、PositionRank、WordRank
- 要求效果好,选择KeyBERT
- 有监督数据,选择WordRank



## Install
* From pip:
```zsh
pip install -U pke_zh
```

* From source:
```zsh
git clone https://github.com/shibing624/pke_zh.git
cd pke_zh
python setup.py install
```

## Usage

### 有监督关键词提取

#### pke_zh快速预测
example: [examples/keyphrase_extraction_demo.py](examples/keyphrase_extraction_demo.py)

```python
from pke_zh import WordRank
m = WordRank()
print(m.extract("哪里下载电视剧周恩来?"))
```

output:
```shell
[('电视剧', 3), ('周恩来', 3), ('下载', 2), ('哪里', 1), ('?', 0)]
```
- 返回值:核心短语列表,(keyphrase, score),其中score: 3:核心词;2:限定词;1:可省略词;0:干扰词 
- **score**共分4级:
  - Super important:3级,主要包括POI核心词,比如“方特、欢乐谷”
  - Required:2级,包括行政区词、品类词等,比如“北京 温泉”中“北京”和“温泉”都很重要
  - Important:1级,包括品类词、门票等,比如“顺景 温泉”中“温泉”相对没有那么重要,用户搜“顺景”大部分都是温泉的需求
  - Unimportant:0级,包括语气词、代词、泛需求词、停用词等
- 模型:默认调用训练好的WordRank模型[wordrank_model.pkl](https://github.com/shibing624/pke_zh/releases/tag/0.2.2),模型自动下载于 `~/.cache/pke_zh/wordrank_model.pkl`
#### 训练模型
**WordRank模型**:对输入query分词并提取多类特征,再把特征喂给GBDT等分类模型,模型区分出各词的重要性得分,挑出topK个词作为关键词

* 文本特征:包括Query长度、Term长度,Term在Query中的偏移量,term词性、长度信息、term数目、位置信息、句法依存tag、是否数字、是否英文、是否停用词、是否专名实体、是否重要行业词、embedding模长、删词差异度、以及短语生成树得到term权重等
* 统计特征:包括PMI、IDF、TextRank值、前后词互信息、左右邻熵、独立检索占比(term单独作为query的qv/所有包含term的query的qv和)、统计概率、idf变种iqf
* 语言模型特征:整个query的语言模型概率 / 去掉该Term后的Query的语言模型概率


训练样本格式:
```shell
邪御天娇 免费 阅读,3 1 1
```
模型结构:

![term-weighting](https://github.com/shibing624/pke_zh/blob/main/docs/gbdt.png)

training example: [examples/train_supervised_wordrank_demo.py](examples/train_supervised_wordrank_demo.py)

### 无监督关键词提取
支持TextRank、TfIdf、PositionRank、KeyBert等关键词提取算法。

example: [examples/unsupervised_demo.py](examples/unsupervised_demo.py)


```python
from pke_zh import TextRank, TfIdf, SingleRank, PositionRank, TopicRank, MultipartiteRank, Yake, KeyBert
q = '哪里下载电视剧周恩来?'
TextRank_m = TextRank()
TfIdf_m = TfIdf()
PositionRank_m = PositionRank()
KeyBert_m = KeyBert()

r = TextRank_m.extract(q)
print('TextRank:', r)

r = TfIdf_m.extract(q)
print('TfIdf:', r)

r = PositionRank_m.extract(q)
print('PositionRank_m:', r)

r = KeyBert_m.extract(q)
print('KeyBert_m:', r)
```

output:
```shell
TextRank: [('电视剧', 1.00000002)]
TfIdf: [('哪里下载', 1.328307500322222), ('下载电视剧', 1.328307500322222), ('电视剧周恩来', 1.328307500322222)]
PositionRank_m: [('电视剧', 1.0)]
KeyBert_m: [('电视剧', 0.47165293)]
```

### 无监督关键句提取(自动摘要)
支持TextRank摘要提取算法。

example: [examples/keysentences_extraction_demo.py](examples/keysentences_extraction_demo.py)


```python
from pke_zh import TextRank
m = TextRank()
r = m.extract_sentences("较早进入中国市场的星巴克,是不少小资钟情的品牌。相比 在美国的平民形象,星巴克在中国就显得“高端”得多。用料并无差别的一杯中杯美式咖啡,在美国仅约合人民币12元,国内要卖21元,相当于贵了75%。  第一财经日报")
print(r)
```

output:
```shell
[('相比在美国的平民形象', 0.13208935993025409), ('在美国仅约合人民币12元', 0.1320761453200497), ('星巴克在中国就显得“高端”得多', 0.12497451534612379), ('国内要卖21元', 0.11929080110899569) ...]
```

## Contact

- Issue(建议):[![GitHub issues](https://img.shields.io/github/issues/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/issues)
- 邮件我:xuming: xuming624@qq.com
- 微信我:加我*微信号:xuming624*, 备注:*姓名-公司名-NLP* 进NLP交流群。

<img src="docs/wechat.jpeg" width="200" />


## Citation

如果你在研究中使用了pke_zh,请按如下格式引用:
APA:
```latex
Xu, M. pke_zh: Python keyphrase extraction toolkit for chinese (Version 0.2.2) [Computer software]. https://github.com/shibing624/pke_zh
```

BibTeX:
```latex
@misc{pke_zh,
  author = {Xu, Ming},
  title = {pke_zh: Python keyphrase extraction toolkit for chinese},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/pke_zh}},
}
```

## License


授权协议为 [The Apache License 2.0](LICENSE),可免费用做商业用途。请在产品说明中附加pke_zh的链接和授权协议。


## Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

 - 在`tests`添加相应的单元测试
 - 使用`python -m pytest`来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。


## References

- [boudinfl/pke](https://github.com/boudinfl/pke)
- [Context-Aware Document Term Weighting for Ad-Hoc Search](http://www.cs.cmu.edu/~zhuyund/papers/TheWebConf_2020_Dai.pdf)
- [term weighting](https://zhuanlan.zhihu.com/p/90957854)
- [DeepCT](https://github.com/AdeDZY/DeepCT)
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/pke_zh",
    "name": "pke-zh",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "pke_zh,term weighting,textrank,word rank,wordweight",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/1f/d8/ca6af9403ad410c9e5c72b0d68ba9f19c59951e525d8f0fc242d62691871/pke_zh-0.2.6.tar.gz",
    "platform": "Windows",
    "description": "[**\ud83c\udde8\ud83c\uddf3\u4e2d\u6587**](https://github.com/shibing624/pke_zh/blob/main/README.md) |  [**\ud83d\udcd6\u6587\u6863/Docs**](https://github.com/shibing624/pke_zh/wiki) | [**\ud83e\udd16\u6a21\u578b/Models**](https://huggingface.co/shibing624) \n\n<div align=\"center\">\n  <a href=\"https://github.com/shibing624/pke_zh\">\n    <img src=\"https://github.com/shibing624/pke_zh/blob/main/docs/pke_zh.png\" alt=\"Logo\" height=\"156\">\n  </a>\n</div>\n\n-----------------\n\n# pke_zh: Python Keyphrase Extraction for zh(chinese)\n[![PyPI version](https://badge.fury.io/py/pke_zh.svg)](https://badge.fury.io/py/pke_zh)\n[![Downloads](https://static.pepy.tech/badge/pke_zh)](https://pepy.tech/project/pke_zh)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/issues)\n[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)\n\n\nPKE_zh, Python Keyphrase Extraction for zh(chinese).\n\n**pke_zh**\u5b9e\u73b0\u4e86\u591a\u79cd\u4e2d\u6587\u5173\u952e\u8bcd\u63d0\u53d6\u7b97\u6cd5\uff0c\u5305\u62ec\u6709\u76d1\u7763\u7684WordRank\uff0c\u65e0\u76d1\u7763\u7684TextRank\u3001TfIdf\u3001KeyBert\u3001PositionRank\u3001TopicRank\u7b49\uff0c\u6269\u5c55\u6027\u5f3a\uff0c\u5f00\u7bb1\u5373\u7528\u3002\n\n\n**Guide**\n\n- [Features](#Features)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [References](#references)\n\n## Features\n#### \u6709\u76d1\u7763\u65b9\u6cd5\n- [x] WordRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePython\u5b9e\u73b0\u4e86\u53e5\u5b50\u7684\u6587\u672c\u7279\u5f81\u3001\u7edf\u8ba1\u7279\u5f81\u3001Tag\u7279\u5f81\u3001\u8bed\u8a00\u6a21\u578b\u7279\u5f81\u63d0\u53d6\uff0c\u7ed3\u5408GBDT\u6a21\u578b\u533a\u5206\u51fa\u53e5\u5b50\u4e2d\u5404\u8bcd\u7684\u91cd\u8981\u6027\u5f97\u5206\uff0c\u8fdb\u800c\u63d0\u53d6\u5173\u952e\u8bcd\uff0c\u901f\u5ea6\u5feb\uff0c\u6548\u679c\u597d\uff0c\u6cdb\u5316\u6027\u4e00\u822c\uff0c\u4f9d\u8d56\u6709\u76d1\u7763\u6570\u636e\u3002\n#### \u65e0\u76d1\u7763\u65b9\u6cd5\n- \u7edf\u8ba1\u7b97\u6cd5\n- [x] TFIDF\uff1a\u672c\u9879\u76ee\u57fa\u4e8ejieba\u7684IDF\u8bcd\u8868\u5b9e\u73b0\u4e86TFIDF\u7684\u5173\u952e\u8bcd\u62bd\u53d6\uff0c\u8be5\u65b9\u6cd5\u662f\u5f88\u5f3a\u7684baseline\uff0c\u6709\u8f83\u5f3a\u666e\u9002\u6027\uff0c\u57fa\u672c\u80fd\u5e94\u4ed8\u5927\u90e8\u5206\u5173\u952e\u8bcd\u62bd\u53d6\u573a\u666f\uff0c\u7b80\u5355\u6709\u6548\uff0c\u901f\u5ea6\u5f88\u5feb\uff0c\u6548\u679c\u4e00\u822c\n- [x] YAKE\uff1a\u672c\u9879\u76ee\u5b9e\u73b0\u4e86YAKE\uff0c\u8be5\u7b97\u6cd5\u57fa\u4e8e\u4eba\u5de5\u603b\u7ed3\u7684\u89c4\u5219\uff08\u8bcd\u7684\u4f4d\u7f6e\uff0c\u8bcd\u9891\uff0c\u4e0a\u4e0b\u6587\u5173\u7cfb\uff0c\u8bcd\u5728\u53e5\u4e2d\u9891\u7387\uff09\uff0c\u4e0d\u4f9d\u8d56\u5916\u90e8\u8bed\u6599\uff0c\u4ece\u5355\u6587\u6863\u63d0\u53d6\u5173\u952e\u8bcd\uff0c\u901f\u5ea6\u5f88\u5feb\uff0c\u6548\u679c\u5dee\n- \u56fe\u7b97\u6cd5\n- [x] TextRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8enetworkx\u5b9e\u73b0\u4e86TextRank\uff0c\u8be5\u7b97\u6cd5\u7b80\u5355\u5957\u7528PageRank\u601d\u60f3\u5230\u5173\u952e\u8bcd\u63d0\u53d6\uff0c\u6548\u679c\u4e0d\u6bd4TFIDF\u5f3a\uff0c\u800c\u4e14\u6d89\u53ca\u7f51\u7edc\u6784\u5efa\u548c\u968f\u673a\u6e38\u8d70\u8fed\u4ee3\uff0c\u901f\u5ea6\u6162\uff0c\u6548\u679c\u4e00\u822c\n- [x] SingleRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8enetworkx\u5b9e\u73b0\u4e86SingleRank\uff0c\u8be5\u7b97\u6cd5\u7c7b\u4f3cTextRank\uff0c\u662fPageRank\u7684\u53d8\u4f53\uff0c\u53ef\u4ee5\u63d0\u53d6\u51fa\u5173\u952e\u77ed\u8bed\uff0c\u901f\u5ea6\u5feb\uff0c\u6548\u679c\u4e00\u822c\n- [x] TopicRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8enetworkx\u5b9e\u73b0\u4e86TopicRank\uff0c\u8be5\u7b97\u6cd5\u57fa\u4e8e\u4e3b\u9898\u6a21\u578b\u7684\u5173\u952e\u8bcd\u63d0\u53d6\uff0c\u8003\u8651\u4e86\u6587\u6863\u4e2d\u8bcd\u8bed\u7684\u8bed\u4e49\u5173\u7cfb\uff0c\u53ef\u4ee5\u63d0\u53d6\u51fa\u4e0e\u6587\u6863\u4e3b\u9898\u76f8\u5173\u7684\u5173\u952e\u8bcd\uff0c\u901f\u5ea6\u6162\uff0c\u6548\u679c\u4e00\u822c\n- [x] MultipartiteRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8enetworkx\u5b9e\u73b0\u4e86MultipartiteRank\uff0c\u8be5\u7b97\u6cd5\u57fa\u4e8e\u591a\u5143\u5173\u7cfb\u63d0\u53d6\u5173\u952e\u8bcd\uff0c\u5728TopicRank\u7684\u57fa\u7840\u4e0a\uff0c\u8003\u8651\u4e86\u8bcd\u8bed\u7684\u8bed\u4e49\u5173\u7cfb\u548c\u8bcd\u8bed\u4f4d\u7f6e\uff0c\u901f\u5ea6\u6162\uff0c\u6548\u679c\u4e00\u822c\n- [x] PositionRank\uff1a\u672c\u9879\u76ee\u57fa\u4e8enetworkx\u5b9e\u73b0\u4e86PositionRank\uff0c\u8be5\u7b97\u6cd5\u57fa\u4e8ePageRank\u7684\u56fe\u5173\u7cfb\u8ba1\u7b97\u8bcd\u6743\u91cd\uff0c\u8003\u8651\u4e86\u8bcd\u4f4d\u7f6e\u548c\u8bcd\u9891\uff0c\u901f\u5ea6\u4e00\u822c\uff0c\u6548\u679c\u597d\n- \u8bed\u4e49\u6a21\u578b\n- [x] KeyBERT\uff1a\u672c\u9879\u76ee\u57fa\u4e8etext2vec\u5b9e\u73b0\u4e86KeyBert\uff0c\u5229\u7528\u4e86\u9884\u8bad\u7ec3\u53e5\u5b50\u8868\u5f81\u6a21\u578b\u8ba1\u7b97\u53e5\u5b50embedding\u548c\u5404\u8bcdembedding\u76f8\u4f3c\u5ea6\u6765\u63d0\u53d6\u5173\u952e\u8bcd\uff0c\u901f\u5ea6\u5f88\u6162\uff0c\u6548\u679c\u6700\u597d\n\n- \u5ef6\u5c55\u9605\u8bfb\uff1a[\u4e2d\u6587\u5173\u952e\u8bcd\u63d0\u53d6\u89e3\u51b3\u601d\u8def](https://github.com/shibing624/pke_zh/blob/main/docs/solution.md)\n\n**\u6a21\u578b\u9009\u578b**\n- \u8981\u6c42\u901f\u5ea6\u5feb\uff0c\u9009\u62e9TFIDF\u3001PositionRank\u3001WordRank\n- \u8981\u6c42\u6548\u679c\u597d\uff0c\u9009\u62e9KeyBERT\n- \u6709\u76d1\u7763\u6570\u636e\uff0c\u9009\u62e9WordRank\n\n\n\n## Install\n* From pip:\n```zsh\npip install -U pke_zh\n```\n\n* From source\uff1a\n```zsh\ngit clone https://github.com/shibing624/pke_zh.git\ncd pke_zh\npython setup.py install\n```\n\n## Usage\n\n### \u6709\u76d1\u7763\u5173\u952e\u8bcd\u63d0\u53d6\n\n#### pke_zh\u5feb\u901f\u9884\u6d4b\nexample: [examples/keyphrase_extraction_demo.py](examples/keyphrase_extraction_demo.py)\n\n```python\nfrom pke_zh import WordRank\nm = WordRank()\nprint(m.extract(\"\u54ea\u91cc\u4e0b\u8f7d\u7535\u89c6\u5267\u5468\u6069\u6765\uff1f\"))\n```\n\noutput:\n```shell\n[('\u7535\u89c6\u5267', 3), ('\u5468\u6069\u6765', 3), ('\u4e0b\u8f7d', 2), ('\u54ea\u91cc', 1), ('\uff1f', 0)]\n```\n- \u8fd4\u56de\u503c\uff1a\u6838\u5fc3\u77ed\u8bed\u5217\u8868\uff0c(keyphrase, score)\uff0c\u5176\u4e2dscore\uff1a 3\uff1a\u6838\u5fc3\u8bcd\uff1b2\uff1a\u9650\u5b9a\u8bcd\uff1b1\uff1a\u53ef\u7701\u7565\u8bcd\uff1b0\uff1a\u5e72\u6270\u8bcd \n- **score**\u5171\u52064\u7ea7\uff1a\n  - Super important\uff1a3\u7ea7\uff0c\u4e3b\u8981\u5305\u62ecPOI\u6838\u5fc3\u8bcd\uff0c\u6bd4\u5982\u201c\u65b9\u7279\u3001\u6b22\u4e50\u8c37\u201d\n  - Required\uff1a2\u7ea7\uff0c\u5305\u62ec\u884c\u653f\u533a\u8bcd\u3001\u54c1\u7c7b\u8bcd\u7b49\uff0c\u6bd4\u5982\u201c\u5317\u4eac \u6e29\u6cc9\u201d\u4e2d\u201c\u5317\u4eac\u201d\u548c\u201c\u6e29\u6cc9\u201d\u90fd\u5f88\u91cd\u8981\n  - Important\uff1a1\u7ea7\uff0c\u5305\u62ec\u54c1\u7c7b\u8bcd\u3001\u95e8\u7968\u7b49\uff0c\u6bd4\u5982\u201c\u987a\u666f \u6e29\u6cc9\u201d\u4e2d\u201c\u6e29\u6cc9\u201d\u76f8\u5bf9\u6ca1\u6709\u90a3\u4e48\u91cd\u8981\uff0c\u7528\u6237\u641c\u201c\u987a\u666f\u201d\u5927\u90e8\u5206\u90fd\u662f\u6e29\u6cc9\u7684\u9700\u6c42\n  - Unimportant\uff1a0\u7ea7\uff0c\u5305\u62ec\u8bed\u6c14\u8bcd\u3001\u4ee3\u8bcd\u3001\u6cdb\u9700\u6c42\u8bcd\u3001\u505c\u7528\u8bcd\u7b49\n- \u6a21\u578b\uff1a\u9ed8\u8ba4\u8c03\u7528\u8bad\u7ec3\u597d\u7684WordRank\u6a21\u578b[wordrank_model.pkl](https://github.com/shibing624/pke_zh/releases/tag/0.2.2)\uff0c\u6a21\u578b\u81ea\u52a8\u4e0b\u8f7d\u4e8e `~/.cache/pke_zh/wordrank_model.pkl`\n#### \u8bad\u7ec3\u6a21\u578b\n**WordRank\u6a21\u578b**\uff1a\u5bf9\u8f93\u5165query\u5206\u8bcd\u5e76\u63d0\u53d6\u591a\u7c7b\u7279\u5f81\uff0c\u518d\u628a\u7279\u5f81\u5582\u7ed9GBDT\u7b49\u5206\u7c7b\u6a21\u578b\uff0c\u6a21\u578b\u533a\u5206\u51fa\u5404\u8bcd\u7684\u91cd\u8981\u6027\u5f97\u5206\uff0c\u6311\u51fatopK\u4e2a\u8bcd\u4f5c\u4e3a\u5173\u952e\u8bcd\n\n* \u6587\u672c\u7279\u5f81\uff1a\u5305\u62ecQuery\u957f\u5ea6\u3001Term\u957f\u5ea6\uff0cTerm\u5728Query\u4e2d\u7684\u504f\u79fb\u91cf\uff0cterm\u8bcd\u6027\u3001\u957f\u5ea6\u4fe1\u606f\u3001term\u6570\u76ee\u3001\u4f4d\u7f6e\u4fe1\u606f\u3001\u53e5\u6cd5\u4f9d\u5b58tag\u3001\u662f\u5426\u6570\u5b57\u3001\u662f\u5426\u82f1\u6587\u3001\u662f\u5426\u505c\u7528\u8bcd\u3001\u662f\u5426\u4e13\u540d\u5b9e\u4f53\u3001\u662f\u5426\u91cd\u8981\u884c\u4e1a\u8bcd\u3001embedding\u6a21\u957f\u3001\u5220\u8bcd\u5dee\u5f02\u5ea6\u3001\u4ee5\u53ca\u77ed\u8bed\u751f\u6210\u6811\u5f97\u5230term\u6743\u91cd\u7b49\n* \u7edf\u8ba1\u7279\u5f81\uff1a\u5305\u62ecPMI\u3001IDF\u3001TextRank\u503c\u3001\u524d\u540e\u8bcd\u4e92\u4fe1\u606f\u3001\u5de6\u53f3\u90bb\u71b5\u3001\u72ec\u7acb\u68c0\u7d22\u5360\u6bd4\uff08term\u5355\u72ec\u4f5c\u4e3aquery\u7684qv/\u6240\u6709\u5305\u542bterm\u7684query\u7684qv\u548c\uff09\u3001\u7edf\u8ba1\u6982\u7387\u3001idf\u53d8\u79cdiqf\n* \u8bed\u8a00\u6a21\u578b\u7279\u5f81\uff1a\u6574\u4e2aquery\u7684\u8bed\u8a00\u6a21\u578b\u6982\u7387 / \u53bb\u6389\u8be5Term\u540e\u7684Query\u7684\u8bed\u8a00\u6a21\u578b\u6982\u7387\n\n\n\u8bad\u7ec3\u6837\u672c\u683c\u5f0f\uff1a\n```shell\n\u90aa\u5fa1\u5929\u5a07 \u514d\u8d39 \u9605\u8bfb,3 1 1\n```\n\u6a21\u578b\u7ed3\u6784\uff1a\n\n![term-weighting](https://github.com/shibing624/pke_zh/blob/main/docs/gbdt.png)\n\ntraining example: [examples/train_supervised_wordrank_demo.py](examples/train_supervised_wordrank_demo.py)\n\n### \u65e0\u76d1\u7763\u5173\u952e\u8bcd\u63d0\u53d6\n\u652f\u6301TextRank\u3001TfIdf\u3001PositionRank\u3001KeyBert\u7b49\u5173\u952e\u8bcd\u63d0\u53d6\u7b97\u6cd5\u3002\n\nexample: [examples/unsupervised_demo.py](examples/unsupervised_demo.py)\n\n\n```python\nfrom pke_zh import TextRank, TfIdf, SingleRank, PositionRank, TopicRank, MultipartiteRank, Yake, KeyBert\nq = '\u54ea\u91cc\u4e0b\u8f7d\u7535\u89c6\u5267\u5468\u6069\u6765\uff1f'\nTextRank_m = TextRank()\nTfIdf_m = TfIdf()\nPositionRank_m = PositionRank()\nKeyBert_m = KeyBert()\n\nr = TextRank_m.extract(q)\nprint('TextRank:', r)\n\nr = TfIdf_m.extract(q)\nprint('TfIdf:', r)\n\nr = PositionRank_m.extract(q)\nprint('PositionRank_m:', r)\n\nr = KeyBert_m.extract(q)\nprint('KeyBert_m:', r)\n```\n\noutput:\n```shell\nTextRank: [('\u7535\u89c6\u5267', 1.00000002)]\nTfIdf: [('\u54ea\u91cc\u4e0b\u8f7d', 1.328307500322222), ('\u4e0b\u8f7d\u7535\u89c6\u5267', 1.328307500322222), ('\u7535\u89c6\u5267\u5468\u6069\u6765', 1.328307500322222)]\nPositionRank_m: [('\u7535\u89c6\u5267', 1.0)]\nKeyBert_m: [('\u7535\u89c6\u5267', 0.47165293)]\n```\n\n### \u65e0\u76d1\u7763\u5173\u952e\u53e5\u63d0\u53d6\uff08\u81ea\u52a8\u6458\u8981\uff09\n\u652f\u6301TextRank\u6458\u8981\u63d0\u53d6\u7b97\u6cd5\u3002\n\nexample: [examples/keysentences_extraction_demo.py](examples/keysentences_extraction_demo.py)\n\n\n```python\nfrom pke_zh import TextRank\nm = TextRank()\nr = m.extract_sentences(\"\u8f83\u65e9\u8fdb\u5165\u4e2d\u56fd\u5e02\u573a\u7684\u661f\u5df4\u514b\uff0c\u662f\u4e0d\u5c11\u5c0f\u8d44\u949f\u60c5\u7684\u54c1\u724c\u3002\u76f8\u6bd4 \u5728\u7f8e\u56fd\u7684\u5e73\u6c11\u5f62\u8c61\uff0c\u661f\u5df4\u514b\u5728\u4e2d\u56fd\u5c31\u663e\u5f97\u201c\u9ad8\u7aef\u201d\u5f97\u591a\u3002\u7528\u6599\u5e76\u65e0\u5dee\u522b\u7684\u4e00\u676f\u4e2d\u676f\u7f8e\u5f0f\u5496\u5561\uff0c\u5728\u7f8e\u56fd\u4ec5\u7ea6\u5408\u4eba\u6c11\u5e0112\u5143\uff0c\u56fd\u5185\u8981\u535621\u5143\uff0c\u76f8\u5f53\u4e8e\u8d35\u4e8675%\u3002  \u7b2c\u4e00\u8d22\u7ecf\u65e5\u62a5\")\nprint(r)\n```\n\noutput:\n```shell\n[('\u76f8\u6bd4\u5728\u7f8e\u56fd\u7684\u5e73\u6c11\u5f62\u8c61', 0.13208935993025409), ('\u5728\u7f8e\u56fd\u4ec5\u7ea6\u5408\u4eba\u6c11\u5e0112\u5143', 0.1320761453200497), ('\u661f\u5df4\u514b\u5728\u4e2d\u56fd\u5c31\u663e\u5f97\u201c\u9ad8\u7aef\u201d\u5f97\u591a', 0.12497451534612379), ('\u56fd\u5185\u8981\u535621\u5143', 0.11929080110899569) ...]\n```\n\n## Contact\n\n- Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/pke_zh.svg)](https://github.com/shibing624/pke_zh/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624*, \u5907\u6ce8\uff1a*\u59d3\u540d-\u516c\u53f8\u540d-NLP* \u8fdbNLP\u4ea4\u6d41\u7fa4\u3002\n\n<img src=\"docs/wechat.jpeg\" width=\"200\" />\n\n\n## Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86pke_zh\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\nAPA:\n```latex\nXu, M. pke_zh: Python keyphrase extraction toolkit for chinese (Version 0.2.2) [Computer software]. https://github.com/shibing624/pke_zh\n```\n\nBibTeX:\n```latex\n@misc{pke_zh,\n  author = {Xu, Ming},\n  title = {pke_zh: Python keyphrase extraction toolkit for chinese},\n  year = {2023},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/shibing624/pke_zh}},\n}\n```\n\n## License\n\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0pke_zh\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n\n## Contribute\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n - \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n - \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n\n## References\n\n- [boudinfl/pke](https://github.com/boudinfl/pke)\n- [Context-Aware Document Term Weighting for Ad-Hoc Search](http://www.cs.cmu.edu/~zhuyund/papers/TheWebConf_2020_Dai.pdf)\n- [term weighting](https://zhuanlan.zhihu.com/p/90957854)\n- [DeepCT](https://github.com/AdeDZY/DeepCT)",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "pke_zh, context-aware bag-of-words term weights for query and document.",
    "version": "0.2.6",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/pke_zh"
    },
    "split_keywords": [
        "pke_zh",
        "term weighting",
        "textrank",
        "word rank",
        "wordweight"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1fd8ca6af9403ad410c9e5c72b0d68ba9f19c59951e525d8f0fc242d62691871",
                "md5": "9eab8d1dde055bd7ed3ffb976204c7c0",
                "sha256": "34ce3e6a8eb421afb74ed6d5cb41524bb25e9ba97c3b6c784823b821a4483b5a"
            },
            "downloads": -1,
            "filename": "pke_zh-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "9eab8d1dde055bd7ed3ffb976204c7c0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 5575643,
            "upload_time": "2023-11-09T12:22:22",
            "upload_time_iso_8601": "2023-11-09T12:22:22.126463Z",
            "url": "https://files.pythonhosted.org/packages/1f/d8/ca6af9403ad410c9e5c72b0d68ba9f19c59951e525d8f0fc242d62691871/pke_zh-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-09 12:22:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "pke_zh",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pke-zh"
}
        
Elapsed time: 0.16339s