Name | pystopwords JSON |
Version |
0.0.2
JSON |
| download |
home_page | |
Summary | 中文停用词大全Python接口 |
upload_time | 2022-12-02 10:22:21 |
maintainer | |
docs_url | None |
author | |
requires_python | |
license | MIT License |
keywords |
停用词
stopwords
中文
chinese
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# pystopwords
## 简介
中文停用词大全,支持Python接口, 可选择百度,哈工大,中科院等公开停用词典。
目前只专注于中文,未来考虑加入多语言支持。
## 安装
```shell
pip install pystopwords
```
## 使用方法
```python
from pystopwords import stopwords
```
stopwords函数返回一个停用词set,有两个参数:
- langs: string,支持的语言,目前仅支持中文(zh)
- source: string, 停用词来源,目前支持
- baidu: 百度停用词表
- hit: 哈工大停用词表
- ict: 中科院计算所停用词表
- scu: 四川大学机器智能实验室停用词库
- cn: 广为流传未知来源的中文停用词表
- marimo: Marimo multi-lingual stopwords collection 内的中文停用词
- iso: Stopwords ISO 内的中文停用词
- all: 上述所有停用词并集
默认参数是`stopwords(langs='zh', source='all')`
```python
from pystopwords import stopwords
import jieba
# 默认的参数为:
# all_stopwords = stopwords(langs='zh', source='all')
all_stopwords = stopwords()
# 可以选择不同的来源
baidu_stopwords = stopwords(source='baidu')
hit_stopwords = stopwords(source='hit')
word_list = jieba.lcut('我想找一个简单好用的停用词典')
word_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]
print(word_list_drop_stopwords)
# Stdout: ['想', '找', '简单', '好用', '停用', '词典']
```
## 来源说明
| 名称 | 来源 | 来源url | 个数 | 备注 |
|--------|------------------------|------------------------------------------------|------|------------------------------------------------------------|
| ict | 中科院计算所 | | 1207 | 网络上大部分很多链接失效,而且一共1207个,不是网传的1208个 |
| baidu | 百度 | | 1429 | |
| hit | 哈工大 | | 767 | |
| scu | 四川大学机器智能实验室 | | 976 | |
| cn | 未知来源 | | 746 | |
| marimo | koheiw | https://github.com/koheiw/marimo | 387 | 原始文件有更细致的分类体系 |
| iso | stopwords-iso | https://github.com/stopwords-iso/stopwords-iso | 794 | 原始文件支持很多语言 |
Raw data
{
"_id": null,
"home_page": "",
"name": "pystopwords",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "\u505c\u7528\u8bcd stopwords \u4e2d\u6587 chinese",
"author": "",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/af/51/dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653/pystopwords-0.0.2.tar.gz",
"platform": null,
"description": "# pystopwords\n\n## \u7b80\u4ecb\n\u4e2d\u6587\u505c\u7528\u8bcd\u5927\u5168\uff0c\u652f\u6301Python\u63a5\u53e3, \u53ef\u9009\u62e9\u767e\u5ea6\uff0c\u54c8\u5de5\u5927\uff0c\u4e2d\u79d1\u9662\u7b49\u516c\u5f00\u505c\u7528\u8bcd\u5178\u3002\n\n\u76ee\u524d\u53ea\u4e13\u6ce8\u4e8e\u4e2d\u6587\uff0c\u672a\u6765\u8003\u8651\u52a0\u5165\u591a\u8bed\u8a00\u652f\u6301\u3002\n\n## \u5b89\u88c5\n\n```shell\npip install pystopwords\n```\n\n## \u4f7f\u7528\u65b9\u6cd5\n\n```python\nfrom pystopwords import stopwords\n```\n\n\nstopwords\u51fd\u6570\u8fd4\u56de\u4e00\u4e2a\u505c\u7528\u8bcdset\uff0c\u6709\u4e24\u4e2a\u53c2\u6570\uff1a\n\n - langs: string\uff0c\u652f\u6301\u7684\u8bed\u8a00\uff0c\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u6587(zh)\n - source: string, \u505c\u7528\u8bcd\u6765\u6e90\uff0c\u76ee\u524d\u652f\u6301\n - baidu: \u767e\u5ea6\u505c\u7528\u8bcd\u8868\n - hit: \u54c8\u5de5\u5927\u505c\u7528\u8bcd\u8868\n - ict: \u4e2d\u79d1\u9662\u8ba1\u7b97\u6240\u505c\u7528\u8bcd\u8868\n - scu: \u56db\u5ddd\u5927\u5b66\u673a\u5668\u667a\u80fd\u5b9e\u9a8c\u5ba4\u505c\u7528\u8bcd\u5e93\n - cn: \u5e7f\u4e3a\u6d41\u4f20\u672a\u77e5\u6765\u6e90\u7684\u4e2d\u6587\u505c\u7528\u8bcd\u8868\n - marimo: Marimo multi-lingual stopwords collection \u5185\u7684\u4e2d\u6587\u505c\u7528\u8bcd\n - iso: Stopwords ISO \u5185\u7684\u4e2d\u6587\u505c\u7528\u8bcd\n - all: \u4e0a\u8ff0\u6240\u6709\u505c\u7528\u8bcd\u5e76\u96c6\n\n\u9ed8\u8ba4\u53c2\u6570\u662f`stopwords(langs='zh', source='all')`\n\n\n```python\nfrom pystopwords import stopwords\nimport jieba\n\n# \u9ed8\u8ba4\u7684\u53c2\u6570\u4e3a\uff1a\n# all_stopwords = stopwords(langs='zh', source='all')\nall_stopwords = stopwords()\n\n# \u53ef\u4ee5\u9009\u62e9\u4e0d\u540c\u7684\u6765\u6e90\nbaidu_stopwords = stopwords(source='baidu')\nhit_stopwords = stopwords(source='hit')\n\nword_list = jieba.lcut('\u6211\u60f3\u627e\u4e00\u4e2a\u7b80\u5355\u597d\u7528\u7684\u505c\u7528\u8bcd\u5178')\nword_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]\nprint(word_list_drop_stopwords)\n\n# Stdout: ['\u60f3', '\u627e', '\u7b80\u5355', '\u597d\u7528', '\u505c\u7528', '\u8bcd\u5178']\n```\n\n\n## \u6765\u6e90\u8bf4\u660e\n\n\n\n| \u540d\u79f0 | \u6765\u6e90 | \u6765\u6e90url | \u4e2a\u6570 | \u5907\u6ce8 |\n|--------|------------------------|------------------------------------------------|------|------------------------------------------------------------|\n| ict | \u4e2d\u79d1\u9662\u8ba1\u7b97\u6240 | | 1207 | \u7f51\u7edc\u4e0a\u5927\u90e8\u5206\u5f88\u591a\u94fe\u63a5\u5931\u6548\uff0c\u800c\u4e14\u4e00\u51711207\u4e2a\uff0c\u4e0d\u662f\u7f51\u4f20\u76841208\u4e2a |\n| baidu | \u767e\u5ea6 | | 1429 | |\n| hit | \u54c8\u5de5\u5927 | | 767 | |\n| scu | \u56db\u5ddd\u5927\u5b66\u673a\u5668\u667a\u80fd\u5b9e\u9a8c\u5ba4 | | 976 | |\n| cn | \u672a\u77e5\u6765\u6e90 | | 746 | |\n| marimo | koheiw | https://github.com/koheiw/marimo | 387 | \u539f\u59cb\u6587\u4ef6\u6709\u66f4\u7ec6\u81f4\u7684\u5206\u7c7b\u4f53\u7cfb |\n| iso | stopwords-iso | https://github.com/stopwords-iso/stopwords-iso | 794 | \u539f\u59cb\u6587\u4ef6\u652f\u6301\u5f88\u591a\u8bed\u8a00 |\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "\u4e2d\u6587\u505c\u7528\u8bcd\u5927\u5168Python\u63a5\u53e3",
"version": "0.0.2",
"split_keywords": [
"\u505c\u7528\u8bcd",
"stopwords",
"\u4e2d\u6587",
"chinese"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "ab55da4947cdbfe72156ed64de396ef6",
"sha256": "454c5f49bb6a5efdb921fa57447f4cfec7e3d7c439fc1e7f0726321c62b9d8d7"
},
"downloads": -1,
"filename": "pystopwords-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ab55da4947cdbfe72156ed64de396ef6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 37888,
"upload_time": "2022-12-02T10:22:16",
"upload_time_iso_8601": "2022-12-02T10:22:16.638089Z",
"url": "https://files.pythonhosted.org/packages/32/46/74aa49737e9b0be37141ad377f71f4251b4ba499f2a65ed2ae069f9296e3/pystopwords-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "b34ac2b46d3568a81264436f6285127c",
"sha256": "61497f4c70a85f35ae4d6d4e46911c0095b984bed566bcc7ae8b2d72f04724c7"
},
"downloads": -1,
"filename": "pystopwords-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "b34ac2b46d3568a81264436f6285127c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 118544,
"upload_time": "2022-12-02T10:22:21",
"upload_time_iso_8601": "2022-12-02T10:22:21.390320Z",
"url": "https://files.pythonhosted.org/packages/af/51/dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653/pystopwords-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-02 10:22:21",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "pystopwords"
}