pystopwords


Namepystopwords JSON
Version 0.0.2 PyPI version JSON
download
home_page
Summary中文停用词大全Python接口
upload_time2022-12-02 10:22:21
maintainer
docs_urlNone
author
requires_python
licenseMIT License
keywords 停用词 stopwords 中文 chinese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            #  pystopwords

## 简介
中文停用词大全,支持Python接口, 可选择百度,哈工大,中科院等公开停用词典。

目前只专注于中文,未来考虑加入多语言支持。

## 安装

```shell
pip install pystopwords
```

## 使用方法

```python
from pystopwords import stopwords
```


stopwords函数返回一个停用词set,有两个参数:

 - langs: string,支持的语言,目前仅支持中文(zh)
 - source: string, 停用词来源,目前支持
      - baidu: 百度停用词表
      - hit: 哈工大停用词表
      - ict: 中科院计算所停用词表
      - scu: 四川大学机器智能实验室停用词库
      - cn: 广为流传未知来源的中文停用词表
      - marimo: Marimo multi-lingual stopwords collection 内的中文停用词
      - iso: Stopwords ISO 内的中文停用词
      - all: 上述所有停用词并集

默认参数是`stopwords(langs='zh', source='all')`


```python
from pystopwords import stopwords
import jieba

# 默认的参数为:
# all_stopwords = stopwords(langs='zh', source='all')
all_stopwords = stopwords()

# 可以选择不同的来源
baidu_stopwords = stopwords(source='baidu')
hit_stopwords = stopwords(source='hit')

word_list = jieba.lcut('我想找一个简单好用的停用词典')
word_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]
print(word_list_drop_stopwords)

# Stdout: ['想', '找', '简单', '好用', '停用', '词典']
```


## 来源说明



| 名称   | 来源                   | 来源url                                        | 个数 | 备注                                                       |
|--------|------------------------|------------------------------------------------|------|------------------------------------------------------------|
| ict    | 中科院计算所           |                                           | 1207 | 网络上大部分很多链接失效,而且一共1207个,不是网传的1208个 |
| baidu  | 百度                   |                                                | 1429 |                                                            |
| hit    | 哈工大                 |                                                |  767 |                                                            |
| scu    | 四川大学机器智能实验室 |                                                |  976 |                                                            |
| cn     | 未知来源               |                                                |  746 |                                                            |
| marimo | koheiw                 | https://github.com/koheiw/marimo               |  387 | 原始文件有更细致的分类体系                                 |
| iso    | stopwords-iso          | https://github.com/stopwords-iso/stopwords-iso |  794 | 原始文件支持很多语言                                       |






            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pystopwords",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "\u505c\u7528\u8bcd stopwords \u4e2d\u6587 chinese",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/af/51/dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653/pystopwords-0.0.2.tar.gz",
    "platform": null,
    "description": "#  pystopwords\n\n## \u7b80\u4ecb\n\u4e2d\u6587\u505c\u7528\u8bcd\u5927\u5168\uff0c\u652f\u6301Python\u63a5\u53e3, \u53ef\u9009\u62e9\u767e\u5ea6\uff0c\u54c8\u5de5\u5927\uff0c\u4e2d\u79d1\u9662\u7b49\u516c\u5f00\u505c\u7528\u8bcd\u5178\u3002\n\n\u76ee\u524d\u53ea\u4e13\u6ce8\u4e8e\u4e2d\u6587\uff0c\u672a\u6765\u8003\u8651\u52a0\u5165\u591a\u8bed\u8a00\u652f\u6301\u3002\n\n## \u5b89\u88c5\n\n```shell\npip install pystopwords\n```\n\n## \u4f7f\u7528\u65b9\u6cd5\n\n```python\nfrom pystopwords import stopwords\n```\n\n\nstopwords\u51fd\u6570\u8fd4\u56de\u4e00\u4e2a\u505c\u7528\u8bcdset\uff0c\u6709\u4e24\u4e2a\u53c2\u6570\uff1a\n\n - langs: string\uff0c\u652f\u6301\u7684\u8bed\u8a00\uff0c\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u6587(zh)\n - source: string, \u505c\u7528\u8bcd\u6765\u6e90\uff0c\u76ee\u524d\u652f\u6301\n      - baidu: \u767e\u5ea6\u505c\u7528\u8bcd\u8868\n      - hit: \u54c8\u5de5\u5927\u505c\u7528\u8bcd\u8868\n      - ict: \u4e2d\u79d1\u9662\u8ba1\u7b97\u6240\u505c\u7528\u8bcd\u8868\n      - scu: \u56db\u5ddd\u5927\u5b66\u673a\u5668\u667a\u80fd\u5b9e\u9a8c\u5ba4\u505c\u7528\u8bcd\u5e93\n      - cn: \u5e7f\u4e3a\u6d41\u4f20\u672a\u77e5\u6765\u6e90\u7684\u4e2d\u6587\u505c\u7528\u8bcd\u8868\n      - marimo: Marimo multi-lingual stopwords collection \u5185\u7684\u4e2d\u6587\u505c\u7528\u8bcd\n      - iso: Stopwords ISO \u5185\u7684\u4e2d\u6587\u505c\u7528\u8bcd\n      - all: \u4e0a\u8ff0\u6240\u6709\u505c\u7528\u8bcd\u5e76\u96c6\n\n\u9ed8\u8ba4\u53c2\u6570\u662f`stopwords(langs='zh', source='all')`\n\n\n```python\nfrom pystopwords import stopwords\nimport jieba\n\n# \u9ed8\u8ba4\u7684\u53c2\u6570\u4e3a\uff1a\n# all_stopwords = stopwords(langs='zh', source='all')\nall_stopwords = stopwords()\n\n# \u53ef\u4ee5\u9009\u62e9\u4e0d\u540c\u7684\u6765\u6e90\nbaidu_stopwords = stopwords(source='baidu')\nhit_stopwords = stopwords(source='hit')\n\nword_list = jieba.lcut('\u6211\u60f3\u627e\u4e00\u4e2a\u7b80\u5355\u597d\u7528\u7684\u505c\u7528\u8bcd\u5178')\nword_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]\nprint(word_list_drop_stopwords)\n\n# Stdout: ['\u60f3', '\u627e', '\u7b80\u5355', '\u597d\u7528', '\u505c\u7528', '\u8bcd\u5178']\n```\n\n\n## \u6765\u6e90\u8bf4\u660e\n\n\n\n| \u540d\u79f0   | \u6765\u6e90                   | \u6765\u6e90url                                        | \u4e2a\u6570 | \u5907\u6ce8                                                       |\n|--------|------------------------|------------------------------------------------|------|------------------------------------------------------------|\n| ict    | \u4e2d\u79d1\u9662\u8ba1\u7b97\u6240           |                                           | 1207 | \u7f51\u7edc\u4e0a\u5927\u90e8\u5206\u5f88\u591a\u94fe\u63a5\u5931\u6548\uff0c\u800c\u4e14\u4e00\u51711207\u4e2a\uff0c\u4e0d\u662f\u7f51\u4f20\u76841208\u4e2a |\n| baidu  | \u767e\u5ea6                   |                                                | 1429 |                                                            |\n| hit    | \u54c8\u5de5\u5927                 |                                                |  767 |                                                            |\n| scu    | \u56db\u5ddd\u5927\u5b66\u673a\u5668\u667a\u80fd\u5b9e\u9a8c\u5ba4 |                                                |  976 |                                                            |\n| cn     | \u672a\u77e5\u6765\u6e90               |                                                |  746 |                                                            |\n| marimo | koheiw                 | https://github.com/koheiw/marimo               |  387 | \u539f\u59cb\u6587\u4ef6\u6709\u66f4\u7ec6\u81f4\u7684\u5206\u7c7b\u4f53\u7cfb                                 |\n| iso    | stopwords-iso          | https://github.com/stopwords-iso/stopwords-iso |  794 | \u539f\u59cb\u6587\u4ef6\u652f\u6301\u5f88\u591a\u8bed\u8a00                                       |\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "\u4e2d\u6587\u505c\u7528\u8bcd\u5927\u5168Python\u63a5\u53e3",
    "version": "0.0.2",
    "split_keywords": [
        "\u505c\u7528\u8bcd",
        "stopwords",
        "\u4e2d\u6587",
        "chinese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "ab55da4947cdbfe72156ed64de396ef6",
                "sha256": "454c5f49bb6a5efdb921fa57447f4cfec7e3d7c439fc1e7f0726321c62b9d8d7"
            },
            "downloads": -1,
            "filename": "pystopwords-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ab55da4947cdbfe72156ed64de396ef6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 37888,
            "upload_time": "2022-12-02T10:22:16",
            "upload_time_iso_8601": "2022-12-02T10:22:16.638089Z",
            "url": "https://files.pythonhosted.org/packages/32/46/74aa49737e9b0be37141ad377f71f4251b4ba499f2a65ed2ae069f9296e3/pystopwords-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "b34ac2b46d3568a81264436f6285127c",
                "sha256": "61497f4c70a85f35ae4d6d4e46911c0095b984bed566bcc7ae8b2d72f04724c7"
            },
            "downloads": -1,
            "filename": "pystopwords-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b34ac2b46d3568a81264436f6285127c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 118544,
            "upload_time": "2022-12-02T10:22:21",
            "upload_time_iso_8601": "2022-12-02T10:22:21.390320Z",
            "url": "https://files.pythonhosted.org/packages/af/51/dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653/pystopwords-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-02 10:22:21",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "pystopwords"
}
        
Elapsed time: 0.03269s