自己蒐集的training data、字典和stopwords並且包成package,讓大家不用重複造輪子。
## Usage
安裝:`pip install NCHU_nlptoolkit`
1. 濾掉stopwords, remove stopwords 並且斷詞
p.s. rm stop words時就會跟著載入實驗室字典了
```
from NCHU_nlptoolkit.cut import *
# minword 是最小詞的字數(斷詞最少幾個字)
# default
cut_sentence(input string, flag=False, minword=1)
# return segmentation with part of speech.
cut_sentence(input string, flag=True, minword=1)
```
2. 載入法律辭典
```
from NCHU_nlptoolkit.cut import *
load_law_dict()
```
3. demo:
* zh:
```
>>> doc = '首先,對區塊鏈需要的第一個理解是,它是一種「將資料寫錄的技術」。'
>>> cut_sentence(doc, flag=True)
[('區塊鏈', 'n'), ('需要', 'n'), ('第一個', 'm'), ('理解', 'n'), ('一種', 'm'), ('資料', 'n'), ('寫錄', 'v'), ('技術', 'n')]
```
* en:
```
>>> doc = 'The City of New York, often called New York City (NYC) or simply New York, is the most populous city in the United States.'
>>> list(cut_sentence_en(doc))
['City', 'New York', 'called', 'New York City', 'NYC', 'simply', 'New York', 'populous', 'city', 'United States']
>>> list(cut_sentence_en(doc, flag=True))
>>> [('City', 'NNP'), ('New York', 'NNP/NNP'), ('called', 'VBN'), ('New York City', 'NNP/NNP/NNP'), ('NYC', 'NN'), ('simply', 'RB'), ('New York', 'NNP/NNP'), ('populous', 'JJ'), ('city', 'NN'), ('United States', 'NNP/NNS')]
```
Raw data
{
"_id": null,
"home_page": "",
"name": "NCHU-nlptoolkit",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "NCHU_nlptoolkit,jieba,dictionary,stopwords",
"author": "['davidtnfsh', 'CYJiang0718', 'dancheng']",
"author_email": "nlpnchu@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ef/51/6465d9bfe7dd1ec49b44e709209135832485acedbe2a58e54b33099ad679/NCHU_nlptoolkit-2.0.5.tar.gz",
"platform": null,
"description": "\r\n\u81ea\u5df1\u8490\u96c6\u7684training data\u3001\u5b57\u5178\u548cstopwords\u4e26\u4e14\u5305\u6210package\uff0c\u8b93\u5927\u5bb6\u4e0d\u7528\u91cd\u8907\u9020\u8f2a\u5b50\u3002\r\n\r\n## Usage\r\n\r\n\u5b89\u88dd\uff1a`pip install NCHU_nlptoolkit`\r\n\r\n1. \u6ffe\u6389stopwords, remove stopwords \u4e26\u4e14\u65b7\u8a5e\r\np.s. rm stop words\u6642\u5c31\u6703\u8ddf\u8457\u8f09\u5165\u5be6\u9a57\u5ba4\u5b57\u5178\u4e86\r\n ```\r\n from NCHU_nlptoolkit.cut import *\r\n \r\n # minword \u662f\u6700\u5c0f\u8a5e\u7684\u5b57\u6578(\u65b7\u8a5e\u6700\u5c11\u5e7e\u500b\u5b57)\r\n \r\n # default\r\n cut_sentence(input string, flag=False, minword=1)\r\n\r\n # return segmentation with part of speech.\r\n cut_sentence(input string, flag=True, minword=1)\r\n ```\r\n2. \u8f09\u5165\u6cd5\u5f8b\u8fad\u5178\r\n ```\r\n from NCHU_nlptoolkit.cut import *\r\n\r\n load_law_dict()\r\n ```\r\n3. demo:\r\n * zh:\r\n\r\n ```\r\n >>> doc = '\u9996\u5148\uff0c\u5c0d\u5340\u584a\u93c8\u9700\u8981\u7684\u7b2c\u4e00\u500b\u7406\u89e3\u662f\uff0c\u5b83\u662f\u4e00\u7a2e\u300c\u5c07\u8cc7\u6599\u5beb\u9304\u7684\u6280\u8853\u300d\u3002'\r\n >>> cut_sentence(doc, flag=True)\r\n [('\u5340\u584a\u93c8', 'n'), ('\u9700\u8981', 'n'), ('\u7b2c\u4e00\u500b', 'm'), ('\u7406\u89e3', 'n'), ('\u4e00\u7a2e', 'm'), ('\u8cc7\u6599', 'n'), ('\u5beb\u9304', 'v'), ('\u6280\u8853', 'n')]\r\n ```\r\n\r\n * en:\r\n\r\n ```\r\n >>> doc = 'The City of New York, often called New York City (NYC) or simply New York, is the most populous city in the United States.'\r\n >>> list(cut_sentence_en(doc))\r\n ['City', 'New York', 'called', 'New York City', 'NYC', 'simply', 'New York', 'populous', 'city', 'United States']\r\n \r\n >>> list(cut_sentence_en(doc, flag=True))\r\n >>> [('City', 'NNP'), ('New York', 'NNP/NNP'), ('called', 'VBN'), ('New York City', 'NNP/NNP/NNP'), ('NYC', 'NN'), ('simply', 'RB'), ('New York', 'NNP/NNP'), ('populous', 'JJ'), ('city', 'NN'), ('United States', 'NNP/NNS')]\r\n ```\r\n \r\n",
"bugtrack_url": null,
"license": "GPL3.0",
"summary": "nlplab dictionary, stopwords module",
"version": "2.0.5",
"project_urls": null,
"split_keywords": [
"nchu_nlptoolkit",
"jieba",
"dictionary",
"stopwords"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ef516465d9bfe7dd1ec49b44e709209135832485acedbe2a58e54b33099ad679",
"md5": "e728ec2652dd9c9707675b54ef74f373",
"sha256": "86afedaacca1d798fc30a8aea34ee2994f9f61aeb007abe32142f000915e43bc"
},
"downloads": -1,
"filename": "NCHU_nlptoolkit-2.0.5.tar.gz",
"has_sig": false,
"md5_digest": "e728ec2652dd9c9707675b54ef74f373",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12909393,
"upload_time": "2023-09-19T14:33:19",
"upload_time_iso_8601": "2023-09-19T14:33:19.264433Z",
"url": "https://files.pythonhosted.org/packages/ef/51/6465d9bfe7dd1ec49b44e709209135832485acedbe2a58e54b33099ad679/NCHU_nlptoolkit-2.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-19 14:33:19",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nchu-nlptoolkit"
}