pnlp


Namepnlp JSON
Version 0.4.15 PyPI version JSON
download
home_pagehttps://github.com/hscspring/pnlp
SummaryA pre/post-processing tool for NLP.
upload_time2024-09-02 06:50:10
maintainerNone
docs_urlNone
authorYam
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [功能特性](#%E5%8A%9F%E8%83%BD%E7%89%B9%E6%80%A7)
- [安装](#%E5%AE%89%E8%A3%85)
- [使用](#%E4%BD%BF%E7%94%A8)
  - [文本IO](#%E6%96%87%E6%9C%ACio)
    - [IO 处理](#io-%E5%A4%84%E7%90%86)
    - [内置方法](#%E5%86%85%E7%BD%AE%E6%96%B9%E6%B3%95)
  - [文本处理](#%E6%96%87%E6%9C%AC%E5%A4%84%E7%90%86)
    - [清理和提取](#%E6%B8%85%E7%90%86%E5%92%8C%E6%8F%90%E5%8F%96)
    - [内置正则](#%E5%86%85%E7%BD%AE%E6%AD%A3%E5%88%99)
  - [文本切分](#%E6%96%87%E6%9C%AC%E5%88%87%E5%88%86)
    - [任意部分切分](#%E4%BB%BB%E6%84%8F%E9%83%A8%E5%88%86%E5%88%87%E5%88%86)
    - [分句](#%E5%88%86%E5%8F%A5)
    - [分子句并按一个阈值合并子句](#%E5%88%86%E5%AD%90%E5%8F%A5%E5%B9%B6%E6%8C%89%E4%B8%80%E4%B8%AA%E9%98%88%E5%80%BC%E5%90%88%E5%B9%B6%E5%AD%90%E5%8F%A5)
    - [中文字符切分](#%E4%B8%AD%E6%96%87%E5%AD%97%E7%AC%A6%E5%88%87%E5%88%86)
    - [句子分组](#%E5%8F%A5%E5%AD%90%E5%88%86%E7%BB%84)
  - [文本增强](#%E6%96%87%E6%9C%AC%E5%A2%9E%E5%BC%BA)
    - [Token级别](#token%E7%BA%A7%E5%88%AB)
    - [句子级别](#%E5%8F%A5%E5%AD%90%E7%BA%A7%E5%88%AB)
  - [文本归一化](#%E6%96%87%E6%9C%AC%E5%BD%92%E4%B8%80%E5%8C%96)
    - [中文数字](#%E4%B8%AD%E6%96%87%E6%95%B0%E5%AD%97)
  - [格式转换](#%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2)
    - [BIO转实体](#bio%E8%BD%AC%E5%AE%9E%E4%BD%93)
    - [任意参数转UUID](#%E4%BB%BB%E6%84%8F%E5%8F%82%E6%95%B0%E8%BD%ACuuid)
  - [内置词典](#%E5%86%85%E7%BD%AE%E8%AF%8D%E5%85%B8)
    - [停用词](#%E5%81%9C%E7%94%A8%E8%AF%8D)
  - [文本长度](#%E6%96%87%E6%9C%AC%E9%95%BF%E5%BA%A6)
  - [魔术方法](#%E9%AD%94%E6%9C%AF%E6%96%B9%E6%B3%95)
  - [并行处理](#%E5%B9%B6%E8%A1%8C%E5%A4%84%E7%90%86)
- [测试](#%E6%B5%8B%E8%AF%95)
- [更新日志](#%E6%9B%B4%E6%96%B0%E6%97%A5%E5%BF%97)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

NLP 预/后处理工具。

## 功能特性

- 专为文本 IO 设计的灵活的 Pipeline
- 灵活的文本清理/提取工具
- 文本增强
- 按句切分或按中文字符切分文本
- 文本分桶
- 中文字符归一化
- 文本各种长度计算
- 中英文常用停用词
- 预处理魔术方法
- 并发、批量化、实体 BIO 转实体

## 安装

需要 Python3.7+。

`pip install pnlp`

## 使用

### 文本IO

#### IO 处理

```bash
tree tests/piop_data/
├── a.md
├── b.txt
├── c.data
├── first
│   ├── fa.md
│   ├── fb.txt
│   ├── fc.data
│   └── second
│       ├── sa.md
│       ├── sb.txt
│       └── sc.data
├── json.json
├── outfile.file
├── outjson.json
└── yml.yml
```

```python
import os
from pnlp import Reader

DATA_PATH = "./pnlp/tests/piop_data/"
pattern = '*.md' # 可以是 '*.txt', 'f*.*' 等,支持正则
reader = Reader(pattern, use_regex=True)

# 获取所有文件的行,输出行文本、行索引和所在的文件名
for line in reader(DATA_FOLDER_PATH):
    print(line.lid, line.fname, line.text)
"""
0 a.md line 1 in a.
1 a.md line 2 in a.
2 a.md line 3 in a.
0 fa.md line 1 in fa.
1 fa.md line 2 in fa
...
"""

# 获取某个文件的所有行,输出行文本、行索引和所在文件名,此时由于指定了文件名 pattern 无效
for line in reader(os.path.join(DATA_FOLDER_PATH, "a.md")):
    print(line.lid, line.fname, line.text)
"""
0 a.md line 1 in a.
1 a.md line 2 in a.
2 a.md line 3 in a.
"""



# 获取目录下的所有文件路径
for path in Reader.gen_files(DATA_PATH, pattern, use_regex: True):
    print(path)
"""
pnlp/tests/piop_data/a.md
pnlp/tests/piop_data/first/fa.md
pnlp/tests/piop_data/first/second/sa.md
"""

# 获取一个目录下所有文件名和它们的内容
paths = Reader.gen_files(DATA_PATH, pattern)
articles = Reader.gen_articles(paths)
for article in articles:
    print(article.fname)
    print(article.f.read())
"""
a.md
line 1 in a.
line 2 in a.
line 3 in a.
...
"""

# 同前两个例子
paths = Reader.gen_files(DATA_PATH, pattern)
articles = Reader.gen_articles(paths)
for line in Reader.gen_flines(articles, strip="\n"):
    print(line.lid, line.fname, line.text)
```

#### 内置方法

```python
import pnlp

# Read
file_string = pnlp.read_file(file_path)
file_list = pnlp.read_lines(file_path)
file_json = pnlp.read_json(file_path)
file_yaml = pnlp.read_yaml(file_path)
file_csv = pnlp.read_csv(file_path)
file_pickle = pnlp.read_pickle(file_path)
list_dict = pnlp.read_file_to_list_dict(file_path)

# Write
pnlp.write_json(file_path, data, indent=2)
pnlp.write_file(file_path, data)
pnlp.write_pickle(file_path, data)
pnlp.write_list_dict_to_file(file_path, data)

# Others
pnlp.check_dir(dirname) # 如果目录不存在会创建
```

### 文本处理

#### 清理和提取

```python
import re
from pnlp import Text

text = "这是https://www.yam.gift长度测试,《 》*)FSJfdsjf😁![](http://xx.jpg)。233."
pattern = re.compile(r'\d+')

# pattern 是 re.Pattern 类型或 str 类型
# 默认为空字符串:'', 表示不使用任何 pattern(实际是 re.compile(r'.+')),此时 clean 返回空(全部被清了),extract 返回原始文本。
# pattern 支持以下字符串类型(实际为正则):
#	'chi': 中文字符
#	'pun': 标点
#	'whi': 空白
#	'nwh': 非空白
#	'wnb': 字母(含中文字符)或数字
#	'nwn': 非字母(含中文字符)或数字
#	'eng': 英文字符
#	'num': 数字
#	'pic': 图片
#	'lnk': 链接
#	'emj': 表情

pt = Text(['chi', pattern])

# 提取所有符合 pattern 的文本和它们的位置
res = pt.extract(text)
print(res)
"""
{'text': '这是长度测试233', 'mats': ['这是', '长度测试', '233'], 'locs': [(0, 2), (22, 26), (60, 63)]}
"""
# 支持用「点」获取key属性
print(res.text, res.mats, res.locs)
"""
'这是长度测试' ['这是', '长度测试'] [(0, 2), (22, 26)]
"""

# 返回指定 pattern 清理后的文本
print(pt.clean(text))
"""
https://www.yam.gift,《 》*)FSJfdsjf😁![](http://xx.jpg)。233.
"""

# 可以指定多个 pattern,注意先后顺序可能会影响结果哦
pt = Text(['pic', 'lnk'])
# 提取到的
res = pt.extract(text)
print(res.mats)
"""
['https://www.yam.gif',
 '![](http://xx.jpg)',
 'https://www.yam.gift',
 'http://xx.jpg']
"""
# 清理后的
print(pt.clean(text))
"""
这是t长度测试,《 》*)FSJfdsjf😁。233.
"""
```

#### 内置正则

```python
# USE Regex
from pnlp import reg
def clean_text(text: str) -> str:
    text = reg.pwhi.sub("", text)
    text = reg.pemj.sub("", text)
    text = reg.ppic.sub("", text)
    text = reg.plnk.sub("", text)
    return text
```

### 文本切分

#### 任意部分切分

```python
# Cut by Regex
from pnlp import cut_part, psent
text = "你好!欢迎使用。"
sent_list = cut_part(text, psent, with_spliter=True, with_offset=False)
print(sent_list)
"""
['你好!', '欢迎使用。']
"""
pcustom_sent = re.compile(r'[。!]')
sent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=False)
print(sent_list)
"""
['你好', '欢迎使用']
"""
sent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=True)
print(sent_list)
"""
[('你好', 0, 3), ('欢迎使用', 3, 8)]
"""
```

#### 分句

```python
# Cut Sentence
from pnlp import cut_sentence as pcs
text = "你好!欢迎使用。"
sent_list = pcs(text)
print(sent_list)
"""
['你好!', '欢迎使用。']
"""
```

#### 分子句并按一个阈值合并子句

```python
from pnlp import cut_sub_sentence as pcss
text = "你好!你好。你好?你坏~欢迎使用。"
sent_list = pcss(text)
print(sent_list)
"""
['你好!', '你好。', '你好?', '你坏~', '欢迎使用。']
"""
sent_list = pcss(text, 6)
print(sent_list)
"""
['你好!你好。', '你好?你坏~', '欢迎使用。']
"""
sent_list = pcss(text, 12)
print(sent_list)
"""
['你好!你好。你好?你坏~', '欢迎使用。']
"""
```

这个功能在很多场合非常有用;)懂的都懂:D

#### 中文字符切分

```python
# 中文字符切分
from pnlp import cut_zhchar
text = "你好,hello, 520 i love u. = ”我爱你“。"
char_list = cut_zhchar(text)
print(char_list)
"""
['你', '好', ',', 'hello', ',', ' ', '520', ' ', 'i', ' ', 'love', ' ', 'u', '.', ' ', '=', ' ', '”', '我', '爱', '你', '“', '。']
"""
char_list = cut_zhchar(text, remove_blank=True)
print(char_list)
"""
['你', '好', ',', 'hello', ',', '520', 'i', 'love', 'u', '.', '=', '”', '我', '爱', '你', '“', '。']
"""
```

#### 句子分组

```python
from pnlp import combine_bucket
parts = [
    "先生,那夜,我因胸中纳闷,无法入睡,",
    "折腾得比那铐了脚镣的叛变水手还更难过;",
    "那时,我就冲动的 ——",
    "好在有那一时之念,",
    "因为有时我们在无意中所做的事能够圆满……"
]
buckets = combine_bucket(parts.copy(), 10, truncate=True, keep_remain=True)
print(buckets)
"""
['先生,那夜,我因胸中',
 '纳闷,无法入睡,',
 '折腾得比那铐了脚镣的',
 '叛变水手还更难过;',
 '那时,我就冲动的 —',
 '—',
 '好在有那一时之念,',
 '因为有时我们在无意中',
 '所做的事能够圆满……']
"""
```

### 文本增强

采样器支持删除、交换、插入操作,所有的操作不会跨越标点。

#### Token级别

- 默认 Tokenizer
    - 中文:字符级 Tokenizer(见上)
    - 英文:空白符切分 Tokenizer
- Tokenizer 可以任意指定,但它的输出应该是一个 List 的 Token 或一个 List 的 Tuple,每个 Tuple 包含一个 Token 和一个词性。
- 对字符级增强,默认并不会操作所有字或词。可以自定义要操作的词或词性。
    - 默认 Token 是「停用词」
    - 默认词性(当 Tokenizer 输出带词性时)是「功能词」:副词、介词、连词、助词、其他虚词(标记为 d p c u xc)

```python
# 【】内的为改变的
text = "人为什么活着?生而为人必须要有梦想!还要有尽可能多的精神体验。"
# 字符粒度
from pnlp import TokenLevelSampler
tls = TokenLevelSampler()
tls.make_samples(text)
"""
{'delete': '人为什么活着?生而为人必须要【有】梦想!还要有尽可能多的精神体验。',
 'swap': '【为】【人】什么活着?生而为人必须要有梦想!还要有尽可能多的精神体验。',
 'insert': '人为什么活着?生而为人必须要有梦想!【还】还要有尽可能多的精神体验。',
 'together': '人什么着着活?生而必为为须要有梦想!还要有尽可能多的精神体验。'}
"""
# 支持自定义 tokenizer
tls.make_samples(text, jieba.lcut)
"""
{'delete': '人为什么活着?生而为人【必须】要有梦想!还要有尽可能多的精神体验。',
 'swap': '【为什么】【人】活着?生而为人必须要有梦想!还要有尽可能多的精神体验。',
 'insert': '人为什么活着?生而为人必须要有梦想!【还要】还要有尽可能多的精神体验。',
 'together': '人为什么活着?生而为人人要有梦想!还要有多尽可能的精神体验。'}
"""
# 自定义
tls = TokenLevelSampler(
    rate=替换比例, # 默认 5%
    types=["delete", "swap", "insert"], # 默认三个 
    sample_words=["词1", "词2"], # 默认停用词
    sample_pos=["词性1", "词性2"], # 默认功能词
)
```

#### 句子级别

```python
from pnlp import SentenceLevelSampler
sls = SentenceLevelSampler()
sls.make_samples(text)
"""
{'delete': '生而为人必须要有梦想!还要有尽可能多的精神体验。',
 'swap': '人为什么活着?还要有尽可能多的精神体验。生而为人必须要有梦想!',
 'insert': '人为什么活着?还要有尽可能多的精神体验。生而为人必须要有梦想!生而为人必须要有梦想!',
 'together': '生而为人必须要有梦想!人为什么活着?人为什么活着?'}
"""
# 自定义
sls = SentenceLevelSampler(types=["delete", "swap", "insert"]) # 默认三个
```

### 文本归一化

#### 中文数字

```python
from pnlp import num_norm
num_norm.num2zh(1024) == "一千零二十四"
num_norm.num2zh(1024).to_money() == "壹仟零贰拾肆"
num_norm.zh2num("一千零二十四") == 1024
```

### 格式转换

#### BIO转实体

```python
# 实体 BIO Token 转实体
from pnlp import pick_entity_from_bio_labels
pairs = [('天', 'B-LOC'), ('安', 'I-LOC'), ('门', 'I-LOC'), ('有', 'O'), ('毛', 'B-PER'), ('主', 'I-PER'), ('席', 'I-PER')]
pick_entity_from_bio_labels(pairs)
"""
[('天安门', 'LOC'), ('毛主席', 'PER')]
"""
pick_entity_from_bio_labels(pairs, with_offset=True)
"""
[('天安门', 'LOC', 0, 3), ('毛主席', 'PER', 4, 7)]
"""
```

#### 任意参数转UUID

```python
from pnlp import generate_uuid

uid1 = pnlp.generate_uuid("a", 1, 0.02)
uid2 = pnlp.generete_uuid("a", 1)
"""
uid1 == 3fbc8b70d05b5abdb5badca1d26e1dbd
uid2 == f7b0ffc589e453e88d4faf66eb92f669
"""
```

### 内置词典

#### 停用词

```python
from pnlp import StopWords, chinese_stopwords, english_stopwords

csw = StopWords("/path/to/custom/stopwords.txt")
csw.stopwords # a set of the custom stopwords

csw.zh == chinese_stopwords # Chineses stopwords
csw.en == english_stopwords # English stopwords
```


### 文本长度

```python
from pnlp import Length

text = "这是https://www.yam.gift长度测试,《 》*)FSJfdsjf😁![](http://xx.jpg)。233."

pl = Length(text)
# 注意:即使使用了 pattern,长度都是基于原始文本
# 长度基于字符计数(不是整词)
print("Length of all characters: ", pl.len_all)
print("Length of all non-white characters: ", pl.len_nwh)
print("Length of all Chinese characters: ", pl.len_chi)
print("Length of all words and numbers: ", pl.len_wnb)
print("Length of all punctuations: ", pl.len_pun)
print("Length of all English characters: ", pl.len_eng)
print("Length of all numbers: ", pl.len_num)

"""
Length of all characters:  64
Length of all non-white characters:  63
Length of all Chinese characters:  6
Length of all words and numbers:  41
Length of all punctuations:  14
Length of all English characters:  32
Length of all numbers:  3
"""
```

### 魔术方法

#### 字典

```python
from pnlp import MagicDict

# 嵌套字典
pmd = MagicDict()
pmd['a']['b']['c'] = 2
print(pmd)

"""
{'a': {'b': {'c': 2}}}
"""

# 当字典被翻转时,保留所有的重复 value-keys
dx = {1: 'a',
      2: 'a',
      3: 'a',
      4: 'b' }
print(pmag.MagicDict.reverse(dx))

"""
{'a': [1, 2, 3], 'b': 4}
"""
```

#### 获取唯一文件名

```python
from pnlp import get_unique_fn

get_unique_fn("a/b/c.md") == "a_b_c.md"
```

### 并行处理

支持四种并行处理方式:

- 线程池:`thread_pool`
- 进程池:`process_pool`
- 线程 Executor:`thread_executor`,默认使用
- 线程:`thread`

注意:惰性处理,返回的是 Generator。

```python
import math
def is_prime(x):
    if x < 2:
        return False
    for i in range(2, int(math.sqrt(x)) + 1):
        if x % i == 0:
            return False
    return True

from pnlp import concurring

# max_workers 默认为 4
@concurring
def get_primes(lst):
    res = []
    for i in lst:
        if is_prime(i):
            res.append(i)
    return res

@concurrint(type="thread_pool", max_workers=10)
def get_primes(lst):
    pass
```

`concurring` 装饰器让你的迭代函数并行。

### 后台处理

```python
from pnlp import run_in_new_thread

def func(file, a, b, c):
    background_task()

run_in_new_thread(func, file, 1, 2, 3)
```

## 测试

Clone 仓库后执行:

```bash
$ python -m pytest
```

## 更新日志

见英文版 README。




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hscspring/pnlp",
    "name": "pnlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Yam",
    "author_email": "haoshaochun@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/74/ad/21a1805f26e498b3fcdae69d7b9f3cd7f1f78d3040f54efb6fd57f1a4631/pnlp-0.4.15.tar.gz",
    "platform": null,
    "description": "<!-- START doctoc generated TOC please keep comment here to allow auto update -->\n<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->\n**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*\n\n- [\u529f\u80fd\u7279\u6027](#%E5%8A%9F%E8%83%BD%E7%89%B9%E6%80%A7)\n- [\u5b89\u88c5](#%E5%AE%89%E8%A3%85)\n- [\u4f7f\u7528](#%E4%BD%BF%E7%94%A8)\n  - [\u6587\u672cIO](#%E6%96%87%E6%9C%ACio)\n    - [IO \u5904\u7406](#io-%E5%A4%84%E7%90%86)\n    - [\u5185\u7f6e\u65b9\u6cd5](#%E5%86%85%E7%BD%AE%E6%96%B9%E6%B3%95)\n  - [\u6587\u672c\u5904\u7406](#%E6%96%87%E6%9C%AC%E5%A4%84%E7%90%86)\n    - [\u6e05\u7406\u548c\u63d0\u53d6](#%E6%B8%85%E7%90%86%E5%92%8C%E6%8F%90%E5%8F%96)\n    - [\u5185\u7f6e\u6b63\u5219](#%E5%86%85%E7%BD%AE%E6%AD%A3%E5%88%99)\n  - [\u6587\u672c\u5207\u5206](#%E6%96%87%E6%9C%AC%E5%88%87%E5%88%86)\n    - [\u4efb\u610f\u90e8\u5206\u5207\u5206](#%E4%BB%BB%E6%84%8F%E9%83%A8%E5%88%86%E5%88%87%E5%88%86)\n    - [\u5206\u53e5](#%E5%88%86%E5%8F%A5)\n    - [\u5206\u5b50\u53e5\u5e76\u6309\u4e00\u4e2a\u9608\u503c\u5408\u5e76\u5b50\u53e5](#%E5%88%86%E5%AD%90%E5%8F%A5%E5%B9%B6%E6%8C%89%E4%B8%80%E4%B8%AA%E9%98%88%E5%80%BC%E5%90%88%E5%B9%B6%E5%AD%90%E5%8F%A5)\n    - [\u4e2d\u6587\u5b57\u7b26\u5207\u5206](#%E4%B8%AD%E6%96%87%E5%AD%97%E7%AC%A6%E5%88%87%E5%88%86)\n    - [\u53e5\u5b50\u5206\u7ec4](#%E5%8F%A5%E5%AD%90%E5%88%86%E7%BB%84)\n  - [\u6587\u672c\u589e\u5f3a](#%E6%96%87%E6%9C%AC%E5%A2%9E%E5%BC%BA)\n    - [Token\u7ea7\u522b](#token%E7%BA%A7%E5%88%AB)\n    - [\u53e5\u5b50\u7ea7\u522b](#%E5%8F%A5%E5%AD%90%E7%BA%A7%E5%88%AB)\n  - [\u6587\u672c\u5f52\u4e00\u5316](#%E6%96%87%E6%9C%AC%E5%BD%92%E4%B8%80%E5%8C%96)\n    - [\u4e2d\u6587\u6570\u5b57](#%E4%B8%AD%E6%96%87%E6%95%B0%E5%AD%97)\n  - [\u683c\u5f0f\u8f6c\u6362](#%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2)\n    - [BIO\u8f6c\u5b9e\u4f53](#bio%E8%BD%AC%E5%AE%9E%E4%BD%93)\n    - [\u4efb\u610f\u53c2\u6570\u8f6cUUID](#%E4%BB%BB%E6%84%8F%E5%8F%82%E6%95%B0%E8%BD%ACuuid)\n  - [\u5185\u7f6e\u8bcd\u5178](#%E5%86%85%E7%BD%AE%E8%AF%8D%E5%85%B8)\n    - [\u505c\u7528\u8bcd](#%E5%81%9C%E7%94%A8%E8%AF%8D)\n  - [\u6587\u672c\u957f\u5ea6](#%E6%96%87%E6%9C%AC%E9%95%BF%E5%BA%A6)\n  - [\u9b54\u672f\u65b9\u6cd5](#%E9%AD%94%E6%9C%AF%E6%96%B9%E6%B3%95)\n  - [\u5e76\u884c\u5904\u7406](#%E5%B9%B6%E8%A1%8C%E5%A4%84%E7%90%86)\n- [\u6d4b\u8bd5](#%E6%B5%8B%E8%AF%95)\n- [\u66f4\u65b0\u65e5\u5fd7](#%E6%9B%B4%E6%96%B0%E6%97%A5%E5%BF%97)\n\n<!-- END doctoc generated TOC please keep comment here to allow auto update -->\n\nNLP \u9884/\u540e\u5904\u7406\u5de5\u5177\u3002\n\n## \u529f\u80fd\u7279\u6027\n\n- \u4e13\u4e3a\u6587\u672c IO \u8bbe\u8ba1\u7684\u7075\u6d3b\u7684 Pipeline\n- \u7075\u6d3b\u7684\u6587\u672c\u6e05\u7406/\u63d0\u53d6\u5de5\u5177\n- \u6587\u672c\u589e\u5f3a\n- \u6309\u53e5\u5207\u5206\u6216\u6309\u4e2d\u6587\u5b57\u7b26\u5207\u5206\u6587\u672c\n- \u6587\u672c\u5206\u6876\n- \u4e2d\u6587\u5b57\u7b26\u5f52\u4e00\u5316\n- \u6587\u672c\u5404\u79cd\u957f\u5ea6\u8ba1\u7b97\n- \u4e2d\u82f1\u6587\u5e38\u7528\u505c\u7528\u8bcd\n- \u9884\u5904\u7406\u9b54\u672f\u65b9\u6cd5\n- \u5e76\u53d1\u3001\u6279\u91cf\u5316\u3001\u5b9e\u4f53 BIO \u8f6c\u5b9e\u4f53\n\n## \u5b89\u88c5\n\n\u9700\u8981 Python3.7+\u3002\n\n`pip install pnlp`\n\n## \u4f7f\u7528\n\n### \u6587\u672cIO\n\n#### IO \u5904\u7406\n\n```bash\ntree tests/piop_data/\n\u251c\u2500\u2500 a.md\n\u251c\u2500\u2500 b.txt\n\u251c\u2500\u2500 c.data\n\u251c\u2500\u2500 first\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fa.md\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fb.txt\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fc.data\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 second\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 sa.md\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 sb.txt\n\u2502\u00a0\u00a0     \u2514\u2500\u2500 sc.data\n\u251c\u2500\u2500 json.json\n\u251c\u2500\u2500 outfile.file\n\u251c\u2500\u2500 outjson.json\n\u2514\u2500\u2500 yml.yml\n```\n\n```python\nimport os\nfrom pnlp import Reader\n\nDATA_PATH = \"./pnlp/tests/piop_data/\"\npattern = '*.md' # \u53ef\u4ee5\u662f '*.txt', 'f*.*' \u7b49\uff0c\u652f\u6301\u6b63\u5219\nreader = Reader(pattern, use_regex=True)\n\n# \u83b7\u53d6\u6240\u6709\u6587\u4ef6\u7684\u884c\uff0c\u8f93\u51fa\u884c\u6587\u672c\u3001\u884c\u7d22\u5f15\u548c\u6240\u5728\u7684\u6587\u4ef6\u540d\nfor line in reader(DATA_FOLDER_PATH):\n    print(line.lid, line.fname, line.text)\n\"\"\"\n0 a.md line 1 in a.\n1 a.md line 2 in a.\n2 a.md line 3 in a.\n0 fa.md line 1 in fa.\n1 fa.md line 2 in fa\n...\n\"\"\"\n\n# \u83b7\u53d6\u67d0\u4e2a\u6587\u4ef6\u7684\u6240\u6709\u884c\uff0c\u8f93\u51fa\u884c\u6587\u672c\u3001\u884c\u7d22\u5f15\u548c\u6240\u5728\u6587\u4ef6\u540d\uff0c\u6b64\u65f6\u7531\u4e8e\u6307\u5b9a\u4e86\u6587\u4ef6\u540d pattern \u65e0\u6548\nfor line in reader(os.path.join(DATA_FOLDER_PATH, \"a.md\")):\n    print(line.lid, line.fname, line.text)\n\"\"\"\n0 a.md line 1 in a.\n1 a.md line 2 in a.\n2 a.md line 3 in a.\n\"\"\"\n\n\n\n# \u83b7\u53d6\u76ee\u5f55\u4e0b\u7684\u6240\u6709\u6587\u4ef6\u8def\u5f84\nfor path in Reader.gen_files(DATA_PATH, pattern, use_regex: True):\n    print(path)\n\"\"\"\npnlp/tests/piop_data/a.md\npnlp/tests/piop_data/first/fa.md\npnlp/tests/piop_data/first/second/sa.md\n\"\"\"\n\n# \u83b7\u53d6\u4e00\u4e2a\u76ee\u5f55\u4e0b\u6240\u6709\u6587\u4ef6\u540d\u548c\u5b83\u4eec\u7684\u5185\u5bb9\npaths = Reader.gen_files(DATA_PATH, pattern)\narticles = Reader.gen_articles(paths)\nfor article in articles:\n    print(article.fname)\n    print(article.f.read())\n\"\"\"\na.md\nline 1 in a.\nline 2 in a.\nline 3 in a.\n...\n\"\"\"\n\n# \u540c\u524d\u4e24\u4e2a\u4f8b\u5b50\npaths = Reader.gen_files(DATA_PATH, pattern)\narticles = Reader.gen_articles(paths)\nfor line in Reader.gen_flines(articles, strip=\"\\n\"):\n    print(line.lid, line.fname, line.text)\n```\n\n#### \u5185\u7f6e\u65b9\u6cd5\n\n```python\nimport pnlp\n\n# Read\nfile_string = pnlp.read_file(file_path)\nfile_list = pnlp.read_lines(file_path)\nfile_json = pnlp.read_json(file_path)\nfile_yaml = pnlp.read_yaml(file_path)\nfile_csv = pnlp.read_csv(file_path)\nfile_pickle = pnlp.read_pickle(file_path)\nlist_dict = pnlp.read_file_to_list_dict(file_path)\n\n# Write\npnlp.write_json(file_path, data, indent=2)\npnlp.write_file(file_path, data)\npnlp.write_pickle(file_path, data)\npnlp.write_list_dict_to_file(file_path, data)\n\n# Others\npnlp.check_dir(dirname) # \u5982\u679c\u76ee\u5f55\u4e0d\u5b58\u5728\u4f1a\u521b\u5efa\n```\n\n### \u6587\u672c\u5904\u7406\n\n#### \u6e05\u7406\u548c\u63d0\u53d6\n\n```python\nimport re\nfrom pnlp import Text\n\ntext = \"\u8fd9\u662fhttps://www.yam.gift\u957f\u5ea6\u6d4b\u8bd5\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\"\npattern = re.compile(r'\\d+')\n\n# pattern \u662f re.Pattern \u7c7b\u578b\u6216 str \u7c7b\u578b\n# \u9ed8\u8ba4\u4e3a\u7a7a\u5b57\u7b26\u4e32\uff1a'', \u8868\u793a\u4e0d\u4f7f\u7528\u4efb\u4f55 pattern\uff08\u5b9e\u9645\u662f re.compile(r'.+')\uff09\uff0c\u6b64\u65f6 clean \u8fd4\u56de\u7a7a\uff08\u5168\u90e8\u88ab\u6e05\u4e86\uff09\uff0cextract \u8fd4\u56de\u539f\u59cb\u6587\u672c\u3002\n# pattern \u652f\u6301\u4ee5\u4e0b\u5b57\u7b26\u4e32\u7c7b\u578b\uff08\u5b9e\u9645\u4e3a\u6b63\u5219\uff09\uff1a\n#\t'chi': \u4e2d\u6587\u5b57\u7b26\n#\t'pun': \u6807\u70b9\n#\t'whi': \u7a7a\u767d\n#\t'nwh': \u975e\u7a7a\u767d\n#\t'wnb': \u5b57\u6bcd\uff08\u542b\u4e2d\u6587\u5b57\u7b26\uff09\u6216\u6570\u5b57\n#\t'nwn': \u975e\u5b57\u6bcd\uff08\u542b\u4e2d\u6587\u5b57\u7b26\uff09\u6216\u6570\u5b57\n#\t'eng': \u82f1\u6587\u5b57\u7b26\n#\t'num': \u6570\u5b57\n#\t'pic': \u56fe\u7247\n#\t'lnk': \u94fe\u63a5\n#\t'emj': \u8868\u60c5\n\npt = Text(['chi', pattern])\n\n# \u63d0\u53d6\u6240\u6709\u7b26\u5408 pattern \u7684\u6587\u672c\u548c\u5b83\u4eec\u7684\u4f4d\u7f6e\nres = pt.extract(text)\nprint(res)\n\"\"\"\n{'text': '\u8fd9\u662f\u957f\u5ea6\u6d4b\u8bd5233', 'mats': ['\u8fd9\u662f', '\u957f\u5ea6\u6d4b\u8bd5', '233'], 'locs': [(0, 2), (22, 26), (60, 63)]}\n\"\"\"\n# \u652f\u6301\u7528\u300c\u70b9\u300d\u83b7\u53d6key\u5c5e\u6027\nprint(res.text, res.mats, res.locs)\n\"\"\"\n'\u8fd9\u662f\u957f\u5ea6\u6d4b\u8bd5' ['\u8fd9\u662f', '\u957f\u5ea6\u6d4b\u8bd5'] [(0, 2), (22, 26)]\n\"\"\"\n\n# \u8fd4\u56de\u6307\u5b9a pattern \u6e05\u7406\u540e\u7684\u6587\u672c\nprint(pt.clean(text))\n\"\"\"\nhttps://www.yam.gift\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\n\"\"\"\n\n# \u53ef\u4ee5\u6307\u5b9a\u591a\u4e2a pattern\uff0c\u6ce8\u610f\u5148\u540e\u987a\u5e8f\u53ef\u80fd\u4f1a\u5f71\u54cd\u7ed3\u679c\u54e6\npt = Text(['pic', 'lnk'])\n# \u63d0\u53d6\u5230\u7684\nres = pt.extract(text)\nprint(res.mats)\n\"\"\"\n['https://www.yam.gif',\n '![](http://xx.jpg)',\n 'https://www.yam.gift',\n 'http://xx.jpg']\n\"\"\"\n# \u6e05\u7406\u540e\u7684\nprint(pt.clean(text))\n\"\"\"\n\u8fd9\u662ft\u957f\u5ea6\u6d4b\u8bd5\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01\u3002233.\n\"\"\"\n```\n\n#### \u5185\u7f6e\u6b63\u5219\n\n```python\n# USE Regex\nfrom pnlp import reg\ndef clean_text(text: str) -> str:\n    text = reg.pwhi.sub(\"\", text)\n    text = reg.pemj.sub(\"\", text)\n    text = reg.ppic.sub(\"\", text)\n    text = reg.plnk.sub(\"\", text)\n    return text\n```\n\n### \u6587\u672c\u5207\u5206\n\n#### \u4efb\u610f\u90e8\u5206\u5207\u5206\n\n```python\n# Cut by Regex\nfrom pnlp import cut_part, psent\ntext = \"\u4f60\u597d\uff01\u6b22\u8fce\u4f7f\u7528\u3002\"\nsent_list = cut_part(text, psent, with_spliter=True, with_offset=False)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d\uff01', '\u6b22\u8fce\u4f7f\u7528\u3002']\n\"\"\"\npcustom_sent = re.compile(r'[\u3002\uff01]')\nsent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=False)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d', '\u6b22\u8fce\u4f7f\u7528']\n\"\"\"\nsent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=True)\nprint(sent_list)\n\"\"\"\n[('\u4f60\u597d', 0, 3), ('\u6b22\u8fce\u4f7f\u7528', 3, 8)]\n\"\"\"\n```\n\n#### \u5206\u53e5\n\n```python\n# Cut Sentence\nfrom pnlp import cut_sentence as pcs\ntext = \"\u4f60\u597d\uff01\u6b22\u8fce\u4f7f\u7528\u3002\"\nsent_list = pcs(text)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d\uff01', '\u6b22\u8fce\u4f7f\u7528\u3002']\n\"\"\"\n```\n\n#### \u5206\u5b50\u53e5\u5e76\u6309\u4e00\u4e2a\u9608\u503c\u5408\u5e76\u5b50\u53e5\n\n```python\nfrom pnlp import cut_sub_sentence as pcss\ntext = \"\u4f60\u597d\uff01\u4f60\u597d\u3002\u4f60\u597d\uff1f\u4f60\u574f~\u6b22\u8fce\u4f7f\u7528\u3002\"\nsent_list = pcss(text)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d\uff01', '\u4f60\u597d\u3002', '\u4f60\u597d\uff1f', '\u4f60\u574f~', '\u6b22\u8fce\u4f7f\u7528\u3002']\n\"\"\"\nsent_list = pcss(text, 6)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d\uff01\u4f60\u597d\u3002', '\u4f60\u597d\uff1f\u4f60\u574f~', '\u6b22\u8fce\u4f7f\u7528\u3002']\n\"\"\"\nsent_list = pcss(text, 12)\nprint(sent_list)\n\"\"\"\n['\u4f60\u597d\uff01\u4f60\u597d\u3002\u4f60\u597d\uff1f\u4f60\u574f~', '\u6b22\u8fce\u4f7f\u7528\u3002']\n\"\"\"\n```\n\n\u8fd9\u4e2a\u529f\u80fd\u5728\u5f88\u591a\u573a\u5408\u975e\u5e38\u6709\u7528\uff1b\uff09\u61c2\u7684\u90fd\u61c2\uff1aD\n\n#### \u4e2d\u6587\u5b57\u7b26\u5207\u5206\n\n```python\n# \u4e2d\u6587\u5b57\u7b26\u5207\u5206\nfrom pnlp import cut_zhchar\ntext = \"\u4f60\u597d\uff0chello, 520 i love u. = \u201d\u6211\u7231\u4f60\u201c\u3002\"\nchar_list = cut_zhchar(text)\nprint(char_list)\n\"\"\"\n['\u4f60', '\u597d', '\uff0c', 'hello', ',', ' ', '520', ' ', 'i', ' ', 'love', ' ', 'u', '.', ' ', '=', ' ', '\u201d', '\u6211', '\u7231', '\u4f60', '\u201c', '\u3002']\n\"\"\"\nchar_list = cut_zhchar(text, remove_blank=True)\nprint(char_list)\n\"\"\"\n['\u4f60', '\u597d', '\uff0c', 'hello', ',', '520', 'i', 'love', 'u', '.', '=', '\u201d', '\u6211', '\u7231', '\u4f60', '\u201c', '\u3002']\n\"\"\"\n```\n\n#### \u53e5\u5b50\u5206\u7ec4\n\n```python\nfrom pnlp import combine_bucket\nparts = [\n    \"\u5148\u751f\uff0c\u90a3\u591c\uff0c\u6211\u56e0\u80f8\u4e2d\u7eb3\u95f7\uff0c\u65e0\u6cd5\u5165\u7761\uff0c\",\n    \"\u6298\u817e\u5f97\u6bd4\u90a3\u94d0\u4e86\u811a\u9563\u7684\u53db\u53d8\u6c34\u624b\u8fd8\u66f4\u96be\u8fc7\uff1b\",\n    \"\u90a3\u65f6\uff0c\u6211\u5c31\u51b2\u52a8\u7684 \u2014\u2014\",\n    \"\u597d\u5728\u6709\u90a3\u4e00\u65f6\u4e4b\u5ff5\uff0c\",\n    \"\u56e0\u4e3a\u6709\u65f6\u6211\u4eec\u5728\u65e0\u610f\u4e2d\u6240\u505a\u7684\u4e8b\u80fd\u591f\u5706\u6ee1\u2026\u2026\"\n]\nbuckets = combine_bucket(parts.copy(), 10, truncate=True, keep_remain=True)\nprint(buckets)\n\"\"\"\n['\u5148\u751f\uff0c\u90a3\u591c\uff0c\u6211\u56e0\u80f8\u4e2d',\n '\u7eb3\u95f7\uff0c\u65e0\u6cd5\u5165\u7761\uff0c',\n '\u6298\u817e\u5f97\u6bd4\u90a3\u94d0\u4e86\u811a\u9563\u7684',\n '\u53db\u53d8\u6c34\u624b\u8fd8\u66f4\u96be\u8fc7\uff1b',\n '\u90a3\u65f6\uff0c\u6211\u5c31\u51b2\u52a8\u7684 \u2014',\n '\u2014',\n '\u597d\u5728\u6709\u90a3\u4e00\u65f6\u4e4b\u5ff5\uff0c',\n '\u56e0\u4e3a\u6709\u65f6\u6211\u4eec\u5728\u65e0\u610f\u4e2d',\n '\u6240\u505a\u7684\u4e8b\u80fd\u591f\u5706\u6ee1\u2026\u2026']\n\"\"\"\n```\n\n### \u6587\u672c\u589e\u5f3a\n\n\u91c7\u6837\u5668\u652f\u6301\u5220\u9664\u3001\u4ea4\u6362\u3001\u63d2\u5165\u64cd\u4f5c\uff0c\u6240\u6709\u7684\u64cd\u4f5c\u4e0d\u4f1a\u8de8\u8d8a\u6807\u70b9\u3002\n\n#### Token\u7ea7\u522b\n\n- \u9ed8\u8ba4 Tokenizer\n    - \u4e2d\u6587\uff1a\u5b57\u7b26\u7ea7 Tokenizer\uff08\u89c1\u4e0a\uff09\n    - \u82f1\u6587\uff1a\u7a7a\u767d\u7b26\u5207\u5206 Tokenizer\n- Tokenizer \u53ef\u4ee5\u4efb\u610f\u6307\u5b9a\uff0c\u4f46\u5b83\u7684\u8f93\u51fa\u5e94\u8be5\u662f\u4e00\u4e2a List \u7684 Token \u6216\u4e00\u4e2a List \u7684 Tuple\uff0c\u6bcf\u4e2a Tuple \u5305\u542b\u4e00\u4e2a Token \u548c\u4e00\u4e2a\u8bcd\u6027\u3002\n- \u5bf9\u5b57\u7b26\u7ea7\u589e\u5f3a\uff0c\u9ed8\u8ba4\u5e76\u4e0d\u4f1a\u64cd\u4f5c\u6240\u6709\u5b57\u6216\u8bcd\u3002\u53ef\u4ee5\u81ea\u5b9a\u4e49\u8981\u64cd\u4f5c\u7684\u8bcd\u6216\u8bcd\u6027\u3002\n    - \u9ed8\u8ba4 Token \u662f\u300c\u505c\u7528\u8bcd\u300d\n    - \u9ed8\u8ba4\u8bcd\u6027\uff08\u5f53 Tokenizer \u8f93\u51fa\u5e26\u8bcd\u6027\u65f6\uff09\u662f\u300c\u529f\u80fd\u8bcd\u300d\uff1a\u526f\u8bcd\u3001\u4ecb\u8bcd\u3001\u8fde\u8bcd\u3001\u52a9\u8bcd\u3001\u5176\u4ed6\u865a\u8bcd\uff08\u6807\u8bb0\u4e3a d p c u xc\uff09\n\n```python\n# \u3010\u3011\u5185\u7684\u4e3a\u6539\u53d8\u7684\ntext = \"\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002\"\n# \u5b57\u7b26\u7c92\u5ea6\nfrom pnlp import TokenLevelSampler\ntls = TokenLevelSampler()\ntls.make_samples(text)\n\"\"\"\n{'delete': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u3010\u6709\u3011\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'swap': '\u3010\u4e3a\u3011\u3010\u4eba\u3011\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'insert': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u3010\u8fd8\u3011\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'together': '\u4eba\u4ec0\u4e48\u7740\u7740\u6d3b\uff1f\u751f\u800c\u5fc5\u4e3a\u4e3a\u987b\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002'}\n\"\"\"\n# \u652f\u6301\u81ea\u5b9a\u4e49 tokenizer\ntls.make_samples(text, jieba.lcut)\n\"\"\"\n{'delete': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u3010\u5fc5\u987b\u3011\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'swap': '\u3010\u4e3a\u4ec0\u4e48\u3011\u3010\u4eba\u3011\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'insert': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u3010\u8fd8\u8981\u3011\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'together': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u751f\u800c\u4e3a\u4eba\u4eba\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u591a\u5c3d\u53ef\u80fd\u7684\u7cbe\u795e\u4f53\u9a8c\u3002'}\n\"\"\"\n# \u81ea\u5b9a\u4e49\ntls = TokenLevelSampler(\n    rate=\u66ff\u6362\u6bd4\u4f8b, # \u9ed8\u8ba4 5%\n    types=[\"delete\", \"swap\", \"insert\"], # \u9ed8\u8ba4\u4e09\u4e2a \n    sample_words=[\"\u8bcd1\", \"\u8bcd2\"], # \u9ed8\u8ba4\u505c\u7528\u8bcd\n    sample_pos=[\"\u8bcd\u60271\", \"\u8bcd\u60272\"], # \u9ed8\u8ba4\u529f\u80fd\u8bcd\n)\n```\n\n#### \u53e5\u5b50\u7ea7\u522b\n\n```python\nfrom pnlp import SentenceLevelSampler\nsls = SentenceLevelSampler()\nsls.make_samples(text)\n\"\"\"\n{'delete': '\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002',\n 'swap': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01',\n 'insert': '\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u8fd8\u8981\u6709\u5c3d\u53ef\u80fd\u591a\u7684\u7cbe\u795e\u4f53\u9a8c\u3002\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01',\n 'together': '\u751f\u800c\u4e3a\u4eba\u5fc5\u987b\u8981\u6709\u68a6\u60f3\uff01\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f\u4eba\u4e3a\u4ec0\u4e48\u6d3b\u7740\uff1f'}\n\"\"\"\n# \u81ea\u5b9a\u4e49\nsls = SentenceLevelSampler(types=[\"delete\", \"swap\", \"insert\"]) # \u9ed8\u8ba4\u4e09\u4e2a\n```\n\n### \u6587\u672c\u5f52\u4e00\u5316\n\n#### \u4e2d\u6587\u6570\u5b57\n\n```python\nfrom pnlp import num_norm\nnum_norm.num2zh(1024) == \"\u4e00\u5343\u96f6\u4e8c\u5341\u56db\"\nnum_norm.num2zh(1024).to_money() == \"\u58f9\u4edf\u96f6\u8d30\u62fe\u8086\"\nnum_norm.zh2num(\"\u4e00\u5343\u96f6\u4e8c\u5341\u56db\") == 1024\n```\n\n### \u683c\u5f0f\u8f6c\u6362\n\n#### BIO\u8f6c\u5b9e\u4f53\n\n```python\n# \u5b9e\u4f53 BIO Token \u8f6c\u5b9e\u4f53\nfrom pnlp import pick_entity_from_bio_labels\npairs = [('\u5929', 'B-LOC'), ('\u5b89', 'I-LOC'), ('\u95e8', 'I-LOC'), ('\u6709', 'O'), ('\u6bdb', 'B-PER'), ('\u4e3b', 'I-PER'), ('\u5e2d', 'I-PER')]\npick_entity_from_bio_labels(pairs)\n\"\"\"\n[('\u5929\u5b89\u95e8', 'LOC'), ('\u6bdb\u4e3b\u5e2d', 'PER')]\n\"\"\"\npick_entity_from_bio_labels(pairs, with_offset=True)\n\"\"\"\n[('\u5929\u5b89\u95e8', 'LOC', 0, 3), ('\u6bdb\u4e3b\u5e2d', 'PER', 4, 7)]\n\"\"\"\n```\n\n#### \u4efb\u610f\u53c2\u6570\u8f6cUUID\n\n```python\nfrom pnlp import generate_uuid\n\nuid1 = pnlp.generate_uuid(\"a\", 1, 0.02)\nuid2 = pnlp.generete_uuid(\"a\", 1)\n\"\"\"\nuid1 == 3fbc8b70d05b5abdb5badca1d26e1dbd\nuid2 == f7b0ffc589e453e88d4faf66eb92f669\n\"\"\"\n```\n\n### \u5185\u7f6e\u8bcd\u5178\n\n#### \u505c\u7528\u8bcd\n\n```python\nfrom pnlp import StopWords, chinese_stopwords, english_stopwords\n\ncsw = StopWords(\"/path/to/custom/stopwords.txt\")\ncsw.stopwords # a set of the custom stopwords\n\ncsw.zh == chinese_stopwords # Chineses stopwords\ncsw.en == english_stopwords # English stopwords\n```\n\n\n### \u6587\u672c\u957f\u5ea6\n\n```python\nfrom pnlp import Length\n\ntext = \"\u8fd9\u662fhttps://www.yam.gift\u957f\u5ea6\u6d4b\u8bd5\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\"\n\npl = Length(text)\n# \u6ce8\u610f\uff1a\u5373\u4f7f\u4f7f\u7528\u4e86 pattern\uff0c\u957f\u5ea6\u90fd\u662f\u57fa\u4e8e\u539f\u59cb\u6587\u672c\n# \u957f\u5ea6\u57fa\u4e8e\u5b57\u7b26\u8ba1\u6570\uff08\u4e0d\u662f\u6574\u8bcd\uff09\nprint(\"Length of all characters: \", pl.len_all)\nprint(\"Length of all non-white characters: \", pl.len_nwh)\nprint(\"Length of all Chinese characters: \", pl.len_chi)\nprint(\"Length of all words and numbers: \", pl.len_wnb)\nprint(\"Length of all punctuations: \", pl.len_pun)\nprint(\"Length of all English characters: \", pl.len_eng)\nprint(\"Length of all numbers: \", pl.len_num)\n\n\"\"\"\nLength of all characters:  64\nLength of all non-white characters:  63\nLength of all Chinese characters:  6\nLength of all words and numbers:  41\nLength of all punctuations:  14\nLength of all English characters:  32\nLength of all numbers:  3\n\"\"\"\n```\n\n### \u9b54\u672f\u65b9\u6cd5\n\n#### \u5b57\u5178\n\n```python\nfrom pnlp import MagicDict\n\n# \u5d4c\u5957\u5b57\u5178\npmd = MagicDict()\npmd['a']['b']['c'] = 2\nprint(pmd)\n\n\"\"\"\n{'a': {'b': {'c': 2}}}\n\"\"\"\n\n# \u5f53\u5b57\u5178\u88ab\u7ffb\u8f6c\u65f6\uff0c\u4fdd\u7559\u6240\u6709\u7684\u91cd\u590d value-keys\ndx = {1: 'a',\n      2: 'a',\n      3: 'a',\n      4: 'b' }\nprint(pmag.MagicDict.reverse(dx))\n\n\"\"\"\n{'a': [1, 2, 3], 'b': 4}\n\"\"\"\n```\n\n#### \u83b7\u53d6\u552f\u4e00\u6587\u4ef6\u540d\n\n```python\nfrom pnlp import get_unique_fn\n\nget_unique_fn(\"a/b/c.md\") == \"a_b_c.md\"\n```\n\n### \u5e76\u884c\u5904\u7406\n\n\u652f\u6301\u56db\u79cd\u5e76\u884c\u5904\u7406\u65b9\u5f0f\uff1a\n\n- \u7ebf\u7a0b\u6c60\uff1a`thread_pool`\n- \u8fdb\u7a0b\u6c60\uff1a`process_pool`\n- \u7ebf\u7a0b Executor\uff1a`thread_executor`\uff0c\u9ed8\u8ba4\u4f7f\u7528\n- \u7ebf\u7a0b\uff1a`thread`\n\n\u6ce8\u610f\uff1a\u60f0\u6027\u5904\u7406\uff0c\u8fd4\u56de\u7684\u662f Generator\u3002\n\n```python\nimport math\ndef is_prime(x):\n    if x < 2:\n        return False\n    for i in range(2, int(math.sqrt(x)) + 1):\n        if x % i == 0:\n            return False\n    return True\n\nfrom pnlp import concurring\n\n# max_workers \u9ed8\u8ba4\u4e3a 4\n@concurring\ndef get_primes(lst):\n    res = []\n    for i in lst:\n        if is_prime(i):\n            res.append(i)\n    return res\n\n@concurrint(type=\"thread_pool\", max_workers=10)\ndef get_primes(lst):\n    pass\n```\n\n`concurring` \u88c5\u9970\u5668\u8ba9\u4f60\u7684\u8fed\u4ee3\u51fd\u6570\u5e76\u884c\u3002\n\n### \u540e\u53f0\u5904\u7406\n\n```python\nfrom pnlp import run_in_new_thread\n\ndef func(file, a, b, c):\n    background_task()\n\nrun_in_new_thread(func, file, 1, 2, 3)\n```\n\n## \u6d4b\u8bd5\n\nClone \u4ed3\u5e93\u540e\u6267\u884c\uff1a\n\n```bash\n$ python -m pytest\n```\n\n## \u66f4\u65b0\u65e5\u5fd7\n\n\u89c1\u82f1\u6587\u7248 README\u3002\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A pre/post-processing tool for NLP.",
    "version": "0.4.15",
    "project_urls": {
        "Homepage": "https://github.com/hscspring/pnlp"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dc77175fc734d43d81ae606578e44bd382d6aa355075046f16326b412800845c",
                "md5": "1311102ee297038d8d86e8a331dfe408",
                "sha256": "c1b4c4a1006fae5a5c8450d6cd807bbbc9e8a28f7011792ea39fcf42bf6e1acb"
            },
            "downloads": -1,
            "filename": "pnlp-0.4.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1311102ee297038d8d86e8a331dfe408",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 34972,
            "upload_time": "2024-09-02T06:50:08",
            "upload_time_iso_8601": "2024-09-02T06:50:08.779619Z",
            "url": "https://files.pythonhosted.org/packages/dc/77/175fc734d43d81ae606578e44bd382d6aa355075046f16326b412800845c/pnlp-0.4.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "74ad21a1805f26e498b3fcdae69d7b9f3cd7f1f78d3040f54efb6fd57f1a4631",
                "md5": "1f68350b151cfe7d5ccd562072444711",
                "sha256": "0b76bcb56a36e8483f14cf3c93a2d77768fd8aa3b6b5a0966e4ab832d2af96b4"
            },
            "downloads": -1,
            "filename": "pnlp-0.4.15.tar.gz",
            "has_sig": false,
            "md5_digest": "1f68350b151cfe7d5ccd562072444711",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 47083,
            "upload_time": "2024-09-02T06:50:10",
            "upload_time_iso_8601": "2024-09-02T06:50:10.741472Z",
            "url": "https://files.pythonhosted.org/packages/74/ad/21a1805f26e498b3fcdae69d7b9f3cd7f1f78d3040f54efb6fd57f1a4631/pnlp-0.4.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-02 06:50:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hscspring",
    "github_project": "pnlp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pnlp"
}
        
Yam
Elapsed time: 0.62857s