jieba3


Namejieba3 JSON
Version 1.0.2 PyPI version JSON
download
home_pageNone
Summary“结巴 3”中文分词:做最好的 Modern Python 3 中文分词组件
upload_time2024-10-12 06:09:52
maintainerNone
docs_urlNone
authorNone
requires_python<4,>=3.10
licenseNone
keywords chinese nlp analysis tokenizer
VCS
bugtrack_url
requirements pydantic
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # jieba3

“结巴 3”中文分词:做最好的 Modern Python 3 中文分词组件

# 与 jieba 的区别

jieba3 是 [jieba](https://github.com/fxsjy/jieba) 分词模块的 Modern Python 3 重构版本

- 删除 Python 2 兼容代码,支持 type hints 等 Modern Python 3 特性
- 重构分词模块,在纯 Python 实现前提下,提高约 **20%** 的性能,且与 jieba 分词结果对齐
- 暂不支持除分词外的其他 jieba 功能,如关键词提取、词性标注等

# 安装说明

jieba3 仅支持 Python 3.10+ 版本

```bash
pip install jieba3
```

# 算法

- 基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)
- 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
- 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法

# 主要功能

构建 `jieba3.jieba3` 分词器实例,支持以下参数:

- `model: Literal["base", "small", "large"] = "base"`
  - 分词模型选项,可选值为 `small`、`base`、`large`,默认为 `base`
  - `base` 模型是 jieba 提供的默认模型
  - `small` 模型是 jieba 提供的占用内存较小的模型
  - `large` 模型是 jieba 支持繁体分词更好的模型
- `use_hmm: bool = True`
  - 是否开启 HMM 新词发现,可选值为 `True`、`False`,默认为 `True`

示例如下:

```python
import jieba3

tokenizer = jieba3.jieba3()  # 默认为 base 模型,开启 HMM 新词发现
tokenizer = jieba3.jieba3(model="small")  # 使用 small 模型
tokenizer = jieba3.jieba3(model="base")  # 使用 base 模型
tokenizer = jieba3.jieba3(model="large")  # 使用 large 模型
tokenizer = jieba3.jieba3(use_hmm=False)  # 关闭 HMM 新词发现
tokenizer = jieba3.jieba3(use_hmm=True)  # 开启 HMM 新词发现
```

## 文档模式

试图将句子最精确地切开,适合文档分析

> 当使用默认的 `base` 模型时,jieba3 文档模式与 jieba 精确模式的分词结果完全一致

```python
import jieba3
import jieba

# 开启 HMM 新词发现

tokenizer = jieba3.jieba3()
tokenizer.cut_text("小明硕士毕业于中国科学院计算所")
# ["小明", "硕士", "毕业", "于", "中国科学院", "计算所"]

jieba.lcut("小明硕士毕业于中国科学院计算所")
# ["小明", "硕士", "毕业", "于", "中国科学院", "计算所"]

# 关闭 HMM 新词发现

tokenizer = jieba3.jieba3(use_hmm=False)
tokenizer.cut_text("小明硕士毕业于中国科学院计算所")
# ["小", "明", "硕士", "毕业", "于", "中国科学院", "计算所"]

jieba.lcut("小明硕士毕业于中国科学院计算所", HMM=False)
# ["小", "明", "硕士", "毕业", "于", "中国科学院", "计算所"]
```

## 查询模式

在文档模式的基础上,对长词再次切分,提高召回率,适合查询分析

> 当使用默认的 `base` 模型时,jieba3 查询模式与 jieba 搜索引擎模式的分词结果完全一致

```python
import jieba3
import jieba

# 开启 HMM 新词发现

tokenizer = jieba3.jieba3()
tokenizer.cut_query("小明硕士毕业于中国科学院计算所")
# ["小明", "硕士", "毕业", "于", "中国", "科学", "学院", "科学院", "中国科学院", "计算", "计算所"]

jieba.lcut_for_search("小明硕士毕业于中国科学院计算所")
# ["小明", "硕士", "毕业", "于", "中国", "科学", "学院", "科学院", "中国科学院", "计算", "计算所"]

# 关闭 HMM 新词发现

tokenizer = jieba3.jieba3(use_hmm=False)
tokenizer.cut_query("小明硕士毕业于中国科学院计算所")
# ["小", "明", "硕士", "毕业", "于", "中国", "科学", "学院", "科学院", "中国科学院", "计算", "计算所"]

jieba.lcut_for_search("小明硕士毕业于中国科学院计算所", HMM=False)
# ["小", "明", "硕士", "毕业", "于", "中国", "科学", "学院", "科学院", "中国科学院", "计算", "计算所"]
```

# 性能测试

jieba3 均使用默认的 `base` 模型,与 jieba 的默认模型对比

测试环境:MacBookPro18,3,macOS 14.5,Apple M1 Pro @ 3.20 GHz,16 GB

## SIGHAN Bakeoff 2005 测试集(逐行分词)

### `as_test.utf8`(繁体)

| 模式                 | jieba 耗时 | jieba 速度 | jieba3 耗时 | jieba3 速度 | 性能提升 |
| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |
| 文档模式(关闭 HMM) | 0.26 秒    | 2.28 MB/s  | 0.20 秒     | 2.94 MB/s   | 22%      |
| 文档模式(开启 HMM) | 0.60 秒    | 0.98 MB/s  | 0.48 秒     | 1.23 MB/s   | 20%      |
| 查询模式(关闭 HMM) | 0.27 秒    | 2.17 MB/s  | 0.21 秒     | 2.79 MB/s   | 22%      |
| 查询模式(开启 HMM) | 0.63 秒    | 0.93 MB/s  | 0.51 秒     | 1.15 MB/s   | 20%      |

### `cityu_test.utf8`(繁体)

| 模式                 | jieba 耗时 | jieba 速度 | jieba3 耗时 | jieba3 速度 | 性能提升 |
| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |
| 文档模式(关闭 HMM) | 0.09 秒    | 2.22 MB/s  | 0.07 秒     | 2.87 MB/s   | 23%      |
| 文档模式(开启 HMM) | 0.21 秒    | 0.93 MB/s  | 0.17 秒     | 1.16 MB/s   | 20%      |
| 查询模式(关闭 HMM) | 0.09 秒    | 2.11 MB/s  | 0.07 秒     | 2.71 MB/s   | 22%      |
| 查询模式(开启 HMM) | 0.21 秒    | 0.90 MB/s  | 0.17 秒     | 1.12 MB/s   | 20%      |

### `msr_test.utf8`(简体)

| 模式                 | jieba 耗时 | jieba 速度 | jieba3 耗时 | jieba3 速度 | 性能提升 |
| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |
| 文档模式(关闭 HMM) | 0.26 秒    | 2.06 MB/s  | 0.20 秒     | 2.69 MB/s   | 24%      |
| 文档模式(开启 HMM) | 0.30 秒    | 1.79 MB/s  | 0.24 秒     | 2.25 MB/s   | 20%      |
| 查询模式(关闭 HMM) | 0.28 秒    | 1.91 MB/s  | 0.22 秒     | 2.47 MB/s   | 23%      |
| 查询模式(开启 HMM) | 0.32 秒    | 1.67 MB/s  | 0.26 秒     | 2.08 MB/s   | 20%      |

### `pku_test.utf8`(简体)

| 模式                 | jieba 耗时 | jieba 速度 | jieba3 耗时 | jieba3 速度 | 性能提升 |
| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |
| 文档模式(关闭 HMM) | 0.25 秒    | 1.91 MB/s  | 0.20 秒     | 2.48 MB/s   | 23%      |
| 文档模式(开启 HMM) | 0.30 秒    | 1.64 MB/s  | 0.24 秒     | 2.04 MB/s   | 20%      |
| 查询模式(关闭 HMM) | 0.26 秒    | 1.85 MB/s  | 0.20 秒     | 2.41 MB/s   | 23%      |
| 查询模式(开启 HMM) | 0.33 秒    | 1.48 MB/s  | 0.27 秒     | 1.82 MB/s   | 19%      |

## 《围城》(全文分词)

| 模式                 | jieba 耗时 | jieba 速度 | jieba3 耗时 | jieba3 速度 | 性能提升 |
| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |
| 文档模式(关闭 HMM) | 0.35 秒    | 1.85 MB/s  | 0.28 秒     | 2.32 MB/s   | 20%      |
| 文档模式(开启 HMM) | 0.51 秒    | 1.25 MB/s  | 0.42 秒     | 1.52 MB/s   | 18%      |
| 查询模式(关闭 HMM) | 0.33 秒    | 1.93 MB/s  | 0.26 秒     | 2.45 MB/s   | 21%      |
| 查询模式(开启 HMM) | 0.55 秒    | 1.17 MB/s  | 0.45 秒     | 1.42 MB/s   | 18%      |

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "jieba3",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.10",
    "maintainer_email": null,
    "keywords": "Chinese, NLP, Analysis, Tokenizer",
    "author": null,
    "author_email": "Shihong Yan <yansh97@foxmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/23/67/8eaae19dd87120dc8a7ba6dff5b24676bd6b92b7570954283edf2bfb5696/jieba3-1.0.2.tar.gz",
    "platform": null,
    "description": "# jieba3\n\n\u201c\u7ed3\u5df4 3\u201d\u4e2d\u6587\u5206\u8bcd\uff1a\u505a\u6700\u597d\u7684 Modern Python 3 \u4e2d\u6587\u5206\u8bcd\u7ec4\u4ef6\n\n# \u4e0e jieba \u7684\u533a\u522b\n\njieba3 \u662f [jieba](https://github.com/fxsjy/jieba) \u5206\u8bcd\u6a21\u5757\u7684 Modern Python 3 \u91cd\u6784\u7248\u672c\n\n- \u5220\u9664 Python 2 \u517c\u5bb9\u4ee3\u7801\uff0c\u652f\u6301 type hints \u7b49 Modern Python 3 \u7279\u6027\n- \u91cd\u6784\u5206\u8bcd\u6a21\u5757\uff0c\u5728\u7eaf Python \u5b9e\u73b0\u524d\u63d0\u4e0b\uff0c\u63d0\u9ad8\u7ea6 **20%** \u7684\u6027\u80fd\uff0c\u4e14\u4e0e jieba \u5206\u8bcd\u7ed3\u679c\u5bf9\u9f50\n- \u6682\u4e0d\u652f\u6301\u9664\u5206\u8bcd\u5916\u7684\u5176\u4ed6 jieba \u529f\u80fd\uff0c\u5982\u5173\u952e\u8bcd\u63d0\u53d6\u3001\u8bcd\u6027\u6807\u6ce8\u7b49\n\n# \u5b89\u88c5\u8bf4\u660e\n\njieba3 \u4ec5\u652f\u6301 Python 3.10+ \u7248\u672c\n\n```bash\npip install jieba3\n```\n\n# \u7b97\u6cd5\n\n- \u57fa\u4e8e\u524d\u7f00\u8bcd\u5178\u5b9e\u73b0\u9ad8\u6548\u7684\u8bcd\u56fe\u626b\u63cf\uff0c\u751f\u6210\u53e5\u5b50\u4e2d\u6c49\u5b57\u6240\u6709\u53ef\u80fd\u6210\u8bcd\u60c5\u51b5\u6240\u6784\u6210\u7684\u6709\u5411\u65e0\u73af\u56fe (DAG)\n- \u91c7\u7528\u4e86\u52a8\u6001\u89c4\u5212\u67e5\u627e\u6700\u5927\u6982\u7387\u8def\u5f84, \u627e\u51fa\u57fa\u4e8e\u8bcd\u9891\u7684\u6700\u5927\u5207\u5206\u7ec4\u5408\n- \u5bf9\u4e8e\u672a\u767b\u5f55\u8bcd\uff0c\u91c7\u7528\u4e86\u57fa\u4e8e\u6c49\u5b57\u6210\u8bcd\u80fd\u529b\u7684 HMM \u6a21\u578b\uff0c\u4f7f\u7528\u4e86 Viterbi \u7b97\u6cd5\n\n# \u4e3b\u8981\u529f\u80fd\n\n\u6784\u5efa `jieba3.jieba3` \u5206\u8bcd\u5668\u5b9e\u4f8b\uff0c\u652f\u6301\u4ee5\u4e0b\u53c2\u6570\uff1a\n\n- `model: Literal[\"base\", \"small\", \"large\"] = \"base\"`\n  - \u5206\u8bcd\u6a21\u578b\u9009\u9879\uff0c\u53ef\u9009\u503c\u4e3a `small`\u3001`base`\u3001`large`\uff0c\u9ed8\u8ba4\u4e3a `base`\n  - `base` \u6a21\u578b\u662f jieba \u63d0\u4f9b\u7684\u9ed8\u8ba4\u6a21\u578b\n  - `small` \u6a21\u578b\u662f jieba \u63d0\u4f9b\u7684\u5360\u7528\u5185\u5b58\u8f83\u5c0f\u7684\u6a21\u578b\n  - `large` \u6a21\u578b\u662f jieba \u652f\u6301\u7e41\u4f53\u5206\u8bcd\u66f4\u597d\u7684\u6a21\u578b\n- `use_hmm: bool = True`\n  - \u662f\u5426\u5f00\u542f HMM \u65b0\u8bcd\u53d1\u73b0\uff0c\u53ef\u9009\u503c\u4e3a `True`\u3001`False`\uff0c\u9ed8\u8ba4\u4e3a `True`\n\n\u793a\u4f8b\u5982\u4e0b\uff1a\n\n```python\nimport jieba3\n\ntokenizer = jieba3.jieba3()  # \u9ed8\u8ba4\u4e3a base \u6a21\u578b\uff0c\u5f00\u542f HMM \u65b0\u8bcd\u53d1\u73b0\ntokenizer = jieba3.jieba3(model=\"small\")  # \u4f7f\u7528 small \u6a21\u578b\ntokenizer = jieba3.jieba3(model=\"base\")  # \u4f7f\u7528 base \u6a21\u578b\ntokenizer = jieba3.jieba3(model=\"large\")  # \u4f7f\u7528 large \u6a21\u578b\ntokenizer = jieba3.jieba3(use_hmm=False)  # \u5173\u95ed HMM \u65b0\u8bcd\u53d1\u73b0\ntokenizer = jieba3.jieba3(use_hmm=True)  # \u5f00\u542f HMM \u65b0\u8bcd\u53d1\u73b0\n```\n\n## \u6587\u6863\u6a21\u5f0f\n\n\u8bd5\u56fe\u5c06\u53e5\u5b50\u6700\u7cbe\u786e\u5730\u5207\u5f00\uff0c\u9002\u5408\u6587\u6863\u5206\u6790\n\n> \u5f53\u4f7f\u7528\u9ed8\u8ba4\u7684 `base` \u6a21\u578b\u65f6\uff0cjieba3 \u6587\u6863\u6a21\u5f0f\u4e0e jieba \u7cbe\u786e\u6a21\u5f0f\u7684\u5206\u8bcd\u7ed3\u679c\u5b8c\u5168\u4e00\u81f4\n\n```python\nimport jieba3\nimport jieba\n\n# \u5f00\u542f HMM \u65b0\u8bcd\u53d1\u73b0\n\ntokenizer = jieba3.jieba3()\ntokenizer.cut_text(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\u6240\"]\n\njieba.lcut(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\u6240\"]\n\n# \u5173\u95ed HMM \u65b0\u8bcd\u53d1\u73b0\n\ntokenizer = jieba3.jieba3(use_hmm=False)\ntokenizer.cut_text(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\", \"\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\u6240\"]\n\njieba.lcut(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\", HMM=False)\n# [\"\u5c0f\", \"\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\u6240\"]\n```\n\n## \u67e5\u8be2\u6a21\u5f0f\n\n\u5728\u6587\u6863\u6a21\u5f0f\u7684\u57fa\u7840\u4e0a\uff0c\u5bf9\u957f\u8bcd\u518d\u6b21\u5207\u5206\uff0c\u63d0\u9ad8\u53ec\u56de\u7387\uff0c\u9002\u5408\u67e5\u8be2\u5206\u6790\n\n> \u5f53\u4f7f\u7528\u9ed8\u8ba4\u7684 `base` \u6a21\u578b\u65f6\uff0cjieba3 \u67e5\u8be2\u6a21\u5f0f\u4e0e jieba \u641c\u7d22\u5f15\u64ce\u6a21\u5f0f\u7684\u5206\u8bcd\u7ed3\u679c\u5b8c\u5168\u4e00\u81f4\n\n```python\nimport jieba3\nimport jieba\n\n# \u5f00\u542f HMM \u65b0\u8bcd\u53d1\u73b0\n\ntokenizer = jieba3.jieba3()\ntokenizer.cut_query(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\", \"\u79d1\u5b66\", \"\u5b66\u9662\", \"\u79d1\u5b66\u9662\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\", \"\u8ba1\u7b97\u6240\"]\n\njieba.lcut_for_search(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\", \"\u79d1\u5b66\", \"\u5b66\u9662\", \"\u79d1\u5b66\u9662\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\", \"\u8ba1\u7b97\u6240\"]\n\n# \u5173\u95ed HMM \u65b0\u8bcd\u53d1\u73b0\n\ntokenizer = jieba3.jieba3(use_hmm=False)\ntokenizer.cut_query(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\")\n# [\"\u5c0f\", \"\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\", \"\u79d1\u5b66\", \"\u5b66\u9662\", \"\u79d1\u5b66\u9662\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\", \"\u8ba1\u7b97\u6240\"]\n\njieba.lcut_for_search(\"\u5c0f\u660e\u7855\u58eb\u6bd5\u4e1a\u4e8e\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6240\", HMM=False)\n# [\"\u5c0f\", \"\u660e\", \"\u7855\u58eb\", \"\u6bd5\u4e1a\", \"\u4e8e\", \"\u4e2d\u56fd\", \"\u79d1\u5b66\", \"\u5b66\u9662\", \"\u79d1\u5b66\u9662\", \"\u4e2d\u56fd\u79d1\u5b66\u9662\", \"\u8ba1\u7b97\", \"\u8ba1\u7b97\u6240\"]\n```\n\n# \u6027\u80fd\u6d4b\u8bd5\n\njieba3 \u5747\u4f7f\u7528\u9ed8\u8ba4\u7684 `base` \u6a21\u578b\uff0c\u4e0e jieba \u7684\u9ed8\u8ba4\u6a21\u578b\u5bf9\u6bd4\n\n\u6d4b\u8bd5\u73af\u5883\uff1aMacBookPro18,3\uff0cmacOS 14.5\uff0cApple M1 Pro @ 3.20 GHz\uff0c16 GB\n\n## SIGHAN Bakeoff 2005 \u6d4b\u8bd5\u96c6\uff08\u9010\u884c\u5206\u8bcd\uff09\n\n### `as_test.utf8`\uff08\u7e41\u4f53\uff09\n\n| \u6a21\u5f0f                 | jieba \u8017\u65f6 | jieba \u901f\u5ea6 | jieba3 \u8017\u65f6 | jieba3 \u901f\u5ea6 | \u6027\u80fd\u63d0\u5347 |\n| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.26 \u79d2    | 2.28 MB/s  | 0.20 \u79d2     | 2.94 MB/s   | 22%      |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.60 \u79d2    | 0.98 MB/s  | 0.48 \u79d2     | 1.23 MB/s   | 20%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.27 \u79d2    | 2.17 MB/s  | 0.21 \u79d2     | 2.79 MB/s   | 22%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.63 \u79d2    | 0.93 MB/s  | 0.51 \u79d2     | 1.15 MB/s   | 20%      |\n\n### `cityu_test.utf8`\uff08\u7e41\u4f53\uff09\n\n| \u6a21\u5f0f                 | jieba \u8017\u65f6 | jieba \u901f\u5ea6 | jieba3 \u8017\u65f6 | jieba3 \u901f\u5ea6 | \u6027\u80fd\u63d0\u5347 |\n| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.09 \u79d2    | 2.22 MB/s  | 0.07 \u79d2     | 2.87 MB/s   | 23%      |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.21 \u79d2    | 0.93 MB/s  | 0.17 \u79d2     | 1.16 MB/s   | 20%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.09 \u79d2    | 2.11 MB/s  | 0.07 \u79d2     | 2.71 MB/s   | 22%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.21 \u79d2    | 0.90 MB/s  | 0.17 \u79d2     | 1.12 MB/s   | 20%      |\n\n### `msr_test.utf8`\uff08\u7b80\u4f53\uff09\n\n| \u6a21\u5f0f                 | jieba \u8017\u65f6 | jieba \u901f\u5ea6 | jieba3 \u8017\u65f6 | jieba3 \u901f\u5ea6 | \u6027\u80fd\u63d0\u5347 |\n| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.26 \u79d2    | 2.06 MB/s  | 0.20 \u79d2     | 2.69 MB/s   | 24%      |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.30 \u79d2    | 1.79 MB/s  | 0.24 \u79d2     | 2.25 MB/s   | 20%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.28 \u79d2    | 1.91 MB/s  | 0.22 \u79d2     | 2.47 MB/s   | 23%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.32 \u79d2    | 1.67 MB/s  | 0.26 \u79d2     | 2.08 MB/s   | 20%      |\n\n### `pku_test.utf8`\uff08\u7b80\u4f53\uff09\n\n| \u6a21\u5f0f                 | jieba \u8017\u65f6 | jieba \u901f\u5ea6 | jieba3 \u8017\u65f6 | jieba3 \u901f\u5ea6 | \u6027\u80fd\u63d0\u5347 |\n| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.25 \u79d2    | 1.91 MB/s  | 0.20 \u79d2     | 2.48 MB/s   | 23%      |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.30 \u79d2    | 1.64 MB/s  | 0.24 \u79d2     | 2.04 MB/s   | 20%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.26 \u79d2    | 1.85 MB/s  | 0.20 \u79d2     | 2.41 MB/s   | 23%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.33 \u79d2    | 1.48 MB/s  | 0.27 \u79d2     | 1.82 MB/s   | 19%      |\n\n## \u300a\u56f4\u57ce\u300b\uff08\u5168\u6587\u5206\u8bcd\uff09\n\n| \u6a21\u5f0f                 | jieba \u8017\u65f6 | jieba \u901f\u5ea6 | jieba3 \u8017\u65f6 | jieba3 \u901f\u5ea6 | \u6027\u80fd\u63d0\u5347 |\n| -------------------- | ---------- | ---------- | ----------- | ----------- | -------- |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.35 \u79d2    | 1.85 MB/s  | 0.28 \u79d2     | 2.32 MB/s   | 20%      |\n| \u6587\u6863\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.51 \u79d2    | 1.25 MB/s  | 0.42 \u79d2     | 1.52 MB/s   | 18%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5173\u95ed HMM\uff09 | 0.33 \u79d2    | 1.93 MB/s  | 0.26 \u79d2     | 2.45 MB/s   | 21%      |\n| \u67e5\u8be2\u6a21\u5f0f\uff08\u5f00\u542f HMM\uff09 | 0.55 \u79d2    | 1.17 MB/s  | 0.45 \u79d2     | 1.42 MB/s   | 18%      |\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "\u201c\u7ed3\u5df4 3\u201d\u4e2d\u6587\u5206\u8bcd\uff1a\u505a\u6700\u597d\u7684 Modern Python 3 \u4e2d\u6587\u5206\u8bcd\u7ec4\u4ef6",
    "version": "1.0.2",
    "project_urls": {
        "Home": "https://github.com/yansh97/jieba3"
    },
    "split_keywords": [
        "chinese",
        " nlp",
        " analysis",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "665d83622ef6020c83766b4d0c9675722c7b6ed00a1a37289c4f6e707c695d8f",
                "md5": "78cf69b18ea5d7d3bf5fabfe43a69ac4",
                "sha256": "b6f33845d8dc32a7a55db95611efef7de5d42ba9964716d26f8d22d20cee9785"
            },
            "downloads": -1,
            "filename": "jieba3-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "78cf69b18ea5d7d3bf5fabfe43a69ac4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.10",
            "size": 6910285,
            "upload_time": "2024-10-12T06:09:44",
            "upload_time_iso_8601": "2024-10-12T06:09:44.147888Z",
            "url": "https://files.pythonhosted.org/packages/66/5d/83622ef6020c83766b4d0c9675722c7b6ed00a1a37289c4f6e707c695d8f/jieba3-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "23678eaae19dd87120dc8a7ba6dff5b24676bd6b92b7570954283edf2bfb5696",
                "md5": "f33e9d5eced410c7567c449e5942d077",
                "sha256": "80054b147115ac6a09f50d54d68abcf55f2cb8d435ab71128da40effd0f4e2cb"
            },
            "downloads": -1,
            "filename": "jieba3-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f33e9d5eced410c7567c449e5942d077",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.10",
            "size": 6850234,
            "upload_time": "2024-10-12T06:09:52",
            "upload_time_iso_8601": "2024-10-12T06:09:52.684254Z",
            "url": "https://files.pythonhosted.org/packages/23/67/8eaae19dd87120dc8a7ba6dff5b24676bd6b92b7570954283edf2bfb5696/jieba3-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-12 06:09:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yansh97",
    "github_project": "jieba3",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pydantic",
            "specs": []
        }
    ],
    "lcname": "jieba3"
}
        
Elapsed time: 0.86003s