# doc2txt
一个用于从Microsoft Word文档中提取文本的Python包,基于antiword工具构建,内置了跨平台的二进制文件和数据文件。
## 功能特性
- 从.doc格式的Microsoft Word文档中提取纯文本
- 跨平台支持(Windows、Linux、macOS ARM64)
- 内置antiword二进制文件,无需额外安装
- 文本格式优化功能,自动处理换行和表格
- 简单易用的Python API
## 支持的平台
- Windows (AMD64)
- Linux (AMD64)
- macOS (ARM64/Apple Silicon)
> 注意:macOS Intel (x86_64) 暂不支持
## 安装
```bash
pip install doc2txt
```
## 快速开始
### 基本用法
```python
from doc2txt import extract_text
# 从Word文档提取文本
text = extract_text('document.doc')
print(text)
```
### 启用文本格式优化
```python
from doc2txt import extract_text
# 提取文本并优化格式(合并断行,处理表格)
text = extract_text('document.doc', optimize_format=True)
print(text)
```
### 使用文本优化工具
```python
from doc2txt import extract_text, optimize_text
# 先提取原始文本
raw_text = extract_text('document.doc')
# 手动优化文本格式
optimized_text = optimize_text(raw_text)
print(optimized_text)
```
## API 参考
### extract_text(doc_path, optimize_format=False)
从Microsoft Word文档中提取文本。
**参数:**
- `doc_path` (str): .doc文件的路径
- `optimize_format` (bool): 是否优化文本格式,默认为False
**返回:**
- `str`: 从文档中提取的文本内容
**异常:**
- `RuntimeError`: 平台不支持或二进制文件缺失
- `subprocess.CalledProcessError`: antiword执行失败
### optimize_text(text)
优化从文档中提取的文本格式。
**参数:**
- `text` (str): 从文档中提取的原始文本
**返回:**
- `str`: 格式优化后的文本
## 文本优化功能
文本优化功能解决了从Word文档提取文本时常见的格式问题:
- **换行合并**: 自动合并没有缩进的连续行,保持段落的完整性
- **表格处理**: 智能识别表格行(包含`|`分隔符),保持表格格式
- **空格处理**: 移除行首多余空格,保持文档的清洁格式
## 项目结构
```
doc2txt/
├── __init__.py # 包的主入口
├── antiword_wrapper.py # antiword工具的Python封装
├── text_optimizer.py # 文本格式优化工具
├── bin/ # 跨平台二进制文件
│ ├── darwin-arm64/
│ ├── linux-amd64/
│ └── win-amd64/
└── antiword_share/ # antiword数据文件
├── fontnames
└── *.txt # 字符编码映射文件
```
## 依赖要求
本包无外部依赖,所有必需的工具和数据文件都已内置。
## 许可证
MIT License
## 贡献
欢迎提交Issue和Pull Request来改进这个项目。
## 更新日志
### 1.0.0
- 初始版本发布
- 支持从.doc文件提取文本
- 内置跨平台antiword二进制文件
- 文本格式优化功能
Raw data
{
"_id": null,
"home_page": "https://github.com/quantatirsk/doc2txt-pypi",
"name": "doc2txt",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "word doc text extraction antiword document",
"author": "Quant",
"author_email": "pengzhia@mail.com",
"download_url": "https://files.pythonhosted.org/packages/b0/f9/bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6/doc2txt-1.0.4.tar.gz",
"platform": null,
"description": "# doc2txt\n\n\u4e00\u4e2a\u7528\u4e8e\u4eceMicrosoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u6587\u672c\u7684Python\u5305\uff0c\u57fa\u4e8eantiword\u5de5\u5177\u6784\u5efa\uff0c\u5185\u7f6e\u4e86\u8de8\u5e73\u53f0\u7684\u4e8c\u8fdb\u5236\u6587\u4ef6\u548c\u6570\u636e\u6587\u4ef6\u3002\n\n## \u529f\u80fd\u7279\u6027\n\n- \u4ece.doc\u683c\u5f0f\u7684Microsoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u7eaf\u6587\u672c\n- \u8de8\u5e73\u53f0\u652f\u6301\uff08Windows\u3001Linux\u3001macOS ARM64\uff09\n- \u5185\u7f6eantiword\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u65e0\u9700\u989d\u5916\u5b89\u88c5\n- \u6587\u672c\u683c\u5f0f\u4f18\u5316\u529f\u80fd\uff0c\u81ea\u52a8\u5904\u7406\u6362\u884c\u548c\u8868\u683c\n- \u7b80\u5355\u6613\u7528\u7684Python API\n\n## \u652f\u6301\u7684\u5e73\u53f0\n\n- Windows (AMD64)\n- Linux (AMD64) \n- macOS (ARM64/Apple Silicon)\n\n> \u6ce8\u610f\uff1amacOS Intel (x86_64) \u6682\u4e0d\u652f\u6301\n\n## \u5b89\u88c5\n\n```bash\npip install doc2txt\n```\n\n## \u5feb\u901f\u5f00\u59cb\n\n### \u57fa\u672c\u7528\u6cd5\n\n```python\nfrom doc2txt import extract_text\n\n# \u4eceWord\u6587\u6863\u63d0\u53d6\u6587\u672c\ntext = extract_text('document.doc')\nprint(text)\n```\n\n### \u542f\u7528\u6587\u672c\u683c\u5f0f\u4f18\u5316\n\n```python\nfrom doc2txt import extract_text\n\n# \u63d0\u53d6\u6587\u672c\u5e76\u4f18\u5316\u683c\u5f0f\uff08\u5408\u5e76\u65ad\u884c\uff0c\u5904\u7406\u8868\u683c\uff09\ntext = extract_text('document.doc', optimize_format=True)\nprint(text)\n```\n\n### \u4f7f\u7528\u6587\u672c\u4f18\u5316\u5de5\u5177\n\n```python\nfrom doc2txt import extract_text, optimize_text\n\n# \u5148\u63d0\u53d6\u539f\u59cb\u6587\u672c\nraw_text = extract_text('document.doc')\n\n# \u624b\u52a8\u4f18\u5316\u6587\u672c\u683c\u5f0f\noptimized_text = optimize_text(raw_text)\nprint(optimized_text)\n```\n\n## API \u53c2\u8003\n\n### extract_text(doc_path, optimize_format=False)\n\n\u4eceMicrosoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u6587\u672c\u3002\n\n**\u53c2\u6570:**\n- `doc_path` (str): .doc\u6587\u4ef6\u7684\u8def\u5f84\n- `optimize_format` (bool): \u662f\u5426\u4f18\u5316\u6587\u672c\u683c\u5f0f\uff0c\u9ed8\u8ba4\u4e3aFalse\n\n**\u8fd4\u56de:**\n- `str`: \u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u6587\u672c\u5185\u5bb9\n\n**\u5f02\u5e38:**\n- `RuntimeError`: \u5e73\u53f0\u4e0d\u652f\u6301\u6216\u4e8c\u8fdb\u5236\u6587\u4ef6\u7f3a\u5931\n- `subprocess.CalledProcessError`: antiword\u6267\u884c\u5931\u8d25\n\n### optimize_text(text)\n\n\u4f18\u5316\u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u6587\u672c\u683c\u5f0f\u3002\n\n**\u53c2\u6570:**\n- `text` (str): \u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u539f\u59cb\u6587\u672c\n\n**\u8fd4\u56de:**\n- `str`: \u683c\u5f0f\u4f18\u5316\u540e\u7684\u6587\u672c\n\n## \u6587\u672c\u4f18\u5316\u529f\u80fd\n\n\u6587\u672c\u4f18\u5316\u529f\u80fd\u89e3\u51b3\u4e86\u4eceWord\u6587\u6863\u63d0\u53d6\u6587\u672c\u65f6\u5e38\u89c1\u7684\u683c\u5f0f\u95ee\u9898\uff1a\n\n- **\u6362\u884c\u5408\u5e76**: \u81ea\u52a8\u5408\u5e76\u6ca1\u6709\u7f29\u8fdb\u7684\u8fde\u7eed\u884c\uff0c\u4fdd\u6301\u6bb5\u843d\u7684\u5b8c\u6574\u6027\n- **\u8868\u683c\u5904\u7406**: \u667a\u80fd\u8bc6\u522b\u8868\u683c\u884c\uff08\u5305\u542b`|`\u5206\u9694\u7b26\uff09\uff0c\u4fdd\u6301\u8868\u683c\u683c\u5f0f\n- **\u7a7a\u683c\u5904\u7406**: \u79fb\u9664\u884c\u9996\u591a\u4f59\u7a7a\u683c\uff0c\u4fdd\u6301\u6587\u6863\u7684\u6e05\u6d01\u683c\u5f0f\n\n## \u9879\u76ee\u7ed3\u6784\n\n```\ndoc2txt/\n\u251c\u2500\u2500 __init__.py # \u5305\u7684\u4e3b\u5165\u53e3\n\u251c\u2500\u2500 antiword_wrapper.py # antiword\u5de5\u5177\u7684Python\u5c01\u88c5\n\u251c\u2500\u2500 text_optimizer.py # \u6587\u672c\u683c\u5f0f\u4f18\u5316\u5de5\u5177\n\u251c\u2500\u2500 bin/ # \u8de8\u5e73\u53f0\u4e8c\u8fdb\u5236\u6587\u4ef6\n\u2502 \u251c\u2500\u2500 darwin-arm64/\n\u2502 \u251c\u2500\u2500 linux-amd64/\n\u2502 \u2514\u2500\u2500 win-amd64/\n\u2514\u2500\u2500 antiword_share/ # antiword\u6570\u636e\u6587\u4ef6\n \u251c\u2500\u2500 fontnames\n \u2514\u2500\u2500 *.txt # \u5b57\u7b26\u7f16\u7801\u6620\u5c04\u6587\u4ef6\n```\n\n## \u4f9d\u8d56\u8981\u6c42\n\n\u672c\u5305\u65e0\u5916\u90e8\u4f9d\u8d56\uff0c\u6240\u6709\u5fc5\u9700\u7684\u5de5\u5177\u548c\u6570\u636e\u6587\u4ef6\u90fd\u5df2\u5185\u7f6e\u3002\n\n## \u8bb8\u53ef\u8bc1\n\nMIT License\n\n## \u8d21\u732e\n\n\u6b22\u8fce\u63d0\u4ea4Issue\u548cPull Request\u6765\u6539\u8fdb\u8fd9\u4e2a\u9879\u76ee\u3002\n\n## \u66f4\u65b0\u65e5\u5fd7\n\n### 1.0.0\n- \u521d\u59cb\u7248\u672c\u53d1\u5e03\n- \u652f\u6301\u4ece.doc\u6587\u4ef6\u63d0\u53d6\u6587\u672c\n- \u5185\u7f6e\u8de8\u5e73\u53f0antiword\u4e8c\u8fdb\u5236\u6587\u4ef6\n- \u6587\u672c\u683c\u5f0f\u4f18\u5316\u529f\u80fd\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python wrapper for antiword with bundled binary and data files",
"version": "1.0.4",
"project_urls": {
"Bug Reports": "https://github.com/quantatirsk/doc2txt-pypi/issues",
"Homepage": "https://github.com/quantatirsk/doc2txt-pypi",
"Source": "https://github.com/quantatirsk/doc2txt-pypi"
},
"split_keywords": [
"word",
"doc",
"text",
"extraction",
"antiword",
"document"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8616d503f30c9addbe5bf9bd0fcfce11960ddcea7a02fc20abc9a2aa552e7e88",
"md5": "b5dbd6385492a4c9915a76abfb1ade52",
"sha256": "2106d926d152eacf0fce9c2e7b4ef0f1225611f51eb9d99f7e40d424838abb86"
},
"downloads": -1,
"filename": "doc2txt-1.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5dbd6385492a4c9915a76abfb1ade52",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 411428,
"upload_time": "2025-07-17T03:25:14",
"upload_time_iso_8601": "2025-07-17T03:25:14.730236Z",
"url": "https://files.pythonhosted.org/packages/86/16/d503f30c9addbe5bf9bd0fcfce11960ddcea7a02fc20abc9a2aa552e7e88/doc2txt-1.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b0f9bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6",
"md5": "0f6a9880efd914c0f9f2d6c0f8c05d42",
"sha256": "558dee4d9cb94ab12b238bfba3b4cd4c4a5c7c5be543e81ac05c061e650643e0"
},
"downloads": -1,
"filename": "doc2txt-1.0.4.tar.gz",
"has_sig": false,
"md5_digest": "0f6a9880efd914c0f9f2d6c0f8c05d42",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 349242,
"upload_time": "2025-07-17T03:25:16",
"upload_time_iso_8601": "2025-07-17T03:25:16.375535Z",
"url": "https://files.pythonhosted.org/packages/b0/f9/bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6/doc2txt-1.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-17 03:25:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "quantatirsk",
"github_project": "doc2txt-pypi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "doc2txt"
}