doc2txt


Namedoc2txt JSON
Version 1.0.4 PyPI version JSON
download
home_pagehttps://github.com/quantatirsk/doc2txt-pypi
SummaryPython wrapper for antiword with bundled binary and data files
upload_time2025-07-17 03:25:16
maintainerNone
docs_urlNone
authorQuant
requires_python>=3.6
licenseMIT
keywords word doc text extraction antiword document
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # doc2txt

一个用于从Microsoft Word文档中提取文本的Python包,基于antiword工具构建,内置了跨平台的二进制文件和数据文件。

## 功能特性

- 从.doc格式的Microsoft Word文档中提取纯文本
- 跨平台支持(Windows、Linux、macOS ARM64)
- 内置antiword二进制文件,无需额外安装
- 文本格式优化功能,自动处理换行和表格
- 简单易用的Python API

## 支持的平台

- Windows (AMD64)
- Linux (AMD64)  
- macOS (ARM64/Apple Silicon)

> 注意:macOS Intel (x86_64) 暂不支持

## 安装

```bash
pip install doc2txt
```

## 快速开始

### 基本用法

```python
from doc2txt import extract_text

# 从Word文档提取文本
text = extract_text('document.doc')
print(text)
```

### 启用文本格式优化

```python
from doc2txt import extract_text

# 提取文本并优化格式(合并断行,处理表格)
text = extract_text('document.doc', optimize_format=True)
print(text)
```

### 使用文本优化工具

```python
from doc2txt import extract_text, optimize_text

# 先提取原始文本
raw_text = extract_text('document.doc')

# 手动优化文本格式
optimized_text = optimize_text(raw_text)
print(optimized_text)
```

## API 参考

### extract_text(doc_path, optimize_format=False)

从Microsoft Word文档中提取文本。

**参数:**
- `doc_path` (str): .doc文件的路径
- `optimize_format` (bool): 是否优化文本格式,默认为False

**返回:**
- `str`: 从文档中提取的文本内容

**异常:**
- `RuntimeError`: 平台不支持或二进制文件缺失
- `subprocess.CalledProcessError`: antiword执行失败

### optimize_text(text)

优化从文档中提取的文本格式。

**参数:**
- `text` (str): 从文档中提取的原始文本

**返回:**
- `str`: 格式优化后的文本

## 文本优化功能

文本优化功能解决了从Word文档提取文本时常见的格式问题:

- **换行合并**: 自动合并没有缩进的连续行,保持段落的完整性
- **表格处理**: 智能识别表格行(包含`|`分隔符),保持表格格式
- **空格处理**: 移除行首多余空格,保持文档的清洁格式

## 项目结构

```
doc2txt/
├── __init__.py              # 包的主入口
├── antiword_wrapper.py      # antiword工具的Python封装
├── text_optimizer.py       # 文本格式优化工具
├── bin/                     # 跨平台二进制文件
│   ├── darwin-arm64/
│   ├── linux-amd64/
│   └── win-amd64/
└── antiword_share/          # antiword数据文件
    ├── fontnames
    └── *.txt                # 字符编码映射文件
```

## 依赖要求

本包无外部依赖,所有必需的工具和数据文件都已内置。

## 许可证

MIT License

## 贡献

欢迎提交Issue和Pull Request来改进这个项目。

## 更新日志

### 1.0.0
- 初始版本发布
- 支持从.doc文件提取文本
- 内置跨平台antiword二进制文件
- 文本格式优化功能

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/quantatirsk/doc2txt-pypi",
    "name": "doc2txt",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "word doc text extraction antiword document",
    "author": "Quant",
    "author_email": "pengzhia@mail.com",
    "download_url": "https://files.pythonhosted.org/packages/b0/f9/bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6/doc2txt-1.0.4.tar.gz",
    "platform": null,
    "description": "# doc2txt\n\n\u4e00\u4e2a\u7528\u4e8e\u4eceMicrosoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u6587\u672c\u7684Python\u5305\uff0c\u57fa\u4e8eantiword\u5de5\u5177\u6784\u5efa\uff0c\u5185\u7f6e\u4e86\u8de8\u5e73\u53f0\u7684\u4e8c\u8fdb\u5236\u6587\u4ef6\u548c\u6570\u636e\u6587\u4ef6\u3002\n\n## \u529f\u80fd\u7279\u6027\n\n- \u4ece.doc\u683c\u5f0f\u7684Microsoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u7eaf\u6587\u672c\n- \u8de8\u5e73\u53f0\u652f\u6301\uff08Windows\u3001Linux\u3001macOS ARM64\uff09\n- \u5185\u7f6eantiword\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u65e0\u9700\u989d\u5916\u5b89\u88c5\n- \u6587\u672c\u683c\u5f0f\u4f18\u5316\u529f\u80fd\uff0c\u81ea\u52a8\u5904\u7406\u6362\u884c\u548c\u8868\u683c\n- \u7b80\u5355\u6613\u7528\u7684Python API\n\n## \u652f\u6301\u7684\u5e73\u53f0\n\n- Windows (AMD64)\n- Linux (AMD64)  \n- macOS (ARM64/Apple Silicon)\n\n> \u6ce8\u610f\uff1amacOS Intel (x86_64) \u6682\u4e0d\u652f\u6301\n\n## \u5b89\u88c5\n\n```bash\npip install doc2txt\n```\n\n## \u5feb\u901f\u5f00\u59cb\n\n### \u57fa\u672c\u7528\u6cd5\n\n```python\nfrom doc2txt import extract_text\n\n# \u4eceWord\u6587\u6863\u63d0\u53d6\u6587\u672c\ntext = extract_text('document.doc')\nprint(text)\n```\n\n### \u542f\u7528\u6587\u672c\u683c\u5f0f\u4f18\u5316\n\n```python\nfrom doc2txt import extract_text\n\n# \u63d0\u53d6\u6587\u672c\u5e76\u4f18\u5316\u683c\u5f0f\uff08\u5408\u5e76\u65ad\u884c\uff0c\u5904\u7406\u8868\u683c\uff09\ntext = extract_text('document.doc', optimize_format=True)\nprint(text)\n```\n\n### \u4f7f\u7528\u6587\u672c\u4f18\u5316\u5de5\u5177\n\n```python\nfrom doc2txt import extract_text, optimize_text\n\n# \u5148\u63d0\u53d6\u539f\u59cb\u6587\u672c\nraw_text = extract_text('document.doc')\n\n# \u624b\u52a8\u4f18\u5316\u6587\u672c\u683c\u5f0f\noptimized_text = optimize_text(raw_text)\nprint(optimized_text)\n```\n\n## API \u53c2\u8003\n\n### extract_text(doc_path, optimize_format=False)\n\n\u4eceMicrosoft Word\u6587\u6863\u4e2d\u63d0\u53d6\u6587\u672c\u3002\n\n**\u53c2\u6570:**\n- `doc_path` (str): .doc\u6587\u4ef6\u7684\u8def\u5f84\n- `optimize_format` (bool): \u662f\u5426\u4f18\u5316\u6587\u672c\u683c\u5f0f\uff0c\u9ed8\u8ba4\u4e3aFalse\n\n**\u8fd4\u56de:**\n- `str`: \u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u6587\u672c\u5185\u5bb9\n\n**\u5f02\u5e38:**\n- `RuntimeError`: \u5e73\u53f0\u4e0d\u652f\u6301\u6216\u4e8c\u8fdb\u5236\u6587\u4ef6\u7f3a\u5931\n- `subprocess.CalledProcessError`: antiword\u6267\u884c\u5931\u8d25\n\n### optimize_text(text)\n\n\u4f18\u5316\u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u6587\u672c\u683c\u5f0f\u3002\n\n**\u53c2\u6570:**\n- `text` (str): \u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u7684\u539f\u59cb\u6587\u672c\n\n**\u8fd4\u56de:**\n- `str`: \u683c\u5f0f\u4f18\u5316\u540e\u7684\u6587\u672c\n\n## \u6587\u672c\u4f18\u5316\u529f\u80fd\n\n\u6587\u672c\u4f18\u5316\u529f\u80fd\u89e3\u51b3\u4e86\u4eceWord\u6587\u6863\u63d0\u53d6\u6587\u672c\u65f6\u5e38\u89c1\u7684\u683c\u5f0f\u95ee\u9898\uff1a\n\n- **\u6362\u884c\u5408\u5e76**: \u81ea\u52a8\u5408\u5e76\u6ca1\u6709\u7f29\u8fdb\u7684\u8fde\u7eed\u884c\uff0c\u4fdd\u6301\u6bb5\u843d\u7684\u5b8c\u6574\u6027\n- **\u8868\u683c\u5904\u7406**: \u667a\u80fd\u8bc6\u522b\u8868\u683c\u884c\uff08\u5305\u542b`|`\u5206\u9694\u7b26\uff09\uff0c\u4fdd\u6301\u8868\u683c\u683c\u5f0f\n- **\u7a7a\u683c\u5904\u7406**: \u79fb\u9664\u884c\u9996\u591a\u4f59\u7a7a\u683c\uff0c\u4fdd\u6301\u6587\u6863\u7684\u6e05\u6d01\u683c\u5f0f\n\n## \u9879\u76ee\u7ed3\u6784\n\n```\ndoc2txt/\n\u251c\u2500\u2500 __init__.py              # \u5305\u7684\u4e3b\u5165\u53e3\n\u251c\u2500\u2500 antiword_wrapper.py      # antiword\u5de5\u5177\u7684Python\u5c01\u88c5\n\u251c\u2500\u2500 text_optimizer.py       # \u6587\u672c\u683c\u5f0f\u4f18\u5316\u5de5\u5177\n\u251c\u2500\u2500 bin/                     # \u8de8\u5e73\u53f0\u4e8c\u8fdb\u5236\u6587\u4ef6\n\u2502   \u251c\u2500\u2500 darwin-arm64/\n\u2502   \u251c\u2500\u2500 linux-amd64/\n\u2502   \u2514\u2500\u2500 win-amd64/\n\u2514\u2500\u2500 antiword_share/          # antiword\u6570\u636e\u6587\u4ef6\n    \u251c\u2500\u2500 fontnames\n    \u2514\u2500\u2500 *.txt                # \u5b57\u7b26\u7f16\u7801\u6620\u5c04\u6587\u4ef6\n```\n\n## \u4f9d\u8d56\u8981\u6c42\n\n\u672c\u5305\u65e0\u5916\u90e8\u4f9d\u8d56\uff0c\u6240\u6709\u5fc5\u9700\u7684\u5de5\u5177\u548c\u6570\u636e\u6587\u4ef6\u90fd\u5df2\u5185\u7f6e\u3002\n\n## \u8bb8\u53ef\u8bc1\n\nMIT License\n\n## \u8d21\u732e\n\n\u6b22\u8fce\u63d0\u4ea4Issue\u548cPull Request\u6765\u6539\u8fdb\u8fd9\u4e2a\u9879\u76ee\u3002\n\n## \u66f4\u65b0\u65e5\u5fd7\n\n### 1.0.0\n- \u521d\u59cb\u7248\u672c\u53d1\u5e03\n- \u652f\u6301\u4ece.doc\u6587\u4ef6\u63d0\u53d6\u6587\u672c\n- \u5185\u7f6e\u8de8\u5e73\u53f0antiword\u4e8c\u8fdb\u5236\u6587\u4ef6\n- \u6587\u672c\u683c\u5f0f\u4f18\u5316\u529f\u80fd\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python wrapper for antiword with bundled binary and data files",
    "version": "1.0.4",
    "project_urls": {
        "Bug Reports": "https://github.com/quantatirsk/doc2txt-pypi/issues",
        "Homepage": "https://github.com/quantatirsk/doc2txt-pypi",
        "Source": "https://github.com/quantatirsk/doc2txt-pypi"
    },
    "split_keywords": [
        "word",
        "doc",
        "text",
        "extraction",
        "antiword",
        "document"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8616d503f30c9addbe5bf9bd0fcfce11960ddcea7a02fc20abc9a2aa552e7e88",
                "md5": "b5dbd6385492a4c9915a76abfb1ade52",
                "sha256": "2106d926d152eacf0fce9c2e7b4ef0f1225611f51eb9d99f7e40d424838abb86"
            },
            "downloads": -1,
            "filename": "doc2txt-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b5dbd6385492a4c9915a76abfb1ade52",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 411428,
            "upload_time": "2025-07-17T03:25:14",
            "upload_time_iso_8601": "2025-07-17T03:25:14.730236Z",
            "url": "https://files.pythonhosted.org/packages/86/16/d503f30c9addbe5bf9bd0fcfce11960ddcea7a02fc20abc9a2aa552e7e88/doc2txt-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b0f9bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6",
                "md5": "0f6a9880efd914c0f9f2d6c0f8c05d42",
                "sha256": "558dee4d9cb94ab12b238bfba3b4cd4c4a5c7c5be543e81ac05c061e650643e0"
            },
            "downloads": -1,
            "filename": "doc2txt-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "0f6a9880efd914c0f9f2d6c0f8c05d42",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 349242,
            "upload_time": "2025-07-17T03:25:16",
            "upload_time_iso_8601": "2025-07-17T03:25:16.375535Z",
            "url": "https://files.pythonhosted.org/packages/b0/f9/bc85a56b94fd3357b5f7a6f6e01b670b845b45a9d36402a8aea71ea6f1d6/doc2txt-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 03:25:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "quantatirsk",
    "github_project": "doc2txt-pypi",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "doc2txt"
}
        
Elapsed time: 0.88067s