pdfdocx


Namepdfdocx JSON
Version 1.7 PyPI version JSON
download
home_pagehttps://github.com/hidadeng/pdfdocx
Summary读取pdf、docx文件,返回文件内的文本数据。
upload_time2023-09-10 09:34:00
maintainer
docs_urlNone
author大邓
requires_python>=3.5
licenseMIT
keywords pdf extraction docx extraction text mining
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            最近运行课件代码,发现pdf文件读取部分的函数失效。这里找到读取pdf文件的可运行代码,为了方便后续学习使用,我已将pdf和docx读取方法封装成pdfdocx包。



# pdfdocx

只有简单的两个读取函数

- read_pdf(file)
- read_docx(file)

file为文件路径,函数运行后返回file文件内的文本数据。

<br>

### 安装

```
pip install pdfdocx
```

<br>

### 使用

读取pdf文件

```python
from pdfdocx import read_pdf
p_text = read_pdf('test/data.pdf')
print(p_text)
```

Run

```
这是来⾃pdf⽂件内的内容
```



```python
from pdfdocx import read_docx
d_text = read_pdf('test/data.docx')
print(d_text)
```

Run

```
这是来⾃docx⽂件内的内容
```


<br>




# 如果

如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,可以参看[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(* ̄︶ ̄*)o,

- python入门
- 网络爬虫
- 数据读取
- 文本分析入门
- 机器学习与文本分析
- 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)进来看看~

[![](img/课程.png)](https://ke.qq.com/course/482241?tuin=163164df)


<br>


# 更多

- [B站:大邓和他的python](https://space.bilibili.com/122592901/channel/detail?cid=66008)

- 公众号:大邓和他的python

- [知乎专栏:数据科学家](https://zhuanlan.zhihu.com/dadeng)

<br>

![](img/大邓和他的Python.png)


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hidadeng/pdfdocx",
    "name": "pdfdocx",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "pdf extraction,docx extraction,text mining",
    "author": "\u5927\u9093",
    "author_email": "thunderhit@qq.com",
    "download_url": "",
    "platform": null,
    "description": "\u6700\u8fd1\u8fd0\u884c\u8bfe\u4ef6\u4ee3\u7801\uff0c\u53d1\u73b0pdf\u6587\u4ef6\u8bfb\u53d6\u90e8\u5206\u7684\u51fd\u6570\u5931\u6548\u3002\u8fd9\u91cc\u627e\u5230\u8bfb\u53d6pdf\u6587\u4ef6\u7684\u53ef\u8fd0\u884c\u4ee3\u7801\uff0c\u4e3a\u4e86\u65b9\u4fbf\u540e\u7eed\u5b66\u4e60\u4f7f\u7528\uff0c\u6211\u5df2\u5c06pdf\u548cdocx\u8bfb\u53d6\u65b9\u6cd5\u5c01\u88c5\u6210pdfdocx\u5305\u3002\n\n\n\n# pdfdocx\n\n\u53ea\u6709\u7b80\u5355\u7684\u4e24\u4e2a\u8bfb\u53d6\u51fd\u6570\n\n- read_pdf(file)\n- read_docx(file)\n\nfile\u4e3a\u6587\u4ef6\u8def\u5f84\uff0c\u51fd\u6570\u8fd0\u884c\u540e\u8fd4\u56defile\u6587\u4ef6\u5185\u7684\u6587\u672c\u6570\u636e\u3002\n\n<br>\n\n### \u5b89\u88c5\n\n```\npip install pdfdocx\n```\n\n<br>\n\n### \u4f7f\u7528\n\n\u8bfb\u53d6pdf\u6587\u4ef6\n\n```python\nfrom pdfdocx import read_pdf\np_text = read_pdf('test/data.pdf')\nprint(p_text)\n```\n\nRun\n\n```\n\u8fd9\u662f\u6765\u2f83pdf\u2f42\u4ef6\u5185\u7684\u5185\u5bb9\n```\n\n\n\n```python\nfrom pdfdocx import read_docx\nd_text = read_pdf('test/data.docx')\nprint(d_text)\n```\n\nRun\n\n```\n\u8fd9\u662f\u6765\u2f83docx\u2f42\u4ef6\u5185\u7684\u5185\u5bb9\n```\n\n\n<br>\n\n\n\n\n# \u5982\u679c\n\n\u5982\u679c\u60a8\u662f\u7ecf\u7ba1\u4eba\u6587\u793e\u79d1\u4e13\u4e1a\u80cc\u666f\uff0c\u7f16\u7a0b\u5c0f\u767d\uff0c\u9762\u4e34\u6d77\u91cf\u6587\u672c\u6570\u636e\u91c7\u96c6\u548c\u5904\u7406\u5206\u6790\u8270\u5de8\u4efb\u52a1\uff0c\u53ef\u4ee5\u53c2\u770b[\u300apython\u7f51\u7edc\u722c\u866b\u4e0e\u6587\u672c\u6570\u636e\u5206\u6790\u300b](https://ke.qq.com/course/482241?tuin=163164df)\u89c6\u9891\u8bfe\u3002\u4f5c\u4e3a\u6587\u79d1\u751f\uff0c\u4e00\u6837\u4e5f\u662f\u4ece\u4e24\u773c\u4e00\u62b9\u9ed1\u5f00\u59cb\uff0c\u8fd9\u95e8\u8bfe\u7a0b\u662f\u7528\u4e94\u5e74\u65f6\u95f4\u51dd\u7f29\u51fa\u6765\u7684\u3002\u81ea\u8ba4\u4e3a\u8bb2\u7684\u5f88\u901a\u4fd7\u6613\u61c2o(*\uffe3\ufe36\uffe3*)o\uff0c\n\n- python\u5165\u95e8\n- \u7f51\u7edc\u722c\u866b\n- \u6570\u636e\u8bfb\u53d6\n- \u6587\u672c\u5206\u6790\u5165\u95e8\n- \u673a\u5668\u5b66\u4e60\u4e0e\u6587\u672c\u5206\u6790\n- \u6587\u672c\u5206\u6790\u5728\u7ecf\u7ba1\u7814\u7a76\u4e2d\u7684\u5e94\u7528\n\n\u611f\u5174\u8da3\u7684\u7ae5\u978b\u4e0d\u59a8 \u6233\u4e00\u4e0b[\u300apython\u7f51\u7edc\u722c\u866b\u4e0e\u6587\u672c\u6570\u636e\u5206\u6790\u300b](https://ke.qq.com/course/482241?tuin=163164df)\u8fdb\u6765\u770b\u770b~\n\n[![](img/\u8bfe\u7a0b.png)](https://ke.qq.com/course/482241?tuin=163164df)\n\n\n<br>\n\n\n# \u66f4\u591a\n\n- [B\u7ad9:\u5927\u9093\u548c\u4ed6\u7684python](https://space.bilibili.com/122592901/channel/detail?cid=66008)\n\n- \u516c\u4f17\u53f7\uff1a\u5927\u9093\u548c\u4ed6\u7684python\n\n- [\u77e5\u4e4e\u4e13\u680f\uff1a\u6570\u636e\u79d1\u5b66\u5bb6](https://zhuanlan.zhihu.com/dadeng)\n\n<br>\n\n![](img/\u5927\u9093\u548c\u4ed6\u7684Python.png)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\u8bfb\u53d6pdf\u3001docx\u6587\u4ef6\uff0c\u8fd4\u56de\u6587\u4ef6\u5185\u7684\u6587\u672c\u6570\u636e\u3002",
    "version": "1.7",
    "project_urls": {
        "Homepage": "https://github.com/hidadeng/pdfdocx"
    },
    "split_keywords": [
        "pdf extraction",
        "docx extraction",
        "text mining"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "80ecaebe998b9d19edcab9cb2f03cf427a73dc4018b5927d8035f64bddb9b343",
                "md5": "a0d052e6f067e27ef1eb736188d8cfdc",
                "sha256": "e5e325d99177de54ff9eeef7e0be6cd4de5e12b9496c327670e559775f7119cc"
            },
            "downloads": -1,
            "filename": "pdfdocx-1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a0d052e6f067e27ef1eb736188d8cfdc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 3837,
            "upload_time": "2023-09-10T09:34:00",
            "upload_time_iso_8601": "2023-09-10T09:34:00.902810Z",
            "url": "https://files.pythonhosted.org/packages/80/ec/aebe998b9d19edcab9cb2f03cf427a73dc4018b5927d8035f64bddb9b343/pdfdocx-1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-10 09:34:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hidadeng",
    "github_project": "pdfdocx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pdfdocx"
}
        
Elapsed time: 2.71377s