docx2txt2


Namedocx2txt2 JSON
Version 1.0.4 PyPI version JSON
download
home_page
SummaryExtract text from .docx and .odt files to strings in pure python.
upload_time2024-03-13 22:36:43
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Toby Devlin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords word docx odt extract text images
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # docx2txt2

> Extract text from .docx and .odt files to strings in pure python.

[![codecov](https://codecov.io/gh/GitToby/docx2txt2/graph/badge.svg?token=12KF8ARYVZ)](https://codecov.io/gh/GitToby/docx2txt2)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/GitToby/docx2txt2/lint-and-test.yaml)](https://github.com/GitToby/docx2txt2/actions/workflows/lint-and-test.yaml)
[![GitHub file size in bytes](https://img.shields.io/github/size/GitToby/docx2txt2/src%2Fdocx2txt2%2F__init__.py)](https://github.com/GitToby/docx2txt2/blob/master/src/docx2txt2/__init__.py)
[![PyPI - License](https://img.shields.io/pypi/l/docx2txt2)](https://github.com/GitToby/docx2txt2/blob/master/LICENSE.txt)
[![PyPI - Version](https://img.shields.io/pypi/v/docx2txt2)](https://pypi.org/project/docx2txt2/)
[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FGitToby%2Fdocx2txt2%2Fmaster%2Fpyproject.toml)](https://pypi.org/project/docx2txt2/)

My personal replacement for [docx2txt](https://github.com/ankushshah89/python-docx2txt).

It's intended to be very simple and provide some utilities to match the functionality of the original lib.


## Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etc)

```
pip install docx2txt2
```

Use with any [`PathLike`](https://docs.python.org/3/library/os.html#os.PathLike) object, like a filepath or IO stream.

```python
import io
from pathlib import Path
import docx2txt2

# path
text = docx2txt2.extract_text("path/to/my.docx")
image_paths = docx2txt2.extract_images("path/to/my.docx", "path/to/images/out")

# actual Paths
docx_path = Path(__file__).parent / "my.docx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = docx2txt2.extract_text(docx_path)
image_paths2 = docx2txt2.extract_images(docx_path, image_out)

# bytestreams
docx_bytes = b"..."
bytes_io = io.BytesIO(docx_bytes)
text3 = docx2txt2.extract_text(bytes_io)
image_paths3 = docx2txt2.extract_images(bytes_io, "path/to/images/out")
```

## Compatability & Motivation

docx2txt2 provides a superset of all data returned by docx2txt with some caveats (below), so the below is true:

```python
import docx2txt

import docx2txt2

orig_content = docx2txt.process("my/file.docx").split()
new_content = docx2txt2.process("my/file.docx").split()

assert all(orig in new_content for orig in orig_content)
```

_This is a test in `test_extract_data.test_docx2txt_compatability`_


Compatability & Caveats

- Doesn't preserve whitespace or styling like the original; new pages, tabs and the like are now just spaces.
- headers and footers contain "PAGE" where there would be a page number, unlike the original which removed them.


Motivations for rewrite:

- **Speed**, I have lots of word docs to process and I saw some efficiency gains over the original lib.
- **Formatting**, I didn't want to do whitespace removal for every run; this preformats output to only include spaces.

## Benchmarks

Basic benchmarking using [pytest-benchmark](https://pytest-benchmark.readthedocs.io) with a basic test document on my M1 macbook and on GithubActions.
From these tests it appears this lib is a sneak under ~2x faster on average.

Macbook:

```
----------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------
Name (time in ms)               Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.1498 (1.0)      6.2305 (1.0)      1.1949 (1.0)      0.3096 (1.0)      1.1685 (1.0)      0.0142 (1.0)          3;74  836.9124 (1.0)         724           1
test_benchmark_docx2txt      2.1684 (1.89)     7.5298 (1.21)     2.2469 (1.88)     0.3941 (1.27)     2.2044 (1.89)     0.0231 (1.62)         2;41  445.0671 (0.53)        365           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

GitHub Actions, python 3.12:

```
----------------------------------------------------------------------------------- benchmark: 2 tests -----------------------------------------------------------------------------------
Name (time in ms)               Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.5368 (1.0)       8.6408 (1.0)      1.6104 (1.0)      0.4961 (1.0)      1.5697 (1.0)      0.0349 (1.0)          3;11  620.9509 (1.0)         565           1
test_benchmark_docx2txt      3.0235 (1.97)     10.1797 (1.18)     3.1365 (1.95)     0.5956 (1.20)     3.0822 (1.96)     0.0356 (1.02)         2;10  318.8220 (0.51)        279           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

Disclaimer: More thorough benchmarking could be conducted. This is a faster lib in general but I haven't tested edge cases.


### Also see:
*  [pptx2txt2](https://github.com/GitToby/pptx2txt2) for pptx/odp conversion

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "docx2txt2",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Toby Devlin <toby@tobydevlin.com>",
    "keywords": "word,docx,odt,extract,text,images",
    "author": "",
    "author_email": "Toby Devlin <toby@tobydevlin.com>",
    "download_url": "https://files.pythonhosted.org/packages/d1/fc/c07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd/docx2txt2-1.0.4.tar.gz",
    "platform": null,
    "description": "# docx2txt2\n\n> Extract text from .docx and .odt files to strings in pure python.\n\n[![codecov](https://codecov.io/gh/GitToby/docx2txt2/graph/badge.svg?token=12KF8ARYVZ)](https://codecov.io/gh/GitToby/docx2txt2)\n[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/GitToby/docx2txt2/lint-and-test.yaml)](https://github.com/GitToby/docx2txt2/actions/workflows/lint-and-test.yaml)\n[![GitHub file size in bytes](https://img.shields.io/github/size/GitToby/docx2txt2/src%2Fdocx2txt2%2F__init__.py)](https://github.com/GitToby/docx2txt2/blob/master/src/docx2txt2/__init__.py)\n[![PyPI - License](https://img.shields.io/pypi/l/docx2txt2)](https://github.com/GitToby/docx2txt2/blob/master/LICENSE.txt)\n[![PyPI - Version](https://img.shields.io/pypi/v/docx2txt2)](https://pypi.org/project/docx2txt2/)\n[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FGitToby%2Fdocx2txt2%2Fmaster%2Fpyproject.toml)](https://pypi.org/project/docx2txt2/)\n\nMy personal replacement for [docx2txt](https://github.com/ankushshah89/python-docx2txt).\n\nIt's intended to be very simple and provide some utilities to match the functionality of the original lib.\n\n\n## Usage\n\nInstall with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etc)\n\n```\npip install docx2txt2\n```\n\nUse with any [`PathLike`](https://docs.python.org/3/library/os.html#os.PathLike) object, like a filepath or IO stream.\n\n```python\nimport io\nfrom pathlib import Path\nimport docx2txt2\n\n# path\ntext = docx2txt2.extract_text(\"path/to/my.docx\")\nimage_paths = docx2txt2.extract_images(\"path/to/my.docx\", \"path/to/images/out\")\n\n# actual Paths\ndocx_path = Path(__file__).parent / \"my.docx\"\nimage_out = Path(__file__).parent / \"my\" / \"images\"\nimage_out.mkdir(parents=True)\n\ntext2 = docx2txt2.extract_text(docx_path)\nimage_paths2 = docx2txt2.extract_images(docx_path, image_out)\n\n# bytestreams\ndocx_bytes = b\"...\"\nbytes_io = io.BytesIO(docx_bytes)\ntext3 = docx2txt2.extract_text(bytes_io)\nimage_paths3 = docx2txt2.extract_images(bytes_io, \"path/to/images/out\")\n```\n\n## Compatability & Motivation\n\ndocx2txt2 provides a superset of all data returned by docx2txt with some caveats (below), so the below is true:\n\n```python\nimport docx2txt\n\nimport docx2txt2\n\norig_content = docx2txt.process(\"my/file.docx\").split()\nnew_content = docx2txt2.process(\"my/file.docx\").split()\n\nassert all(orig in new_content for orig in orig_content)\n```\n\n_This is a test in `test_extract_data.test_docx2txt_compatability`_\n\n\nCompatability & Caveats\n\n- Doesn't preserve whitespace or styling like the original; new pages, tabs and the like are now just spaces.\n- headers and footers contain \"PAGE\" where there would be a page number, unlike the original which removed them.\n\n\nMotivations for rewrite:\n\n- **Speed**, I have lots of word docs to process and I saw some efficiency gains over the original lib.\n- **Formatting**, I didn't want to do whitespace removal for every run; this preformats output to only include spaces.\n\n## Benchmarks\n\nBasic benchmarking using [pytest-benchmark](https://pytest-benchmark.readthedocs.io) with a basic test document on my M1 macbook and on GithubActions.\nFrom these tests it appears this lib is a sneak under ~2x faster on average.\n\nMacbook:\n\n```\n----------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------\nName (time in ms)               Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations\n-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\ntest_benchmark_docx2txt2     1.1498 (1.0)      6.2305 (1.0)      1.1949 (1.0)      0.3096 (1.0)      1.1685 (1.0)      0.0142 (1.0)          3;74  836.9124 (1.0)         724           1\ntest_benchmark_docx2txt      2.1684 (1.89)     7.5298 (1.21)     2.2469 (1.88)     0.3941 (1.27)     2.2044 (1.89)     0.0231 (1.62)         2;41  445.0671 (0.53)        365           1\n-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n```\n\nGitHub Actions, python 3.12:\n\n```\n----------------------------------------------------------------------------------- benchmark: 2 tests -----------------------------------------------------------------------------------\nName (time in ms)               Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations\n------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\ntest_benchmark_docx2txt2     1.5368 (1.0)       8.6408 (1.0)      1.6104 (1.0)      0.4961 (1.0)      1.5697 (1.0)      0.0349 (1.0)          3;11  620.9509 (1.0)         565           1\ntest_benchmark_docx2txt      3.0235 (1.97)     10.1797 (1.18)     3.1365 (1.95)     0.5956 (1.20)     3.0822 (1.96)     0.0356 (1.02)         2;10  318.8220 (0.51)        279           1\n------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n```\n\nDisclaimer: More thorough benchmarking could be conducted. This is a faster lib in general but I haven't tested edge cases.\n\n\n### Also see:\n*  [pptx2txt2](https://github.com/GitToby/pptx2txt2) for pptx/odp conversion\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Toby Devlin  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Extract text from .docx and .odt files to strings in pure python.",
    "version": "1.0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/GitToby/docx2txt2/issues",
        "Repository": "https://github.com/GitToby/docx2txt2.git"
    },
    "split_keywords": [
        "word",
        "docx",
        "odt",
        "extract",
        "text",
        "images"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eabd19e106b5e5225d9214445fc0dbdf2600279f359c9b5fb5aca54c267cfba7",
                "md5": "1ed5a8f9b57278c30e8891a3f23a473f",
                "sha256": "59c3ea13eaf15613224b7912c241fca455ba16abe93e493e6c9e05c8e59d17fa"
            },
            "downloads": -1,
            "filename": "docx2txt2-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1ed5a8f9b57278c30e8891a3f23a473f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6524,
            "upload_time": "2024-03-13T22:36:41",
            "upload_time_iso_8601": "2024-03-13T22:36:41.257895Z",
            "url": "https://files.pythonhosted.org/packages/ea/bd/19e106b5e5225d9214445fc0dbdf2600279f359c9b5fb5aca54c267cfba7/docx2txt2-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d1fcc07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd",
                "md5": "d4c00f9cc13e12aed8d5063f1199a443",
                "sha256": "62e3c508726f668a21bc2cfa4c376714c9074edced492a9b9760ed0dafb20db5"
            },
            "downloads": -1,
            "filename": "docx2txt2-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "d4c00f9cc13e12aed8d5063f1199a443",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 1205246,
            "upload_time": "2024-03-13T22:36:43",
            "upload_time_iso_8601": "2024-03-13T22:36:43.807410Z",
            "url": "https://files.pythonhosted.org/packages/d1/fc/c07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd/docx2txt2-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-13 22:36:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GitToby",
    "github_project": "docx2txt2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "docx2txt2"
}
        
Elapsed time: 0.26467s