pptx2txt2


Namepptx2txt2 JSON
Version 1.1.0 PyPI version JSON
download
home_page
SummaryExtract text from .pptx and .odp files to strings in pure python.
upload_time2024-03-15 00:39:58
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Toby Devlin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords word pptx odp extract text images
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pptx2txt2

> Extract text from .pptx and .odp files to strings in pure python.

[![codecov](https://codecov.io/gh/GitToby/pptx2txt2/graph/badge.svg?token=OW9957N278)](https://codecov.io/gh/GitToby/pptx2txt2)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/GitToby/pptx2txt2/lint-and-test.yaml)](https://github.com/GitToby/pptx2txt2/actions/workflows/lint-and-test.yaml)
[![GitHub file size in bytes](https://img.shields.io/github/size/GitToby/pptx2txt2/src%2Fpptx2txt2%2F__init__.py)](https://github.com/GitToby/pptx2txt2/blob/master/src/pptx2txt2/__init__.py)
[![PyPI - License](https://img.shields.io/pypi/l/pptx2txt2)](https://github.com/GitToby/pptx2txt2/blob/master/LICENSE.txt)
[![PyPI - Version](https://img.shields.io/pypi/v/pptx2txt2)](https://pypi.org/project/pptx2txt2/)
[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FGitToby%2Fpptx2txt2%2Fmaster%2Fpyproject.toml)](https://pypi.org/project/pptx2txt2/)

My personal replacement for [pptx2txt](https://github.com/shakiyam/pptx2txt).

It's intended to be very simple and provide some utilities to extract content similar to the original lib.


## Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etx)

```
pip install pptx2txt2
```

Use with any [`PathLike`](https://docs.python.org/3/library/os.html#os.PathLike) object, like a filepath or IO stream.

There are 3 methods
- `extract_text_per_slide` returns a `dict[int, str]` of per slide content & notes
- `extract_text` utility to join all slide content
- `extract_images` copy images over to another dir

```python
import io
from pathlib import Path
import pptx2txt2

# path
text = pptx2txt2.extract_text("path/to/my.pptx")
text_per_slide = pptx2txt2.extract_text_per_slide("path/to/my.pptx")
image_paths = pptx2txt2.extract_images("path/to/my.pptx", "path/to/images/out")

# actual Paths
pptx_path = Path(__file__).parent / "my.pptx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = pptx2txt2.extract_text(pptx_path)
text_per_slide2 = pptx2txt2.extract_text_per_slide(pptx_path)
image_paths2 = pptx2txt2.extract_images(pptx_path, image_out)

# bytestreams
pptx_bytes = b"..."
bytes_io = io.BytesIO(pptx_bytes)
text3 = pptx2txt2.extract_text(bytes_io)
text_per_slide3 = pptx2txt2.extract_text_per_slide(bytes_io)
image_paths3 = pptx2txt2.extract_images(bytes_io, "path/to/images/out")
```


# Considerations
- Doesn't preserve whitespace or styling like the original; new slides, tabs and the like are now just spaces.
- headers and footers contain "<#>" of "<number>" where there would be a number, unlike the original which removed them
- pptx files have a UUID in text where images were.

## Benchmarks

Basic benchmarking using [pytest-benchmark](https://pytest-benchmark.readthedocs.io) with a basic test document on my M1 macbook and on GithubActions.

Macbook:

```
------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)               Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     2.4470  7.1815  2.5762  0.4344  2.4987  0.1050       2;7  388.1666     122           1
-------------------------------------------------------------------------------------------------------------------
```

GitHub Actions, python 3.12:

```
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)               Min      Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     4.0548  11.4523  4.2387  0.8312  4.1343  0.0484      3;11  235.9197     217           1
--------------------------------------------------------------------------------------------------------------------
```

### Also See
- [docx2txt2](https://github.com/GitToby/docx2txt2) for docx conversion

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pptx2txt2",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Toby Devlin <toby@tobydevlin.com>",
    "keywords": "word,pptx,odp,extract,text,images",
    "author": "",
    "author_email": "Toby Devlin <toby@tobydevlin.com>",
    "download_url": "https://files.pythonhosted.org/packages/20/12/fe9375794a8287fe7d477711fc1b4cea5941f869ed077f6809fccb36a8e5/pptx2txt2-1.1.0.tar.gz",
    "platform": null,
    "description": "# pptx2txt2\n\n> Extract text from .pptx and .odp files to strings in pure python.\n\n[![codecov](https://codecov.io/gh/GitToby/pptx2txt2/graph/badge.svg?token=OW9957N278)](https://codecov.io/gh/GitToby/pptx2txt2)\n[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/GitToby/pptx2txt2/lint-and-test.yaml)](https://github.com/GitToby/pptx2txt2/actions/workflows/lint-and-test.yaml)\n[![GitHub file size in bytes](https://img.shields.io/github/size/GitToby/pptx2txt2/src%2Fpptx2txt2%2F__init__.py)](https://github.com/GitToby/pptx2txt2/blob/master/src/pptx2txt2/__init__.py)\n[![PyPI - License](https://img.shields.io/pypi/l/pptx2txt2)](https://github.com/GitToby/pptx2txt2/blob/master/LICENSE.txt)\n[![PyPI - Version](https://img.shields.io/pypi/v/pptx2txt2)](https://pypi.org/project/pptx2txt2/)\n[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FGitToby%2Fpptx2txt2%2Fmaster%2Fpyproject.toml)](https://pypi.org/project/pptx2txt2/)\n\nMy personal replacement for [pptx2txt](https://github.com/shakiyam/pptx2txt).\n\nIt's intended to be very simple and provide some utilities to extract content similar to the original lib.\n\n\n## Usage\n\nInstall with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etx)\n\n```\npip install pptx2txt2\n```\n\nUse with any [`PathLike`](https://docs.python.org/3/library/os.html#os.PathLike) object, like a filepath or IO stream.\n\nThere are 3 methods\n- `extract_text_per_slide` returns a `dict[int, str]` of per slide content & notes\n- `extract_text` utility to join all slide content\n- `extract_images` copy images over to another dir\n\n```python\nimport io\nfrom pathlib import Path\nimport pptx2txt2\n\n# path\ntext = pptx2txt2.extract_text(\"path/to/my.pptx\")\ntext_per_slide = pptx2txt2.extract_text_per_slide(\"path/to/my.pptx\")\nimage_paths = pptx2txt2.extract_images(\"path/to/my.pptx\", \"path/to/images/out\")\n\n# actual Paths\npptx_path = Path(__file__).parent / \"my.pptx\"\nimage_out = Path(__file__).parent / \"my\" / \"images\"\nimage_out.mkdir(parents=True)\n\ntext2 = pptx2txt2.extract_text(pptx_path)\ntext_per_slide2 = pptx2txt2.extract_text_per_slide(pptx_path)\nimage_paths2 = pptx2txt2.extract_images(pptx_path, image_out)\n\n# bytestreams\npptx_bytes = b\"...\"\nbytes_io = io.BytesIO(pptx_bytes)\ntext3 = pptx2txt2.extract_text(bytes_io)\ntext_per_slide3 = pptx2txt2.extract_text_per_slide(bytes_io)\nimage_paths3 = pptx2txt2.extract_images(bytes_io, \"path/to/images/out\")\n```\n\n\n# Considerations\n- Doesn't preserve whitespace or styling like the original; new slides, tabs and the like are now just spaces.\n- headers and footers contain \"<#>\" of \"<number>\" where there would be a number, unlike the original which removed them\n- pptx files have a UUID in text where images were.\n\n## Benchmarks\n\nBasic benchmarking using [pytest-benchmark](https://pytest-benchmark.readthedocs.io) with a basic test document on my M1 macbook and on GithubActions.\n\nMacbook:\n\n```\n------------------------------------------------ benchmark: 1 tests -----------------------------------------------\nName (time in ms)               Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations\n-------------------------------------------------------------------------------------------------------------------\ntest_benchmark_pptx2txt2     2.4470  7.1815  2.5762  0.4344  2.4987  0.1050       2;7  388.1666     122           1\n-------------------------------------------------------------------------------------------------------------------\n```\n\nGitHub Actions, python 3.12:\n\n```\n------------------------------------------------ benchmark: 1 tests ------------------------------------------------\nName (time in ms)               Min      Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations\n--------------------------------------------------------------------------------------------------------------------\ntest_benchmark_pptx2txt2     4.0548  11.4523  4.2387  0.8312  4.1343  0.0484      3;11  235.9197     217           1\n--------------------------------------------------------------------------------------------------------------------\n```\n\n### Also See\n- [docx2txt2](https://github.com/GitToby/docx2txt2) for docx conversion\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Toby Devlin  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Extract text from .pptx and .odp files to strings in pure python.",
    "version": "1.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/GitToby/pptx2txt2/issues",
        "Repository": "https://github.com/GitToby/pptx2txt2.git"
    },
    "split_keywords": [
        "word",
        "pptx",
        "odp",
        "extract",
        "text",
        "images"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "090e8d810dd84b90381ee4f29110a742e523609a35d8bb255c163d14322349b0",
                "md5": "423f9fea72daa2eabd37dfa9ea9389ee",
                "sha256": "0041b57f814039399b317367b5e0e54ca7d3ca5800b170fb906788116f834141"
            },
            "downloads": -1,
            "filename": "pptx2txt2-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "423f9fea72daa2eabd37dfa9ea9389ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6163,
            "upload_time": "2024-03-15T00:39:56",
            "upload_time_iso_8601": "2024-03-15T00:39:56.452630Z",
            "url": "https://files.pythonhosted.org/packages/09/0e/8d810dd84b90381ee4f29110a742e523609a35d8bb255c163d14322349b0/pptx2txt2-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2012fe9375794a8287fe7d477711fc1b4cea5941f869ed077f6809fccb36a8e5",
                "md5": "00b4b7d20ef864a83548d32d68de25c7",
                "sha256": "fa71cf0799c60266c3ffa415bfb5b8c61364cb4814a92a4ce7e96901f7aecfcc"
            },
            "downloads": -1,
            "filename": "pptx2txt2-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "00b4b7d20ef864a83548d32d68de25c7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6664298,
            "upload_time": "2024-03-15T00:39:58",
            "upload_time_iso_8601": "2024-03-15T00:39:58.392417Z",
            "url": "https://files.pythonhosted.org/packages/20/12/fe9375794a8287fe7d477711fc1b4cea5941f869ed077f6809fccb36a8e5/pptx2txt2-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-15 00:39:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GitToby",
    "github_project": "pptx2txt2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pptx2txt2"
}
        
Elapsed time: 0.34936s