msword


Namemsword JSON
Version 0.0.5 PyPI version JSON
download
home_pagehttps://github.com/thorwhalen/msword
SummarySimple mapping view to docx (Word Doc) elements
upload_time2025-02-03 15:06:16
maintainerNone
docs_urlNone
authorThor Whalen
requires_pythonNone
licenseapache-2.0
keywords docx doc file microsoft word msword
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# msword: local files Manager(s)

Simple mapping view to docx (Word Doc) elements

To install:	```pip install msword```


- [Installation](#installation)
- [Quick Start](#quick-start)
- [LocalDocxTextStore](#localdocxtextstore)
- [LocalDocxStore](#localdocxstore)
- [Package Architecture](#package-architecture)
  - [Main Classes](#main-classes)
  - [Helper Functions](#helper-functions)
  - [Mapping Wrappers](#mapping-wrappers)
- [Mermaid Graphs](#mermaid-graphs)
- [License](#license)

---

## Quick Start

For users who just want to extract text from a collection of local MS Word documents, the simplest approach is to use the `LocalDocxTextStore`. The following example demonstrates how to create a text store from a folder containing MS Word files and access the text content of a document:

```python
from msword import LocalDocxTextStore
from msword.tests.util import test_data_dir  # Directory with test data

# Create a text store that extracts and returns text from MS Word documents.
docs_text_content = LocalDocxTextStore(test_data_dir)

# List the available document keys (filtered to valid MS Word files).
print(sorted(docs_text_content))

# Access the text of a specific document.
print(docs_text_content['simple.docx'])
```

For users needing the full `docx.Document` objects for more advanced processing (e.g., modifying document structure, styling, etc.), use the `LocalDocxStore`:

```python
from msword import LocalDocxStore
from msword.tests.util import test_data_dir
import docx

store = LocalDocxStore(test_data_dir)
doc = store['with_doc_extension.doc']
assert isinstance(doc, docx.document.Document)
print(doc.paragraphs[0].text)
```


## LocalDocxTextStore

Local files store returning, as values, text extracted from the documents.
Use this when you just want the text contents of the document.
If you want more, you'll need to user `LocalDocxStore` with the appropriate content extractor
(i.e. the obj_of_data function in a `dol.wrap_kvs` wrapper).

Note: Filters for valid msword extensions (.doc and .docx).
To NOT filter for valid extensions, use ``AllLocalFilesDocxTextStore`` instead.

```python
>>> from msword import LocalDocxTextStore, test_data_dir
>>> import docx
>>> s = LocalDocxTextStore(test_data_dir)
>>> assert {'with_doc_extension.doc', 'simple.docx'}.issubset(s)
>>> v = s['simple.docx']
>>> assert isinstance(v, str)
>>> print(v)
Just a bit of text to show that is works. Another sentence.
This is after a newline.
<BLANKLINE>
This is after two newlines.
```

## LocalDocxStore

Local files store returning, as values, docx objects.
Note: Filters for valid msword extensions (.doc and .docx).
To Note filter for valid extensions, use ``AllLocalFilesDocxStore`` instead.

```python
>>> from msword import LocalDocxStore, test_data_dir
>>> import docx
>>> s = LocalDocxStore(test_data_dir)
>>> assert {'with_doc_extension.doc', 'simple.docx'}.issubset(s)
>>> v = s['with_doc_extension.doc']
>>> assert isinstance(v, docx.document.Document)
```

What does a ``docx.document.Document`` have to offer?
If you really want to get into it, see here: https://python-docx.readthedocs.io/en/latest/

Meanwhile, we'll give a few examples here as an amuse-bouche.

```python
>>> ddir = lambda x: set([xx for xx in dir(x) if not xx.startswith('_')])  # to see what an object has
>>> assert ddir(v).issuperset({
...     'add_heading', 'add_page_break', 'add_paragraph', 'add_picture', 'add_section', 'add_table',
...     'core_properties', 'element', 'inline_shapes', 'paragraphs', 'part',
...     'save', 'sections', 'settings', 'styles', 'tables'
... })
```

``paragraphs`` is where the main content is, so let's have a look at what it has.

```python
>>> len(v.paragraphs)
21
>>> paragraph = v.paragraphs[0]
>>> assert ddir(paragraph).issuperset({
...     'add_run', 'alignment', 'clear', 'insert_paragraph_before',
...     'paragraph_format', 'part', 'runs', 'style', 'text'
... })
>>> paragraph.text
'Section 1'
>>> assert ddir(paragraph.style).issuperset({
...     'base_style', 'builtin', 'delete', 'element', 'font', 'hidden', 'locked', 'name', 'next_paragraph_style',
...     'paragraph_format', 'part', 'priority', 'quick_style', 'style_id', 'type', 'unhide_when_used'
... })
>>> paragraph.style.style_id
'Heading1'
>>> paragraph.style.font.color.rgb
RGBColor(0x2f, 0x54, 0x96)
```

You get the point...

If you're only interested in one particular aspect of the documents, you should your favorite
`dol` wrappers to get the store you really want. For example:

```python
>>> from dol import wrap_kvs
>>> ss = wrap_kvs(s, obj_of_data=lambda doc: [paragraph.style.style_id for paragraph in doc.paragraphs])
>>> assert ss['with_doc_extension.doc'] == [
...     'Heading1', 'Normal', 'Normal', 'Heading2', 'Normal', 'Normal',
...     'Heading1', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal',
...     'ListParagraph', 'ListParagraph', 'Normal', 'Normal', 'ListParagraph', 'ListParagraph', 'Normal'
... ]
```

The most common use case is probably getting text, not styles, out of a document.
It's so common, that we've done the wrapping for you:
Just use the already wrapped LocalDocxTextStore store for that purpose.


---

## Package Architecture

The module is built around a set of core classes and helper functions. Its functionality is entirely defined by the combination of a base file store (from `dol.Files`) and a set of decorators that wrap the store to transform its values.

### Main Classes

- **AllLocalFilesDocxStore**  
  A wrapper around a local file store that returns file contents as `docx.Document` objects. This class does not filter file extensions, meaning that non-MS Word files might cause errors.

- **AllLocalFilesDocxTextStore**  
  Inherits from `AllLocalFilesDocxStore` and applies a text extraction function (`get_text_from_docx`). It returns the concatenated text of all paragraphs in the document.

- **LocalDocxStore**  
  Extends `AllLocalFilesDocxStore` and uses the `only_files_with_msword_extension` decorator to filter out files that do not have valid MS Word extensions (i.e., `.doc` or `.docx`).

- **LocalDocxTextStore**  
  Extends `AllLocalFilesDocxTextStore` and applies the same file filtering as `LocalDocxStore`, ensuring that only valid MS Word files are processed.

### Helper Functions

- **_extension(k: str)**  
  Splits a filename and returns its extension.

- **has_msword_extension(k: str)**  
  Checks if a filename has a valid MS Word extension (`doc` or `docx`).

- **_remove_docx_extension(k: str) & _add_docx_extension(k: str)**  
  Utilities to remove or add the default `.docx` extension to keys.

- **paragraphs_text(doc)**  
  Yields the text from each paragraph of a `docx.Document`.

- **get_text_from_docx(doc, paragraph_sep='\n')**  
  Concatenates all paragraph texts from a document using the specified separator.

- **bytes_to_doc(doc_bytes: bytes)**  
  Converts raw bytes into a `docx.Document` using an in-memory buffer.

### Mapping Wrappers

- **with_bytes_to_doc_decoding**  
  Wraps a mapping to convert byte values into `docx.Document` objects.

- **with_doc_to_text_decoding**  
  Wraps a mapping to convert `docx.Document` objects into text.

- **with_bytes_to_text_decoding**  
  Combines the two above by first converting bytes into a document and then extracting its text.

- **only_files_with_msword_extension**  
  Filters the keys of a store so that only those with valid MS Word extensions are processed.

---

## Mermaid Graphs

### Overall Object Relationships

```mermaid
flowchart TD
    A[dol.Files]
    B[AllLocalFilesDocxStore]
    C[AllLocalFilesDocxTextStore]
    D[LocalDocxStore]
    E[LocalDocxTextStore]

    A --> B
    B --> C
    B --> D
    C --> E

    subgraph Helper Functions
        F[bytes_to_doc]
        G[get_text_from_docx]
        H[has_msword_extension]
        I[only_files_with_msword_extension]
    end

    F --> B
    G --> C
    H --> I
    I --> D
    I --> E
```

### Mapping Wrappers Pipeline

```mermaid
flowchart LR
    RawBytes[Raw Bytes]
    BytesToDoc[with_bytes_to_doc_decoding]
    Doc[docx.Document]
    DocToText[with_doc_to_text_decoding]
    Text[get_text_from_docx]
    BytesToText[with_bytes_to_text_decoding]

    RawBytes --> BytesToDoc --> Doc
    Doc --> DocToText --> Text
    RawBytes --> BytesToText --> Text
```

---

## License

This module is distributed under the terms of the MIT license.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thorwhalen/msword",
    "name": "msword",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "docx, doc file, microsoft word, msword",
    "author": "Thor Whalen",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/82/d0/43d32e313d2900f32b53a29f8ef927da31c6507957991f14d9a295ba6ab1/msword-0.0.5.tar.gz",
    "platform": "any",
    "description": "\n# msword: local files Manager(s)\n\nSimple mapping view to docx (Word Doc) elements\n\nTo install:\t```pip install msword```\n\n\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [LocalDocxTextStore](#localdocxtextstore)\n- [LocalDocxStore](#localdocxstore)\n- [Package Architecture](#package-architecture)\n  - [Main Classes](#main-classes)\n  - [Helper Functions](#helper-functions)\n  - [Mapping Wrappers](#mapping-wrappers)\n- [Mermaid Graphs](#mermaid-graphs)\n- [License](#license)\n\n---\n\n## Quick Start\n\nFor users who just want to extract text from a collection of local MS Word documents, the simplest approach is to use the `LocalDocxTextStore`. The following example demonstrates how to create a text store from a folder containing MS Word files and access the text content of a document:\n\n```python\nfrom msword import LocalDocxTextStore\nfrom msword.tests.util import test_data_dir  # Directory with test data\n\n# Create a text store that extracts and returns text from MS Word documents.\ndocs_text_content = LocalDocxTextStore(test_data_dir)\n\n# List the available document keys (filtered to valid MS Word files).\nprint(sorted(docs_text_content))\n\n# Access the text of a specific document.\nprint(docs_text_content['simple.docx'])\n```\n\nFor users needing the full `docx.Document` objects for more advanced processing (e.g., modifying document structure, styling, etc.), use the `LocalDocxStore`:\n\n```python\nfrom msword import LocalDocxStore\nfrom msword.tests.util import test_data_dir\nimport docx\n\nstore = LocalDocxStore(test_data_dir)\ndoc = store['with_doc_extension.doc']\nassert isinstance(doc, docx.document.Document)\nprint(doc.paragraphs[0].text)\n```\n\n\n## LocalDocxTextStore\n\nLocal files store returning, as values, text extracted from the documents.\nUse this when you just want the text contents of the document.\nIf you want more, you'll need to user `LocalDocxStore` with the appropriate content extractor\n(i.e. the obj_of_data function in a `dol.wrap_kvs` wrapper).\n\nNote: Filters for valid msword extensions (.doc and .docx).\nTo NOT filter for valid extensions, use ``AllLocalFilesDocxTextStore`` instead.\n\n```python\n>>> from msword import LocalDocxTextStore, test_data_dir\n>>> import docx\n>>> s = LocalDocxTextStore(test_data_dir)\n>>> assert {'with_doc_extension.doc', 'simple.docx'}.issubset(s)\n>>> v = s['simple.docx']\n>>> assert isinstance(v, str)\n>>> print(v)\nJust a bit of text to show that is works. Another sentence.\nThis is after a newline.\n<BLANKLINE>\nThis is after two newlines.\n```\n\n## LocalDocxStore\n\nLocal files store returning, as values, docx objects.\nNote: Filters for valid msword extensions (.doc and .docx).\nTo Note filter for valid extensions, use ``AllLocalFilesDocxStore`` instead.\n\n```python\n>>> from msword import LocalDocxStore, test_data_dir\n>>> import docx\n>>> s = LocalDocxStore(test_data_dir)\n>>> assert {'with_doc_extension.doc', 'simple.docx'}.issubset(s)\n>>> v = s['with_doc_extension.doc']\n>>> assert isinstance(v, docx.document.Document)\n```\n\nWhat does a ``docx.document.Document`` have to offer?\nIf you really want to get into it, see here: https://python-docx.readthedocs.io/en/latest/\n\nMeanwhile, we'll give a few examples here as an amuse-bouche.\n\n```python\n>>> ddir = lambda x: set([xx for xx in dir(x) if not xx.startswith('_')])  # to see what an object has\n>>> assert ddir(v).issuperset({\n...     'add_heading', 'add_page_break', 'add_paragraph', 'add_picture', 'add_section', 'add_table',\n...     'core_properties', 'element', 'inline_shapes', 'paragraphs', 'part',\n...     'save', 'sections', 'settings', 'styles', 'tables'\n... })\n```\n\n``paragraphs`` is where the main content is, so let's have a look at what it has.\n\n```python\n>>> len(v.paragraphs)\n21\n>>> paragraph = v.paragraphs[0]\n>>> assert ddir(paragraph).issuperset({\n...     'add_run', 'alignment', 'clear', 'insert_paragraph_before',\n...     'paragraph_format', 'part', 'runs', 'style', 'text'\n... })\n>>> paragraph.text\n'Section 1'\n>>> assert ddir(paragraph.style).issuperset({\n...     'base_style', 'builtin', 'delete', 'element', 'font', 'hidden', 'locked', 'name', 'next_paragraph_style',\n...     'paragraph_format', 'part', 'priority', 'quick_style', 'style_id', 'type', 'unhide_when_used'\n... })\n>>> paragraph.style.style_id\n'Heading1'\n>>> paragraph.style.font.color.rgb\nRGBColor(0x2f, 0x54, 0x96)\n```\n\nYou get the point...\n\nIf you're only interested in one particular aspect of the documents, you should your favorite\n`dol` wrappers to get the store you really want. For example:\n\n```python\n>>> from dol import wrap_kvs\n>>> ss = wrap_kvs(s, obj_of_data=lambda doc: [paragraph.style.style_id for paragraph in doc.paragraphs])\n>>> assert ss['with_doc_extension.doc'] == [\n...     'Heading1', 'Normal', 'Normal', 'Heading2', 'Normal', 'Normal',\n...     'Heading1', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal',\n...     'ListParagraph', 'ListParagraph', 'Normal', 'Normal', 'ListParagraph', 'ListParagraph', 'Normal'\n... ]\n```\n\nThe most common use case is probably getting text, not styles, out of a document.\nIt's so common, that we've done the wrapping for you:\nJust use the already wrapped LocalDocxTextStore store for that purpose.\n\n\n---\n\n## Package Architecture\n\nThe module is built around a set of core classes and helper functions. Its functionality is entirely defined by the combination of a base file store (from `dol.Files`) and a set of decorators that wrap the store to transform its values.\n\n### Main Classes\n\n- **AllLocalFilesDocxStore**  \n  A wrapper around a local file store that returns file contents as `docx.Document` objects. This class does not filter file extensions, meaning that non-MS Word files might cause errors.\n\n- **AllLocalFilesDocxTextStore**  \n  Inherits from `AllLocalFilesDocxStore` and applies a text extraction function (`get_text_from_docx`). It returns the concatenated text of all paragraphs in the document.\n\n- **LocalDocxStore**  \n  Extends `AllLocalFilesDocxStore` and uses the `only_files_with_msword_extension` decorator to filter out files that do not have valid MS Word extensions (i.e., `.doc` or `.docx`).\n\n- **LocalDocxTextStore**  \n  Extends `AllLocalFilesDocxTextStore` and applies the same file filtering as `LocalDocxStore`, ensuring that only valid MS Word files are processed.\n\n### Helper Functions\n\n- **_extension(k: str)**  \n  Splits a filename and returns its extension.\n\n- **has_msword_extension(k: str)**  \n  Checks if a filename has a valid MS Word extension (`doc` or `docx`).\n\n- **_remove_docx_extension(k: str) & _add_docx_extension(k: str)**  \n  Utilities to remove or add the default `.docx` extension to keys.\n\n- **paragraphs_text(doc)**  \n  Yields the text from each paragraph of a `docx.Document`.\n\n- **get_text_from_docx(doc, paragraph_sep='\\n')**  \n  Concatenates all paragraph texts from a document using the specified separator.\n\n- **bytes_to_doc(doc_bytes: bytes)**  \n  Converts raw bytes into a `docx.Document` using an in-memory buffer.\n\n### Mapping Wrappers\n\n- **with_bytes_to_doc_decoding**  \n  Wraps a mapping to convert byte values into `docx.Document` objects.\n\n- **with_doc_to_text_decoding**  \n  Wraps a mapping to convert `docx.Document` objects into text.\n\n- **with_bytes_to_text_decoding**  \n  Combines the two above by first converting bytes into a document and then extracting its text.\n\n- **only_files_with_msword_extension**  \n  Filters the keys of a store so that only those with valid MS Word extensions are processed.\n\n---\n\n## Mermaid Graphs\n\n### Overall Object Relationships\n\n```mermaid\nflowchart TD\n    A[dol.Files]\n    B[AllLocalFilesDocxStore]\n    C[AllLocalFilesDocxTextStore]\n    D[LocalDocxStore]\n    E[LocalDocxTextStore]\n\n    A --> B\n    B --> C\n    B --> D\n    C --> E\n\n    subgraph Helper Functions\n        F[bytes_to_doc]\n        G[get_text_from_docx]\n        H[has_msword_extension]\n        I[only_files_with_msword_extension]\n    end\n\n    F --> B\n    G --> C\n    H --> I\n    I --> D\n    I --> E\n```\n\n### Mapping Wrappers Pipeline\n\n```mermaid\nflowchart LR\n    RawBytes[Raw Bytes]\n    BytesToDoc[with_bytes_to_doc_decoding]\n    Doc[docx.Document]\n    DocToText[with_doc_to_text_decoding]\n    Text[get_text_from_docx]\n    BytesToText[with_bytes_to_text_decoding]\n\n    RawBytes --> BytesToDoc --> Doc\n    Doc --> DocToText --> Text\n    RawBytes --> BytesToText --> Text\n```\n\n---\n\n## License\n\nThis module is distributed under the terms of the MIT license.\n\n",
    "bugtrack_url": null,
    "license": "apache-2.0",
    "summary": "Simple mapping view to docx (Word Doc) elements",
    "version": "0.0.5",
    "project_urls": {
        "Homepage": "https://github.com/thorwhalen/msword"
    },
    "split_keywords": [
        "docx",
        " doc file",
        " microsoft word",
        " msword"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bafd2590ac54ede4350c750bb4f2da718fe3333eb4ae4091fe2c688020f9949f",
                "md5": "29fb86d0be976c6b0326e8f867562dc2",
                "sha256": "0552ee5ae36a32c2cc1fe1e622be4b1c4161b2196dfed153c05026b7d5da085f"
            },
            "downloads": -1,
            "filename": "msword-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "29fb86d0be976c6b0326e8f867562dc2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14504,
            "upload_time": "2025-02-03T15:06:15",
            "upload_time_iso_8601": "2025-02-03T15:06:15.191813Z",
            "url": "https://files.pythonhosted.org/packages/ba/fd/2590ac54ede4350c750bb4f2da718fe3333eb4ae4091fe2c688020f9949f/msword-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "82d043d32e313d2900f32b53a29f8ef927da31c6507957991f14d9a295ba6ab1",
                "md5": "44aeace6fe0d1ce1adae68a3a2f62919",
                "sha256": "313240a54f98e8dbb5a54b5ac8f1b022a9e34269cf107e37f861ea2c8cee018b"
            },
            "downloads": -1,
            "filename": "msword-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "44aeace6fe0d1ce1adae68a3a2f62919",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 13558,
            "upload_time": "2025-02-03T15:06:16",
            "upload_time_iso_8601": "2025-02-03T15:06:16.967145Z",
            "url": "https://files.pythonhosted.org/packages/82/d0/43d32e313d2900f32b53a29f8ef927da31c6507957991f14d9a295ba6ab1/msword-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-03 15:06:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thorwhalen",
    "github_project": "msword",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "msword"
}
        
Elapsed time: 1.31956s