# Docx Parser and Converter πβ¨
A powerful library for converting DOCX documents into HTML and plain text, with detailed parsing of document properties and styles.
## Table of Contents
- [Introduction π](#introduction-)
- [Project Overview π οΈ](#project-overview-)
- [Key Features π](#key-features-)
- [Installation πΎ](#installation-)
- [Usage π](#usage-)
- [Quick Start Guide π](#quick-start-guide-)
- [Supported XML Parsing Types π](#supported-xml-parsing-types-)
- [General Code Flow π](#general-code-flow-)
- [Conversion Table of DOCX XML Elements to HTML](#conversion-table-of-docx-xml-elements-to-html)
- [Examples π](#examples-)
- [API Reference π](#api-reference-)
## Introduction π
Welcome to the Docx Parser and Converter project! This library allows you to easily convert DOCX documents into HTML and plain text formats, extracting detailed properties and styles using Pydantic models.
## Project Overview π οΈ
The project is structured to parse DOCX files, convert their content into structured data using Pydantic models, and provide conversion utilities to transform this data into HTML or plain text.
## Key Features π
- Convert DOCX documents to HTML or plain text.
- Parse and extract detailed document properties and styles.
- Structured data representation using Pydantic models.
## Installation πΎ
To install the library, you can use pip. (Add the pip install command manually)
```sh
pip install docx-parser-converter
```
## Usage π
### Importing the Library
To start using the library, import the necessary modules:
```python
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter
from docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
```
### Quick Start Guide π
1. **Convert to HTML**:
```python
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
docx_path = "path_to_your_docx_file.docx"
html_output_path = "output.html"
docx_file_content = read_binary_from_file_path(docx_path)
converter = DocxToHtmlConverter(docx_file_content, use_default_values=True)
html_output = converter.convert_to_html()
converter.save_html_to_file(html_output, html_output_path)
```
2. **Convert to Plain Text**:
```python
from docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
docx_path = "path_to_your_docx_file.docx"
txt_output_path = "output.txt"
docx_file_content = read_binary_from_file_path(docx_path)
converter = DocxToTxtConverter(docx_file_content, use_default_values=True)
txt_output = converter.convert_to_txt(indent=True)
converter.save_txt_to_file(txt_output, txt_output_path)
```
## Supported XML Parsing Types π
The Docx Parser and Converter library supports parsing various XML components within a DOCX file. Below is a detailed list of the supported and unsupported components:
### Supported Components
1. **document.xml**:
- **Document Parsing**: Parses the main document structure.
- **Paragraphs**: Extracts paragraphs and their properties.
- **Runs**: Extracts individual text runs within paragraphs.
- **Tables**: Parses table structures and properties.
- **Table Rows**: Extracts rows within tables.
- **Table Cells**: Extracts cells within rows.
- **List Items**: Handles both bulleted and numbered lists through paragraph properties.
2. **numbering.xml**:
- **Numbering Definitions**: Parses numbering definitions and properties for lists.
- **Numbering Levels**: Extracts different levels of numbering for nested lists.
3. **styles.xml**:
- **Paragraph Styles**: Extracts styles applied to paragraphs.
- **Run Styles**: Extracts styles applied to text runs.
- **Table Styles**: Parses styles applied to tables and table elements.
- **Default Styles**: Extracts default document styles for paragraphs, runs, and tables.
### Unsupported Components
- **Images**: Parsing and extraction of images embedded within the document.
- **Headers and Footers**: Parsing of headers and footers content.
- **Footnotes and Endnotes**: Handling footnotes and endnotes within the document.
- **Comments**: Extraction and handling of comments.
- **Custom XML Parts**: Any custom XML parts beyond the standard DOCX schema.
## General Code Flow π
The Docx Parser and Converter library follows a structured workflow to parse, convert, and merge document properties and styles according to DOCX specifications. Hereβs a detailed overview of the technical process:
1. **Parsing XML Files**:
- **Document XML Parsing**: The `DocumentParser` class reads and parses the `document.xml` file to extract the document structure, including paragraphs, tables, and runs. This data is converted into `DocumentSchema` Pydantic models.
- **Numbering XML Parsing**: The `NumberingParser` class parses the `numbering.xml` file to extract numbering definitions and levels, converting them into `NumberingSchema` Pydantic models.
- **Styles XML Parsing**: The `StylesParser` class parses the `styles.xml` file to extract styles for paragraphs, runs, and tables, converting them into `StylesSchema` Pydantic models.
2. **Property and Style Merging**:
- **Hierarchical Style Application**: The library applies styles to paragraphs and runs based on a defined hierarchy. Explicit properties in the `DocumentSchema` remain unchanged, while styles are applied based on the `style_id` if present.
- **Default Style Application**: If no specific `style_id` is present, default styles from `StyleDefaults` are applied. Finally, any remaining null properties are filled with `default_rpr` and `default_ppr` from the `StylesSchema`.
- **Efficient Property Merging**: The `merge_properties` function is used to efficiently merge properties by converting Pydantic models to dictionaries, adding only non-null properties, and reassigning them to the original models.
3. **Conversion to HTML and TXT**:
- **DOCX to HTML**:
- The `DocxToHtmlConverter` class takes the parsed `DocumentSchema` and converts the document elements into HTML format.
- Styles and properties are translated into equivalent HTML tags and CSS attributes.
- The converted HTML content can be saved to a file using the `save_html_to_file` method.
- **WYSIWYG Support**: The conversion maintains the visual representation of the document, ensuring accurate rendering of numbering, margins, and indentations as they appear in the original DOCX file.
- **DOCX to TXT**:
- The `DocxToTxtConverter` class converts the `DocumentSchema` into plain text format.
- Paragraphs, lists, and tables are transformed into a readable plain text representation.
- The converted text content can be saved to a file using the `save_txt_to_file` method.
- **WYSIWYG Support**: The conversion preserves the structure of the document, maintaining numbering, margins, and indentations to ensure the text layout resembles the original document's format.
This detailed process ensures that the Docx Parser and Converter library accurately parses and converts DOCX documents while preserving the original document's structure and style as much as possible.
## Conversion Table of DOCX XML Elements to HTML
| XML Element | HTML Element | Notes |
|----------------|--------------------------------------|-----------------------------------------------------------------------|
| w:p | p | Paragraph element |
| w:r | span | Run element, used for inline text formatting |
| w:tbl | table | Table element |
| w:tr | tr | Table row |
| w:tc | td | Table cell |
| w:tblGrid | colgroup | Table grid, converted to colgroup for column definitions |
| w:gridCol | col | Grid column, converted to col for column width |
| w:tblPr | table | Table properties |
| w:tblW | table style="width:X%;" | Table width, converted using CSS `width` property |
| w:tblBorders | table style="border:X;" | Table borders, converted using CSS `border` property |
| w:tblCellMar | td style="padding:Xpt;" | Table cell margins, converted using CSS `padding` property |
| w:tblCellSpacing | table style="border-spacing:Xpt;" | Cell spacing, converted using CSS `border-spacing` property |
| w:b | b | Bold text |
| w:i | i | Italic text |
| w:u | span style="text-decoration:underline;" | Underline text, converted using CSS `text-decoration` property |
| w:color | span style="color:#RRGGBB;" | Text color, converted using CSS `color` property |
| w:sz | span style="font-size:Xpt;" | Text size, converted using CSS `font-size` property (in points) |
| w:jc | p style="text-align:left|center|right|justify;" | Text alignment, converted using CSS `text-align` property |
| w:ind | p style="margin-left:Xpt;" | Regular indent, converted using CSS `margin-left` property |
| w:ind | p style="text-indent:Xpt;" | Hanging/first-line indent, converted using CSS `text-indent` property |
| w:spacing | p style="line-height:X%;" | Line spacing, converted using CSS `line-height` property |
| w:highlight | span style="background-color:#RRGGBB;" | Text highlight, converted using CSS `background-color` property |
| w:shd | span style="background-color:#RRGGBB;" | Shading, converted using CSS `background-color` property |
| w:vertAlign | span style="vertical-align:super/sub;" | Vertical alignment, converted using CSS `vertical-align` property |
| w:pgMar | div style="padding: Xpt;" | Margins, converted using CSS `padding` property |
| w:rFonts | span style="font-family:'font-name';"| Font name, converted using CSS `font-family` property |
## Examples π
### Original DOCX File


### Converted to HTML


### Converted to Plain Text

## API Reference π
For detailed API documentation, please visit our [Read the Docs](https://docx-parser-and-converter.readthedocs.io/en/latest/) page.
Enjoy using Docx Parser and Converter! πβ¨
Raw data
{
"_id": null,
"home_page": "https://github.com/omer-go/docx-html-txt",
"name": "docx-parser-converter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": null,
"author": "Omer Hayun",
"author_email": "your.email@example.com",
"download_url": "https://files.pythonhosted.org/packages/3f/1b/8cffe4cb9929bc433f47c70aab6a3e929fde8ae393d214f32b7d2c45e2da/docx-parser-converter-0.5.1.2.tar.gz",
"platform": null,
"description": "# Docx Parser and Converter \ud83d\udcc4\u2728\r\n\r\nA powerful library for converting DOCX documents into HTML and plain text, with detailed parsing of document properties and styles.\r\n\r\n## Table of Contents\r\n- [Introduction \ud83c\udf1f](#introduction-)\r\n- [Project Overview \ud83d\udee0\ufe0f](#project-overview-)\r\n- [Key Features \ud83c\udf1f](#key-features-)\r\n- [Installation \ud83d\udcbe](#installation-)\r\n- [Usage \ud83d\ude80](#usage-)\r\n- [Quick Start Guide \ud83d\udcd6](#quick-start-guide-)\r\n- [Supported XML Parsing Types \ud83d\udcc4](#supported-xml-parsing-types-)\r\n- [General Code Flow \ud83d\udd04](#general-code-flow-)\r\n- [Conversion Table of DOCX XML Elements to HTML](#conversion-table-of-docx-xml-elements-to-html)\r\n- [Examples \ud83d\udcda](#examples-)\r\n- [API Reference \ud83d\udcdc](#api-reference-)\r\n\r\n## Introduction \ud83c\udf1f\r\nWelcome to the Docx Parser and Converter project! This library allows you to easily convert DOCX documents into HTML and plain text formats, extracting detailed properties and styles using Pydantic models.\r\n\r\n## Project Overview \ud83d\udee0\ufe0f\r\nThe project is structured to parse DOCX files, convert their content into structured data using Pydantic models, and provide conversion utilities to transform this data into HTML or plain text.\r\n\r\n## Key Features \ud83c\udf1f\r\n- Convert DOCX documents to HTML or plain text.\r\n- Parse and extract detailed document properties and styles.\r\n- Structured data representation using Pydantic models.\r\n\r\n## Installation \ud83d\udcbe\r\nTo install the library, you can use pip. (Add the pip install command manually)\r\n\r\n```sh\r\npip install docx-parser-converter\r\n```\r\n\r\n## Usage \ud83d\ude80\r\n\r\n### Importing the Library\r\nTo start using the library, import the necessary modules:\r\n\r\n```python\r\nfrom docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter\r\nfrom docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter\r\nfrom docx_parser_converter.docx_parsers.utils import read_binary_from_file_path\r\n```\r\n\r\n### Quick Start Guide \ud83d\udcd6\r\n1. **Convert to HTML**:\r\n ```python\r\n from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter\r\n from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path\r\n\r\n docx_path = \"path_to_your_docx_file.docx\"\r\n html_output_path = \"output.html\"\r\n\r\n docx_file_content = read_binary_from_file_path(docx_path)\r\n\r\n converter = DocxToHtmlConverter(docx_file_content, use_default_values=True)\r\n html_output = converter.convert_to_html()\r\n converter.save_html_to_file(html_output, html_output_path)\r\n ```\r\n\r\n2. **Convert to Plain Text**:\r\n ```python\r\n from docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter\r\n from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path\r\n\r\n docx_path = \"path_to_your_docx_file.docx\"\r\n txt_output_path = \"output.txt\"\r\n\r\n docx_file_content = read_binary_from_file_path(docx_path)\r\n\r\n converter = DocxToTxtConverter(docx_file_content, use_default_values=True)\r\n txt_output = converter.convert_to_txt(indent=True)\r\n converter.save_txt_to_file(txt_output, txt_output_path)\r\n ```\r\n\r\n## Supported XML Parsing Types \ud83d\udcc4\r\n\r\nThe Docx Parser and Converter library supports parsing various XML components within a DOCX file. Below is a detailed list of the supported and unsupported components:\r\n\r\n### Supported Components\r\n\r\n1. **document.xml**:\r\n - **Document Parsing**: Parses the main document structure.\r\n - **Paragraphs**: Extracts paragraphs and their properties.\r\n - **Runs**: Extracts individual text runs within paragraphs.\r\n - **Tables**: Parses table structures and properties.\r\n - **Table Rows**: Extracts rows within tables.\r\n - **Table Cells**: Extracts cells within rows.\r\n - **List Items**: Handles both bulleted and numbered lists through paragraph properties.\r\n\r\n2. **numbering.xml**:\r\n - **Numbering Definitions**: Parses numbering definitions and properties for lists.\r\n - **Numbering Levels**: Extracts different levels of numbering for nested lists.\r\n\r\n3. **styles.xml**:\r\n - **Paragraph Styles**: Extracts styles applied to paragraphs.\r\n - **Run Styles**: Extracts styles applied to text runs.\r\n - **Table Styles**: Parses styles applied to tables and table elements.\r\n - **Default Styles**: Extracts default document styles for paragraphs, runs, and tables.\r\n\r\n### Unsupported Components\r\n\r\n- **Images**: Parsing and extraction of images embedded within the document.\r\n- **Headers and Footers**: Parsing of headers and footers content.\r\n- **Footnotes and Endnotes**: Handling footnotes and endnotes within the document.\r\n- **Comments**: Extraction and handling of comments.\r\n- **Custom XML Parts**: Any custom XML parts beyond the standard DOCX schema.\r\n\r\n## General Code Flow \ud83d\udd04\r\n\r\nThe Docx Parser and Converter library follows a structured workflow to parse, convert, and merge document properties and styles according to DOCX specifications. Here\u2019s a detailed overview of the technical process:\r\n\r\n1. **Parsing XML Files**:\r\n - **Document XML Parsing**: The `DocumentParser` class reads and parses the `document.xml` file to extract the document structure, including paragraphs, tables, and runs. This data is converted into `DocumentSchema` Pydantic models.\r\n - **Numbering XML Parsing**: The `NumberingParser` class parses the `numbering.xml` file to extract numbering definitions and levels, converting them into `NumberingSchema` Pydantic models.\r\n - **Styles XML Parsing**: The `StylesParser` class parses the `styles.xml` file to extract styles for paragraphs, runs, and tables, converting them into `StylesSchema` Pydantic models.\r\n\r\n2. **Property and Style Merging**:\r\n - **Hierarchical Style Application**: The library applies styles to paragraphs and runs based on a defined hierarchy. Explicit properties in the `DocumentSchema` remain unchanged, while styles are applied based on the `style_id` if present.\r\n - **Default Style Application**: If no specific `style_id` is present, default styles from `StyleDefaults` are applied. Finally, any remaining null properties are filled with `default_rpr` and `default_ppr` from the `StylesSchema`.\r\n - **Efficient Property Merging**: The `merge_properties` function is used to efficiently merge properties by converting Pydantic models to dictionaries, adding only non-null properties, and reassigning them to the original models.\r\n\r\n3. **Conversion to HTML and TXT**:\r\n - **DOCX to HTML**:\r\n - The `DocxToHtmlConverter` class takes the parsed `DocumentSchema` and converts the document elements into HTML format.\r\n - Styles and properties are translated into equivalent HTML tags and CSS attributes.\r\n - The converted HTML content can be saved to a file using the `save_html_to_file` method.\r\n - **WYSIWYG Support**: The conversion maintains the visual representation of the document, ensuring accurate rendering of numbering, margins, and indentations as they appear in the original DOCX file.\r\n - **DOCX to TXT**:\r\n - The `DocxToTxtConverter` class converts the `DocumentSchema` into plain text format.\r\n - Paragraphs, lists, and tables are transformed into a readable plain text representation.\r\n - The converted text content can be saved to a file using the `save_txt_to_file` method.\r\n - **WYSIWYG Support**: The conversion preserves the structure of the document, maintaining numbering, margins, and indentations to ensure the text layout resembles the original document's format.\r\n\r\nThis detailed process ensures that the Docx Parser and Converter library accurately parses and converts DOCX documents while preserving the original document's structure and style as much as possible.\r\n\r\n## Conversion Table of DOCX XML Elements to HTML\r\n\r\n\r\n| XML Element | HTML Element | Notes |\r\n|----------------|--------------------------------------|-----------------------------------------------------------------------|\r\n| w:p | p | Paragraph element |\r\n| w:r | span | Run element, used for inline text formatting |\r\n| w:tbl | table | Table element |\r\n| w:tr | tr | Table row |\r\n| w:tc | td | Table cell |\r\n| w:tblGrid | colgroup | Table grid, converted to colgroup for column definitions |\r\n| w:gridCol | col | Grid column, converted to col for column width |\r\n| w:tblPr | table | Table properties |\r\n| w:tblW | table style=\"width:X%;\" | Table width, converted using CSS `width` property |\r\n| w:tblBorders | table style=\"border:X;\" | Table borders, converted using CSS `border` property |\r\n| w:tblCellMar | td style=\"padding:Xpt;\" | Table cell margins, converted using CSS `padding` property |\r\n| w:tblCellSpacing | table style=\"border-spacing:Xpt;\" | Cell spacing, converted using CSS `border-spacing` property |\r\n| w:b | b | Bold text |\r\n| w:i | i | Italic text |\r\n| w:u | span style=\"text-decoration:underline;\" | Underline text, converted using CSS `text-decoration` property |\r\n| w:color | span style=\"color:#RRGGBB;\" | Text color, converted using CSS `color` property |\r\n| w:sz | span style=\"font-size:Xpt;\" | Text size, converted using CSS `font-size` property (in points) |\r\n| w:jc | p style=\"text-align:left|center|right|justify;\" | Text alignment, converted using CSS `text-align` property |\r\n| w:ind | p style=\"margin-left:Xpt;\" | Regular indent, converted using CSS `margin-left` property |\r\n| w:ind | p style=\"text-indent:Xpt;\" | Hanging/first-line indent, converted using CSS `text-indent` property |\r\n| w:spacing | p style=\"line-height:X%;\" | Line spacing, converted using CSS `line-height` property |\r\n| w:highlight | span style=\"background-color:#RRGGBB;\" | Text highlight, converted using CSS `background-color` property |\r\n| w:shd | span style=\"background-color:#RRGGBB;\" | Shading, converted using CSS `background-color` property |\r\n| w:vertAlign | span style=\"vertical-align:super/sub;\" | Vertical alignment, converted using CSS `vertical-align` property |\r\n| w:pgMar | div style=\"padding: Xpt;\" | Margins, converted using CSS `padding` property |\r\n| w:rFonts | span style=\"font-family:'font-name';\"| Font name, converted using CSS `font-family` property |\r\n\r\n\r\n## Examples \ud83d\udcda\r\n\r\n### Original DOCX File\r\n\r\n\r\n\r\n### Converted to HTML\r\n\r\n\r\n\r\n### Converted to Plain Text\r\n\r\n\r\n## API Reference \ud83d\udcdc\r\n\r\nFor detailed API documentation, please visit our [Read the Docs](https://docx-parser-and-converter.readthedocs.io/en/latest/) page.\r\n\r\n\r\nEnjoy using Docx Parser and Converter! \ud83d\ude80\u2728\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A library for converting DOCX documents to HTML and plain text",
"version": "0.5.1.2",
"project_urls": {
"Homepage": "https://github.com/omer-go/docx-html-txt"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9090c052f4fa03f6f92bd61481f858a300fcc832a330ddfa29ef7e29cc8c58de",
"md5": "20f2708c499e03142af461b497b896c0",
"sha256": "fcdfe4072886edc190fa21d418472b605b09fe320b492f7468b81f045465585b"
},
"downloads": -1,
"filename": "docx_parser_converter-0.5.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "20f2708c499e03142af461b497b896c0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 64482,
"upload_time": "2024-08-30T01:46:47",
"upload_time_iso_8601": "2024-08-30T01:46:47.304281Z",
"url": "https://files.pythonhosted.org/packages/90/90/c052f4fa03f6f92bd61481f858a300fcc832a330ddfa29ef7e29cc8c58de/docx_parser_converter-0.5.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3f1b8cffe4cb9929bc433f47c70aab6a3e929fde8ae393d214f32b7d2c45e2da",
"md5": "ff40df864bbf4681b8d57e66f28fd431",
"sha256": "ccd4ede00f763a45d5b1100bbd60f231b753ab0f80b1fcc76cc30c7aad24d0c0"
},
"downloads": -1,
"filename": "docx-parser-converter-0.5.1.2.tar.gz",
"has_sig": false,
"md5_digest": "ff40df864bbf4681b8d57e66f28fd431",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 492541,
"upload_time": "2024-08-30T01:46:49",
"upload_time_iso_8601": "2024-08-30T01:46:49.265856Z",
"url": "https://files.pythonhosted.org/packages/3f/1b/8cffe4cb9929bc433f47c70aab6a3e929fde8ae393d214f32b7d2c45e2da/docx-parser-converter-0.5.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-30 01:46:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "omer-go",
"github_project": "docx-html-txt",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "docx-parser-converter"
}