html-for-docx


Namehtml-for-docx JSON
Version 1.0.9 PyPI version JSON
download
home_pagehttps://github.com/dfop02/html4docx
SummaryConvert HTML to Docx easily and fastly
upload_time2025-07-18 15:34:23
maintainerDiogo Fernandes
docs_urlNone
authorDiogo Fernandes
requires_python>=3.7
licenseMIT
keywords html docx docs office word convert transform
VCS
bugtrack_url
requirements beautifulsoup4 python-docx
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HTML FOR DOCX
![Tests](https://github.com/dfop02/html4docx/actions/workflows/tests.yml/badge.svg)
![Version](https://img.shields.io/pypi/v/html-for-docx.svg)
![Supported Versions](https://img.shields.io/pypi/pyversions/html-for-docx.svg)

Convert html to docx, this project is a fork from descontinued [pqzx/html2docx](https://github.com/pqzx/html2docx).

## How install

`pip install html-for-docx`

## Usage

#### The basic usage

Add HTML formatted to an existing Docx

```python
from html4docx import HtmlToDocx

parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)
```

You can use `python-docx` to manipulate the file as well, here an example

```python
from docx import Document
from html4docx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()

html_string = '<h1>Hello world</h1>'
new_parser.add_html_to_document(html_string, document)

document.save('your_file_name.docx')
```

#### Convert files directly

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
new_parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')
```

#### Convert files from a string

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
docx = new_parser.parse_html_string(input_html_file_string)
```

#### Change table styles

Tables are not styled by default. Use the `table_style` attribute on the parser to set a table style before convert html. The style is used for all tables.

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.table_style = 'Light Shading Accent 4'
docx = new_parser.parse_html_string(input_html_file_string)
```

To add borders to tables, use the `Table Grid` style:

```python
new_parser.table_style = 'Table Grid'
```

All table styles we support can be found [here](https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html#table-styles-in-default-template).

#### Metadata

You're able to read or set docx metadata:

```python
from docx import Document
from html4docx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()
new_parser.set_initial_attrs(document)
metadata = new_parser.metadata

# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)

# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')
```

You can find all available metadata attributes [here](https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html).

### Why

My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.

### Differences (fixes and new features)

**Fixes**
- Fix `table_style` not working | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/11)
- Handle missing run for leading br tag | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/53)
- Fix base64 images | [djplaner](https://github.com/djplaner) from [Issue](https://github.com/pqzx/html2docx/issues/28#issuecomment-1052736896)
- Handle img tag without src attribute | [johnjor](https://github.com/johnjor) from [PR](https://github.com/pqzx/html2docx/pull/63)
- Fix bug when any style has `!important` | [Dfop02](https://github.com/dfop02)
- Fix 'style lookup by style_id is deprecated.' | [Dfop02](https://github.com/dfop02)
- Fix `background-color` not working | [Dfop02](https://github.com/dfop02)
- Fix crashes when img or bookmark is created without paragraph | [Dfop02](https://github.com/dfop02)
- Fix Ordered and Unordered Lists | [TaylorN15](https://github.com/TaylorN15) from [PR](https://github.com/dfop02/html4docx/pull/16)

**New Features**
- Add Witdh/Height style to images | [maifeeulasad](https://github.com/maifeeulasad) from [PR](https://github.com/pqzx/html2docx/pull/29)
- Support px, cm, pt, in, rem, em, mm, pc and % units for styles | [Dfop02](https://github.com/dfop02)
- Improve performance on large tables | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/58)
- Support for HTML Pagination | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)
- Support Table style | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)
- Support alternative encoding | [HebaElwazzan](https://github.com/HebaElwazzan) from [PR](https://github.com/pqzx/html2docx/pull/59)
- Support colors by name | [Dfop02](https://github.com/dfop02)
- Support font_size when text, ex.: small, medium, etc. | [Dfop02](https://github.com/dfop02)
- Support to internal links (Anchor) | [Dfop02](https://github.com/dfop02)
- Support to rowspan and colspan in tables. | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/25)
- Support to 'vertical-align' in table cells. | [Dfop02](https://github.com/dfop02)
- Support to metadata | [Dfop02](https://github.com/dfop02)
- Add support to table cells style (border, background-color, width, height, margin) | [Dfop02](https://github.com/dfop02)
- Being able to use inline images on same paragraph. | [Dfop02](https://github.com/dfop02)
- Refactory Tests to be more consistent and less 'human validation' | [Dfop02](https://github.com/dfop02)

## Known Issues

- **Maximum Nesting Depth:** Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.
- **Counter Reset Behavior:**
  - At level 1, starting a new ordered list will reset the counter.
  - At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.

## Project Guidelines

This project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.

> ⚠️ However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dfop02/html4docx",
    "name": "html-for-docx",
    "maintainer": "Diogo Fernandes",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "dfop02@hotmail.com",
    "keywords": "html, docx, docs, office, word, convert, transform",
    "author": "Diogo Fernandes",
    "author_email": "diogofernandesop@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e8/78/4e07ae08768693a88046dbc24e37e0a88593387927d82bda08f8f879aeba/html_for_docx-1.0.9.tar.gz",
    "platform": "any",
    "description": "# HTML FOR DOCX\n![Tests](https://github.com/dfop02/html4docx/actions/workflows/tests.yml/badge.svg)\n![Version](https://img.shields.io/pypi/v/html-for-docx.svg)\n![Supported Versions](https://img.shields.io/pypi/pyversions/html-for-docx.svg)\n\nConvert html to docx, this project is a fork from descontinued [pqzx/html2docx](https://github.com/pqzx/html2docx).\n\n## How install\n\n`pip install html-for-docx`\n\n## Usage\n\n#### The basic usage\n\nAdd HTML formatted to an existing Docx\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\nhtml_string = '<h1>Hello world</h1>'\nparser.add_html_to_document(html_string, filename_docx)\n```\n\nYou can use `python-docx` to manipulate the file as well, here an example\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndocument = Document()\nnew_parser = HtmlToDocx()\n\nhtml_string = '<h1>Hello world</h1>'\nnew_parser.add_html_to_document(html_string, document)\n\ndocument.save('your_file_name.docx')\n```\n\n#### Convert files directly\n\n```python\nfrom html4docx import HtmlToDocx\n\nnew_parser = HtmlToDocx()\nnew_parser.parse_html_file(input_html_file_path, output_docx_file_path)\n# You can also define a encoding, by default is utf-8\nnew_parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')\n```\n\n#### Convert files from a string\n\n```python\nfrom html4docx import HtmlToDocx\n\nnew_parser = HtmlToDocx()\ndocx = new_parser.parse_html_string(input_html_file_string)\n```\n\n#### Change table styles\n\nTables are not styled by default. Use the `table_style` attribute on the parser to set a table style before convert html. The style is used for all tables.\n\n```python\nfrom html4docx import HtmlToDocx\n\nnew_parser = HtmlToDocx()\nnew_parser.table_style = 'Light Shading Accent 4'\ndocx = new_parser.parse_html_string(input_html_file_string)\n```\n\nTo add borders to tables, use the `Table Grid` style:\n\n```python\nnew_parser.table_style = 'Table Grid'\n```\n\nAll table styles we support can be found [here](https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html#table-styles-in-default-template).\n\n#### Metadata\n\nYou're able to read or set docx metadata:\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndocument = Document()\nnew_parser = HtmlToDocx()\nnew_parser.set_initial_attrs(document)\nmetadata = new_parser.metadata\n\n# You can get metadata as dict\nmetadata_json = metadata.get_metadata()\nprint(metadata_json['author']) # Jane\n# or just print all metadata if if you want\nmetadata.get_metadata(print_result=True)\n\n# Set new metadata\nmetadata.set_metadata(author=\"Jane\", created=\"2025-07-18T09:30:00\")\ndocument.save('your_file_name.docx')\n```\n\nYou can find all available metadata attributes [here](https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html).\n\n### Why\n\nMy goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.\n\n### Differences (fixes and new features)\n\n**Fixes**\n- Fix `table_style` not working | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/11)\n- Handle missing run for leading br tag | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/53)\n- Fix base64 images | [djplaner](https://github.com/djplaner) from [Issue](https://github.com/pqzx/html2docx/issues/28#issuecomment-1052736896)\n- Handle img tag without src attribute | [johnjor](https://github.com/johnjor) from [PR](https://github.com/pqzx/html2docx/pull/63)\n- Fix bug when any style has `!important` | [Dfop02](https://github.com/dfop02)\n- Fix 'style lookup by style_id is deprecated.' | [Dfop02](https://github.com/dfop02)\n- Fix `background-color` not working | [Dfop02](https://github.com/dfop02)\n- Fix crashes when img or bookmark is created without paragraph | [Dfop02](https://github.com/dfop02)\n- Fix Ordered and Unordered Lists | [TaylorN15](https://github.com/TaylorN15) from [PR](https://github.com/dfop02/html4docx/pull/16)\n\n**New Features**\n- Add Witdh/Height style to images | [maifeeulasad](https://github.com/maifeeulasad) from [PR](https://github.com/pqzx/html2docx/pull/29)\n- Support px, cm, pt, in, rem, em, mm, pc and % units for styles | [Dfop02](https://github.com/dfop02)\n- Improve performance on large tables | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/58)\n- Support for HTML Pagination | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)\n- Support Table style | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)\n- Support alternative encoding | [HebaElwazzan](https://github.com/HebaElwazzan) from [PR](https://github.com/pqzx/html2docx/pull/59)\n- Support colors by name | [Dfop02](https://github.com/dfop02)\n- Support font_size when text, ex.: small, medium, etc. | [Dfop02](https://github.com/dfop02)\n- Support to internal links (Anchor) | [Dfop02](https://github.com/dfop02)\n- Support to rowspan and colspan in tables. | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/25)\n- Support to 'vertical-align' in table cells. | [Dfop02](https://github.com/dfop02)\n- Support to metadata | [Dfop02](https://github.com/dfop02)\n- Add support to table cells style (border, background-color, width, height, margin) | [Dfop02](https://github.com/dfop02)\n- Being able to use inline images on same paragraph. | [Dfop02](https://github.com/dfop02)\n- Refactory Tests to be more consistent and less 'human validation' | [Dfop02](https://github.com/dfop02)\n\n## Known Issues\n\n- **Maximum Nesting Depth:** Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.\n- **Counter Reset Behavior:**\n  - At level 1, starting a new ordered list will reset the counter.\n  - At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.\n\n## Project Guidelines\n\nThis project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.\n\n> \u26a0\ufe0f However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert HTML to Docx easily and fastly",
    "version": "1.0.9",
    "project_urls": {
        "Bug Tracker": "https://github.com/dfop02/html4docx/issues",
        "Changelog": "https://github.com/dfop02/html4docx/blob/master/HISTORY.rst",
        "Download": "https://github.com/dfop02/html4docx/archive/1.0.9.tar.gz",
        "Homepage": "https://github.com/dfop02/html4docx",
        "Repository": "https://github.com/dfop02/html4docx"
    },
    "split_keywords": [
        "html",
        " docx",
        " docs",
        " office",
        " word",
        " convert",
        " transform"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4e75d779ff2490a027a48299d083e4ffe1f7b449c6d2a08b366efc692c779f20",
                "md5": "fe5e1d153c75d694a5af3e9e70f58227",
                "sha256": "5f61747e28a172433725080f0f4ccaba68edc54cb757a9f41bc4be7181e98fe1"
            },
            "downloads": -1,
            "filename": "html_for_docx-1.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fe5e1d153c75d694a5af3e9e70f58227",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 28111,
            "upload_time": "2025-07-18T15:34:22",
            "upload_time_iso_8601": "2025-07-18T15:34:22.305868Z",
            "url": "https://files.pythonhosted.org/packages/4e/75/d779ff2490a027a48299d083e4ffe1f7b449c6d2a08b366efc692c779f20/html_for_docx-1.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e8784e07ae08768693a88046dbc24e37e0a88593387927d82bda08f8f879aeba",
                "md5": "dc7da9e734c736d0284b4508aed9d4cb",
                "sha256": "7507d33f166568d7ab9706f3bf7ece8cdc49331a8e006ed91bffbfa18eabf695"
            },
            "downloads": -1,
            "filename": "html_for_docx-1.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "dc7da9e734c736d0284b4508aed9d4cb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 29556,
            "upload_time": "2025-07-18T15:34:23",
            "upload_time_iso_8601": "2025-07-18T15:34:23.413564Z",
            "url": "https://files.pythonhosted.org/packages/e8/78/4e07ae08768693a88046dbc24e37e0a88593387927d82bda08f8f879aeba/html_for_docx-1.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-18 15:34:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dfop02",
    "github_project": "html4docx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.2"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        }
    ],
    "lcname": "html-for-docx"
}
        
Elapsed time: 0.81261s