extract-office-content


Nameextract-office-content JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/SWHL/ExtractOfficeText.git
SummaryTool for extracting content from office files.
upload_time2023-07-16 14:19:44
maintainer
docs_urlNone
authorSWHL
requires_python>=3.6,<3.12
licenseApache-2.0
keywords extract office text content
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## extract_office_content

<p>
    <a href=""><img src="https://img.shields.io/badge/Python->=3.7,<=3.10-aff.svg"></a>
    <a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
    <a href="https://pypi.org/project/extract_office_content/"><img alt="PyPI" src="https://img.shields.io/pypi/v/extract_office_content"></a>
    <a href="https://pepy.tech/project/extract_office_content"><img src="https://static.pepy.tech/personalized-badge/extract_office_content?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
</p>


### Use
1. Install`extract_office_content`
   ```bash
   $ pip install extract_office_content
   ```
2. Run by CLI.
    - Extract All office file's content.
        ```bash
        $ extract_office_content -h
        usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path

        positional arguments:
        file_path

        optional arguments:
        -h, --help            show this help message and exit
        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

        $ extract_office_content tests/test_files
        ```
    - Extract Word.
        ```bash
        $ extract_word -h
        usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path

        positional arguments:
        word_path

        optional arguments:
        -h, --help            show this help message and exit
        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

        $ extract_word tests/test_files/word_example.docx
        ```
    - Extract PPT.
        ```bash
        $ extract_ppt -h
        usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path

        positional arguments:
        ppt_path

        optional arguments:
        -h, --help            show this help message and exit
        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

        $ extract_ppt tests/test_files/ppt_example.pptx
        ```
    - Extract Excel.
        ```bash
        $ extract_excel -h
        usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
                            excel_path

        positional arguments:
        excel_path

        optional arguments:
        -h, --help            show this help message and exit
        -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
        -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

        $ extract_excel tests/test_files/excel_example.xlsx
        ```
3. Run by python script.
   - Extract all.
        ```python
        from pathlib import Path
        from extract_office_content import ExtractOfficeContent


        extracter = ExtractOfficeContent()

        file_list = list(Path('tests/test_files').iterdir())

        for file_path in file_list:
            res = extracter(file_path)
            print(res)
        ```
    - Extract Word.
        ```python
        from extract_office_content import ExtractWord

        word_extract = ExtractWord()

        word_path = 'tests/test_files/word_example.docx'
        text = word_extract(word_path, "outputs/word")

        # or bytes
        with open(word_path, 'rb') as f:
            word_content = f.read()
        text = word_extract(word_content, "outputs/word")
        print(text)
        ```
    - Extract PPT.
        ```python
        from pathlib import Path

        from extract_office_content import ExtractPPT

        ppt_extracter = ExtractPPT()

        ppt_path = 'tests/test_files/ppt_example.pptx'
        save_dir = 'outputs'
        save_img_dir = Path(save_dir) / Path(ppt_path).stem
        res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))

        # or bytes
        with open(ppt_path, 'rb') as f:
            ppt_content = f.read()
        res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
        print(res)
        ```
    - Extract Excel.
        ```python
        from extract_office_content import ExtractExcel

        excel_extract = ExtractExcel()

        excel_path = 'tests/test_files/excel_with_image.xlsx'
        res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')

        # or
        with open(excel_path, 'rb') as f:
            excel_content = f.read()
        res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
        print(res)
        ```

### See details for [ExtractOfficeContent](https://github.com/SWHL/ExtractOfficeContent).


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SWHL/ExtractOfficeText.git",
    "name": "extract-office-content",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<3.12",
    "maintainer_email": "",
    "keywords": "extract,office,text,content",
    "author": "SWHL",
    "author_email": "liekkaskono@163.com",
    "download_url": "",
    "platform": "Any",
    "description": "## extract_office_content\n\n<p>\n    <a href=\"\"><img src=\"https://img.shields.io/badge/Python->=3.7,<=3.10-aff.svg\"></a>\n    <a href=\"\"><img src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"></a>\n    <a href=\"https://pypi.org/project/extract_office_content/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/extract_office_content\"></a>\n    <a href=\"https://pepy.tech/project/extract_office_content\"><img src=\"https://static.pepy.tech/personalized-badge/extract_office_content?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads\"></a>\n</p>\n\n\n### Use\n1. Install`extract_office_content`\n   ```bash\n   $ pip install extract_office_content\n   ```\n2. Run by CLI.\n    - Extract All office file's content.\n        ```bash\n        $ extract_office_content -h\n        usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path\n\n        positional arguments:\n        file_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_office_content tests/test_files\n        ```\n    - Extract Word.\n        ```bash\n        $ extract_word -h\n        usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path\n\n        positional arguments:\n        word_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_word tests/test_files/word_example.docx\n        ```\n    - Extract PPT.\n        ```bash\n        $ extract_ppt -h\n        usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path\n\n        positional arguments:\n        ppt_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_ppt tests/test_files/ppt_example.pptx\n        ```\n    - Extract Excel.\n        ```bash\n        $ extract_excel -h\n        usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]\n                            excel_path\n\n        positional arguments:\n        excel_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}\n        -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_excel tests/test_files/excel_example.xlsx\n        ```\n3. Run by python script.\n   - Extract all.\n        ```python\n        from pathlib import Path\n        from extract_office_content import ExtractOfficeContent\n\n\n        extracter = ExtractOfficeContent()\n\n        file_list = list(Path('tests/test_files').iterdir())\n\n        for file_path in file_list:\n            res = extracter(file_path)\n            print(res)\n        ```\n    - Extract Word.\n        ```python\n        from extract_office_content import ExtractWord\n\n        word_extract = ExtractWord()\n\n        word_path = 'tests/test_files/word_example.docx'\n        text = word_extract(word_path, \"outputs/word\")\n\n        # or bytes\n        with open(word_path, 'rb') as f:\n            word_content = f.read()\n        text = word_extract(word_content, \"outputs/word\")\n        print(text)\n        ```\n    - Extract PPT.\n        ```python\n        from pathlib import Path\n\n        from extract_office_content import ExtractPPT\n\n        ppt_extracter = ExtractPPT()\n\n        ppt_path = 'tests/test_files/ppt_example.pptx'\n        save_dir = 'outputs'\n        save_img_dir = Path(save_dir) / Path(ppt_path).stem\n        res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))\n\n        # or bytes\n        with open(ppt_path, 'rb') as f:\n            ppt_content = f.read()\n        res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))\n        print(res)\n        ```\n    - Extract Excel.\n        ```python\n        from extract_office_content import ExtractExcel\n\n        excel_extract = ExtractExcel()\n\n        excel_path = 'tests/test_files/excel_with_image.xlsx'\n        res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')\n\n        # or\n        with open(excel_path, 'rb') as f:\n            excel_content = f.read()\n        res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')\n        print(res)\n        ```\n\n### See details for [ExtractOfficeContent](https://github.com/SWHL/ExtractOfficeContent).\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Tool for extracting content from office files.",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/SWHL/ExtractOfficeText.git"
    },
    "split_keywords": [
        "extract",
        "office",
        "text",
        "content"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00a83c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44",
                "md5": "a432cd4131525238012b9bea7014a37e",
                "sha256": "8d5bc5f7bb5a8dc90d596d95bdd65aa43787b060345626130f23a24e58b98c85"
            },
            "downloads": -1,
            "filename": "extract_office_content-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a432cd4131525238012b9bea7014a37e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<3.12",
            "size": 10742,
            "upload_time": "2023-07-16T14:19:44",
            "upload_time_iso_8601": "2023-07-16T14:19:44.981212Z",
            "url": "https://files.pythonhosted.org/packages/00/a8/3c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44/extract_office_content-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-16 14:19:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SWHL",
    "github_project": "ExtractOfficeText",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "extract-office-content"
}
        
Elapsed time: 0.23991s