## extract_office_content
<p>
<a href=""><img src="https://img.shields.io/badge/Python->=3.7,<=3.10-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
<a href="https://pypi.org/project/extract_office_content/"><img alt="PyPI" src="https://img.shields.io/pypi/v/extract_office_content"></a>
<a href="https://pepy.tech/project/extract_office_content"><img src="https://static.pepy.tech/personalized-badge/extract_office_content?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
</p>
### Use
1. Install`extract_office_content`
```bash
$ pip install extract_office_content
```
2. Run by CLI.
- Extract All office file's content.
```bash
$ extract_office_content -h
usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path
positional arguments:
file_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_office_content tests/test_files
```
- Extract Word.
```bash
$ extract_word -h
usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path
positional arguments:
word_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_word tests/test_files/word_example.docx
```
- Extract PPT.
```bash
$ extract_ppt -h
usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path
positional arguments:
ppt_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_ppt tests/test_files/ppt_example.pptx
```
- Extract Excel.
```bash
$ extract_excel -h
usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
excel_path
positional arguments:
excel_path
optional arguments:
-h, --help show this help message and exit
-f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
-o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_excel tests/test_files/excel_example.xlsx
```
3. Run by python script.
- Extract all.
```python
from pathlib import Path
from extract_office_content import ExtractOfficeContent
extracter = ExtractOfficeContent()
file_list = list(Path('tests/test_files').iterdir())
for file_path in file_list:
res = extracter(file_path)
print(res)
```
- Extract Word.
```python
from extract_office_content import ExtractWord
word_extract = ExtractWord()
word_path = 'tests/test_files/word_example.docx'
text = word_extract(word_path, "outputs/word")
# or bytes
with open(word_path, 'rb') as f:
word_content = f.read()
text = word_extract(word_content, "outputs/word")
print(text)
```
- Extract PPT.
```python
from pathlib import Path
from extract_office_content import ExtractPPT
ppt_extracter = ExtractPPT()
ppt_path = 'tests/test_files/ppt_example.pptx'
save_dir = 'outputs'
save_img_dir = Path(save_dir) / Path(ppt_path).stem
res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))
# or bytes
with open(ppt_path, 'rb') as f:
ppt_content = f.read()
res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
print(res)
```
- Extract Excel.
```python
from extract_office_content import ExtractExcel
excel_extract = ExtractExcel()
excel_path = 'tests/test_files/excel_with_image.xlsx'
res = excel_extract(excel_path, out_format='markdown', save_img_dir='1')
# or
with open(excel_path, 'rb') as f:
excel_content = f.read()
res = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
print(res)
```
### See details for [ExtractOfficeContent](https://github.com/SWHL/ExtractOfficeContent).
Raw data
{
"_id": null,
"home_page": "https://github.com/SWHL/ExtractOfficeText.git",
"name": "extract-office-content",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<3.12",
"maintainer_email": "",
"keywords": "extract,office,text,content",
"author": "SWHL",
"author_email": "liekkaskono@163.com",
"download_url": "",
"platform": "Any",
"description": "## extract_office_content\n\n<p>\n <a href=\"\"><img src=\"https://img.shields.io/badge/Python->=3.7,<=3.10-aff.svg\"></a>\n <a href=\"\"><img src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"></a>\n <a href=\"https://pypi.org/project/extract_office_content/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/extract_office_content\"></a>\n <a href=\"https://pepy.tech/project/extract_office_content\"><img src=\"https://static.pepy.tech/personalized-badge/extract_office_content?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads\"></a>\n</p>\n\n\n### Use\n1. Install`extract_office_content`\n ```bash\n $ pip install extract_office_content\n ```\n2. Run by CLI.\n - Extract All office file's content.\n ```bash\n $ extract_office_content -h\n usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path\n\n positional arguments:\n file_path\n\n optional arguments:\n -h, --help show this help message and exit\n -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n $ extract_office_content tests/test_files\n ```\n - Extract Word.\n ```bash\n $ extract_word -h\n usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path\n\n positional arguments:\n word_path\n\n optional arguments:\n -h, --help show this help message and exit\n -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n $ extract_word tests/test_files/word_example.docx\n ```\n - Extract PPT.\n ```bash\n $ extract_ppt -h\n usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path\n\n positional arguments:\n ppt_path\n\n optional arguments:\n -h, --help show this help message and exit\n -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n $ extract_ppt tests/test_files/ppt_example.pptx\n ```\n - Extract Excel.\n ```bash\n $ extract_excel -h\n usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]\n excel_path\n\n positional arguments:\n excel_path\n\n optional arguments:\n -h, --help show this help message and exit\n -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}\n -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n $ extract_excel tests/test_files/excel_example.xlsx\n ```\n3. Run by python script.\n - Extract all.\n ```python\n from pathlib import Path\n from extract_office_content import ExtractOfficeContent\n\n\n extracter = ExtractOfficeContent()\n\n file_list = list(Path('tests/test_files').iterdir())\n\n for file_path in file_list:\n res = extracter(file_path)\n print(res)\n ```\n - Extract Word.\n ```python\n from extract_office_content import ExtractWord\n\n word_extract = ExtractWord()\n\n word_path = 'tests/test_files/word_example.docx'\n text = word_extract(word_path, \"outputs/word\")\n\n # or bytes\n with open(word_path, 'rb') as f:\n word_content = f.read()\n text = word_extract(word_content, \"outputs/word\")\n print(text)\n ```\n - Extract PPT.\n ```python\n from pathlib import Path\n\n from extract_office_content import ExtractPPT\n\n ppt_extracter = ExtractPPT()\n\n ppt_path = 'tests/test_files/ppt_example.pptx'\n save_dir = 'outputs'\n save_img_dir = Path(save_dir) / Path(ppt_path).stem\n res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))\n\n # or bytes\n with open(ppt_path, 'rb') as f:\n ppt_content = f.read()\n res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))\n print(res)\n ```\n - Extract Excel.\n ```python\n from extract_office_content import ExtractExcel\n\n excel_extract = ExtractExcel()\n\n excel_path = 'tests/test_files/excel_with_image.xlsx'\n res = excel_extract(excel_path, out_format='markdown', save_img_dir='1')\n\n # or\n with open(excel_path, 'rb') as f:\n excel_content = f.read()\n res = excel_extract(excel_content, out_format='markdown', save_img_dir='1')\n print(res)\n ```\n\n### See details for [ExtractOfficeContent](https://github.com/SWHL/ExtractOfficeContent).\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Tool for extracting content from office files.",
"version": "0.0.7",
"project_urls": {
"Homepage": "https://github.com/SWHL/ExtractOfficeText.git"
},
"split_keywords": [
"extract",
"office",
"text",
"content"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "00a83c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44",
"md5": "a432cd4131525238012b9bea7014a37e",
"sha256": "8d5bc5f7bb5a8dc90d596d95bdd65aa43787b060345626130f23a24e58b98c85"
},
"downloads": -1,
"filename": "extract_office_content-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a432cd4131525238012b9bea7014a37e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6,<3.12",
"size": 10742,
"upload_time": "2023-07-16T14:19:44",
"upload_time_iso_8601": "2023-07-16T14:19:44.981212Z",
"url": "https://files.pythonhosted.org/packages/00/a8/3c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44/extract_office_content-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-16 14:19:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SWHL",
"github_project": "ExtractOfficeText",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "extract-office-content"
}