# extract-hwp
한글과컴퓨터의 HWP 파일(HWP 5.0 및 HWPX 형식)에서 텍스트를 추출하는 Python 라이브러리입니다.
## 특징
- **다중 포맷 지원**: HWP 5.0 (OLE) 및 HWPX (ZIP/XML) 파일 모두 지원
- **암호화 파일 감지**: 처리하기 전에 암호로 보호된 파일을 감지
- **구조화된 추출**: 텍스트 추출 시 단락 구조 보존
- **견고한 오류 처리**: 손상되거나 잘못된 파일에 대한 방어적 처리
- **유니코드 지원**: 한글 및 다국어 텍스트 완전 지원
## 설치
```bash
pip install extract-hwp
```
## 사용법
### 기본 사용법
```python
from extract_hwp import extract_text_from_hwp
# HWP 또는 HWPX 파일에서 텍스트 추출
text, error = extract_text_from_hwp("document.hwp")
if error is None:
print(text)
else:
print(f"오류: {error}")
```
### 포맷별 추출
```python
from extract_hwp import extract_text_from_hwpx, extract_text_from_hwp5
# HWPX 파일 전용
hwpx_text = extract_text_from_hwpx("document.hwpx")
# HWP 5.0 파일 전용
hwp5_text = extract_text_from_hwp5("document.hwp")
```
### 암호화 파일 감지
```python
from extract_hwp import is_hwp_file_password_protected
if is_hwp_file_password_protected("document.hwp"):
print("파일이 암호로 보호되어 있습니다.")
else:
text, error = extract_text_from_hwp("document.hwp")
```
## API 참조
### 핵심 함수
#### `extract_text_from_hwp(filepath)`
HWP/HWPX 파일에서 텍스트를 추출합니다 (통합 인터페이스).
**매개변수:**
- `filepath` (str): HWP 또는 HWPX 파일 경로
**반환값:**
- `tuple`: (추출된_텍스트, 오류_메시지). 성공시 오류_메시지는 None
**예외:**
- `FileNotFoundError`: 파일을 찾을 수 없음
- `PermissionError`: 파일 접근 권한이 없음
- `ValueError`: 지원하지 않는 파일 형식
### 포맷별 함수
#### `extract_text_from_hwpx(hwpx_file_path)`
HWPX 파일에서 텍스트를 추출합니다.
**매개변수:**
- `hwpx_file_path` (str): HWPX 파일 경로
**반환값:**
- `str`: 추출된 텍스트 (오류 시 빈 문자열)
#### `extract_text_from_hwp5(filepath)`
HWP 5.0 (OLE) 파일에서 텍스트를 추출합니다.
**매개변수:**
- `filepath` (str): HWP 파일 경로
**반환값:**
- `str`: 추출된 텍스트 (오류 시 빈 문자열)
### 암호화 감지 함수
#### `is_hwp_file_password_protected(filepath)`
HWP/HWPX 파일이 암호로 보호되어 있는지 확인합니다.
**매개변수:**
- `filepath` (str): 확인할 파일 경로
**반환값:**
- `bool`: 암호로 보호된 경우 True, 그렇지 않으면 False
## 지원 포맷
### HWP 5.0 (OLE 포맷)
- 확장자: `.hwp`
- 구조: OLE 복합 문서 형식
- 압축: zlib 압축 지원
- 특징: 바이너리 구조 분석을 통한 텍스트 추출
### HWPX (ZIP/XML 포맷)
- 확장자: `.hwpx`
- 구조: XML 문서가 포함된 ZIP 아카이브
- 특징: 구조화된 텍스트 추출을 위한 XML 파싱
## 의존성
- `olefile>=0.46`: HWP 5.0 OLE 파일 처리
## 개발
### 개발 환경 설정
```bash
# 저장소 복제
git clone https://github.com/thlee/extract-hwp.git
cd extract-hwp
# 의존성 설치
uv sync
# 개발 의존성 포함 설치
uv sync --extra dev
```
### 테스트
```bash
# 테스트 실행
pytest
# 커버리지 포함
pytest --cov=src/extract_hwp
```
### 코드 품질
```bash
# 코드 포맷팅
black src/ tests/
# 타입 검사
mypy src/
```
## 라이선스
BSD 3-Clause License - 자세한 내용은 [LICENSE](LICENSE) 파일을 참조하세요.
## 변경사항
버전 히스토리는 [CHANGELOG.md](CHANGELOG.md)에서 확인할 수 있습니다.
Raw data
{
"_id": null,
"home_page": null,
"name": "extract-hwp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "document, hwp, hwpx, korean, text-extraction",
"author": null,
"author_email": "extract-hwp <extract-hwp@example.com>",
"download_url": "https://files.pythonhosted.org/packages/1a/d0/ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45/extract_hwp-0.1.0.tar.gz",
"platform": null,
"description": "# extract-hwp\n\n\ud55c\uae00\uacfc\ucef4\ud4e8\ud130\uc758 HWP \ud30c\uc77c(HWP 5.0 \ubc0f HWPX \ud615\uc2dd)\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud558\ub294 Python \ub77c\uc774\ube0c\ub7ec\ub9ac\uc785\ub2c8\ub2e4.\n\n## \ud2b9\uc9d5\n\n- **\ub2e4\uc911 \ud3ec\ub9f7 \uc9c0\uc6d0**: HWP 5.0 (OLE) \ubc0f HWPX (ZIP/XML) \ud30c\uc77c \ubaa8\ub450 \uc9c0\uc6d0\n- **\uc554\ud638\ud654 \ud30c\uc77c \uac10\uc9c0**: \ucc98\ub9ac\ud558\uae30 \uc804\uc5d0 \uc554\ud638\ub85c \ubcf4\ud638\ub41c \ud30c\uc77c\uc744 \uac10\uc9c0\n- **\uad6c\uc870\ud654\ub41c \ucd94\ucd9c**: \ud14d\uc2a4\ud2b8 \ucd94\ucd9c \uc2dc \ub2e8\ub77d \uad6c\uc870 \ubcf4\uc874\n- **\uacac\uace0\ud55c \uc624\ub958 \ucc98\ub9ac**: \uc190\uc0c1\ub418\uac70\ub098 \uc798\ubabb\ub41c \ud30c\uc77c\uc5d0 \ub300\ud55c \ubc29\uc5b4\uc801 \ucc98\ub9ac\n- **\uc720\ub2c8\ucf54\ub4dc \uc9c0\uc6d0**: \ud55c\uae00 \ubc0f \ub2e4\uad6d\uc5b4 \ud14d\uc2a4\ud2b8 \uc644\uc804 \uc9c0\uc6d0\n\n## \uc124\uce58\n\n```bash\npip install extract-hwp\n```\n\n## \uc0ac\uc6a9\ubc95\n\n### \uae30\ubcf8 \uc0ac\uc6a9\ubc95\n\n```python\nfrom extract_hwp import extract_text_from_hwp\n\n# HWP \ub610\ub294 HWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\ntext, error = extract_text_from_hwp(\"document.hwp\")\nif error is None:\n print(text)\nelse:\n print(f\"\uc624\ub958: {error}\")\n```\n\n### \ud3ec\ub9f7\ubcc4 \ucd94\ucd9c\n\n```python\nfrom extract_hwp import extract_text_from_hwpx, extract_text_from_hwp5\n\n# HWPX \ud30c\uc77c \uc804\uc6a9\nhwpx_text = extract_text_from_hwpx(\"document.hwpx\")\n\n# HWP 5.0 \ud30c\uc77c \uc804\uc6a9\nhwp5_text = extract_text_from_hwp5(\"document.hwp\")\n```\n\n### \uc554\ud638\ud654 \ud30c\uc77c \uac10\uc9c0\n\n```python\nfrom extract_hwp import is_hwp_file_password_protected\n\nif is_hwp_file_password_protected(\"document.hwp\"):\n print(\"\ud30c\uc77c\uc774 \uc554\ud638\ub85c \ubcf4\ud638\ub418\uc5b4 \uc788\uc2b5\ub2c8\ub2e4.\")\nelse:\n text, error = extract_text_from_hwp(\"document.hwp\")\n```\n\n## API \ucc38\uc870\n\n### \ud575\uc2ec \ud568\uc218\n\n#### `extract_text_from_hwp(filepath)`\n\nHWP/HWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4 (\ud1b5\ud569 \uc778\ud130\ud398\uc774\uc2a4).\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): HWP \ub610\ub294 HWPX \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `tuple`: (\ucd94\ucd9c\ub41c_\ud14d\uc2a4\ud2b8, \uc624\ub958_\uba54\uc2dc\uc9c0). \uc131\uacf5\uc2dc \uc624\ub958_\uba54\uc2dc\uc9c0\ub294 None\n\n**\uc608\uc678:**\n- `FileNotFoundError`: \ud30c\uc77c\uc744 \ucc3e\uc744 \uc218 \uc5c6\uc74c\n- `PermissionError`: \ud30c\uc77c \uc811\uadfc \uad8c\ud55c\uc774 \uc5c6\uc74c\n- `ValueError`: \uc9c0\uc6d0\ud558\uc9c0 \uc54a\ub294 \ud30c\uc77c \ud615\uc2dd\n\n### \ud3ec\ub9f7\ubcc4 \ud568\uc218\n\n#### `extract_text_from_hwpx(hwpx_file_path)`\n\nHWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `hwpx_file_path` (str): HWPX \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `str`: \ucd94\ucd9c\ub41c \ud14d\uc2a4\ud2b8 (\uc624\ub958 \uc2dc \ube48 \ubb38\uc790\uc5f4)\n\n#### `extract_text_from_hwp5(filepath)`\n\nHWP 5.0 (OLE) \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): HWP \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `str`: \ucd94\ucd9c\ub41c \ud14d\uc2a4\ud2b8 (\uc624\ub958 \uc2dc \ube48 \ubb38\uc790\uc5f4)\n\n### \uc554\ud638\ud654 \uac10\uc9c0 \ud568\uc218\n\n#### `is_hwp_file_password_protected(filepath)`\n\nHWP/HWPX \ud30c\uc77c\uc774 \uc554\ud638\ub85c \ubcf4\ud638\ub418\uc5b4 \uc788\ub294\uc9c0 \ud655\uc778\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): \ud655\uc778\ud560 \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `bool`: \uc554\ud638\ub85c \ubcf4\ud638\ub41c \uacbd\uc6b0 True, \uadf8\ub807\uc9c0 \uc54a\uc73c\uba74 False\n\n## \uc9c0\uc6d0 \ud3ec\ub9f7\n\n### HWP 5.0 (OLE \ud3ec\ub9f7)\n- \ud655\uc7a5\uc790: `.hwp`\n- \uad6c\uc870: OLE \ubcf5\ud569 \ubb38\uc11c \ud615\uc2dd\n- \uc555\ucd95: zlib \uc555\ucd95 \uc9c0\uc6d0\n- \ud2b9\uc9d5: \ubc14\uc774\ub108\ub9ac \uad6c\uc870 \ubd84\uc11d\uc744 \ud1b5\ud55c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\n\n### HWPX (ZIP/XML \ud3ec\ub9f7)\n- \ud655\uc7a5\uc790: `.hwpx`\n- \uad6c\uc870: XML \ubb38\uc11c\uac00 \ud3ec\ud568\ub41c ZIP \uc544\uce74\uc774\ube0c\n- \ud2b9\uc9d5: \uad6c\uc870\ud654\ub41c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\uc744 \uc704\ud55c XML \ud30c\uc2f1\n\n## \uc758\uc874\uc131\n\n- `olefile>=0.46`: HWP 5.0 OLE \ud30c\uc77c \ucc98\ub9ac\n\n## \uac1c\ubc1c\n\n### \uac1c\ubc1c \ud658\uacbd \uc124\uc815\n\n```bash\n# \uc800\uc7a5\uc18c \ubcf5\uc81c\ngit clone https://github.com/thlee/extract-hwp.git\ncd extract-hwp\n\n# \uc758\uc874\uc131 \uc124\uce58\nuv sync\n\n# \uac1c\ubc1c \uc758\uc874\uc131 \ud3ec\ud568 \uc124\uce58\nuv sync --extra dev\n```\n\n### \ud14c\uc2a4\ud2b8\n\n```bash\n# \ud14c\uc2a4\ud2b8 \uc2e4\ud589\npytest\n\n# \ucee4\ubc84\ub9ac\uc9c0 \ud3ec\ud568\npytest --cov=src/extract_hwp\n```\n\n### \ucf54\ub4dc \ud488\uc9c8\n\n```bash\n# \ucf54\ub4dc \ud3ec\ub9f7\ud305\nblack src/ tests/\n\n# \ud0c0\uc785 \uac80\uc0ac\nmypy src/\n```\n\n## \ub77c\uc774\uc120\uc2a4\n\nBSD 3-Clause License - \uc790\uc138\ud55c \ub0b4\uc6a9\uc740 [LICENSE](LICENSE) \ud30c\uc77c\uc744 \ucc38\uc870\ud558\uc138\uc694.\n\n## \ubcc0\uacbd\uc0ac\ud56d\n\n\ubc84\uc804 \ud788\uc2a4\ud1a0\ub9ac\ub294 [CHANGELOG.md](CHANGELOG.md)\uc5d0\uc11c \ud655\uc778\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4.",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "Python library for extracting text from Korean HWP files (HWP 5.0 and HWPX formats)",
"version": "0.1.0",
"project_urls": {
"Bug Reports": "https://github.com/thlee/extract-hwp/issues",
"Homepage": "https://github.com/thlee/extract-hwp",
"Source": "https://github.com/thlee/extract-hwp"
},
"split_keywords": [
"document",
" hwp",
" hwpx",
" korean",
" text-extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6f57f15fa218433b854e0cd8faf93ebff48808b35d49a6420c98a67d1b9d1803",
"md5": "04fe34b6c214c8aeda343bca6d0a8267",
"sha256": "695ef54e9bb52b12b0e31b8517173be074b5b7dd721aede9dfd6c575b628d6c5"
},
"downloads": -1,
"filename": "extract_hwp-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "04fe34b6c214c8aeda343bca6d0a8267",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 3721,
"upload_time": "2025-08-13T10:13:02",
"upload_time_iso_8601": "2025-08-13T10:13:02.235205Z",
"url": "https://files.pythonhosted.org/packages/6f/57/f15fa218433b854e0cd8faf93ebff48808b35d49a6420c98a67d1b9d1803/extract_hwp-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1ad0ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45",
"md5": "02ae4553498d604cb11a8031f5200541",
"sha256": "cc5a7137c462292955085f2f004e6b627afc12ec5a40521a4f6ba15d0b84edea"
},
"downloads": -1,
"filename": "extract_hwp-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "02ae4553498d604cb11a8031f5200541",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 43255,
"upload_time": "2025-08-13T10:13:03",
"upload_time_iso_8601": "2025-08-13T10:13:03.807595Z",
"url": "https://files.pythonhosted.org/packages/1a/d0/ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45/extract_hwp-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-13 10:13:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thlee",
"github_project": "extract-hwp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "extract-hwp"
}