extract-hwp


Nameextract-hwp JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryPython library for extracting text from Korean HWP files (HWP 5.0 and HWPX formats)
upload_time2025-08-13 10:13:03
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseBSD-3-Clause
keywords document hwp hwpx korean text-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # extract-hwp

한글과컴퓨터의 HWP 파일(HWP 5.0 및 HWPX 형식)에서 텍스트를 추출하는 Python 라이브러리입니다.

## 특징

- **다중 포맷 지원**: HWP 5.0 (OLE) 및 HWPX (ZIP/XML) 파일 모두 지원
- **암호화 파일 감지**: 처리하기 전에 암호로 보호된 파일을 감지
- **구조화된 추출**: 텍스트 추출 시 단락 구조 보존
- **견고한 오류 처리**: 손상되거나 잘못된 파일에 대한 방어적 처리
- **유니코드 지원**: 한글 및 다국어 텍스트 완전 지원

## 설치

```bash
pip install extract-hwp
```

## 사용법

### 기본 사용법

```python
from extract_hwp import extract_text_from_hwp

# HWP 또는 HWPX 파일에서 텍스트 추출
text, error = extract_text_from_hwp("document.hwp")
if error is None:
    print(text)
else:
    print(f"오류: {error}")
```

### 포맷별 추출

```python
from extract_hwp import extract_text_from_hwpx, extract_text_from_hwp5

# HWPX 파일 전용
hwpx_text = extract_text_from_hwpx("document.hwpx")

# HWP 5.0 파일 전용
hwp5_text = extract_text_from_hwp5("document.hwp")
```

### 암호화 파일 감지

```python
from extract_hwp import is_hwp_file_password_protected

if is_hwp_file_password_protected("document.hwp"):
    print("파일이 암호로 보호되어 있습니다.")
else:
    text, error = extract_text_from_hwp("document.hwp")
```

## API 참조

### 핵심 함수

#### `extract_text_from_hwp(filepath)`

HWP/HWPX 파일에서 텍스트를 추출합니다 (통합 인터페이스).

**매개변수:**
- `filepath` (str): HWP 또는 HWPX 파일 경로

**반환값:**
- `tuple`: (추출된_텍스트, 오류_메시지). 성공시 오류_메시지는 None

**예외:**
- `FileNotFoundError`: 파일을 찾을 수 없음
- `PermissionError`: 파일 접근 권한이 없음
- `ValueError`: 지원하지 않는 파일 형식

### 포맷별 함수

#### `extract_text_from_hwpx(hwpx_file_path)`

HWPX 파일에서 텍스트를 추출합니다.

**매개변수:**
- `hwpx_file_path` (str): HWPX 파일 경로

**반환값:**
- `str`: 추출된 텍스트 (오류 시 빈 문자열)

#### `extract_text_from_hwp5(filepath)`

HWP 5.0 (OLE) 파일에서 텍스트를 추출합니다.

**매개변수:**
- `filepath` (str): HWP 파일 경로

**반환값:**
- `str`: 추출된 텍스트 (오류 시 빈 문자열)

### 암호화 감지 함수

#### `is_hwp_file_password_protected(filepath)`

HWP/HWPX 파일이 암호로 보호되어 있는지 확인합니다.

**매개변수:**
- `filepath` (str): 확인할 파일 경로

**반환값:**
- `bool`: 암호로 보호된 경우 True, 그렇지 않으면 False

## 지원 포맷

### HWP 5.0 (OLE 포맷)
- 확장자: `.hwp`
- 구조: OLE 복합 문서 형식
- 압축: zlib 압축 지원
- 특징: 바이너리 구조 분석을 통한 텍스트 추출

### HWPX (ZIP/XML 포맷)
- 확장자: `.hwpx`
- 구조: XML 문서가 포함된 ZIP 아카이브
- 특징: 구조화된 텍스트 추출을 위한 XML 파싱

## 의존성

- `olefile>=0.46`: HWP 5.0 OLE 파일 처리

## 개발

### 개발 환경 설정

```bash
# 저장소 복제
git clone https://github.com/thlee/extract-hwp.git
cd extract-hwp

# 의존성 설치
uv sync

# 개발 의존성 포함 설치
uv sync --extra dev
```

### 테스트

```bash
# 테스트 실행
pytest

# 커버리지 포함
pytest --cov=src/extract_hwp
```

### 코드 품질

```bash
# 코드 포맷팅
black src/ tests/

# 타입 검사
mypy src/
```

## 라이선스

BSD 3-Clause License - 자세한 내용은 [LICENSE](LICENSE) 파일을 참조하세요.

## 변경사항

버전 히스토리는 [CHANGELOG.md](CHANGELOG.md)에서 확인할 수 있습니다.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "extract-hwp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "document, hwp, hwpx, korean, text-extraction",
    "author": null,
    "author_email": "extract-hwp <extract-hwp@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/1a/d0/ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45/extract_hwp-0.1.0.tar.gz",
    "platform": null,
    "description": "# extract-hwp\n\n\ud55c\uae00\uacfc\ucef4\ud4e8\ud130\uc758 HWP \ud30c\uc77c(HWP 5.0 \ubc0f HWPX \ud615\uc2dd)\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud558\ub294 Python \ub77c\uc774\ube0c\ub7ec\ub9ac\uc785\ub2c8\ub2e4.\n\n## \ud2b9\uc9d5\n\n- **\ub2e4\uc911 \ud3ec\ub9f7 \uc9c0\uc6d0**: HWP 5.0 (OLE) \ubc0f HWPX (ZIP/XML) \ud30c\uc77c \ubaa8\ub450 \uc9c0\uc6d0\n- **\uc554\ud638\ud654 \ud30c\uc77c \uac10\uc9c0**: \ucc98\ub9ac\ud558\uae30 \uc804\uc5d0 \uc554\ud638\ub85c \ubcf4\ud638\ub41c \ud30c\uc77c\uc744 \uac10\uc9c0\n- **\uad6c\uc870\ud654\ub41c \ucd94\ucd9c**: \ud14d\uc2a4\ud2b8 \ucd94\ucd9c \uc2dc \ub2e8\ub77d \uad6c\uc870 \ubcf4\uc874\n- **\uacac\uace0\ud55c \uc624\ub958 \ucc98\ub9ac**: \uc190\uc0c1\ub418\uac70\ub098 \uc798\ubabb\ub41c \ud30c\uc77c\uc5d0 \ub300\ud55c \ubc29\uc5b4\uc801 \ucc98\ub9ac\n- **\uc720\ub2c8\ucf54\ub4dc \uc9c0\uc6d0**: \ud55c\uae00 \ubc0f \ub2e4\uad6d\uc5b4 \ud14d\uc2a4\ud2b8 \uc644\uc804 \uc9c0\uc6d0\n\n## \uc124\uce58\n\n```bash\npip install extract-hwp\n```\n\n## \uc0ac\uc6a9\ubc95\n\n### \uae30\ubcf8 \uc0ac\uc6a9\ubc95\n\n```python\nfrom extract_hwp import extract_text_from_hwp\n\n# HWP \ub610\ub294 HWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\ntext, error = extract_text_from_hwp(\"document.hwp\")\nif error is None:\n    print(text)\nelse:\n    print(f\"\uc624\ub958: {error}\")\n```\n\n### \ud3ec\ub9f7\ubcc4 \ucd94\ucd9c\n\n```python\nfrom extract_hwp import extract_text_from_hwpx, extract_text_from_hwp5\n\n# HWPX \ud30c\uc77c \uc804\uc6a9\nhwpx_text = extract_text_from_hwpx(\"document.hwpx\")\n\n# HWP 5.0 \ud30c\uc77c \uc804\uc6a9\nhwp5_text = extract_text_from_hwp5(\"document.hwp\")\n```\n\n### \uc554\ud638\ud654 \ud30c\uc77c \uac10\uc9c0\n\n```python\nfrom extract_hwp import is_hwp_file_password_protected\n\nif is_hwp_file_password_protected(\"document.hwp\"):\n    print(\"\ud30c\uc77c\uc774 \uc554\ud638\ub85c \ubcf4\ud638\ub418\uc5b4 \uc788\uc2b5\ub2c8\ub2e4.\")\nelse:\n    text, error = extract_text_from_hwp(\"document.hwp\")\n```\n\n## API \ucc38\uc870\n\n### \ud575\uc2ec \ud568\uc218\n\n#### `extract_text_from_hwp(filepath)`\n\nHWP/HWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4 (\ud1b5\ud569 \uc778\ud130\ud398\uc774\uc2a4).\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): HWP \ub610\ub294 HWPX \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `tuple`: (\ucd94\ucd9c\ub41c_\ud14d\uc2a4\ud2b8, \uc624\ub958_\uba54\uc2dc\uc9c0). \uc131\uacf5\uc2dc \uc624\ub958_\uba54\uc2dc\uc9c0\ub294 None\n\n**\uc608\uc678:**\n- `FileNotFoundError`: \ud30c\uc77c\uc744 \ucc3e\uc744 \uc218 \uc5c6\uc74c\n- `PermissionError`: \ud30c\uc77c \uc811\uadfc \uad8c\ud55c\uc774 \uc5c6\uc74c\n- `ValueError`: \uc9c0\uc6d0\ud558\uc9c0 \uc54a\ub294 \ud30c\uc77c \ud615\uc2dd\n\n### \ud3ec\ub9f7\ubcc4 \ud568\uc218\n\n#### `extract_text_from_hwpx(hwpx_file_path)`\n\nHWPX \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `hwpx_file_path` (str): HWPX \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `str`: \ucd94\ucd9c\ub41c \ud14d\uc2a4\ud2b8 (\uc624\ub958 \uc2dc \ube48 \ubb38\uc790\uc5f4)\n\n#### `extract_text_from_hwp5(filepath)`\n\nHWP 5.0 (OLE) \ud30c\uc77c\uc5d0\uc11c \ud14d\uc2a4\ud2b8\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): HWP \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `str`: \ucd94\ucd9c\ub41c \ud14d\uc2a4\ud2b8 (\uc624\ub958 \uc2dc \ube48 \ubb38\uc790\uc5f4)\n\n### \uc554\ud638\ud654 \uac10\uc9c0 \ud568\uc218\n\n#### `is_hwp_file_password_protected(filepath)`\n\nHWP/HWPX \ud30c\uc77c\uc774 \uc554\ud638\ub85c \ubcf4\ud638\ub418\uc5b4 \uc788\ub294\uc9c0 \ud655\uc778\ud569\ub2c8\ub2e4.\n\n**\ub9e4\uac1c\ubcc0\uc218:**\n- `filepath` (str): \ud655\uc778\ud560 \ud30c\uc77c \uacbd\ub85c\n\n**\ubc18\ud658\uac12:**\n- `bool`: \uc554\ud638\ub85c \ubcf4\ud638\ub41c \uacbd\uc6b0 True, \uadf8\ub807\uc9c0 \uc54a\uc73c\uba74 False\n\n## \uc9c0\uc6d0 \ud3ec\ub9f7\n\n### HWP 5.0 (OLE \ud3ec\ub9f7)\n- \ud655\uc7a5\uc790: `.hwp`\n- \uad6c\uc870: OLE \ubcf5\ud569 \ubb38\uc11c \ud615\uc2dd\n- \uc555\ucd95: zlib \uc555\ucd95 \uc9c0\uc6d0\n- \ud2b9\uc9d5: \ubc14\uc774\ub108\ub9ac \uad6c\uc870 \ubd84\uc11d\uc744 \ud1b5\ud55c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\n\n### HWPX (ZIP/XML \ud3ec\ub9f7)\n- \ud655\uc7a5\uc790: `.hwpx`\n- \uad6c\uc870: XML \ubb38\uc11c\uac00 \ud3ec\ud568\ub41c ZIP \uc544\uce74\uc774\ube0c\n- \ud2b9\uc9d5: \uad6c\uc870\ud654\ub41c \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\uc744 \uc704\ud55c XML \ud30c\uc2f1\n\n## \uc758\uc874\uc131\n\n- `olefile>=0.46`: HWP 5.0 OLE \ud30c\uc77c \ucc98\ub9ac\n\n## \uac1c\ubc1c\n\n### \uac1c\ubc1c \ud658\uacbd \uc124\uc815\n\n```bash\n# \uc800\uc7a5\uc18c \ubcf5\uc81c\ngit clone https://github.com/thlee/extract-hwp.git\ncd extract-hwp\n\n# \uc758\uc874\uc131 \uc124\uce58\nuv sync\n\n# \uac1c\ubc1c \uc758\uc874\uc131 \ud3ec\ud568 \uc124\uce58\nuv sync --extra dev\n```\n\n### \ud14c\uc2a4\ud2b8\n\n```bash\n# \ud14c\uc2a4\ud2b8 \uc2e4\ud589\npytest\n\n# \ucee4\ubc84\ub9ac\uc9c0 \ud3ec\ud568\npytest --cov=src/extract_hwp\n```\n\n### \ucf54\ub4dc \ud488\uc9c8\n\n```bash\n# \ucf54\ub4dc \ud3ec\ub9f7\ud305\nblack src/ tests/\n\n# \ud0c0\uc785 \uac80\uc0ac\nmypy src/\n```\n\n## \ub77c\uc774\uc120\uc2a4\n\nBSD 3-Clause License - \uc790\uc138\ud55c \ub0b4\uc6a9\uc740 [LICENSE](LICENSE) \ud30c\uc77c\uc744 \ucc38\uc870\ud558\uc138\uc694.\n\n## \ubcc0\uacbd\uc0ac\ud56d\n\n\ubc84\uc804 \ud788\uc2a4\ud1a0\ub9ac\ub294 [CHANGELOG.md](CHANGELOG.md)\uc5d0\uc11c \ud655\uc778\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4.",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Python library for extracting text from Korean HWP files (HWP 5.0 and HWPX formats)",
    "version": "0.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/thlee/extract-hwp/issues",
        "Homepage": "https://github.com/thlee/extract-hwp",
        "Source": "https://github.com/thlee/extract-hwp"
    },
    "split_keywords": [
        "document",
        " hwp",
        " hwpx",
        " korean",
        " text-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6f57f15fa218433b854e0cd8faf93ebff48808b35d49a6420c98a67d1b9d1803",
                "md5": "04fe34b6c214c8aeda343bca6d0a8267",
                "sha256": "695ef54e9bb52b12b0e31b8517173be074b5b7dd721aede9dfd6c575b628d6c5"
            },
            "downloads": -1,
            "filename": "extract_hwp-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "04fe34b6c214c8aeda343bca6d0a8267",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 3721,
            "upload_time": "2025-08-13T10:13:02",
            "upload_time_iso_8601": "2025-08-13T10:13:02.235205Z",
            "url": "https://files.pythonhosted.org/packages/6f/57/f15fa218433b854e0cd8faf93ebff48808b35d49a6420c98a67d1b9d1803/extract_hwp-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1ad0ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45",
                "md5": "02ae4553498d604cb11a8031f5200541",
                "sha256": "cc5a7137c462292955085f2f004e6b627afc12ec5a40521a4f6ba15d0b84edea"
            },
            "downloads": -1,
            "filename": "extract_hwp-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "02ae4553498d604cb11a8031f5200541",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 43255,
            "upload_time": "2025-08-13T10:13:03",
            "upload_time_iso_8601": "2025-08-13T10:13:03.807595Z",
            "url": "https://files.pythonhosted.org/packages/1a/d0/ff70a595a50198409478bb927c02b28d002d7a7426e674cc8ffde125fe45/extract_hwp-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-13 10:13:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thlee",
    "github_project": "extract-hwp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "extract-hwp"
}
        
Elapsed time: 1.03099s