epub-cfi-toolkit


Nameepub-cfi-toolkit JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryA Python toolkit for processing EPUB Canonical Fragment Identifiers (CFIs)
upload_time2025-08-25 21:05:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords epub cfi ebook canonical-fragment-identifier
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            **Disclaimer**: everything in this repository was generated entirely by Claude Code (including this README) and has not been reviewed or verified by a human. Content may be inaccurate or incomplete. Use at your own risk.

# EPUB CFI Toolkit

[![PyPI version](https://badge.fury.io/py/epub-cfi-toolkit.svg)](https://badge.fury.io/py/epub-cfi-toolkit)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![CI](https://github.com/PagePalApp/epub-cfi-toolkit/workflows/CI/badge.svg?branch=main)](https://github.com/PagePalApp/epub-cfi-toolkit/actions?query=workflow%3ACI+branch%3Amain)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python toolkit for extracting text from EPUB files using **EPUB Canonical Fragment Identifiers (CFIs)** with full CFI specification compliance.

## Installation

```bash
pip install epub-cfi-toolkit
```

## Features

- **Full CFI Specification Compliance** - Supports all CFI features per EPUB CFI specification
- **Character Escaping** - Handles special characters with circumflex (^) escaping
- **Range CFI Support** - Processes both simple and range CFI syntax
- **Element Assertion Validation** - Validates element ID assertions in CFI paths
- **UTF-16 Character Offsets** - Proper Unicode handling for character positioning
- **Virtual Element Indices** - Support for virtual elements in DOM navigation

## Usage

### Basic Text Extraction

```python
from epub_cfi_toolkit import CFIProcessor

processor = CFIProcessor("path/to/book.epub")
text = processor.extract_text_from_cfi_range(
    start_cfi="epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)",
    end_cfi="epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:20)"
)
print(text)  # Extracted text from the EPUB
```

### CFI Parsing and Analysis

```python
from epub_cfi_toolkit import CFIParser

parser = CFIParser()
parsed_cfi = parser.parse("epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)")

print(parsed_cfi.spine_index)        # 4 (itemref index)
print(parsed_cfi.spine_assertion)    # "chap01ref" (element ID)
print(parsed_cfi.location.offset)    # 0 (character offset)
```

### Range CFI Processing

```python
# Range CFI with comma syntax
range_cfi = "epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)"
parsed = parser.parse(range_cfi)
```

### Character Escaping

CFIs automatically handle escaped special characters:

```python
# CFI with escaped characters: [, ], ^, ,, (, ), ;
cfi = "epubcfi(/6/4!/2[element^[with^]brackets]/1:0)"
parsed = parser.parse(cfi)  # Correctly handles escaped brackets
```

## API Reference

### CFIProcessor

```python
class CFIProcessor:
    def __init__(self, epub_path: str) -> None:
        """Initialize processor with EPUB file path."""
    
    def extract_text_from_cfi_range(self, start_cfi: str, end_cfi: str) -> str:
        """Extract text between two CFI positions with full spec compliance."""
```

### CFIParser

```python
class CFIParser:
    def __init__(self) -> None:
        """Initialize CFI parser with specification compliance."""
    
    def parse(self, cfi: str) -> ParsedCFI:
        """Parse CFI string with support for simple and range CFIs."""
    
    def compare_cfis(self, cfi1: ParsedCFI, cfi2: ParsedCFI) -> int:
        """Compare two CFIs for document order (-1, 0, 1)."""
```

### EPUBParser

```python
class EPUBParser:
    def __init__(self, epub_path: str) -> None:
        """Initialize parser for EPUB file structure."""
```

### Data Classes

```python
@dataclass
class CFIStep:
    index: int                    # Step index
    assertion: Optional[str]      # Element ID assertion

@dataclass  
class CFILocation:
    offset: int                   # Character offset
    length: Optional[int]         # Range length (optional)

@dataclass
class ParsedCFI:
    spine_steps: List[CFIStep]    # Spine navigation steps
    content_steps: List[CFIStep]  # Content navigation steps
    location: Optional[CFILocation] # Character position
```

### Exceptions

```python
from epub_cfi_toolkit import CFIError, EPUBError

# CFIError: Base exception for CFI parsing/processing errors
# EPUBError: Raised when EPUB file cannot be processed
```

## CFI Specification Compliance

This library fully implements the EPUB CFI specification including:

### Character Escaping
Special characters are escaped with circumflex (^): `[ ] ^ , ( ) ;`
```python
# Original: element[with]brackets  
# Escaped:  element^[with^]brackets
```

### Range CFI Syntax
Supports comma-separated range CFIs:
```python
# Format: epubcfi(parent_path, start_offset, end_offset)  
"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)"
```

### Element Assertions
Validates element ID assertions in square brackets:
```python
"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)"
#              ^^^^^^^^       ^^^^^^     ^^^^^^
#              Element ID assertions are validated
```

### Virtual Element Indices  
Handles virtual elements (indices 0 and beyond last child):
```python
"/4/0"    # Before first child element
"/4/10"   # After last child (if only 8 children exist)
```

## What are EPUB CFIs?

EPUB Canonical Fragment Identifiers (CFIs) are a standard way to reference specific locations within EPUB documents.

**CFI Structure**: `epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)`

- `epubcfi(...)` - CFI wrapper
- `/6` - Package document reference  
- `/4[chap01ref]` - Spine item reference with assertion
- `!` - Separator (spine / content boundary)
- `/4[body01]/10[para05]` - Content navigation path
- `/3:10` - Text node (3) with character offset (10)

## Requirements

- **Python 3.8+**
- **lxml** - XML/HTML processing library

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "epub-cfi-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "PagePal <info@pagepalapp.com>",
    "keywords": "epub, cfi, ebook, canonical-fragment-identifier",
    "author": null,
    "author_email": "PagePal <info@pagepalapp.com>",
    "download_url": "https://files.pythonhosted.org/packages/f7/4b/ff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331/epub_cfi_toolkit-0.3.0.tar.gz",
    "platform": null,
    "description": "**Disclaimer**: everything in this repository was generated entirely by Claude Code (including this README) and has not been reviewed or verified by a human. Content may be inaccurate or incomplete. Use at your own risk.\n\n# EPUB CFI Toolkit\n\n[![PyPI version](https://badge.fury.io/py/epub-cfi-toolkit.svg)](https://badge.fury.io/py/epub-cfi-toolkit)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![CI](https://github.com/PagePalApp/epub-cfi-toolkit/workflows/CI/badge.svg?branch=main)](https://github.com/PagePalApp/epub-cfi-toolkit/actions?query=workflow%3ACI+branch%3Amain)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA Python toolkit for extracting text from EPUB files using **EPUB Canonical Fragment Identifiers (CFIs)** with full CFI specification compliance.\n\n## Installation\n\n```bash\npip install epub-cfi-toolkit\n```\n\n## Features\n\n- **Full CFI Specification Compliance** - Supports all CFI features per EPUB CFI specification\n- **Character Escaping** - Handles special characters with circumflex (^) escaping\n- **Range CFI Support** - Processes both simple and range CFI syntax\n- **Element Assertion Validation** - Validates element ID assertions in CFI paths\n- **UTF-16 Character Offsets** - Proper Unicode handling for character positioning\n- **Virtual Element Indices** - Support for virtual elements in DOM navigation\n\n## Usage\n\n### Basic Text Extraction\n\n```python\nfrom epub_cfi_toolkit import CFIProcessor\n\nprocessor = CFIProcessor(\"path/to/book.epub\")\ntext = processor.extract_text_from_cfi_range(\n    start_cfi=\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\",\n    end_cfi=\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:20)\"\n)\nprint(text)  # Extracted text from the EPUB\n```\n\n### CFI Parsing and Analysis\n\n```python\nfrom epub_cfi_toolkit import CFIParser\n\nparser = CFIParser()\nparsed_cfi = parser.parse(\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\")\n\nprint(parsed_cfi.spine_index)        # 4 (itemref index)\nprint(parsed_cfi.spine_assertion)    # \"chap01ref\" (element ID)\nprint(parsed_cfi.location.offset)    # 0 (character offset)\n```\n\n### Range CFI Processing\n\n```python\n# Range CFI with comma syntax\nrange_cfi = \"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)\"\nparsed = parser.parse(range_cfi)\n```\n\n### Character Escaping\n\nCFIs automatically handle escaped special characters:\n\n```python\n# CFI with escaped characters: [, ], ^, ,, (, ), ;\ncfi = \"epubcfi(/6/4!/2[element^[with^]brackets]/1:0)\"\nparsed = parser.parse(cfi)  # Correctly handles escaped brackets\n```\n\n## API Reference\n\n### CFIProcessor\n\n```python\nclass CFIProcessor:\n    def __init__(self, epub_path: str) -> None:\n        \"\"\"Initialize processor with EPUB file path.\"\"\"\n    \n    def extract_text_from_cfi_range(self, start_cfi: str, end_cfi: str) -> str:\n        \"\"\"Extract text between two CFI positions with full spec compliance.\"\"\"\n```\n\n### CFIParser\n\n```python\nclass CFIParser:\n    def __init__(self) -> None:\n        \"\"\"Initialize CFI parser with specification compliance.\"\"\"\n    \n    def parse(self, cfi: str) -> ParsedCFI:\n        \"\"\"Parse CFI string with support for simple and range CFIs.\"\"\"\n    \n    def compare_cfis(self, cfi1: ParsedCFI, cfi2: ParsedCFI) -> int:\n        \"\"\"Compare two CFIs for document order (-1, 0, 1).\"\"\"\n```\n\n### EPUBParser\n\n```python\nclass EPUBParser:\n    def __init__(self, epub_path: str) -> None:\n        \"\"\"Initialize parser for EPUB file structure.\"\"\"\n```\n\n### Data Classes\n\n```python\n@dataclass\nclass CFIStep:\n    index: int                    # Step index\n    assertion: Optional[str]      # Element ID assertion\n\n@dataclass  \nclass CFILocation:\n    offset: int                   # Character offset\n    length: Optional[int]         # Range length (optional)\n\n@dataclass\nclass ParsedCFI:\n    spine_steps: List[CFIStep]    # Spine navigation steps\n    content_steps: List[CFIStep]  # Content navigation steps\n    location: Optional[CFILocation] # Character position\n```\n\n### Exceptions\n\n```python\nfrom epub_cfi_toolkit import CFIError, EPUBError\n\n# CFIError: Base exception for CFI parsing/processing errors\n# EPUBError: Raised when EPUB file cannot be processed\n```\n\n## CFI Specification Compliance\n\nThis library fully implements the EPUB CFI specification including:\n\n### Character Escaping\nSpecial characters are escaped with circumflex (^): `[ ] ^ , ( ) ;`\n```python\n# Original: element[with]brackets  \n# Escaped:  element^[with^]brackets\n```\n\n### Range CFI Syntax\nSupports comma-separated range CFIs:\n```python\n# Format: epubcfi(parent_path, start_offset, end_offset)  \n\"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)\"\n```\n\n### Element Assertions\nValidates element ID assertions in square brackets:\n```python\n\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\"\n#              ^^^^^^^^       ^^^^^^     ^^^^^^\n#              Element ID assertions are validated\n```\n\n### Virtual Element Indices  \nHandles virtual elements (indices 0 and beyond last child):\n```python\n\"/4/0\"    # Before first child element\n\"/4/10\"   # After last child (if only 8 children exist)\n```\n\n## What are EPUB CFIs?\n\nEPUB Canonical Fragment Identifiers (CFIs) are a standard way to reference specific locations within EPUB documents.\n\n**CFI Structure**: `epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)`\n\n- `epubcfi(...)` - CFI wrapper\n- `/6` - Package document reference  \n- `/4[chap01ref]` - Spine item reference with assertion\n- `!` - Separator (spine / content boundary)\n- `/4[body01]/10[para05]` - Content navigation path\n- `/3:10` - Text node (3) with character offset (10)\n\n## Requirements\n\n- **Python 3.8+**\n- **lxml** - XML/HTML processing library\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python toolkit for processing EPUB Canonical Fragment Identifiers (CFIs)",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/PagePalApp/epub-cfi-toolkit",
        "Issues": "https://github.com/PagePalApp/epub-cfi-toolkit/issues",
        "Repository": "https://github.com/PagePalApp/epub-cfi-toolkit"
    },
    "split_keywords": [
        "epub",
        " cfi",
        " ebook",
        " canonical-fragment-identifier"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "50fdd7134c8701f8c6c556e353602cc98979a9ded0ac898008a8264a2dd5ce7a",
                "md5": "977ff9779d5c15002873a4aa7fe6b5ff",
                "sha256": "2be493e340057a4c64659022147240e63a2ce576c9c5963ed331934068304695"
            },
            "downloads": -1,
            "filename": "epub_cfi_toolkit-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "977ff9779d5c15002873a4aa7fe6b5ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14401,
            "upload_time": "2025-08-25T21:05:26",
            "upload_time_iso_8601": "2025-08-25T21:05:26.584119Z",
            "url": "https://files.pythonhosted.org/packages/50/fd/d7134c8701f8c6c556e353602cc98979a9ded0ac898008a8264a2dd5ce7a/epub_cfi_toolkit-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f74bff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331",
                "md5": "4d7f3d1808f491eb270effb0f25e62d3",
                "sha256": "113c2d2d5f88b8b315725a446cc1f04472bae3d305f7d70fc16ae4cd78614945"
            },
            "downloads": -1,
            "filename": "epub_cfi_toolkit-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4d7f3d1808f491eb270effb0f25e62d3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 18612,
            "upload_time": "2025-08-25T21:05:27",
            "upload_time_iso_8601": "2025-08-25T21:05:27.809658Z",
            "url": "https://files.pythonhosted.org/packages/f7/4b/ff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331/epub_cfi_toolkit-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-25 21:05:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PagePalApp",
    "github_project": "epub-cfi-toolkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "epub-cfi-toolkit"
}
        
Elapsed time: 0.47736s