**Disclaimer**: everything in this repository was generated entirely by Claude Code (including this README) and has not been reviewed or verified by a human. Content may be inaccurate or incomplete. Use at your own risk.
# EPUB CFI Toolkit
[](https://badge.fury.io/py/epub-cfi-toolkit)
[](https://www.python.org/downloads/)
[](https://github.com/PagePalApp/epub-cfi-toolkit/actions?query=workflow%3ACI+branch%3Amain)
[](https://opensource.org/licenses/MIT)
A Python toolkit for extracting text from EPUB files using **EPUB Canonical Fragment Identifiers (CFIs)** with full CFI specification compliance.
## Installation
```bash
pip install epub-cfi-toolkit
```
## Features
- **Full CFI Specification Compliance** - Supports all CFI features per EPUB CFI specification
- **Character Escaping** - Handles special characters with circumflex (^) escaping
- **Range CFI Support** - Processes both simple and range CFI syntax
- **Element Assertion Validation** - Validates element ID assertions in CFI paths
- **UTF-16 Character Offsets** - Proper Unicode handling for character positioning
- **Virtual Element Indices** - Support for virtual elements in DOM navigation
## Usage
### Basic Text Extraction
```python
from epub_cfi_toolkit import CFIProcessor
processor = CFIProcessor("path/to/book.epub")
text = processor.extract_text_from_cfi_range(
start_cfi="epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)",
end_cfi="epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:20)"
)
print(text) # Extracted text from the EPUB
```
### CFI Parsing and Analysis
```python
from epub_cfi_toolkit import CFIParser
parser = CFIParser()
parsed_cfi = parser.parse("epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)")
print(parsed_cfi.spine_index) # 4 (itemref index)
print(parsed_cfi.spine_assertion) # "chap01ref" (element ID)
print(parsed_cfi.location.offset) # 0 (character offset)
```
### Range CFI Processing
```python
# Range CFI with comma syntax
range_cfi = "epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)"
parsed = parser.parse(range_cfi)
```
### Character Escaping
CFIs automatically handle escaped special characters:
```python
# CFI with escaped characters: [, ], ^, ,, (, ), ;
cfi = "epubcfi(/6/4!/2[element^[with^]brackets]/1:0)"
parsed = parser.parse(cfi) # Correctly handles escaped brackets
```
## API Reference
### CFIProcessor
```python
class CFIProcessor:
def __init__(self, epub_path: str) -> None:
"""Initialize processor with EPUB file path."""
def extract_text_from_cfi_range(self, start_cfi: str, end_cfi: str) -> str:
"""Extract text between two CFI positions with full spec compliance."""
```
### CFIParser
```python
class CFIParser:
def __init__(self) -> None:
"""Initialize CFI parser with specification compliance."""
def parse(self, cfi: str) -> ParsedCFI:
"""Parse CFI string with support for simple and range CFIs."""
def compare_cfis(self, cfi1: ParsedCFI, cfi2: ParsedCFI) -> int:
"""Compare two CFIs for document order (-1, 0, 1)."""
```
### EPUBParser
```python
class EPUBParser:
def __init__(self, epub_path: str) -> None:
"""Initialize parser for EPUB file structure."""
```
### Data Classes
```python
@dataclass
class CFIStep:
index: int # Step index
assertion: Optional[str] # Element ID assertion
@dataclass
class CFILocation:
offset: int # Character offset
length: Optional[int] # Range length (optional)
@dataclass
class ParsedCFI:
spine_steps: List[CFIStep] # Spine navigation steps
content_steps: List[CFIStep] # Content navigation steps
location: Optional[CFILocation] # Character position
```
### Exceptions
```python
from epub_cfi_toolkit import CFIError, EPUBError
# CFIError: Base exception for CFI parsing/processing errors
# EPUBError: Raised when EPUB file cannot be processed
```
## CFI Specification Compliance
This library fully implements the EPUB CFI specification including:
### Character Escaping
Special characters are escaped with circumflex (^): `[ ] ^ , ( ) ;`
```python
# Original: element[with]brackets
# Escaped: element^[with^]brackets
```
### Range CFI Syntax
Supports comma-separated range CFIs:
```python
# Format: epubcfi(parent_path, start_offset, end_offset)
"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)"
```
### Element Assertions
Validates element ID assertions in square brackets:
```python
"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)"
# ^^^^^^^^ ^^^^^^ ^^^^^^
# Element ID assertions are validated
```
### Virtual Element Indices
Handles virtual elements (indices 0 and beyond last child):
```python
"/4/0" # Before first child element
"/4/10" # After last child (if only 8 children exist)
```
## What are EPUB CFIs?
EPUB Canonical Fragment Identifiers (CFIs) are a standard way to reference specific locations within EPUB documents.
**CFI Structure**: `epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)`
- `epubcfi(...)` - CFI wrapper
- `/6` - Package document reference
- `/4[chap01ref]` - Spine item reference with assertion
- `!` - Separator (spine / content boundary)
- `/4[body01]/10[para05]` - Content navigation path
- `/3:10` - Text node (3) with character offset (10)
## Requirements
- **Python 3.8+**
- **lxml** - XML/HTML processing library
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "epub-cfi-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "PagePal <info@pagepalapp.com>",
"keywords": "epub, cfi, ebook, canonical-fragment-identifier",
"author": null,
"author_email": "PagePal <info@pagepalapp.com>",
"download_url": "https://files.pythonhosted.org/packages/f7/4b/ff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331/epub_cfi_toolkit-0.3.0.tar.gz",
"platform": null,
"description": "**Disclaimer**: everything in this repository was generated entirely by Claude Code (including this README) and has not been reviewed or verified by a human. Content may be inaccurate or incomplete. Use at your own risk.\n\n# EPUB CFI Toolkit\n\n[](https://badge.fury.io/py/epub-cfi-toolkit)\n[](https://www.python.org/downloads/)\n[](https://github.com/PagePalApp/epub-cfi-toolkit/actions?query=workflow%3ACI+branch%3Amain)\n[](https://opensource.org/licenses/MIT)\n\nA Python toolkit for extracting text from EPUB files using **EPUB Canonical Fragment Identifiers (CFIs)** with full CFI specification compliance.\n\n## Installation\n\n```bash\npip install epub-cfi-toolkit\n```\n\n## Features\n\n- **Full CFI Specification Compliance** - Supports all CFI features per EPUB CFI specification\n- **Character Escaping** - Handles special characters with circumflex (^) escaping\n- **Range CFI Support** - Processes both simple and range CFI syntax\n- **Element Assertion Validation** - Validates element ID assertions in CFI paths\n- **UTF-16 Character Offsets** - Proper Unicode handling for character positioning\n- **Virtual Element Indices** - Support for virtual elements in DOM navigation\n\n## Usage\n\n### Basic Text Extraction\n\n```python\nfrom epub_cfi_toolkit import CFIProcessor\n\nprocessor = CFIProcessor(\"path/to/book.epub\")\ntext = processor.extract_text_from_cfi_range(\n start_cfi=\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\",\n end_cfi=\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:20)\"\n)\nprint(text) # Extracted text from the EPUB\n```\n\n### CFI Parsing and Analysis\n\n```python\nfrom epub_cfi_toolkit import CFIParser\n\nparser = CFIParser()\nparsed_cfi = parser.parse(\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\")\n\nprint(parsed_cfi.spine_index) # 4 (itemref index)\nprint(parsed_cfi.spine_assertion) # \"chap01ref\" (element ID)\nprint(parsed_cfi.location.offset) # 0 (character offset)\n```\n\n### Range CFI Processing\n\n```python\n# Range CFI with comma syntax\nrange_cfi = \"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)\"\nparsed = parser.parse(range_cfi)\n```\n\n### Character Escaping\n\nCFIs automatically handle escaped special characters:\n\n```python\n# CFI with escaped characters: [, ], ^, ,, (, ), ;\ncfi = \"epubcfi(/6/4!/2[element^[with^]brackets]/1:0)\"\nparsed = parser.parse(cfi) # Correctly handles escaped brackets\n```\n\n## API Reference\n\n### CFIProcessor\n\n```python\nclass CFIProcessor:\n def __init__(self, epub_path: str) -> None:\n \"\"\"Initialize processor with EPUB file path.\"\"\"\n \n def extract_text_from_cfi_range(self, start_cfi: str, end_cfi: str) -> str:\n \"\"\"Extract text between two CFI positions with full spec compliance.\"\"\"\n```\n\n### CFIParser\n\n```python\nclass CFIParser:\n def __init__(self) -> None:\n \"\"\"Initialize CFI parser with specification compliance.\"\"\"\n \n def parse(self, cfi: str) -> ParsedCFI:\n \"\"\"Parse CFI string with support for simple and range CFIs.\"\"\"\n \n def compare_cfis(self, cfi1: ParsedCFI, cfi2: ParsedCFI) -> int:\n \"\"\"Compare two CFIs for document order (-1, 0, 1).\"\"\"\n```\n\n### EPUBParser\n\n```python\nclass EPUBParser:\n def __init__(self, epub_path: str) -> None:\n \"\"\"Initialize parser for EPUB file structure.\"\"\"\n```\n\n### Data Classes\n\n```python\n@dataclass\nclass CFIStep:\n index: int # Step index\n assertion: Optional[str] # Element ID assertion\n\n@dataclass \nclass CFILocation:\n offset: int # Character offset\n length: Optional[int] # Range length (optional)\n\n@dataclass\nclass ParsedCFI:\n spine_steps: List[CFIStep] # Spine navigation steps\n content_steps: List[CFIStep] # Content navigation steps\n location: Optional[CFILocation] # Character position\n```\n\n### Exceptions\n\n```python\nfrom epub_cfi_toolkit import CFIError, EPUBError\n\n# CFIError: Base exception for CFI parsing/processing errors\n# EPUBError: Raised when EPUB file cannot be processed\n```\n\n## CFI Specification Compliance\n\nThis library fully implements the EPUB CFI specification including:\n\n### Character Escaping\nSpecial characters are escaped with circumflex (^): `[ ] ^ , ( ) ;`\n```python\n# Original: element[with]brackets \n# Escaped: element^[with^]brackets\n```\n\n### Range CFI Syntax\nSupports comma-separated range CFIs:\n```python\n# Format: epubcfi(parent_path, start_offset, end_offset) \n\"epubcfi(/6/4[chapter]!, /2/1:5, /2/1:15)\"\n```\n\n### Element Assertions\nValidates element ID assertions in square brackets:\n```python\n\"epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)\"\n# ^^^^^^^^ ^^^^^^ ^^^^^^\n# Element ID assertions are validated\n```\n\n### Virtual Element Indices \nHandles virtual elements (indices 0 and beyond last child):\n```python\n\"/4/0\" # Before first child element\n\"/4/10\" # After last child (if only 8 children exist)\n```\n\n## What are EPUB CFIs?\n\nEPUB Canonical Fragment Identifiers (CFIs) are a standard way to reference specific locations within EPUB documents.\n\n**CFI Structure**: `epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)`\n\n- `epubcfi(...)` - CFI wrapper\n- `/6` - Package document reference \n- `/4[chap01ref]` - Spine item reference with assertion\n- `!` - Separator (spine / content boundary)\n- `/4[body01]/10[para05]` - Content navigation path\n- `/3:10` - Text node (3) with character offset (10)\n\n## Requirements\n\n- **Python 3.8+**\n- **lxml** - XML/HTML processing library\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python toolkit for processing EPUB Canonical Fragment Identifiers (CFIs)",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/PagePalApp/epub-cfi-toolkit",
"Issues": "https://github.com/PagePalApp/epub-cfi-toolkit/issues",
"Repository": "https://github.com/PagePalApp/epub-cfi-toolkit"
},
"split_keywords": [
"epub",
" cfi",
" ebook",
" canonical-fragment-identifier"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "50fdd7134c8701f8c6c556e353602cc98979a9ded0ac898008a8264a2dd5ce7a",
"md5": "977ff9779d5c15002873a4aa7fe6b5ff",
"sha256": "2be493e340057a4c64659022147240e63a2ce576c9c5963ed331934068304695"
},
"downloads": -1,
"filename": "epub_cfi_toolkit-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "977ff9779d5c15002873a4aa7fe6b5ff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 14401,
"upload_time": "2025-08-25T21:05:26",
"upload_time_iso_8601": "2025-08-25T21:05:26.584119Z",
"url": "https://files.pythonhosted.org/packages/50/fd/d7134c8701f8c6c556e353602cc98979a9ded0ac898008a8264a2dd5ce7a/epub_cfi_toolkit-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f74bff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331",
"md5": "4d7f3d1808f491eb270effb0f25e62d3",
"sha256": "113c2d2d5f88b8b315725a446cc1f04472bae3d305f7d70fc16ae4cd78614945"
},
"downloads": -1,
"filename": "epub_cfi_toolkit-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "4d7f3d1808f491eb270effb0f25e62d3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 18612,
"upload_time": "2025-08-25T21:05:27",
"upload_time_iso_8601": "2025-08-25T21:05:27.809658Z",
"url": "https://files.pythonhosted.org/packages/f7/4b/ff62c2f4f0a371d2c3fbe2e483c9478d02792adb5f7c8f458f757560b331/epub_cfi_toolkit-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-25 21:05:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PagePalApp",
"github_project": "epub-cfi-toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "epub-cfi-toolkit"
}