domcontext


Namedomcontext JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryParse DOM trees into clean, LLM-friendly context
upload_time2025-10-19 06:24:04
maintainerNone
docs_urlNone
authorSteve Wang
requires_python>=3.8
licenseMIT
keywords dom html llm parsing web automation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # domcontext

**Parse DOM trees into clean, LLM-friendly context.**

Converts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.

> **โš ๏ธ Development Status:** This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.

> **Why "domcontext"?** It's a double pun! ๐ŸŽฏ
> - **DOM** (Document Object Model) + **context** (LLM context windows)
> - Provides **DOM context** for your LLM agents

[![Tests](https://img.shields.io/badge/tests-188%20passing-brightgreen)]()
[![Python](https://img.shields.io/badge/python-3.8+-blue)]()
[![License](https://img.shields.io/badge/license-MIT-blue)]()
[![Version](https://img.shields.io/badge/version-0.1.3--alpha-orange)]()

---

## Quick Start

```python
from domcontext import DomContext

# Parse HTML string
html = """
<html>
<head><title>Example</title></head>
<body>
    <nav><a href="/home">Home</a></nav>
    <main>
        <button type="submit">Search</button>
    </main>
</body>
</html>
"""

# Create DOM context
context = DomContext.from_html(html)

# Get markdown representation
print(context.markdown)
print(f"Tokens: {context.tokens}")

# Iterate through interactive elements
for element in context.elements():
    print(f"{element.id}: {element.tag} - {element.text}")
```

**Output:**
```
- body-1
  - nav-1
    - a-1 (href="/home")
      - "Home"
  - main-1
    - button-1 (type="submit")
      - "Search"

Tokens: 42
```

---

## Installation

```bash
# Install from source
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

# Install with Playwright support (for live browser CDP capture)
pip install -e ".[playwright]"

# Install with Jupyter notebooks support (to run examples)
pip install -e ".[examples,playwright]"

# Install with all optional dependencies
pip install -e ".[dev,playwright,examples]"
```

After installing with Playwright support, install browser binaries:
```bash
playwright install chromium
```

---

## Features

- ๐Ÿงน **Semantic filtering** - Removes scripts, styles, hidden elements automatically
- ๐Ÿ“‰ **Token reduction** - 60% average reduction in token count
- ๐ŸŽฏ **Structure preservation** - Maintains DOM hierarchy in clean format
- ๐Ÿ” **Element lookup** - Access original DOM elements by their generated IDs
- ๐Ÿ“Š **Token counting** - Built-in token counting with tiktoken
- ๐ŸŽ›๏ธ **Configurable filtering** - Fine-tune visibility and semantic filters
- ๐Ÿ“ฆ **Multiple input formats** - Support for HTML strings and CDP snapshots
- ๐Ÿงฉ **Smart chunking** - Split large DOMs with continuation markers (`...`) and parent context for seamless chunk boundaries

---

## API

### Parse HTML

```python
from domcontext import DomContext

# Basic parsing
context = DomContext.from_html(html_string)

# With custom filter options
context = DomContext.from_html(
    html_string,
    filter_non_visible=True,      # Remove script, style tags
    filter_css_hidden=True,        # Remove display:none, visibility:hidden
    filter_zero_dimensions=True,   # Remove zero-width/height elements
    filter_empty_elements=True,    # Remove empty wrapper divs
    filter_attributes=True,        # Keep only semantic attributes
    collapse_wrappers=True         # Collapse single-child wrappers
)
```

### Parse CDP Snapshot

```python
# From Chrome DevTools Protocol snapshot
cdp_snapshot = {
    'documents': [...],
    'strings': [...]
}

context = DomContext.from_cdp(cdp_snapshot)
```

### Access Context

```python
# Markdown representation
markdown = context.markdown

# Token count
token_count = context.tokens

# Get all interactive elements
for element in context.elements():
    print(f"ID: {element.id}")
    print(f"Tag: {element.tag}")
    print(f"Text: {element.text}")
    print(f"Attributes: {element.attributes}")

# Get element by ID
element = context.get_element("button-1")
print(element.attributes)  # {'type': 'submit'}

# Get as dictionary
data = context.to_dict()
```

### Chunking

```python
# Split large DOMs into chunks with smart continuation markers
for chunk in context.chunks(max_tokens=1000, overlap=100):
    print(f"Chunk tokens: {chunk.tokens}")
    print(chunk.markdown)

# Chunks automatically include:
# - Parent path context (e.g., "- body-1\n  - div-1")
# - Continuation markers (...) when elements span chunks
# Example: "- button-1 (type="submit" ...)" โ†’ continues in next chunk

# Disable parent path if needed
for chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):
    print(chunk.markdown)  # No parent context
```

### Custom Tokenizer

```python
from domcontext import DomContext, Tokenizer

class CustomTokenizer(Tokenizer):
    def count_tokens(self, text: str) -> int:
        # Your custom token counting logic
        return len(text.split())

context = DomContext.from_html(html, tokenizer=CustomTokenizer())
```

### Playwright Utilities (Optional)

Capture CDP snapshots directly from live browser sessions:

```python
from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://example.com')

        # Capture CDP snapshot from live page
        snapshot = await capture_snapshot(page)

        # Parse into DomContext
        context = DomContext.from_cdp(snapshot)
        print(context.markdown)

        await browser.close()

# Run with: python -m asyncio script.py
```

**Note:** Requires installation with `pip install domcontext[playwright]`

---

## Architecture

The library uses a multi-stage filtering pipeline:

1. **Parse** - HTML/CDP โ†’ DomIR (complete DOM tree with all data)
2. **Visibility Filter** - Remove non-visible elements (optional flags)
   - Non-visible tags (script, style, head)
   - CSS hidden elements (display:none, visibility:hidden)
   - Zero-dimension elements
3. **Semantic Filter** - Extract semantic information (optional flags)
   - Convert to SemanticIR
   - Filter to semantic attributes only
   - Remove empty nodes
   - Collapse wrapper divs
   - Generate readable IDs
4. **Output** - SemanticIR โ†’ Markdown/JSON

---

## Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=domcontext --cov-report=html

# Run specific test suite
pytest tests/unit/parsers/
pytest tests/unit/filters/
pytest tests/unit/ir/
```

**Test Coverage:**
- 188 tests passing
- HTML Parser (13 tests)
- CDP Parser (12 tests)
- DomIR Layer (27 tests)
- SemanticIR Layer (34 tests)
- Visibility Filters (43 tests)
- Semantic Filters (28 tests)
- Chunker (15 tests)
- Tokenizers (13 tests)
- Smoke tests (3 tests)

---

## Use Cases

- **Web automation agents** - Provide clean DOM context to LLMs for element selection
- **Web scraping** - Extract structured content from complex pages
- **Testing** - Generate clean snapshots of DOM state
- **Accessibility** - Extract semantic structure from pages

---

## License

MIT

---

## Examples

Check out the interactive Jupyter notebooks in `examples/`:

- **`simple_demo.ipynb`** - Quick start guide with Google search example
  - Element lookup by ID
  - Chunking demonstration
  - Perfect for beginners

- **`advanced_demo.ipynb`** - Advanced features showcase
  - Custom filters and tokenizers
  - Element iteration and statistics
  - LLM prompt generation
  - Production patterns

Run with:
```bash
jupyter notebook examples/simple_demo.ipynb
```

---

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint
ruff check src/ tests/
```

---

## TODO

### Collapsing Improvements

1. **Collapse text-wrapping elements** - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, `<a><span>text</span></a>` keeps the `span`, but it should be collapsed to `<a>text</a>` if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.

### Evaluation & Benchmarking

2. **Mind2Web dataset evaluation** - Conduct comprehensive testing on the [Mind2Web dataset](https://osu-nlp-group.github.io/Mind2Web/) to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.

---

## Recently Completed

### โœ… Chunking Improvements (v0.1.3)

- **Atomic-level chunking** - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (`...`)
- **Smart chunk boundaries** - Text and attributes now split across chunks seamlessly with proper context preservation
- **Parent path context** - Each chunk includes parent hierarchy for better LLM understanding
- **Better token utilization** - No more wasted chunk capacity from oversized single-line elements

---

## Contributing

Contributions welcome! Please ensure tests pass and add new tests for new features.

```bash
# Run full test suite
pytest -v

# Check coverage
pytest --cov=domcontext
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "domcontext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "dom, html, llm, parsing, web, automation",
    "author": "Steve Wang",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/46/5b/9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21/domcontext-0.1.3.tar.gz",
    "platform": null,
    "description": "# domcontext\n\n**Parse DOM trees into clean, LLM-friendly context.**\n\nConverts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.\n\n> **\u26a0\ufe0f Development Status:** This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.\n\n> **Why \"domcontext\"?** It's a double pun! \ud83c\udfaf\n> - **DOM** (Document Object Model) + **context** (LLM context windows)\n> - Provides **DOM context** for your LLM agents\n\n[![Tests](https://img.shields.io/badge/tests-188%20passing-brightgreen)]()\n[![Python](https://img.shields.io/badge/python-3.8+-blue)]()\n[![License](https://img.shields.io/badge/license-MIT-blue)]()\n[![Version](https://img.shields.io/badge/version-0.1.3--alpha-orange)]()\n\n---\n\n## Quick Start\n\n```python\nfrom domcontext import DomContext\n\n# Parse HTML string\nhtml = \"\"\"\n<html>\n<head><title>Example</title></head>\n<body>\n    <nav><a href=\"/home\">Home</a></nav>\n    <main>\n        <button type=\"submit\">Search</button>\n    </main>\n</body>\n</html>\n\"\"\"\n\n# Create DOM context\ncontext = DomContext.from_html(html)\n\n# Get markdown representation\nprint(context.markdown)\nprint(f\"Tokens: {context.tokens}\")\n\n# Iterate through interactive elements\nfor element in context.elements():\n    print(f\"{element.id}: {element.tag} - {element.text}\")\n```\n\n**Output:**\n```\n- body-1\n  - nav-1\n    - a-1 (href=\"/home\")\n      - \"Home\"\n  - main-1\n    - button-1 (type=\"submit\")\n      - \"Search\"\n\nTokens: 42\n```\n\n---\n\n## Installation\n\n```bash\n# Install from source\npip install -e .\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Install with Playwright support (for live browser CDP capture)\npip install -e \".[playwright]\"\n\n# Install with Jupyter notebooks support (to run examples)\npip install -e \".[examples,playwright]\"\n\n# Install with all optional dependencies\npip install -e \".[dev,playwright,examples]\"\n```\n\nAfter installing with Playwright support, install browser binaries:\n```bash\nplaywright install chromium\n```\n\n---\n\n## Features\n\n- \ud83e\uddf9 **Semantic filtering** - Removes scripts, styles, hidden elements automatically\n- \ud83d\udcc9 **Token reduction** - 60% average reduction in token count\n- \ud83c\udfaf **Structure preservation** - Maintains DOM hierarchy in clean format\n- \ud83d\udd0d **Element lookup** - Access original DOM elements by their generated IDs\n- \ud83d\udcca **Token counting** - Built-in token counting with tiktoken\n- \ud83c\udf9b\ufe0f **Configurable filtering** - Fine-tune visibility and semantic filters\n- \ud83d\udce6 **Multiple input formats** - Support for HTML strings and CDP snapshots\n- \ud83e\udde9 **Smart chunking** - Split large DOMs with continuation markers (`...`) and parent context for seamless chunk boundaries\n\n---\n\n## API\n\n### Parse HTML\n\n```python\nfrom domcontext import DomContext\n\n# Basic parsing\ncontext = DomContext.from_html(html_string)\n\n# With custom filter options\ncontext = DomContext.from_html(\n    html_string,\n    filter_non_visible=True,      # Remove script, style tags\n    filter_css_hidden=True,        # Remove display:none, visibility:hidden\n    filter_zero_dimensions=True,   # Remove zero-width/height elements\n    filter_empty_elements=True,    # Remove empty wrapper divs\n    filter_attributes=True,        # Keep only semantic attributes\n    collapse_wrappers=True         # Collapse single-child wrappers\n)\n```\n\n### Parse CDP Snapshot\n\n```python\n# From Chrome DevTools Protocol snapshot\ncdp_snapshot = {\n    'documents': [...],\n    'strings': [...]\n}\n\ncontext = DomContext.from_cdp(cdp_snapshot)\n```\n\n### Access Context\n\n```python\n# Markdown representation\nmarkdown = context.markdown\n\n# Token count\ntoken_count = context.tokens\n\n# Get all interactive elements\nfor element in context.elements():\n    print(f\"ID: {element.id}\")\n    print(f\"Tag: {element.tag}\")\n    print(f\"Text: {element.text}\")\n    print(f\"Attributes: {element.attributes}\")\n\n# Get element by ID\nelement = context.get_element(\"button-1\")\nprint(element.attributes)  # {'type': 'submit'}\n\n# Get as dictionary\ndata = context.to_dict()\n```\n\n### Chunking\n\n```python\n# Split large DOMs into chunks with smart continuation markers\nfor chunk in context.chunks(max_tokens=1000, overlap=100):\n    print(f\"Chunk tokens: {chunk.tokens}\")\n    print(chunk.markdown)\n\n# Chunks automatically include:\n# - Parent path context (e.g., \"- body-1\\n  - div-1\")\n# - Continuation markers (...) when elements span chunks\n# Example: \"- button-1 (type=\"submit\" ...)\" \u2192 continues in next chunk\n\n# Disable parent path if needed\nfor chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):\n    print(chunk.markdown)  # No parent context\n```\n\n### Custom Tokenizer\n\n```python\nfrom domcontext import DomContext, Tokenizer\n\nclass CustomTokenizer(Tokenizer):\n    def count_tokens(self, text: str) -> int:\n        # Your custom token counting logic\n        return len(text.split())\n\ncontext = DomContext.from_html(html, tokenizer=CustomTokenizer())\n```\n\n### Playwright Utilities (Optional)\n\nCapture CDP snapshots directly from live browser sessions:\n\n```python\nfrom playwright.async_api import async_playwright\nfrom domcontext import DomContext\nfrom domcontext.utils import capture_snapshot\n\nasync def main():\n    async with async_playwright() as p:\n        browser = await p.chromium.launch()\n        page = await browser.new_page()\n        await page.goto('https://example.com')\n\n        # Capture CDP snapshot from live page\n        snapshot = await capture_snapshot(page)\n\n        # Parse into DomContext\n        context = DomContext.from_cdp(snapshot)\n        print(context.markdown)\n\n        await browser.close()\n\n# Run with: python -m asyncio script.py\n```\n\n**Note:** Requires installation with `pip install domcontext[playwright]`\n\n---\n\n## Architecture\n\nThe library uses a multi-stage filtering pipeline:\n\n1. **Parse** - HTML/CDP \u2192 DomIR (complete DOM tree with all data)\n2. **Visibility Filter** - Remove non-visible elements (optional flags)\n   - Non-visible tags (script, style, head)\n   - CSS hidden elements (display:none, visibility:hidden)\n   - Zero-dimension elements\n3. **Semantic Filter** - Extract semantic information (optional flags)\n   - Convert to SemanticIR\n   - Filter to semantic attributes only\n   - Remove empty nodes\n   - Collapse wrapper divs\n   - Generate readable IDs\n4. **Output** - SemanticIR \u2192 Markdown/JSON\n\n---\n\n## Testing\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=domcontext --cov-report=html\n\n# Run specific test suite\npytest tests/unit/parsers/\npytest tests/unit/filters/\npytest tests/unit/ir/\n```\n\n**Test Coverage:**\n- 188 tests passing\n- HTML Parser (13 tests)\n- CDP Parser (12 tests)\n- DomIR Layer (27 tests)\n- SemanticIR Layer (34 tests)\n- Visibility Filters (43 tests)\n- Semantic Filters (28 tests)\n- Chunker (15 tests)\n- Tokenizers (13 tests)\n- Smoke tests (3 tests)\n\n---\n\n## Use Cases\n\n- **Web automation agents** - Provide clean DOM context to LLMs for element selection\n- **Web scraping** - Extract structured content from complex pages\n- **Testing** - Generate clean snapshots of DOM state\n- **Accessibility** - Extract semantic structure from pages\n\n---\n\n## License\n\nMIT\n\n---\n\n## Examples\n\nCheck out the interactive Jupyter notebooks in `examples/`:\n\n- **`simple_demo.ipynb`** - Quick start guide with Google search example\n  - Element lookup by ID\n  - Chunking demonstration\n  - Perfect for beginners\n\n- **`advanced_demo.ipynb`** - Advanced features showcase\n  - Custom filters and tokenizers\n  - Element iteration and statistics\n  - LLM prompt generation\n  - Production patterns\n\nRun with:\n```bash\njupyter notebook examples/simple_demo.ipynb\n```\n\n---\n\n## Development\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack src/ tests/\n\n# Lint\nruff check src/ tests/\n```\n\n---\n\n## TODO\n\n### Collapsing Improvements\n\n1. **Collapse text-wrapping elements** - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, `<a><span>text</span></a>` keeps the `span`, but it should be collapsed to `<a>text</a>` if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.\n\n### Evaluation & Benchmarking\n\n2. **Mind2Web dataset evaluation** - Conduct comprehensive testing on the [Mind2Web dataset](https://osu-nlp-group.github.io/Mind2Web/) to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.\n\n---\n\n## Recently Completed\n\n### \u2705 Chunking Improvements (v0.1.3)\n\n- **Atomic-level chunking** - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (`...`)\n- **Smart chunk boundaries** - Text and attributes now split across chunks seamlessly with proper context preservation\n- **Parent path context** - Each chunk includes parent hierarchy for better LLM understanding\n- **Better token utilization** - No more wasted chunk capacity from oversized single-line elements\n\n---\n\n## Contributing\n\nContributions welcome! Please ensure tests pass and add new tests for new features.\n\n```bash\n# Run full test suite\npytest -v\n\n# Check coverage\npytest --cov=domcontext\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Parse DOM trees into clean, LLM-friendly context",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/steve-z-wang/domcontext",
        "Issues": "https://github.com/steve-z-wang/domcontext/issues",
        "Repository": "https://github.com/steve-z-wang/domcontext"
    },
    "split_keywords": [
        "dom",
        " html",
        " llm",
        " parsing",
        " web",
        " automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7a5307ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1",
                "md5": "ad2ee9744998ef74e7292af1f5bb6a88",
                "sha256": "5602a31d6163eee6cfc06d1dc5785f8edd051e28b45254f03cb629b724e1ddc0"
            },
            "downloads": -1,
            "filename": "domcontext-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad2ee9744998ef74e7292af1f5bb6a88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 37991,
            "upload_time": "2025-10-19T06:24:03",
            "upload_time_iso_8601": "2025-10-19T06:24:03.708737Z",
            "url": "https://files.pythonhosted.org/packages/7a/53/07ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1/domcontext-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "465b9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21",
                "md5": "d4d2663cc4e7b68fde074f8b1d70cb90",
                "sha256": "f279a41bdb17a25400083ff97e9e272470b5624384329acd3c3b4db4b3ea9377"
            },
            "downloads": -1,
            "filename": "domcontext-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d4d2663cc4e7b68fde074f8b1d70cb90",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 30809,
            "upload_time": "2025-10-19T06:24:04",
            "upload_time_iso_8601": "2025-10-19T06:24:04.542263Z",
            "url": "https://files.pythonhosted.org/packages/46/5b/9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21/domcontext-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-19 06:24:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "steve-z-wang",
    "github_project": "domcontext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "domcontext"
}
        
Elapsed time: 0.95355s