Name | domcontext JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | Parse DOM trees into clean, LLM-friendly context |
upload_time | 2025-10-19 06:24:04 |
maintainer | None |
docs_url | None |
author | Steve Wang |
requires_python | >=3.8 |
license | MIT |
keywords |
dom
html
llm
parsing
web
automation
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# domcontext
**Parse DOM trees into clean, LLM-friendly context.**
Converts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.
> **โ ๏ธ Development Status:** This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.
> **Why "domcontext"?** It's a double pun! ๐ฏ
> - **DOM** (Document Object Model) + **context** (LLM context windows)
> - Provides **DOM context** for your LLM agents
[]()
[]()
[]()
[]()
---
## Quick Start
```python
from domcontext import DomContext
# Parse HTML string
html = """
<html>
<head><title>Example</title></head>
<body>
<nav><a href="/home">Home</a></nav>
<main>
<button type="submit">Search</button>
</main>
</body>
</html>
"""
# Create DOM context
context = DomContext.from_html(html)
# Get markdown representation
print(context.markdown)
print(f"Tokens: {context.tokens}")
# Iterate through interactive elements
for element in context.elements():
print(f"{element.id}: {element.tag} - {element.text}")
```
**Output:**
```
- body-1
- nav-1
- a-1 (href="/home")
- "Home"
- main-1
- button-1 (type="submit")
- "Search"
Tokens: 42
```
---
## Installation
```bash
# Install from source
pip install -e .
# Install with dev dependencies
pip install -e ".[dev]"
# Install with Playwright support (for live browser CDP capture)
pip install -e ".[playwright]"
# Install with Jupyter notebooks support (to run examples)
pip install -e ".[examples,playwright]"
# Install with all optional dependencies
pip install -e ".[dev,playwright,examples]"
```
After installing with Playwright support, install browser binaries:
```bash
playwright install chromium
```
---
## Features
- ๐งน **Semantic filtering** - Removes scripts, styles, hidden elements automatically
- ๐ **Token reduction** - 60% average reduction in token count
- ๐ฏ **Structure preservation** - Maintains DOM hierarchy in clean format
- ๐ **Element lookup** - Access original DOM elements by their generated IDs
- ๐ **Token counting** - Built-in token counting with tiktoken
- ๐๏ธ **Configurable filtering** - Fine-tune visibility and semantic filters
- ๐ฆ **Multiple input formats** - Support for HTML strings and CDP snapshots
- ๐งฉ **Smart chunking** - Split large DOMs with continuation markers (`...`) and parent context for seamless chunk boundaries
---
## API
### Parse HTML
```python
from domcontext import DomContext
# Basic parsing
context = DomContext.from_html(html_string)
# With custom filter options
context = DomContext.from_html(
html_string,
filter_non_visible=True, # Remove script, style tags
filter_css_hidden=True, # Remove display:none, visibility:hidden
filter_zero_dimensions=True, # Remove zero-width/height elements
filter_empty_elements=True, # Remove empty wrapper divs
filter_attributes=True, # Keep only semantic attributes
collapse_wrappers=True # Collapse single-child wrappers
)
```
### Parse CDP Snapshot
```python
# From Chrome DevTools Protocol snapshot
cdp_snapshot = {
'documents': [...],
'strings': [...]
}
context = DomContext.from_cdp(cdp_snapshot)
```
### Access Context
```python
# Markdown representation
markdown = context.markdown
# Token count
token_count = context.tokens
# Get all interactive elements
for element in context.elements():
print(f"ID: {element.id}")
print(f"Tag: {element.tag}")
print(f"Text: {element.text}")
print(f"Attributes: {element.attributes}")
# Get element by ID
element = context.get_element("button-1")
print(element.attributes) # {'type': 'submit'}
# Get as dictionary
data = context.to_dict()
```
### Chunking
```python
# Split large DOMs into chunks with smart continuation markers
for chunk in context.chunks(max_tokens=1000, overlap=100):
print(f"Chunk tokens: {chunk.tokens}")
print(chunk.markdown)
# Chunks automatically include:
# - Parent path context (e.g., "- body-1\n - div-1")
# - Continuation markers (...) when elements span chunks
# Example: "- button-1 (type="submit" ...)" โ continues in next chunk
# Disable parent path if needed
for chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):
print(chunk.markdown) # No parent context
```
### Custom Tokenizer
```python
from domcontext import DomContext, Tokenizer
class CustomTokenizer(Tokenizer):
def count_tokens(self, text: str) -> int:
# Your custom token counting logic
return len(text.split())
context = DomContext.from_html(html, tokenizer=CustomTokenizer())
```
### Playwright Utilities (Optional)
Capture CDP snapshots directly from live browser sessions:
```python
from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto('https://example.com')
# Capture CDP snapshot from live page
snapshot = await capture_snapshot(page)
# Parse into DomContext
context = DomContext.from_cdp(snapshot)
print(context.markdown)
await browser.close()
# Run with: python -m asyncio script.py
```
**Note:** Requires installation with `pip install domcontext[playwright]`
---
## Architecture
The library uses a multi-stage filtering pipeline:
1. **Parse** - HTML/CDP โ DomIR (complete DOM tree with all data)
2. **Visibility Filter** - Remove non-visible elements (optional flags)
- Non-visible tags (script, style, head)
- CSS hidden elements (display:none, visibility:hidden)
- Zero-dimension elements
3. **Semantic Filter** - Extract semantic information (optional flags)
- Convert to SemanticIR
- Filter to semantic attributes only
- Remove empty nodes
- Collapse wrapper divs
- Generate readable IDs
4. **Output** - SemanticIR โ Markdown/JSON
---
## Testing
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=domcontext --cov-report=html
# Run specific test suite
pytest tests/unit/parsers/
pytest tests/unit/filters/
pytest tests/unit/ir/
```
**Test Coverage:**
- 188 tests passing
- HTML Parser (13 tests)
- CDP Parser (12 tests)
- DomIR Layer (27 tests)
- SemanticIR Layer (34 tests)
- Visibility Filters (43 tests)
- Semantic Filters (28 tests)
- Chunker (15 tests)
- Tokenizers (13 tests)
- Smoke tests (3 tests)
---
## Use Cases
- **Web automation agents** - Provide clean DOM context to LLMs for element selection
- **Web scraping** - Extract structured content from complex pages
- **Testing** - Generate clean snapshots of DOM state
- **Accessibility** - Extract semantic structure from pages
---
## License
MIT
---
## Examples
Check out the interactive Jupyter notebooks in `examples/`:
- **`simple_demo.ipynb`** - Quick start guide with Google search example
- Element lookup by ID
- Chunking demonstration
- Perfect for beginners
- **`advanced_demo.ipynb`** - Advanced features showcase
- Custom filters and tokenizers
- Element iteration and statistics
- LLM prompt generation
- Production patterns
Run with:
```bash
jupyter notebook examples/simple_demo.ipynb
```
---
## Development
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
```
---
## TODO
### Collapsing Improvements
1. **Collapse text-wrapping elements** - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, `<a><span>text</span></a>` keeps the `span`, but it should be collapsed to `<a>text</a>` if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.
### Evaluation & Benchmarking
2. **Mind2Web dataset evaluation** - Conduct comprehensive testing on the [Mind2Web dataset](https://osu-nlp-group.github.io/Mind2Web/) to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.
---
## Recently Completed
### โ
Chunking Improvements (v0.1.3)
- **Atomic-level chunking** - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (`...`)
- **Smart chunk boundaries** - Text and attributes now split across chunks seamlessly with proper context preservation
- **Parent path context** - Each chunk includes parent hierarchy for better LLM understanding
- **Better token utilization** - No more wasted chunk capacity from oversized single-line elements
---
## Contributing
Contributions welcome! Please ensure tests pass and add new tests for new features.
```bash
# Run full test suite
pytest -v
# Check coverage
pytest --cov=domcontext
```
Raw data
{
"_id": null,
"home_page": null,
"name": "domcontext",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "dom, html, llm, parsing, web, automation",
"author": "Steve Wang",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/46/5b/9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21/domcontext-0.1.3.tar.gz",
"platform": null,
"description": "# domcontext\n\n**Parse DOM trees into clean, LLM-friendly context.**\n\nConverts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.\n\n> **\u26a0\ufe0f Development Status:** This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.\n\n> **Why \"domcontext\"?** It's a double pun! \ud83c\udfaf\n> - **DOM** (Document Object Model) + **context** (LLM context windows)\n> - Provides **DOM context** for your LLM agents\n\n[]()\n[]()\n[]()\n[]()\n\n---\n\n## Quick Start\n\n```python\nfrom domcontext import DomContext\n\n# Parse HTML string\nhtml = \"\"\"\n<html>\n<head><title>Example</title></head>\n<body>\n <nav><a href=\"/home\">Home</a></nav>\n <main>\n <button type=\"submit\">Search</button>\n </main>\n</body>\n</html>\n\"\"\"\n\n# Create DOM context\ncontext = DomContext.from_html(html)\n\n# Get markdown representation\nprint(context.markdown)\nprint(f\"Tokens: {context.tokens}\")\n\n# Iterate through interactive elements\nfor element in context.elements():\n print(f\"{element.id}: {element.tag} - {element.text}\")\n```\n\n**Output:**\n```\n- body-1\n - nav-1\n - a-1 (href=\"/home\")\n - \"Home\"\n - main-1\n - button-1 (type=\"submit\")\n - \"Search\"\n\nTokens: 42\n```\n\n---\n\n## Installation\n\n```bash\n# Install from source\npip install -e .\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Install with Playwright support (for live browser CDP capture)\npip install -e \".[playwright]\"\n\n# Install with Jupyter notebooks support (to run examples)\npip install -e \".[examples,playwright]\"\n\n# Install with all optional dependencies\npip install -e \".[dev,playwright,examples]\"\n```\n\nAfter installing with Playwright support, install browser binaries:\n```bash\nplaywright install chromium\n```\n\n---\n\n## Features\n\n- \ud83e\uddf9 **Semantic filtering** - Removes scripts, styles, hidden elements automatically\n- \ud83d\udcc9 **Token reduction** - 60% average reduction in token count\n- \ud83c\udfaf **Structure preservation** - Maintains DOM hierarchy in clean format\n- \ud83d\udd0d **Element lookup** - Access original DOM elements by their generated IDs\n- \ud83d\udcca **Token counting** - Built-in token counting with tiktoken\n- \ud83c\udf9b\ufe0f **Configurable filtering** - Fine-tune visibility and semantic filters\n- \ud83d\udce6 **Multiple input formats** - Support for HTML strings and CDP snapshots\n- \ud83e\udde9 **Smart chunking** - Split large DOMs with continuation markers (`...`) and parent context for seamless chunk boundaries\n\n---\n\n## API\n\n### Parse HTML\n\n```python\nfrom domcontext import DomContext\n\n# Basic parsing\ncontext = DomContext.from_html(html_string)\n\n# With custom filter options\ncontext = DomContext.from_html(\n html_string,\n filter_non_visible=True, # Remove script, style tags\n filter_css_hidden=True, # Remove display:none, visibility:hidden\n filter_zero_dimensions=True, # Remove zero-width/height elements\n filter_empty_elements=True, # Remove empty wrapper divs\n filter_attributes=True, # Keep only semantic attributes\n collapse_wrappers=True # Collapse single-child wrappers\n)\n```\n\n### Parse CDP Snapshot\n\n```python\n# From Chrome DevTools Protocol snapshot\ncdp_snapshot = {\n 'documents': [...],\n 'strings': [...]\n}\n\ncontext = DomContext.from_cdp(cdp_snapshot)\n```\n\n### Access Context\n\n```python\n# Markdown representation\nmarkdown = context.markdown\n\n# Token count\ntoken_count = context.tokens\n\n# Get all interactive elements\nfor element in context.elements():\n print(f\"ID: {element.id}\")\n print(f\"Tag: {element.tag}\")\n print(f\"Text: {element.text}\")\n print(f\"Attributes: {element.attributes}\")\n\n# Get element by ID\nelement = context.get_element(\"button-1\")\nprint(element.attributes) # {'type': 'submit'}\n\n# Get as dictionary\ndata = context.to_dict()\n```\n\n### Chunking\n\n```python\n# Split large DOMs into chunks with smart continuation markers\nfor chunk in context.chunks(max_tokens=1000, overlap=100):\n print(f\"Chunk tokens: {chunk.tokens}\")\n print(chunk.markdown)\n\n# Chunks automatically include:\n# - Parent path context (e.g., \"- body-1\\n - div-1\")\n# - Continuation markers (...) when elements span chunks\n# Example: \"- button-1 (type=\"submit\" ...)\" \u2192 continues in next chunk\n\n# Disable parent path if needed\nfor chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):\n print(chunk.markdown) # No parent context\n```\n\n### Custom Tokenizer\n\n```python\nfrom domcontext import DomContext, Tokenizer\n\nclass CustomTokenizer(Tokenizer):\n def count_tokens(self, text: str) -> int:\n # Your custom token counting logic\n return len(text.split())\n\ncontext = DomContext.from_html(html, tokenizer=CustomTokenizer())\n```\n\n### Playwright Utilities (Optional)\n\nCapture CDP snapshots directly from live browser sessions:\n\n```python\nfrom playwright.async_api import async_playwright\nfrom domcontext import DomContext\nfrom domcontext.utils import capture_snapshot\n\nasync def main():\n async with async_playwright() as p:\n browser = await p.chromium.launch()\n page = await browser.new_page()\n await page.goto('https://example.com')\n\n # Capture CDP snapshot from live page\n snapshot = await capture_snapshot(page)\n\n # Parse into DomContext\n context = DomContext.from_cdp(snapshot)\n print(context.markdown)\n\n await browser.close()\n\n# Run with: python -m asyncio script.py\n```\n\n**Note:** Requires installation with `pip install domcontext[playwright]`\n\n---\n\n## Architecture\n\nThe library uses a multi-stage filtering pipeline:\n\n1. **Parse** - HTML/CDP \u2192 DomIR (complete DOM tree with all data)\n2. **Visibility Filter** - Remove non-visible elements (optional flags)\n - Non-visible tags (script, style, head)\n - CSS hidden elements (display:none, visibility:hidden)\n - Zero-dimension elements\n3. **Semantic Filter** - Extract semantic information (optional flags)\n - Convert to SemanticIR\n - Filter to semantic attributes only\n - Remove empty nodes\n - Collapse wrapper divs\n - Generate readable IDs\n4. **Output** - SemanticIR \u2192 Markdown/JSON\n\n---\n\n## Testing\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=domcontext --cov-report=html\n\n# Run specific test suite\npytest tests/unit/parsers/\npytest tests/unit/filters/\npytest tests/unit/ir/\n```\n\n**Test Coverage:**\n- 188 tests passing\n- HTML Parser (13 tests)\n- CDP Parser (12 tests)\n- DomIR Layer (27 tests)\n- SemanticIR Layer (34 tests)\n- Visibility Filters (43 tests)\n- Semantic Filters (28 tests)\n- Chunker (15 tests)\n- Tokenizers (13 tests)\n- Smoke tests (3 tests)\n\n---\n\n## Use Cases\n\n- **Web automation agents** - Provide clean DOM context to LLMs for element selection\n- **Web scraping** - Extract structured content from complex pages\n- **Testing** - Generate clean snapshots of DOM state\n- **Accessibility** - Extract semantic structure from pages\n\n---\n\n## License\n\nMIT\n\n---\n\n## Examples\n\nCheck out the interactive Jupyter notebooks in `examples/`:\n\n- **`simple_demo.ipynb`** - Quick start guide with Google search example\n - Element lookup by ID\n - Chunking demonstration\n - Perfect for beginners\n\n- **`advanced_demo.ipynb`** - Advanced features showcase\n - Custom filters and tokenizers\n - Element iteration and statistics\n - LLM prompt generation\n - Production patterns\n\nRun with:\n```bash\njupyter notebook examples/simple_demo.ipynb\n```\n\n---\n\n## Development\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack src/ tests/\n\n# Lint\nruff check src/ tests/\n```\n\n---\n\n## TODO\n\n### Collapsing Improvements\n\n1. **Collapse text-wrapping elements** - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, `<a><span>text</span></a>` keeps the `span`, but it should be collapsed to `<a>text</a>` if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.\n\n### Evaluation & Benchmarking\n\n2. **Mind2Web dataset evaluation** - Conduct comprehensive testing on the [Mind2Web dataset](https://osu-nlp-group.github.io/Mind2Web/) to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.\n\n---\n\n## Recently Completed\n\n### \u2705 Chunking Improvements (v0.1.3)\n\n- **Atomic-level chunking** - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (`...`)\n- **Smart chunk boundaries** - Text and attributes now split across chunks seamlessly with proper context preservation\n- **Parent path context** - Each chunk includes parent hierarchy for better LLM understanding\n- **Better token utilization** - No more wasted chunk capacity from oversized single-line elements\n\n---\n\n## Contributing\n\nContributions welcome! Please ensure tests pass and add new tests for new features.\n\n```bash\n# Run full test suite\npytest -v\n\n# Check coverage\npytest --cov=domcontext\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Parse DOM trees into clean, LLM-friendly context",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/steve-z-wang/domcontext",
"Issues": "https://github.com/steve-z-wang/domcontext/issues",
"Repository": "https://github.com/steve-z-wang/domcontext"
},
"split_keywords": [
"dom",
" html",
" llm",
" parsing",
" web",
" automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7a5307ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1",
"md5": "ad2ee9744998ef74e7292af1f5bb6a88",
"sha256": "5602a31d6163eee6cfc06d1dc5785f8edd051e28b45254f03cb629b724e1ddc0"
},
"downloads": -1,
"filename": "domcontext-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ad2ee9744998ef74e7292af1f5bb6a88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 37991,
"upload_time": "2025-10-19T06:24:03",
"upload_time_iso_8601": "2025-10-19T06:24:03.708737Z",
"url": "https://files.pythonhosted.org/packages/7a/53/07ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1/domcontext-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "465b9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21",
"md5": "d4d2663cc4e7b68fde074f8b1d70cb90",
"sha256": "f279a41bdb17a25400083ff97e9e272470b5624384329acd3c3b4db4b3ea9377"
},
"downloads": -1,
"filename": "domcontext-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "d4d2663cc4e7b68fde074f8b1d70cb90",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 30809,
"upload_time": "2025-10-19T06:24:04",
"upload_time_iso_8601": "2025-10-19T06:24:04.542263Z",
"url": "https://files.pythonhosted.org/packages/46/5b/9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21/domcontext-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-19 06:24:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "steve-z-wang",
"github_project": "domcontext",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "domcontext"
}