# domnode
DOM nodes with browser rendering data for web automation.
A Python library for parsing and filtering DOM trees with browser rendering information. Supports HTML and Chrome DevTools Protocol snapshots.
## Installation
```bash
pip install domnode
```
## Quick Start
```python
from domnode import parse_html, filter_visible
html = """
<div>
<script>console.log('hidden')</script>
<div style="display: none">Hidden content</div>
<button role="button" class="btn">Click me</button>
</div>
"""
root = parse_html(html)
visible = filter_visible(root)
for child in visible:
print(child.tag, child.attrib)
# Output: button {'role': 'button', 'class': 'btn'}
```
## Features
- Parse HTML strings and CDP snapshots into rich DOM trees
- Filter visibility (display:none, visibility:hidden, opacity:0, zero-size)
- Filter semantically (keep only meaningful attributes, collapse wrappers)
- Access computed styles and bounding boxes
- 86 comprehensive unit tests
## Usage
### Parsing HTML
```python
from domnode.parsers import parse_html
html = '<div class="container"><button>Click</button></div>'
root = parse_html(html)
print(root.tag) # 'div'
print(root.attrib) # {'class': 'container'}
print(root.children[0]) # Node(tag='button', ...)
```
### Parsing CDP Snapshots
```python
from domnode.parsers import parse_cdp
# From Playwright/Puppeteer
snapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {
'computedStyles': [],
'includeDOMRects': True
})
root = parse_cdp(snapshot)
print(root.bounds) # BoundingBox(x=0, y=0, width=1920, height=1080)
print(root.styles) # {'display': 'block', 'position': 'static', ...}
```
### Filtering Visible Elements
```python
from domnode import parse_html, filter_visible
html = """
<div>
<script>alert('hidden')</script>
<style>.hide { display: none; }</style>
<div style="display: none">Hidden</div>
<div style="opacity: 0">Invisible</div>
<button>Visible</button>
</div>
"""
root = parse_html(html)
visible = filter_visible(root)
# Only button remains
assert len(visible.children) == 1
assert visible.children[0].tag == 'button'
```
### Filtering Semantic Content
```python
from domnode import parse_html, filter_semantic
html = """
<div class="wrapper" id="container">
<div class="inner">
<button class="btn" role="button" aria-label="Submit">Click</button>
</div>
</div>
"""
root = parse_html(html)
semantic = filter_semantic(root)
# Wrappers collapsed, only semantic attributes remain
assert semantic.tag == 'button'
assert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}
```
### Combining Filters
```python
from domnode import parse_html, filter_all
html = """
<html>
<head>
<script src="app.js"></script>
</head>
<body class="page">
<div class="wrapper">
<button class="btn" role="button">Click</button>
</div>
</body>
</html>
"""
root = parse_html(html)
clean = filter_all(root)
# Head removed, wrappers collapsed, only semantic attributes
assert clean.tag == 'button'
assert clean.attrib == {'role': 'button'}
```
### Granular Filtering
```python
from domnode.parsers import parse_html
from domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions
from domnode.filters.semantic import filter_attributes, collapse_wrappers
root = parse_html(html)
# Apply specific filters
root = filter_css_hidden(root)
root = filter_attributes(root)
root = collapse_wrappers(root)
```
### Working with Nodes
```python
from domnode import Node, Text, BoundingBox
# Create nodes
div = Node(tag='div', attrib={'class': 'container'})
button = Node(
tag='button',
attrib={'role': 'button'},
styles={'display': 'block'},
bounds=BoundingBox(x=10, y=20, width=100, height=50)
)
# Build tree
div.append(Text('Click here: '))
div.append(button)
button.append(Text('Submit'))
# Navigate
for child in div:
if isinstance(child, Node):
print(f"Element: {child.tag}")
elif isinstance(child, Text):
print(f"Text: {child.content}")
# Get all text
print(div.get_text()) # "Click here: Submit"
# Check visibility
print(button.is_visible()) # True
print(button.has_zero_size()) # False
```
## API Reference
### Types
**Node**
DOM element with tag, attributes, styles, bounds, metadata, and children.
**Text**
Text node with content.
**BoundingBox**
Element bounding box with x, y, width, height.
### Parsers
**parse_html(html: str) -> Node**
Parse HTML string to Node tree.
**parse_cdp(snapshot: dict) -> Node**
Parse CDP snapshot to Node tree.
### Preset Filters
**filter_visible(node) -> Node | None**
Remove all hidden elements.
**filter_semantic(node) -> Node | None**
Keep only semantic content.
**filter_all(node) -> Node | None**
Apply all filters.
### Visibility Filters
**filter_non_visible_tags(node)**
Remove script, style, head, meta, etc.
**filter_css_hidden(node)**
Remove display:none, visibility:hidden, opacity:0.
**filter_zero_dimensions(node)**
Remove zero-width/height elements.
### Semantic Filters
**filter_attributes(node, keep=SEMANTIC_ATTRIBUTES)**
Keep only semantic attributes.
**filter_empty(node)**
Remove empty nodes.
**collapse_wrappers(node)**
Collapse single-child wrapper elements.
### Node Methods
**node.append(child)**
Add a child node or text.
**node.remove(child)**
Remove a child.
**node.is_visible()**
Check if element is visible.
**node.has_zero_size()**
Check if element has zero dimensions.
**node.get_text(separator='')**
Get all text content recursively.
## Semantic Attributes
By default, filter_attributes keeps these attributes:
```python
SEMANTIC_ATTRIBUTES = {
"role", "aria-label", "aria-labelledby", "aria-describedby",
"aria-checked", "aria-selected", "aria-expanded", "aria-hidden",
"aria-disabled", "type", "name", "placeholder", "value",
"alt", "title", "href", "disabled", "checked", "selected"
}
```
You can customize:
```python
from domnode.filters.semantic import filter_attributes
custom_attrs = {"role", "href", "data-test-id"}
filtered = filter_attributes(node, keep=custom_attrs)
```
## Use Cases
**Web Scraping**
Extract only visible, meaningful content from web pages.
**Browser Automation**
Filter DOM to only interactive elements for AI agents.
**LLM Context**
Reduce HTML to essential semantic structure for language models.
**Accessibility Testing**
Analyze semantic attributes and ARIA labels.
**Testing**
Build and manipulate DOM trees programmatically.
## Development
```bash
# Clone repository
git clone https://github.com/steve-z-wang/domnode.git
cd domnode
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=domnode --cov-report=html
```
## License
MIT
## Contributing
Contributions are welcome. Please submit a Pull Request.
## Related Projects
[domcontext](https://github.com/steve-z-wang/domcontext) - DOM to LLM context with markdown serialization
[natural-selector](https://github.com/steve-z-wang/natural-selector) - Natural language element selection with RAG
Raw data
{
"_id": null,
"home_page": null,
"name": "domnode",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "dom, html, parser, web-automation, browser, cdp, selenium, playwright, puppeteer, web-scraping",
"author": null,
"author_email": "Steve Wang <steve.z.wang@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/85/e9/16eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f/domnode-0.2.0.tar.gz",
"platform": null,
"description": "# domnode\n\nDOM nodes with browser rendering data for web automation.\n\nA Python library for parsing and filtering DOM trees with browser rendering information. Supports HTML and Chrome DevTools Protocol snapshots.\n\n## Installation\n\n```bash\npip install domnode\n```\n\n## Quick Start\n\n```python\nfrom domnode import parse_html, filter_visible\n\nhtml = \"\"\"\n<div>\n <script>console.log('hidden')</script>\n <div style=\"display: none\">Hidden content</div>\n <button role=\"button\" class=\"btn\">Click me</button>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nvisible = filter_visible(root)\n\nfor child in visible:\n print(child.tag, child.attrib)\n# Output: button {'role': 'button', 'class': 'btn'}\n```\n\n## Features\n\n- Parse HTML strings and CDP snapshots into rich DOM trees\n- Filter visibility (display:none, visibility:hidden, opacity:0, zero-size)\n- Filter semantically (keep only meaningful attributes, collapse wrappers)\n- Access computed styles and bounding boxes\n- 86 comprehensive unit tests\n\n## Usage\n\n### Parsing HTML\n\n```python\nfrom domnode.parsers import parse_html\n\nhtml = '<div class=\"container\"><button>Click</button></div>'\nroot = parse_html(html)\n\nprint(root.tag) # 'div'\nprint(root.attrib) # {'class': 'container'}\nprint(root.children[0]) # Node(tag='button', ...)\n```\n\n### Parsing CDP Snapshots\n\n```python\nfrom domnode.parsers import parse_cdp\n\n# From Playwright/Puppeteer\nsnapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {\n 'computedStyles': [],\n 'includeDOMRects': True\n})\n\nroot = parse_cdp(snapshot)\nprint(root.bounds) # BoundingBox(x=0, y=0, width=1920, height=1080)\nprint(root.styles) # {'display': 'block', 'position': 'static', ...}\n```\n\n### Filtering Visible Elements\n\n```python\nfrom domnode import parse_html, filter_visible\n\nhtml = \"\"\"\n<div>\n <script>alert('hidden')</script>\n <style>.hide { display: none; }</style>\n <div style=\"display: none\">Hidden</div>\n <div style=\"opacity: 0\">Invisible</div>\n <button>Visible</button>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nvisible = filter_visible(root)\n\n# Only button remains\nassert len(visible.children) == 1\nassert visible.children[0].tag == 'button'\n```\n\n### Filtering Semantic Content\n\n```python\nfrom domnode import parse_html, filter_semantic\n\nhtml = \"\"\"\n<div class=\"wrapper\" id=\"container\">\n <div class=\"inner\">\n <button class=\"btn\" role=\"button\" aria-label=\"Submit\">Click</button>\n </div>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nsemantic = filter_semantic(root)\n\n# Wrappers collapsed, only semantic attributes remain\nassert semantic.tag == 'button'\nassert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}\n```\n\n### Combining Filters\n\n```python\nfrom domnode import parse_html, filter_all\n\nhtml = \"\"\"\n<html>\n <head>\n <script src=\"app.js\"></script>\n </head>\n <body class=\"page\">\n <div class=\"wrapper\">\n <button class=\"btn\" role=\"button\">Click</button>\n </div>\n </body>\n</html>\n\"\"\"\n\nroot = parse_html(html)\nclean = filter_all(root)\n\n# Head removed, wrappers collapsed, only semantic attributes\nassert clean.tag == 'button'\nassert clean.attrib == {'role': 'button'}\n```\n\n### Granular Filtering\n\n```python\nfrom domnode.parsers import parse_html\nfrom domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions\nfrom domnode.filters.semantic import filter_attributes, collapse_wrappers\n\nroot = parse_html(html)\n\n# Apply specific filters\nroot = filter_css_hidden(root)\nroot = filter_attributes(root)\nroot = collapse_wrappers(root)\n```\n\n### Working with Nodes\n\n```python\nfrom domnode import Node, Text, BoundingBox\n\n# Create nodes\ndiv = Node(tag='div', attrib={'class': 'container'})\nbutton = Node(\n tag='button',\n attrib={'role': 'button'},\n styles={'display': 'block'},\n bounds=BoundingBox(x=10, y=20, width=100, height=50)\n)\n\n# Build tree\ndiv.append(Text('Click here: '))\ndiv.append(button)\nbutton.append(Text('Submit'))\n\n# Navigate\nfor child in div:\n if isinstance(child, Node):\n print(f\"Element: {child.tag}\")\n elif isinstance(child, Text):\n print(f\"Text: {child.content}\")\n\n# Get all text\nprint(div.get_text()) # \"Click here: Submit\"\n\n# Check visibility\nprint(button.is_visible()) # True\nprint(button.has_zero_size()) # False\n```\n\n## API Reference\n\n### Types\n\n**Node**\nDOM element with tag, attributes, styles, bounds, metadata, and children.\n\n**Text**\nText node with content.\n\n**BoundingBox**\nElement bounding box with x, y, width, height.\n\n### Parsers\n\n**parse_html(html: str) -> Node**\nParse HTML string to Node tree.\n\n**parse_cdp(snapshot: dict) -> Node**\nParse CDP snapshot to Node tree.\n\n### Preset Filters\n\n**filter_visible(node) -> Node | None**\nRemove all hidden elements.\n\n**filter_semantic(node) -> Node | None**\nKeep only semantic content.\n\n**filter_all(node) -> Node | None**\nApply all filters.\n\n### Visibility Filters\n\n**filter_non_visible_tags(node)**\nRemove script, style, head, meta, etc.\n\n**filter_css_hidden(node)**\nRemove display:none, visibility:hidden, opacity:0.\n\n**filter_zero_dimensions(node)**\nRemove zero-width/height elements.\n\n### Semantic Filters\n\n**filter_attributes(node, keep=SEMANTIC_ATTRIBUTES)**\nKeep only semantic attributes.\n\n**filter_empty(node)**\nRemove empty nodes.\n\n**collapse_wrappers(node)**\nCollapse single-child wrapper elements.\n\n### Node Methods\n\n**node.append(child)**\nAdd a child node or text.\n\n**node.remove(child)**\nRemove a child.\n\n**node.is_visible()**\nCheck if element is visible.\n\n**node.has_zero_size()**\nCheck if element has zero dimensions.\n\n**node.get_text(separator='')**\nGet all text content recursively.\n\n## Semantic Attributes\n\nBy default, filter_attributes keeps these attributes:\n\n```python\nSEMANTIC_ATTRIBUTES = {\n \"role\", \"aria-label\", \"aria-labelledby\", \"aria-describedby\",\n \"aria-checked\", \"aria-selected\", \"aria-expanded\", \"aria-hidden\",\n \"aria-disabled\", \"type\", \"name\", \"placeholder\", \"value\",\n \"alt\", \"title\", \"href\", \"disabled\", \"checked\", \"selected\"\n}\n```\n\nYou can customize:\n\n```python\nfrom domnode.filters.semantic import filter_attributes\n\ncustom_attrs = {\"role\", \"href\", \"data-test-id\"}\nfiltered = filter_attributes(node, keep=custom_attrs)\n```\n\n## Use Cases\n\n**Web Scraping**\nExtract only visible, meaningful content from web pages.\n\n**Browser Automation**\nFilter DOM to only interactive elements for AI agents.\n\n**LLM Context**\nReduce HTML to essential semantic structure for language models.\n\n**Accessibility Testing**\nAnalyze semantic attributes and ARIA labels.\n\n**Testing**\nBuild and manipulate DOM trees programmatically.\n\n## Development\n\n```bash\n# Clone repository\ngit clone https://github.com/steve-z-wang/domnode.git\ncd domnode\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=domnode --cov-report=html\n```\n\n## License\n\nMIT\n\n## Contributing\n\nContributions are welcome. Please submit a Pull Request.\n\n## Related Projects\n\n[domcontext](https://github.com/steve-z-wang/domcontext) - DOM to LLM context with markdown serialization\n\n[natural-selector](https://github.com/steve-z-wang/natural-selector) - Natural language element selection with RAG\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "DOM nodes with browser rendering data for web automation",
"version": "0.2.0",
"project_urls": {
"Changelog": "https://github.com/steve-z-wang/domnode/releases",
"Homepage": "https://github.com/steve-z-wang/domnode",
"Issues": "https://github.com/steve-z-wang/domnode/issues",
"Repository": "https://github.com/steve-z-wang/domnode"
},
"split_keywords": [
"dom",
" html",
" parser",
" web-automation",
" browser",
" cdp",
" selenium",
" playwright",
" puppeteer",
" web-scraping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "02d642d2da646b860ef7e1b8f0ddf6c88d96d4b59841cd25c5d8c10613f6e44a",
"md5": "fa3d6c6c47299a625feabafd0cb936a7",
"sha256": "4b645edb338ef82ac3b33a945185783979663a5483dd70575da6fc74a7c395c6"
},
"downloads": -1,
"filename": "domnode-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fa3d6c6c47299a625feabafd0cb936a7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 19758,
"upload_time": "2025-10-19T23:58:27",
"upload_time_iso_8601": "2025-10-19T23:58:27.368082Z",
"url": "https://files.pythonhosted.org/packages/02/d6/42d2da646b860ef7e1b8f0ddf6c88d96d4b59841cd25c5d8c10613f6e44a/domnode-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "85e916eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f",
"md5": "a22710d55a35224b4dff08cc486f7ed3",
"sha256": "033e0fdaeadca57e0325133e5c000eb7ef4434c23d148d8cf6497d33c6389fc0"
},
"downloads": -1,
"filename": "domnode-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "a22710d55a35224b4dff08cc486f7ed3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 20803,
"upload_time": "2025-10-19T23:58:28",
"upload_time_iso_8601": "2025-10-19T23:58:28.660653Z",
"url": "https://files.pythonhosted.org/packages/85/e9/16eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f/domnode-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-19 23:58:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "steve-z-wang",
"github_project": "domnode",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "domnode"
}