domnode


Namedomnode JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryDOM nodes with browser rendering data for web automation
upload_time2025-10-19 23:58:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords dom html parser web-automation browser cdp selenium playwright puppeteer web-scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # domnode

DOM nodes with browser rendering data for web automation.

A Python library for parsing and filtering DOM trees with browser rendering information. Supports HTML and Chrome DevTools Protocol snapshots.

## Installation

```bash
pip install domnode
```

## Quick Start

```python
from domnode import parse_html, filter_visible

html = """
<div>
    <script>console.log('hidden')</script>
    <div style="display: none">Hidden content</div>
    <button role="button" class="btn">Click me</button>
</div>
"""

root = parse_html(html)
visible = filter_visible(root)

for child in visible:
    print(child.tag, child.attrib)
# Output: button {'role': 'button', 'class': 'btn'}
```

## Features

- Parse HTML strings and CDP snapshots into rich DOM trees
- Filter visibility (display:none, visibility:hidden, opacity:0, zero-size)
- Filter semantically (keep only meaningful attributes, collapse wrappers)
- Access computed styles and bounding boxes
- 86 comprehensive unit tests

## Usage

### Parsing HTML

```python
from domnode.parsers import parse_html

html = '<div class="container"><button>Click</button></div>'
root = parse_html(html)

print(root.tag)          # 'div'
print(root.attrib)       # {'class': 'container'}
print(root.children[0])  # Node(tag='button', ...)
```

### Parsing CDP Snapshots

```python
from domnode.parsers import parse_cdp

# From Playwright/Puppeteer
snapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {
    'computedStyles': [],
    'includeDOMRects': True
})

root = parse_cdp(snapshot)
print(root.bounds)  # BoundingBox(x=0, y=0, width=1920, height=1080)
print(root.styles)  # {'display': 'block', 'position': 'static', ...}
```

### Filtering Visible Elements

```python
from domnode import parse_html, filter_visible

html = """
<div>
    <script>alert('hidden')</script>
    <style>.hide { display: none; }</style>
    <div style="display: none">Hidden</div>
    <div style="opacity: 0">Invisible</div>
    <button>Visible</button>
</div>
"""

root = parse_html(html)
visible = filter_visible(root)

# Only button remains
assert len(visible.children) == 1
assert visible.children[0].tag == 'button'
```

### Filtering Semantic Content

```python
from domnode import parse_html, filter_semantic

html = """
<div class="wrapper" id="container">
    <div class="inner">
        <button class="btn" role="button" aria-label="Submit">Click</button>
    </div>
</div>
"""

root = parse_html(html)
semantic = filter_semantic(root)

# Wrappers collapsed, only semantic attributes remain
assert semantic.tag == 'button'
assert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}
```

### Combining Filters

```python
from domnode import parse_html, filter_all

html = """
<html>
    <head>
        <script src="app.js"></script>
    </head>
    <body class="page">
        <div class="wrapper">
            <button class="btn" role="button">Click</button>
        </div>
    </body>
</html>
"""

root = parse_html(html)
clean = filter_all(root)

# Head removed, wrappers collapsed, only semantic attributes
assert clean.tag == 'button'
assert clean.attrib == {'role': 'button'}
```

### Granular Filtering

```python
from domnode.parsers import parse_html
from domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions
from domnode.filters.semantic import filter_attributes, collapse_wrappers

root = parse_html(html)

# Apply specific filters
root = filter_css_hidden(root)
root = filter_attributes(root)
root = collapse_wrappers(root)
```

### Working with Nodes

```python
from domnode import Node, Text, BoundingBox

# Create nodes
div = Node(tag='div', attrib={'class': 'container'})
button = Node(
    tag='button',
    attrib={'role': 'button'},
    styles={'display': 'block'},
    bounds=BoundingBox(x=10, y=20, width=100, height=50)
)

# Build tree
div.append(Text('Click here: '))
div.append(button)
button.append(Text('Submit'))

# Navigate
for child in div:
    if isinstance(child, Node):
        print(f"Element: {child.tag}")
    elif isinstance(child, Text):
        print(f"Text: {child.content}")

# Get all text
print(div.get_text())  # "Click here: Submit"

# Check visibility
print(button.is_visible())      # True
print(button.has_zero_size())   # False
```

## API Reference

### Types

**Node**
DOM element with tag, attributes, styles, bounds, metadata, and children.

**Text**
Text node with content.

**BoundingBox**
Element bounding box with x, y, width, height.

### Parsers

**parse_html(html: str) -> Node**
Parse HTML string to Node tree.

**parse_cdp(snapshot: dict) -> Node**
Parse CDP snapshot to Node tree.

### Preset Filters

**filter_visible(node) -> Node | None**
Remove all hidden elements.

**filter_semantic(node) -> Node | None**
Keep only semantic content.

**filter_all(node) -> Node | None**
Apply all filters.

### Visibility Filters

**filter_non_visible_tags(node)**
Remove script, style, head, meta, etc.

**filter_css_hidden(node)**
Remove display:none, visibility:hidden, opacity:0.

**filter_zero_dimensions(node)**
Remove zero-width/height elements.

### Semantic Filters

**filter_attributes(node, keep=SEMANTIC_ATTRIBUTES)**
Keep only semantic attributes.

**filter_empty(node)**
Remove empty nodes.

**collapse_wrappers(node)**
Collapse single-child wrapper elements.

### Node Methods

**node.append(child)**
Add a child node or text.

**node.remove(child)**
Remove a child.

**node.is_visible()**
Check if element is visible.

**node.has_zero_size()**
Check if element has zero dimensions.

**node.get_text(separator='')**
Get all text content recursively.

## Semantic Attributes

By default, filter_attributes keeps these attributes:

```python
SEMANTIC_ATTRIBUTES = {
    "role", "aria-label", "aria-labelledby", "aria-describedby",
    "aria-checked", "aria-selected", "aria-expanded", "aria-hidden",
    "aria-disabled", "type", "name", "placeholder", "value",
    "alt", "title", "href", "disabled", "checked", "selected"
}
```

You can customize:

```python
from domnode.filters.semantic import filter_attributes

custom_attrs = {"role", "href", "data-test-id"}
filtered = filter_attributes(node, keep=custom_attrs)
```

## Use Cases

**Web Scraping**
Extract only visible, meaningful content from web pages.

**Browser Automation**
Filter DOM to only interactive elements for AI agents.

**LLM Context**
Reduce HTML to essential semantic structure for language models.

**Accessibility Testing**
Analyze semantic attributes and ARIA labels.

**Testing**
Build and manipulate DOM trees programmatically.

## Development

```bash
# Clone repository
git clone https://github.com/steve-z-wang/domnode.git
cd domnode

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=domnode --cov-report=html
```

## License

MIT

## Contributing

Contributions are welcome. Please submit a Pull Request.

## Related Projects

[domcontext](https://github.com/steve-z-wang/domcontext) - DOM to LLM context with markdown serialization

[natural-selector](https://github.com/steve-z-wang/natural-selector) - Natural language element selection with RAG

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "domnode",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "dom, html, parser, web-automation, browser, cdp, selenium, playwright, puppeteer, web-scraping",
    "author": null,
    "author_email": "Steve Wang <steve.z.wang@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/85/e9/16eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f/domnode-0.2.0.tar.gz",
    "platform": null,
    "description": "# domnode\n\nDOM nodes with browser rendering data for web automation.\n\nA Python library for parsing and filtering DOM trees with browser rendering information. Supports HTML and Chrome DevTools Protocol snapshots.\n\n## Installation\n\n```bash\npip install domnode\n```\n\n## Quick Start\n\n```python\nfrom domnode import parse_html, filter_visible\n\nhtml = \"\"\"\n<div>\n    <script>console.log('hidden')</script>\n    <div style=\"display: none\">Hidden content</div>\n    <button role=\"button\" class=\"btn\">Click me</button>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nvisible = filter_visible(root)\n\nfor child in visible:\n    print(child.tag, child.attrib)\n# Output: button {'role': 'button', 'class': 'btn'}\n```\n\n## Features\n\n- Parse HTML strings and CDP snapshots into rich DOM trees\n- Filter visibility (display:none, visibility:hidden, opacity:0, zero-size)\n- Filter semantically (keep only meaningful attributes, collapse wrappers)\n- Access computed styles and bounding boxes\n- 86 comprehensive unit tests\n\n## Usage\n\n### Parsing HTML\n\n```python\nfrom domnode.parsers import parse_html\n\nhtml = '<div class=\"container\"><button>Click</button></div>'\nroot = parse_html(html)\n\nprint(root.tag)          # 'div'\nprint(root.attrib)       # {'class': 'container'}\nprint(root.children[0])  # Node(tag='button', ...)\n```\n\n### Parsing CDP Snapshots\n\n```python\nfrom domnode.parsers import parse_cdp\n\n# From Playwright/Puppeteer\nsnapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {\n    'computedStyles': [],\n    'includeDOMRects': True\n})\n\nroot = parse_cdp(snapshot)\nprint(root.bounds)  # BoundingBox(x=0, y=0, width=1920, height=1080)\nprint(root.styles)  # {'display': 'block', 'position': 'static', ...}\n```\n\n### Filtering Visible Elements\n\n```python\nfrom domnode import parse_html, filter_visible\n\nhtml = \"\"\"\n<div>\n    <script>alert('hidden')</script>\n    <style>.hide { display: none; }</style>\n    <div style=\"display: none\">Hidden</div>\n    <div style=\"opacity: 0\">Invisible</div>\n    <button>Visible</button>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nvisible = filter_visible(root)\n\n# Only button remains\nassert len(visible.children) == 1\nassert visible.children[0].tag == 'button'\n```\n\n### Filtering Semantic Content\n\n```python\nfrom domnode import parse_html, filter_semantic\n\nhtml = \"\"\"\n<div class=\"wrapper\" id=\"container\">\n    <div class=\"inner\">\n        <button class=\"btn\" role=\"button\" aria-label=\"Submit\">Click</button>\n    </div>\n</div>\n\"\"\"\n\nroot = parse_html(html)\nsemantic = filter_semantic(root)\n\n# Wrappers collapsed, only semantic attributes remain\nassert semantic.tag == 'button'\nassert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}\n```\n\n### Combining Filters\n\n```python\nfrom domnode import parse_html, filter_all\n\nhtml = \"\"\"\n<html>\n    <head>\n        <script src=\"app.js\"></script>\n    </head>\n    <body class=\"page\">\n        <div class=\"wrapper\">\n            <button class=\"btn\" role=\"button\">Click</button>\n        </div>\n    </body>\n</html>\n\"\"\"\n\nroot = parse_html(html)\nclean = filter_all(root)\n\n# Head removed, wrappers collapsed, only semantic attributes\nassert clean.tag == 'button'\nassert clean.attrib == {'role': 'button'}\n```\n\n### Granular Filtering\n\n```python\nfrom domnode.parsers import parse_html\nfrom domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions\nfrom domnode.filters.semantic import filter_attributes, collapse_wrappers\n\nroot = parse_html(html)\n\n# Apply specific filters\nroot = filter_css_hidden(root)\nroot = filter_attributes(root)\nroot = collapse_wrappers(root)\n```\n\n### Working with Nodes\n\n```python\nfrom domnode import Node, Text, BoundingBox\n\n# Create nodes\ndiv = Node(tag='div', attrib={'class': 'container'})\nbutton = Node(\n    tag='button',\n    attrib={'role': 'button'},\n    styles={'display': 'block'},\n    bounds=BoundingBox(x=10, y=20, width=100, height=50)\n)\n\n# Build tree\ndiv.append(Text('Click here: '))\ndiv.append(button)\nbutton.append(Text('Submit'))\n\n# Navigate\nfor child in div:\n    if isinstance(child, Node):\n        print(f\"Element: {child.tag}\")\n    elif isinstance(child, Text):\n        print(f\"Text: {child.content}\")\n\n# Get all text\nprint(div.get_text())  # \"Click here: Submit\"\n\n# Check visibility\nprint(button.is_visible())      # True\nprint(button.has_zero_size())   # False\n```\n\n## API Reference\n\n### Types\n\n**Node**\nDOM element with tag, attributes, styles, bounds, metadata, and children.\n\n**Text**\nText node with content.\n\n**BoundingBox**\nElement bounding box with x, y, width, height.\n\n### Parsers\n\n**parse_html(html: str) -> Node**\nParse HTML string to Node tree.\n\n**parse_cdp(snapshot: dict) -> Node**\nParse CDP snapshot to Node tree.\n\n### Preset Filters\n\n**filter_visible(node) -> Node | None**\nRemove all hidden elements.\n\n**filter_semantic(node) -> Node | None**\nKeep only semantic content.\n\n**filter_all(node) -> Node | None**\nApply all filters.\n\n### Visibility Filters\n\n**filter_non_visible_tags(node)**\nRemove script, style, head, meta, etc.\n\n**filter_css_hidden(node)**\nRemove display:none, visibility:hidden, opacity:0.\n\n**filter_zero_dimensions(node)**\nRemove zero-width/height elements.\n\n### Semantic Filters\n\n**filter_attributes(node, keep=SEMANTIC_ATTRIBUTES)**\nKeep only semantic attributes.\n\n**filter_empty(node)**\nRemove empty nodes.\n\n**collapse_wrappers(node)**\nCollapse single-child wrapper elements.\n\n### Node Methods\n\n**node.append(child)**\nAdd a child node or text.\n\n**node.remove(child)**\nRemove a child.\n\n**node.is_visible()**\nCheck if element is visible.\n\n**node.has_zero_size()**\nCheck if element has zero dimensions.\n\n**node.get_text(separator='')**\nGet all text content recursively.\n\n## Semantic Attributes\n\nBy default, filter_attributes keeps these attributes:\n\n```python\nSEMANTIC_ATTRIBUTES = {\n    \"role\", \"aria-label\", \"aria-labelledby\", \"aria-describedby\",\n    \"aria-checked\", \"aria-selected\", \"aria-expanded\", \"aria-hidden\",\n    \"aria-disabled\", \"type\", \"name\", \"placeholder\", \"value\",\n    \"alt\", \"title\", \"href\", \"disabled\", \"checked\", \"selected\"\n}\n```\n\nYou can customize:\n\n```python\nfrom domnode.filters.semantic import filter_attributes\n\ncustom_attrs = {\"role\", \"href\", \"data-test-id\"}\nfiltered = filter_attributes(node, keep=custom_attrs)\n```\n\n## Use Cases\n\n**Web Scraping**\nExtract only visible, meaningful content from web pages.\n\n**Browser Automation**\nFilter DOM to only interactive elements for AI agents.\n\n**LLM Context**\nReduce HTML to essential semantic structure for language models.\n\n**Accessibility Testing**\nAnalyze semantic attributes and ARIA labels.\n\n**Testing**\nBuild and manipulate DOM trees programmatically.\n\n## Development\n\n```bash\n# Clone repository\ngit clone https://github.com/steve-z-wang/domnode.git\ncd domnode\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=domnode --cov-report=html\n```\n\n## License\n\nMIT\n\n## Contributing\n\nContributions are welcome. Please submit a Pull Request.\n\n## Related Projects\n\n[domcontext](https://github.com/steve-z-wang/domcontext) - DOM to LLM context with markdown serialization\n\n[natural-selector](https://github.com/steve-z-wang/natural-selector) - Natural language element selection with RAG\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "DOM nodes with browser rendering data for web automation",
    "version": "0.2.0",
    "project_urls": {
        "Changelog": "https://github.com/steve-z-wang/domnode/releases",
        "Homepage": "https://github.com/steve-z-wang/domnode",
        "Issues": "https://github.com/steve-z-wang/domnode/issues",
        "Repository": "https://github.com/steve-z-wang/domnode"
    },
    "split_keywords": [
        "dom",
        " html",
        " parser",
        " web-automation",
        " browser",
        " cdp",
        " selenium",
        " playwright",
        " puppeteer",
        " web-scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "02d642d2da646b860ef7e1b8f0ddf6c88d96d4b59841cd25c5d8c10613f6e44a",
                "md5": "fa3d6c6c47299a625feabafd0cb936a7",
                "sha256": "4b645edb338ef82ac3b33a945185783979663a5483dd70575da6fc74a7c395c6"
            },
            "downloads": -1,
            "filename": "domnode-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fa3d6c6c47299a625feabafd0cb936a7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 19758,
            "upload_time": "2025-10-19T23:58:27",
            "upload_time_iso_8601": "2025-10-19T23:58:27.368082Z",
            "url": "https://files.pythonhosted.org/packages/02/d6/42d2da646b860ef7e1b8f0ddf6c88d96d4b59841cd25c5d8c10613f6e44a/domnode-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "85e916eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f",
                "md5": "a22710d55a35224b4dff08cc486f7ed3",
                "sha256": "033e0fdaeadca57e0325133e5c000eb7ef4434c23d148d8cf6497d33c6389fc0"
            },
            "downloads": -1,
            "filename": "domnode-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a22710d55a35224b4dff08cc486f7ed3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 20803,
            "upload_time": "2025-10-19T23:58:28",
            "upload_time_iso_8601": "2025-10-19T23:58:28.660653Z",
            "url": "https://files.pythonhosted.org/packages/85/e9/16eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f/domnode-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-19 23:58:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "steve-z-wang",
    "github_project": "domnode",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "domnode"
}
        
Elapsed time: 1.72044s