# Html Universal Identifier
Html Universal Identifier is an alpha version of an application designed for identifying server-side HTML parsers. This package provides a way to determine which HTML, SVG, and MathML tags are allowed, helps to find parser features (incorrectly implemented tags), and can also help to guess which parser is used on the backend.
Primarily, this library relies on the incorrectness of HTML parsing, for example, here are some classic examples:
- `<form><form>text</form></form>` should be transformed to `<form>text</form>`
- `<h1><h2>text</h2></h1>` should be transformed to `<h1><h2>text</h2></h1>`
There are several reasons why you don't want to rely entirely on allowed tags:
- It won't help you determine which parser your custom sanitization is based on
- Allowed tags can be changed
## Features
- Identify allowed HTML, SVG, and MathML tags.
- Identify allowed attributes.
- Identify incorrect parsing
- Use a customizable handler function to process HTML payloads.
- Load and compare results against predefined Parser outputs.
## Installation
To install the package, use pip:
```
pip install hui
```
## Usage
Here is a basic example of how to use the `Identifier` class from the package:
```python
from hui.identify import Identifier
import requests
def handler(payload):
return requests.get("http://localhost:3005/sanitize",params={"html":payload}).text
a = Identifier(handler=handler, buffer_enabled=False, buffer_limit=64, debug_mode=False)
print(a.identify())
# run all
# Example output
# [[1.0, 27, 'JS_SANITIZE_HTML'], [0.8148148148148148, 22, 'PYTHON_HTML_SANITIZE'], ...
print(a.check_attr_allowed("href",tag="a"))
# True or False
print(a.INCORRECT_PARSED)
# Example output
# [{'output': '<h5><h6>govnoed</h6></h5>', 'expected': '<h5></h5><h6>$text</h6>'}, .. ]
print(a.ALLOWED_TAGS)
# print allowed tags
print(a.ATTRIBUTES)
# Prints ATTRIBUTES info
print(a.DEPTH_LIMITS)
# Example Outputs:
# (514, 'No max tags limit')
# (512, 'Flattening')
# (255, 'Removing')
```
## Identifier Class
The `Identifier` class is the core of this package. It is responsible for identifying allowed HTML, SVG, and MathML tags based on a handler function that processes HTML payloads.
The class also maintains an `INCORRECT_PARSED` list, which contains payloads that were incorrectly parsed by the handler. For example, this may include cases where the parser fails to remove nested forms and similar issues.
## Current Parsers
The following parsers are currently supported in the project:
- **DOMpurify with JSDOM (JS)**
- **JSDOM (JS)**
- **sanitize_html (JS)**
- **htmlparser2 (JS)**
- **JSXSS (JS)**
- **html (python)**
- **lxml (python)**
- **html_sanitizer (python)**
- **net/html (go)**
- **bluemonday (go)**
If you believe a new parser/sanitizer should be added, please create an issue, and I will be happy to include it.
### Constructor Parameters
- **`handler`**: A function that takes a payload and returns an HTML response. Example:
```python
lambda payload: requests.get(f"http://localhost:3000?payload={payload}").text
```
- **`buffer_enabled`** (optional, default=False): A boolean flag to enable or disable buffering of payloads before sending them to the handler. By default, buffering is disabled, as it can sometimes lead to incorrect results. For example, some sanitizers may simply remove all input if it contains a dangerous tag. Use buffering only if the server you are interacting with has strict rate limits.
- **`buffer_delimeter`** (optional, default=`<div>TEXTTEXT</div>`): A string used to delimit buffered payloads when sending them to the handler.
- **`buffer_limit`** (optional, default=32): An integer that specifies the maximum number of payloads to buffer before sending them to the handler.
- **`template_vars`** (optional, default=None): A dictionary of template variables to use for substitution in payloads.
- **`debug_mode`** (optional, default=False): A boolean flag to enable or disable debug logging.
### Methods
- **`check_allowed_tags()`**: Checks and populates the `ALLOWED_TAGS` dictionary with allowed tags for HTML, SVG, and MathML.
- **`call_handler(template_payloads: list[str])`**: Calls the handler function with a list of template payloads and returns the processed results.
- **`check_namespace(namespace: str)`**: Checks for allowed tags in the specified namespace (SVG or MathML).
- **`identify()`**: Identifies the best matching Parser based on generated payloads and returns a list of matches.
- **`check_allowed_attrs()`**: Checks and validates allowed attributes for HTML tags.
### identify() Method
The `identify()` method checks if allowed tags have been determined. If not, it calls `check_allowed_tags()` to populate the `ALLOWED_TAGS`. It then loads a list of generated payloads from a JSON file and calls the handler for each payload. Finally, it compares the results against all JSON files in the `results_parsers` directory to count matches and returns a sorted list of results.
- **Returns**: A list of tuples, each containing:
- The match ratio (float)
- The number of matches (int)
- The name of the Parser (str)
### Attributes
- **`ATTRIBUTES`**: A dictionary that holds information about allowed attributes for HTML tags, including:
- `custom_attribute`: Indicates if custom attributes are allowed.
- `event_attributes_blocked`: Indicates if event attributes are directly blocked.
- `data_attributes`: Indicates if data attributes are allowed.
- `attrs_allowed`: A nested dictionary categorizing allowed attributes into global, event and specific tags attributes.
### Allowed Tags
- **`ALLOWED_TAGS`**: A dictionary that holds information about allowed tags for HTML, SVG, and MathML, including:
- `html`: A list of allowed HTML tags.
- `svg`: A list of allowed SVG tags.
- `math`: A list of allowed MathML tags.
### Incorrectly Parsed Tags
- **`INCORRECT_PARSED`**: A dictionary that holds information about incorrectly parsed tags for HTML, SVG, and MathML, including:
- `html`: A list of incorrectly parsed HTML tags.
- `svg`: A list of incorrectly parsed SVG tags.
- `math`: A list of incorrectly parsed MathML tags.
### DEPTH_LIMITS
**DEPTH_LIMITS**: A tuple that holds information about the depth limits of HTML tags, including:
- `max_depth`: The maximum depth of HTML tags.
- `limit_strategy`: The strategy used to handle tags exceeding the depth limit, which can be 'No max tags limit', 'Flattening', or 'Removing'.
Raw data
{
"_id": null,
"home_page": "https://github.com/slonser/hui",
"name": "hui",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "HTML, hui, HTML GUESSER, HTML identifier, XSS, bugbounty",
"author": "Slonser",
"author_email": "slonser@slonser.info",
"download_url": "https://files.pythonhosted.org/packages/b9/4a/889e01d1bfdb03a22511c43dfa353bbe236afbc9e71d27e6cac3628563cd/hui-0.2.2.tar.gz",
"platform": null,
"description": "# Html Universal Identifier\n\nHtml Universal Identifier is an alpha version of an application designed for identifying server-side HTML parsers. This package provides a way to determine which HTML, SVG, and MathML tags are allowed, helps to find parser features (incorrectly implemented tags), and can also help to guess which parser is used on the backend.\n\nPrimarily, this library relies on the incorrectness of HTML parsing, for example, here are some classic examples:\n- `<form><form>text</form></form>` should be transformed to `<form>text</form>`\n- `<h1><h2>text</h2></h1>` should be transformed to `<h1><h2>text</h2></h1>`\n\nThere are several reasons why you don't want to rely entirely on allowed tags:\n- It won't help you determine which parser your custom sanitization is based on\n- Allowed tags can be changed\n \n## Features\n\n- Identify allowed HTML, SVG, and MathML tags.\n- Identify allowed attributes.\n- Identify incorrect parsing\n- Use a customizable handler function to process HTML payloads.\n- Load and compare results against predefined Parser outputs.\n\n## Installation\n\nTo install the package, use pip:\n\n```\npip install hui\n```\n\n## Usage\n\nHere is a basic example of how to use the `Identifier` class from the package:\n\n```python\nfrom hui.identify import Identifier\nimport requests\n\ndef handler(payload):\n return requests.get(\"http://localhost:3005/sanitize\",params={\"html\":payload}).text\n\na = Identifier(handler=handler, buffer_enabled=False, buffer_limit=64, debug_mode=False)\nprint(a.identify())\n# run all\n# Example output \n# [[1.0, 27, 'JS_SANITIZE_HTML'], [0.8148148148148148, 22, 'PYTHON_HTML_SANITIZE'], ...\n\nprint(a.check_attr_allowed(\"href\",tag=\"a\"))\n# True or False\nprint(a.INCORRECT_PARSED)\n# Example output\n# [{'output': '<h5><h6>govnoed</h6></h5>', 'expected': '<h5></h5><h6>$text</h6>'}, .. ]\nprint(a.ALLOWED_TAGS)\n# print allowed tags\nprint(a.ATTRIBUTES)\n# Prints ATTRIBUTES info\nprint(a.DEPTH_LIMITS)\n# Example Outputs:\n# (514, 'No max tags limit')\n# (512, 'Flattening')\n# (255, 'Removing')\n```\n\n## Identifier Class\n\nThe `Identifier` class is the core of this package. It is responsible for identifying allowed HTML, SVG, and MathML tags based on a handler function that processes HTML payloads.\n\nThe class also maintains an `INCORRECT_PARSED` list, which contains payloads that were incorrectly parsed by the handler. For example, this may include cases where the parser fails to remove nested forms and similar issues.\n\n## Current Parsers\n\nThe following parsers are currently supported in the project:\n\n- **DOMpurify with JSDOM (JS)**\n- **JSDOM (JS)**\n- **sanitize_html (JS)**\n- **htmlparser2 (JS)**\n- **JSXSS (JS)**\n- **html (python)**\n- **lxml (python)**\n- **html_sanitizer (python)**\n- **net/html (go)**\n- **bluemonday (go)**\n\nIf you believe a new parser/sanitizer should be added, please create an issue, and I will be happy to include it.\n### Constructor Parameters\n\n- **`handler`**: A function that takes a payload and returns an HTML response. Example:\n ```python\n lambda payload: requests.get(f\"http://localhost:3000?payload={payload}\").text\n ```\n\n- **`buffer_enabled`** (optional, default=False): A boolean flag to enable or disable buffering of payloads before sending them to the handler. By default, buffering is disabled, as it can sometimes lead to incorrect results. For example, some sanitizers may simply remove all input if it contains a dangerous tag. Use buffering only if the server you are interacting with has strict rate limits.\n\n- **`buffer_delimeter`** (optional, default=`<div>TEXTTEXT</div>`): A string used to delimit buffered payloads when sending them to the handler.\n\n- **`buffer_limit`** (optional, default=32): An integer that specifies the maximum number of payloads to buffer before sending them to the handler.\n\n- **`template_vars`** (optional, default=None): A dictionary of template variables to use for substitution in payloads.\n\n- **`debug_mode`** (optional, default=False): A boolean flag to enable or disable debug logging.\n\n### Methods\n\n- **`check_allowed_tags()`**: Checks and populates the `ALLOWED_TAGS` dictionary with allowed tags for HTML, SVG, and MathML.\n- **`call_handler(template_payloads: list[str])`**: Calls the handler function with a list of template payloads and returns the processed results.\n- **`check_namespace(namespace: str)`**: Checks for allowed tags in the specified namespace (SVG or MathML).\n- **`identify()`**: Identifies the best matching Parser based on generated payloads and returns a list of matches.\n- **`check_allowed_attrs()`**: Checks and validates allowed attributes for HTML tags.\n\n### identify() Method\n\nThe `identify()` method checks if allowed tags have been determined. If not, it calls `check_allowed_tags()` to populate the `ALLOWED_TAGS`. It then loads a list of generated payloads from a JSON file and calls the handler for each payload. Finally, it compares the results against all JSON files in the `results_parsers` directory to count matches and returns a sorted list of results.\n\n- **Returns**: A list of tuples, each containing:\n - The match ratio (float)\n - The number of matches (int)\n - The name of the Parser (str)\n\n### Attributes\n\n- **`ATTRIBUTES`**: A dictionary that holds information about allowed attributes for HTML tags, including:\n - `custom_attribute`: Indicates if custom attributes are allowed.\n - `event_attributes_blocked`: Indicates if event attributes are directly blocked.\n - `data_attributes`: Indicates if data attributes are allowed.\n - `attrs_allowed`: A nested dictionary categorizing allowed attributes into global, event and specific tags attributes.\n\n### Allowed Tags\n\n- **`ALLOWED_TAGS`**: A dictionary that holds information about allowed tags for HTML, SVG, and MathML, including:\n - `html`: A list of allowed HTML tags.\n - `svg`: A list of allowed SVG tags.\n - `math`: A list of allowed MathML tags.\n\n### Incorrectly Parsed Tags\n\n- **`INCORRECT_PARSED`**: A dictionary that holds information about incorrectly parsed tags for HTML, SVG, and MathML, including:\n - `html`: A list of incorrectly parsed HTML tags.\n - `svg`: A list of incorrectly parsed SVG tags.\n - `math`: A list of incorrectly parsed MathML tags.\n\n### DEPTH_LIMITS\n**DEPTH_LIMITS**: A tuple that holds information about the depth limits of HTML tags, including:\n - `max_depth`: The maximum depth of HTML tags.\n - `limit_strategy`: The strategy used to handle tags exceeding the depth limit, which can be 'No max tags limit', 'Flattening', or 'Removing'.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": null,
"version": "0.2.2",
"project_urls": {
"Download": "https://github.com/Slonser/hui/archive/v_01.tar.gz",
"Homepage": "https://github.com/slonser/hui"
},
"split_keywords": [
"html",
" hui",
" html guesser",
" html identifier",
" xss",
" bugbounty"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "eb15f48b8818f3969917f915484c99384dca2b18eefd69ff3e69f0ad18469c39",
"md5": "40744ac9d8b69725b9c73c8ae745ce00",
"sha256": "587247cbf07be33cd9ffedc79a11ce2911c8eeff5e1a961a6eecd8b8198a744a"
},
"downloads": -1,
"filename": "hui-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "40744ac9d8b69725b9c73c8ae745ce00",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 25883,
"upload_time": "2024-12-15T10:55:07",
"upload_time_iso_8601": "2024-12-15T10:55:07.952308Z",
"url": "https://files.pythonhosted.org/packages/eb/15/f48b8818f3969917f915484c99384dca2b18eefd69ff3e69f0ad18469c39/hui-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b94a889e01d1bfdb03a22511c43dfa353bbe236afbc9e71d27e6cac3628563cd",
"md5": "425a32dcddb1412fc53606471046abd1",
"sha256": "50afc30537d08f00ce9c12542691dd9054ec7eeb7e37a4c11b28ee30115f7a7d"
},
"downloads": -1,
"filename": "hui-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "425a32dcddb1412fc53606471046abd1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20148,
"upload_time": "2024-12-15T10:55:11",
"upload_time_iso_8601": "2024-12-15T10:55:11.255325Z",
"url": "https://files.pythonhosted.org/packages/b9/4a/889e01d1bfdb03a22511c43dfa353bbe236afbc9e71d27e6cac3628563cd/hui-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-15 10:55:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "slonser",
"github_project": "hui",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "hui"
}