Name | domselect JSON |
Version |
0.0.6
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2025-08-31 12:41:24 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | The MIT License (MIT)
Copyright (c) 2025, Gregory Petukhov
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
|
keywords |
html
dom
lxml
css
xpath
lexbor
parser
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Domselect
Domselect provides high-level API to work with structure of HTML document using one of HTML processing backend.
To work with HTML document you have to create so-called selector object from raw content of HTML document.
That selector will be bound to the root node of HTML structure. Then you can call different methods of these selector
to build other selectors bound to nested parts of HTML structure.
Selector object extracts low-level nodes from DOM constructed by HTML processing backend and wraps them
into high-level selector interface. If you need, you can always access low-level node stored in selector object.
### Selector Backends
Domselect library provides these selectors:
1. LexborSelector powered by [selectolax](https://github.com/rushter/selectolax)
and [lexbor](https://github.com/lexbor/lexbor) libraries. The type of raw node is `selectolax.lexbor.LexborNode`.
Query language is CSS. Lexbor parser is x3-x4 times faster than lxml parser.
2. LxmlCssSelector powered by [lxml](https://github.com/lxml/lxml) library. The type of raw node is `lxml.html.HtmlElement`.
Query language is CSS.
2. LxmlXpathSelector powered by [lxml](https://github.com/lxml/lxml) library. The type of raw node is `lxml.html.HtmlElement`.
Query language is XPATH.
### Selector Creating
To create lexbor selector from content of HTML document:
```python
from domselect import LexborSelector
sel = LexborSelector.from_content("<div>test</div>")
```
Also you can create selector from raw node:
```python
from domselect import LexborSelector
from selectolax.lexbor import LexborHTMLParser
node = LexborHTMLParser("<div>test</div>").css_first("div")
sel = LexborSelector(node)
```
Same goes for lxml backend. Here is an example of creating lxml selector from raw node:
```python
from lxml.html import fromstring
from domselect import LxmlCssSelector, LxmlXpathSelector
node = fromstring("<div>test</div>")
sel = LxmlCssSelector(node)
# or
sel = LxmlXpathSelector(node)
```
### Node Traversal Methods
Each of these methods return other selectors of same type i.e. LexborSelector return
other LexborSelectors and LxmlCssSelector returns other LxmlCssSelectors.
Method `find(query: str)` returns list of selectors bound to raw nodes found by query.
Method `first(query: str)` returns `None` of selector bound to first raw node found by query.
There is similar `find_raw` and `first_raw` methods which works in same way but returns low-level raw nodes
i.e. they do not wrap found nodes into selector interface.
Method `parent()` returns selector bound to raw node which is parent to raw node of current selector.
Method `exists(query: str)` returns boolean flag indicates if any node has been found by query.
Method `first_contains(query: str, pattern: str[, default: None])` returns selector bound to first raw node
found by query and which contains text as `pattern` parameter. If node is not found then
`NodeNotFoundError` is raised. You can pass `default=None` optional parameter to return `None` in case
of node is not found.
### Node Properties Methods
Method `attr(name: str[, default: None|str])` returns content of node's attribute of given name.
If node does not have such attribute the `AttributeNotFoundError` is raised. If you pass optional
`default: None|str` parameter the method will return `None` or `str` if attribute does not exists.
Method `text([strip: bool])` returns text content of current node and all its sub-nodes. By default
returned text is stripped at beginning and ending from whitespaces, tabulations and line-breaks. You
can turn off striping by passing `strip=False` parameter.
Method `tag()` returns tag name of raw node to which current selector is bound.
### Traversal and Properties Methods
These methods combine two operations: search node by query and do something on found node. They are helful
if you want to get text or attribute from found node, but this node might not exist. Such methods allows you
to return reasonable default value in case node is not found. On contrary, if you use call chain like `first().text()`
then you'll not be able to return default value from `text()` call because `first()` will raise Exception if
node is not found.
Method `first_attr(query: str, name: str[, default: None|str])` returns content of attribute of given name of node
found by given query. If node does not have such attribute the `AttributeNotFoundError` is raised.
If node is not found by given query the `NodeNotFoundError` is raised. If you pass optional
`default: None|str` parameter the method will return `None` or `str` instead of rasing exceptions.
Method `first_text(query: str[, default: None|str, strip: bool])` returns text content of raw node (and all its
sub-nodes) found by given query. If node is not found the `NodeNotFoundError` is raised. Use optional `default: None|str`
parametere to return `None` or `str` instead of raising exceptions. You can control text stripping with `strip`
parameter (see description of `text()` method).
### Usage example
This code downloads telegram channel preview page and parse external links from it.
```python
from html import unescape
from urllib.request import urlopen
from domselect import LexborSelector
def main() -> None:
content = urlopen("https://t.me/s/centralbank_russia").read()
sel = LexborSelector.from_content(content)
for msg_node in sel.find(".tgme_widget_message_wrap"):
msg_date = msg_node.first_attr(
".tgme_widget_message_date time", "datetime", default=None
)
for text_node in msg_node.find(".tgme_widget_message_text"):
print("Message by {}".format(msg_date))
for link_node in text_node.find("a[href]"):
url = link_node.attr("href")
if url.startswith("http"):
print(" - {}".format(unescape(url)))
if __name__ == "__main__":
main()
```
Raw data
{
"_id": null,
"home_page": null,
"name": "domselect",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "html, dom, lxml, css, xpath, lexbor, parser",
"author": null,
"author_email": "Gregory Petukhov <lorien@lorien.name>",
"download_url": "https://files.pythonhosted.org/packages/b1/f6/82236af357002e593d8590d7b38954dcd3738db7d36f24e3eb5e5043486e/domselect-0.0.6.tar.gz",
"platform": null,
"description": "# Domselect\n\nDomselect provides high-level API to work with structure of HTML document using one of HTML processing backend.\nTo work with HTML document you have to create so-called selector object from raw content of HTML document.\nThat selector will be bound to the root node of HTML structure. Then you can call different methods of these selector\nto build other selectors bound to nested parts of HTML structure.\n\nSelector object extracts low-level nodes from DOM constructed by HTML processing backend and wraps them\ninto high-level selector interface. If you need, you can always access low-level node stored in selector object.\n\n### Selector Backends\n\nDomselect library provides these selectors:\n\n1. LexborSelector powered by [selectolax](https://github.com/rushter/selectolax)\n and [lexbor](https://github.com/lexbor/lexbor) libraries. The type of raw node is `selectolax.lexbor.LexborNode`.\n Query language is CSS. Lexbor parser is x3-x4 times faster than lxml parser.\n\n2. LxmlCssSelector powered by [lxml](https://github.com/lxml/lxml) library. The type of raw node is `lxml.html.HtmlElement`.\n Query language is CSS.\n\n2. LxmlXpathSelector powered by [lxml](https://github.com/lxml/lxml) library. The type of raw node is `lxml.html.HtmlElement`.\n Query language is XPATH.\n\n### Selector Creating\n\nTo create lexbor selector from content of HTML document:\n\n```python\nfrom domselect import LexborSelector\nsel = LexborSelector.from_content(\"<div>test</div>\")\n```\n\nAlso you can create selector from raw node:\n\n```python\nfrom domselect import LexborSelector\nfrom selectolax.lexbor import LexborHTMLParser\nnode = LexborHTMLParser(\"<div>test</div>\").css_first(\"div\")\nsel = LexborSelector(node)\n```\n\nSame goes for lxml backend. Here is an example of creating lxml selector from raw node:\n\n```python\nfrom lxml.html import fromstring\nfrom domselect import LxmlCssSelector, LxmlXpathSelector\nnode = fromstring(\"<div>test</div>\")\nsel = LxmlCssSelector(node)\n# or\nsel = LxmlXpathSelector(node)\n```\n\n### Node Traversal Methods\n\nEach of these methods return other selectors of same type i.e. LexborSelector return\nother LexborSelectors and LxmlCssSelector returns other LxmlCssSelectors.\n\nMethod `find(query: str)` returns list of selectors bound to raw nodes found by query.\n\nMethod `first(query: str)` returns `None` of selector bound to first raw node found by query.\n\nThere is similar `find_raw` and `first_raw` methods which works in same way but returns low-level raw nodes\ni.e. they do not wrap found nodes into selector interface.\n\nMethod `parent()` returns selector bound to raw node which is parent to raw node of current selector.\n\nMethod `exists(query: str)` returns boolean flag indicates if any node has been found by query.\n\nMethod `first_contains(query: str, pattern: str[, default: None])` returns selector bound to first raw node\nfound by query and which contains text as `pattern` parameter. If node is not found then\n`NodeNotFoundError` is raised. You can pass `default=None` optional parameter to return `None` in case\nof node is not found.\n\n\n### Node Properties Methods\n\nMethod `attr(name: str[, default: None|str])` returns content of node's attribute of given name.\nIf node does not have such attribute the `AttributeNotFoundError` is raised. If you pass optional\n`default: None|str` parameter the method will return `None` or `str` if attribute does not exists.\n\nMethod `text([strip: bool])` returns text content of current node and all its sub-nodes. By default\nreturned text is stripped at beginning and ending from whitespaces, tabulations and line-breaks. You\ncan turn off striping by passing `strip=False` parameter.\n\nMethod `tag()` returns tag name of raw node to which current selector is bound.\n\n### Traversal and Properties Methods\n\nThese methods combine two operations: search node by query and do something on found node. They are helful\nif you want to get text or attribute from found node, but this node might not exist. Such methods allows you\nto return reasonable default value in case node is not found. On contrary, if you use call chain like `first().text()`\nthen you'll not be able to return default value from `text()` call because `first()` will raise Exception if\nnode is not found.\n\nMethod `first_attr(query: str, name: str[, default: None|str])` returns content of attribute of given name of node\nfound by given query. If node does not have such attribute the `AttributeNotFoundError` is raised.\nIf node is not found by given query the `NodeNotFoundError` is raised. If you pass optional\n`default: None|str` parameter the method will return `None` or `str` instead of rasing exceptions.\n\nMethod `first_text(query: str[, default: None|str, strip: bool])` returns text content of raw node (and all its\nsub-nodes) found by given query. If node is not found the `NodeNotFoundError` is raised. Use optional `default: None|str`\nparametere to return `None` or `str` instead of raising exceptions. You can control text stripping with `strip`\nparameter (see description of `text()` method).\n\n### Usage example\n\nThis code downloads telegram channel preview page and parse external links from it.\n\n```python\nfrom html import unescape\nfrom urllib.request import urlopen\n\nfrom domselect import LexborSelector\n\n\ndef main() -> None:\n content = urlopen(\"https://t.me/s/centralbank_russia\").read()\n sel = LexborSelector.from_content(content)\n for msg_node in sel.find(\".tgme_widget_message_wrap\"):\n msg_date = msg_node.first_attr(\n \".tgme_widget_message_date time\", \"datetime\", default=None\n )\n for text_node in msg_node.find(\".tgme_widget_message_text\"):\n print(\"Message by {}\".format(msg_date))\n for link_node in text_node.find(\"a[href]\"):\n url = link_node.attr(\"href\")\n if url.startswith(\"http\"):\n print(\" - {}\".format(unescape(url)))\n\n\nif __name__ == \"__main__\":\n main()\n```\n",
"bugtrack_url": null,
"license": "The MIT License (MIT)\n \n Copyright (c) 2025, Gregory Petukhov\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in\n all copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n THE SOFTWARE.\n ",
"summary": null,
"version": "0.0.6",
"project_urls": null,
"split_keywords": [
"html",
" dom",
" lxml",
" css",
" xpath",
" lexbor",
" parser"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b1f682236af357002e593d8590d7b38954dcd3738db7d36f24e3eb5e5043486e",
"md5": "4a5ccffea30dd006dc941ba2a3496e2b",
"sha256": "514a1817ee26c481392759c3d4f1ebaef5b1b6804a75d83aa9592ca2f214bcab"
},
"downloads": -1,
"filename": "domselect-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "4a5ccffea30dd006dc941ba2a3496e2b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 11713,
"upload_time": "2025-08-31T12:41:24",
"upload_time_iso_8601": "2025-08-31T12:41:24.630624Z",
"url": "https://files.pythonhosted.org/packages/b1/f6/82236af357002e593d8590d7b38954dcd3738db7d36f24e3eb5e5043486e/domselect-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-31 12:41:24",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "domselect"
}