pyxml2xpath


Namepyxml2xpath JSON
Version 0.3.4 PyPI version JSON
download
home_pageNone
SummaryGenerate xpath expressions from XML document.
upload_time2024-12-21 18:10:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseGPL-3.0
keywords xpath xml
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# pyxml2xpath
Parse an XML document with [lxml](https://lxml.de/) and build XPath expressions corresponding to its structure.

Table of contents
=================

* [Basic usage](#description)
* [Installation](#installation)
* [Command line usage](#command-line-usage)
* [Module usage](#module-usage)
* [Method parse(...)](#method-parse)
* [Print result modes](#print-result-modes)
* [HTML support](#html-support)
* [Relative expressions](#relative-expressions)
* [Unqualified vs. Qualified](#unqualified-vs-qualified)
* [Initial Xpath Examples](#initial-xpath-examples)
* [Performance](#performance)
* [Known issues](#known-issues)
* [Testing](#testing)

## Description
Iterates elements in a XML document and builds XPath expression for them starting at root element by default or at an element defined by an xpath expression.  
Xpath expressions returned by `lxml` are converted to fully qualified ones taking into account namespaces if they exist.  

Source expression could have qualified and unqualified parts with unknown element names  
`/soapenv:Envelope/soapenv:Body/*/*[5]/*[2]`  

A qualified one is returned  
`/soapenv:Envelope/soapenv:Body/ns98:requestMessage/ns98:item/ns98:quantity`  

Supported node types on path: elements, comments and processing instructions

`//* | //processing-instruction() | //comment()`  

text node types can be used in predicates but not on path

|  Xpath                                | Supported |
| :------------------------------------ |:---------:|
| //element[text() = "some text"]       | Yes       |
| //element/text()                      | No        |

It can be used as a [command line utility](#command-line-usage) or as a [module](#module-usage).

> A spin off of [xml2xpath Bash script](https://github.com/mluis7/xml2xpath). Both projects rely on [libxml2](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home) implementation.

## Installation
Installing from PyPi

`pip3.9 install pyxml2xpath`

Or building from source repo

```bash
git clone https://github.com/mluis7/pyxml2xpath.git
cd pyxml2xpath
python3.9 -m build
python3.9 -m pip install dist/pyxml2xpath-0.2.0-py3-none-any.whl --upgrade
```

Alternative without cloning the repo yourself

```
pip3.9 install git+https://github.com/mluis7/pyxml2xpath.git
```

## Command line usage
`pyxml2xpath <file path> [mode] [initial xpath expression] [with count] [max elements] [without banners]`

```bash
pyxml2xpath tests/resources/soap.xml

pyxml2xpath tests/resources/HL7.xml '' '//*[local-name()= "act"]'

pyxml2xpath tests/resources/HL7.xml 'values' '//*[local-name()= "act"]'

# mode                            : all
# starting at xpath               : none
# count elements                  : False
# Limit elements                  : 11
# Do not show banner (just xpaths): true

pyxml2xpath ~/tmp/test.html all none none 11 true
```


## Module usage

```python
from xml2xpath import xml2xpath
tree, nsmap, xmap = xml2xpath.parse('tests/resources/wiki.xml')
xml2xpath.print_xpath(xmap, 'all')
```

If an element tree created with `lxml` is available, use it and avoid double parsing the file.

```python
from lxml import etree
from xml2xpath import xml2xpath

doc = etree.parse("tests/resources/wiki.xml")
tree, nsmap, xmap = xml2xpath.parse(file=None,itree=doc)

```

Result

```
Found xpath for elements

/ns98:feed
/ns98:feed/ns98:id
/ns98:feed/ns98:title
/ns98:feed/ns98:link
...

Found xpath for attributes

/ns98:feed/@{http://www.w3.org/XML/1998/namespace}lang
/ns98:feed/ns98:link/@rel
/ns98:feed/ns98:link/@type
/ns98:feed/ns98:link/@href
...

Found  32 xpath expressions for elements
Found  19 xpath expressions for attributes

```

XPath search could start at a different element than root by passing an xpath expression

```python
xmap = parse(file,  xpath_base='//*[local-name() = "author"]')[2]
```

### Method parse(...)
Signature: `parse(file: str, *, itree: etree._ElementTree = None, xpath_base: str = '//*', with_count: bool = WITH_COUNT, max_items: int = MAX_ITEMS)`

Parse given xml file or `lxml` tree, find xpath expressions in it and return:

- The ElementTree for further usage
- The sanitized namespaces map (no None keys)
- A dictionary with unqualified xpath as keys and as values a tuple of qualified xpaths, count of elements found with them (optional) and a list with names of attributes of that elements.  
  Returns `None` if an error occurred.

```python
xmap = {
    "/some/xpath/*[1]": (
        "/some/xpath/ns:ele1", 
        1, 
        ["id", "class"] 
     ),
    "/some/other/xpath/*[3]": ( 
        "/some/other/xpath/ns:other", 
        1, 
        ["attr1", "attr2"] 
     ),
}
```

Namespaces dictionary adds a prefix for default namespaces.
If there are more than 1 default namespace, prefix will be incremental:
`ns98`, `ns99` and so on. Try testing file `tests/resources/soap.xml`

**Parameters**

- `file: str` file path string.
- `itree: lxml.etree._ElementTree` ElementTree object.
- `xpath_base: str` xpath expression To start searching xpaths for.
- `with_count: bool` Include count of elements found with each expression. Default: False
- `max_items: int` limit the number of parsed elements. Default: 100000
        
## Print result modes
Print xpath expressions and validate by count of elements found with it.  

`mode` argument values (optional):

- `path`  : print elements xpath expressions (default)  
- `all`   : also print attribute xpath expressions  
- `raw`   : print unqualified xpath and found values (tuple)  
- `values`: print tuple of found values only  

`pyxml2xpath ~/tmp/soap-ws-oasis.xml 'all'`

or if used as module:

`xml2xpath.print_xpath(xmap, 'all')`


## HTML support
HTML has limited support as long as the document or the HTML fragment are well formed. 
Make sure the HTML fragment is surrounded by a single element.
If not, add some fake root element `<root>some_html_fragment</root>`.

See examples on tests:

```
test_01.TestPyXml2Xpath01.test_parse_html
test_01.TestPyXml2Xpath01.test_fromstring_html_fragment
```

```python
from lxml import html
from xml2xpath import xml2xpath

filepath = 'tests/resources/html5-small.html.xml'
hdoc = html.parse(filepath)
xpath_base = '//*[@id="math"]'

xmap = xml2xpath.parse(None, itree=hdoc, xpath_base=xpath_base)[2]
```

or on command line

```
pyxml2xpath tests/resources/html5-small.html.xml 'all' '//*[@id="math"]'
```

## Relative expressions
Build relative expressions when passing `xpath_base` kword argument. The xpath of the parent should be removed so `base_xpath` should be like:

`xpath_base = '//*[@id="math"]/parent::* | //*[@id="math"]/descendant-or-self::*'`

Example:

```python
from lxml import html
from xml2xpath import xml2xpath

filepath = 'tests/resources/html5-small.html.xml'
hdoc = html.parse(filepath)

needle = 'math'
xpath_base = f'//*[@id="{needle}"]/parent::* | //*[@id="{needle}"]/descendant-or-self::*'
xmap = xml2xpath.parse(None, itree=hdoc, xpath_base=xpath_base)[2]

rel_xpath = []
xiter = iter(xmap)
# parent xpath
x0 = next(xiter)
# base element xpath
x1 = next(xiter)
# get base element attributes and build a predicate with first
x1a = ''
if len(xmap[x1][2]) > 0:
    x1a = f'[@{xmap[x1][2][0]}="{needle}"]'
# base element relative xpath (/html/body/math -> //math)
x1f = x1.replace(x0, '/')
# remove numeric indexes if any (div[1] -> div)
x1f = x1f.split('[', 1)[0]
# add first attribute as predicate
x1f += x1a
rel_xpath.append(x1f)

# children relative xpath
for xs in list(xmap.keys())[2:]:
    rel_xpath.append(xs.replace(x1, x1f))

for x in rel_xpath:
    print(x)
```

Output

```None
//math[@id='math']
//math[@id='math']/mrow
//math[@id='math']/mrow/mi
//math[@id='math']/mrow/mo
//math[@id='math']/mrow/mfrac
//math[@id='math']/mrow/mfrac/mn
//math[@id='math']/mrow/mfrac/msqrt
//math[@id='math']/mrow/mfrac/msqrt/mrow
//math[@id='math']/mrow/mfrac/msqrt/mrow/msup
//math[@id='math']/mrow/mfrac/msqrt/mrow/msup/mi
//math[@id='math']/mrow/mfrac/msqrt/mrow/msup/mn
//math[@id='math']/mrow/mfrac/msqrt/mrow/mo
//math[@id='math']/mrow/mfrac/msqrt/mrow/mn
```

## Unqualified vs. Qualified
Symbolic element tree of `tests/resources/wiki.xml` showing position of unqualified elements.

```
feed
  id
  title
  link
  link
  updated
  subtitle
  generator
  entry
    id
    title
    link
    updated
    summary
    author
      name
  entry   <- 9th child of 'feed'
    id
    title
    link
    updated
    summary
    author   <- 6th child of 'entry'
      name
  entry
    id
    title
    link
    updated
    summary
    author
      name
```

`tree.getpath(element)` could return a fully qualified expression, a fully unqualified expression or a mix of both `/soap:Envelope/soap:Body/*[2]`.

Unqualified parts are converted to qualified ones.

```
/*/*[9]/*[6]
/*           # root element
  /*[9]      # 9th child of root element. Tag name unknown.
       /*[6] # 6th child of previous element.  Tag name unknown.
```

qualified expression using appropriate namespace prefix

```
/*/*[9]/*[6]   /ns98:feed/ns98:entry/ns98:author
/*           # /ns98:feed
  /*[9]      #           /ns98:entry
       /*[6] #                      /ns98:author
```

## Initial Xpath Examples
To use with 3rd command line argument or `xpath_base` named parameter.

```
# Elements, comments and PIs
//* | //processing-instruction() | //comment()
/descendant-or-self::node()[not(.=self::text())]

# A processing instruction with a comment preceding sibling
//processing-instruction("pitest")[preceding-sibling::comment()]

# Comment following a ns98:typeId element
//comment()[preceding-sibling::ns98:typeId[parent::ns98:ClinicalDocument]][1]

# A comment containing specified text.
//comment()[contains(., "before root")]
```

## Performance
Performance degrades quickly for documents that produce more than 500k xpath expressions.  
Measuring timings with `timeit` for main steps in `parsed_mixed_ns()` method it can be seen that most consuming task is initializing the result dictionary while the time taken by `lxml.parse()` method and processing unqualified expressions remains stable.  
An effort was made to remove unnecessary iterations and to optimize dictionary keys preloading so the major penalty remains on the dictionary performance itself.

With times in seconds:

```
tree.xpath: 1.08
dict preloaded with: 750000 keys; 204.20
parse finished: 2.10


tree.xpath: 1.10
dict preloaded with: 1000000 keys; 399.05
parse finished: 2.60
```

Testing file: [Treebank dataset](https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/) - 82MB uncompressed, 2.4M xpath expressions.

## Known issues
- Count of elements fails with documents with long element names. See [issue pxx-13](https://github.com/mluis7/pyxml2xpath/issues/19)

## Testing
To get some result messages run as

`pytest --capture=no --verbose`

**Verifying found keys**  
Compare `xmllint` and `pyxml2xpath` found keys

```bash
printf "%s\n" "setrootns" "whereis //*" "bye" | xmllint --shell resources/HL7.xml | grep -v '^[/] >' > /tmp/HL7-whereis-xmllint.txt
pyxml2xpath resources/HL7.xml 'raw' none none none True | cut -d ' ' -f1 > /tmp/HL7-raw-keys.txt
diff -u /tmp/HL7-raw-keys.txt /tmp/HL7-whereis-xmllint.txt
```
No result returned.

**Verifying found qualified expressions**  
Test found xpath qualified expressions with a different tool by counting elements found with them

```bash
#!/bin/bash
xfile='resources/HL7.xml'
cmds=( "setrootns" "setns ns98=urn:hl7-org:v3" )

for xpath in $(pyxml2xpath $xfile none none none none True | sort | uniq); do
    cmds+=( "xpath count($xpath) > 0" )
done

printf "%s\n" "${cmds[@]}" | xmllint --shell "$xfile" | grep -v '^[/] >' | grep -v 'Object is a Boolean : true'

if [ "$?" -ne 0 ]; then
    echo "Success. Counts returned > 0"
fi
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyxml2xpath",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "xpath, xml",
    "author": null,
    "author_email": "Luis Mu\u00f1oz <south.minds@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/54/c4/5fde11730ac5d16d96a71016be2958c787093028ac1d71fdf881539d0098/pyxml2xpath-0.3.4.tar.gz",
    "platform": null,
    "description": "\n# pyxml2xpath\nParse an XML document with [lxml](https://lxml.de/) and build XPath expressions corresponding to its structure.\n\nTable of contents\n=================\n\n* [Basic usage](#description)\n* [Installation](#installation)\n* [Command line usage](#command-line-usage)\n* [Module usage](#module-usage)\n* [Method parse(...)](#method-parse)\n* [Print result modes](#print-result-modes)\n* [HTML support](#html-support)\n* [Relative expressions](#relative-expressions)\n* [Unqualified vs. Qualified](#unqualified-vs-qualified)\n* [Initial Xpath Examples](#initial-xpath-examples)\n* [Performance](#performance)\n* [Known issues](#known-issues)\n* [Testing](#testing)\n\n## Description\nIterates elements in a XML document and builds XPath expression for them starting at root element by default or at an element defined by an xpath expression.  \nXpath expressions returned by `lxml` are converted to fully qualified ones taking into account namespaces if they exist.  \n\nSource expression could have qualified and unqualified parts with unknown element names  \n`/soapenv:Envelope/soapenv:Body/*/*[5]/*[2]`  \n\nA qualified one is returned  \n`/soapenv:Envelope/soapenv:Body/ns98:requestMessage/ns98:item/ns98:quantity`  \n\nSupported node types on path: elements, comments and processing instructions\n\n`//* | //processing-instruction() | //comment()`  \n\ntext node types can be used in predicates but not on path\n\n|  Xpath                                | Supported |\n| :------------------------------------ |:---------:|\n| //element[text() = \"some text\"]       | Yes       |\n| //element/text()                      | No        |\n\nIt can be used as a [command line utility](#command-line-usage) or as a [module](#module-usage).\n\n> A spin off of [xml2xpath Bash script](https://github.com/mluis7/xml2xpath). Both projects rely on [libxml2](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home) implementation.\n\n## Installation\nInstalling from PyPi\n\n`pip3.9 install pyxml2xpath`\n\nOr building from source repo\n\n```bash\ngit clone https://github.com/mluis7/pyxml2xpath.git\ncd pyxml2xpath\npython3.9 -m build\npython3.9 -m pip install dist/pyxml2xpath-0.2.0-py3-none-any.whl --upgrade\n```\n\nAlternative without cloning the repo yourself\n\n```\npip3.9 install git+https://github.com/mluis7/pyxml2xpath.git\n```\n\n## Command line usage\n`pyxml2xpath <file path> [mode] [initial xpath expression] [with count] [max elements] [without banners]`\n\n```bash\npyxml2xpath tests/resources/soap.xml\n\npyxml2xpath tests/resources/HL7.xml '' '//*[local-name()= \"act\"]'\n\npyxml2xpath tests/resources/HL7.xml 'values' '//*[local-name()= \"act\"]'\n\n# mode                            : all\n# starting at xpath               : none\n# count elements                  : False\n# Limit elements                  : 11\n# Do not show banner (just xpaths): true\n\npyxml2xpath ~/tmp/test.html all none none 11 true\n```\n\n\n## Module usage\n\n```python\nfrom xml2xpath import xml2xpath\ntree, nsmap, xmap = xml2xpath.parse('tests/resources/wiki.xml')\nxml2xpath.print_xpath(xmap, 'all')\n```\n\nIf an element tree created with `lxml` is available, use it and avoid double parsing the file.\n\n```python\nfrom lxml import etree\nfrom xml2xpath import xml2xpath\n\ndoc = etree.parse(\"tests/resources/wiki.xml\")\ntree, nsmap, xmap = xml2xpath.parse(file=None,itree=doc)\n\n```\n\nResult\n\n```\nFound xpath for elements\n\n/ns98:feed\n/ns98:feed/ns98:id\n/ns98:feed/ns98:title\n/ns98:feed/ns98:link\n...\n\nFound xpath for attributes\n\n/ns98:feed/@{http://www.w3.org/XML/1998/namespace}lang\n/ns98:feed/ns98:link/@rel\n/ns98:feed/ns98:link/@type\n/ns98:feed/ns98:link/@href\n...\n\nFound  32 xpath expressions for elements\nFound  19 xpath expressions for attributes\n\n```\n\nXPath search could start at a different element than root by passing an xpath expression\n\n```python\nxmap = parse(file,  xpath_base='//*[local-name() = \"author\"]')[2]\n```\n\n### Method parse(...)\nSignature: `parse(file: str, *, itree: etree._ElementTree = None, xpath_base: str = '//*', with_count: bool = WITH_COUNT, max_items: int = MAX_ITEMS)`\n\nParse given xml file or `lxml` tree, find xpath expressions in it and return:\n\n- The ElementTree for further usage\n- The sanitized namespaces map (no None keys)\n- A dictionary with unqualified xpath as keys and as values a tuple of qualified xpaths, count of elements found with them (optional) and a list with names of attributes of that elements.  \n  Returns `None` if an error occurred.\n\n```python\nxmap = {\n    \"/some/xpath/*[1]\": (\n        \"/some/xpath/ns:ele1\", \n        1, \n        [\"id\", \"class\"] \n     ),\n    \"/some/other/xpath/*[3]\": ( \n        \"/some/other/xpath/ns:other\", \n        1, \n        [\"attr1\", \"attr2\"] \n     ),\n}\n```\n\nNamespaces dictionary adds a prefix for default namespaces.\nIf there are more than 1 default namespace, prefix will be incremental:\n`ns98`, `ns99` and so on. Try testing file `tests/resources/soap.xml`\n\n**Parameters**\n\n- `file: str` file path string.\n- `itree: lxml.etree._ElementTree` ElementTree object.\n- `xpath_base: str` xpath expression To start searching xpaths for.\n- `with_count: bool` Include count of elements found with each expression. Default: False\n- `max_items: int` limit the number of parsed elements. Default: 100000\n        \n## Print result modes\nPrint xpath expressions and validate by count of elements found with it.  \n\n`mode` argument values (optional):\n\n- `path`  : print elements xpath expressions (default)  \n- `all`   : also print attribute xpath expressions  \n- `raw`   : print unqualified xpath and found values (tuple)  \n- `values`: print tuple of found values only  \n\n`pyxml2xpath ~/tmp/soap-ws-oasis.xml 'all'`\n\nor if used as module:\n\n`xml2xpath.print_xpath(xmap, 'all')`\n\n\n## HTML support\nHTML has limited support as long as the document or the HTML fragment are well formed. \nMake sure the HTML fragment is surrounded by a single element.\nIf not, add some fake root element `<root>some_html_fragment</root>`.\n\nSee examples on tests:\n\n```\ntest_01.TestPyXml2Xpath01.test_parse_html\ntest_01.TestPyXml2Xpath01.test_fromstring_html_fragment\n```\n\n```python\nfrom lxml import html\nfrom xml2xpath import xml2xpath\n\nfilepath = 'tests/resources/html5-small.html.xml'\nhdoc = html.parse(filepath)\nxpath_base = '//*[@id=\"math\"]'\n\nxmap = xml2xpath.parse(None, itree=hdoc, xpath_base=xpath_base)[2]\n```\n\nor on command line\n\n```\npyxml2xpath tests/resources/html5-small.html.xml 'all' '//*[@id=\"math\"]'\n```\n\n## Relative expressions\nBuild relative expressions when passing `xpath_base` kword argument. The xpath of the parent should be removed so `base_xpath` should be like:\n\n`xpath_base = '//*[@id=\"math\"]/parent::* | //*[@id=\"math\"]/descendant-or-self::*'`\n\nExample:\n\n```python\nfrom lxml import html\nfrom xml2xpath import xml2xpath\n\nfilepath = 'tests/resources/html5-small.html.xml'\nhdoc = html.parse(filepath)\n\nneedle = 'math'\nxpath_base = f'//*[@id=\"{needle}\"]/parent::* | //*[@id=\"{needle}\"]/descendant-or-self::*'\nxmap = xml2xpath.parse(None, itree=hdoc, xpath_base=xpath_base)[2]\n\nrel_xpath = []\nxiter = iter(xmap)\n# parent xpath\nx0 = next(xiter)\n# base element xpath\nx1 = next(xiter)\n# get base element attributes and build a predicate with first\nx1a = ''\nif len(xmap[x1][2]) > 0:\n    x1a = f'[@{xmap[x1][2][0]}=\"{needle}\"]'\n# base element relative xpath (/html/body/math -> //math)\nx1f = x1.replace(x0, '/')\n# remove numeric indexes if any (div[1] -> div)\nx1f = x1f.split('[', 1)[0]\n# add first attribute as predicate\nx1f += x1a\nrel_xpath.append(x1f)\n\n# children relative xpath\nfor xs in list(xmap.keys())[2:]:\n    rel_xpath.append(xs.replace(x1, x1f))\n\nfor x in rel_xpath:\n    print(x)\n```\n\nOutput\n\n```None\n//math[@id='math']\n//math[@id='math']/mrow\n//math[@id='math']/mrow/mi\n//math[@id='math']/mrow/mo\n//math[@id='math']/mrow/mfrac\n//math[@id='math']/mrow/mfrac/mn\n//math[@id='math']/mrow/mfrac/msqrt\n//math[@id='math']/mrow/mfrac/msqrt/mrow\n//math[@id='math']/mrow/mfrac/msqrt/mrow/msup\n//math[@id='math']/mrow/mfrac/msqrt/mrow/msup/mi\n//math[@id='math']/mrow/mfrac/msqrt/mrow/msup/mn\n//math[@id='math']/mrow/mfrac/msqrt/mrow/mo\n//math[@id='math']/mrow/mfrac/msqrt/mrow/mn\n```\n\n## Unqualified vs. Qualified\nSymbolic element tree of `tests/resources/wiki.xml` showing position of unqualified elements.\n\n```\nfeed\n  id\n  title\n  link\n  link\n  updated\n  subtitle\n  generator\n  entry\n    id\n    title\n    link\n    updated\n    summary\n    author\n      name\n  entry   <- 9th child of 'feed'\n    id\n    title\n    link\n    updated\n    summary\n    author   <- 6th child of 'entry'\n      name\n  entry\n    id\n    title\n    link\n    updated\n    summary\n    author\n      name\n```\n\n`tree.getpath(element)` could return a fully qualified expression, a fully unqualified expression or a mix of both `/soap:Envelope/soap:Body/*[2]`.\n\nUnqualified parts are converted to qualified ones.\n\n```\n/*/*[9]/*[6]\n/*           # root element\n  /*[9]      # 9th child of root element. Tag name unknown.\n       /*[6] # 6th child of previous element.  Tag name unknown.\n```\n\nqualified expression using appropriate namespace prefix\n\n```\n/*/*[9]/*[6]   /ns98:feed/ns98:entry/ns98:author\n/*           # /ns98:feed\n  /*[9]      #           /ns98:entry\n       /*[6] #                      /ns98:author\n```\n\n## Initial Xpath Examples\nTo use with 3rd command line argument or `xpath_base` named parameter.\n\n```\n# Elements, comments and PIs\n//* | //processing-instruction() | //comment()\n/descendant-or-self::node()[not(.=self::text())]\n\n# A processing instruction with a comment preceding sibling\n//processing-instruction(\"pitest\")[preceding-sibling::comment()]\n\n# Comment following a ns98:typeId element\n//comment()[preceding-sibling::ns98:typeId[parent::ns98:ClinicalDocument]][1]\n\n# A comment containing specified text.\n//comment()[contains(., \"before root\")]\n```\n\n## Performance\nPerformance degrades quickly for documents that produce more than 500k xpath expressions.  \nMeasuring timings with `timeit` for main steps in `parsed_mixed_ns()` method it can be seen that most consuming task is initializing the result dictionary while the time taken by `lxml.parse()` method and processing unqualified expressions remains stable.  \nAn effort was made to remove unnecessary iterations and to optimize dictionary keys preloading so the major penalty remains on the dictionary performance itself.\n\nWith times in seconds:\n\n```\ntree.xpath: 1.08\ndict preloaded with: 750000 keys; 204.20\nparse finished: 2.10\n\n\ntree.xpath: 1.10\ndict preloaded with: 1000000 keys; 399.05\nparse finished: 2.60\n```\n\nTesting file: [Treebank dataset](https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/) - 82MB uncompressed, 2.4M xpath expressions.\n\n## Known issues\n- Count of elements fails with documents with long element names. See [issue pxx-13](https://github.com/mluis7/pyxml2xpath/issues/19)\n\n## Testing\nTo get some result messages run as\n\n`pytest --capture=no --verbose`\n\n**Verifying found keys**  \nCompare `xmllint` and `pyxml2xpath` found keys\n\n```bash\nprintf \"%s\\n\" \"setrootns\" \"whereis //*\" \"bye\" | xmllint --shell resources/HL7.xml | grep -v '^[/] >' > /tmp/HL7-whereis-xmllint.txt\npyxml2xpath resources/HL7.xml 'raw' none none none True | cut -d ' ' -f1 > /tmp/HL7-raw-keys.txt\ndiff -u /tmp/HL7-raw-keys.txt /tmp/HL7-whereis-xmllint.txt\n```\nNo result returned.\n\n**Verifying found qualified expressions**  \nTest found xpath qualified expressions with a different tool by counting elements found with them\n\n```bash\n#!/bin/bash\nxfile='resources/HL7.xml'\ncmds=( \"setrootns\" \"setns ns98=urn:hl7-org:v3\" )\n\nfor xpath in $(pyxml2xpath $xfile none none none none True | sort | uniq); do\n    cmds+=( \"xpath count($xpath) > 0\" )\ndone\n\nprintf \"%s\\n\" \"${cmds[@]}\" | xmllint --shell \"$xfile\" | grep -v '^[/] >' | grep -v 'Object is a Boolean : true'\n\nif [ \"$?\" -ne 0 ]; then\n    echo \"Success. Counts returned > 0\"\nfi\n```\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "Generate xpath expressions from XML document.",
    "version": "0.3.4",
    "project_urls": {
        "Repository": "https://github.com/mluis7/pyxml2xpath.git"
    },
    "split_keywords": [
        "xpath",
        " xml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a12765048be33ce04e54fe12e318801426774421236d63659a96bbcdf2a44733",
                "md5": "9f05d519e3a53d02694223f8133b9ec1",
                "sha256": "7c30b1bad235b5b4ea867b9573382800bf387f57972474899ad995e3f7de1f9e"
            },
            "downloads": -1,
            "filename": "pyxml2xpath-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9f05d519e3a53d02694223f8133b9ec1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 22770,
            "upload_time": "2024-12-21T18:10:31",
            "upload_time_iso_8601": "2024-12-21T18:10:31.522253Z",
            "url": "https://files.pythonhosted.org/packages/a1/27/65048be33ce04e54fe12e318801426774421236d63659a96bbcdf2a44733/pyxml2xpath-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "54c45fde11730ac5d16d96a71016be2958c787093028ac1d71fdf881539d0098",
                "md5": "1dfd6416174e3e4cce040b0ed8dae51e",
                "sha256": "43bafec0673662792786e16f1edbae87cf511bb585443481fc41d8179eec557b"
            },
            "downloads": -1,
            "filename": "pyxml2xpath-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "1dfd6416174e3e4cce040b0ed8dae51e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 23303,
            "upload_time": "2024-12-21T18:10:33",
            "upload_time_iso_8601": "2024-12-21T18:10:33.254482Z",
            "url": "https://files.pythonhosted.org/packages/54/c4/5fde11730ac5d16d96a71016be2958c787093028ac1d71fdf881539d0098/pyxml2xpath-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-21 18:10:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mluis7",
    "github_project": "pyxml2xpath",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyxml2xpath"
}
        
Elapsed time: 3.01534s