# lxmlx
[![Build Status](https://travis-ci.org/innodatalabs/lxmlx.svg?branch=master)](https://travis-ci.org/innodatalabs/lxmlx)
[![PyPI version](https://badge.fury.io/py/lxmlx.svg)](https://badge.fury.io/py/lxmlx)
Helpers and utilities for streaming processing of XML documents. Intended to be used with [lxml](http://lxml.de)
## Installation
Attention: this package no longer supports Python 2.
If you install using `pip`, all dependencies are automatically fetched and installed:
```
pip install lxmlx
```
If you want to build from sources, follow these steps:
### Building and testing (Python 3):
```
virtualenv .venv -p python3
. .venv/bin/activate
pip install -r requirements.txt
pip install pytest
pytest lxmlx
```
## Event stream
Event stream is XML representation which is equivalent to the in-memory tree.
It is similar to SAX parsing events, except:
1. we use simplified set of events (ENTER, EXIT, TEXT, COMMENT and PI)
2. events are represented natively as Python streams (generators)
3. event objects are JSON-serializable
3. we use events for complete XML processing: parsing, transformation, writing
Each event in the stream is a dict containing at least `type` key
## ENTER event
`ENTER` event is fired to indicate the opening of an XML tag. Payload:
* `type` must be string `"enter"` (or constant `lxmlx.event.ENTER`)
* `tag` element tag
* `attrib` optional - a dictionary of attributes
Example:
```python
{
'type' : 'enter',
'tag' : 'font',
'attrib': {
'name' : 'Times',
'style': 'bold'
}
}
```
## EXIT event
`EXIT` event is fired to indicate closing of an XML tag. No payload is
expected, because it implicitly corresponds to the opening tag from `ENTER`
event.
* `type` must be string `"exit"` (or constant `lxmlx.event.EXIT`)
Example:
```python
{
"type": "exit"
}
```
## TEXT event
`TEXT` event is fired to indicate XML `CTEXT` value. Payload is:
* `type` must be string `"text"` (or constant `lxmlx.event.TEXT`)
* `text` - required
Example:
```python
{
"type": "text",
"text": "Hello!"
}
```
## COMMENT
Payload is:
* `type` must be string `"comment"` (or constant `lxmlx.event.COMMENT`)
* `text` - required
Example:
```python
{
"type": "comment",
"text": "Hello!"
}
```
## PI
`PI` - processing instruction. Payload:
* `type` must be string `"pi"` (or constant `lxmlx.event.PI`)
* `target` - required PI target (aka tag)
* `text` - optional PI text content
Example:
```python
{
"type" : "pi",
"target": "myPI",
"text" : "my cool text here"
}
```
Our definition of event stream is consistent with depth-first left-to-right
traversal of XML tree.
## Example
XML document below
```xml
<book>
<chapter id="1">Introduction</chapter>
<chapter id="2">Preface</chapter>
<chapter id="3">Title</chapter>
</book>
```
can equivalently be represented by the following event stream:
```json
[
{"type": "enter", "tag": "book"},
{"type": "enter", "tag": "chapter", "attrib": {"id": "1"}},
{"type": "text", "text": "Introduction"},
{"type": "exit"},
{"type": "enter", "tag": "chapter", "attrib": {"id": "2"}},
{"type": "text", "text": "Preface"},
{"type": "exit"},
{"type": "enter", "tag": "chapter", "attrib": {"id": "3"}},
{"type": "text", "text": "Title"},
{"type": "exit"},
{"type": "exit"}
]
```
### Why do we need event stream representation of XML?
Some tasks are easier done using tree representation, but other
tasks are better done on event stream representation.
1. Stripping some XML tags. Remove some tags from XML document, leaving
text and other tags intact. In terms of XML tree this requires
carefully taking care of the children and contained text, and is
pretty difficult to get it right. Especially if you need to
remove many tags from a single tree - mutating the tree for each
one.
Using event stream representation this is as easy as suppressing
matching `ENTER` and `EXIT` events.
2. Extracting text content from an XML fragment. Using traditional
tree representation this is not a difficult task. But using event stream
representation this becomes quite trivial: accept only `TEXT` events and
join the resulting text pieces together:
```
''.join(evt['text'] for evt in events if evt['type']==TEXT)
```
3. Wrapping XML elements. Daunting task using XML tree representation. Very
easy using events stream - just inject wrappers each time you detect
`ENTER` or `EXIT` of a wrapee.
4. When implemented right, event stream uses limited memory, independent of
the size of the XML document. Even huge XML documents can be transformed
quickly using small amount of RAM.
## Well-formed event stream
Not every sequence of events is a valid event stream. The requirement of
well-formedness asserts that stream corresponds to left-to-right depth-first
traversal of some tree.
Raw data
{
"_id": null,
"home_page": "https://github.com/innodatalabs/lxmlx",
"name": "lxmlx",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "lxml xml events sax",
"author": "Mike Kroutikov",
"author_email": "mkroutikov@innodata.com",
"download_url": "",
"platform": "",
"description": "# lxmlx\n[![Build Status](https://travis-ci.org/innodatalabs/lxmlx.svg?branch=master)](https://travis-ci.org/innodatalabs/lxmlx)\n[![PyPI version](https://badge.fury.io/py/lxmlx.svg)](https://badge.fury.io/py/lxmlx)\n\nHelpers and utilities for streaming processing of XML documents. Intended to be used with [lxml](http://lxml.de)\n\n## Installation\n\nAttention: this package no longer supports Python 2.\n\nIf you install using `pip`, all dependencies are automatically fetched and installed:\n\n```\npip install lxmlx\n```\n\nIf you want to build from sources, follow these steps:\n\n### Building and testing (Python 3):\n```\nvirtualenv .venv -p python3\n. .venv/bin/activate\npip install -r requirements.txt\npip install pytest\npytest lxmlx\n```\n\n## Event stream\nEvent stream is XML representation which is equivalent to the in-memory tree.\n\nIt is similar to SAX parsing events, except:\n\n1. we use simplified set of events (ENTER, EXIT, TEXT, COMMENT and PI)\n2. events are represented natively as Python streams (generators)\n3. event objects are JSON-serializable\n3. we use events for complete XML processing: parsing, transformation, writing\n\nEach event in the stream is a dict containing at least `type` key\n\n## ENTER event\n`ENTER` event is fired to indicate the opening of an XML tag. Payload:\n\n* `type` must be string `\"enter\"` (or constant `lxmlx.event.ENTER`)\n* `tag` element tag\n* `attrib` optional - a dictionary of attributes\n\nExample:\n```python\n{\n 'type' : 'enter',\n 'tag' : 'font',\n 'attrib': {\n 'name' : 'Times',\n 'style': 'bold'\n }\n}\n```\n\n## EXIT event\n`EXIT` event is fired to indicate closing of an XML tag. No payload is\nexpected, because it implicitly corresponds to the opening tag from `ENTER`\nevent.\n\n* `type` must be string `\"exit\"` (or constant `lxmlx.event.EXIT`)\n\nExample:\n```python\n{\n \"type\": \"exit\"\n}\n```\n\n## TEXT event\n`TEXT` event is fired to indicate XML `CTEXT` value. Payload is:\n\n* `type` must be string `\"text\"` (or constant `lxmlx.event.TEXT`)\n* `text` - required\n\nExample:\n```python\n{\n \"type\": \"text\",\n \"text\": \"Hello!\"\n}\n```\n\n## COMMENT\n\nPayload is:\n* `type` must be string `\"comment\"` (or constant `lxmlx.event.COMMENT`)\n* `text` - required\n\nExample:\n```python\n{\n \"type\": \"comment\",\n \"text\": \"Hello!\"\n}\n```\n\n## PI\n`PI` - processing instruction. Payload:\n\n* `type` must be string `\"pi\"` (or constant `lxmlx.event.PI`)\n* `target` - required PI target (aka tag)\n* `text` - optional PI text content\n\nExample:\n```python\n{\n \"type\" : \"pi\",\n \"target\": \"myPI\",\n \"text\" : \"my cool text here\"\n}\n```\n\nOur definition of event stream is consistent with depth-first left-to-right\ntraversal of XML tree.\n\n## Example\nXML document below\n```xml\n<book>\n <chapter id=\"1\">Introduction</chapter>\n <chapter id=\"2\">Preface</chapter>\n <chapter id=\"3\">Title</chapter>\n</book>\n```\n\ncan equivalently be represented by the following event stream:\n```json\n[\n {\"type\": \"enter\", \"tag\": \"book\"},\n\n {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"1\"}},\n {\"type\": \"text\", \"text\": \"Introduction\"},\n {\"type\": \"exit\"},\n\n {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"2\"}},\n {\"type\": \"text\", \"text\": \"Preface\"},\n {\"type\": \"exit\"},\n\n {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"3\"}},\n {\"type\": \"text\", \"text\": \"Title\"},\n {\"type\": \"exit\"},\n\n {\"type\": \"exit\"}\n]\n```\n\n### Why do we need event stream representation of XML?\nSome tasks are easier done using tree representation, but other\ntasks are better done on event stream representation.\n\n1. Stripping some XML tags. Remove some tags from XML document, leaving\n text and other tags intact. In terms of XML tree this requires\n carefully taking care of the children and contained text, and is\n pretty difficult to get it right. Especially if you need to\n remove many tags from a single tree - mutating the tree for each\n one.\n\n Using event stream representation this is as easy as suppressing\n matching `ENTER` and `EXIT` events.\n\n2. Extracting text content from an XML fragment. Using traditional\n tree representation this is not a difficult task. But using event stream\n representation this becomes quite trivial: accept only `TEXT` events and\n join the resulting text pieces together:\n ```\n ''.join(evt['text'] for evt in events if evt['type']==TEXT)\n ```\n\n3. Wrapping XML elements. Daunting task using XML tree representation. Very\n easy using events stream - just inject wrappers each time you detect\n `ENTER` or `EXIT` of a wrapee.\n\n4. When implemented right, event stream uses limited memory, independent of\n the size of the XML document. Even huge XML documents can be transformed\n quickly using small amount of RAM.\n\n\n## Well-formed event stream\n\nNot every sequence of events is a valid event stream. The requirement of\nwell-formedness asserts that stream corresponds to left-to-right depth-first\ntraversal of some tree.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Helpers and utilities to be used with lxml",
"version": "2.0.2",
"split_keywords": [
"lxml",
"xml",
"events",
"sax"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "c19ffdafcc12bafa39f68426da13206a",
"sha256": "1857d1a49add83e24abe770fafea066eb3c303843f1ec40c28c791d333614549"
},
"downloads": -1,
"filename": "lxmlx-2.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c19ffdafcc12bafa39f68426da13206a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9275,
"upload_time": "2019-11-06T20:09:23",
"upload_time_iso_8601": "2019-11-06T20:09:23.291926Z",
"url": "https://files.pythonhosted.org/packages/4f/77/4c3dcb5e4912a8a1be20e0e92cd1bf4df4ed8211643d8463cba9bd1ebd7f/lxmlx-2.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2019-11-06 20:09:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "innodatalabs",
"github_project": "lxmlx",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "lxml",
"specs": [
[
"~=",
"4.6.2"
]
]
}
],
"lcname": "lxmlx"
}