lxmlx


Namelxmlx JSON
Version 2.0.2 PyPI version JSON
download
home_pagehttps://github.com/innodatalabs/lxmlx
SummaryHelpers and utilities to be used with lxml
upload_time2019-11-06 20:09:23
maintainer
docs_urlNone
authorMike Kroutikov
requires_python
licenseMIT
keywords lxml xml events sax
VCS
bugtrack_url
requirements lxml
Travis-CI
coveralls test coverage No coveralls.
            # lxmlx
[![Build Status](https://travis-ci.org/innodatalabs/lxmlx.svg?branch=master)](https://travis-ci.org/innodatalabs/lxmlx)
[![PyPI version](https://badge.fury.io/py/lxmlx.svg)](https://badge.fury.io/py/lxmlx)

Helpers and utilities for streaming processing of XML documents. Intended to be used with [lxml](http://lxml.de)

## Installation

Attention: this package no longer supports Python 2.

If you install using `pip`, all dependencies are automatically fetched and installed:

```
pip install lxmlx
```

If you want to build from sources, follow these steps:

### Building and testing (Python 3):
```
virtualenv .venv -p python3
. .venv/bin/activate
pip install -r requirements.txt
pip install pytest
pytest lxmlx
```

## Event stream
Event stream is XML representation which is equivalent to the in-memory tree.

It is similar to SAX parsing events, except:

1. we use simplified set of events (ENTER, EXIT, TEXT, COMMENT and PI)
2. events are represented natively as Python streams (generators)
3. event objects are JSON-serializable
3. we use events for complete XML processing: parsing, transformation, writing

Each event in the stream is a dict containing at least `type` key

## ENTER event
`ENTER` event is fired to indicate the opening of an XML tag. Payload:

* `type` must be string `"enter"` (or constant `lxmlx.event.ENTER`)
* `tag` element tag
* `attrib` optional - a dictionary of attributes

Example:
```python
{
  'type'  : 'enter',
  'tag'   : 'font',
  'attrib': {
    'name' : 'Times',
    'style': 'bold'
  }
}
```

## EXIT event
`EXIT` event is fired to indicate closing of an XML tag. No payload is
expected, because it implicitly corresponds to the opening tag from `ENTER`
event.

* `type` must be string `"exit"` (or constant `lxmlx.event.EXIT`)

Example:
```python
{
  "type": "exit"
}
```

## TEXT event
`TEXT` event is fired to indicate XML `CTEXT` value. Payload is:

* `type` must be string `"text"` (or constant `lxmlx.event.TEXT`)
* `text` - required

Example:
```python
{
  "type": "text",
  "text": "Hello!"
}
```

## COMMENT

Payload is:
* `type` must be string `"comment"` (or constant `lxmlx.event.COMMENT`)
* `text` - required

Example:
```python
{
  "type": "comment",
  "text": "Hello!"
}
```

## PI
`PI` - processing instruction. Payload:

* `type` must be string `"pi"` (or constant `lxmlx.event.PI`)
* `target` - required PI target (aka tag)
* `text` - optional PI text content

Example:
```python
{
  "type"  : "pi",
  "target": "myPI",
  "text"  : "my cool text here"
}
```

Our definition of event stream is consistent with depth-first left-to-right
traversal of XML tree.

## Example
XML document below
```xml
<book>
   <chapter id="1">Introduction</chapter>
   <chapter id="2">Preface</chapter>
   <chapter id="3">Title</chapter>
</book>
```

can equivalently be represented by the following event stream:
```json
[
  {"type": "enter", "tag": "book"},

  {"type": "enter", "tag": "chapter", "attrib": {"id": "1"}},
  {"type": "text", "text": "Introduction"},
  {"type": "exit"},

  {"type": "enter", "tag": "chapter", "attrib": {"id": "2"}},
  {"type": "text", "text": "Preface"},
  {"type": "exit"},

  {"type": "enter", "tag": "chapter", "attrib": {"id": "3"}},
  {"type": "text", "text": "Title"},
  {"type": "exit"},

  {"type": "exit"}
]
```

### Why do we need event stream representation of XML?
Some tasks are easier done using tree representation, but other
tasks are better done on event stream representation.

1. Stripping some XML tags. Remove some tags from XML document, leaving
   text and other tags intact. In terms of XML tree this requires
   carefully taking care of the children and contained text, and is
   pretty difficult to get it right. Especially if you need to
   remove many tags from a single tree - mutating the tree for each
   one.

   Using event stream representation this is as easy as suppressing
   matching `ENTER` and `EXIT` events.

2. Extracting text content from an XML fragment. Using traditional
   tree representation this is not a difficult task. But using event stream
   representation this becomes quite trivial: accept only `TEXT` events and
   join the resulting text pieces together:
   ```
   ''.join(evt['text'] for evt in events if evt['type']==TEXT)
   ```

3. Wrapping XML elements. Daunting task using XML tree representation. Very
   easy using events stream - just inject wrappers each time you detect
   `ENTER` or `EXIT` of a wrapee.

4. When implemented right, event stream uses limited memory, independent of
   the size of the XML document. Even huge XML documents can be transformed
   quickly using small amount of RAM.


## Well-formed event stream

Not every sequence of events is a valid event stream. The requirement of
well-formedness asserts that stream corresponds to left-to-right depth-first
traversal of some tree.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/innodatalabs/lxmlx",
    "name": "lxmlx",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "lxml xml events sax",
    "author": "Mike Kroutikov",
    "author_email": "mkroutikov@innodata.com",
    "download_url": "",
    "platform": "",
    "description": "# lxmlx\n[![Build Status](https://travis-ci.org/innodatalabs/lxmlx.svg?branch=master)](https://travis-ci.org/innodatalabs/lxmlx)\n[![PyPI version](https://badge.fury.io/py/lxmlx.svg)](https://badge.fury.io/py/lxmlx)\n\nHelpers and utilities for streaming processing of XML documents. Intended to be used with [lxml](http://lxml.de)\n\n## Installation\n\nAttention: this package no longer supports Python 2.\n\nIf you install using `pip`, all dependencies are automatically fetched and installed:\n\n```\npip install lxmlx\n```\n\nIf you want to build from sources, follow these steps:\n\n### Building and testing (Python 3):\n```\nvirtualenv .venv -p python3\n. .venv/bin/activate\npip install -r requirements.txt\npip install pytest\npytest lxmlx\n```\n\n## Event stream\nEvent stream is XML representation which is equivalent to the in-memory tree.\n\nIt is similar to SAX parsing events, except:\n\n1. we use simplified set of events (ENTER, EXIT, TEXT, COMMENT and PI)\n2. events are represented natively as Python streams (generators)\n3. event objects are JSON-serializable\n3. we use events for complete XML processing: parsing, transformation, writing\n\nEach event in the stream is a dict containing at least `type` key\n\n## ENTER event\n`ENTER` event is fired to indicate the opening of an XML tag. Payload:\n\n* `type` must be string `\"enter\"` (or constant `lxmlx.event.ENTER`)\n* `tag` element tag\n* `attrib` optional - a dictionary of attributes\n\nExample:\n```python\n{\n  'type'  : 'enter',\n  'tag'   : 'font',\n  'attrib': {\n    'name' : 'Times',\n    'style': 'bold'\n  }\n}\n```\n\n## EXIT event\n`EXIT` event is fired to indicate closing of an XML tag. No payload is\nexpected, because it implicitly corresponds to the opening tag from `ENTER`\nevent.\n\n* `type` must be string `\"exit\"` (or constant `lxmlx.event.EXIT`)\n\nExample:\n```python\n{\n  \"type\": \"exit\"\n}\n```\n\n## TEXT event\n`TEXT` event is fired to indicate XML `CTEXT` value. Payload is:\n\n* `type` must be string `\"text\"` (or constant `lxmlx.event.TEXT`)\n* `text` - required\n\nExample:\n```python\n{\n  \"type\": \"text\",\n  \"text\": \"Hello!\"\n}\n```\n\n## COMMENT\n\nPayload is:\n* `type` must be string `\"comment\"` (or constant `lxmlx.event.COMMENT`)\n* `text` - required\n\nExample:\n```python\n{\n  \"type\": \"comment\",\n  \"text\": \"Hello!\"\n}\n```\n\n## PI\n`PI` - processing instruction. Payload:\n\n* `type` must be string `\"pi\"` (or constant `lxmlx.event.PI`)\n* `target` - required PI target (aka tag)\n* `text` - optional PI text content\n\nExample:\n```python\n{\n  \"type\"  : \"pi\",\n  \"target\": \"myPI\",\n  \"text\"  : \"my cool text here\"\n}\n```\n\nOur definition of event stream is consistent with depth-first left-to-right\ntraversal of XML tree.\n\n## Example\nXML document below\n```xml\n<book>\n   <chapter id=\"1\">Introduction</chapter>\n   <chapter id=\"2\">Preface</chapter>\n   <chapter id=\"3\">Title</chapter>\n</book>\n```\n\ncan equivalently be represented by the following event stream:\n```json\n[\n  {\"type\": \"enter\", \"tag\": \"book\"},\n\n  {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"1\"}},\n  {\"type\": \"text\", \"text\": \"Introduction\"},\n  {\"type\": \"exit\"},\n\n  {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"2\"}},\n  {\"type\": \"text\", \"text\": \"Preface\"},\n  {\"type\": \"exit\"},\n\n  {\"type\": \"enter\", \"tag\": \"chapter\", \"attrib\": {\"id\": \"3\"}},\n  {\"type\": \"text\", \"text\": \"Title\"},\n  {\"type\": \"exit\"},\n\n  {\"type\": \"exit\"}\n]\n```\n\n### Why do we need event stream representation of XML?\nSome tasks are easier done using tree representation, but other\ntasks are better done on event stream representation.\n\n1. Stripping some XML tags. Remove some tags from XML document, leaving\n   text and other tags intact. In terms of XML tree this requires\n   carefully taking care of the children and contained text, and is\n   pretty difficult to get it right. Especially if you need to\n   remove many tags from a single tree - mutating the tree for each\n   one.\n\n   Using event stream representation this is as easy as suppressing\n   matching `ENTER` and `EXIT` events.\n\n2. Extracting text content from an XML fragment. Using traditional\n   tree representation this is not a difficult task. But using event stream\n   representation this becomes quite trivial: accept only `TEXT` events and\n   join the resulting text pieces together:\n   ```\n   ''.join(evt['text'] for evt in events if evt['type']==TEXT)\n   ```\n\n3. Wrapping XML elements. Daunting task using XML tree representation. Very\n   easy using events stream - just inject wrappers each time you detect\n   `ENTER` or `EXIT` of a wrapee.\n\n4. When implemented right, event stream uses limited memory, independent of\n   the size of the XML document. Even huge XML documents can be transformed\n   quickly using small amount of RAM.\n\n\n## Well-formed event stream\n\nNot every sequence of events is a valid event stream. The requirement of\nwell-formedness asserts that stream corresponds to left-to-right depth-first\ntraversal of some tree.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Helpers and utilities to be used with lxml",
    "version": "2.0.2",
    "split_keywords": [
        "lxml",
        "xml",
        "events",
        "sax"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "c19ffdafcc12bafa39f68426da13206a",
                "sha256": "1857d1a49add83e24abe770fafea066eb3c303843f1ec40c28c791d333614549"
            },
            "downloads": -1,
            "filename": "lxmlx-2.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c19ffdafcc12bafa39f68426da13206a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9275,
            "upload_time": "2019-11-06T20:09:23",
            "upload_time_iso_8601": "2019-11-06T20:09:23.291926Z",
            "url": "https://files.pythonhosted.org/packages/4f/77/4c3dcb5e4912a8a1be20e0e92cd1bf4df4ed8211643d8463cba9bd1ebd7f/lxmlx-2.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2019-11-06 20:09:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "innodatalabs",
    "github_project": "lxmlx",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "lxml",
            "specs": [
                [
                    "~=",
                    "4.6.2"
                ]
            ]
        }
    ],
    "lcname": "lxmlx"
}
        
Elapsed time: 0.01661s