xmlstreamer


Namexmlstreamer JSON
Version 1.1.2 PyPI version JSON
download
home_pageNone
SummaryPython library to parse XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory, supporting decompression, encoding detection, tag-based itemization, and optional filtering.
upload_time2025-09-02 17:48:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords xml streaming parser lightweight http https ftp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # xmlstreamer

**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.

## Features

- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.
- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.
- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.
- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.
- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application’s needs.
- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.

## Example Use Cases

- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.
- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.
- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.

### DeepWiki Docs: [https://deepwiki.com/carlosplanchon/xmlstreamer](https://deepwiki.com/carlosplanchon/xmlstreamer)

## Installation

To install `xmlstreamer`, use uv:

```bash
uv add xmlstreamer
```

## Usage

```python
from xmlstreamer import StreamInterpreter

import pprint

url = "https://example.com/large-feed.xml"
separator_tag = "item"

interpreter = StreamInterpreter(
    url=url,
    separator_tag=separator_tag,
    buffer_size=1024 * 128,
    max_running_time=600  # 10 minutes
)

for item in interpreter:
    pprint.pprint(item)  # Process each parsed item as a dictionary
```

Define custom filters, encoding mappings, or buffer sizes as needed for optimal performance.

## Filtering Usage

To enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.

### Step 1: Define an `ItemFilter`

The `ItemFilter` class specifies which items to keep based on date filtering criteria:

```python
import attrs
from attrs_strict import type_validator
from typing import Optional

@attrs.define
class ItemFilter:
    attrib: str = attrs.field(
        kw_only=True,
        validator=type_validator()
    )
    fmt: Optional[str] = attrs.field(
        validator=type_validator(),
        default=None
    )
    max_item_age_in_days: int = attrs.field(
        kw_only=True,
        validator=type_validator()
    )
```

- attrib: The XML tag or attribute to filter by.
- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.
- max_item_age_in_days: The maximum allowable age of items in days.

### Step 2: Define Helper Functions for Parsing and Filtering Dates
Functions to parse dates and evaluate if an item should be kept based on the specified date limit:

```python
from datetime import datetime, timedelta
from typing import Optional
import dateparser

def parse_date(string: str, fmt: str) -> Optional[datetime]:
    try:
        return datetime.strptime(string, fmt)
    except ValueError:
        return None

def eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:
    string = string.strip()
    parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)
    return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None
```

- parse_date: Parses the date string with the specified format.
- eval_keep_date_item: Checks if the item’s date is within the allowable age limit.

### Step 3: Define the Filtering Function
The filter_parsed_item function applies the date filter to each parsed item:

```python
from typing import Optional, Dict, Any

def filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:
    attrib = ITEM_FILTER.attrib
    fmt = ITEM_FILTER.fmt
    max_item_age_in_days = ITEM_FILTER.max_item_age_in_days
    limit_date = datetime.now() - timedelta(days=max_item_age_in_days)
    item_content: Dict[str, Any] = parsed_item.parsed_content

    if attrib in item_content:
        dt = item_content[attrib]
        keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None
        if keep_item is None:
            item_content[attrib] = None
            return parsed_item
        elif keep_item:
            return parsed_item

    return None
```

### Step 4: Extend StreamInterpreter for Filtering and Alerts
Create a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.

```python
from xmlstreamer import StreamInterpreter
from datetime import datetime
from pathlib import Path
import inspect

class CustomStreamInterpreter(StreamInterpreter):
    def __init__(self, **kwargs):
        kwargs["max_running_time"] = 3600  # Set max runtime to 1 hour
        super().__init__(**kwargs)

        stack = inspect.stack()
        fname = stack[1].filename
        fname_path = Path(fname)
        self.called_from = fname_path.stem

        self.alerts_enabled = True
        self.filter_parsed_item_func = filter_parsed_item

    def raise_stop_iteration(self):
        print(f"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}")
        if self.stats_parsed_items == 0:
            self.raise_zero_items_alert()
        raise StopIteration

    def raise_zero_items_alert(self):
        print("--- ZERO ITEMS ALERT ---")
        actual_date = datetime.now()
        running_time = actual_date - self.start_date
        print(f"Running time exceeded with no items parsed: {running_time}")
```

- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.
- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.

### Step 5: Run with Custom Filtering
To use filtering and alerting with your subclass:

```python
url = "https://example.com/large-feed.xml"
separator_tag = "item"
item_filter = ItemFilter(
    attrib="pubDate",
    fmt="%a, %d %b %Y %H:%M:%S %z",
    max_item_age_in_days=7
)

interpreter = CustomStreamInterpreter(
    url=url,
    separator_tag=separator_tag,
    item_filter=item_filter,
    buffer_size=1024 * 128,
)

for item in interpreter:
    print(item)  # Process each filtered item as a dictionary
```

This example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "xmlstreamer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "xml, streaming, parser, lightweight, http, https, ftp",
    "author": null,
    "author_email": "\"Carlos A. Planch\u00f3n\" <carlosandresplanchonprestes@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/dd/c9/10cb3d504febd73fa42f2a6a85d61f01443c6ac32699d9ccc3da4157e95e/xmlstreamer-1.1.2.tar.gz",
    "platform": null,
    "description": "# xmlstreamer\n\n**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.\n\n## Features\n\n- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.\n- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.\n- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.\n- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.\n- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application\u2019s needs.\n- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.\n\n## Example Use Cases\n\n- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.\n- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.\n- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.\n\n### DeepWiki Docs: [https://deepwiki.com/carlosplanchon/xmlstreamer](https://deepwiki.com/carlosplanchon/xmlstreamer)\n\n## Installation\n\nTo install `xmlstreamer`, use uv:\n\n```bash\nuv add xmlstreamer\n```\n\n## Usage\n\n```python\nfrom xmlstreamer import StreamInterpreter\n\nimport pprint\n\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\n\ninterpreter = StreamInterpreter(\n    url=url,\n    separator_tag=separator_tag,\n    buffer_size=1024 * 128,\n    max_running_time=600  # 10 minutes\n)\n\nfor item in interpreter:\n    pprint.pprint(item)  # Process each parsed item as a dictionary\n```\n\nDefine custom filters, encoding mappings, or buffer sizes as needed for optimal performance.\n\n## Filtering Usage\n\nTo enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.\n\n### Step 1: Define an `ItemFilter`\n\nThe `ItemFilter` class specifies which items to keep based on date filtering criteria:\n\n```python\nimport attrs\nfrom attrs_strict import type_validator\nfrom typing import Optional\n\n@attrs.define\nclass ItemFilter:\n    attrib: str = attrs.field(\n        kw_only=True,\n        validator=type_validator()\n    )\n    fmt: Optional[str] = attrs.field(\n        validator=type_validator(),\n        default=None\n    )\n    max_item_age_in_days: int = attrs.field(\n        kw_only=True,\n        validator=type_validator()\n    )\n```\n\n- attrib: The XML tag or attribute to filter by.\n- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.\n- max_item_age_in_days: The maximum allowable age of items in days.\n\n### Step 2: Define Helper Functions for Parsing and Filtering Dates\nFunctions to parse dates and evaluate if an item should be kept based on the specified date limit:\n\n```python\nfrom datetime import datetime, timedelta\nfrom typing import Optional\nimport dateparser\n\ndef parse_date(string: str, fmt: str) -> Optional[datetime]:\n    try:\n        return datetime.strptime(string, fmt)\n    except ValueError:\n        return None\n\ndef eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:\n    string = string.strip()\n    parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)\n    return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None\n```\n\n- parse_date: Parses the date string with the specified format.\n- eval_keep_date_item: Checks if the item\u2019s date is within the allowable age limit.\n\n### Step 3: Define the Filtering Function\nThe filter_parsed_item function applies the date filter to each parsed item:\n\n```python\nfrom typing import Optional, Dict, Any\n\ndef filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:\n    attrib = ITEM_FILTER.attrib\n    fmt = ITEM_FILTER.fmt\n    max_item_age_in_days = ITEM_FILTER.max_item_age_in_days\n    limit_date = datetime.now() - timedelta(days=max_item_age_in_days)\n    item_content: Dict[str, Any] = parsed_item.parsed_content\n\n    if attrib in item_content:\n        dt = item_content[attrib]\n        keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None\n        if keep_item is None:\n            item_content[attrib] = None\n            return parsed_item\n        elif keep_item:\n            return parsed_item\n\n    return None\n```\n\n### Step 4: Extend StreamInterpreter for Filtering and Alerts\nCreate a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.\n\n```python\nfrom xmlstreamer import StreamInterpreter\nfrom datetime import datetime\nfrom pathlib import Path\nimport inspect\n\nclass CustomStreamInterpreter(StreamInterpreter):\n    def __init__(self, **kwargs):\n        kwargs[\"max_running_time\"] = 3600  # Set max runtime to 1 hour\n        super().__init__(**kwargs)\n\n        stack = inspect.stack()\n        fname = stack[1].filename\n        fname_path = Path(fname)\n        self.called_from = fname_path.stem\n\n        self.alerts_enabled = True\n        self.filter_parsed_item_func = filter_parsed_item\n\n    def raise_stop_iteration(self):\n        print(f\"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}\")\n        if self.stats_parsed_items == 0:\n            self.raise_zero_items_alert()\n        raise StopIteration\n\n    def raise_zero_items_alert(self):\n        print(\"--- ZERO ITEMS ALERT ---\")\n        actual_date = datetime.now()\n        running_time = actual_date - self.start_date\n        print(f\"Running time exceeded with no items parsed: {running_time}\")\n```\n\n- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.\n- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.\n\n### Step 5: Run with Custom Filtering\nTo use filtering and alerting with your subclass:\n\n```python\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\nitem_filter = ItemFilter(\n    attrib=\"pubDate\",\n    fmt=\"%a, %d %b %Y %H:%M:%S %z\",\n    max_item_age_in_days=7\n)\n\ninterpreter = CustomStreamInterpreter(\n    url=url,\n    separator_tag=separator_tag,\n    item_filter=item_filter,\n    buffer_size=1024 * 128,\n)\n\nfor item in interpreter:\n    print(item)  # Process each filtered item as a dictionary\n```\n\nThis example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python library to parse XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory, supporting decompression, encoding detection, tag-based itemization, and optional filtering.",
    "version": "1.1.2",
    "project_urls": {
        "repository": "https://github.com/carlosplanchon/xmlstreamer.git"
    },
    "split_keywords": [
        "xml",
        " streaming",
        " parser",
        " lightweight",
        " http",
        " https",
        " ftp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2aad7a4ffe6412404f3889e249ba492cea1e35d3b5b5a839ec47c6c061dd97aa",
                "md5": "df6907a9554f483dc80665510813745f",
                "sha256": "c84c200d365db80d1cd1ac481daded246457bc3590718813f1dbe1be6e2b3fc4"
            },
            "downloads": -1,
            "filename": "xmlstreamer-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "df6907a9554f483dc80665510813745f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 8862,
            "upload_time": "2025-09-02T17:48:54",
            "upload_time_iso_8601": "2025-09-02T17:48:54.798260Z",
            "url": "https://files.pythonhosted.org/packages/2a/ad/7a4ffe6412404f3889e249ba492cea1e35d3b5b5a839ec47c6c061dd97aa/xmlstreamer-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ddc910cb3d504febd73fa42f2a6a85d61f01443c6ac32699d9ccc3da4157e95e",
                "md5": "501a0af8a02f7914b6d98fa8f485d526",
                "sha256": "f68df9dbeb03d3505945f35a1e8c31b57ba85cb5aef43a189556e9ac6d847407"
            },
            "downloads": -1,
            "filename": "xmlstreamer-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "501a0af8a02f7914b6d98fa8f485d526",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 10802,
            "upload_time": "2025-09-02T17:48:56",
            "upload_time_iso_8601": "2025-09-02T17:48:56.169224Z",
            "url": "https://files.pythonhosted.org/packages/dd/c9/10cb3d504febd73fa42f2a6a85d61f01443c6ac32699d9ccc3da4157e95e/xmlstreamer-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-02 17:48:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "carlosplanchon",
    "github_project": "xmlstreamer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "xmlstreamer"
}
        
Elapsed time: 3.19821s