xmlstreamer


Namexmlstreamer JSON
Version 0.9 PyPI version JSON
download
home_pageNone
SummaryPython library to parse XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory, supporting decompression, encoding detection, tag-based itemization, and optional filtering.
upload_time2024-11-08 01:00:19
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT License
keywords xml streaming parser lightweight http https ftp
VCS
bugtrack_url
requirements defusedxml chardet requests attrs attrs-strict
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # xmlstreamer

**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.

## Features

- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.
- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.
- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.
- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.
- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application’s needs.
- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.

## Example Use Cases

- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.
- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.
- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.

## Installation

To install `xmlstreamer`, use pip:

```bash
pip install xmlstreamer
```

## Usage

```python
from xmlstreamer import StreamInterpreter

import pprint

url = "https://example.com/large-feed.xml"
separator_tag = "item"

interpreter = StreamInterpreter(
    url=url,
    separator_tag=separator_tag,
    buffer_size=1024 * 128,
    max_running_time=600  # 10 minutes
)

for item in interpreter:
    pprint.pprint(item)  # Process each parsed item as a dictionary
```

Define custom filters, encoding mappings, or buffer sizes as needed for optimal performance.

## Filtering Usage

To enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.

### Step 1: Define an `ItemFilter`

The `ItemFilter` class specifies which items to keep based on date filtering criteria:

```python
import attrs
from attrs_strict import type_validator
from typing import Optional

@attrs.define
class ItemFilter:
    attrib: str = attrs.field(
        kw_only=True,
        validator=type_validator()
    )
    fmt: Optional[str] = attrs.field(
        validator=type_validator(),
        default=None
    )
    max_item_age_in_days: int = attrs.field(
        kw_only=True,
        validator=type_validator()
    )
```

- attrib: The XML tag or attribute to filter by.
- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.
- max_item_age_in_days: The maximum allowable age of items in days.

### Step 2: Define Helper Functions for Parsing and Filtering Dates
Functions to parse dates and evaluate if an item should be kept based on the specified date limit:

```python
from datetime import datetime, timedelta
from typing import Optional
import dateparser

def parse_date(string: str, fmt: str) -> Optional[datetime]:
    try:
        return datetime.strptime(string, fmt)
    except ValueError:
        return None

def eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:
    string = string.strip()
    parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)
    return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None
```

- parse_date: Parses the date string with the specified format.
- eval_keep_date_item: Checks if the item’s date is within the allowable age limit.

### Step 3: Define the Filtering Function
The filter_parsed_item function applies the date filter to each parsed item:

```python
from typing import Optional, Dict, Any

def filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:
    attrib = ITEM_FILTER.attrib
    fmt = ITEM_FILTER.fmt
    max_item_age_in_days = ITEM_FILTER.max_item_age_in_days
    limit_date = datetime.now() - timedelta(days=max_item_age_in_days)
    item_content: Dict[str, Any] = parsed_item.parsed_content

    if attrib in item_content:
        dt = item_content[attrib]
        keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None
        if keep_item is None:
            item_content[attrib] = None
            return parsed_item
        elif keep_item:
            return parsed_item

    return None
```

### Step 4: Extend StreamInterpreter for Filtering and Alerts
Create a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.

```python
from xmlstreamer import StreamInterpreter
from datetime import datetime
from pathlib import Path
import inspect

class CustomStreamInterpreter(StreamInterpreter):
    def __init__(self, **kwargs):
        kwargs["max_running_time"] = 3600  # Set max runtime to 1 hour
        super().__init__(**kwargs)

        stack = inspect.stack()
        fname = stack[1].filename
        fname_path = Path(fname)
        self.called_from = fname_path.stem

        self.alerts_enabled = True
        self.filter_parsed_item_func = filter_parsed_item

    def raise_stop_iteration(self):
        print(f"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}")
        if self.stats_parsed_items == 0:
            self.raise_zero_items_alert()
        raise StopIteration

    def raise_zero_items_alert(self):
        print("--- ZERO ITEMS ALERT ---")
        actual_date = datetime.now()
        running_time = actual_date - self.start_date
        print(f"Running time exceeded with no items parsed: {running_time}")
```

- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.
- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.

### Step 5: Run with Custom Filtering
To use filtering and alerting with your subclass:

```python
url = "https://example.com/large-feed.xml"
separator_tag = "item"
item_filter = ItemFilter(
    attrib="pubDate",
    fmt="%a, %d %b %Y %H:%M:%S %z",
    max_item_age_in_days=7
)

interpreter = CustomStreamInterpreter(
    url=url,
    separator_tag=separator_tag,
    item_filter=item_filter,
    buffer_size=1024 * 128,
)

for item in interpreter:
    print(item)  # Process each filtered item as a dictionary
```

This example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "xmlstreamer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "xml, streaming, parser, lightweight, http, https, ftp",
    "author": null,
    "author_email": "\"Carlos A. Planch\u00f3n\" <carlosandresplanchonprestes@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/19/98/2398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae/xmlstreamer-0.9.tar.gz",
    "platform": null,
    "description": "# xmlstreamer\n\n**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.\n\n## Features\n\n- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.\n- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.\n- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.\n- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.\n- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application\u2019s needs.\n- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.\n\n## Example Use Cases\n\n- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.\n- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.\n- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.\n\n## Installation\n\nTo install `xmlstreamer`, use pip:\n\n```bash\npip install xmlstreamer\n```\n\n## Usage\n\n```python\nfrom xmlstreamer import StreamInterpreter\n\nimport pprint\n\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\n\ninterpreter = StreamInterpreter(\n    url=url,\n    separator_tag=separator_tag,\n    buffer_size=1024 * 128,\n    max_running_time=600  # 10 minutes\n)\n\nfor item in interpreter:\n    pprint.pprint(item)  # Process each parsed item as a dictionary\n```\n\nDefine custom filters, encoding mappings, or buffer sizes as needed for optimal performance.\n\n## Filtering Usage\n\nTo enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.\n\n### Step 1: Define an `ItemFilter`\n\nThe `ItemFilter` class specifies which items to keep based on date filtering criteria:\n\n```python\nimport attrs\nfrom attrs_strict import type_validator\nfrom typing import Optional\n\n@attrs.define\nclass ItemFilter:\n    attrib: str = attrs.field(\n        kw_only=True,\n        validator=type_validator()\n    )\n    fmt: Optional[str] = attrs.field(\n        validator=type_validator(),\n        default=None\n    )\n    max_item_age_in_days: int = attrs.field(\n        kw_only=True,\n        validator=type_validator()\n    )\n```\n\n- attrib: The XML tag or attribute to filter by.\n- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.\n- max_item_age_in_days: The maximum allowable age of items in days.\n\n### Step 2: Define Helper Functions for Parsing and Filtering Dates\nFunctions to parse dates and evaluate if an item should be kept based on the specified date limit:\n\n```python\nfrom datetime import datetime, timedelta\nfrom typing import Optional\nimport dateparser\n\ndef parse_date(string: str, fmt: str) -> Optional[datetime]:\n    try:\n        return datetime.strptime(string, fmt)\n    except ValueError:\n        return None\n\ndef eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:\n    string = string.strip()\n    parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)\n    return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None\n```\n\n- parse_date: Parses the date string with the specified format.\n- eval_keep_date_item: Checks if the item\u2019s date is within the allowable age limit.\n\n### Step 3: Define the Filtering Function\nThe filter_parsed_item function applies the date filter to each parsed item:\n\n```python\nfrom typing import Optional, Dict, Any\n\ndef filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:\n    attrib = ITEM_FILTER.attrib\n    fmt = ITEM_FILTER.fmt\n    max_item_age_in_days = ITEM_FILTER.max_item_age_in_days\n    limit_date = datetime.now() - timedelta(days=max_item_age_in_days)\n    item_content: Dict[str, Any] = parsed_item.parsed_content\n\n    if attrib in item_content:\n        dt = item_content[attrib]\n        keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None\n        if keep_item is None:\n            item_content[attrib] = None\n            return parsed_item\n        elif keep_item:\n            return parsed_item\n\n    return None\n```\n\n### Step 4: Extend StreamInterpreter for Filtering and Alerts\nCreate a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.\n\n```python\nfrom xmlstreamer import StreamInterpreter\nfrom datetime import datetime\nfrom pathlib import Path\nimport inspect\n\nclass CustomStreamInterpreter(StreamInterpreter):\n    def __init__(self, **kwargs):\n        kwargs[\"max_running_time\"] = 3600  # Set max runtime to 1 hour\n        super().__init__(**kwargs)\n\n        stack = inspect.stack()\n        fname = stack[1].filename\n        fname_path = Path(fname)\n        self.called_from = fname_path.stem\n\n        self.alerts_enabled = True\n        self.filter_parsed_item_func = filter_parsed_item\n\n    def raise_stop_iteration(self):\n        print(f\"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}\")\n        if self.stats_parsed_items == 0:\n            self.raise_zero_items_alert()\n        raise StopIteration\n\n    def raise_zero_items_alert(self):\n        print(\"--- ZERO ITEMS ALERT ---\")\n        actual_date = datetime.now()\n        running_time = actual_date - self.start_date\n        print(f\"Running time exceeded with no items parsed: {running_time}\")\n```\n\n- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.\n- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.\n\n### Step 5: Run with Custom Filtering\nTo use filtering and alerting with your subclass:\n\n```python\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\nitem_filter = ItemFilter(\n    attrib=\"pubDate\",\n    fmt=\"%a, %d %b %Y %H:%M:%S %z\",\n    max_item_age_in_days=7\n)\n\ninterpreter = CustomStreamInterpreter(\n    url=url,\n    separator_tag=separator_tag,\n    item_filter=item_filter,\n    buffer_size=1024 * 128,\n)\n\nfor item in interpreter:\n    print(item)  # Process each filtered item as a dictionary\n```\n\nThis example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Python library to parse XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory, supporting decompression, encoding detection, tag-based itemization, and optional filtering.",
    "version": "0.9",
    "project_urls": {
        "repository": "https://github.com/carlosplanchon/xmlstreamer.git"
    },
    "split_keywords": [
        "xml",
        " streaming",
        " parser",
        " lightweight",
        " http",
        " https",
        " ftp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a160c8df04868bec20793f39900a9ef693633c07547a702d0db2bb295afae6d4",
                "md5": "0c5763e58fdecdbed032bfc1fc198b09",
                "sha256": "90b7adc5c5b9f7f446032c78868b80af8ca4602381065022eaa2bb3ee1ce932b"
            },
            "downloads": -1,
            "filename": "xmlstreamer-0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0c5763e58fdecdbed032bfc1fc198b09",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8697,
            "upload_time": "2024-11-08T01:00:17",
            "upload_time_iso_8601": "2024-11-08T01:00:17.541228Z",
            "url": "https://files.pythonhosted.org/packages/a1/60/c8df04868bec20793f39900a9ef693633c07547a702d0db2bb295afae6d4/xmlstreamer-0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "19982398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae",
                "md5": "3f556f1748c0b66f369957ae5421bb2b",
                "sha256": "952d6f25023f442f250338450ab57b8b5a9b00cee8dd44b33e430451a2e0179a"
            },
            "downloads": -1,
            "filename": "xmlstreamer-0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "3f556f1748c0b66f369957ae5421bb2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10550,
            "upload_time": "2024-11-08T01:00:19",
            "upload_time_iso_8601": "2024-11-08T01:00:19.682025Z",
            "url": "https://files.pythonhosted.org/packages/19/98/2398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae/xmlstreamer-0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-08 01:00:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "carlosplanchon",
    "github_project": "xmlstreamer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "defusedxml",
            "specs": [
                [
                    "==",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "chardet",
            "specs": [
                [
                    "==",
                    "5.2.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.28.0"
                ]
            ]
        },
        {
            "name": "attrs",
            "specs": [
                [
                    "==",
                    "23.2.0"
                ]
            ]
        },
        {
            "name": "attrs-strict",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        }
    ],
    "lcname": "xmlstreamer"
}
        
Elapsed time: 0.42777s