# xmlstreamer
**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.
## Features
- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.
- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.
- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.
- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.
- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application’s needs.
- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.
## Example Use Cases
- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.
- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.
- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.
## Installation
To install `xmlstreamer`, use pip:
```bash
pip install xmlstreamer
```
## Usage
```python
from xmlstreamer import StreamInterpreter
import pprint
url = "https://example.com/large-feed.xml"
separator_tag = "item"
interpreter = StreamInterpreter(
url=url,
separator_tag=separator_tag,
buffer_size=1024 * 128,
max_running_time=600 # 10 minutes
)
for item in interpreter:
pprint.pprint(item) # Process each parsed item as a dictionary
```
Define custom filters, encoding mappings, or buffer sizes as needed for optimal performance.
## Filtering Usage
To enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.
### Step 1: Define an `ItemFilter`
The `ItemFilter` class specifies which items to keep based on date filtering criteria:
```python
import attrs
from attrs_strict import type_validator
from typing import Optional
@attrs.define
class ItemFilter:
attrib: str = attrs.field(
kw_only=True,
validator=type_validator()
)
fmt: Optional[str] = attrs.field(
validator=type_validator(),
default=None
)
max_item_age_in_days: int = attrs.field(
kw_only=True,
validator=type_validator()
)
```
- attrib: The XML tag or attribute to filter by.
- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.
- max_item_age_in_days: The maximum allowable age of items in days.
### Step 2: Define Helper Functions for Parsing and Filtering Dates
Functions to parse dates and evaluate if an item should be kept based on the specified date limit:
```python
from datetime import datetime, timedelta
from typing import Optional
import dateparser
def parse_date(string: str, fmt: str) -> Optional[datetime]:
try:
return datetime.strptime(string, fmt)
except ValueError:
return None
def eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:
string = string.strip()
parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)
return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None
```
- parse_date: Parses the date string with the specified format.
- eval_keep_date_item: Checks if the item’s date is within the allowable age limit.
### Step 3: Define the Filtering Function
The filter_parsed_item function applies the date filter to each parsed item:
```python
from typing import Optional, Dict, Any
def filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:
attrib = ITEM_FILTER.attrib
fmt = ITEM_FILTER.fmt
max_item_age_in_days = ITEM_FILTER.max_item_age_in_days
limit_date = datetime.now() - timedelta(days=max_item_age_in_days)
item_content: Dict[str, Any] = parsed_item.parsed_content
if attrib in item_content:
dt = item_content[attrib]
keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None
if keep_item is None:
item_content[attrib] = None
return parsed_item
elif keep_item:
return parsed_item
return None
```
### Step 4: Extend StreamInterpreter for Filtering and Alerts
Create a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.
```python
from xmlstreamer import StreamInterpreter
from datetime import datetime
from pathlib import Path
import inspect
class CustomStreamInterpreter(StreamInterpreter):
def __init__(self, **kwargs):
kwargs["max_running_time"] = 3600 # Set max runtime to 1 hour
super().__init__(**kwargs)
stack = inspect.stack()
fname = stack[1].filename
fname_path = Path(fname)
self.called_from = fname_path.stem
self.alerts_enabled = True
self.filter_parsed_item_func = filter_parsed_item
def raise_stop_iteration(self):
print(f"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}")
if self.stats_parsed_items == 0:
self.raise_zero_items_alert()
raise StopIteration
def raise_zero_items_alert(self):
print("--- ZERO ITEMS ALERT ---")
actual_date = datetime.now()
running_time = actual_date - self.start_date
print(f"Running time exceeded with no items parsed: {running_time}")
```
- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.
- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.
### Step 5: Run with Custom Filtering
To use filtering and alerting with your subclass:
```python
url = "https://example.com/large-feed.xml"
separator_tag = "item"
item_filter = ItemFilter(
attrib="pubDate",
fmt="%a, %d %b %Y %H:%M:%S %z",
max_item_age_in_days=7
)
interpreter = CustomStreamInterpreter(
url=url,
separator_tag=separator_tag,
item_filter=item_filter,
buffer_size=1024 * 128,
)
for item in interpreter:
print(item) # Process each filtered item as a dictionary
```
This example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.
Raw data
{
"_id": null,
"home_page": null,
"name": "xmlstreamer",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "xml, streaming, parser, lightweight, http, https, ftp",
"author": null,
"author_email": "\"Carlos A. Planch\u00f3n\" <carlosandresplanchonprestes@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/19/98/2398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae/xmlstreamer-0.9.tar.gz",
"platform": null,
"description": "# xmlstreamer\n\n**xmlstreamer** is a Python library designed for efficient, memory-friendly streaming and parsing of large XML feeds from various sources, including compressed formats. It supports decompression, character encoding detection, tag-based itemization, and optional item filtering, making it ideal for handling real-time XML data feeds, large datasets, and complex structured content.\n\n## Features\n\n- **Streamed Data Parsing**: Parses XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory.\n- **GZIP Decompression Support**: Seamlessly handles GZIP-compressed XML feeds for bandwidth efficiency.\n- **Encoding Detection**: Automatically detects and decodes character encodings, ensuring compatibility with varied XML data formats.\n- **Customizable Item Tokenization**: Uses SAX-based parsing with user-defined tags to parse only relevant items from XML streams, reducing parsing overhead.\n- **Configurable Runtime and Buffering**: Includes configurable buffer sizes and runtime limits, allowing you to tailor performance to your application\u2019s needs.\n- **Flexible Filtering**: Supports custom item filtering based on tag attributes and content, ideal for targeted data extraction.\n\n## Example Use Cases\n\n- Parsing large RSS/Atom feeds in real-time, with item-by-item streaming and filtering.\n- Handling compressed XML datasets for applications in web scraping, data aggregation, and news syndication.\n- Memory-efficient parsing of large or continuous XML data streams without requiring the entire document in memory.\n\n## Installation\n\nTo install `xmlstreamer`, use pip:\n\n```bash\npip install xmlstreamer\n```\n\n## Usage\n\n```python\nfrom xmlstreamer import StreamInterpreter\n\nimport pprint\n\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\n\ninterpreter = StreamInterpreter(\n url=url,\n separator_tag=separator_tag,\n buffer_size=1024 * 128,\n max_running_time=600 # 10 minutes\n)\n\nfor item in interpreter:\n pprint.pprint(item) # Process each parsed item as a dictionary\n```\n\nDefine custom filters, encoding mappings, or buffer sizes as needed for optimal performance.\n\n## Filtering Usage\n\nTo enable item filtering and alerts, create a subclass of `xmlstreamer.StreamInterpreter` with custom methods for filtering and alerting. Below is an example of setting up an item filter based on date and creating a zero-items alert.\n\n### Step 1: Define an `ItemFilter`\n\nThe `ItemFilter` class specifies which items to keep based on date filtering criteria:\n\n```python\nimport attrs\nfrom attrs_strict import type_validator\nfrom typing import Optional\n\n@attrs.define\nclass ItemFilter:\n attrib: str = attrs.field(\n kw_only=True,\n validator=type_validator()\n )\n fmt: Optional[str] = attrs.field(\n validator=type_validator(),\n default=None\n )\n max_item_age_in_days: int = attrs.field(\n kw_only=True,\n validator=type_validator()\n )\n```\n\n- attrib: The XML tag or attribute to filter by.\n- fmt: Optional date format for parsing dates within the attribute. If not provided, dateparser will be used for parsing.\n- max_item_age_in_days: The maximum allowable age of items in days.\n\n### Step 2: Define Helper Functions for Parsing and Filtering Dates\nFunctions to parse dates and evaluate if an item should be kept based on the specified date limit:\n\n```python\nfrom datetime import datetime, timedelta\nfrom typing import Optional\nimport dateparser\n\ndef parse_date(string: str, fmt: str) -> Optional[datetime]:\n try:\n return datetime.strptime(string, fmt)\n except ValueError:\n return None\n\ndef eval_keep_date_item(string: str, fmt: str, limit_date: datetime) -> Optional[bool]:\n string = string.strip()\n parsed_date = parse_date(string, fmt) if fmt else dateparser.parse(string)\n return parsed_date.timestamp() > limit_date.timestamp() if parsed_date else None\n```\n\n- parse_date: Parses the date string with the specified format.\n- eval_keep_date_item: Checks if the item\u2019s date is within the allowable age limit.\n\n### Step 3: Define the Filtering Function\nThe filter_parsed_item function applies the date filter to each parsed item:\n\n```python\nfrom typing import Optional, Dict, Any\n\ndef filter_parsed_item(ITEM_FILTER: ItemFilter, parsed_item: ParsedItem) -> Optional[ParsedItem]:\n attrib = ITEM_FILTER.attrib\n fmt = ITEM_FILTER.fmt\n max_item_age_in_days = ITEM_FILTER.max_item_age_in_days\n limit_date = datetime.now() - timedelta(days=max_item_age_in_days)\n item_content: Dict[str, Any] = parsed_item.parsed_content\n\n if attrib in item_content:\n dt = item_content[attrib]\n keep_item = eval_keep_date_item(dt, fmt, limit_date) if isinstance(dt, str) else None\n if keep_item is None:\n item_content[attrib] = None\n return parsed_item\n elif keep_item:\n return parsed_item\n\n return None\n```\n\n### Step 4: Extend StreamInterpreter for Filtering and Alerts\nCreate a subclass that enables filtering with filter_parsed_item and alerts if no items are parsed.\n\n```python\nfrom xmlstreamer import StreamInterpreter\nfrom datetime import datetime\nfrom pathlib import Path\nimport inspect\n\nclass CustomStreamInterpreter(StreamInterpreter):\n def __init__(self, **kwargs):\n kwargs[\"max_running_time\"] = 3600 # Set max runtime to 1 hour\n super().__init__(**kwargs)\n\n stack = inspect.stack()\n fname = stack[1].filename\n fname_path = Path(fname)\n self.called_from = fname_path.stem\n\n self.alerts_enabled = True\n self.filter_parsed_item_func = filter_parsed_item\n\n def raise_stop_iteration(self):\n print(f\"XMLSTREAMER STATS ITEMS PARSED: {self.stats_parsed_items}\")\n if self.stats_parsed_items == 0:\n self.raise_zero_items_alert()\n raise StopIteration\n\n def raise_zero_items_alert(self):\n print(\"--- ZERO ITEMS ALERT ---\")\n actual_date = datetime.now()\n running_time = actual_date - self.start_date\n print(f\"Running time exceeded with no items parsed: {running_time}\")\n```\n\n- CustomStreamInterpreter: Initializes with filter_parsed_item_func for item filtering.\n- raise_zero_items_alert: Triggered if no items are parsed, printing a warning.\n\n### Step 5: Run with Custom Filtering\nTo use filtering and alerting with your subclass:\n\n```python\nurl = \"https://example.com/large-feed.xml\"\nseparator_tag = \"item\"\nitem_filter = ItemFilter(\n attrib=\"pubDate\",\n fmt=\"%a, %d %b %Y %H:%M:%S %z\",\n max_item_age_in_days=7\n)\n\ninterpreter = CustomStreamInterpreter(\n url=url,\n separator_tag=separator_tag,\n item_filter=item_filter,\n buffer_size=1024 * 128,\n)\n\nfor item in interpreter:\n print(item) # Process each filtered item as a dictionary\n```\n\nThis example demonstrates setting a pubDate filter that removes items older than 7 days. The CustomStreamInterpreter will also trigger an alert if no items are parsed within the set runtime.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Python library to parse XML data directly from HTTP, HTTPS, or FTP sources without loading the entire file into memory, supporting decompression, encoding detection, tag-based itemization, and optional filtering.",
"version": "0.9",
"project_urls": {
"repository": "https://github.com/carlosplanchon/xmlstreamer.git"
},
"split_keywords": [
"xml",
" streaming",
" parser",
" lightweight",
" http",
" https",
" ftp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a160c8df04868bec20793f39900a9ef693633c07547a702d0db2bb295afae6d4",
"md5": "0c5763e58fdecdbed032bfc1fc198b09",
"sha256": "90b7adc5c5b9f7f446032c78868b80af8ca4602381065022eaa2bb3ee1ce932b"
},
"downloads": -1,
"filename": "xmlstreamer-0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0c5763e58fdecdbed032bfc1fc198b09",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8697,
"upload_time": "2024-11-08T01:00:17",
"upload_time_iso_8601": "2024-11-08T01:00:17.541228Z",
"url": "https://files.pythonhosted.org/packages/a1/60/c8df04868bec20793f39900a9ef693633c07547a702d0db2bb295afae6d4/xmlstreamer-0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "19982398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae",
"md5": "3f556f1748c0b66f369957ae5421bb2b",
"sha256": "952d6f25023f442f250338450ab57b8b5a9b00cee8dd44b33e430451a2e0179a"
},
"downloads": -1,
"filename": "xmlstreamer-0.9.tar.gz",
"has_sig": false,
"md5_digest": "3f556f1748c0b66f369957ae5421bb2b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10550,
"upload_time": "2024-11-08T01:00:19",
"upload_time_iso_8601": "2024-11-08T01:00:19.682025Z",
"url": "https://files.pythonhosted.org/packages/19/98/2398a60af9e231b86dfd12596bfb96ae5a8e4babb4c99aa5615483c529ae/xmlstreamer-0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-08 01:00:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "carlosplanchon",
"github_project": "xmlstreamer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "defusedxml",
"specs": [
[
"==",
"0.7.1"
]
]
},
{
"name": "chardet",
"specs": [
[
"==",
"5.2.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.28.0"
]
]
},
{
"name": "attrs",
"specs": [
[
"==",
"23.2.0"
]
]
},
{
"name": "attrs-strict",
"specs": [
[
"==",
"1.0.1"
]
]
}
],
"lcname": "xmlstreamer"
}