rss-parser

Name	rss-parser JSON
Version	2.1.0 JSON
	download
home_page	None
Summary	Typed pythonic RSS/Atom parser
upload_time	2024-09-26 10:58:59
maintainer	None
docs_url	None
author	dhvcc
requires_python	<4.0,>=3.9
license	GPL-3.0
keywords	python python3 cli rss parser gplv3 typed typed-python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Rss parser

[![Downloads](https://pepy.tech/badge/rss-parser)](https://pepy.tech/project/rss-parser)
[![Downloads](https://pepy.tech/badge/rss-parser/month)](https://pepy.tech/project/rss-parser)
[![Downloads](https://pepy.tech/badge/rss-parser/week)](https://pepy.tech/project/rss-parser)

[![PyPI version](https://img.shields.io/pypi/v/rss-parser)](https://pypi.org/project/rss-parser)
[![Python versions](https://img.shields.io/pypi/pyversions/rss-parser)](https://pypi.org/project/rss-parser)
[![Wheel status](https://img.shields.io/pypi/wheel/rss-parser)](https://pypi.org/project/rss-parser)
[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)

![Docs](https://github.com/dhvcc/rss-parser/actions/workflows/pages/pages-build-deployment/badge.svg)
![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)
![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg)

## About

`rss-parser` is typed python RSS/Atom parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)

## Installation

```bash
pip install rss-parser
```

or

```bash
git clone https://github.com/dhvcc/rss-parser.git
cd rss-parser
poetry build
pip install dist/*.whl
```

## V1 -> V2 migration
- `Parser` class was renamed to `RSSParser`
- Models for RSS-specific schemas were moved from `rss_parser.models` to `rss_parser.models.rss`. Generic types are not touched
- Date parsing was changed a bit, now uses pydantic's `validator` instead of `email.utils`, so the code will produce datetimes better, where it was defaulting to `str` before

## Usage

### Quickstart

**NOTE: For parsing Atom, use `AtomParser`**

```python
from rss_parser import RSSParser
from requests import get  # noqa

rss_url = "https://rss.art19.com/apology-line"
response = get(rss_url)

rss = RSSParser.parse(response.text)

# Print out rss meta data
print("Language", rss.channel.language)
print("RSS", rss.version)

# Iteratively print feed items
for item in rss.channel.items:
    print(item.title)
    print(item.description[:50])

# Language en
# RSS 2.0
# Wondery Presents - Flipping The Bird: Elon vs Twitter
# <p>When Elon Musk posted a video of himself arrivi
# Introducing: The Apology Line
# <p>If you could call a number and say you’re sorry
```

Here we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so

```
<![CDATA[<p>If you could call ...</p>]]>
```

### Overriding schema

If you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser

```python
from rss_parser import RSSParser
from rss_parser.models import XMLBaseModel
from rss_parser.models.rss import RSS
from rss_parser.models.types import Tag


class CustomSchema(RSS, XMLBaseModel):
    channel: None = None  # Removing previous channel field
    custom: Tag[str]


with open("tests/samples/custom.xml") as f:
    data = f.read()

rss = RSSParser.parse(data, schema=CustomSchema)

print("RSS", rss.version)
print("Custom", rss.custom)

# RSS 2.0
# Custom Custom tag data
```

### xmltodict

This library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)

The basic thing you should know is that your data is processed into dictionaries

For example, this data

```xml
<tag>content</tag>
```

will result in the following

```python
{
    "tag": "content"
}
```

*But*, when handling attributes, the content of the tag will be also a dictionary

```xml
<tag attr="1" data-value="data">data</tag>
```

Turns into

```python
{
    "tag": {
        "@attr": "1",
        "@data-value": "data",
        "#text": "content"
    }
}
```

Multiple children of a tag will be put into a list

```xml
<div>
    <tag>content</tag>
    <tag>content2</tag>
</div>
```

Results in a list

```python
[
    { "tag": "content" },
    { "tag": "content" },
]
```

If you don't want to deal with those conditions and parse something **always** as a list - 
please, use `rss_parser.models.types.only_list.OnlyList` like we did in `Channel`
```python
from typing import Optional

from rss_parser.models.rss.item import Item
from rss_parser.models.types.only_list import OnlyList
from rss_parser.models.types.tag import Tag
from rss_parser.pydantic_proxy import import_v1_pydantic

pydantic = import_v1_pydantic()
...


class OptionalChannelElementsMixin(...):
    ...
    items: Optional[OnlyList[Tag[Item]]] = pydantic.Field(alias="item", default=[])
```

### Tag field

This is a generic field that handles tags as raw data or a dictonary returned with attributes

Example

```python
from rss_parser.models import XMLBaseModel
from rss_parser.models.types.tag import Tag


class Model(XMLBaseModel):
    width: Tag[int]
    category: Tag[str]


m = Model(
    width=48,
    category={"@someAttribute": "https://example.com", "#text": "valid string"},
)

# Content value is an integer, as per the generic type
assert m.width.content == 48

assert type(m.width), type(m.width.content) == (Tag[int], int)

# The attributes are empty by default
assert m.width.attributes == {} # But are populated when provided.

# Note that the @ symbol is trimmed from the beggining and name is convert to snake_case
assert m.category.attributes == {'some_attribute': 'https://example.com'}
```

## Contributing

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.

Install dependencies with `poetry install` (`pip install poetry`)

`pre-commit` usage is highly recommended. To install hooks run

```bash
poetry run pre-commit install -t=pre-commit -t=pre-push
```

## License

[GPLv3](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rss-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "python, python3, cli, rss, parser, gplv3, typed, typed-python",
    "author": "dhvcc",
    "author_email": "1337kwiz@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/72/f1/8853d9808f68b4a34a316977f0082906b32e8a2313b6fb3935155fb055a1/rss_parser-2.1.0.tar.gz",
    "platform": null,
    "description": "# Rss parser\n\n[![Downloads](https://pepy.tech/badge/rss-parser)](https://pepy.tech/project/rss-parser)\n[![Downloads](https://pepy.tech/badge/rss-parser/month)](https://pepy.tech/project/rss-parser)\n[![Downloads](https://pepy.tech/badge/rss-parser/week)](https://pepy.tech/project/rss-parser)\n\n[![PyPI version](https://img.shields.io/pypi/v/rss-parser)](https://pypi.org/project/rss-parser)\n[![Python versions](https://img.shields.io/pypi/pyversions/rss-parser)](https://pypi.org/project/rss-parser)\n[![Wheel status](https://img.shields.io/pypi/wheel/rss-parser)](https://pypi.org/project/rss-parser)\n[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)\n\n![Docs](https://github.com/dhvcc/rss-parser/actions/workflows/pages/pages-build-deployment/badge.svg)\n![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)\n![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg)\n\n## About\n\n`rss-parser` is typed python RSS/Atom parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)\n\n## Installation\n\n```bash\npip install rss-parser\n```\n\nor\n\n```bash\ngit clone https://github.com/dhvcc/rss-parser.git\ncd rss-parser\npoetry build\npip install dist/*.whl\n```\n\n## V1 -> V2 migration\n- `Parser` class was renamed to `RSSParser`\n- Models for RSS-specific schemas were moved from `rss_parser.models` to `rss_parser.models.rss`. Generic types are not touched\n- Date parsing was changed a bit, now uses pydantic's `validator` instead of `email.utils`, so the code will produce datetimes better, where it was defaulting to `str` before\n\n## Usage\n\n### Quickstart\n\n**NOTE: For parsing Atom, use `AtomParser`**\n\n```python\nfrom rss_parser import RSSParser\nfrom requests import get  # noqa\n\nrss_url = \"https://rss.art19.com/apology-line\"\nresponse = get(rss_url)\n\nrss = RSSParser.parse(response.text)\n\n# Print out rss meta data\nprint(\"Language\", rss.channel.language)\nprint(\"RSS\", rss.version)\n\n# Iteratively print feed items\nfor item in rss.channel.items:\n    print(item.title)\n    print(item.description[:50])\n\n# Language en\n# RSS 2.0\n# Wondery Presents - Flipping The Bird: Elon vs Twitter\n# <p>When Elon Musk posted a video of himself arrivi\n# Introducing: The Apology Line\n# <p>If you could call a number and say you\u2019re sorry\n```\n\nHere we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so\n\n```\n<![CDATA[<p>If you could call ...</p>]]>\n```\n\n### Overriding schema\n\nIf you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser\n\n```python\nfrom rss_parser import RSSParser\nfrom rss_parser.models import XMLBaseModel\nfrom rss_parser.models.rss import RSS\nfrom rss_parser.models.types import Tag\n\n\nclass CustomSchema(RSS, XMLBaseModel):\n    channel: None = None  # Removing previous channel field\n    custom: Tag[str]\n\n\nwith open(\"tests/samples/custom.xml\") as f:\n    data = f.read()\n\nrss = RSSParser.parse(data, schema=CustomSchema)\n\nprint(\"RSS\", rss.version)\nprint(\"Custom\", rss.custom)\n\n# RSS 2.0\n# Custom Custom tag data\n```\n\n### xmltodict\n\nThis library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)\n\nThe basic thing you should know is that your data is processed into dictionaries\n\nFor example, this data\n\n```xml\n<tag>content</tag>\n```\n\nwill result in the following\n\n```python\n{\n    \"tag\": \"content\"\n}\n```\n\n*But*, when handling attributes, the content of the tag will be also a dictionary\n\n```xml\n<tag attr=\"1\" data-value=\"data\">data</tag>\n```\n\nTurns into\n\n```python\n{\n    \"tag\": {\n        \"@attr\": \"1\",\n        \"@data-value\": \"data\",\n        \"#text\": \"content\"\n    }\n}\n```\n\nMultiple children of a tag will be put into a list\n\n```xml\n<div>\n    <tag>content</tag>\n    <tag>content2</tag>\n</div>\n```\n\nResults in a list\n\n```python\n[\n    { \"tag\": \"content\" },\n    { \"tag\": \"content\" },\n]\n```\n\nIf you don't want to deal with those conditions and parse something **always** as a list - \nplease, use `rss_parser.models.types.only_list.OnlyList` like we did in `Channel`\n```python\nfrom typing import Optional\n\nfrom rss_parser.models.rss.item import Item\nfrom rss_parser.models.types.only_list import OnlyList\nfrom rss_parser.models.types.tag import Tag\nfrom rss_parser.pydantic_proxy import import_v1_pydantic\n\npydantic = import_v1_pydantic()\n...\n\n\nclass OptionalChannelElementsMixin(...):\n    ...\n    items: Optional[OnlyList[Tag[Item]]] = pydantic.Field(alias=\"item\", default=[])\n```\n\n### Tag field\n\nThis is a generic field that handles tags as raw data or a dictonary returned with attributes\n\nExample\n\n```python\nfrom rss_parser.models import XMLBaseModel\nfrom rss_parser.models.types.tag import Tag\n\n\nclass Model(XMLBaseModel):\n    width: Tag[int]\n    category: Tag[str]\n\n\nm = Model(\n    width=48,\n    category={\"@someAttribute\": \"https://example.com\", \"#text\": \"valid string\"},\n)\n\n# Content value is an integer, as per the generic type\nassert m.width.content == 48\n\nassert type(m.width), type(m.width.content) == (Tag[int], int)\n\n# The attributes are empty by default\nassert m.width.attributes == {} # But are populated when provided.\n\n# Note that the @ symbol is trimmed from the beggining and name is convert to snake_case\nassert m.category.attributes == {'some_attribute': 'https://example.com'}\n```\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\nInstall dependencies with `poetry install` (`pip install poetry`)\n\n`pre-commit` usage is highly recommended. To install hooks run\n\n```bash\npoetry run pre-commit install -t=pre-commit -t=pre-push\n```\n\n## License\n\n[GPLv3](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "Typed pythonic RSS/Atom parser",
    "version": "2.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/dhvcc/rss-parser/issues",
        "Homepage": "https://dhvcc.github.io/rss-parser",
        "Source": "https://github.com/dhvcc/rss-parser"
    },
    "split_keywords": [
        "python",
        " python3",
        " cli",
        " rss",
        " parser",
        " gplv3",
        " typed",
        " typed-python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d16643fb6a0a1b3be3974e03b3d1182c066cddb6efedd7b3b23609597f962631",
                "md5": "ec36ce4be5bbdbe6a1213e857ea7b7e4",
                "sha256": "193b76f3292657faf85dd11dfe823b9007551fb7722d4363316870e32aff5ced"
            },
            "downloads": -1,
            "filename": "rss_parser-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ec36ce4be5bbdbe6a1213e857ea7b7e4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 29816,
            "upload_time": "2024-09-26T10:58:58",
            "upload_time_iso_8601": "2024-09-26T10:58:58.059636Z",
            "url": "https://files.pythonhosted.org/packages/d1/66/43fb6a0a1b3be3974e03b3d1182c066cddb6efedd7b3b23609597f962631/rss_parser-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72f18853d9808f68b4a34a316977f0082906b32e8a2313b6fb3935155fb055a1",
                "md5": "132a9fc810304d647ecd970c91ae97be",
                "sha256": "4a1eb0f69442b9b8f3b8343c053c3a772c8e9a5c8a6a969edadc03800f30b47e"
            },
            "downloads": -1,
            "filename": "rss_parser-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "132a9fc810304d647ecd970c91ae97be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 25511,
            "upload_time": "2024-09-26T10:58:59",
            "upload_time_iso_8601": "2024-09-26T10:58:59.582059Z",
            "url": "https://files.pythonhosted.org/packages/72/f1/8853d9808f68b4a34a316977f0082906b32e8a2313b6fb3935155fb055a1/rss_parser-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-26 10:58:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dhvcc",
    "github_project": "rss-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "rss-parser"
}

dhvcc