rss-parser


Namerss-parser JSON
Version 2.0.0 PyPI version JSON
download
home_page
SummaryTyped pythonic RSS/Atom parser
upload_time2024-02-22 18:15:40
maintainer
docs_urlNone
authordhvcc
requires_python>=3.9,<4.0
licenseGPL-3.0
keywords python python3 cli rss parser gplv3 typed typed-python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Rss parser

[![Downloads](https://pepy.tech/badge/rss-parser)](https://pepy.tech/project/rss-parser)
[![Downloads](https://pepy.tech/badge/rss-parser/month)](https://pepy.tech/project/rss-parser)
[![Downloads](https://pepy.tech/badge/rss-parser/week)](https://pepy.tech/project/rss-parser)

[![PyPI version](https://img.shields.io/pypi/v/rss-parser)](https://pypi.org/project/rss-parser)
[![Python versions](https://img.shields.io/pypi/pyversions/rss-parser)](https://pypi.org/project/rss-parser)
[![Wheel status](https://img.shields.io/pypi/wheel/rss-parser)](https://pypi.org/project/rss-parser)
[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)

![Docs](https://github.com/dhvcc/rss-parser/actions/workflows/pages/pages-build-deployment/badge.svg)
![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)
![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg)

## About

`rss-parser` is typed python RSS/Atom parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)

## Installation

```bash
pip install rss-parser
```

or

```bash
git clone https://github.com/dhvcc/rss-parser.git
cd rss-parser
poetry build
pip install dist/*.whl
```

## V1 -> V2 migration
- `Parser` class was renamed to `RSSParser`
- Models for RSS-specific schemas were moved from `rss_parser.models` to `rss_parser.models.rss`. Generic types are not touched
- Date parsing was changed a bit, now uses pydantic's `validator` instead of `email.utils`, so the code will produce datetimes better, where it was defaulting to `str` before

## Usage

### Quickstart

**NOTE: For parsing Atom, use `AtomParser`**

```python
from rss_parser import RSSParser
from requests import get  # noqa

rss_url = "https://rss.art19.com/apology-line"
response = get(rss_url)

rss = RSSParser.parse(response.text)

# Print out rss meta data
print("Language", rss.channel.language)
print("RSS", rss.version)

# Iteratively print feed items
for item in rss.channel.items:
    print(item.title)
    print(item.description[:50])

# Language en
# RSS 2.0
# Wondery Presents - Flipping The Bird: Elon vs Twitter
# <p>When Elon Musk posted a video of himself arrivi
# Introducing: The Apology Line
# <p>If you could call a number and say you’re sorry
```

Here we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so

```
<![CDATA[<p>If you could call ...</p>]]>
```

### Overriding schema

If you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser

```python
from rss_parser import RSSParser
from rss_parser.models import XMLBaseModel
from rss_parser.models.rss import RSS
from rss_parser.models.types import Tag


class CustomSchema(RSS, XMLBaseModel):
    channel: None = None  # Removing previous channel field
    custom: Tag[str]


with open("tests/samples/custom.xml") as f:
    data = f.read()

rss = RSSParser.parse(data, schema=CustomSchema)

print("RSS", rss.version)
print("Custom", rss.custom)

# RSS 2.0
# Custom Custom tag data
```

### xmltodict

This library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)

The basic thing you should know is that your data is processed into dictionaries

For example, this data

```xml
<tag>content</tag>
```

will result in the following

```python
{
    "tag": "content"
}
```

*But*, when handling attributes, the content of the tag will be also a dictionary

```xml
<tag attr="1" data-value="data">data</tag>
```

Turns into

```python
{
    "tag": {
        "@attr": "1",
        "@data-value": "data",
        "#text": "content"
    }
}
```

Multiple children of a tag will be put into a list

```xml
<div>
    <tag>content</tag>
    <tag>content2</tag>
</div>
```

Results in a list

```python
[
    { "tag": "content" },
    { "tag": "content" },
]
```

If you don't want to deal with those conditions and parse something **always** as a list - 
please, use `rss_parser.models.types.only_list.OnlyList` like we did in `Channel`
```python
from typing import Optional

from rss_parser.models.rss.item import Item
from rss_parser.models.types.only_list import OnlyList
from rss_parser.models.types.tag import Tag
from rss_parser.pydantic_proxy import import_v1_pydantic

pydantic = import_v1_pydantic()
...


class OptionalChannelElementsMixin(...):
    ...
    items: Optional[OnlyList[Tag[Item]]] = pydantic.Field(alias="item", default=[])
```

### Tag field

This is a generic field that handles tags as raw data or a dictonary returned with attributes

Example

```python
from rss_parser.models import XMLBaseModel
from rss_parser.models.types.tag import Tag


class Model(XMLBaseModel):
    width: Tag[int]
    category: Tag[str]


m = Model(
    width=48,
    category={"@someAttribute": "https://example.com", "#text": "valid string"},
)

# Content value is an integer, as per the generic type
assert m.width.content == 48

assert type(m.width), type(m.width.content) == (Tag[int], int)

# The attributes are empty by default
assert m.width.attributes == {} # But are populated when provided.

# Note that the @ symbol is trimmed from the beggining and name is convert to snake_case
assert m.category.attributes == {'some_attribute': 'https://example.com'}
```

## Contributing

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.

Install dependencies with `poetry install` (`pip install poetry`)

`pre-commit` usage is highly recommended. To install hooks run

```bash
poetry run pre-commit install -t=pre-commit -t=pre-push
```

## License

[GPLv3](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "rss-parser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "python,python3,cli,rss,parser,gplv3,typed,typed-python",
    "author": "dhvcc",
    "author_email": "1337kwiz@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a1/7a/aee36165a911ebc34be0eb16225e76475f11e62212884ca9927b97c3ff84/rss_parser-2.0.0.tar.gz",
    "platform": null,
    "description": "# Rss parser\n\n[![Downloads](https://pepy.tech/badge/rss-parser)](https://pepy.tech/project/rss-parser)\n[![Downloads](https://pepy.tech/badge/rss-parser/month)](https://pepy.tech/project/rss-parser)\n[![Downloads](https://pepy.tech/badge/rss-parser/week)](https://pepy.tech/project/rss-parser)\n\n[![PyPI version](https://img.shields.io/pypi/v/rss-parser)](https://pypi.org/project/rss-parser)\n[![Python versions](https://img.shields.io/pypi/pyversions/rss-parser)](https://pypi.org/project/rss-parser)\n[![Wheel status](https://img.shields.io/pypi/wheel/rss-parser)](https://pypi.org/project/rss-parser)\n[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)\n\n![Docs](https://github.com/dhvcc/rss-parser/actions/workflows/pages/pages-build-deployment/badge.svg)\n![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)\n![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg)\n\n## About\n\n`rss-parser` is typed python RSS/Atom parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)\n\n## Installation\n\n```bash\npip install rss-parser\n```\n\nor\n\n```bash\ngit clone https://github.com/dhvcc/rss-parser.git\ncd rss-parser\npoetry build\npip install dist/*.whl\n```\n\n## V1 -> V2 migration\n- `Parser` class was renamed to `RSSParser`\n- Models for RSS-specific schemas were moved from `rss_parser.models` to `rss_parser.models.rss`. Generic types are not touched\n- Date parsing was changed a bit, now uses pydantic's `validator` instead of `email.utils`, so the code will produce datetimes better, where it was defaulting to `str` before\n\n## Usage\n\n### Quickstart\n\n**NOTE: For parsing Atom, use `AtomParser`**\n\n```python\nfrom rss_parser import RSSParser\nfrom requests import get  # noqa\n\nrss_url = \"https://rss.art19.com/apology-line\"\nresponse = get(rss_url)\n\nrss = RSSParser.parse(response.text)\n\n# Print out rss meta data\nprint(\"Language\", rss.channel.language)\nprint(\"RSS\", rss.version)\n\n# Iteratively print feed items\nfor item in rss.channel.items:\n    print(item.title)\n    print(item.description[:50])\n\n# Language en\n# RSS 2.0\n# Wondery Presents - Flipping The Bird: Elon vs Twitter\n# <p>When Elon Musk posted a video of himself arrivi\n# Introducing: The Apology Line\n# <p>If you could call a number and say you\u2019re sorry\n```\n\nHere we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so\n\n```\n<![CDATA[<p>If you could call ...</p>]]>\n```\n\n### Overriding schema\n\nIf you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser\n\n```python\nfrom rss_parser import RSSParser\nfrom rss_parser.models import XMLBaseModel\nfrom rss_parser.models.rss import RSS\nfrom rss_parser.models.types import Tag\n\n\nclass CustomSchema(RSS, XMLBaseModel):\n    channel: None = None  # Removing previous channel field\n    custom: Tag[str]\n\n\nwith open(\"tests/samples/custom.xml\") as f:\n    data = f.read()\n\nrss = RSSParser.parse(data, schema=CustomSchema)\n\nprint(\"RSS\", rss.version)\nprint(\"Custom\", rss.custom)\n\n# RSS 2.0\n# Custom Custom tag data\n```\n\n### xmltodict\n\nThis library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)\n\nThe basic thing you should know is that your data is processed into dictionaries\n\nFor example, this data\n\n```xml\n<tag>content</tag>\n```\n\nwill result in the following\n\n```python\n{\n    \"tag\": \"content\"\n}\n```\n\n*But*, when handling attributes, the content of the tag will be also a dictionary\n\n```xml\n<tag attr=\"1\" data-value=\"data\">data</tag>\n```\n\nTurns into\n\n```python\n{\n    \"tag\": {\n        \"@attr\": \"1\",\n        \"@data-value\": \"data\",\n        \"#text\": \"content\"\n    }\n}\n```\n\nMultiple children of a tag will be put into a list\n\n```xml\n<div>\n    <tag>content</tag>\n    <tag>content2</tag>\n</div>\n```\n\nResults in a list\n\n```python\n[\n    { \"tag\": \"content\" },\n    { \"tag\": \"content\" },\n]\n```\n\nIf you don't want to deal with those conditions and parse something **always** as a list - \nplease, use `rss_parser.models.types.only_list.OnlyList` like we did in `Channel`\n```python\nfrom typing import Optional\n\nfrom rss_parser.models.rss.item import Item\nfrom rss_parser.models.types.only_list import OnlyList\nfrom rss_parser.models.types.tag import Tag\nfrom rss_parser.pydantic_proxy import import_v1_pydantic\n\npydantic = import_v1_pydantic()\n...\n\n\nclass OptionalChannelElementsMixin(...):\n    ...\n    items: Optional[OnlyList[Tag[Item]]] = pydantic.Field(alias=\"item\", default=[])\n```\n\n### Tag field\n\nThis is a generic field that handles tags as raw data or a dictonary returned with attributes\n\nExample\n\n```python\nfrom rss_parser.models import XMLBaseModel\nfrom rss_parser.models.types.tag import Tag\n\n\nclass Model(XMLBaseModel):\n    width: Tag[int]\n    category: Tag[str]\n\n\nm = Model(\n    width=48,\n    category={\"@someAttribute\": \"https://example.com\", \"#text\": \"valid string\"},\n)\n\n# Content value is an integer, as per the generic type\nassert m.width.content == 48\n\nassert type(m.width), type(m.width.content) == (Tag[int], int)\n\n# The attributes are empty by default\nassert m.width.attributes == {} # But are populated when provided.\n\n# Note that the @ symbol is trimmed from the beggining and name is convert to snake_case\nassert m.category.attributes == {'some_attribute': 'https://example.com'}\n```\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\nInstall dependencies with `poetry install` (`pip install poetry`)\n\n`pre-commit` usage is highly recommended. To install hooks run\n\n```bash\npoetry run pre-commit install -t=pre-commit -t=pre-push\n```\n\n## License\n\n[GPLv3](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "Typed pythonic RSS/Atom parser",
    "version": "2.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/dhvcc/rss-parser/issues",
        "Homepage": "https://dhvcc.github.io/rss-parser",
        "Source": "https://github.com/dhvcc/rss-parser"
    },
    "split_keywords": [
        "python",
        "python3",
        "cli",
        "rss",
        "parser",
        "gplv3",
        "typed",
        "typed-python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f135e0f4c45c1b807f8ad901c577cdb86b0165abb81f9cb85bd7d05eb40264a",
                "md5": "2d81f9e2d60b5a25ce0fd05e43f29990",
                "sha256": "52466d846d5b933a154d7240e7120b86009c8884141ec467086fdb5405d19d7d"
            },
            "downloads": -1,
            "filename": "rss_parser-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2d81f9e2d60b5a25ce0fd05e43f29990",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 29732,
            "upload_time": "2024-02-22T18:15:38",
            "upload_time_iso_8601": "2024-02-22T18:15:38.725023Z",
            "url": "https://files.pythonhosted.org/packages/6f/13/5e0f4c45c1b807f8ad901c577cdb86b0165abb81f9cb85bd7d05eb40264a/rss_parser-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a17aaee36165a911ebc34be0eb16225e76475f11e62212884ca9927b97c3ff84",
                "md5": "a0ef9ee988485ca0531cb1f7dc319e51",
                "sha256": "4fb845443aa1c47364a15ed0dab20687c791ab5570a169a93a6c0de8cfcbd624"
            },
            "downloads": -1,
            "filename": "rss_parser-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a0ef9ee988485ca0531cb1f7dc319e51",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 25496,
            "upload_time": "2024-02-22T18:15:40",
            "upload_time_iso_8601": "2024-02-22T18:15:40.660978Z",
            "url": "https://files.pythonhosted.org/packages/a1/7a/aee36165a911ebc34be0eb16225e76475f11e62212884ca9927b97c3ff84/rss_parser-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-22 18:15:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dhvcc",
    "github_project": "rss-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "rss-parser"
}
        
Elapsed time: 0.30804s