ultimate-sitemap-parser


Nameultimate-sitemap-parser JSON
Version 1.5.0 PyPI version JSON
download
home_pageNone
SummaryA performant library for parsing and crawling sitemaps
upload_time2025-08-11 10:54:32
maintainerFreddy Heppell
docs_urlNone
authorLinas Valiukas
requires_python>=3.9
licenseGPL-3.0-or-later
keywords sitemap crawler indexing xml rss atom google news
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            Ultimate Sitemap Parser
-----------------------

.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser
   :alt: PyPI - Python Version
   :target: https://github.com/GateNLP/ultimate-sitemap-parser

.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser
   :alt: PyPI - Version
   :target: https://pypi.org/project/ultimate-sitemap-parser/

.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser
   :alt: Conda Version
   :target: https://anaconda.org/conda-forge/ultimate-sitemap-parser

.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser
   :target: https://pepy.tech/project/ultimate-sitemap-parser
   :alt: Pepy Total Downloads


**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**


Features
========

- Supports all sitemap formats:

  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_
  - `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_
  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_

- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested


Installation
============

.. code:: sh

    pip install ultimate-sitemap-parser

or using Anaconda:

.. code:: sh

    conda install -c conda-forge ultimate-sitemap-parser


Usage
=====

.. code:: python

    from usp.tree import sitemap_tree_for_homepage

    tree = sitemap_tree_for_homepage('https://www.example.org/')

    for page in tree.all_pages():
        print(page.url)

``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.

For more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ultimate-sitemap-parser",
    "maintainer": "Freddy Heppell",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "f.heppell@sheffield.ac.uk",
    "keywords": "sitemap, crawler, indexing, xml, rss, atom, google news",
    "author": "Linas Valiukas",
    "author_email": "linas@media.mit.edu",
    "download_url": "https://files.pythonhosted.org/packages/80/a1/43c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d/ultimate_sitemap_parser-1.5.0.tar.gz",
    "platform": null,
    "description": "Ultimate Sitemap Parser\n-----------------------\n\n.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser\n   :alt: PyPI - Python Version\n   :target: https://github.com/GateNLP/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser\n   :alt: PyPI - Version\n   :target: https://pypi.org/project/ultimate-sitemap-parser/\n\n.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser\n   :alt: Conda Version\n   :target: https://anaconda.org/conda-forge/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser\n   :target: https://pepy.tech/project/ultimate-sitemap-parser\n   :alt: Pepy Total Downloads\n\n\n**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**\n\n\nFeatures\n========\n\n- Supports all sitemap formats:\n\n  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_\n  - `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_\n  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_\n\n- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_\n- Error-tolerant with more common sitemap bugs\n- Tries to find sitemaps not listed in ``robots.txt``\n- Uses fast and memory efficient Expat XML parsing\n- Doesn't consume much memory even with massive sitemap hierarchies\n- Provides a generated sitemap tree as easy to use object tree\n- Supports using a custom web client\n- Uses a small number of actively maintained third-party modules\n- Reasonably tested\n\n\nInstallation\n============\n\n.. code:: sh\n\n    pip install ultimate-sitemap-parser\n\nor using Anaconda:\n\n.. code:: sh\n\n    conda install -c conda-forge ultimate-sitemap-parser\n\n\nUsage\n=====\n\n.. code:: python\n\n    from usp.tree import sitemap_tree_for_homepage\n\n    tree = sitemap_tree_for_homepage('https://www.example.org/')\n\n    for page in tree.all_pages():\n        print(page.url)\n\n``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap\nhierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.\n\nFor more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "A performant library for parsing and crawling sitemaps",
    "version": "1.5.0",
    "project_urls": {
        "Documentation": "https://ultimate-sitemap-parser.readthedocs.io/",
        "Homepage": "https://ultimate-sitemap-parser.readthedocs.io/",
        "Repository": "https://github.com/GateNLP/ultimate-sitemap-parser"
    },
    "split_keywords": [
        "sitemap",
        " crawler",
        " indexing",
        " xml",
        " rss",
        " atom",
        " google news"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "be9069780a9e1bd5ed9b73dcbb864612225f9dd01c7865227e08fdeac1659c93",
                "md5": "83b2117449e8c486d5a208eb31da629e",
                "sha256": "98a474d64cccf98934c9fa2a4a3fa50f8de19b39e2beb99614ca9caea0a46857"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-1.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "83b2117449e8c486d5a208eb31da629e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 42428,
            "upload_time": "2025-08-11T10:54:30",
            "upload_time_iso_8601": "2025-08-11T10:54:30.175563Z",
            "url": "https://files.pythonhosted.org/packages/be/90/69780a9e1bd5ed9b73dcbb864612225f9dd01c7865227e08fdeac1659c93/ultimate_sitemap_parser-1.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "80a143c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d",
                "md5": "4386dce19f68e4972a0172b34bc6365f",
                "sha256": "fe6938a37a105a097ed2ee2744ce6d947f20b463fb6dad523e76719bcebc939b"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-1.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4386dce19f68e4972a0172b34bc6365f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 38031,
            "upload_time": "2025-08-11T10:54:32",
            "upload_time_iso_8601": "2025-08-11T10:54:32.051029Z",
            "url": "https://files.pythonhosted.org/packages/80/a1/43c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d/ultimate_sitemap_parser-1.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-11 10:54:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GateNLP",
    "github_project": "ultimate-sitemap-parser",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "ultimate-sitemap-parser"
}
        
Elapsed time: 0.41344s