ultimate-sitemap-parser


Nameultimate-sitemap-parser JSON
Version 1.6.0 PyPI version JSON
download
home_pageNone
SummaryA performant library for parsing and crawling sitemaps
upload_time2025-09-10 08:28:14
maintainerFreddy Heppell
docs_urlNone
authorLinas Valiukas
requires_python>=3.9
licenseGPL-3.0-or-later
keywords sitemap crawler indexing xml rss atom google news
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            Ultimate Sitemap Parser
-----------------------

.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser
   :alt: PyPI - Python Version
   :target: https://github.com/GateNLP/ultimate-sitemap-parser

.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser
   :alt: PyPI - Version
   :target: https://pypi.org/project/ultimate-sitemap-parser/

.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser
   :alt: Conda Version
   :target: https://anaconda.org/conda-forge/ultimate-sitemap-parser

.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser
   :target: https://pepy.tech/project/ultimate-sitemap-parser
   :alt: Pepy Total Downloads


**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**


Features
========

- Supports all sitemap formats:

  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_
  - `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_
  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_

- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested


Installation
============

.. code:: sh

    pip install ultimate-sitemap-parser

or using Anaconda:

.. code:: sh

    conda install -c conda-forge ultimate-sitemap-parser


Usage
=====

.. code:: python

    from usp.tree import sitemap_tree_for_homepage

    tree = sitemap_tree_for_homepage('https://www.example.org/')

    for page in tree.all_pages():
        print(page.url)

``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.

For more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ultimate-sitemap-parser",
    "maintainer": "Freddy Heppell",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "f.heppell@sheffield.ac.uk",
    "keywords": "sitemap, crawler, indexing, xml, rss, atom, google news",
    "author": "Linas Valiukas",
    "author_email": "linas@media.mit.edu",
    "download_url": "https://files.pythonhosted.org/packages/f6/66/c3de16608092d14386cc5da4fdb7068bd53b5a0245ec012a689279392aba/ultimate_sitemap_parser-1.6.0.tar.gz",
    "platform": null,
    "description": "Ultimate Sitemap Parser\n-----------------------\n\n.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser\n   :alt: PyPI - Python Version\n   :target: https://github.com/GateNLP/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser\n   :alt: PyPI - Version\n   :target: https://pypi.org/project/ultimate-sitemap-parser/\n\n.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser\n   :alt: Conda Version\n   :target: https://anaconda.org/conda-forge/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser\n   :target: https://pepy.tech/project/ultimate-sitemap-parser\n   :alt: Pepy Total Downloads\n\n\n**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**\n\n\nFeatures\n========\n\n- Supports all sitemap formats:\n\n  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_\n  - `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_\n  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_\n\n- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_\n- Error-tolerant with more common sitemap bugs\n- Tries to find sitemaps not listed in ``robots.txt``\n- Uses fast and memory efficient Expat XML parsing\n- Doesn't consume much memory even with massive sitemap hierarchies\n- Provides a generated sitemap tree as easy to use object tree\n- Supports using a custom web client\n- Uses a small number of actively maintained third-party modules\n- Reasonably tested\n\n\nInstallation\n============\n\n.. code:: sh\n\n    pip install ultimate-sitemap-parser\n\nor using Anaconda:\n\n.. code:: sh\n\n    conda install -c conda-forge ultimate-sitemap-parser\n\n\nUsage\n=====\n\n.. code:: python\n\n    from usp.tree import sitemap_tree_for_homepage\n\n    tree = sitemap_tree_for_homepage('https://www.example.org/')\n\n    for page in tree.all_pages():\n        print(page.url)\n\n``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap\nhierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.\n\nFor more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "A performant library for parsing and crawling sitemaps",
    "version": "1.6.0",
    "project_urls": {
        "Documentation": "https://ultimate-sitemap-parser.readthedocs.io/",
        "Homepage": "https://ultimate-sitemap-parser.readthedocs.io/",
        "Repository": "https://github.com/GateNLP/ultimate-sitemap-parser"
    },
    "split_keywords": [
        "sitemap",
        " crawler",
        " indexing",
        " xml",
        " rss",
        " atom",
        " google news"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4613cd03dbe94bfa02f9e9c2d4863b951d5042c515e45ad9816fc08551369e3b",
                "md5": "d7063a2e7861163a9a9b2c6bff4656fc",
                "sha256": "ca309b18b5461f3a85f6b5c338e24b5cb7693ba3fffebcebee5f3862a5777662"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-1.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d7063a2e7861163a9a9b2c6bff4656fc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 43288,
            "upload_time": "2025-09-10T08:28:12",
            "upload_time_iso_8601": "2025-09-10T08:28:12.981996Z",
            "url": "https://files.pythonhosted.org/packages/46/13/cd03dbe94bfa02f9e9c2d4863b951d5042c515e45ad9816fc08551369e3b/ultimate_sitemap_parser-1.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f666c3de16608092d14386cc5da4fdb7068bd53b5a0245ec012a689279392aba",
                "md5": "358353284b9bda25fc5d4fe8b340956d",
                "sha256": "5fa1264875e0b04e278e48497d4eafb3b9703a8e21fa2563b7d93b08ba3fcf99"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-1.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "358353284b9bda25fc5d4fe8b340956d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 38788,
            "upload_time": "2025-09-10T08:28:14",
            "upload_time_iso_8601": "2025-09-10T08:28:14.389425Z",
            "url": "https://files.pythonhosted.org/packages/f6/66/c3de16608092d14386cc5da4fdb7068bd53b5a0245ec012a689279392aba/ultimate_sitemap_parser-1.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-10 08:28:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GateNLP",
    "github_project": "ultimate-sitemap-parser",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "ultimate-sitemap-parser"
}
        
Elapsed time: 2.62127s