Ultimate Sitemap Parser
-----------------------
.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser
:alt: PyPI - Python Version
:target: https://github.com/GateNLP/ultimate-sitemap-parser
.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser
:alt: PyPI - Version
:target: https://pypi.org/project/ultimate-sitemap-parser/
.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser
:alt: Conda Version
:target: https://anaconda.org/conda-forge/ultimate-sitemap-parser
.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser
:target: https://pepy.tech/project/ultimate-sitemap-parser
:alt: Pepy Total Downloads
**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**
Features
========
- Supports all sitemap formats:
- `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_
- `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_
- `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
- `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_
- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
Installation
============
.. code:: sh
pip install ultimate-sitemap-parser
or using Anaconda:
.. code:: sh
conda install -c conda-forge ultimate-sitemap-parser
Usage
=====
.. code:: python
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.example.org/')
for page in tree.all_pages():
print(page.url)
``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.
For more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.
Raw data
{
"_id": null,
"home_page": null,
"name": "ultimate-sitemap-parser",
"maintainer": "Freddy Heppell",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "f.heppell@sheffield.ac.uk",
"keywords": "sitemap, crawler, indexing, xml, rss, atom, google news",
"author": "Linas Valiukas",
"author_email": "linas@media.mit.edu",
"download_url": "https://files.pythonhosted.org/packages/80/a1/43c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d/ultimate_sitemap_parser-1.5.0.tar.gz",
"platform": null,
"description": "Ultimate Sitemap Parser\n-----------------------\n\n.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser\n :alt: PyPI - Python Version\n :target: https://github.com/GateNLP/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser\n :alt: PyPI - Version\n :target: https://pypi.org/project/ultimate-sitemap-parser/\n\n.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser\n :alt: Conda Version\n :target: https://anaconda.org/conda-forge/ultimate-sitemap-parser\n\n.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser\n :target: https://pepy.tech/project/ultimate-sitemap-parser\n :alt: Pepy Total Downloads\n\n\n**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**\n\n\nFeatures\n========\n\n- Supports all sitemap formats:\n\n - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_\n - `Google News sitemaps <https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap>`_ and `Image sitemaps <https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps>`_\n - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_\n\n- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_\n- Error-tolerant with more common sitemap bugs\n- Tries to find sitemaps not listed in ``robots.txt``\n- Uses fast and memory efficient Expat XML parsing\n- Doesn't consume much memory even with massive sitemap hierarchies\n- Provides a generated sitemap tree as easy to use object tree\n- Supports using a custom web client\n- Uses a small number of actively maintained third-party modules\n- Reasonably tested\n\n\nInstallation\n============\n\n.. code:: sh\n\n pip install ultimate-sitemap-parser\n\nor using Anaconda:\n\n.. code:: sh\n\n conda install -c conda-forge ultimate-sitemap-parser\n\n\nUsage\n=====\n\n.. code:: python\n\n from usp.tree import sitemap_tree_for_homepage\n\n tree = sitemap_tree_for_homepage('https://www.example.org/')\n\n for page in tree.all_pages():\n print(page.url)\n\n``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap\nhierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/reference/api/usp.objects.sitemap.html>`_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.\n\nFor more examples and details, see the `documentation <https://ultimate-sitemap-parser.readthedocs.io/en/latest/>`_.\n\n",
"bugtrack_url": null,
"license": "GPL-3.0-or-later",
"summary": "A performant library for parsing and crawling sitemaps",
"version": "1.5.0",
"project_urls": {
"Documentation": "https://ultimate-sitemap-parser.readthedocs.io/",
"Homepage": "https://ultimate-sitemap-parser.readthedocs.io/",
"Repository": "https://github.com/GateNLP/ultimate-sitemap-parser"
},
"split_keywords": [
"sitemap",
" crawler",
" indexing",
" xml",
" rss",
" atom",
" google news"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "be9069780a9e1bd5ed9b73dcbb864612225f9dd01c7865227e08fdeac1659c93",
"md5": "83b2117449e8c486d5a208eb31da629e",
"sha256": "98a474d64cccf98934c9fa2a4a3fa50f8de19b39e2beb99614ca9caea0a46857"
},
"downloads": -1,
"filename": "ultimate_sitemap_parser-1.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "83b2117449e8c486d5a208eb31da629e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 42428,
"upload_time": "2025-08-11T10:54:30",
"upload_time_iso_8601": "2025-08-11T10:54:30.175563Z",
"url": "https://files.pythonhosted.org/packages/be/90/69780a9e1bd5ed9b73dcbb864612225f9dd01c7865227e08fdeac1659c93/ultimate_sitemap_parser-1.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "80a143c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d",
"md5": "4386dce19f68e4972a0172b34bc6365f",
"sha256": "fe6938a37a105a097ed2ee2744ce6d947f20b463fb6dad523e76719bcebc939b"
},
"downloads": -1,
"filename": "ultimate_sitemap_parser-1.5.0.tar.gz",
"has_sig": false,
"md5_digest": "4386dce19f68e4972a0172b34bc6365f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 38031,
"upload_time": "2025-08-11T10:54:32",
"upload_time_iso_8601": "2025-08-11T10:54:32.051029Z",
"url": "https://files.pythonhosted.org/packages/80/a1/43c1d4e466642fb433dc8ae4c94811afb2b2d2979cd0aacf851cb7fcd29d/ultimate_sitemap_parser-1.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-11 10:54:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GateNLP",
"github_project": "ultimate-sitemap-parser",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "ultimate-sitemap-parser"
}