.. image:: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser.svg?branch=develop
:target: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser
:alt: Build Status
.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest
:target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://coveralls.io/repos/github/berkmancenter/mediacloud-ultimate_sitemap_parser/badge.svg?branch=develop
:target: https://coveralls.io/github/berkmancenter/mediacloud-ultimate_sitemap_parser?branch=develop
:alt: Coverage Status
.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg
:target: https://badge.fury.io/py/ultimate-sitemap-parser
:alt: PyPI package
Website sitemap parser for Python 3.5+.
Features
========
- Supports all sitemap formats:
- `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_
- `Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>`_
- `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
- `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_
- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
Installation
============
.. code:: sh
pip install ultimate_sitemap_parser
Usage
=====
.. code:: python
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)
``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>`_.
If you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:
.. code:: python
# all_pages() returns an Iterator
for page in tree.all_pages():
print(page)
``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>`_.
Raw data
{
"_id": null,
"home_page": "https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser",
"name": "ultimate-sitemap-parser",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "sitemap sitemap-xml parser",
"author": "Linas Valiukas, Hal Roberts, Media Cloud project",
"author_email": "linas@media.mit.edu, hroberts@cyber.law.harvard.edu",
"download_url": "https://files.pythonhosted.org/packages/21/44/04eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff/ultimate_sitemap_parser-0.5.tar.gz",
"platform": "",
"description": ".. image:: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser.svg?branch=develop\n :target: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser\n :alt: Build Status\n\n.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest\n :target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://coveralls.io/repos/github/berkmancenter/mediacloud-ultimate_sitemap_parser/badge.svg?branch=develop\n :target: https://coveralls.io/github/berkmancenter/mediacloud-ultimate_sitemap_parser?branch=develop\n :alt: Coverage Status\n\n.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg\n :target: https://badge.fury.io/py/ultimate-sitemap-parser\n :alt: PyPI package\n\n\nWebsite sitemap parser for Python 3.5+.\n\n\nFeatures\n========\n\n- Supports all sitemap formats:\n\n - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_\n - `Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>`_\n - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_\n\n- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_\n- Error-tolerant with more common sitemap bugs\n- Tries to find sitemaps not listed in ``robots.txt``\n- Uses fast and memory efficient Expat XML parsing\n- Doesn't consume much memory even with massive sitemap hierarchies\n- Provides a generated sitemap tree as easy to use object tree\n- Supports using a custom web client\n- Uses a small number of actively maintained third-party modules\n- Reasonably tested\n\n\nInstallation\n============\n\n.. code:: sh\n\n pip install ultimate_sitemap_parser\n\n\nUsage\n=====\n\n.. code:: python\n\n from usp.tree import sitemap_tree_for_homepage\n\n tree = sitemap_tree_for_homepage('https://www.nytimes.com/')\n print(tree)\n\n``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap\nhierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>`_.\n\nIf you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:\n\n.. code:: python\n\n # all_pages() returns an Iterator\n for page in tree.all_pages():\n print(page)\n\n``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>`_.\n\n\n",
"bugtrack_url": null,
"license": "GPLv3+",
"summary": "Ultimate Sitemap Parser",
"version": "0.5",
"project_urls": {
"Homepage": "https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser"
},
"split_keywords": [
"sitemap",
"sitemap-xml",
"parser"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ee58a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496",
"md5": "5479eb21fc1626a54642dc06ae9613de",
"sha256": "806e723eeb0293c38e111822d651e987b1494ae9c08be82e73172ade667418a6"
},
"downloads": -1,
"filename": "ultimate_sitemap_parser-0.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "5479eb21fc1626a54642dc06ae9613de",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.5",
"size": 23208,
"upload_time": "2019-07-31T11:15:46",
"upload_time_iso_8601": "2019-07-31T11:15:46.124185Z",
"url": "https://files.pythonhosted.org/packages/ee/58/a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496/ultimate_sitemap_parser-0.5-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "214404eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff",
"md5": "362e6e5d4b993d6e89eb4a259ccd029e",
"sha256": "9825fefcdf515e2748addc7ec5dcdb6430dfdd4ef5de4a54e39de1e7613d0ece"
},
"downloads": -1,
"filename": "ultimate_sitemap_parser-0.5.tar.gz",
"has_sig": false,
"md5_digest": "362e6e5d4b993d6e89eb4a259ccd029e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 20229,
"upload_time": "2019-07-31T11:15:47",
"upload_time_iso_8601": "2019-07-31T11:15:47.758717Z",
"url": "https://files.pythonhosted.org/packages/21/44/04eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff/ultimate_sitemap_parser-0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2019-07-31 11:15:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "berkmancenter",
"github_project": "mediacloud-ultimate_sitemap_parser",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "ultimate-sitemap-parser"
}