ultimate-sitemap-parser

Name	ultimate-sitemap-parser JSON
Version	0.5 JSON
	download
home_page	https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser
Summary	Ultimate Sitemap Parser
upload_time	2019-07-31 11:15:47
maintainer
docs_url	None
author	Linas Valiukas, Hal Roberts, Media Cloud project
requires_python	>=3.5
license	GPLv3+
keywords	sitemap sitemap-xml parser
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            .. image:: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser.svg?branch=develop
    :target: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser
    :alt: Build Status

.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest
    :target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status

.. image:: https://coveralls.io/repos/github/berkmancenter/mediacloud-ultimate_sitemap_parser/badge.svg?branch=develop
    :target: https://coveralls.io/github/berkmancenter/mediacloud-ultimate_sitemap_parser?branch=develop
    :alt: Coverage Status

.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg
    :target: https://badge.fury.io/py/ultimate-sitemap-parser
    :alt: PyPI package


Website sitemap parser for Python 3.5+.


Features
========

- Supports all sitemap formats:

  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_
  - `Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>`_
  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_
  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_

- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested


Installation
============

.. code:: sh

    pip install ultimate_sitemap_parser


Usage
=====

.. code:: python

    from usp.tree import sitemap_tree_for_homepage

    tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
    print(tree)

``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>`_.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:

.. code:: python

    # all_pages() returns an Iterator
    for page in tree.all_pages():
        print(page)

``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>`_.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser",
    "name": "ultimate-sitemap-parser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "sitemap sitemap-xml parser",
    "author": "Linas Valiukas, Hal Roberts, Media Cloud project",
    "author_email": "linas@media.mit.edu, hroberts@cyber.law.harvard.edu",
    "download_url": "https://files.pythonhosted.org/packages/21/44/04eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff/ultimate_sitemap_parser-0.5.tar.gz",
    "platform": "",
    "description": ".. image:: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser.svg?branch=develop\n    :target: https://travis-ci.org/berkmancenter/mediacloud-ultimate_sitemap_parser\n    :alt: Build Status\n\n.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest\n    :target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n\n.. image:: https://coveralls.io/repos/github/berkmancenter/mediacloud-ultimate_sitemap_parser/badge.svg?branch=develop\n    :target: https://coveralls.io/github/berkmancenter/mediacloud-ultimate_sitemap_parser?branch=develop\n    :alt: Coverage Status\n\n.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg\n    :target: https://badge.fury.io/py/ultimate-sitemap-parser\n    :alt: PyPI package\n\n\nWebsite sitemap parser for Python 3.5+.\n\n\nFeatures\n========\n\n- Supports all sitemap formats:\n\n  - `XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>`_\n  - `Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>`_\n  - `plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>`_\n  - `Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>`_\n\n- Field-tested with ~1 million URLs as part of the `Media Cloud project <https://mediacloud.org/>`_\n- Error-tolerant with more common sitemap bugs\n- Tries to find sitemaps not listed in ``robots.txt``\n- Uses fast and memory efficient Expat XML parsing\n- Doesn't consume much memory even with massive sitemap hierarchies\n- Provides a generated sitemap tree as easy to use object tree\n- Supports using a custom web client\n- Uses a small number of actively maintained third-party modules\n- Reasonably tested\n\n\nInstallation\n============\n\n.. code:: sh\n\n    pip install ultimate_sitemap_parser\n\n\nUsage\n=====\n\n.. code:: python\n\n    from usp.tree import sitemap_tree_for_homepage\n\n    tree = sitemap_tree_for_homepage('https://www.nytimes.com/')\n    print(tree)\n\n``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap\nhierarchy found on the website; see a `reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>`_.\n\nIf you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:\n\n.. code:: python\n\n    # all_pages() returns an Iterator\n    for page in tree.all_pages():\n        print(page)\n\n``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>`_.\n\n\n",
    "bugtrack_url": null,
    "license": "GPLv3+",
    "summary": "Ultimate Sitemap Parser",
    "version": "0.5",
    "project_urls": {
        "Homepage": "https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser"
    },
    "split_keywords": [
        "sitemap",
        "sitemap-xml",
        "parser"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee58a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496",
                "md5": "5479eb21fc1626a54642dc06ae9613de",
                "sha256": "806e723eeb0293c38e111822d651e987b1494ae9c08be82e73172ade667418a6"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-0.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5479eb21fc1626a54642dc06ae9613de",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.5",
            "size": 23208,
            "upload_time": "2019-07-31T11:15:46",
            "upload_time_iso_8601": "2019-07-31T11:15:46.124185Z",
            "url": "https://files.pythonhosted.org/packages/ee/58/a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496/ultimate_sitemap_parser-0.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "214404eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff",
                "md5": "362e6e5d4b993d6e89eb4a259ccd029e",
                "sha256": "9825fefcdf515e2748addc7ec5dcdb6430dfdd4ef5de4a54e39de1e7613d0ece"
            },
            "downloads": -1,
            "filename": "ultimate_sitemap_parser-0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "362e6e5d4b993d6e89eb4a259ccd029e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 20229,
            "upload_time": "2019-07-31T11:15:47",
            "upload_time_iso_8601": "2019-07-31T11:15:47.758717Z",
            "url": "https://files.pythonhosted.org/packages/21/44/04eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff/ultimate_sitemap_parser-0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2019-07-31 11:15:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "berkmancenter",
    "github_project": "mediacloud-ultimate_sitemap_parser",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "ultimate-sitemap-parser"
}

Linas Valiukas, Hal Roberts, Media Cloud project