Protego


NameProtego JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/scrapy/protego
SummaryPure-Python robots.txt parser with support for modern conventions
upload_time2024-04-05 10:08:54
maintainerNone
docs_urlNone
authorAnubhav Patel
requires_python>=3.8
licenseBSD
keywords robots.txt parser robots rep
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            =======
Protego
=======

.. image:: https://img.shields.io/pypi/pyversions/protego.svg
   :target: https://pypi.python.org/pypi/protego
   :alt: Supported Python Versions

.. image:: https://github.com/scrapy/protego/workflows/CI/badge.svg
   :target: https://github.com/scrapy/protego/actions?query=workflow%3ACI
   :alt: CI

Protego is a pure-Python ``robots.txt`` parser with support for modern
conventions.


Install
=======

To install Protego, simply use pip:

.. code-block:: none

    pip install protego


Usage
=====

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
... 
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests_:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

.. _Requests: https://3.python-requests.org/


Comparison
==========

The following table compares Protego to the most popular ``robots.txt`` parsers
implemented in Python or featuring Python bindings:

+----------------------------+---------+-----------------+--------+---------------------------+
|                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language    | Python  | Python          | C++    | Python                    |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Length-based precedence`_ | ✓       |                 | ✓      |                           |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_               |         | +40%            | +1300% | -25%                      |
+----------------------------+---------+-----------------+--------+---------------------------+

.. _Google: https://developers.google.com/search/reference/robots_txt
.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values


API Reference
=============

Class ``protego.Protego``:

Properties
----------

*   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
    ``robots.txt``.

*   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.


Methods
-------

*   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
    ``protego.Protego``.

*   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
    URL, otherwise return ``False``.

*   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
    agent as a float. If nothing is specified, return ``None``.

*   ``request_rate(user_agent)`` Return the request rate specified for the user
    agent as a named tuple ``RequestRate(requests, seconds, start_time,
    end_time)``. If nothing is specified, return ``None``.

*   ``visit_time(user_agent)`` Return the visit time specified for the user 
    agent as a named tuple ``VisitTime(start_time, end_time)``. 
    If nothing is specified, return ``None``.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/scrapy/protego",
    "name": "Protego",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "robots.txt, parser, robots, rep",
    "author": "Anubhav Patel",
    "author_email": "anubhavp28@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8a/12/cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc/Protego-0.3.1.tar.gz",
    "platform": null,
    "description": "=======\nProtego\n=======\n\n.. image:: https://img.shields.io/pypi/pyversions/protego.svg\n   :target: https://pypi.python.org/pypi/protego\n   :alt: Supported Python Versions\n\n.. image:: https://github.com/scrapy/protego/workflows/CI/badge.svg\n   :target: https://github.com/scrapy/protego/actions?query=workflow%3ACI\n   :alt: CI\n\nProtego is a pure-Python ``robots.txt`` parser with support for modern\nconventions.\n\n\nInstall\n=======\n\nTo install Protego, simply use pip:\n\n.. code-block:: none\n\n    pip install protego\n\n\nUsage\n=====\n\n>>> from protego import Protego\n>>> robotstxt = \"\"\"\n... User-agent: *\n... Disallow: /\n... Allow: /about\n... Allow: /account\n... Disallow: /account/contact$\n... Disallow: /account/*/profile\n... Crawl-delay: 4\n... Request-rate: 10/1m                 # 10 requests every 1 minute\n... \n... Sitemap: http://example.com/sitemap-index.xml\n... Host: http://example.co.in\n... \"\"\"\n>>> rp = Protego.parse(robotstxt)\n>>> rp.can_fetch(\"http://example.com/profiles\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"http://example.com/about\", \"mybot\")\nTrue\n>>> rp.can_fetch(\"http://example.com/account\", \"mybot\")\nTrue\n>>> rp.can_fetch(\"http://example.com/account/myuser/profile\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"http://example.com/account/contact\", \"mybot\")\nFalse\n>>> rp.crawl_delay(\"mybot\")\n4.0\n>>> rp.request_rate(\"mybot\")\nRequestRate(requests=10, seconds=60, start_time=None, end_time=None)\n>>> list(rp.sitemaps)\n['http://example.com/sitemap-index.xml']\n>>> rp.preferred_host\n'http://example.co.in'\n\nUsing Protego with Requests_:\n\n>>> from protego import Protego\n>>> import requests\n>>> r = requests.get(\"https://google.com/robots.txt\")\n>>> rp = Protego.parse(r.text)\n>>> rp.can_fetch(\"https://google.com/search\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"https://google.com/search/about\", \"mybot\")\nTrue\n>>> list(rp.sitemaps)\n['https://www.google.com/sitemap.xml']\n\n.. _Requests: https://3.python-requests.org/\n\n\nComparison\n==========\n\nThe following table compares Protego to the most popular ``robots.txt`` parsers\nimplemented in Python or featuring Python bindings:\n\n+----------------------------+---------+-----------------+--------+---------------------------+\n|                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |\n+============================+=========+=================+========+===========================+\n| Implementation language    | Python  | Python          | C++    | Python                    |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Reference specification    | Google_ | `Martijn Koster\u2019s 1996 draft`_                       |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Wildcard support`_        | \u2713       |                 | \u2713      | \u2713                         |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Length-based precedence`_ | \u2713       |                 | \u2713      |                           |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Performance_               |         | +40%            | +1300% | -25%                      |\n+----------------------------+---------+-----------------+--------+---------------------------+\n\n.. _Google: https://developers.google.com/search/reference/robots_txt\n.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines\n.. _Martijn Koster\u2019s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt\n.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/\n.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values\n\n\nAPI Reference\n=============\n\nClass ``protego.Protego``:\n\nProperties\n----------\n\n*   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in\n    ``robots.txt``.\n\n*   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.\n\n\nMethods\n-------\n\n*   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of\n    ``protego.Protego``.\n\n*   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the\n    URL, otherwise return ``False``.\n\n*   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user\n    agent as a float. If nothing is specified, return ``None``.\n\n*   ``request_rate(user_agent)`` Return the request rate specified for the user\n    agent as a named tuple ``RequestRate(requests, seconds, start_time,\n    end_time)``. If nothing is specified, return ``None``.\n\n*   ``visit_time(user_agent)`` Return the visit time specified for the user \n    agent as a named tuple ``VisitTime(start_time, end_time)``. \n    If nothing is specified, return ``None``.\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "Pure-Python robots.txt parser with support for modern conventions",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/scrapy/protego"
    },
    "split_keywords": [
        "robots.txt",
        " parser",
        " robots",
        " rep"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "74efece78585a5a189d8cc2b4c2d2b92a0dc025f156a6501159b026472ebbedc",
                "md5": "68ec8dbe4fd0f1481eb2b8d1ca9ff839",
                "sha256": "2fbe8e9b7a7dbc5016a932b14c98d236aad4c29290bbe457b8d2779666ef7a41"
            },
            "downloads": -1,
            "filename": "Protego-0.3.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "68ec8dbe4fd0f1481eb2b8d1ca9ff839",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.8",
            "size": 8474,
            "upload_time": "2024-04-05T10:08:53",
            "upload_time_iso_8601": "2024-04-05T10:08:53.500338Z",
            "url": "https://files.pythonhosted.org/packages/74/ef/ece78585a5a189d8cc2b4c2d2b92a0dc025f156a6501159b026472ebbedc/Protego-0.3.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a12cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc",
                "md5": "200c5f8947240a59ecee2b12efd26fd5",
                "sha256": "e94430d0d25cbbf239bc849d86c5e544fbde531fcccfa059953c7da344a1712c"
            },
            "downloads": -1,
            "filename": "Protego-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "200c5f8947240a59ecee2b12efd26fd5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 3246145,
            "upload_time": "2024-04-05T10:08:54",
            "upload_time_iso_8601": "2024-04-05T10:08:54.884249Z",
            "url": "https://files.pythonhosted.org/packages/8a/12/cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc/Protego-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-05 10:08:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scrapy",
    "github_project": "protego",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "tox": true,
    "lcname": "protego"
}
        
Elapsed time: 2.16154s