=======
Protego
=======
.. image:: https://img.shields.io/pypi/pyversions/protego.svg
:target: https://pypi.python.org/pypi/protego
:alt: Supported Python Versions
.. image:: https://github.com/scrapy/protego/workflows/CI/badge.svg
:target: https://github.com/scrapy/protego/actions?query=workflow%3ACI
:alt: CI
Protego is a pure-Python ``robots.txt`` parser with support for modern
conventions.
Install
=======
To install Protego, simply use pip:
.. code-block:: none
pip install protego
Usage
=====
>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'
Using Protego with Requests_:
>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']
.. _Requests: https://3.python-requests.org/
Comparison
==========
The following table compares Protego to the most popular ``robots.txt`` parsers
implemented in Python or featuring Python bindings:
+----------------------------+---------+-----------------+--------+---------------------------+
| | Protego | RobotFileParser | Reppy | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language | Python | Python | C++ | Python |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification | Google_ | `Martijn Koster’s 1996 draft`_ |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Wildcard support`_ | ✓ | | ✓ | ✓ |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Length-based precedence`_ | ✓ | | ✓ | |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_ | | +40% | +1300% | -25% |
+----------------------------+---------+-----------------+--------+---------------------------+
.. _Google: https://developers.google.com/search/reference/robots_txt
.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values
API Reference
=============
Class ``protego.Protego``:
Properties
----------
* ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
``robots.txt``.
* ``preferred_host`` {string} Preferred host specified in ``robots.txt``.
Methods
-------
* ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
``protego.Protego``.
* ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
URL, otherwise return ``False``.
* ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
agent as a float. If nothing is specified, return ``None``.
* ``request_rate(user_agent)`` Return the request rate specified for the user
agent as a named tuple ``RequestRate(requests, seconds, start_time,
end_time)``. If nothing is specified, return ``None``.
* ``visit_time(user_agent)`` Return the visit time specified for the user
agent as a named tuple ``VisitTime(start_time, end_time)``.
If nothing is specified, return ``None``.
Raw data
{
"_id": null,
"home_page": "https://github.com/scrapy/protego",
"name": "Protego",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "robots.txt, parser, robots, rep",
"author": "Anubhav Patel",
"author_email": "anubhavp28@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8a/12/cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc/Protego-0.3.1.tar.gz",
"platform": null,
"description": "=======\nProtego\n=======\n\n.. image:: https://img.shields.io/pypi/pyversions/protego.svg\n :target: https://pypi.python.org/pypi/protego\n :alt: Supported Python Versions\n\n.. image:: https://github.com/scrapy/protego/workflows/CI/badge.svg\n :target: https://github.com/scrapy/protego/actions?query=workflow%3ACI\n :alt: CI\n\nProtego is a pure-Python ``robots.txt`` parser with support for modern\nconventions.\n\n\nInstall\n=======\n\nTo install Protego, simply use pip:\n\n.. code-block:: none\n\n pip install protego\n\n\nUsage\n=====\n\n>>> from protego import Protego\n>>> robotstxt = \"\"\"\n... User-agent: *\n... Disallow: /\n... Allow: /about\n... Allow: /account\n... Disallow: /account/contact$\n... Disallow: /account/*/profile\n... Crawl-delay: 4\n... Request-rate: 10/1m # 10 requests every 1 minute\n... \n... Sitemap: http://example.com/sitemap-index.xml\n... Host: http://example.co.in\n... \"\"\"\n>>> rp = Protego.parse(robotstxt)\n>>> rp.can_fetch(\"http://example.com/profiles\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"http://example.com/about\", \"mybot\")\nTrue\n>>> rp.can_fetch(\"http://example.com/account\", \"mybot\")\nTrue\n>>> rp.can_fetch(\"http://example.com/account/myuser/profile\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"http://example.com/account/contact\", \"mybot\")\nFalse\n>>> rp.crawl_delay(\"mybot\")\n4.0\n>>> rp.request_rate(\"mybot\")\nRequestRate(requests=10, seconds=60, start_time=None, end_time=None)\n>>> list(rp.sitemaps)\n['http://example.com/sitemap-index.xml']\n>>> rp.preferred_host\n'http://example.co.in'\n\nUsing Protego with Requests_:\n\n>>> from protego import Protego\n>>> import requests\n>>> r = requests.get(\"https://google.com/robots.txt\")\n>>> rp = Protego.parse(r.text)\n>>> rp.can_fetch(\"https://google.com/search\", \"mybot\")\nFalse\n>>> rp.can_fetch(\"https://google.com/search/about\", \"mybot\")\nTrue\n>>> list(rp.sitemaps)\n['https://www.google.com/sitemap.xml']\n\n.. _Requests: https://3.python-requests.org/\n\n\nComparison\n==========\n\nThe following table compares Protego to the most popular ``robots.txt`` parsers\nimplemented in Python or featuring Python bindings:\n\n+----------------------------+---------+-----------------+--------+---------------------------+\n| | Protego | RobotFileParser | Reppy | Robotexclusionrulesparser |\n+============================+=========+=================+========+===========================+\n| Implementation language | Python | Python | C++ | Python |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Reference specification | Google_ | `Martijn Koster\u2019s 1996 draft`_ |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Wildcard support`_ | \u2713 | | \u2713 | \u2713 |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Length-based precedence`_ | \u2713 | | \u2713 | |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Performance_ | | +40% | +1300% | -25% |\n+----------------------------+---------+-----------------+--------+---------------------------+\n\n.. _Google: https://developers.google.com/search/reference/robots_txt\n.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines\n.. _Martijn Koster\u2019s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt\n.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/\n.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values\n\n\nAPI Reference\n=============\n\nClass ``protego.Protego``:\n\nProperties\n----------\n\n* ``sitemaps`` {``list_iterator``} A list of sitemaps specified in\n ``robots.txt``.\n\n* ``preferred_host`` {string} Preferred host specified in ``robots.txt``.\n\n\nMethods\n-------\n\n* ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of\n ``protego.Protego``.\n\n* ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the\n URL, otherwise return ``False``.\n\n* ``crawl_delay(user_agent)`` Return the crawl delay specified for the user\n agent as a float. If nothing is specified, return ``None``.\n\n* ``request_rate(user_agent)`` Return the request rate specified for the user\n agent as a named tuple ``RequestRate(requests, seconds, start_time,\n end_time)``. If nothing is specified, return ``None``.\n\n* ``visit_time(user_agent)`` Return the visit time specified for the user \n agent as a named tuple ``VisitTime(start_time, end_time)``. \n If nothing is specified, return ``None``.\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "Pure-Python robots.txt parser with support for modern conventions",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/scrapy/protego"
},
"split_keywords": [
"robots.txt",
" parser",
" robots",
" rep"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "74efece78585a5a189d8cc2b4c2d2b92a0dc025f156a6501159b026472ebbedc",
"md5": "68ec8dbe4fd0f1481eb2b8d1ca9ff839",
"sha256": "2fbe8e9b7a7dbc5016a932b14c98d236aad4c29290bbe457b8d2779666ef7a41"
},
"downloads": -1,
"filename": "Protego-0.3.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "68ec8dbe4fd0f1481eb2b8d1ca9ff839",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 8474,
"upload_time": "2024-04-05T10:08:53",
"upload_time_iso_8601": "2024-04-05T10:08:53.500338Z",
"url": "https://files.pythonhosted.org/packages/74/ef/ece78585a5a189d8cc2b4c2d2b92a0dc025f156a6501159b026472ebbedc/Protego-0.3.1-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8a12cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc",
"md5": "200c5f8947240a59ecee2b12efd26fd5",
"sha256": "e94430d0d25cbbf239bc849d86c5e544fbde531fcccfa059953c7da344a1712c"
},
"downloads": -1,
"filename": "Protego-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "200c5f8947240a59ecee2b12efd26fd5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 3246145,
"upload_time": "2024-04-05T10:08:54",
"upload_time_iso_8601": "2024-04-05T10:08:54.884249Z",
"url": "https://files.pythonhosted.org/packages/8a/12/cab9fa77ff4e9e444a5eb5480db4b4f872c03aa079145804aa054be377bc/Protego-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-05 10:08:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scrapy",
"github_project": "protego",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"tox": true,
"lcname": "protego"
}