crawlerdetect


Namecrawlerdetect JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/moskrc/crawlerdetect
SummaryCrawlerDetect is a Python library designed to identify bots, crawlers, and spiders by analyzing their user agents.
upload_time2024-11-15 08:02:23
maintainerNone
docs_urlNone
authorVitalii Shishorin
requires_python<4,>=3.9
licenseMIT
keywords crawler crawler detect crawler detector crawlerdetect python crawler detect
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # About CrawlerDetect

This is a Python wrapper for [CrawlerDetect](https://github.com/JayBizzle/Crawler-Detect) a web crawler detection library. It helps identify
bots, crawlers, and spiders using the user agent and other HTTP headers. Currently, it can detect
over 3,678 bots, spiders, and crawlers.

# How to install
```bash
$ pip install crawlerdetect
```

# How to use

## Variant 1
```Python
from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect()
crawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')
# true if crawler user agent detected
```

## Variant 2
```Python
from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect(user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 7_1 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)')
crawler_detect.isCrawler()
# true if crawler user agent detected
```

## Variant 3
```Python
from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect(headers={'DOCUMENT_ROOT': '/home/test/public_html', 'GATEWAY_INTERFACE': 'CGI/1.1', 'HTTP_ACCEPT': '*/*', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CACHE_CONTROL': 'no-cache', 'HTTP_CONNECTION': 'Keep-Alive', 'HTTP_FROM': 'googlebot(at)googlebot.com', 'HTTP_HOST': 'www.test.com', 'HTTP_PRAGMA': 'no-cache', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'PATH': '/bin:/usr/bin', 'QUERY_STRING': 'order=closingDate', 'REDIRECT_STATUS': '200', 'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '3360', 'REQUEST_METHOD': 'GET', 'REQUEST_URI': '/?test=testing', 'SCRIPT_FILENAME': '/home/test/public_html/index.php', 'SCRIPT_NAME': '/index.php', 'SERVER_ADDR': '127.0.0.1', 'SERVER_ADMIN': 'webmaster@test.com', 'SERVER_NAME': 'www.test.com', 'SERVER_PORT': '80', 'SERVER_PROTOCOL': 'HTTP/1.1', 'SERVER_SIGNATURE': '', 'SERVER_SOFTWARE': 'Apache', 'UNIQUE_ID': 'Vx6MENRxerBUSDEQgFLAAAAAS', 'PHP_SELF': '/index.php', 'REQUEST_TIME_FLOAT': 1461619728.0705, 'REQUEST_TIME': 1461619728})
crawler_detect.isCrawler()
# true if crawler user agent detected
```
## Output the name of the bot that matched (if any)
```Python
from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect()
crawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')
# true if crawler user agent detected
crawler_detect.getMatches()
# Sosospider
```

## Get version of the library
```Python
import crawlerdetect
crawlerdetect.__version__
```

# Contributing

The patterns and testcases are synced from the PHP repo. If you find a bot/spider/crawler user agent that crawlerdetect fails to detect, please submit a pull request with the regex pattern and a testcase to the [upstream PHP repo](https://github.com/JayBizzle/Crawler-Detect).

Failing that, just create an issue with the user agent you have found, and we'll take it from there :)

# Development

## Setup
```bash
$ poetry install
```

## Running tests
```bash
$ poetry run pytest
```

## Update crawlers from upstream PHP repo
```bash
$ ./update_data.sh
```

## Bump version
```bash
$ poetry run bump-my-version bump [patch|minor|major]
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/moskrc/crawlerdetect",
    "name": "crawlerdetect",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.9",
    "maintainer_email": null,
    "keywords": "crawler, crawler detect, crawler detector, crawlerdetect, python crawler detect",
    "author": "Vitalii Shishorin",
    "author_email": "moskrc@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d4/15/e37598eeb987331dce58bda176e265e311b4fe1330c339ec3741936bd7e6/crawlerdetect-0.3.0.tar.gz",
    "platform": null,
    "description": "# About CrawlerDetect\n\nThis is a Python wrapper for [CrawlerDetect](https://github.com/JayBizzle/Crawler-Detect) a web crawler detection library. It helps identify\nbots, crawlers, and spiders using the user agent and other HTTP headers. Currently, it can detect\nover 3,678 bots, spiders, and crawlers.\n\n# How to install\n```bash\n$ pip install crawlerdetect\n```\n\n# How to use\n\n## Variant 1\n```Python\nfrom crawlerdetect import CrawlerDetect\ncrawler_detect = CrawlerDetect()\ncrawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')\n# true if crawler user agent detected\n```\n\n## Variant 2\n```Python\nfrom crawlerdetect import CrawlerDetect\ncrawler_detect = CrawlerDetect(user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 7_1 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)')\ncrawler_detect.isCrawler()\n# true if crawler user agent detected\n```\n\n## Variant 3\n```Python\nfrom crawlerdetect import CrawlerDetect\ncrawler_detect = CrawlerDetect(headers={'DOCUMENT_ROOT': '/home/test/public_html', 'GATEWAY_INTERFACE': 'CGI/1.1', 'HTTP_ACCEPT': '*/*', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CACHE_CONTROL': 'no-cache', 'HTTP_CONNECTION': 'Keep-Alive', 'HTTP_FROM': 'googlebot(at)googlebot.com', 'HTTP_HOST': 'www.test.com', 'HTTP_PRAGMA': 'no-cache', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'PATH': '/bin:/usr/bin', 'QUERY_STRING': 'order=closingDate', 'REDIRECT_STATUS': '200', 'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '3360', 'REQUEST_METHOD': 'GET', 'REQUEST_URI': '/?test=testing', 'SCRIPT_FILENAME': '/home/test/public_html/index.php', 'SCRIPT_NAME': '/index.php', 'SERVER_ADDR': '127.0.0.1', 'SERVER_ADMIN': 'webmaster@test.com', 'SERVER_NAME': 'www.test.com', 'SERVER_PORT': '80', 'SERVER_PROTOCOL': 'HTTP/1.1', 'SERVER_SIGNATURE': '', 'SERVER_SOFTWARE': 'Apache', 'UNIQUE_ID': 'Vx6MENRxerBUSDEQgFLAAAAAS', 'PHP_SELF': '/index.php', 'REQUEST_TIME_FLOAT': 1461619728.0705, 'REQUEST_TIME': 1461619728})\ncrawler_detect.isCrawler()\n# true if crawler user agent detected\n```\n## Output the name of the bot that matched (if any)\n```Python\nfrom crawlerdetect import CrawlerDetect\ncrawler_detect = CrawlerDetect()\ncrawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')\n# true if crawler user agent detected\ncrawler_detect.getMatches()\n# Sosospider\n```\n\n## Get version of the library\n```Python\nimport crawlerdetect\ncrawlerdetect.__version__\n```\n\n# Contributing\n\nThe patterns and testcases are synced from the PHP repo. If you find a bot/spider/crawler user agent that crawlerdetect fails to detect, please submit a pull request with the regex pattern and a testcase to the [upstream PHP repo](https://github.com/JayBizzle/Crawler-Detect).\n\nFailing that, just create an issue with the user agent you have found, and we'll take it from there :)\n\n# Development\n\n## Setup\n```bash\n$ poetry install\n```\n\n## Running tests\n```bash\n$ poetry run pytest\n```\n\n## Update crawlers from upstream PHP repo\n```bash\n$ ./update_data.sh\n```\n\n## Bump version\n```bash\n$ poetry run bump-my-version bump [patch|minor|major]\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "CrawlerDetect is a Python library designed to identify bots, crawlers, and spiders by analyzing their user agents.",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://github.com/moskrc/crawlerdetect",
        "Homepage": "https://github.com/moskrc/crawlerdetect",
        "Repository": "https://github.com/moskrc/crawlerdetect"
    },
    "split_keywords": [
        "crawler",
        " crawler detect",
        " crawler detector",
        " crawlerdetect",
        " python crawler detect"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d29fe0df5531907083792578aec5ebb2ec670829317fe6bb63d9cdf78889382b",
                "md5": "68d635e3597705fea7eb6014569a3a96",
                "sha256": "7a9144619f74941bafe76a384aab55143d41d1c999e846eaf61853b3b02633c2"
            },
            "downloads": -1,
            "filename": "crawlerdetect-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "68d635e3597705fea7eb6014569a3a96",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.9",
            "size": 16058,
            "upload_time": "2024-11-15T08:02:21",
            "upload_time_iso_8601": "2024-11-15T08:02:21.684749Z",
            "url": "https://files.pythonhosted.org/packages/d2/9f/e0df5531907083792578aec5ebb2ec670829317fe6bb63d9cdf78889382b/crawlerdetect-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d415e37598eeb987331dce58bda176e265e311b4fe1330c339ec3741936bd7e6",
                "md5": "8be9c5e04330c5d7232da3379b988254",
                "sha256": "a269289943f6b2a8f33ed5b9a09591b6dae0e45c0c19a0d9c5ef07f5c518575c"
            },
            "downloads": -1,
            "filename": "crawlerdetect-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8be9c5e04330c5d7232da3379b988254",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.9",
            "size": 16574,
            "upload_time": "2024-11-15T08:02:23",
            "upload_time_iso_8601": "2024-11-15T08:02:23.521278Z",
            "url": "https://files.pythonhosted.org/packages/d4/15/e37598eeb987331dce58bda176e265e311b4fe1330c339ec3741936bd7e6/crawlerdetect-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-15 08:02:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "moskrc",
    "github_project": "crawlerdetect",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "crawlerdetect"
}
        
Elapsed time: 3.16140s