Cr0wl3r

Name	Cr0wl3r JSON
Version	1.0.1 JSON
	download
home_page	https://github.com/mauricelambert/Cr0wl3r
Summary	This module implements a crawler to find all links and resources
upload_time	2023-10-15 07:58:43
maintainer	Maurice Lambert
docs_url	None
author	Maurice Lambert
requires_python	>=3.8
license	GPL-3.0 License
keywords	crawler scraper scan web pentest discovery security selenium url uri
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Cr0wl3r logo](https://mauricelambert.github.io/info/python/security/Cr0wl3r_small.png "Cr0wl3r logo")

# Cr0wl3r

## Description

This package implements a web discreet crawler to find all visible URLs on a website, this crawler can store pages (and reuse them for next crawl), scan web content for dynamic content (useful for pentest, red teaming and hacking), create a full JSON report and database to reuse the analysis, try to identify web pages, static content and assets to request only what is useful.

> The name *Cr0wl3r* is a pun with *Crawler* and *Growler* because this tool in not offensive but it's the first step to attack a web server.

## Requirements

This package require: 
 - python3
 - python3 Standard Library

Optional:
 - Selenium

## Installation

```bash
pip install Cr0wl3r 
```

## Usages

### Command lines

```bash
# Python executable
python3 Cr0wl3r.pyz -h
# or
chmod u+x Cr0wl3r.pyz
./Cr0wl3r.pyz --help

# Python module
python3 -m Cr0wl3r https://github.com/mauricelambert

# Entry point (console)
Cr0wl3r -F report.json -L DEBUG -l logs.log -R -S -d -c "mycookie=foobar" -H "User-Agent:Chrome" -m 3 -t "p" -r https://github.com/mauricelambert
Cr0wl3r -R -S -C -d -u -i -F report.json -L DEBUG -l logs.log -c "mycookie=foobar" "session=abc" -c "counter=5" -H "User-Agent:Chrome" "Api-Key:myapikey" -H "Authorization:Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==" -m 5 -t "p" "img" -t "link" -I 3.5 -f "raw-url-only" -D4 "text/html" -r https://github.com/mauricelambert
```

### Python3

```python
from Cr0wl3r import CrawlerRawPrinter

CrawlerRawPrinter(
    "https://github.com/mauricelambert",
    recursive=False,
).crawl()
```

```python
from ssl import _create_unverified_context
from Cr0wl3r import _Crawler, reports
from logging import basicConfig
from typing import Union

basicConfig(level=1)

class CustomCr0wl3r(_Crawler):
    def handle_web_page(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New web page:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

    def handle_static(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New static:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

    def handle_resource(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New assets:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

cr0wl3r = CustomCr0wl3r(
    "https://github.com/mauricelambert",
    recursive=True,
    update=True,
    max_request=10,
    only_domain=False,
    headers={"User-Agent": "Chrome", "Cookie": "mycookie=abc"},
    robots=False,
    sitemap=False,
    crossdomain=False,
    context=_create_unverified_context(),
    interval=3.5,
    download_policy="do not download",
)
cr0wl3r.crawl()

with open("urls.txt", 'w') as report:
    [report.write(url + '\n') for url in reports]
```

## Links

 - [Github Page](https://github.com/mauricelambert/Cr0wl3r)
 - [Pypi](https://pypi.org/project/Cr0wl3r/)
 - [Documentation](https://mauricelambert.github.io/info/python/security/Cr0wl3r.html)
 - [Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz)
 - [Windows Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.exe)

## Help

```text
~# python3 Cr0wl3r.py -h
usage: Cr0wl3r.py [-h] [--recursive] [--update] [--insecure] [--do-not-request-robots] [--do-not-request-sitemap] [--do-not-request-crossdomain]
                  [--not-only-domain] [--max-request MAX_REQUEST] [--cookie COOKIE] [--headers HEADERS [HEADERS ...]]
                  [--tags-counter TAGS_COUNTER [TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME] [--loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}]
                  [--logfile LOGFILE] [--interval-request INTERVAL_REQUEST] [--output-format {raw,colored,raw-only-url}]
                  [--download-all | --download-html | --download-static | --download-resources | --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE | --download-requested | --do-not-download]
                  url

This script crawls web site and prints URLs.

positional arguments:
  url                   First URL to crawl.

options:
  -h, --help            show this help message and exit
  --recursive, -r       Crawl URLs recursively.
  --update, -u          Re-downloads and overwrites responses from requests made during previous crawls.
  --insecure, -i        Use insecure SSL (support selenium and urllib)
  --do-not-request-robots, --no-robots, -R
                        Don't search, request and parse robots.txt
  --do-not-request-sitemap, --no-sitemap, -S
                        Don't search, request and parse sitemap.xml
  --do-not-request-crossdomain, --no-crossdomain, -C
                        Don't search, request and parse crossdomain.xml
  --not-only-domain, -d
                        Do not request only the base URL domain (request all domains).
  --max-request MAX_REQUEST, -m MAX_REQUEST
                        Maximum request to perform.
  --cookie COOKIE, -c COOKIE
                        Add a cookie.
  --headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]
                        Add headers.
  --tags-counter TAGS_COUNTER [TAGS_COUNTER ...], --tags TAGS_COUNTER [TAGS_COUNTER ...], -t TAGS_COUNTER [TAGS_COUNTER ...]
                        Add a tag counter for scoring.
  --report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME
                        The JSON report filename.
  --loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}, -L {WARNING,CRITICAL,DEBUG,INFO,ERROR}
                        WebSiteCloner logs level.
  --logfile LOGFILE, -l LOGFILE
                        WebCrawler logs file.
  --interval-request INTERVAL_REQUEST, --interval INTERVAL_REQUEST, -I INTERVAL_REQUEST
                        Interval between each requests by domain.
  --output-format {raw,colored,raw-only-url}, --format {raw,colored,raw-only-url}, -f {raw,colored,raw-only-url}
                        Output format.
  --download-all, --download, -D, -D0
                        Download (store) all responses
  --download-html, --dh, -D1
                        Download (store) only HTML responses
  --download-static, --ds, -D2
                        Download (store) only static files (HTML, CSS, JavaScript)
  --download-resources, --dr, -D3
                        Download (store) only resources files (images, documents, icon...)
  --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE, --dct DOWNLOAD_BY_CONTENT_TYPE, -D4 DOWNLOAD_BY_CONTENT_TYPE
                        Download (store) only responses with Content-Type that contains this value
  --download-requested, --dR, -D5
                        Download all requests responses and try to requests only Web page
  --do-not-download, --dN, -D6
                        Try to requests only Web page and do not download
~# 
```

## Licence

Licensed under the [GPL, version 3](https://www.gnu.org/licenses/).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mauricelambert/Cr0wl3r",
    "name": "Cr0wl3r",
    "maintainer": "Maurice Lambert",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "mauricelambert434@gmail.com",
    "keywords": "Crawler,Scraper,Scan,Web,Pentest,Discovery,Security,Selenium,URL,URI",
    "author": "Maurice Lambert",
    "author_email": "mauricelambert434@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/72/f2/1a91fad4ec0dee2d92f6821822a736526d9c357219f2b642a99b1a56b617/Cr0wl3r-1.0.1.tar.gz",
    "platform": "Windows",
    "description": "![Cr0wl3r logo](https://mauricelambert.github.io/info/python/security/Cr0wl3r_small.png \"Cr0wl3r logo\")\n\n# Cr0wl3r\n\n## Description\n\nThis package implements a web discreet crawler to find all visible URLs on a website, this crawler can store pages (and reuse them for next crawl), scan web content for dynamic content (useful for pentest, red teaming and hacking), create a full JSON report and database to reuse the analysis, try to identify web pages, static content and assets to request only what is useful.\n\n> The name *Cr0wl3r* is a pun with *Crawler* and *Growler* because this tool in not offensive but it's the first step to attack a web server.\n\n## Requirements\n\nThis package require: \n - python3\n - python3 Standard Library\n\nOptional:\n - Selenium\n\n## Installation\n\n```bash\npip install Cr0wl3r \n```\n\n## Usages\n\n### Command lines\n\n```bash\n# Python executable\npython3 Cr0wl3r.pyz -h\n# or\nchmod u+x Cr0wl3r.pyz\n./Cr0wl3r.pyz --help\n\n# Python module\npython3 -m Cr0wl3r https://github.com/mauricelambert\n\n# Entry point (console)\nCr0wl3r -F report.json -L DEBUG -l logs.log -R -S -d -c \"mycookie=foobar\" -H \"User-Agent:Chrome\" -m 3 -t \"p\" -r https://github.com/mauricelambert\nCr0wl3r -R -S -C -d -u -i -F report.json -L DEBUG -l logs.log -c \"mycookie=foobar\" \"session=abc\" -c \"counter=5\" -H \"User-Agent:Chrome\" \"Api-Key:myapikey\" -H \"Authorization:Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==\" -m 5 -t \"p\" \"img\" -t \"link\" -I 3.5 -f \"raw-url-only\" -D4 \"text/html\" -r https://github.com/mauricelambert\n```\n\n### Python3\n\n```python\nfrom Cr0wl3r import CrawlerRawPrinter\n\nCrawlerRawPrinter(\n    \"https://github.com/mauricelambert\",\n    recursive=False,\n).crawl()\n```\n\n```python\nfrom ssl import _create_unverified_context\nfrom Cr0wl3r import _Crawler, reports\nfrom logging import basicConfig\nfrom typing import Union\n\nbasicConfig(level=1)\n\nclass CustomCr0wl3r(_Crawler):\n    def handle_web_page(\n        self, from_url: str, url: str, tag: str, attribute: str\n    ) -> Union[bool, None]:\n\n        print(\"[+] New web page:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n        print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\n    def handle_static(\n        self, from_url: str, url: str, tag: str, attribute: str\n    ) -> Union[bool, None]:\n\n        print(\"[+] New static:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n        print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\n    def handle_resource(\n        self, from_url: str, url: str, tag: str, attribute: str\n    ) -> Union[bool, None]:\n\n        print(\"[+] New assets:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n        print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\ncr0wl3r = CustomCr0wl3r(\n    \"https://github.com/mauricelambert\",\n    recursive=True,\n    update=True,\n    max_request=10,\n    only_domain=False,\n    headers={\"User-Agent\": \"Chrome\", \"Cookie\": \"mycookie=abc\"},\n    robots=False,\n    sitemap=False,\n    crossdomain=False,\n    context=_create_unverified_context(),\n    interval=3.5,\n    download_policy=\"do not download\",\n)\ncr0wl3r.crawl()\n\nwith open(\"urls.txt\", 'w') as report:\n    [report.write(url + '\\n') for url in reports]\n```\n\n## Links\n\n - [Github Page](https://github.com/mauricelambert/Cr0wl3r)\n - [Pypi](https://pypi.org/project/Cr0wl3r/)\n - [Documentation](https://mauricelambert.github.io/info/python/security/Cr0wl3r.html)\n - [Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz)\n - [Windows Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.exe)\n\n## Help\n\n```text\n~# python3 Cr0wl3r.py -h\nusage: Cr0wl3r.py [-h] [--recursive] [--update] [--insecure] [--do-not-request-robots] [--do-not-request-sitemap] [--do-not-request-crossdomain]\n                  [--not-only-domain] [--max-request MAX_REQUEST] [--cookie COOKIE] [--headers HEADERS [HEADERS ...]]\n                  [--tags-counter TAGS_COUNTER [TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME] [--loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}]\n                  [--logfile LOGFILE] [--interval-request INTERVAL_REQUEST] [--output-format {raw,colored,raw-only-url}]\n                  [--download-all | --download-html | --download-static | --download-resources | --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE | --download-requested | --do-not-download]\n                  url\n\nThis script crawls web site and prints URLs.\n\npositional arguments:\n  url                   First URL to crawl.\n\noptions:\n  -h, --help            show this help message and exit\n  --recursive, -r       Crawl URLs recursively.\n  --update, -u          Re-downloads and overwrites responses from requests made during previous crawls.\n  --insecure, -i        Use insecure SSL (support selenium and urllib)\n  --do-not-request-robots, --no-robots, -R\n                        Don't search, request and parse robots.txt\n  --do-not-request-sitemap, --no-sitemap, -S\n                        Don't search, request and parse sitemap.xml\n  --do-not-request-crossdomain, --no-crossdomain, -C\n                        Don't search, request and parse crossdomain.xml\n  --not-only-domain, -d\n                        Do not request only the base URL domain (request all domains).\n  --max-request MAX_REQUEST, -m MAX_REQUEST\n                        Maximum request to perform.\n  --cookie COOKIE, -c COOKIE\n                        Add a cookie.\n  --headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]\n                        Add headers.\n  --tags-counter TAGS_COUNTER [TAGS_COUNTER ...], --tags TAGS_COUNTER [TAGS_COUNTER ...], -t TAGS_COUNTER [TAGS_COUNTER ...]\n                        Add a tag counter for scoring.\n  --report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME\n                        The JSON report filename.\n  --loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}, -L {WARNING,CRITICAL,DEBUG,INFO,ERROR}\n                        WebSiteCloner logs level.\n  --logfile LOGFILE, -l LOGFILE\n                        WebCrawler logs file.\n  --interval-request INTERVAL_REQUEST, --interval INTERVAL_REQUEST, -I INTERVAL_REQUEST\n                        Interval between each requests by domain.\n  --output-format {raw,colored,raw-only-url}, --format {raw,colored,raw-only-url}, -f {raw,colored,raw-only-url}\n                        Output format.\n  --download-all, --download, -D, -D0\n                        Download (store) all responses\n  --download-html, --dh, -D1\n                        Download (store) only HTML responses\n  --download-static, --ds, -D2\n                        Download (store) only static files (HTML, CSS, JavaScript)\n  --download-resources, --dr, -D3\n                        Download (store) only resources files (images, documents, icon...)\n  --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE, --dct DOWNLOAD_BY_CONTENT_TYPE, -D4 DOWNLOAD_BY_CONTENT_TYPE\n                        Download (store) only responses with Content-Type that contains this value\n  --download-requested, --dR, -D5\n                        Download all requests responses and try to requests only Web page\n  --do-not-download, --dN, -D6\n                        Try to requests only Web page and do not download\n~# \n```\n\n## Licence\n\nLicensed under the [GPL, version 3](https://www.gnu.org/licenses/).\n",
    "bugtrack_url": null,
    "license": "GPL-3.0 License",
    "summary": "This module implements a crawler to find all links and resources",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://mauricelambert.github.io/info/python/security/Cr0wl3r.html",
        "Executable": "https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz",
        "Homepage": "https://github.com/mauricelambert/Cr0wl3r"
    },
    "split_keywords": [
        "crawler",
        "scraper",
        "scan",
        "web",
        "pentest",
        "discovery",
        "security",
        "selenium",
        "url",
        "uri"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72f21a91fad4ec0dee2d92f6821822a736526d9c357219f2b642a99b1a56b617",
                "md5": "b181f735de31b7b67f4049db05abd8d9",
                "sha256": "c3c3db14ab47a25bf14cdd8d3933702f5a35b3fc2182c6737823ad7cffcf4e17"
            },
            "downloads": -1,
            "filename": "Cr0wl3r-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b181f735de31b7b67f4049db05abd8d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 35782,
            "upload_time": "2023-10-15T07:58:43",
            "upload_time_iso_8601": "2023-10-15T07:58:43.789347Z",
            "url": "https://files.pythonhosted.org/packages/72/f2/1a91fad4ec0dee2d92f6821822a736526d9c357219f2b642a99b1a56b617/Cr0wl3r-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-15 07:58:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mauricelambert",
    "github_project": "Cr0wl3r",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cr0wl3r"
}

Maurice Lambert