![Cr0wl3r logo](https://mauricelambert.github.io/info/python/security/Cr0wl3r_small.png "Cr0wl3r logo")
# Cr0wl3r
## Description
This package implements a web discreet crawler to find all visible URLs on a website, this crawler can store pages (and reuse them for next crawl), scan web content for dynamic content (useful for pentest, red teaming and hacking), create a full JSON report and database to reuse the analysis, try to identify web pages, static content and assets to request only what is useful.
> The name *Cr0wl3r* is a pun with *Crawler* and *Growler* because this tool in not offensive but it's the first step to attack a web server.
## Requirements
This package require:
- python3
- python3 Standard Library
Optional:
- Selenium
## Installation
```bash
pip install Cr0wl3r
```
## Usages
### Command lines
```bash
# Python executable
python3 Cr0wl3r.pyz -h
# or
chmod u+x Cr0wl3r.pyz
./Cr0wl3r.pyz --help
# Python module
python3 -m Cr0wl3r https://github.com/mauricelambert
# Entry point (console)
Cr0wl3r -F report.json -L DEBUG -l logs.log -R -S -d -c "mycookie=foobar" -H "User-Agent:Chrome" -m 3 -t "p" -r https://github.com/mauricelambert
Cr0wl3r -R -S -C -d -u -i -F report.json -L DEBUG -l logs.log -c "mycookie=foobar" "session=abc" -c "counter=5" -H "User-Agent:Chrome" "Api-Key:myapikey" -H "Authorization:Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==" -m 5 -t "p" "img" -t "link" -I 3.5 -f "raw-url-only" -D4 "text/html" -q -r https://github.com/mauricelambert
```
### Python3
```python
from Cr0wl3r import CrawlerRawPrinter
CrawlerRawPrinter(
"https://github.com/mauricelambert",
recursive=False,
).crawl()
```
```python
from ssl import _create_unverified_context
from Cr0wl3r import _Crawler, reports
from logging import basicConfig
from typing import Union
basicConfig(level=1)
class CustomCr0wl3r(_Crawler):
def handle_web_page(
self, from_url: str, url: str, tag: str, attribute: str
) -> Union[bool, None]:
print("[+] New web page:", url, "from", from_url, f"{tag}<{attribute}>")
print("[*] There are still", len(self.urls_to_parse), "requests to send.")
def handle_static(
self, from_url: str, url: str, tag: str, attribute: str
) -> Union[bool, None]:
print("[+] New static:", url, "from", from_url, f"{tag}<{attribute}>")
print("[*] There are still", len(self.urls_to_parse), "requests to send.")
def handle_resource(
self, from_url: str, url: str, tag: str, attribute: str
) -> Union[bool, None]:
print("[+] New assets:", url, "from", from_url, f"{tag}<{attribute}>")
print("[*] There are still", len(self.urls_to_parse), "requests to send.")
cr0wl3r = CustomCr0wl3r(
"https://github.com/mauricelambert",
recursive=True,
update=True,
max_request=10,
only_domain=False,
headers={"User-Agent": "Chrome", "Cookie": "mycookie=abc"},
robots=False,
sitemap=False,
crossdomain=False,
context=_create_unverified_context(),
interval=3.5,
download_policy="do not download",
no_query_page=False,
)
cr0wl3r.crawl()
with open("urls.txt", 'w') as report:
[report.write(url + '\n') for url in reports]
```
## Links
- [Github Page](https://github.com/mauricelambert/Cr0wl3r)
- [Pypi](https://pypi.org/project/Cr0wl3r/)
- [Documentation](https://mauricelambert.github.io/info/python/security/Cr0wl3r.html)
- [Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz)
- [Windows Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.exe)
## Help
```text
~# Cr0wl3r --help
usage: Cr0wl3r [-h] [--recursive] [--update] [--insecure] [--do-not-request-robots] [--do-not-request-sitemap] [--do-not-request-crossdomain] [--not-only-domain] [--max-request MAX_REQUEST] [--cookies COOKIES [COOKIES ...]]
[--headers HEADERS [HEADERS ...]] [--dynamic-tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME] [--loglevel {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}] [--logfile LOGFILE]
[--interval-request INTERVAL_REQUEST] [--output-format {raw-url-only,colored,raw}] [--no-query-page]
[--download-all | --download-html | --download-static | --download-resources | --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE | --download-requested | --do-not-download]
url
This script crawls web site and prints URLs.
positional arguments:
url First URL to crawl.
options:
-h, --help show this help message and exit
--recursive, -r Crawl URLs recursively.
--update, -u Re-downloads and overwrites responses from requests made during previous crawls.
--insecure, -i Use insecure SSL (support selenium and urllib)
--do-not-request-robots, --no-robots, -R
Don't search, request and parse robots.txt
--do-not-request-sitemap, --no-sitemap, -S
Don't search, request and parse sitemap.xml
--do-not-request-crossdomain, --no-crossdomain, -C
Don't search, request and parse crossdomain.xml
--not-only-domain, -d
Do not request only the base URL domain (request all domains).
--max-request MAX_REQUEST, -m MAX_REQUEST
Maximum request to perform.
--cookies COOKIES [COOKIES ...], -c COOKIES [COOKIES ...]
Add a cookie.
--headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]
Add headers.
--dynamic-tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], --tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], --tags DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], -t DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...]
Add a tag counter for scoring.
--report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME
The JSON report filename.
--loglevel {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}, -L {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}
WebCrawler logs level.
--logfile LOGFILE, -l LOGFILE
WebCrawler logs file.
--interval-request INTERVAL_REQUEST, --interval INTERVAL_REQUEST, -I INTERVAL_REQUEST
Interval between each requests by domain.
--output-format {raw-url-only,colored,raw}, --format {raw-url-only,colored,raw}, -f {raw-url-only,colored,raw}
Output format.
--no-query-page, --no-query, -q
Request only when path is different, without this option the same path will be requested for each differents queries.
--download-all, --download, -D, -D0
Download (store) all responses
--download-html, --dh, -D1
Download (store) only HTML responses
--download-static, --ds, -D2
Download (store) only static files (HTML, CSS, JavaScript)
--download-resources, --dr, -D3
Download (store) only resources files (images, documents, icon...)
--download-by-content-type DOWNLOAD_BY_CONTENT_TYPE, --dct DOWNLOAD_BY_CONTENT_TYPE, -D4 DOWNLOAD_BY_CONTENT_TYPE
Download (store) only responses with Content-Type that contains this value
--download-requested, --dR, -D5
Download all requests responses and try to requests only Web page
--do-not-download, --dN, -D6
Try to requests only Web page and do not download
~#
```
## Licence
Licensed under the [GPL, version 3](https://www.gnu.org/licenses/).
Raw data
{
"_id": null,
"home_page": "https://github.com/mauricelambert/Cr0wl3r",
"name": "Cr0wl3r",
"maintainer": "Maurice Lambert",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "mauricelambert434@gmail.com",
"keywords": "Crawler, Scraper, Scan, Web, Pentest, Discovery, Security, Selenium, URL, URI",
"author": "Maurice Lambert",
"author_email": "mauricelambert434@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/63/07/7e19aebb58d59de080c07dc8ee693d2be25e44fb076e3c748d851cd07b10/Cr0wl3r-1.1.0.tar.gz",
"platform": "Windows",
"description": "![Cr0wl3r logo](https://mauricelambert.github.io/info/python/security/Cr0wl3r_small.png \"Cr0wl3r logo\")\n\n# Cr0wl3r\n\n## Description\n\nThis package implements a web discreet crawler to find all visible URLs on a website, this crawler can store pages (and reuse them for next crawl), scan web content for dynamic content (useful for pentest, red teaming and hacking), create a full JSON report and database to reuse the analysis, try to identify web pages, static content and assets to request only what is useful.\n\n> The name *Cr0wl3r* is a pun with *Crawler* and *Growler* because this tool in not offensive but it's the first step to attack a web server.\n\n## Requirements\n\nThis package require:\n\n - python3\n - python3 Standard Library\n\nOptional:\n\n - Selenium\n\n## Installation\n\n```bash\npip install Cr0wl3r \n```\n\n## Usages\n\n### Command lines\n\n```bash\n# Python executable\npython3 Cr0wl3r.pyz -h\n# or\nchmod u+x Cr0wl3r.pyz\n./Cr0wl3r.pyz --help\n\n# Python module\npython3 -m Cr0wl3r https://github.com/mauricelambert\n\n# Entry point (console)\nCr0wl3r -F report.json -L DEBUG -l logs.log -R -S -d -c \"mycookie=foobar\" -H \"User-Agent:Chrome\" -m 3 -t \"p\" -r https://github.com/mauricelambert\nCr0wl3r -R -S -C -d -u -i -F report.json -L DEBUG -l logs.log -c \"mycookie=foobar\" \"session=abc\" -c \"counter=5\" -H \"User-Agent:Chrome\" \"Api-Key:myapikey\" -H \"Authorization:Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==\" -m 5 -t \"p\" \"img\" -t \"link\" -I 3.5 -f \"raw-url-only\" -D4 \"text/html\" -q -r https://github.com/mauricelambert\n```\n\n### Python3\n\n```python\nfrom Cr0wl3r import CrawlerRawPrinter\n\nCrawlerRawPrinter(\n \"https://github.com/mauricelambert\",\n recursive=False,\n).crawl()\n```\n\n```python\nfrom ssl import _create_unverified_context\nfrom Cr0wl3r import _Crawler, reports\nfrom logging import basicConfig\nfrom typing import Union\n\nbasicConfig(level=1)\n\nclass CustomCr0wl3r(_Crawler):\n def handle_web_page(\n self, from_url: str, url: str, tag: str, attribute: str\n ) -> Union[bool, None]:\n\n print(\"[+] New web page:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\n def handle_static(\n self, from_url: str, url: str, tag: str, attribute: str\n ) -> Union[bool, None]:\n\n print(\"[+] New static:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\n def handle_resource(\n self, from_url: str, url: str, tag: str, attribute: str\n ) -> Union[bool, None]:\n\n print(\"[+] New assets:\", url, \"from\", from_url, f\"{tag}<{attribute}>\")\n print(\"[*] There are still\", len(self.urls_to_parse), \"requests to send.\")\n\ncr0wl3r = CustomCr0wl3r(\n \"https://github.com/mauricelambert\",\n recursive=True,\n update=True,\n max_request=10,\n only_domain=False,\n headers={\"User-Agent\": \"Chrome\", \"Cookie\": \"mycookie=abc\"},\n robots=False,\n sitemap=False,\n crossdomain=False,\n context=_create_unverified_context(),\n interval=3.5,\n download_policy=\"do not download\",\n no_query_page=False,\n)\ncr0wl3r.crawl()\n\nwith open(\"urls.txt\", 'w') as report:\n [report.write(url + '\\n') for url in reports]\n```\n\n## Links\n\n - [Github Page](https://github.com/mauricelambert/Cr0wl3r)\n - [Pypi](https://pypi.org/project/Cr0wl3r/)\n - [Documentation](https://mauricelambert.github.io/info/python/security/Cr0wl3r.html)\n - [Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz)\n - [Windows Python Executable](https://mauricelambert.github.io/info/python/security/Cr0wl3r.exe)\n\n## Help\n\n```text\n~# Cr0wl3r --help\nusage: Cr0wl3r [-h] [--recursive] [--update] [--insecure] [--do-not-request-robots] [--do-not-request-sitemap] [--do-not-request-crossdomain] [--not-only-domain] [--max-request MAX_REQUEST] [--cookies COOKIES [COOKIES ...]]\n [--headers HEADERS [HEADERS ...]] [--dynamic-tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME] [--loglevel {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}] [--logfile LOGFILE]\n [--interval-request INTERVAL_REQUEST] [--output-format {raw-url-only,colored,raw}] [--no-query-page]\n [--download-all | --download-html | --download-static | --download-resources | --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE | --download-requested | --do-not-download]\n url\n\nThis script crawls web site and prints URLs.\n\npositional arguments:\n url First URL to crawl.\n\noptions:\n -h, --help show this help message and exit\n --recursive, -r Crawl URLs recursively.\n --update, -u Re-downloads and overwrites responses from requests made during previous crawls.\n --insecure, -i Use insecure SSL (support selenium and urllib)\n --do-not-request-robots, --no-robots, -R\n Don't search, request and parse robots.txt\n --do-not-request-sitemap, --no-sitemap, -S\n Don't search, request and parse sitemap.xml\n --do-not-request-crossdomain, --no-crossdomain, -C\n Don't search, request and parse crossdomain.xml\n --not-only-domain, -d\n Do not request only the base URL domain (request all domains).\n --max-request MAX_REQUEST, -m MAX_REQUEST\n Maximum request to perform.\n --cookies COOKIES [COOKIES ...], -c COOKIES [COOKIES ...]\n Add a cookie.\n --headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]\n Add headers.\n --dynamic-tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], --tags-counter DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], --tags DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...], -t DYNAMIC_TAGS_COUNTER [DYNAMIC_TAGS_COUNTER ...]\n Add a tag counter for scoring.\n --report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME\n The JSON report filename.\n --loglevel {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}, -L {DEBUG,INFO,REQUEST,WARNING,ERROR,CRITICAL}\n WebCrawler logs level.\n --logfile LOGFILE, -l LOGFILE\n WebCrawler logs file.\n --interval-request INTERVAL_REQUEST, --interval INTERVAL_REQUEST, -I INTERVAL_REQUEST\n Interval between each requests by domain.\n --output-format {raw-url-only,colored,raw}, --format {raw-url-only,colored,raw}, -f {raw-url-only,colored,raw}\n Output format.\n --no-query-page, --no-query, -q\n Request only when path is different, without this option the same path will be requested for each differents queries.\n --download-all, --download, -D, -D0\n Download (store) all responses\n --download-html, --dh, -D1\n Download (store) only HTML responses\n --download-static, --ds, -D2\n Download (store) only static files (HTML, CSS, JavaScript)\n --download-resources, --dr, -D3\n Download (store) only resources files (images, documents, icon...)\n --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE, --dct DOWNLOAD_BY_CONTENT_TYPE, -D4 DOWNLOAD_BY_CONTENT_TYPE\n Download (store) only responses with Content-Type that contains this value\n --download-requested, --dR, -D5\n Download all requests responses and try to requests only Web page\n --do-not-download, --dN, -D6\n Try to requests only Web page and do not download\n\n~# \n```\n\n## Licence\n\nLicensed under the [GPL, version 3](https://www.gnu.org/licenses/).\n",
"bugtrack_url": null,
"license": "GPL-3.0 License",
"summary": "This module implements a crawler to find all links and resources",
"version": "1.1.0",
"project_urls": {
"Documentation": "https://mauricelambert.github.io/info/python/security/Cr0wl3r.html",
"Executable": "https://mauricelambert.github.io/info/python/security/Cr0wl3r.pyz",
"Homepage": "https://github.com/mauricelambert/Cr0wl3r"
},
"split_keywords": [
"crawler",
" scraper",
" scan",
" web",
" pentest",
" discovery",
" security",
" selenium",
" url",
" uri"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "63077e19aebb58d59de080c07dc8ee693d2be25e44fb076e3c748d851cd07b10",
"md5": "eab8cc591c09e2147d211e4697f07fd2",
"sha256": "bf8399dbaa770008ac356259c876aa175d51b5adacfb95017f08916d89e0b560"
},
"downloads": -1,
"filename": "Cr0wl3r-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "eab8cc591c09e2147d211e4697f07fd2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 36586,
"upload_time": "2024-06-18T10:26:00",
"upload_time_iso_8601": "2024-06-18T10:26:00.500905Z",
"url": "https://files.pythonhosted.org/packages/63/07/7e19aebb58d59de080c07dc8ee693d2be25e44fb076e3c748d851cd07b10/Cr0wl3r-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-18 10:26:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mauricelambert",
"github_project": "Cr0wl3r",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "cr0wl3r"
}