httpeat is a recursive, parallel and multi-mirror/multi-proxy HTTP downloader.
![overview](doc/httpeat_overview_0.3.png)
# Usage
```
usage: httpeat.py [-h] [-A USER_AGENT] [-d] [-i] [-I] [-k] [-m MIRROR] [-P]
[-q] [-s SKIP] [-t TIMEOUT] [-T] [-v] [-w WAIT] [-x PROXY]
[-z TASKS_COUNT]
session_name [targets ...]
httpeat v0.2 - recursive, parallel and multi-mirror/multi-proxy HTTP downloader
positional arguments:
session_name name of the session
targets to create a session, provide URLs to HTTP index or files, or path of a source txt file
options:
-h, --help show this help message and exit
-A USER_AGENT, --user-agent USER_AGENT
user agent
-d, --download-only only download already listed files
-i, --index-only only list all files recursively, do not download
-I, --index-debug drop in interactive ipython shell during indexing
-k, --no-ssl-verify do no verify the SSL certificate in case of HTTPS connection
-m MIRROR, --mirror MIRROR
mirror definition to load balance requests, eg. "http://host1/data/ mirrors http://host2/data/"
can be specified multiple times.
only valid uppon session creation, afterwards you must modify session mirrors.txt.
-P, --no-progress disable progress bar
-q, --quiet quiet output, show only warnings
-s SKIP, --skip SKIP skip rule: dl-(path|size-gt):[pattern]. can be specified multiple times.
-t TIMEOUT, --timeout TIMEOUT
in seconds, default to {TO_DEFAULT}
-T, --no-index-touch do not create empty .download files uppon indexing
-v, --verbose verbose output, specify twice for http request debug
-w WAIT, --wait WAIT wait after request for n to n*3 seconds, for each task
-x PROXY, --proxy PROXY
proxy URL: "(http[s]|socks5)://<host>:<port>[ tasks-count=N]"
can be specified multiple times to loadbalance downloads between proxies.
optional tasks-count overrides the golbal tasks-count.
only valid uppon session creation, afterwards you must modify session proxies.txt.
-z TASKS_COUNT, --tasks-count TASKS_COUNT
number of parallel tasks, defaults to 3
```
## session directory structure
```
<session_name>/
log.txt
state_download.csv
state_index.csv
targets.txt
mirrors.txt
proxies.txt
data/
...downloaded files...
```
## Example usage
- crawl HTTP index page and linked files
`httpeat antennes https://ferme.ydns.eu/antennes/bands/2024-10/`
- resume after interrupt
`httpeat antennes`
- crawl HTTP index page, using mirror from host2
`httpeat bigfilesA https://host1/data/ -m "https://host2/data/ mirrors https://host1/data/"`
- crawl HTTP index page, using 2 proxies
`httpeat bigfilesB https://host1/data/ -x "socks4://192.168.0.2:3000" -x "socks4://192.168.0.3:3000"`
- crawl 2 HTTP index directory pages
`httpeat bigfilesC https://host1/data/one/ https://host1/data/six/`
- download 3 files
`httpeat bigfilesD https://host1/data/bigA.iso https://host1/data/six/bigB.iso https://host1/otherdata/bigC.iso`
- download 3 files with URLs from txt file
```
cat <<-_EOF > ./list.txt
https://host1/data/bigA.iso
https://host1/data/six/bigB.iso
https://host1/otherdata/bigC.iso
_EOF
httpeat bigfilesE ./list.txt
```
# Limitations
files count:
- above approximalety 1 000 000 files in the download queue, httpeat will start to eat your CPU.
live progress:
- showing live progress eats CPU, even if we throtle it to 0.5 frames per second. if it is too much for you, use -P / --no-progress.
- showing live progress while activating verbose messages with -v / --verbose may eat a lot of CPU, since the 'rich' library needs to process all the logs. try using -P / --no-progress when activating verbose logs.
# Change log / todo list
```
v0.1
- while downloading store <file>.download, then rename when done
- improve index parser capability to handle unknown pages
- test that the URL "unquote" to path works, in dowload mode
- accept text file URL list as argument, also useful for testing
- store local files with full URL path including host
- existing session do not need URL of file list. prepare for "download from multiple hosts"
- retry immediatly on download error
see "Retrying HTTPX Requests" https://scrapfly.io/blog/web-scraping-with-python-httpx/
for testing see https://github.com/Colin-b/pytest_httpx
- retry count per entry, then drop it and mark as error
- keep old states, in case last ones get corrupted
- maybe log file with higher log level and timestamp ? or at least time for start and end ? (last option implemented)
- prevent SIGINT during CSV state file saving
v0.2
- hide begining of URL on info print when single root prefix is identified
- unit tests for network errors
- fix progress update of indexer in download-only mode: store progress and it's task id in State_*
and update in indexer/downloader
- argument to skip gt size
- fix modification date of downloaded files when doing final mv. don't fix directories for now
- add rich line for current file of each download task: name, size, retry count
- progress download bar should show size, and file count as additional numbers
- progress bar should be black and white
- progress bars should display bytes per second for download
- display file path instead of URL after download completed
- display file size after path after download completed
- handle file names len > 255
- create all .download empty files during indexing, option to disable
- download from multiple (2?) mirrors
- fix bug with state_dl size progress, grows much too fast
- download from multiple proxies
- configurable user agent
v0.3
- fix 'rich' flickering on dl workers progress, by creating Group after all progress add_task() are performed.
- fix download size estimation for completed and total, by correctly handling in-progress files on startup.
- fix handling of SIGTERM, by dedirecing raising SIGINT
- fix show 'index' line all the time, even if nothing to do
- fix dl/idx progress bar position to match dl workers
- display errors count on dl progress bar
- print download stats at end of session
- cleanup code and review documentation
- package with pyproject.toml
- public version
TODO v1.0
TODO cleanup and review
TODO v1.1
TODO when size is not found in index, perform HEAD requests in indexer
TODO directories mtime from index
TODO profile code to see if we can improve performance with large download lists / CSV
```
Raw data
{
"_id": null,
"home_page": null,
"name": "httpeat",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.0",
"maintainer_email": null,
"keywords": "http, downloader, recursive, parallel, mirror, proxy",
"author": null,
"author_email": "Laurent Ghigonis <ooookiwi@protonmail.com>",
"download_url": "https://files.pythonhosted.org/packages/84/5c/d348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac/httpeat-0.3.tar.gz",
"platform": null,
"description": "httpeat is a recursive, parallel and multi-mirror/multi-proxy HTTP downloader.\n\n![overview](doc/httpeat_overview_0.3.png)\n\n# Usage\n\n```\nusage: httpeat.py [-h] [-A USER_AGENT] [-d] [-i] [-I] [-k] [-m MIRROR] [-P]\n [-q] [-s SKIP] [-t TIMEOUT] [-T] [-v] [-w WAIT] [-x PROXY]\n [-z TASKS_COUNT]\n session_name [targets ...]\n\nhttpeat v0.2 - recursive, parallel and multi-mirror/multi-proxy HTTP downloader\n\npositional arguments:\n session_name name of the session\n targets to create a session, provide URLs to HTTP index or files, or path of a source txt file\n\noptions:\n -h, --help show this help message and exit\n -A USER_AGENT, --user-agent USER_AGENT\n user agent\n -d, --download-only only download already listed files\n -i, --index-only only list all files recursively, do not download\n -I, --index-debug drop in interactive ipython shell during indexing\n -k, --no-ssl-verify do no verify the SSL certificate in case of HTTPS connection\n -m MIRROR, --mirror MIRROR\n mirror definition to load balance requests, eg. \"http://host1/data/ mirrors http://host2/data/\"\n can be specified multiple times.\n only valid uppon session creation, afterwards you must modify session mirrors.txt.\n -P, --no-progress disable progress bar\n -q, --quiet quiet output, show only warnings\n -s SKIP, --skip SKIP skip rule: dl-(path|size-gt):[pattern]. can be specified multiple times.\n -t TIMEOUT, --timeout TIMEOUT\n in seconds, default to {TO_DEFAULT}\n -T, --no-index-touch do not create empty .download files uppon indexing\n -v, --verbose verbose output, specify twice for http request debug\n -w WAIT, --wait WAIT wait after request for n to n*3 seconds, for each task\n -x PROXY, --proxy PROXY\n proxy URL: \"(http[s]|socks5)://<host>:<port>[ tasks-count=N]\"\n can be specified multiple times to loadbalance downloads between proxies.\n optional tasks-count overrides the golbal tasks-count.\n only valid uppon session creation, afterwards you must modify session proxies.txt.\n -z TASKS_COUNT, --tasks-count TASKS_COUNT\n number of parallel tasks, defaults to 3\n```\n\n## session directory structure\n```\n<session_name>/\n log.txt\n state_download.csv\n state_index.csv\n targets.txt\n mirrors.txt\n proxies.txt\n data/\n ...downloaded files...\n```\n\n## Example usage\n\n- crawl HTTP index page and linked files\n`httpeat antennes https://ferme.ydns.eu/antennes/bands/2024-10/`\n\n- resume after interrupt\n`httpeat antennes`\n\n- crawl HTTP index page, using mirror from host2\n`httpeat bigfilesA https://host1/data/ -m \"https://host2/data/ mirrors https://host1/data/\"`\n\n- crawl HTTP index page, using 2 proxies\n`httpeat bigfilesB https://host1/data/ -x \"socks4://192.168.0.2:3000\" -x \"socks4://192.168.0.3:3000\"`\n\n- crawl 2 HTTP index directory pages\n`httpeat bigfilesC https://host1/data/one/ https://host1/data/six/`\n\n- download 3 files\n`httpeat bigfilesD https://host1/data/bigA.iso https://host1/data/six/bigB.iso https://host1/otherdata/bigC.iso`\n\n- download 3 files with URLs from txt file\n```\ncat <<-_EOF > ./list.txt\nhttps://host1/data/bigA.iso\nhttps://host1/data/six/bigB.iso\nhttps://host1/otherdata/bigC.iso\n_EOF\nhttpeat bigfilesE ./list.txt\n```\n\n# Limitations\n\nfiles count:\n- above approximalety 1 000 000 files in the download queue, httpeat will start to eat your CPU.\n\nlive progress:\n- showing live progress eats CPU, even if we throtle it to 0.5 frames per second. if it is too much for you, use -P / --no-progress.\n- showing live progress while activating verbose messages with -v / --verbose may eat a lot of CPU, since the 'rich' library needs to process all the logs. try using -P / --no-progress when activating verbose logs.\n\n# Change log / todo list\n\n```\nv0.1\n- while downloading store <file>.download, then rename when done\n- improve index parser capability to handle unknown pages\n- test that the URL \"unquote\" to path works, in dowload mode\n- accept text file URL list as argument, also useful for testing\n- store local files with full URL path including host\n- existing session do not need URL of file list. prepare for \"download from multiple hosts\"\n- retry immediatly on download error\n see \"Retrying HTTPX Requests\" https://scrapfly.io/blog/web-scraping-with-python-httpx/\n for testing see https://github.com/Colin-b/pytest_httpx\n- retry count per entry, then drop it and mark as error\n- keep old states, in case last ones get corrupted\n- maybe log file with higher log level and timestamp ? or at least time for start and end ? (last option implemented)\n- prevent SIGINT during CSV state file saving\n\nv0.2\n- hide begining of URL on info print when single root prefix is identified\n- unit tests for network errors\n- fix progress update of indexer in download-only mode: store progress and it's task id in State_*\n and update in indexer/downloader\n- argument to skip gt size\n- fix modification date of downloaded files when doing final mv. don't fix directories for now\n- add rich line for current file of each download task: name, size, retry count\n- progress download bar should show size, and file count as additional numbers\n- progress bar should be black and white\n- progress bars should display bytes per second for download\n- display file path instead of URL after download completed\n- display file size after path after download completed\n- handle file names len > 255\n- create all .download empty files during indexing, option to disable\n- download from multiple (2?) mirrors\n- fix bug with state_dl size progress, grows much too fast\n- download from multiple proxies\n- configurable user agent\n\nv0.3\n- fix 'rich' flickering on dl workers progress, by creating Group after all progress add_task() are performed.\n- fix download size estimation for completed and total, by correctly handling in-progress files on startup.\n- fix handling of SIGTERM, by dedirecing raising SIGINT\n- fix show 'index' line all the time, even if nothing to do\n- fix dl/idx progress bar position to match dl workers\n- display errors count on dl progress bar\n- print download stats at end of session\n- cleanup code and review documentation\n- package with pyproject.toml\n- public version\n\nTODO v1.0\nTODO cleanup and review\n\nTODO v1.1\nTODO when size is not found in index, perform HEAD requests in indexer\nTODO directories mtime from index\nTODO profile code to see if we can improve performance with large download lists / CSV\n```\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "a recursive, parallel and multi-mirror/multi-proxy HTTP downloader",
"version": "0.3",
"project_urls": {
"Homepage": "https://github.com/looran/httpeat"
},
"split_keywords": [
"http",
" downloader",
" recursive",
" parallel",
" mirror",
" proxy"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f7ae383e5bbb1bd6713a3cf5c552af61e48c04d45d07a46b5443926fb655ffd3",
"md5": "da0fc8785400615265ab0d182021b599",
"sha256": "9a1810522be0e1b5f738156355fd488ce60fac1f82eb7f9e666bf19843cc5a9d"
},
"downloads": -1,
"filename": "httpeat-0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "da0fc8785400615265ab0d182021b599",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.0",
"size": 17374,
"upload_time": "2024-12-02T14:12:05",
"upload_time_iso_8601": "2024-12-02T14:12:05.317471Z",
"url": "https://files.pythonhosted.org/packages/f7/ae/383e5bbb1bd6713a3cf5c552af61e48c04d45d07a46b5443926fb655ffd3/httpeat-0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "845cd348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac",
"md5": "1b5a44e529f52398f475b7374c5f3f46",
"sha256": "684568428e9cb4b2def5c6f45b7e19dbcbfade48aaa28189d5620ceaabd32d02"
},
"downloads": -1,
"filename": "httpeat-0.3.tar.gz",
"has_sig": false,
"md5_digest": "1b5a44e529f52398f475b7374c5f3f46",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.0",
"size": 641715,
"upload_time": "2024-12-02T14:12:08",
"upload_time_iso_8601": "2024-12-02T14:12:08.239484Z",
"url": "https://files.pythonhosted.org/packages/84/5c/d348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac/httpeat-0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-02 14:12:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "looran",
"github_project": "httpeat",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "nodriver",
"specs": []
},
{
"name": "httpx",
"specs": []
},
{
"name": "bs4",
"specs": []
},
{
"name": "lxml",
"specs": []
},
{
"name": "python-dateutil",
"specs": []
},
{
"name": "rich",
"specs": []
},
{
"name": "humanfriendly",
"specs": []
},
{
"name": "tenacity",
"specs": []
}
],
"lcname": "httpeat"
}