httpeat


Namehttpeat JSON
Version 0.3 PyPI version JSON
download
home_pageNone
Summarya recursive, parallel and multi-mirror/multi-proxy HTTP downloader
upload_time2024-12-02 14:12:08
maintainerNone
docs_urlNone
authorNone
requires_python>=3.0
licenseBSD-3-Clause
keywords http downloader recursive parallel mirror proxy
VCS
bugtrack_url
requirements nodriver httpx bs4 lxml python-dateutil rich humanfriendly tenacity
Travis-CI No Travis.
coveralls test coverage No coveralls.
            httpeat is a recursive, parallel and multi-mirror/multi-proxy HTTP downloader.

![overview](doc/httpeat_overview_0.3.png)

# Usage

```
usage: httpeat.py [-h] [-A USER_AGENT] [-d] [-i] [-I] [-k] [-m MIRROR] [-P]
                  [-q] [-s SKIP] [-t TIMEOUT] [-T] [-v] [-w WAIT] [-x PROXY]
                  [-z TASKS_COUNT]
                  session_name [targets ...]

httpeat v0.2 - recursive, parallel and multi-mirror/multi-proxy HTTP downloader

positional arguments:
  session_name          name of the session
  targets               to create a session, provide URLs to HTTP index or files, or path of a source txt file

options:
  -h, --help            show this help message and exit
  -A USER_AGENT, --user-agent USER_AGENT
                        user agent
  -d, --download-only   only download already listed files
  -i, --index-only      only list all files recursively, do not download
  -I, --index-debug     drop in interactive ipython shell during indexing
  -k, --no-ssl-verify   do no verify the SSL certificate in case of HTTPS connection
  -m MIRROR, --mirror MIRROR
                        mirror definition to load balance requests, eg. "http://host1/data/ mirrors http://host2/data/"
                        can be specified multiple times.
                        only valid uppon session creation, afterwards you must modify session mirrors.txt.
  -P, --no-progress     disable progress bar
  -q, --quiet           quiet output, show only warnings
  -s SKIP, --skip SKIP  skip rule: dl-(path|size-gt):[pattern]. can be specified multiple times.
  -t TIMEOUT, --timeout TIMEOUT
                        in seconds, default to {TO_DEFAULT}
  -T, --no-index-touch  do not create empty .download files uppon indexing
  -v, --verbose         verbose output, specify twice for http request debug
  -w WAIT, --wait WAIT  wait after request for n to n*3 seconds, for each task
  -x PROXY, --proxy PROXY
                        proxy URL: "(http[s]|socks5)://<host>:<port>[ tasks-count=N]"
                        can be specified multiple times to loadbalance downloads between proxies.
                        optional tasks-count overrides the golbal tasks-count.
                        only valid uppon session creation, afterwards you must modify session proxies.txt.
  -z TASKS_COUNT, --tasks-count TASKS_COUNT
                        number of parallel tasks, defaults to 3
```

## session directory structure
```
<session_name>/
   log.txt
   state_download.csv
   state_index.csv
   targets.txt
   mirrors.txt
   proxies.txt
   data/
      ...downloaded files...
```

## Example usage

- crawl HTTP index page and linked files
`httpeat antennes https://ferme.ydns.eu/antennes/bands/2024-10/`

- resume after interrupt
`httpeat antennes`

- crawl HTTP index page, using mirror from host2
`httpeat bigfilesA https://host1/data/ -m "https://host2/data/ mirrors https://host1/data/"`

- crawl HTTP index page, using 2 proxies
`httpeat bigfilesB https://host1/data/ -x "socks4://192.168.0.2:3000" -x "socks4://192.168.0.3:3000"`

- crawl 2 HTTP index directory pages
`httpeat bigfilesC https://host1/data/one/ https://host1/data/six/`

- download 3 files
`httpeat bigfilesD https://host1/data/bigA.iso https://host1/data/six/bigB.iso https://host1/otherdata/bigC.iso`

- download 3 files with URLs from txt file
```
cat <<-_EOF > ./list.txt
https://host1/data/bigA.iso
https://host1/data/six/bigB.iso
https://host1/otherdata/bigC.iso
_EOF
httpeat bigfilesE ./list.txt
```

# Limitations

files count:
- above approximalety 1 000 000 files in the download queue, httpeat will start to eat your CPU.

live progress:
- showing live progress eats CPU, even if we throtle it to 0.5 frames per second. if it is too much for you, use -P / --no-progress.
- showing live progress while activating verbose messages with -v / --verbose may eat a lot of CPU, since the 'rich' library needs to process all the logs. try using -P / --no-progress when activating verbose logs.

# Change log / todo list

```
v0.1
- while downloading store <file>.download, then rename when done
- improve index parser capability to handle unknown pages
- test that the URL "unquote" to path works, in dowload mode
- accept text file URL list as argument, also useful for testing
- store local files with full URL path including host
- existing session do not need URL of file list. prepare for "download from multiple hosts"
- retry immediatly on download error
  see "Retrying HTTPX Requests" https://scrapfly.io/blog/web-scraping-with-python-httpx/
  for testing see https://github.com/Colin-b/pytest_httpx
- retry count per entry, then drop it and mark as error
- keep old states, in case last ones get corrupted
- maybe log file with higher log level and timestamp ? or at least time for start and end ? (last option implemented)
- prevent SIGINT during CSV state file saving

v0.2
- hide begining of URL on info print when single root prefix is identified
- unit tests for network errors
- fix progress update of indexer in download-only mode: store progress and it's task id in State_*
  and update in indexer/downloader
- argument to skip gt size
- fix modification date of downloaded files when doing final mv. don't fix directories for now
- add rich line for current file of each download task: name, size, retry count
- progress download bar should show size, and file count as additional numbers
- progress bar should be black and white
- progress bars should  display bytes per second for download
- display file path instead of URL after download completed
- display file size after path after download completed
- handle file names len > 255
- create all .download empty files during indexing, option to disable
- download from multiple (2?) mirrors
- fix bug with state_dl size progress, grows much too fast
- download from multiple proxies
- configurable user agent

v0.3
- fix 'rich' flickering on dl workers progress, by creating Group after all progress add_task() are performed.
- fix download size estimation for completed and total, by correctly handling in-progress files on startup.
- fix handling of SIGTERM, by dedirecing raising SIGINT
- fix show 'index' line all the time, even if nothing to do
- fix dl/idx progress bar position to match dl workers
- display errors count on dl progress bar
- print download stats at end of session
- cleanup code and review documentation
- package with pyproject.toml
- public version

TODO v1.0
TODO cleanup and review

TODO v1.1
TODO when size is not found in index, perform HEAD requests in indexer
TODO directories mtime from index
TODO profile code to see if we can improve performance with large download lists / CSV
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "httpeat",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": null,
    "keywords": "http, downloader, recursive, parallel, mirror, proxy",
    "author": null,
    "author_email": "Laurent Ghigonis <ooookiwi@protonmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/84/5c/d348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac/httpeat-0.3.tar.gz",
    "platform": null,
    "description": "httpeat is a recursive, parallel and multi-mirror/multi-proxy HTTP downloader.\n\n![overview](doc/httpeat_overview_0.3.png)\n\n# Usage\n\n```\nusage: httpeat.py [-h] [-A USER_AGENT] [-d] [-i] [-I] [-k] [-m MIRROR] [-P]\n                  [-q] [-s SKIP] [-t TIMEOUT] [-T] [-v] [-w WAIT] [-x PROXY]\n                  [-z TASKS_COUNT]\n                  session_name [targets ...]\n\nhttpeat v0.2 - recursive, parallel and multi-mirror/multi-proxy HTTP downloader\n\npositional arguments:\n  session_name          name of the session\n  targets               to create a session, provide URLs to HTTP index or files, or path of a source txt file\n\noptions:\n  -h, --help            show this help message and exit\n  -A USER_AGENT, --user-agent USER_AGENT\n                        user agent\n  -d, --download-only   only download already listed files\n  -i, --index-only      only list all files recursively, do not download\n  -I, --index-debug     drop in interactive ipython shell during indexing\n  -k, --no-ssl-verify   do no verify the SSL certificate in case of HTTPS connection\n  -m MIRROR, --mirror MIRROR\n                        mirror definition to load balance requests, eg. \"http://host1/data/ mirrors http://host2/data/\"\n                        can be specified multiple times.\n                        only valid uppon session creation, afterwards you must modify session mirrors.txt.\n  -P, --no-progress     disable progress bar\n  -q, --quiet           quiet output, show only warnings\n  -s SKIP, --skip SKIP  skip rule: dl-(path|size-gt):[pattern]. can be specified multiple times.\n  -t TIMEOUT, --timeout TIMEOUT\n                        in seconds, default to {TO_DEFAULT}\n  -T, --no-index-touch  do not create empty .download files uppon indexing\n  -v, --verbose         verbose output, specify twice for http request debug\n  -w WAIT, --wait WAIT  wait after request for n to n*3 seconds, for each task\n  -x PROXY, --proxy PROXY\n                        proxy URL: \"(http[s]|socks5)://<host>:<port>[ tasks-count=N]\"\n                        can be specified multiple times to loadbalance downloads between proxies.\n                        optional tasks-count overrides the golbal tasks-count.\n                        only valid uppon session creation, afterwards you must modify session proxies.txt.\n  -z TASKS_COUNT, --tasks-count TASKS_COUNT\n                        number of parallel tasks, defaults to 3\n```\n\n## session directory structure\n```\n<session_name>/\n   log.txt\n   state_download.csv\n   state_index.csv\n   targets.txt\n   mirrors.txt\n   proxies.txt\n   data/\n      ...downloaded files...\n```\n\n## Example usage\n\n- crawl HTTP index page and linked files\n`httpeat antennes https://ferme.ydns.eu/antennes/bands/2024-10/`\n\n- resume after interrupt\n`httpeat antennes`\n\n- crawl HTTP index page, using mirror from host2\n`httpeat bigfilesA https://host1/data/ -m \"https://host2/data/ mirrors https://host1/data/\"`\n\n- crawl HTTP index page, using 2 proxies\n`httpeat bigfilesB https://host1/data/ -x \"socks4://192.168.0.2:3000\" -x \"socks4://192.168.0.3:3000\"`\n\n- crawl 2 HTTP index directory pages\n`httpeat bigfilesC https://host1/data/one/ https://host1/data/six/`\n\n- download 3 files\n`httpeat bigfilesD https://host1/data/bigA.iso https://host1/data/six/bigB.iso https://host1/otherdata/bigC.iso`\n\n- download 3 files with URLs from txt file\n```\ncat <<-_EOF > ./list.txt\nhttps://host1/data/bigA.iso\nhttps://host1/data/six/bigB.iso\nhttps://host1/otherdata/bigC.iso\n_EOF\nhttpeat bigfilesE ./list.txt\n```\n\n# Limitations\n\nfiles count:\n- above approximalety 1 000 000 files in the download queue, httpeat will start to eat your CPU.\n\nlive progress:\n- showing live progress eats CPU, even if we throtle it to 0.5 frames per second. if it is too much for you, use -P / --no-progress.\n- showing live progress while activating verbose messages with -v / --verbose may eat a lot of CPU, since the 'rich' library needs to process all the logs. try using -P / --no-progress when activating verbose logs.\n\n# Change log / todo list\n\n```\nv0.1\n- while downloading store <file>.download, then rename when done\n- improve index parser capability to handle unknown pages\n- test that the URL \"unquote\" to path works, in dowload mode\n- accept text file URL list as argument, also useful for testing\n- store local files with full URL path including host\n- existing session do not need URL of file list. prepare for \"download from multiple hosts\"\n- retry immediatly on download error\n  see \"Retrying HTTPX Requests\" https://scrapfly.io/blog/web-scraping-with-python-httpx/\n  for testing see https://github.com/Colin-b/pytest_httpx\n- retry count per entry, then drop it and mark as error\n- keep old states, in case last ones get corrupted\n- maybe log file with higher log level and timestamp ? or at least time for start and end ? (last option implemented)\n- prevent SIGINT during CSV state file saving\n\nv0.2\n- hide begining of URL on info print when single root prefix is identified\n- unit tests for network errors\n- fix progress update of indexer in download-only mode: store progress and it's task id in State_*\n  and update in indexer/downloader\n- argument to skip gt size\n- fix modification date of downloaded files when doing final mv. don't fix directories for now\n- add rich line for current file of each download task: name, size, retry count\n- progress download bar should show size, and file count as additional numbers\n- progress bar should be black and white\n- progress bars should  display bytes per second for download\n- display file path instead of URL after download completed\n- display file size after path after download completed\n- handle file names len > 255\n- create all .download empty files during indexing, option to disable\n- download from multiple (2?) mirrors\n- fix bug with state_dl size progress, grows much too fast\n- download from multiple proxies\n- configurable user agent\n\nv0.3\n- fix 'rich' flickering on dl workers progress, by creating Group after all progress add_task() are performed.\n- fix download size estimation for completed and total, by correctly handling in-progress files on startup.\n- fix handling of SIGTERM, by dedirecing raising SIGINT\n- fix show 'index' line all the time, even if nothing to do\n- fix dl/idx progress bar position to match dl workers\n- display errors count on dl progress bar\n- print download stats at end of session\n- cleanup code and review documentation\n- package with pyproject.toml\n- public version\n\nTODO v1.0\nTODO cleanup and review\n\nTODO v1.1\nTODO when size is not found in index, perform HEAD requests in indexer\nTODO directories mtime from index\nTODO profile code to see if we can improve performance with large download lists / CSV\n```\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "a recursive, parallel and multi-mirror/multi-proxy HTTP downloader",
    "version": "0.3",
    "project_urls": {
        "Homepage": "https://github.com/looran/httpeat"
    },
    "split_keywords": [
        "http",
        " downloader",
        " recursive",
        " parallel",
        " mirror",
        " proxy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f7ae383e5bbb1bd6713a3cf5c552af61e48c04d45d07a46b5443926fb655ffd3",
                "md5": "da0fc8785400615265ab0d182021b599",
                "sha256": "9a1810522be0e1b5f738156355fd488ce60fac1f82eb7f9e666bf19843cc5a9d"
            },
            "downloads": -1,
            "filename": "httpeat-0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da0fc8785400615265ab0d182021b599",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 17374,
            "upload_time": "2024-12-02T14:12:05",
            "upload_time_iso_8601": "2024-12-02T14:12:05.317471Z",
            "url": "https://files.pythonhosted.org/packages/f7/ae/383e5bbb1bd6713a3cf5c552af61e48c04d45d07a46b5443926fb655ffd3/httpeat-0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "845cd348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac",
                "md5": "1b5a44e529f52398f475b7374c5f3f46",
                "sha256": "684568428e9cb4b2def5c6f45b7e19dbcbfade48aaa28189d5620ceaabd32d02"
            },
            "downloads": -1,
            "filename": "httpeat-0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "1b5a44e529f52398f475b7374c5f3f46",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.0",
            "size": 641715,
            "upload_time": "2024-12-02T14:12:08",
            "upload_time_iso_8601": "2024-12-02T14:12:08.239484Z",
            "url": "https://files.pythonhosted.org/packages/84/5c/d348eb44390566425601449414764c9cb01f50193709ea8cc090c67c6bac/httpeat-0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-02 14:12:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "looran",
    "github_project": "httpeat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "nodriver",
            "specs": []
        },
        {
            "name": "httpx",
            "specs": []
        },
        {
            "name": "bs4",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "python-dateutil",
            "specs": []
        },
        {
            "name": "rich",
            "specs": []
        },
        {
            "name": "humanfriendly",
            "specs": []
        },
        {
            "name": "tenacity",
            "specs": []
        }
    ],
    "lcname": "httpeat"
}
        
Elapsed time: 0.36352s