pywaybackup


Namepywaybackup JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/bitdruid/python-wayback-machine-downloader
SummaryDownload snapshots from the Wayback Machine
upload_time2024-04-22 07:11:16
maintainerNone
docs_urlNone
authorbitdruid
requires_pythonNone
licenseMIT
keywords wayback machine internet archive
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # archive wayback downloader

[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
![Python Version](https://img.shields.io/badge/Python-3.6-blue)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).

Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.

## Installation

### Pip

1. Install the package <br>
   ```pip install pywaybackup```
2. Run the script <br>
   ```waybackup -h```

### Manual

1. Clone the repository <br>
   ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
2. Install <br>
   ```pip install .```
   - in a virtual env or use `--break-system-package`

## Usage

This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

### Arguments

- `-h`, `--help`: Show the help message and exit.
- `-a`, `--about`: Show information about the script and exit.

#### Required Arguments

- `-u`, `--url`: The URL of the web page to download. This argument is required.

#### Mode Selection (Choose One)

- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)

#### Optional Arguments

- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths.
- `-o`, `--output`: The folder where downloaded files will be saved.

- **Range Selection:**<br>
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
   - `-r`, `--range`: Specify the range in years for which to search and download snapshots.
   - `--start`: Timestamp to start searching.
   - `--end`: Timestamp to end searching.

#### Additional

- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.

### Examples

Download latest snapshot of all files:<br>
`waybackup -u http://example.com -c`

Download latest snapshot of all files with retries:<br>
`waybackup -u http://example.com -c --retry 3`

Download all snapshots sorted per timestamp with a specified range and do not follow redirects:<br>
`waybackup -u http://example.com -f -r 5 --no-redirect`

Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 workers:<br>
`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`

Download all snapshots from 2020 to 12th of December 2022 with 4 workers, save a csv and show a progress bar:
`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --csv --verbosity progress`

Download all snapshots and output a json response:<br>
`waybackup -u http://example.com -f --verbosity json`

List available snapshots per timestamp without downloading and save a csv file to home folder:<br>
`waybackup -u http://example.com -f -l --csv /home/user/Downloads`

## Contributing

I'm always happy for some feature requests to improve the usability of this script.
Feel free to give suggestions and report issues. Project is still far from being perfect.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bitdruid/python-wayback-machine-downloader",
    "name": "pywaybackup",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "wayback machine internet archive",
    "author": "bitdruid",
    "author_email": "bitdruid@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/fd/04/c0cf6667d772248cac734415d868d8bf76653222285d331205c2af116dc1/pywaybackup-1.0.1.tar.gz",
    "platform": null,
    "description": "# archive wayback downloader\n\n[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)\n![Python Version](https://img.shields.io/badge/Python-3.6-blue)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nDownloading archived web pages from the [Wayback Machine](https://archive.org/web/).\n\nInternet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.\n\n## Installation\n\n### Pip\n\n1. Install the package <br>\n   ```pip install pywaybackup```\n2. Run the script <br>\n   ```waybackup -h```\n\n### Manual\n\n1. Clone the repository <br>\n   ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```\n2. Install <br>\n   ```pip install .```\n   - in a virtual env or use `--break-system-package`\n\n## Usage\n\nThis script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.\n\n### Arguments\n\n- `-h`, `--help`: Show the help message and exit.\n- `-a`, `--about`: Show information about the script and exit.\n\n#### Required Arguments\n\n- `-u`, `--url`: The URL of the web page to download. This argument is required.\n\n#### Mode Selection (Choose One)\n\n- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).\n- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.\n- `-s`, `--save`: Save a page to the Wayback Machine. (beta)\n\n#### Optional Arguments\n\n- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.\n- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths.\n- `-o`, `--output`: The folder where downloaded files will be saved.\n\n- **Range Selection:**<br>\nSpecify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>\n(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)\n   - `-r`, `--range`: Specify the range in years for which to search and download snapshots.\n   - `--start`: Timestamp to start searching.\n   - `--end`: Timestamp to end searching.\n\n#### Additional\n\n- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.\n- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).\n- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).\n- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.\n- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.\n\n### Examples\n\nDownload latest snapshot of all files:<br>\n`waybackup -u http://example.com -c`\n\nDownload latest snapshot of all files with retries:<br>\n`waybackup -u http://example.com -c --retry 3`\n\nDownload all snapshots sorted per timestamp with a specified range and do not follow redirects:<br>\n`waybackup -u http://example.com -f -r 5 --no-redirect`\n\nDownload all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 workers:<br>\n`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`\n\nDownload all snapshots from 2020 to 12th of December 2022 with 4 workers, save a csv and show a progress bar:\n`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --csv --verbosity progress`\n\nDownload all snapshots and output a json response:<br>\n`waybackup -u http://example.com -f --verbosity json`\n\nList available snapshots per timestamp without downloading and save a csv file to home folder:<br>\n`waybackup -u http://example.com -f -l --csv /home/user/Downloads`\n\n## Contributing\n\nI'm always happy for some feature requests to improve the usability of this script.\nFeel free to give suggestions and report issues. Project is still far from being perfect.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Download snapshots from the Wayback Machine",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/bitdruid/python-wayback-machine-downloader"
    },
    "split_keywords": [
        "wayback",
        "machine",
        "internet",
        "archive"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3e026bb065e06a83f99c85f8feb10538f680e760b8e34991bd7e9d627217b270",
                "md5": "16d2da0b8ff9343fc40689d6534090c3",
                "sha256": "527356e5cece2e7ce7fbd7cdef274a92d166310e77f0ea3dd4337a3926ee79d6"
            },
            "downloads": -1,
            "filename": "pywaybackup-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "16d2da0b8ff9343fc40689d6534090c3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 12123,
            "upload_time": "2024-04-22T07:11:15",
            "upload_time_iso_8601": "2024-04-22T07:11:15.334722Z",
            "url": "https://files.pythonhosted.org/packages/3e/02/6bb065e06a83f99c85f8feb10538f680e760b8e34991bd7e9d627217b270/pywaybackup-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd04c0cf6667d772248cac734415d868d8bf76653222285d331205c2af116dc1",
                "md5": "09fa9bed3168ad314d06605d5d19321b",
                "sha256": "18ebd457a3c68dab0ff18392531d243cdd75d48ed653f4d4d3c369dcd8599f05"
            },
            "downloads": -1,
            "filename": "pywaybackup-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "09fa9bed3168ad314d06605d5d19321b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12243,
            "upload_time": "2024-04-22T07:11:16",
            "upload_time_iso_8601": "2024-04-22T07:11:16.463590Z",
            "url": "https://files.pythonhosted.org/packages/fd/04/c0cf6667d772248cac734415d868d8bf76653222285d331205c2af116dc1/pywaybackup-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-22 07:11:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bitdruid",
    "github_project": "python-wayback-machine-downloader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pywaybackup"
}
        
Elapsed time: 0.27984s