wayback-machine-archiver

Name	wayback-machine-archiver JSON
Version	3.3.1 JSON
	download
home_page	None
Summary	A Python script to submit web pages to the Wayback Machine for archiving.
upload_time	2025-09-11 03:18:50
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	# MIT License (MIT) Copyright © 2018--2025 Alexander Gude Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	internet archive wayback machine
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Wayback Machine Archiver

Wayback Machine Archiver (Archiver for short) is a command-line utility
written in Python to back up web pages using the [Internet Archive][ia].

[ia]: https://archive.org/

## Installation

The best way to install Archiver is with `pip`:

```bash
pip install wayback-machine-archiver
```

This will give you access to the script simply by calling:

```bash
archiver --help
```

You can also install it directly from a local clone of this repository:

```bash
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .
```

All dependencies are handled automatically. Archiver supports Python 3.8+.

## Usage

The archiver is simple to use from the command line.

### Command-Line Examples

**Archive a single page:**
```bash
archiver https://alexgude.com
```

**Archive all pages from a sitemap:**
```bash
archiver --sitemaps https://alexgude.com/sitemap.xml
```

**Archive from a local sitemap file:**
(Note the `file://` prefix is required)
```bash
archiver --sitemaps file://sitemap.xml
```

**Archive from a text file of URLs:**
(The file should contain one URL per line)
```bash
archiver --file urls.txt
```

**Combine multiple sources:**
```bash
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml
```

**Use advanced API options:**
(Capture a screenshot and skip if archived in the last 10 days)
```bash
archiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d
```

**Archive the sitemap URL itself:**
```bash
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
```

## Authentication (Required)

As of version 3.0.0, this tool requires authentication with the Internet
Archive's SPN2 API. This change was made to ensure all archiving jobs are
reliable and their final success or failure status can be confirmed. The
previous, less reliable method for unauthenticated users has been removed.

If you run the script without credentials, it will exit with an error message.

**To set up authentication:**

1.  Get your S3-style API keys from your Internet Archive account settings:
    [https://archive.org/account/s3.php](https://archive.org/account/s3.php)

2.  Create a `.env` file in the directory where you run the `archiver`
    command. Add your keys to it:
    ```
    INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE"
    INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
    ```

The script will automatically detect this file (or the equivalent environment
variables) and use the authenticated API.

## Help

For a full list of command-line flags, Archiver has built-in help displayed
with `archiver --help`:

```
usage: archiver [-h] [--version] [--file FILE]
                [--sitemaps SITEMAPS [SITEMAPS ...]]
                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--log-to-file LOG_FILE]
                [--archive-sitemap-also]
                [--rate-limit-wait RATE_LIMIT_IN_SEC]
                [--random-order] [--capture-all]
                [--capture-outlinks] [--capture-screenshot]
                [--delay-wb-availability] [--force-get]
                [--skip-first-archive] [--email-result]
                [--if-not-archived-within <timedelta>]
                [--js-behavior-timeout <seconds>]
                [--capture-cookie <cookie>]
                [--user-agent <string>]
                [urls ...]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  Specifies the URLs of the pages to archive.

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           Specifies the path to a file containing URLs to save,
                        one per line.
  --sitemaps SITEMAPS [SITEMAPS ...]
                        Specifies one or more URIs to sitemaps listing pages
                        to archive. Local paths must be prefixed with
                        'file://'.
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Sets the logging level. Defaults to WARNING
                        (case-insensitive).
  --log-to-file LOG_FILE
                        Redirects logs to a specified file instead of the
                        console.
  --archive-sitemap-also
                        Submits the URL of the sitemap itself to be archived.
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        Specifies the number of seconds to wait between
                        submissions. A minimum of 5 seconds is enforced for
                        authenticated users. Defaults to 15.
  --random-order        Randomizes the order of pages before archiving.

SPN2 API Options:
  Control the behavior of the Internet Archive capture API.

  --capture-all         Captures a web page even if it returns an error (e.g.,
                        404, 500).
  --capture-outlinks    Captures web page outlinks automatically. Note: this
                        can significantly increase the total number of
                        captures and runtime.
  --capture-screenshot  Captures a full page screenshot.
  --delay-wb-availability
                        Reduces load on Internet Archive systems by making the
                        capture publicly available after ~12 hours instead of
                        immediately.
  --force-get           Bypasses the headless browser check, which can speed
                        up captures for non-HTML content (e.g., PDFs, images).
  --skip-first-archive  Speeds up captures by skipping the check for whether
                        this is the first time a URL has been archived.
  --email-result        Sends an email report of the captured URLs to the
                        user's registered email.
  --if-not-archived-within <timedelta>
                        Captures only if the latest capture is older than
                        <timedelta> (e.g., '3d 5h').
  --js-behavior-timeout <seconds>
                        Runs JS code for <N> seconds after page load to
                        trigger dynamic content. Defaults to 5, max is 30. Use
                        0 to disable for static pages.
  --capture-cookie <cookie>
                        Uses an extra HTTP Cookie value when capturing the
                        target page.
  --user-agent <string>
                        Uses a custom HTTP User-Agent value when capturing the
                        target page.
```

## Setting Up a `Sitemap.xml` for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site.
Simply use [jekyll/jekyll-sitemap][jsm].

Setup instructions can be found on the above site; they require changing just
a single line of your site's `_config.yml`.

[jsm]: https://github.com/jekyll/jekyll-sitemap

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "wayback-machine-archiver",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "Internet Archive, Wayback Machine",
    "author": null,
    "author_email": "Alexander Gude <alex.public.account@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d3/95/198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d/wayback_machine_archiver-3.3.1.tar.gz",
    "platform": null,
    "description": "# Wayback Machine Archiver\n\nWayback Machine Archiver (Archiver for short) is a command-line utility\nwritten in Python to back up web pages using the [Internet Archive][ia].\n\n[ia]: https://archive.org/\n\n## Installation\n\nThe best way to install Archiver is with `pip`:\n\n```bash\npip install wayback-machine-archiver\n```\n\nThis will give you access to the script simply by calling:\n\n```bash\narchiver --help\n```\n\nYou can also install it directly from a local clone of this repository:\n\n```bash\ngit clone https://github.com/agude/wayback-machine-archiver.git\ncd wayback-machine-archiver\npip install .\n```\n\nAll dependencies are handled automatically. Archiver supports Python 3.8+.\n\n## Usage\n\nThe archiver is simple to use from the command line.\n\n### Command-Line Examples\n\n**Archive a single page:**\n```bash\narchiver https://alexgude.com\n```\n\n**Archive all pages from a sitemap:**\n```bash\narchiver --sitemaps https://alexgude.com/sitemap.xml\n```\n\n**Archive from a local sitemap file:**\n(Note the `file://` prefix is required)\n```bash\narchiver --sitemaps file://sitemap.xml\n```\n\n**Archive from a text file of URLs:**\n(The file should contain one URL per line)\n```bash\narchiver --file urls.txt\n```\n\n**Combine multiple sources:**\n```bash\narchiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml\n```\n\n**Use advanced API options:**\n(Capture a screenshot and skip if archived in the last 10 days)\n```bash\narchiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d\n```\n\n**Archive the sitemap URL itself:**\n```bash\narchiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also\n```\n\n## Authentication (Required)\n\nAs of version 3.0.0, this tool requires authentication with the Internet\nArchive's SPN2 API. This change was made to ensure all archiving jobs are\nreliable and their final success or failure status can be confirmed. The\nprevious, less reliable method for unauthenticated users has been removed.\n\nIf you run the script without credentials, it will exit with an error message.\n\n**To set up authentication:**\n\n1.  Get your S3-style API keys from your Internet Archive account settings:\n    [https://archive.org/account/s3.php](https://archive.org/account/s3.php)\n\n2.  Create a `.env` file in the directory where you run the `archiver`\n    command. Add your keys to it:\n    ```\n    INTERNET_ARCHIVE_ACCESS_KEY=\"YOUR_ACCESS_KEY_HERE\"\n    INTERNET_ARCHIVE_SECRET_KEY=\"YOUR_SECRET_KEY_HERE\"\n    ```\n\nThe script will automatically detect this file (or the equivalent environment\nvariables) and use the authenticated API.\n\n## Help\n\nFor a full list of command-line flags, Archiver has built-in help displayed\nwith `archiver --help`:\n\n```\nusage: archiver [-h] [--version] [--file FILE]\n                [--sitemaps SITEMAPS [SITEMAPS ...]]\n                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]\n                [--log-to-file LOG_FILE]\n                [--archive-sitemap-also]\n                [--rate-limit-wait RATE_LIMIT_IN_SEC]\n                [--random-order] [--capture-all]\n                [--capture-outlinks] [--capture-screenshot]\n                [--delay-wb-availability] [--force-get]\n                [--skip-first-archive] [--email-result]\n                [--if-not-archived-within <timedelta>]\n                [--js-behavior-timeout <seconds>]\n                [--capture-cookie <cookie>]\n                [--user-agent <string>]\n                [urls ...]\n\nA script to backup a web pages with Internet Archive\n\npositional arguments:\n  urls                  Specifies the URLs of the pages to archive.\n\noptions:\n  -h, --help            show this help message and exit\n  --version             show program's version number and exit\n  --file FILE           Specifies the path to a file containing URLs to save,\n                        one per line.\n  --sitemaps SITEMAPS [SITEMAPS ...]\n                        Specifies one or more URIs to sitemaps listing pages\n                        to archive. Local paths must be prefixed with\n                        'file://'.\n  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}\n                        Sets the logging level. Defaults to WARNING\n                        (case-insensitive).\n  --log-to-file LOG_FILE\n                        Redirects logs to a specified file instead of the\n                        console.\n  --archive-sitemap-also\n                        Submits the URL of the sitemap itself to be archived.\n  --rate-limit-wait RATE_LIMIT_IN_SEC\n                        Specifies the number of seconds to wait between\n                        submissions. A minimum of 5 seconds is enforced for\n                        authenticated users. Defaults to 15.\n  --random-order        Randomizes the order of pages before archiving.\n\nSPN2 API Options:\n  Control the behavior of the Internet Archive capture API.\n\n  --capture-all         Captures a web page even if it returns an error (e.g.,\n                        404, 500).\n  --capture-outlinks    Captures web page outlinks automatically. Note: this\n                        can significantly increase the total number of\n                        captures and runtime.\n  --capture-screenshot  Captures a full page screenshot.\n  --delay-wb-availability\n                        Reduces load on Internet Archive systems by making the\n                        capture publicly available after ~12 hours instead of\n                        immediately.\n  --force-get           Bypasses the headless browser check, which can speed\n                        up captures for non-HTML content (e.g., PDFs, images).\n  --skip-first-archive  Speeds up captures by skipping the check for whether\n                        this is the first time a URL has been archived.\n  --email-result        Sends an email report of the captured URLs to the\n                        user's registered email.\n  --if-not-archived-within <timedelta>\n                        Captures only if the latest capture is older than\n                        <timedelta> (e.g., '3d 5h').\n  --js-behavior-timeout <seconds>\n                        Runs JS code for <N> seconds after page load to\n                        trigger dynamic content. Defaults to 5, max is 30. Use\n                        0 to disable for static pages.\n  --capture-cookie <cookie>\n                        Uses an extra HTTP Cookie value when capturing the\n                        target page.\n  --user-agent <string>\n                        Uses a custom HTTP User-Agent value when capturing the\n                        target page.\n```\n\n## Setting Up a `Sitemap.xml` for Github Pages\n\nIt is easy to automatically generate a sitemap for a Github Pages Jekyll site.\nSimply use [jekyll/jekyll-sitemap][jsm].\n\nSetup instructions can be found on the above site; they require changing just\na single line of your site's `_config.yml`.\n\n[jsm]: https://github.com/jekyll/jekyll-sitemap\n",
    "bugtrack_url": null,
    "license": "# MIT License (MIT)\n        \n        Copyright \u00a9 2018--2025 Alexander Gude\n        \n        Permission is hereby granted, free of charge, to any person obtaining\n        a copy of this software and associated documentation files (the\n        \"Software\"), to deal in the Software without restriction, including\n        without limitation the rights to use, copy, modify, merge, publish,\n        distribute, sublicense, and/or sell copies of the Software, and to\n        permit persons to whom the Software is furnished to do so, subject to\n        the following conditions:\n        \n        The above copyright notice and this permission notice shall be\n        included in all copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND,\n        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\n        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.\n        IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY\n        CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,\n        TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\n        SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n        ",
    "summary": "A Python script to submit web pages to the Wayback Machine for archiving.",
    "version": "3.3.1",
    "project_urls": {
        "Homepage": "https://github.com/agude/wayback-machine-archiver"
    },
    "split_keywords": [
        "internet archive",
        " wayback machine"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "116294ae9acdde3a3f603144f1707dc56dba37071ddce0ee309ecba73ffa5b25",
                "md5": "cfa1862467ffbdef26499f1eda6d587f",
                "sha256": "31bc1d1a44d15a61b9bd233795550dc5721fbfaf39010fce1f450252d28c1364"
            },
            "downloads": -1,
            "filename": "wayback_machine_archiver-3.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cfa1862467ffbdef26499f1eda6d587f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 16024,
            "upload_time": "2025-09-11T03:18:48",
            "upload_time_iso_8601": "2025-09-11T03:18:48.569764Z",
            "url": "https://files.pythonhosted.org/packages/11/62/94ae9acdde3a3f603144f1707dc56dba37071ddce0ee309ecba73ffa5b25/wayback_machine_archiver-3.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d395198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d",
                "md5": "573d300756f8aff40dc8cb191d348b50",
                "sha256": "1664c495ed8096d925b45bbbb171add35a406ce50fab5c158539d76aca5545bb"
            },
            "downloads": -1,
            "filename": "wayback_machine_archiver-3.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "573d300756f8aff40dc8cb191d348b50",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 23687,
            "upload_time": "2025-09-11T03:18:50",
            "upload_time_iso_8601": "2025-09-11T03:18:50.162379Z",
            "url": "https://files.pythonhosted.org/packages/d3/95/198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d/wayback_machine_archiver-3.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 03:18:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "agude",
    "github_project": "wayback-machine-archiver",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "wayback-machine-archiver"
}

None