| Name | wayback-machine-archiver JSON |
| Version |
3.3.1
JSON |
| download |
| home_page | None |
| Summary | A Python script to submit web pages to the Wayback Machine for archiving. |
| upload_time | 2025-09-11 03:18:50 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | # MIT License (MIT)
Copyright © 2018--2025 Alexander Gude
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
| keywords |
internet archive
wayback machine
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# Wayback Machine Archiver
Wayback Machine Archiver (Archiver for short) is a command-line utility
written in Python to back up web pages using the [Internet Archive][ia].
[ia]: https://archive.org/
## Installation
The best way to install Archiver is with `pip`:
```bash
pip install wayback-machine-archiver
```
This will give you access to the script simply by calling:
```bash
archiver --help
```
You can also install it directly from a local clone of this repository:
```bash
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .
```
All dependencies are handled automatically. Archiver supports Python 3.8+.
## Usage
The archiver is simple to use from the command line.
### Command-Line Examples
**Archive a single page:**
```bash
archiver https://alexgude.com
```
**Archive all pages from a sitemap:**
```bash
archiver --sitemaps https://alexgude.com/sitemap.xml
```
**Archive from a local sitemap file:**
(Note the `file://` prefix is required)
```bash
archiver --sitemaps file://sitemap.xml
```
**Archive from a text file of URLs:**
(The file should contain one URL per line)
```bash
archiver --file urls.txt
```
**Combine multiple sources:**
```bash
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml
```
**Use advanced API options:**
(Capture a screenshot and skip if archived in the last 10 days)
```bash
archiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d
```
**Archive the sitemap URL itself:**
```bash
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
```
## Authentication (Required)
As of version 3.0.0, this tool requires authentication with the Internet
Archive's SPN2 API. This change was made to ensure all archiving jobs are
reliable and their final success or failure status can be confirmed. The
previous, less reliable method for unauthenticated users has been removed.
If you run the script without credentials, it will exit with an error message.
**To set up authentication:**
1. Get your S3-style API keys from your Internet Archive account settings:
[https://archive.org/account/s3.php](https://archive.org/account/s3.php)
2. Create a `.env` file in the directory where you run the `archiver`
command. Add your keys to it:
```
INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE"
INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
```
The script will automatically detect this file (or the equivalent environment
variables) and use the authenticated API.
## Help
For a full list of command-line flags, Archiver has built-in help displayed
with `archiver --help`:
```
usage: archiver [-h] [--version] [--file FILE]
[--sitemaps SITEMAPS [SITEMAPS ...]]
[--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--log-to-file LOG_FILE]
[--archive-sitemap-also]
[--rate-limit-wait RATE_LIMIT_IN_SEC]
[--random-order] [--capture-all]
[--capture-outlinks] [--capture-screenshot]
[--delay-wb-availability] [--force-get]
[--skip-first-archive] [--email-result]
[--if-not-archived-within <timedelta>]
[--js-behavior-timeout <seconds>]
[--capture-cookie <cookie>]
[--user-agent <string>]
[urls ...]
A script to backup a web pages with Internet Archive
positional arguments:
urls Specifies the URLs of the pages to archive.
options:
-h, --help show this help message and exit
--version show program's version number and exit
--file FILE Specifies the path to a file containing URLs to save,
one per line.
--sitemaps SITEMAPS [SITEMAPS ...]
Specifies one or more URIs to sitemaps listing pages
to archive. Local paths must be prefixed with
'file://'.
--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Sets the logging level. Defaults to WARNING
(case-insensitive).
--log-to-file LOG_FILE
Redirects logs to a specified file instead of the
console.
--archive-sitemap-also
Submits the URL of the sitemap itself to be archived.
--rate-limit-wait RATE_LIMIT_IN_SEC
Specifies the number of seconds to wait between
submissions. A minimum of 5 seconds is enforced for
authenticated users. Defaults to 15.
--random-order Randomizes the order of pages before archiving.
SPN2 API Options:
Control the behavior of the Internet Archive capture API.
--capture-all Captures a web page even if it returns an error (e.g.,
404, 500).
--capture-outlinks Captures web page outlinks automatically. Note: this
can significantly increase the total number of
captures and runtime.
--capture-screenshot Captures a full page screenshot.
--delay-wb-availability
Reduces load on Internet Archive systems by making the
capture publicly available after ~12 hours instead of
immediately.
--force-get Bypasses the headless browser check, which can speed
up captures for non-HTML content (e.g., PDFs, images).
--skip-first-archive Speeds up captures by skipping the check for whether
this is the first time a URL has been archived.
--email-result Sends an email report of the captured URLs to the
user's registered email.
--if-not-archived-within <timedelta>
Captures only if the latest capture is older than
<timedelta> (e.g., '3d 5h').
--js-behavior-timeout <seconds>
Runs JS code for <N> seconds after page load to
trigger dynamic content. Defaults to 5, max is 30. Use
0 to disable for static pages.
--capture-cookie <cookie>
Uses an extra HTTP Cookie value when capturing the
target page.
--user-agent <string>
Uses a custom HTTP User-Agent value when capturing the
target page.
```
## Setting Up a `Sitemap.xml` for Github Pages
It is easy to automatically generate a sitemap for a Github Pages Jekyll site.
Simply use [jekyll/jekyll-sitemap][jsm].
Setup instructions can be found on the above site; they require changing just
a single line of your site's `_config.yml`.
[jsm]: https://github.com/jekyll/jekyll-sitemap
Raw data
{
"_id": null,
"home_page": null,
"name": "wayback-machine-archiver",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "Internet Archive, Wayback Machine",
"author": null,
"author_email": "Alexander Gude <alex.public.account@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d3/95/198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d/wayback_machine_archiver-3.3.1.tar.gz",
"platform": null,
"description": "# Wayback Machine Archiver\n\nWayback Machine Archiver (Archiver for short) is a command-line utility\nwritten in Python to back up web pages using the [Internet Archive][ia].\n\n[ia]: https://archive.org/\n\n## Installation\n\nThe best way to install Archiver is with `pip`:\n\n```bash\npip install wayback-machine-archiver\n```\n\nThis will give you access to the script simply by calling:\n\n```bash\narchiver --help\n```\n\nYou can also install it directly from a local clone of this repository:\n\n```bash\ngit clone https://github.com/agude/wayback-machine-archiver.git\ncd wayback-machine-archiver\npip install .\n```\n\nAll dependencies are handled automatically. Archiver supports Python 3.8+.\n\n## Usage\n\nThe archiver is simple to use from the command line.\n\n### Command-Line Examples\n\n**Archive a single page:**\n```bash\narchiver https://alexgude.com\n```\n\n**Archive all pages from a sitemap:**\n```bash\narchiver --sitemaps https://alexgude.com/sitemap.xml\n```\n\n**Archive from a local sitemap file:**\n(Note the `file://` prefix is required)\n```bash\narchiver --sitemaps file://sitemap.xml\n```\n\n**Archive from a text file of URLs:**\n(The file should contain one URL per line)\n```bash\narchiver --file urls.txt\n```\n\n**Combine multiple sources:**\n```bash\narchiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml\n```\n\n**Use advanced API options:**\n(Capture a screenshot and skip if archived in the last 10 days)\n```bash\narchiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d\n```\n\n**Archive the sitemap URL itself:**\n```bash\narchiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also\n```\n\n## Authentication (Required)\n\nAs of version 3.0.0, this tool requires authentication with the Internet\nArchive's SPN2 API. This change was made to ensure all archiving jobs are\nreliable and their final success or failure status can be confirmed. The\nprevious, less reliable method for unauthenticated users has been removed.\n\nIf you run the script without credentials, it will exit with an error message.\n\n**To set up authentication:**\n\n1. Get your S3-style API keys from your Internet Archive account settings:\n [https://archive.org/account/s3.php](https://archive.org/account/s3.php)\n\n2. Create a `.env` file in the directory where you run the `archiver`\n command. Add your keys to it:\n ```\n INTERNET_ARCHIVE_ACCESS_KEY=\"YOUR_ACCESS_KEY_HERE\"\n INTERNET_ARCHIVE_SECRET_KEY=\"YOUR_SECRET_KEY_HERE\"\n ```\n\nThe script will automatically detect this file (or the equivalent environment\nvariables) and use the authenticated API.\n\n## Help\n\nFor a full list of command-line flags, Archiver has built-in help displayed\nwith `archiver --help`:\n\n```\nusage: archiver [-h] [--version] [--file FILE]\n [--sitemaps SITEMAPS [SITEMAPS ...]]\n [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]\n [--log-to-file LOG_FILE]\n [--archive-sitemap-also]\n [--rate-limit-wait RATE_LIMIT_IN_SEC]\n [--random-order] [--capture-all]\n [--capture-outlinks] [--capture-screenshot]\n [--delay-wb-availability] [--force-get]\n [--skip-first-archive] [--email-result]\n [--if-not-archived-within <timedelta>]\n [--js-behavior-timeout <seconds>]\n [--capture-cookie <cookie>]\n [--user-agent <string>]\n [urls ...]\n\nA script to backup a web pages with Internet Archive\n\npositional arguments:\n urls Specifies the URLs of the pages to archive.\n\noptions:\n -h, --help show this help message and exit\n --version show program's version number and exit\n --file FILE Specifies the path to a file containing URLs to save,\n one per line.\n --sitemaps SITEMAPS [SITEMAPS ...]\n Specifies one or more URIs to sitemaps listing pages\n to archive. Local paths must be prefixed with\n 'file://'.\n --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}\n Sets the logging level. Defaults to WARNING\n (case-insensitive).\n --log-to-file LOG_FILE\n Redirects logs to a specified file instead of the\n console.\n --archive-sitemap-also\n Submits the URL of the sitemap itself to be archived.\n --rate-limit-wait RATE_LIMIT_IN_SEC\n Specifies the number of seconds to wait between\n submissions. A minimum of 5 seconds is enforced for\n authenticated users. Defaults to 15.\n --random-order Randomizes the order of pages before archiving.\n\nSPN2 API Options:\n Control the behavior of the Internet Archive capture API.\n\n --capture-all Captures a web page even if it returns an error (e.g.,\n 404, 500).\n --capture-outlinks Captures web page outlinks automatically. Note: this\n can significantly increase the total number of\n captures and runtime.\n --capture-screenshot Captures a full page screenshot.\n --delay-wb-availability\n Reduces load on Internet Archive systems by making the\n capture publicly available after ~12 hours instead of\n immediately.\n --force-get Bypasses the headless browser check, which can speed\n up captures for non-HTML content (e.g., PDFs, images).\n --skip-first-archive Speeds up captures by skipping the check for whether\n this is the first time a URL has been archived.\n --email-result Sends an email report of the captured URLs to the\n user's registered email.\n --if-not-archived-within <timedelta>\n Captures only if the latest capture is older than\n <timedelta> (e.g., '3d 5h').\n --js-behavior-timeout <seconds>\n Runs JS code for <N> seconds after page load to\n trigger dynamic content. Defaults to 5, max is 30. Use\n 0 to disable for static pages.\n --capture-cookie <cookie>\n Uses an extra HTTP Cookie value when capturing the\n target page.\n --user-agent <string>\n Uses a custom HTTP User-Agent value when capturing the\n target page.\n```\n\n## Setting Up a `Sitemap.xml` for Github Pages\n\nIt is easy to automatically generate a sitemap for a Github Pages Jekyll site.\nSimply use [jekyll/jekyll-sitemap][jsm].\n\nSetup instructions can be found on the above site; they require changing just\na single line of your site's `_config.yml`.\n\n[jsm]: https://github.com/jekyll/jekyll-sitemap\n",
"bugtrack_url": null,
"license": "# MIT License (MIT)\n \n Copyright \u00a9 2018--2025 Alexander Gude\n \n Permission is hereby granted, free of charge, to any person obtaining\n a copy of this software and associated documentation files (the\n \"Software\"), to deal in the Software without restriction, including\n without limitation the rights to use, copy, modify, merge, publish,\n distribute, sublicense, and/or sell copies of the Software, and to\n permit persons to whom the Software is furnished to do so, subject to\n the following conditions:\n \n The above copyright notice and this permission notice shall be\n included in all copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND,\n EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\n MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.\n IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY\n CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,\n TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\n SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n ",
"summary": "A Python script to submit web pages to the Wayback Machine for archiving.",
"version": "3.3.1",
"project_urls": {
"Homepage": "https://github.com/agude/wayback-machine-archiver"
},
"split_keywords": [
"internet archive",
" wayback machine"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "116294ae9acdde3a3f603144f1707dc56dba37071ddce0ee309ecba73ffa5b25",
"md5": "cfa1862467ffbdef26499f1eda6d587f",
"sha256": "31bc1d1a44d15a61b9bd233795550dc5721fbfaf39010fce1f450252d28c1364"
},
"downloads": -1,
"filename": "wayback_machine_archiver-3.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cfa1862467ffbdef26499f1eda6d587f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 16024,
"upload_time": "2025-09-11T03:18:48",
"upload_time_iso_8601": "2025-09-11T03:18:48.569764Z",
"url": "https://files.pythonhosted.org/packages/11/62/94ae9acdde3a3f603144f1707dc56dba37071ddce0ee309ecba73ffa5b25/wayback_machine_archiver-3.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d395198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d",
"md5": "573d300756f8aff40dc8cb191d348b50",
"sha256": "1664c495ed8096d925b45bbbb171add35a406ce50fab5c158539d76aca5545bb"
},
"downloads": -1,
"filename": "wayback_machine_archiver-3.3.1.tar.gz",
"has_sig": false,
"md5_digest": "573d300756f8aff40dc8cb191d348b50",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23687,
"upload_time": "2025-09-11T03:18:50",
"upload_time_iso_8601": "2025-09-11T03:18:50.162379Z",
"url": "https://files.pythonhosted.org/packages/d3/95/198c13790c545b7e577591d92538142cb0cc292d5e8b920dd9522c70566d/wayback_machine_archiver-3.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-11 03:18:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "agude",
"github_project": "wayback-machine-archiver",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "wayback-machine-archiver"
}