Name | dokuWikiDumper JSON |
Version |
0.1.46
JSON |
| download |
home_page | None |
Summary | A tool for archiving DokuWiki |
upload_time | 2024-11-16 18:26:20 |
maintainer | None |
docs_url | None |
author | yzqzss |
requires_python | <4.0,>=3.8 |
license | GPL-3.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DokuWiki Dumper
![Dynamic JSON Badge](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Farchive.org%2Fadvancedsearch.php%3Fq%3Dsubject%3AdokuWikiDumper%26rows%3D1%26page%3D1%26output%3Djson&query=%24.response.numFound&label=DokuWiki%20Dumps%40IA)
[![PyPI version](https://badge.fury.io/py/dokuwikidumper.svg)](https://badge.fury.io/py/dokuwikidumper)
> A tool for archiving DokuWiki.
Recommend using `dokuWikiDumper` on _modern_ filesystems, such as `ext4` or `btrfs`. `NTFS` is not recommended because it denies many special characters in the filename.
# For webmaster
If you don’t want your wiki to be archived, add the following to your `domain/robots.txt`:
```robots.txt
User-agent: dokuWikiDumper
Disallow: /
```
## Requirements
### dokuWikiDumper
- Python 3.8+ (developed on py3.10)
- beautifulsoup4
- requests
- lxml
- rich
### dokuWikiUploader
> Upload wiki dump to [Internet Archive](https://archive.org/).
> `dokuWikiUploader -h` for help.
- internetarchive
- p7zip (`7z` command) (`p7zip-full` package)
## Install `dokuWikiDumper`
> `dokuWikiUploader` is included in `dokuWikiDumper`.
### Install `dokuWikiDumper` with `pip` (recommended)
> <https://pypi.org/project/dokuwikidumper/>
```bash
pip3 install dokuWikiDumper
```
### Install `dokuWikiDumper` with `Poetry` (for developers)
- Install `Poetry`
```bash
pip3 install poetry
```
- Install `dokuWikiDumper`
```bash
git clone https://github.com/saveweb/dokuwiki-dumper
cd dokuwiki-dumper
poetry install
rm dist/ -rf
poetry build
pip install --force-reinstall dist/dokuWikiDumper*.whl
```
## Usage
```bash
usage: dokuWikiDumper [-h] [--content] [--media] [--html] [--pdf] [--current-only] [--skip-to SKIP_TO] [--path PATH] [--no-resume] [--threads THREADS]
[--i-love-retro] [--insecure] [--ignore-errors] [--ignore-action-disabled-edit] [--ignore-disposition-header-missing]
[--trim-php-warnings] [--delay DELAY] [--retry RETRY] [--hard-retry HARD_RETRY] [--parser PARSER] [--username USERNAME]
[--password PASSWORD] [--cookies COOKIES] [--auto] [-u] [-g UPLOADER_ARGS]
url
dokuWikiDumper Version: 0.1.31
positional arguments:
url URL of the dokuWiki (provide the doku.php URL)
options:
-h, --help show this help message and exit
--current-only Dump latest revision, no history [default: false]
--skip-to SKIP_TO !DEV! Skip to title number [default: 0]
--path PATH Specify dump directory [default: <site>-<date>]
--no-resume Do not resume a previous dump [default: resume]
--threads THREADS Number of sub threads to use [default: 1], not recommended to set > 5
--i-love-retro Do not check the latest version of dokuWikiDumper (from pypi.org) before running [default: False]
--insecure Disable SSL certificate verification
--ignore-errors !DANGEROUS! ignore errors in the sub threads. This may cause incomplete dumps.
--ignore-action-disabled-edit
Some sites disable edit action for anonymous users and some core pages. This option will ignore this error and textarea not found
error.But you may only get a partial dump. (only works with --content)
--ignore-disposition-header-missing
Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]
--trim-php-warnings Trim PHP warnings from requests.Response.text
--delay DELAY Delay between requests [default: 0.0]
--retry RETRY Maximum number of retries [default: 5]
--hard-retry HARD_RETRY
Maximum number of retries for hard errors [default: 3]
--parser PARSER HTML parser [default: lxml]
--username USERNAME login: username
--password PASSWORD login: password
--cookies COOKIES cookies file
--auto dump: content+media+html, threads=3, ignore-action-disable-edit. (threads is overridable)
-u, --upload Upload wikidump to Internet Archive after successfully dumped (only works with --auto)
-g UPLOADER_ARGS, --uploader-arg UPLOADER_ARGS
Arguments for uploader.
Data to download:
What info download from the wiki
--content Dump content
--media Dump media
--html Dump HTML
--pdf Dump PDF (Only available on some wikis with the PDF export plugin) (Only dumps the latest PDF revision)
```
For most cases, you can use `--auto` to dump the site.
```bash
dokuWikiDumper https://example.com/wiki/ --auto
```
which is equivalent to
```bash
dokuWikiDumper https://example.com/wiki/ --content --media --html --threads 3 --ignore-action-disabled-edit
```
> Highly recommend using `--username` and `--password` to login (or using `--cookies`), because some sites may disable anonymous users to access some pages or check the raw wikitext.
`--cookies` accepts a Netscape cookies file, you can use [cookies.txt Extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) to export cookies from Firefox. It also accepts a json cookies file created by [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/).
## Dump structure
<!-- Dump structure -->
| Directory or File | Description |
|----------- |------------- |
| `attic/` | old revisions of page. (wikitext) |
| `dumpMeta/` | (dokuWikiDumper only) metadata of the dump. |
| `dumpMeta/check.html` | ?do=check page of the wiki. |
| `dumpMeta/config.json` | dump's configuration. |
| `dumpMeta/favicon.ico` | favicon of the site. |
| `dumpMeta/files.txt` | list of filename. |
| `dumpMeta/index.html` | homepage of the wiki. |
| `dumpMeta/info.json` | infomations of the wiki. |
| `dumpMeta/titles.txt` | list of page title. |
| `html/` | (dokuWikiDumper only) HTML of the pages. |
| `media/` | media files. |
| `meta/` | metadata of the pages. |
| `pages/` | latest page content. (wikitext) |
| `*.mark` | mark file. |
<!-- /Dump structure -->
## Available Backups/Dumps
Check out: <https://archive.org/search?query=subject%3A"dokuWikiDumper">
## How to import dump to DokuWiki
If you need to import Dokuwiki, please add the following configuration to `local.php`
```php
$conf['fnencode'] = 'utf-8'; // Dokuwiki default: 'safe' (url encode)
# 'safe' => Non-ASCII characters will be escaped as %xx form.
# 'utf-8' => Non-ASCII characters will be preserved as UTF-8 characters.
$conf['compression'] = '0'; // Dokuwiki default: 'gz'.
# 'gz' => attic/<id>.<rev_id>.txt.gz
# 'bz2' => attic/<id>.<rev_id>.txt.bz2
# '0' => attic/<id>.<rev_id>.txt
```
Import `pages` dir if you only need the latest version of the page.
Import `meta` dir if you need the **changelog** of the page.
Import `attic` and `meta` dirs if you need the old revisions **content** of the page.
Import `media` dir if you need the media files.
`dumpMeta` and `html` dirs are only used by `dokuWikiDumper`, you can ignore it.
## Information
### DokuWiki links
- [DokuWiki](https://www.dokuwiki.org/)
- [DokuWiki changelog](https://www.dokuwiki.org/changelog)
- [DokuWiki source code](https://github.com/splitbrain/dokuwiki)
- [DokuWiki - ArchiveTeam Wiki](https://wiki.archiveteam.org/index.php/DokuWiki)
### Other tools
- [wikiteam/WikiTeam](https://github.com/wikiteam/wikiteam/), a tool for archiving MediaWiki, written in Python 2 that you won't want to use nowadays. :(
- [mediawiki-client-tools/MediaWiki Scraper](https://github.com/mediawiki-client-tools/mediawiki-scraper) (aka `wikiteam3`), a tool for archiving MediaWiki, forked from [WikiTeam](https://github.com/wikiteam/wikiteam/) and has been rewritten in Python 3. (Lack of code writers and reviewers, STWP no longer maintains this repo.)
- [saveweb/WikiTeam3](https://github.com/saveweb/wikiteam3) forked from MediaWiki Scraper, maintained by STWP. :)
- [DigitalDwagon/WikiBot](https://github.com/DigitalDwagon/WikiBot) a Discord and IRC bot to run the dokuWikiDumper and wikiteam3 in the background.
## License
GPLv3
## Contributors
This tool is based on an unmerged PR (_8 years ago!_) of [WikiTeam](https://github.com/WikiTeam/wikiteam/): [DokuWiki dump alpha](https://github.com/WikiTeam/wikiteam/pull/243) by [@PiRSquared17](https://github.com/PiRSquared17).
I ([@yzqzss](https://github.com/yzqzss)) have rewritten the code in Python 3 and added ~~some features, also fixed~~ some bugs.
Raw data
{
"_id": null,
"home_page": null,
"name": "dokuWikiDumper",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "yzqzss",
"author_email": "yzqzss@yandex.com",
"download_url": "https://files.pythonhosted.org/packages/4e/5a/a4ad32f9f850ec2f723856ebbfe30a7bf3a3db9c9bc235ea5c81abe3b0b9/dokuwikidumper-0.1.46.tar.gz",
"platform": null,
"description": "# DokuWiki Dumper\n\n![Dynamic JSON Badge](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Farchive.org%2Fadvancedsearch.php%3Fq%3Dsubject%3AdokuWikiDumper%26rows%3D1%26page%3D1%26output%3Djson&query=%24.response.numFound&label=DokuWiki%20Dumps%40IA)\n[![PyPI version](https://badge.fury.io/py/dokuwikidumper.svg)](https://badge.fury.io/py/dokuwikidumper)\n\n\n> A tool for archiving DokuWiki.\n\nRecommend using `dokuWikiDumper` on _modern_ filesystems, such as `ext4` or `btrfs`. `NTFS` is not recommended because it denies many special characters in the filename.\n\n# For webmaster\n\nIf you don\u2019t want your wiki to be archived, add the following to your `domain/robots.txt`:\n\n```robots.txt\nUser-agent: dokuWikiDumper\nDisallow: /\n```\n\n\n## Requirements\n\n### dokuWikiDumper\n\n- Python 3.8+ (developed on py3.10)\n- beautifulsoup4\n- requests\n- lxml\n- rich\n\n### dokuWikiUploader\n\n> Upload wiki dump to [Internet Archive](https://archive.org/).\n> `dokuWikiUploader -h` for help.\n\n- internetarchive\n- p7zip (`7z` command) (`p7zip-full` package)\n\n## Install `dokuWikiDumper`\n\n> `dokuWikiUploader` is included in `dokuWikiDumper`.\n\n### Install `dokuWikiDumper` with `pip` (recommended)\n\n> <https://pypi.org/project/dokuwikidumper/>\n\n```bash\npip3 install dokuWikiDumper\n```\n\n### Install `dokuWikiDumper` with `Poetry` (for developers)\n\n- Install `Poetry`\n\n ```bash\n pip3 install poetry\n ```\n\n- Install `dokuWikiDumper`\n\n ```bash\n git clone https://github.com/saveweb/dokuwiki-dumper\n cd dokuwiki-dumper\n poetry install\n rm dist/ -rf\n poetry build\n pip install --force-reinstall dist/dokuWikiDumper*.whl\n ```\n\n## Usage\n\n```bash\nusage: dokuWikiDumper [-h] [--content] [--media] [--html] [--pdf] [--current-only] [--skip-to SKIP_TO] [--path PATH] [--no-resume] [--threads THREADS]\n [--i-love-retro] [--insecure] [--ignore-errors] [--ignore-action-disabled-edit] [--ignore-disposition-header-missing]\n [--trim-php-warnings] [--delay DELAY] [--retry RETRY] [--hard-retry HARD_RETRY] [--parser PARSER] [--username USERNAME]\n [--password PASSWORD] [--cookies COOKIES] [--auto] [-u] [-g UPLOADER_ARGS]\n url\n\ndokuWikiDumper Version: 0.1.31\n\npositional arguments:\n url URL of the dokuWiki (provide the doku.php URL)\n\noptions:\n -h, --help show this help message and exit\n --current-only Dump latest revision, no history [default: false]\n --skip-to SKIP_TO !DEV! Skip to title number [default: 0]\n --path PATH Specify dump directory [default: <site>-<date>]\n --no-resume Do not resume a previous dump [default: resume]\n --threads THREADS Number of sub threads to use [default: 1], not recommended to set > 5\n --i-love-retro Do not check the latest version of dokuWikiDumper (from pypi.org) before running [default: False]\n --insecure Disable SSL certificate verification\n --ignore-errors !DANGEROUS! ignore errors in the sub threads. This may cause incomplete dumps.\n --ignore-action-disabled-edit\n Some sites disable edit action for anonymous users and some core pages. This option will ignore this error and textarea not found\n error.But you may only get a partial dump. (only works with --content)\n --ignore-disposition-header-missing\n Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]\n --trim-php-warnings Trim PHP warnings from requests.Response.text\n --delay DELAY Delay between requests [default: 0.0]\n --retry RETRY Maximum number of retries [default: 5]\n --hard-retry HARD_RETRY\n Maximum number of retries for hard errors [default: 3]\n --parser PARSER HTML parser [default: lxml]\n --username USERNAME login: username\n --password PASSWORD login: password\n --cookies COOKIES cookies file\n --auto dump: content+media+html, threads=3, ignore-action-disable-edit. (threads is overridable)\n -u, --upload Upload wikidump to Internet Archive after successfully dumped (only works with --auto)\n -g UPLOADER_ARGS, --uploader-arg UPLOADER_ARGS\n Arguments for uploader.\n\nData to download:\n What info download from the wiki\n\n --content Dump content\n --media Dump media\n --html Dump HTML\n --pdf Dump PDF (Only available on some wikis with the PDF export plugin) (Only dumps the latest PDF revision)\n```\n\nFor most cases, you can use `--auto` to dump the site.\n\n```bash\ndokuWikiDumper https://example.com/wiki/ --auto\n```\n\nwhich is equivalent to\n\n```bash\ndokuWikiDumper https://example.com/wiki/ --content --media --html --threads 3 --ignore-action-disabled-edit\n```\n\n> Highly recommend using `--username` and `--password` to login (or using `--cookies`), because some sites may disable anonymous users to access some pages or check the raw wikitext.\n\n`--cookies` accepts a Netscape cookies file, you can use [cookies.txt Extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) to export cookies from Firefox. It also accepts a json cookies file created by [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/).\n\n## Dump structure\n\n<!-- Dump structure -->\n| Directory or File | Description |\n|----------- |------------- |\n| `attic/` | old revisions of page. (wikitext) |\n| `dumpMeta/` | (dokuWikiDumper only) metadata of the dump. |\n| `dumpMeta/check.html` | ?do=check page of the wiki. |\n| `dumpMeta/config.json` | dump's configuration. |\n| `dumpMeta/favicon.ico` | favicon of the site. |\n| `dumpMeta/files.txt` | list of filename. |\n| `dumpMeta/index.html` | homepage of the wiki. |\n| `dumpMeta/info.json` | infomations of the wiki. |\n| `dumpMeta/titles.txt` | list of page title. |\n| `html/` | (dokuWikiDumper only) HTML of the pages. |\n| `media/` | media files. |\n| `meta/` | metadata of the pages. |\n| `pages/` | latest page content. (wikitext) |\n| `*.mark` | mark file. |\n<!-- /Dump structure -->\n\n## Available Backups/Dumps\n\nCheck out: <https://archive.org/search?query=subject%3A\"dokuWikiDumper\">\n\n## How to import dump to DokuWiki\n\nIf you need to import Dokuwiki, please add the following configuration to `local.php`\n\n```php\n$conf['fnencode'] = 'utf-8'; // Dokuwiki default: 'safe' (url encode)\n# 'safe' => Non-ASCII characters will be escaped as %xx form.\n# 'utf-8' => Non-ASCII characters will be preserved as UTF-8 characters.\n\n$conf['compression'] = '0'; // Dokuwiki default: 'gz'.\n# 'gz' => attic/<id>.<rev_id>.txt.gz\n# 'bz2' => attic/<id>.<rev_id>.txt.bz2\n# '0' => attic/<id>.<rev_id>.txt\n```\n\nImport `pages` dir if you only need the latest version of the page. \nImport `meta` dir if you need the **changelog** of the page. \nImport `attic` and `meta` dirs if you need the old revisions **content** of the page. \nImport `media` dir if you need the media files.\n\n`dumpMeta` and `html` dirs are only used by `dokuWikiDumper`, you can ignore it.\n\n## Information\n\n### DokuWiki links\n\n- [DokuWiki](https://www.dokuwiki.org/)\n- [DokuWiki changelog](https://www.dokuwiki.org/changelog)\n- [DokuWiki source code](https://github.com/splitbrain/dokuwiki)\n\n- [DokuWiki - ArchiveTeam Wiki](https://wiki.archiveteam.org/index.php/DokuWiki)\n\n### Other tools\n\n- [wikiteam/WikiTeam](https://github.com/wikiteam/wikiteam/), a tool for archiving MediaWiki, written in Python 2 that you won't want to use nowadays. :(\n- [mediawiki-client-tools/MediaWiki Scraper](https://github.com/mediawiki-client-tools/mediawiki-scraper) (aka `wikiteam3`), a tool for archiving MediaWiki, forked from [WikiTeam](https://github.com/wikiteam/wikiteam/) and has been rewritten in Python 3. (Lack of code writers and reviewers, STWP no longer maintains this repo.)\n- [saveweb/WikiTeam3](https://github.com/saveweb/wikiteam3) forked from MediaWiki Scraper, maintained by STWP. :)\n- [DigitalDwagon/WikiBot](https://github.com/DigitalDwagon/WikiBot) a Discord and IRC bot to run the dokuWikiDumper and wikiteam3 in the background.\n\n## License\n\nGPLv3\n\n## Contributors\n\nThis tool is based on an unmerged PR (_8 years ago!_) of [WikiTeam](https://github.com/WikiTeam/wikiteam/): [DokuWiki dump alpha](https://github.com/WikiTeam/wikiteam/pull/243) by [@PiRSquared17](https://github.com/PiRSquared17).\n\nI ([@yzqzss](https://github.com/yzqzss)) have rewritten the code in Python 3 and added ~~some features, also fixed~~ some bugs.\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "A tool for archiving DokuWiki",
"version": "0.1.46",
"project_urls": {
"Bug Tracker": "https://github.com/saveweb/dokuwiki-dumper/issues",
"repository": "https://github.com/saveweb/dokuwiki-dumper/"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "af36f891a222142d78d195469f6082d330654d48b5d997326888e17afccee1ff",
"md5": "3cb633749ef0d5f24b5962865e6efc1f",
"sha256": "5f5b4ced9bb1db29d5d2343859bb3d2479f023cd275a5461b7d1adf39d56c24a"
},
"downloads": -1,
"filename": "dokuwikidumper-0.1.46-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3cb633749ef0d5f24b5962865e6efc1f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 54573,
"upload_time": "2024-11-16T18:26:18",
"upload_time_iso_8601": "2024-11-16T18:26:18.084704Z",
"url": "https://files.pythonhosted.org/packages/af/36/f891a222142d78d195469f6082d330654d48b5d997326888e17afccee1ff/dokuwikidumper-0.1.46-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4e5aa4ad32f9f850ec2f723856ebbfe30a7bf3a3db9c9bc235ea5c81abe3b0b9",
"md5": "6edef7fd3d1f2d134f2a50befcf35921",
"sha256": "67eba888370d28a292b7b0627f1d057198d4e25c55b2b4d4435186a98b22a659"
},
"downloads": -1,
"filename": "dokuwikidumper-0.1.46.tar.gz",
"has_sig": false,
"md5_digest": "6edef7fd3d1f2d134f2a50befcf35921",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 47437,
"upload_time": "2024-11-16T18:26:20",
"upload_time_iso_8601": "2024-11-16T18:26:20.172350Z",
"url": "https://files.pythonhosted.org/packages/4e/5a/a4ad32f9f850ec2f723856ebbfe30a7bf3a3db9c9bc235ea5c81abe3b0b9/dokuwikidumper-0.1.46.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-16 18:26:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "saveweb",
"github_project": "dokuwiki-dumper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "dokuwikidumper"
}