dokuwikidumper


Namedokuwikidumper JSON
Version 0.1.43 PyPI version JSON
download
home_page
SummaryA tool for archiving DokuWiki
upload_time2023-09-26 15:46:44
maintainer
docs_urlNone
authoryzqzss
requires_python>=3.8,<4.0
licenseGPL-3.0
keywords
VCS
bugtrack_url
requirements requests beautifulsoup4 lxml internetarchive rich python-slugify
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DokuWiki Dumper

> A tool for archiving DokuWiki.

Recommend using `dokuWikiDumper` on _modern_ filesystems, such as `ext4` or `btrfs`. `NTFS` is not recommended because it denies many special characters in the filename.

# For webmaster

If you don’t want your wiki to be archived, add the following to your `domain/robots.txt`:

```robots.txt
User-agent: dokuWikiDumper
Disallow: /
```


## Requirements

### dokuWikiDumper

- Python 3.8+ (developed on py3.10)
- beautifulsoup4
- requests
- lxml
- rich

### dokuWikiUploader

> Upload wiki dump to [Internet Archive](https://archive.org/).
> `dokuWikiUploader -h` for help.

- internetarchive
- p7zip (`7z` command) (`p7zip-full` package)

## Install `dokuWikiDumper`

> `dokuWikiUploader` is included in `dokuWikiDumper`.

### Install `dokuWikiDumper` with `pip` (recommended)

> <https://pypi.org/project/dokuwikidumper/>

```bash
pip3 install dokuWikiDumper
```

### Install `dokuWikiDumper` with `Poetry` (for developers)

- Install `Poetry`

    ```bash
    pip3 install poetry
    ```

- Install `dokuWikiDumper`

    ```bash
    git clone https://github.com/saveweb/dokuwiki-dumper
    cd dokuwiki-dumper
    poetry install
    rm dist/ -rf
    poetry build
    pip install --force-reinstall dist/dokuWikiDumper*.whl
    ```

## Usage

```bash
usage: dokuWikiDumper [-h] [--content] [--media] [--html] [--pdf] [--current-only] [--skip-to SKIP_TO] [--path PATH] [--no-resume] [--threads THREADS]
                      [--i-love-retro] [--insecure] [--ignore-errors] [--ignore-action-disabled-edit] [--ignore-disposition-header-missing]
                      [--trim-php-warnings] [--delay DELAY] [--retry RETRY] [--hard-retry HARD_RETRY] [--parser PARSER] [--username USERNAME]
                      [--password PASSWORD] [--cookies COOKIES] [--auto] [-u] [-g UPLOADER_ARGS]
                      url

dokuWikiDumper Version: 0.1.31

positional arguments:
  url                   URL of the dokuWiki (provide the doku.php URL)

options:
  -h, --help            show this help message and exit
  --current-only        Dump latest revision, no history [default: false]
  --skip-to SKIP_TO     !DEV! Skip to title number [default: 0]
  --path PATH           Specify dump directory [default: <site>-<date>]
  --no-resume           Do not resume a previous dump [default: resume]
  --threads THREADS     Number of sub threads to use [default: 1], not recommended to set > 5
  --i-love-retro        Do not check the latest version of dokuWikiDumper (from pypi.org) before running [default: False]
  --insecure            Disable SSL certificate verification
  --ignore-errors       !DANGEROUS! ignore errors in the sub threads. This may cause incomplete dumps.
  --ignore-action-disabled-edit
                        Some sites disable edit action for anonymous users and some core pages. This option will ignore this error and textarea not found
                        error.But you may only get a partial dump. (only works with --content)
  --ignore-disposition-header-missing
                        Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]
  --trim-php-warnings   Trim PHP warnings from requests.Response.text
  --delay DELAY         Delay between requests [default: 0.0]
  --retry RETRY         Maximum number of retries [default: 5]
  --hard-retry HARD_RETRY
                        Maximum number of retries for hard errors [default: 3]
  --parser PARSER       HTML parser [default: lxml]
  --username USERNAME   login: username
  --password PASSWORD   login: password
  --cookies COOKIES     cookies file
  --auto                dump: content+media+html, threads=5, ignore-action-disable-edit. (threads is overridable)
  -u, --upload          Upload wikidump to Internet Archive after successfully dumped (only works with --auto)
  -g UPLOADER_ARGS, --uploader-arg UPLOADER_ARGS
                        Arguments for uploader.

Data to download:
  What info download from the wiki

  --content             Dump content
  --media               Dump media
  --html                Dump HTML
  --pdf                 Dump PDF (Only available on some wikis with the PDF export plugin) (Only dumps the latest PDF revision)
```

For most cases, you can use `--auto` to dump the site.

```bash
dokuWikiDumper https://example.com/wiki/ --auto
```

which is equivalent to

```bash
dokuWikiDumper https://example.com/wiki/ --content --media --html --threads 5 --ignore-action-disabled-edit
```

> Highly recommend using `--username` and `--password` to login (or using `--cookies`), because some sites may disable anonymous users to access some pages or check the raw wikitext.

`--cookies` accepts a Netscape cookies file, you can use [cookies.txt Extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) to export cookies from Firefox. It also accepts a json cookies file created by [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/).

## Dump structure

<!-- Dump structure -->
| Directory or File       | Description                                 |
|-----------              |-------------                                |
| `attic/`                | old revisions of page. (wikitext)           |
| `dumpMeta/`             | (dokuWikiDumper only) metadata of the dump. |
| `dumpMeta/check.html`   | ?do=check page of the wiki.                 |
| `dumpMeta/config.json`  | dump's configuration.                       |
| `dumpMeta/favicon.ico`  | favicon of the site.                        |
| `dumpMeta/files.txt`    | list of filename.                           |
| `dumpMeta/index.html`   | homepage of the wiki.                       |
| `dumpMeta/info.json`    | infomations of the wiki.                    |
| `dumpMeta/titles.txt`   | list of page title.                         |
| `html/`                 | (dokuWikiDumper only) HTML of the pages.    |
| `media/`                | media files.                                |
| `meta/`                 | metadata of the pages.                      |
| `pages/`                | latest page content. (wikitext)             |
| `*.mark`                | mark file.                                  |
<!-- /Dump structure -->

## Available Backups/Dumps

I made some backups for testing, you can check out the list: <https://github.com/orgs/saveweb/projects/4>.

> Some wikidump has been uploaded to IA, you can check out: <https://archive.org/search?query=subject%3A"dokuWikiDumper">
>
> If you dumped a DokuWiki and want to share it, please feel free to open an issue, I will add it to the list.

## How to import dump to DokuWiki

If you need to import Dokuwiki, please add the following configuration to `local.php`

```php
$conf['fnencode'] = 'utf-8'; // Dokuwiki default: 'safe' (url encode)
# 'safe' => Non-ASCII characters will be escaped as %xx form.
# 'utf-8' => Non-ASCII characters will be preserved as UTF-8 characters.

$conf['compression'] = '0'; // Dokuwiki default: 'gz'.
# 'gz' => attic/<id>.<rev_id>.txt.gz
# 'bz2' => attic/<id>.<rev_id>.txt.bz2
# '0' => attic/<id>.<rev_id>.txt
```

Import `pages` dir if you only need the latest version of the page.  
Import `meta` dir if you need the **changelog** of the page.  
Import `attic` and `meta` dirs if you need the old revisions **content** of the page.  
Import `media` dir if you need the media files.

`dumpMeta` and `html` dirs are only used by `dokuWikiDumper`, you can ignore it.

## Information

### DokuWiki links

- [DokuWiki](https://www.dokuwiki.org/)
- [DokuWiki changelog](https://www.dokuwiki.org/changelog)
- [DokuWiki source code](https://github.com/splitbrain/dokuwiki)

- [DokuWiki - ArchiveTeam Wiki](https://wiki.archiveteam.org/index.php/DokuWiki)

### Other tools

- [wikiteam/WikiTeam](https://github.com/wikiteam/wikiteam/), a tool for archiving MediaWiki, written in Python 2 that you won't want to use nowadays. :(
- [mediawiki-client-tools/MediaWiki Scraper](https://github.com/mediawiki-client-tools/mediawiki-scraper) (aka `wikiteam3`), a tool for archiving MediaWiki, forked from [WikiTeam](https://github.com/wikiteam/wikiteam/) and has been rewritten in Python 3. (Lack of code writers and reviewers, STWP no longer maintains this repo.)
- [saveweb/WikiTeam3](https://github.com/saveweb/wikiteam3) forked from MediaWiki Scraper, maintained by STWP. :)
- [DigitalDwagon/WikiBot](https://github.com/DigitalDwagon/WikiBot) a Discord and IRC bot to run the dokuWikiDumper and wikiteam3 in the background.

## License

GPLv3

## Contributors

This tool is based on an unmerged PR (_8 years ago!_) of [WikiTeam](https://github.com/WikiTeam/wikiteam/): [DokuWiki dump alpha](https://github.com/WikiTeam/wikiteam/pull/243) by [@PiRSquared17](https://github.com/PiRSquared17).

I ([@yzqzss](https://github.com/yzqzss)) have rewritten the code in Python 3 and added ~~some features, also fixed~~ some bugs.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "dokuwikidumper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "yzqzss",
    "author_email": "yzqzss@yandex.com",
    "download_url": "https://files.pythonhosted.org/packages/0c/9c/6b6dc546a5219c6c84cfc113f644cf691fb69a4a33e24e664f2df7a54a89/dokuwikidumper-0.1.43.tar.gz",
    "platform": null,
    "description": "# DokuWiki Dumper\n\n> A tool for archiving DokuWiki.\n\nRecommend using `dokuWikiDumper` on _modern_ filesystems, such as `ext4` or `btrfs`. `NTFS` is not recommended because it denies many special characters in the filename.\n\n# For webmaster\n\nIf you don\u2019t want your wiki to be archived, add the following to your `domain/robots.txt`:\n\n```robots.txt\nUser-agent: dokuWikiDumper\nDisallow: /\n```\n\n\n## Requirements\n\n### dokuWikiDumper\n\n- Python 3.8+ (developed on py3.10)\n- beautifulsoup4\n- requests\n- lxml\n- rich\n\n### dokuWikiUploader\n\n> Upload wiki dump to [Internet Archive](https://archive.org/).\n> `dokuWikiUploader -h` for help.\n\n- internetarchive\n- p7zip (`7z` command) (`p7zip-full` package)\n\n## Install `dokuWikiDumper`\n\n> `dokuWikiUploader` is included in `dokuWikiDumper`.\n\n### Install `dokuWikiDumper` with `pip` (recommended)\n\n> <https://pypi.org/project/dokuwikidumper/>\n\n```bash\npip3 install dokuWikiDumper\n```\n\n### Install `dokuWikiDumper` with `Poetry` (for developers)\n\n- Install `Poetry`\n\n    ```bash\n    pip3 install poetry\n    ```\n\n- Install `dokuWikiDumper`\n\n    ```bash\n    git clone https://github.com/saveweb/dokuwiki-dumper\n    cd dokuwiki-dumper\n    poetry install\n    rm dist/ -rf\n    poetry build\n    pip install --force-reinstall dist/dokuWikiDumper*.whl\n    ```\n\n## Usage\n\n```bash\nusage: dokuWikiDumper [-h] [--content] [--media] [--html] [--pdf] [--current-only] [--skip-to SKIP_TO] [--path PATH] [--no-resume] [--threads THREADS]\n                      [--i-love-retro] [--insecure] [--ignore-errors] [--ignore-action-disabled-edit] [--ignore-disposition-header-missing]\n                      [--trim-php-warnings] [--delay DELAY] [--retry RETRY] [--hard-retry HARD_RETRY] [--parser PARSER] [--username USERNAME]\n                      [--password PASSWORD] [--cookies COOKIES] [--auto] [-u] [-g UPLOADER_ARGS]\n                      url\n\ndokuWikiDumper Version: 0.1.31\n\npositional arguments:\n  url                   URL of the dokuWiki (provide the doku.php URL)\n\noptions:\n  -h, --help            show this help message and exit\n  --current-only        Dump latest revision, no history [default: false]\n  --skip-to SKIP_TO     !DEV! Skip to title number [default: 0]\n  --path PATH           Specify dump directory [default: <site>-<date>]\n  --no-resume           Do not resume a previous dump [default: resume]\n  --threads THREADS     Number of sub threads to use [default: 1], not recommended to set > 5\n  --i-love-retro        Do not check the latest version of dokuWikiDumper (from pypi.org) before running [default: False]\n  --insecure            Disable SSL certificate verification\n  --ignore-errors       !DANGEROUS! ignore errors in the sub threads. This may cause incomplete dumps.\n  --ignore-action-disabled-edit\n                        Some sites disable edit action for anonymous users and some core pages. This option will ignore this error and textarea not found\n                        error.But you may only get a partial dump. (only works with --content)\n  --ignore-disposition-header-missing\n                        Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]\n  --trim-php-warnings   Trim PHP warnings from requests.Response.text\n  --delay DELAY         Delay between requests [default: 0.0]\n  --retry RETRY         Maximum number of retries [default: 5]\n  --hard-retry HARD_RETRY\n                        Maximum number of retries for hard errors [default: 3]\n  --parser PARSER       HTML parser [default: lxml]\n  --username USERNAME   login: username\n  --password PASSWORD   login: password\n  --cookies COOKIES     cookies file\n  --auto                dump: content+media+html, threads=5, ignore-action-disable-edit. (threads is overridable)\n  -u, --upload          Upload wikidump to Internet Archive after successfully dumped (only works with --auto)\n  -g UPLOADER_ARGS, --uploader-arg UPLOADER_ARGS\n                        Arguments for uploader.\n\nData to download:\n  What info download from the wiki\n\n  --content             Dump content\n  --media               Dump media\n  --html                Dump HTML\n  --pdf                 Dump PDF (Only available on some wikis with the PDF export plugin) (Only dumps the latest PDF revision)\n```\n\nFor most cases, you can use `--auto` to dump the site.\n\n```bash\ndokuWikiDumper https://example.com/wiki/ --auto\n```\n\nwhich is equivalent to\n\n```bash\ndokuWikiDumper https://example.com/wiki/ --content --media --html --threads 5 --ignore-action-disabled-edit\n```\n\n> Highly recommend using `--username` and `--password` to login (or using `--cookies`), because some sites may disable anonymous users to access some pages or check the raw wikitext.\n\n`--cookies` accepts a Netscape cookies file, you can use [cookies.txt Extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) to export cookies from Firefox. It also accepts a json cookies file created by [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/).\n\n## Dump structure\n\n<!-- Dump structure -->\n| Directory or File       | Description                                 |\n|-----------              |-------------                                |\n| `attic/`                | old revisions of page. (wikitext)           |\n| `dumpMeta/`             | (dokuWikiDumper only) metadata of the dump. |\n| `dumpMeta/check.html`   | ?do=check page of the wiki.                 |\n| `dumpMeta/config.json`  | dump's configuration.                       |\n| `dumpMeta/favicon.ico`  | favicon of the site.                        |\n| `dumpMeta/files.txt`    | list of filename.                           |\n| `dumpMeta/index.html`   | homepage of the wiki.                       |\n| `dumpMeta/info.json`    | infomations of the wiki.                    |\n| `dumpMeta/titles.txt`   | list of page title.                         |\n| `html/`                 | (dokuWikiDumper only) HTML of the pages.    |\n| `media/`                | media files.                                |\n| `meta/`                 | metadata of the pages.                      |\n| `pages/`                | latest page content. (wikitext)             |\n| `*.mark`                | mark file.                                  |\n<!-- /Dump structure -->\n\n## Available Backups/Dumps\n\nI made some backups for testing, you can check out the list: <https://github.com/orgs/saveweb/projects/4>.\n\n> Some wikidump has been uploaded to IA, you can check out: <https://archive.org/search?query=subject%3A\"dokuWikiDumper\">\n>\n> If you dumped a DokuWiki and want to share it, please feel free to open an issue, I will add it to the list.\n\n## How to import dump to DokuWiki\n\nIf you need to import Dokuwiki, please add the following configuration to `local.php`\n\n```php\n$conf['fnencode'] = 'utf-8'; // Dokuwiki default: 'safe' (url encode)\n# 'safe' => Non-ASCII characters will be escaped as %xx form.\n# 'utf-8' => Non-ASCII characters will be preserved as UTF-8 characters.\n\n$conf['compression'] = '0'; // Dokuwiki default: 'gz'.\n# 'gz' => attic/<id>.<rev_id>.txt.gz\n# 'bz2' => attic/<id>.<rev_id>.txt.bz2\n# '0' => attic/<id>.<rev_id>.txt\n```\n\nImport `pages` dir if you only need the latest version of the page.  \nImport `meta` dir if you need the **changelog** of the page.  \nImport `attic` and `meta` dirs if you need the old revisions **content** of the page.  \nImport `media` dir if you need the media files.\n\n`dumpMeta` and `html` dirs are only used by `dokuWikiDumper`, you can ignore it.\n\n## Information\n\n### DokuWiki links\n\n- [DokuWiki](https://www.dokuwiki.org/)\n- [DokuWiki changelog](https://www.dokuwiki.org/changelog)\n- [DokuWiki source code](https://github.com/splitbrain/dokuwiki)\n\n- [DokuWiki - ArchiveTeam Wiki](https://wiki.archiveteam.org/index.php/DokuWiki)\n\n### Other tools\n\n- [wikiteam/WikiTeam](https://github.com/wikiteam/wikiteam/), a tool for archiving MediaWiki, written in Python 2 that you won't want to use nowadays. :(\n- [mediawiki-client-tools/MediaWiki Scraper](https://github.com/mediawiki-client-tools/mediawiki-scraper) (aka `wikiteam3`), a tool for archiving MediaWiki, forked from [WikiTeam](https://github.com/wikiteam/wikiteam/) and has been rewritten in Python 3. (Lack of code writers and reviewers, STWP no longer maintains this repo.)\n- [saveweb/WikiTeam3](https://github.com/saveweb/wikiteam3) forked from MediaWiki Scraper, maintained by STWP. :)\n- [DigitalDwagon/WikiBot](https://github.com/DigitalDwagon/WikiBot) a Discord and IRC bot to run the dokuWikiDumper and wikiteam3 in the background.\n\n## License\n\nGPLv3\n\n## Contributors\n\nThis tool is based on an unmerged PR (_8 years ago!_) of [WikiTeam](https://github.com/WikiTeam/wikiteam/): [DokuWiki dump alpha](https://github.com/WikiTeam/wikiteam/pull/243) by [@PiRSquared17](https://github.com/PiRSquared17).\n\nI ([@yzqzss](https://github.com/yzqzss)) have rewritten the code in Python 3 and added ~~some features, also fixed~~ some bugs.\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "A tool for archiving DokuWiki",
    "version": "0.1.43",
    "project_urls": {
        "Bug Tracker": "https://github.com/saveweb/dokuwiki-dumper/issues",
        "repository": "https://github.com/saveweb/dokuwiki-dumper/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70c36a33eb3ee81023e6a73402a92dfccf844dd14fccf0ecf1a38893a48e3148",
                "md5": "469e5ce7298566b2393cb15c89fc04b4",
                "sha256": "335af3e8166fd589b480451f8a7ea392ff43260a42b7955737d92b43cb1254e4"
            },
            "downloads": -1,
            "filename": "dokuwikidumper-0.1.43-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "469e5ce7298566b2393cb15c89fc04b4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 53965,
            "upload_time": "2023-09-26T15:46:42",
            "upload_time_iso_8601": "2023-09-26T15:46:42.194960Z",
            "url": "https://files.pythonhosted.org/packages/70/c3/6a33eb3ee81023e6a73402a92dfccf844dd14fccf0ecf1a38893a48e3148/dokuwikidumper-0.1.43-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0c9c6b6dc546a5219c6c84cfc113f644cf691fb69a4a33e24e664f2df7a54a89",
                "md5": "06ee90ecad47524da8cfaaae4e7f4e17",
                "sha256": "53bb8a431d2a1c59c759997b917cff1ef20d7065e4e02f97503ad7bf5ecbce99"
            },
            "downloads": -1,
            "filename": "dokuwikidumper-0.1.43.tar.gz",
            "has_sig": false,
            "md5_digest": "06ee90ecad47524da8cfaaae4e7f4e17",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 48286,
            "upload_time": "2023-09-26T15:46:44",
            "upload_time_iso_8601": "2023-09-26T15:46:44.786127Z",
            "url": "https://files.pythonhosted.org/packages/0c/9c/6b6dc546a5219c6c84cfc113f644cf691fb69a4a33e24e664f2df7a54a89/dokuwikidumper-0.1.43.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-26 15:46:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "saveweb",
    "github_project": "dokuwiki-dumper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "beautifulsoup4",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "internetarchive",
            "specs": []
        },
        {
            "name": "rich",
            "specs": []
        },
        {
            "name": "python-slugify",
            "specs": []
        }
    ],
    "lcname": "dokuwikidumper"
}
        
Elapsed time: 0.30985s