gutenberg2zim


Namegutenberg2zim JSON
Version 3.0.0 PyPI version JSON
download
home_pageNone
SummaryMake ZIM file from Gutenberg books
upload_time2025-10-28 13:51:22
maintainerNone
docs_urlNone
authorNone
requires_python<3.14,>=3.13
licenseGPL-3.0-or-later
keywords gutenberg kiwix offline zim
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Gutenberg Offline
This scraper downloads the whole [Project
Gutenberg](https://www.gutenberg.org) library and puts it in a
[ZIM](https://openzim.org) file, a clean and user friendly format for
storing content for offline usage.

[![CodeFactor](https://www.codefactor.io/repository/github/openzim/gutenberg/badge)](https://www.codefactor.io/repository/github/openzim/gutenberg)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![codecov](https://codecov.io/gh/openzim/gutenberg/branch/main/graph/badge.svg)](https://codecov.io/gh/openzim/gutenberg)
[![PyPI version shields.io](https://img.shields.io/pypi/v/gutenberg2zim.svg)](https://pypi.org/project/gutenberg2zim/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gutenberg2zim.svg)](https://pypi.org/project/gutenberg2zim/)
[![Docker](https://ghcr-badge.egpl.dev/openzim/gutenberg/latest_tag?label=docker)](https://ghcr.io/openzim/gutenberg)

> [!WARNING]
> This scraper is now known to have a serious flaw. A critical bug https://github.com/openzim/gutenberg/issues/219 has been discovered which leads to incomplete archives. Work on https://github.com/openzim/gutenberg/issues/97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).

## Getting Started

The recommended way to run the Gutenberg scraper is using Docker, as it comes with all required dependencies pre-installed.

### Running with Docker

1. **Run the scraper with Docker**:

```bash
docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim
```

The `-v $(pwd)/output:/output` option mounts the `output` folder in your current directory to the `/output` folder inside the container (which is the working directory). This ensures that the ZIM file is saved to your local machine.

2. **Show available options**:

To view all the available options for `gutenberg2zim`, run:

```bash
docker run ghcr.io/openzim/gutenberg:latest gutenberg2zim --help
```


### Arguments

Customize the content download with the following options. For example, to download books in English or French with IDs 100 to 200 and only in PDF format:

```bash
docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim -l en,fr -f pdf --books 100-200 --lcc-shelves all --title-search
```

This will download books in English and French that have the Id 100 to
200 in the HTML (default) and PDF format.
The -it flags allow you to see progress.
The --rm flag removes the container after completion.

You can find the full arguments list below:

```bash
-h --help                       Display this help message
-F --force                      Overwrite existing ZIM file

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-L --zim-long-desc=<description> Set ZIM long description
--zim-languages=<languages>          Set ZIM Language metadata
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for processing tasks
--no-index                      Do NOT create full-text index within ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--lcc-shelves=<shelves>         Comma-separated list of LCC shelf codes to include
                                (e.g., P,PR,Q). Use 'all' to generate all shelves. If omitted, no shelf generated.
--stats-filename=<filename>     Path to store the progress JSON file to
--publisher=<zim_publisher>     Custom Publisher in ZIM Metadata (openZIM otherwise)
--mirror-url=<mirror_url>       Optional custom url of mirror hosting Gutenberg files
--output=<output_folder>        Output folder for ZIMs. Default: /output
--debug                         Enable verbose output
```

The scraper will automatically perform all steps: download catalog and RDF files, parse metadata, download books, and create the ZIM file.


## Contributing Code

Main coding guidelines are from the [openZIM Wiki](https://github.com/openzim/overview/wiki).

### Setting Up the Environment

Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.

### Install the dependencies

First, ensure you use the proper Python version, inline with the requirement of `pyproject.toml` (you might for instance use `pyenv` to manage multiple Python versions in parallel).

You then need to install the various tools/libraries needed by the scraper.


The setup is divided into two categories: one for simply running the scraper and another for setting up a development environment for contributing and making improvements

**For Users Running the Scraper**:

### GNU/Linux
```
sudo apt update && sudo apt install -y python3-pip zim-tools
```
### Fedora
```
sudo dnf install -y python3-pip zim-tools
```
### Arch linux
```
sudo pacman -S python-pip zim-tools
```
#### macOS
```
brew install zim-tools
```
**For Developers Contributing & Modifying**;

#### GNU/Linux

```
sudo apt update && sudo apt install -y python3-pip zim-tools
```
### Fedora
```
sudo dnf install -y python3-pip zim-tools
```
### Arch linux
```
sudo pacman -S python-pip zim-tools
```
#### macOS
```
brew install zim-tools
```


### Setup the package

First, clone this repository.

```bash
git clone git@github.com:openzim/gutenberg.git
cd gutenberg
```

If you do not already have it on your system, install `hatch` to build the software and manage virtual environments (you might be interested by our detailed [Developer Setup](https://github.com/openzim/_python-bootstrap/blob/main/docs/Developer-Setup.md) as well).

```bash
pip3 install hatch
```

Start a hatch shell: this will install software including dependencies in an isolated virtual environment.

```bash
hatch shell
```

That's it. You can now run `gutenberg2zim` from your terminal.


## Screenshots

![](https://raw.githubusercontent.com/openzim/gutenberg/main/pictures/screenshot_1.png)
![](https://raw.githubusercontent.com/openzim/gutenberg/main/pictures/screenshot_2.png)

## License

[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gutenberg2zim",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.13",
    "maintainer_email": null,
    "keywords": "gutenberg, kiwix, offline, zim",
    "author": null,
    "author_email": "Kiwix <dev@kiwix.org>",
    "download_url": "https://files.pythonhosted.org/packages/3f/23/c9c928476febaef13b2f6c85616aa950478e136242c3aa57666ad7f9fd42/gutenberg2zim-3.0.0.tar.gz",
    "platform": null,
    "description": "# Gutenberg Offline\nThis scraper downloads the whole [Project\nGutenberg](https://www.gutenberg.org) library and puts it in a\n[ZIM](https://openzim.org) file, a clean and user friendly format for\nstoring content for offline usage.\n\n[![CodeFactor](https://www.codefactor.io/repository/github/openzim/gutenberg/badge)](https://www.codefactor.io/repository/github/openzim/gutenberg)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![codecov](https://codecov.io/gh/openzim/gutenberg/branch/main/graph/badge.svg)](https://codecov.io/gh/openzim/gutenberg)\n[![PyPI version shields.io](https://img.shields.io/pypi/v/gutenberg2zim.svg)](https://pypi.org/project/gutenberg2zim/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gutenberg2zim.svg)](https://pypi.org/project/gutenberg2zim/)\n[![Docker](https://ghcr-badge.egpl.dev/openzim/gutenberg/latest_tag?label=docker)](https://ghcr.io/openzim/gutenberg)\n\n> [!WARNING]\n> This scraper is now known to have a serious flaw. A critical bug https://github.com/openzim/gutenberg/issues/219 has been discovered which leads to incomplete archives. Work on https://github.com/openzim/gutenberg/issues/97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).\n\n## Getting Started\n\nThe recommended way to run the Gutenberg scraper is using Docker, as it comes with all required dependencies pre-installed.\n\n### Running with Docker\n\n1. **Run the scraper with Docker**:\n\n```bash\ndocker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim\n```\n\nThe `-v $(pwd)/output:/output` option mounts the `output` folder in your current directory to the `/output` folder inside the container (which is the working directory). This ensures that the ZIM file is saved to your local machine.\n\n2. **Show available options**:\n\nTo view all the available options for `gutenberg2zim`, run:\n\n```bash\ndocker run ghcr.io/openzim/gutenberg:latest gutenberg2zim --help\n```\n\n\n### Arguments\n\nCustomize the content download with the following options. For example, to download books in English or French with IDs 100 to 200 and only in PDF format:\n\n```bash\ndocker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim -l en,fr -f pdf --books 100-200 --lcc-shelves all --title-search\n```\n\nThis will download books in English and French that have the Id 100 to\n200 in the HTML (default) and PDF format.\nThe -it flags allow you to see progress.\nThe --rm flag removes the container after completion.\n\nYou can find the full arguments list below:\n\n```bash\n-h --help                       Display this help message\n-F --force                      Overwrite existing ZIM file\n\n-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)\n-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)\n\n-z --zim-file=<file>            Write ZIM into this file path\n-t --zim-title=<title>          Set ZIM title\n-n --zim-desc=<description>     Set ZIM description\n-L --zim-long-desc=<description> Set ZIM long description\n--zim-languages=<languages>          Set ZIM Language metadata\n-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals\n-c --concurrency=<nb>           Number of concurrent process for processing tasks\n--no-index                      Do NOT create full-text index within ZIM file\n--title-search                  Add field to search a book by title and directly jump to it\n--lcc-shelves=<shelves>         Comma-separated list of LCC shelf codes to include\n                                (e.g., P,PR,Q). Use 'all' to generate all shelves. If omitted, no shelf generated.\n--stats-filename=<filename>     Path to store the progress JSON file to\n--publisher=<zim_publisher>     Custom Publisher in ZIM Metadata (openZIM otherwise)\n--mirror-url=<mirror_url>       Optional custom url of mirror hosting Gutenberg files\n--output=<output_folder>        Output folder for ZIMs. Default: /output\n--debug                         Enable verbose output\n```\n\nThe scraper will automatically perform all steps: download catalog and RDF files, parse metadata, download books, and create the ZIM file.\n\n\n## Contributing Code\n\nMain coding guidelines are from the [openZIM Wiki](https://github.com/openzim/overview/wiki).\n\n### Setting Up the Environment\n\nHere we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.\n\n### Install the dependencies\n\nFirst, ensure you use the proper Python version, inline with the requirement of `pyproject.toml` (you might for instance use `pyenv` to manage multiple Python versions in parallel).\n\nYou then need to install the various tools/libraries needed by the scraper.\n\n\nThe setup is divided into two categories: one for simply running the scraper and another for setting up a development environment for contributing and making improvements\n\n**For Users Running the Scraper**:\n\n### GNU/Linux\n```\nsudo apt update && sudo apt install -y python3-pip zim-tools\n```\n### Fedora\n```\nsudo dnf install -y python3-pip zim-tools\n```\n### Arch linux\n```\nsudo pacman -S python-pip zim-tools\n```\n#### macOS\n```\nbrew install zim-tools\n```\n**For Developers Contributing & Modifying**;\n\n#### GNU/Linux\n\n```\nsudo apt update && sudo apt install -y python3-pip zim-tools\n```\n### Fedora\n```\nsudo dnf install -y python3-pip zim-tools\n```\n### Arch linux\n```\nsudo pacman -S python-pip zim-tools\n```\n#### macOS\n```\nbrew install zim-tools\n```\n\n\n### Setup the package\n\nFirst, clone this repository.\n\n```bash\ngit clone git@github.com:openzim/gutenberg.git\ncd gutenberg\n```\n\nIf you do not already have it on your system, install `hatch` to build the software and manage virtual environments (you might be interested by our detailed [Developer Setup](https://github.com/openzim/_python-bootstrap/blob/main/docs/Developer-Setup.md) as well).\n\n```bash\npip3 install hatch\n```\n\nStart a hatch shell: this will install software including dependencies in an isolated virtual environment.\n\n```bash\nhatch shell\n```\n\nThat's it. You can now run `gutenberg2zim` from your terminal.\n\n\n## Screenshots\n\n![](https://raw.githubusercontent.com/openzim/gutenberg/main/pictures/screenshot_1.png)\n![](https://raw.githubusercontent.com/openzim/gutenberg/main/pictures/screenshot_2.png)\n\n## License\n\n[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see\n[LICENSE](LICENSE) for more details.\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "Make ZIM file from Gutenberg books",
    "version": "3.0.0",
    "project_urls": {
        "Donate": "https://www.kiwix.org/en/support-us/",
        "Homepage": "https://github.com/openzim/kolibri"
    },
    "split_keywords": [
        "gutenberg",
        " kiwix",
        " offline",
        " zim"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "31440af1fdf56f092d75474c13b54249d2fe1df035217dcae4a00047f53bb2ba",
                "md5": "499ab91064ab92b0624f5ee25c9ffc90",
                "sha256": "3cae783a9b2fb06c0747610e081b76f27714a2251768b56f306f2470e098e0fe"
            },
            "downloads": -1,
            "filename": "gutenberg2zim-3.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "499ab91064ab92b0624f5ee25c9ffc90",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.13",
            "size": 1848919,
            "upload_time": "2025-10-28T13:51:19",
            "upload_time_iso_8601": "2025-10-28T13:51:19.883350Z",
            "url": "https://files.pythonhosted.org/packages/31/44/0af1fdf56f092d75474c13b54249d2fe1df035217dcae4a00047f53bb2ba/gutenberg2zim-3.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3f23c9c928476febaef13b2f6c85616aa950478e136242c3aa57666ad7f9fd42",
                "md5": "35257861a9890140fdad386bccd15669",
                "sha256": "5e2db277d89ce4bcebab290c11bf1ac69880a7d4f364f822272967933093f05e"
            },
            "downloads": -1,
            "filename": "gutenberg2zim-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "35257861a9890140fdad386bccd15669",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.13",
            "size": 1989268,
            "upload_time": "2025-10-28T13:51:22",
            "upload_time_iso_8601": "2025-10-28T13:51:22.239451Z",
            "url": "https://files.pythonhosted.org/packages/3f/23/c9c928476febaef13b2f6c85616aa950478e136242c3aa57666ad7f9fd42/gutenberg2zim-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-28 13:51:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "openzim",
    "github_project": "kolibri",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "gutenberg2zim"
}
        
Elapsed time: 1.56398s