pgzip


Namepgzip JSON
Version 0.3.5 PyPI version JSON
download
home_pagehttps://github.com/pgzip/pgzip
SummaryA multi-threading implementation of Python gzip module
upload_time2023-08-03 21:45:10
maintainer
docs_urlNone
authorpgzip team
requires_python>=3.7
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pgzip

[![Run tests](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml)
[![CodeQL](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml)

<p align="center">
  <img src="pgzip_logo.png" />
</p>

`pgzip` is a multi-threaded `gzip` implementation for `python` that increases the compression and decompression performance.

Compression and decompression performance gains are made by parallelizing the usage of block indexing within a `gzip` file. Block indexing utilizes gzip's `FEXTRA` feature which records the index of compressed members. `FEXTRA` is defined in the official `gzip` specification starting at version 4.3. Because `FEXTRA` is part of the `gzip` specification, `pgzip` is compatible with regular `gzip` files.

`pgzip` is **~25X** faster for compression and **~7X** faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the `python` interpreter.

Theoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.

## Usage and Examples

### CLI
```
❯ python -m pgzip -h
usage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input

positional arguments:
  input                 Input file or '-' for stdin

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)
  -f FILENAME, --filename FILENAME
                        Name for the original file when compressing
  -d, --decompress      Decompress instead of compress
  -l {0-9}, --compression-level {0-9}
                        Compression level; 0 = no compression (Default: 9)
  -t THREADS, --threads THREADS
                        Number of threads to use (Default: Determine automatically)
```

### Programatically

Using `pgzip` is the same as using the built-in `gzip` module.

Compressing data and writing it to a file:

```python
import pgzip

s = "a big string..."

# An explanation of parameters:
# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)
# `blocksize=2*10**8` - Use a compression block size of 200MB
with pgzip.open("test.txt.gz", "wt", thread=8, blocksize=2*10**8) as fw:
    fw.write(s)
```

Decompressing data from a file:

```python
import pgzip

s = "a big string..."

with pgzip.open("test.txt.gz", "rt", thread=8) as fr:
    assert fr.read(len(s)) == s
```

## Performance

### Compression Performance

![Compression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/CompressionBenchmark.png)

### Decompression Performance

![Decompression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/DecompressionBenchmark.png)

Decompression was benchmarked using an 8.0GB `FASTQ` text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.

The compressed file used in this benchmark was created with a blocksize of 200MB.

## Warning

`pgzip` only replaces the following methods of `gzip`'s `GzipFile` class:

- `open()`
- `compress()`
- `decompress()`

Other class methods and functionality have not been well tested.

Contributions or improvements is appreciated for methods such as:

- `seek()`
- `tell()`

## History

Created initially by Vincent Li (@vinlyx), this project is a fork of [https://github.com/vinlyx/mgzip](https://github.com/vinlyx/mgzip). We had several bug fixes to implement, but we could not contact them. The `pgzip` team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pgzip/pgzip",
    "name": "pgzip",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "pgzip team",
    "author_email": "pgzip@thegoldfish.org",
    "download_url": "https://files.pythonhosted.org/packages/de/64/547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024/pgzip-0.3.5.tar.gz",
    "platform": null,
    "description": "# pgzip\n\n[![Run tests](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml)\n[![CodeQL](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml)\n\n<p align=\"center\">\n  <img src=\"pgzip_logo.png\" />\n</p>\n\n`pgzip` is a multi-threaded `gzip` implementation for `python` that increases the compression and decompression performance.\n\nCompression and decompression performance gains are made by parallelizing the usage of block indexing within a `gzip` file. Block indexing utilizes gzip's `FEXTRA` feature which records the index of compressed members. `FEXTRA` is defined in the official `gzip` specification starting at version 4.3. Because `FEXTRA` is part of the `gzip` specification, `pgzip` is compatible with regular `gzip` files.\n\n`pgzip` is **~25X** faster for compression and **~7X** faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the `python` interpreter.\n\nTheoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.\n\n## Usage and Examples\n\n### CLI\n```\n\u276f python -m pgzip -h\nusage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input\n\npositional arguments:\n  input                 Input file or '-' for stdin\n\noptions:\n  -h, --help            show this help message and exit\n  -o OUTPUT, --output OUTPUT\n                        Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)\n  -f FILENAME, --filename FILENAME\n                        Name for the original file when compressing\n  -d, --decompress      Decompress instead of compress\n  -l {0-9}, --compression-level {0-9}\n                        Compression level; 0 = no compression (Default: 9)\n  -t THREADS, --threads THREADS\n                        Number of threads to use (Default: Determine automatically)\n```\n\n### Programatically\n\nUsing `pgzip` is the same as using the built-in `gzip` module.\n\nCompressing data and writing it to a file:\n\n```python\nimport pgzip\n\ns = \"a big string...\"\n\n# An explanation of parameters:\n# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)\n# `blocksize=2*10**8` - Use a compression block size of 200MB\nwith pgzip.open(\"test.txt.gz\", \"wt\", thread=8, blocksize=2*10**8) as fw:\n    fw.write(s)\n```\n\nDecompressing data from a file:\n\n```python\nimport pgzip\n\ns = \"a big string...\"\n\nwith pgzip.open(\"test.txt.gz\", \"rt\", thread=8) as fr:\n    assert fr.read(len(s)) == s\n```\n\n## Performance\n\n### Compression Performance\n\n![Compression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/CompressionBenchmark.png)\n\n### Decompression Performance\n\n![Decompression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/DecompressionBenchmark.png)\n\nDecompression was benchmarked using an 8.0GB `FASTQ` text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.\n\nThe compressed file used in this benchmark was created with a blocksize of 200MB.\n\n## Warning\n\n`pgzip` only replaces the following methods of `gzip`'s `GzipFile` class:\n\n- `open()`\n- `compress()`\n- `decompress()`\n\nOther class methods and functionality have not been well tested.\n\nContributions or improvements is appreciated for methods such as:\n\n- `seek()`\n- `tell()`\n\n## History\n\nCreated initially by Vincent Li (@vinlyx), this project is a fork of [https://github.com/vinlyx/mgzip](https://github.com/vinlyx/mgzip). We had several bug fixes to implement, but we could not contact them. The `pgzip` team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A multi-threading implementation of Python gzip module",
    "version": "0.3.5",
    "project_urls": {
        "Homepage": "https://github.com/pgzip/pgzip"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "619f1a97b17fb29d7d1b6293faf13899c483460a2c524c3e06fe4226f6916133",
                "md5": "09071694c8154806603151c122f1eebd",
                "sha256": "4e13ab66ecface5c51c5af51d8cd676aa51675cf85df000f501a86cf38c208c1"
            },
            "downloads": -1,
            "filename": "pgzip-0.3.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "09071694c8154806603151c122f1eebd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 13011,
            "upload_time": "2023-08-03T21:45:08",
            "upload_time_iso_8601": "2023-08-03T21:45:08.706519Z",
            "url": "https://files.pythonhosted.org/packages/61/9f/1a97b17fb29d7d1b6293faf13899c483460a2c524c3e06fe4226f6916133/pgzip-0.3.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de64547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024",
                "md5": "4c72e3911f4160cf013b0a9ad92ecb94",
                "sha256": "dd35510f59f6bd6b64e31c4baf90c10cdbb2775235fcc079b14b404fbd7f46bf"
            },
            "downloads": -1,
            "filename": "pgzip-0.3.5.tar.gz",
            "has_sig": false,
            "md5_digest": "4c72e3911f4160cf013b0a9ad92ecb94",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 14317,
            "upload_time": "2023-08-03T21:45:10",
            "upload_time_iso_8601": "2023-08-03T21:45:10.511464Z",
            "url": "https://files.pythonhosted.org/packages/de/64/547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024/pgzip-0.3.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-03 21:45:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pgzip",
    "github_project": "pgzip",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pgzip"
}
        
Elapsed time: 0.18570s