Name | pgzip JSON |
Version |
0.3.5
JSON |
| download |
home_page | https://github.com/pgzip/pgzip |
Summary | A multi-threading implementation of Python gzip module |
upload_time | 2023-08-03 21:45:10 |
maintainer | |
docs_url | None |
author | pgzip team |
requires_python | >=3.7 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# pgzip
[![Run tests](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml)
[![CodeQL](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml)
<p align="center">
<img src="pgzip_logo.png" />
</p>
`pgzip` is a multi-threaded `gzip` implementation for `python` that increases the compression and decompression performance.
Compression and decompression performance gains are made by parallelizing the usage of block indexing within a `gzip` file. Block indexing utilizes gzip's `FEXTRA` feature which records the index of compressed members. `FEXTRA` is defined in the official `gzip` specification starting at version 4.3. Because `FEXTRA` is part of the `gzip` specification, `pgzip` is compatible with regular `gzip` files.
`pgzip` is **~25X** faster for compression and **~7X** faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the `python` interpreter.
Theoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.
## Usage and Examples
### CLI
```
❯ python -m pgzip -h
usage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input
positional arguments:
input Input file or '-' for stdin
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)
-f FILENAME, --filename FILENAME
Name for the original file when compressing
-d, --decompress Decompress instead of compress
-l {0-9}, --compression-level {0-9}
Compression level; 0 = no compression (Default: 9)
-t THREADS, --threads THREADS
Number of threads to use (Default: Determine automatically)
```
### Programatically
Using `pgzip` is the same as using the built-in `gzip` module.
Compressing data and writing it to a file:
```python
import pgzip
s = "a big string..."
# An explanation of parameters:
# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)
# `blocksize=2*10**8` - Use a compression block size of 200MB
with pgzip.open("test.txt.gz", "wt", thread=8, blocksize=2*10**8) as fw:
fw.write(s)
```
Decompressing data from a file:
```python
import pgzip
s = "a big string..."
with pgzip.open("test.txt.gz", "rt", thread=8) as fr:
assert fr.read(len(s)) == s
```
## Performance
### Compression Performance
![Compression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/CompressionBenchmark.png)
### Decompression Performance
![Decompression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/DecompressionBenchmark.png)
Decompression was benchmarked using an 8.0GB `FASTQ` text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.
The compressed file used in this benchmark was created with a blocksize of 200MB.
## Warning
`pgzip` only replaces the following methods of `gzip`'s `GzipFile` class:
- `open()`
- `compress()`
- `decompress()`
Other class methods and functionality have not been well tested.
Contributions or improvements is appreciated for methods such as:
- `seek()`
- `tell()`
## History
Created initially by Vincent Li (@vinlyx), this project is a fork of [https://github.com/vinlyx/mgzip](https://github.com/vinlyx/mgzip). We had several bug fixes to implement, but we could not contact them. The `pgzip` team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.
Raw data
{
"_id": null,
"home_page": "https://github.com/pgzip/pgzip",
"name": "pgzip",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "pgzip team",
"author_email": "pgzip@thegoldfish.org",
"download_url": "https://files.pythonhosted.org/packages/de/64/547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024/pgzip-0.3.5.tar.gz",
"platform": null,
"description": "# pgzip\n\n[![Run tests](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/python-tests.yml)\n[![CodeQL](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/pgzip/pgzip/actions/workflows/codeql-analysis.yml)\n\n<p align=\"center\">\n <img src=\"pgzip_logo.png\" />\n</p>\n\n`pgzip` is a multi-threaded `gzip` implementation for `python` that increases the compression and decompression performance.\n\nCompression and decompression performance gains are made by parallelizing the usage of block indexing within a `gzip` file. Block indexing utilizes gzip's `FEXTRA` feature which records the index of compressed members. `FEXTRA` is defined in the official `gzip` specification starting at version 4.3. Because `FEXTRA` is part of the `gzip` specification, `pgzip` is compatible with regular `gzip` files.\n\n`pgzip` is **~25X** faster for compression and **~7X** faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the `python` interpreter.\n\nTheoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.\n\n## Usage and Examples\n\n### CLI\n```\n\u276f python -m pgzip -h\nusage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input\n\npositional arguments:\n input Input file or '-' for stdin\n\noptions:\n -h, --help show this help message and exit\n -o OUTPUT, --output OUTPUT\n Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)\n -f FILENAME, --filename FILENAME\n Name for the original file when compressing\n -d, --decompress Decompress instead of compress\n -l {0-9}, --compression-level {0-9}\n Compression level; 0 = no compression (Default: 9)\n -t THREADS, --threads THREADS\n Number of threads to use (Default: Determine automatically)\n```\n\n### Programatically\n\nUsing `pgzip` is the same as using the built-in `gzip` module.\n\nCompressing data and writing it to a file:\n\n```python\nimport pgzip\n\ns = \"a big string...\"\n\n# An explanation of parameters:\n# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)\n# `blocksize=2*10**8` - Use a compression block size of 200MB\nwith pgzip.open(\"test.txt.gz\", \"wt\", thread=8, blocksize=2*10**8) as fw:\n fw.write(s)\n```\n\nDecompressing data from a file:\n\n```python\nimport pgzip\n\ns = \"a big string...\"\n\nwith pgzip.open(\"test.txt.gz\", \"rt\", thread=8) as fr:\n assert fr.read(len(s)) == s\n```\n\n## Performance\n\n### Compression Performance\n\n![Compression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/CompressionBenchmark.png)\n\n### Decompression Performance\n\n![Decompression Performance](https://raw.githubusercontent.com/pgzip/pgzip/master/DecompressionBenchmark.png)\n\nDecompression was benchmarked using an 8.0GB `FASTQ` text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.\n\nThe compressed file used in this benchmark was created with a blocksize of 200MB.\n\n## Warning\n\n`pgzip` only replaces the following methods of `gzip`'s `GzipFile` class:\n\n- `open()`\n- `compress()`\n- `decompress()`\n\nOther class methods and functionality have not been well tested.\n\nContributions or improvements is appreciated for methods such as:\n\n- `seek()`\n- `tell()`\n\n## History\n\nCreated initially by Vincent Li (@vinlyx), this project is a fork of [https://github.com/vinlyx/mgzip](https://github.com/vinlyx/mgzip). We had several bug fixes to implement, but we could not contact them. The `pgzip` team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A multi-threading implementation of Python gzip module",
"version": "0.3.5",
"project_urls": {
"Homepage": "https://github.com/pgzip/pgzip"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "619f1a97b17fb29d7d1b6293faf13899c483460a2c524c3e06fe4226f6916133",
"md5": "09071694c8154806603151c122f1eebd",
"sha256": "4e13ab66ecface5c51c5af51d8cd676aa51675cf85df000f501a86cf38c208c1"
},
"downloads": -1,
"filename": "pgzip-0.3.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "09071694c8154806603151c122f1eebd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 13011,
"upload_time": "2023-08-03T21:45:08",
"upload_time_iso_8601": "2023-08-03T21:45:08.706519Z",
"url": "https://files.pythonhosted.org/packages/61/9f/1a97b17fb29d7d1b6293faf13899c483460a2c524c3e06fe4226f6916133/pgzip-0.3.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "de64547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024",
"md5": "4c72e3911f4160cf013b0a9ad92ecb94",
"sha256": "dd35510f59f6bd6b64e31c4baf90c10cdbb2775235fcc079b14b404fbd7f46bf"
},
"downloads": -1,
"filename": "pgzip-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "4c72e3911f4160cf013b0a9ad92ecb94",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 14317,
"upload_time": "2023-08-03T21:45:10",
"upload_time_iso_8601": "2023-08-03T21:45:10.511464Z",
"url": "https://files.pythonhosted.org/packages/de/64/547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024/pgzip-0.3.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-03 21:45:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pgzip",
"github_project": "pgzip",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pgzip"
}