gzeus


Namegzeus JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2025-02-19 04:57:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords compression data-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GZeus 

is a package that chunk-reads *GZipped* text files *LIGHTENING* fast. 

## What is this package for?

This package is designed for workloads that 

1. Need to read data from a very large .csv.gz file

2. You have additional rules that you want to apply while reading, and you want to work in a chunk by chunk fashion that saves you memory.

This package provides a Chunker class that will read gz compressed text file by chunks. In the case of csv files, each chunk will represent a proper decompressed csv file and only the first chunk will have header info, if headers are present. The Chunker will produce these chunks in a streaming fashion, thus minimizing memory load.

**This package can potentially be used to stream large gzipped text files as well. But is not capable of semantic chunking, which is often needed for text processing for LLMs. This package only chunks by identifying the last needle (new line character) in the haystack (text string) in the current buffer.**

## Assumptions

The new_line_symbol provided by the user only shows up in the underlying text file as new line symbol.

## We get decompressed bytes by chunks, then what? 

Most of the times, we only need to extract partial data from large .csv.gz files. This is where the combination of GZeus and Polars really shines. 

If you have Polars installed already:
```python
from gzeus import stream_polars_csv_gz

for df_chunk in stream_polars_csv_gz("PATH TO YOUR DATA", func = your_func):
    # do work with df_chunk
```
where your_func should be `pl.LazyFrame -> pl.DataFrame`. If you need more control over the byte chunks, you can structure your code like below:

```python
from gzeus import Chunker
import polars as pl

# Turn portion of the produced bytes into a DataFrame. Only possible with Polars, 
# or dataframe packages with "lazy" capabilities.
def bytes_into_df(df:pl.LazyFrame) -> pl.DataFrame:
    return df.filter(
        pl.col("City_Category") == 'A'
    ).select("City_Category", "Primary_Bank_Type", "Source").collect()

ck = (
    Chunker(buffer_size=1_000_000, new_line_symbol='\n')
    .with_local_file("../data/test.csv.gz")
)

df_temp = pl.scan_csv(ck.read_one()) # first chunk
schema = df_temp.collect_schema() # Infer schema from first chunk
dfs = [bytes_into_df(df_temp)]

dfs.extend(
    bytes_into_df(
        pl.scan_csv(byte_chunk, has_header=False, schema=schema)
    )
    for byte_chunk in ck.chunks()
)

df = pl.concat(dfs)
df.head()
```

## Performance

See [here](./benches/bench.ipynb).

It is extremely hard to have an apples-to-apples comparison with other tools. Here I will focus on the comparison with pandas.read_csv, which has an iterator option. Note: GZeus chunks are defined by byte-sizes, while pandas.read_csv iterator has a fixed number of rows per chunk.

However, generally speaking, I find that for .csv.gz files:

1. GZeus + Polars is at least a 50% reduction in time than pd.read_csv with zero additional work on each chunk
2. If you set higher buffer size, GZeus + Polars can take only 1/5 of the time of pandas.read_csv.
2. Even faster with more workload per chunk (mostly because of Polars).

## Cloud Files

To support "chunk read" from any cloud major provider is no easy task. Not only will it require an async interface in Rust, which is much harder to write and maintain, but there are also performance issues related to getting only a small chunk each time. To name a few:

1. Increase the number of calls to the storage
2. Repeatedly opening the file and seeking to the last read position. 
3. Rate limits issues, especially with VPN. E.g. to get better performance, gzeus needs to read 10MB+ per chunk, but this will increase "packets per second" significantly.

A workaround is to use temp files. For example, for Amazon s3, one can do the following:

```python
import tempfile
import boto3

s3 = boto3.client('s3')

tmp = tempfile.NamedTemporaryFile()
tmp.write(s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', tmp))
df = chunk_load_data_using_gzeus(tmp.name) # a wrapper function for the code shown above.
tmp.close()
```

Almost always, the machine should have enough disk space. In `chunk_load_data_using_gzeus`, data is read by chunks and therefore won't lead to OOM errors. It can be any wrapper around `stream_polars_csv_gz` provided by the package.

## Road Maps
1. To be decided

## Other Projects to Check Out
1. Dataframe-friendly data analysis package [polars_ds](https://github.com/abstractqqq/polars_ds_extension)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gzeus",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "compression, data-processing",
    "author": null,
    "author_email": "Tianren Qin <tq9695@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/36/8e/f1f08bed54b962f83469d6bb57abc6c8dd6501ffd16fb2cfd0e8d8384e1d/gzeus-0.1.0.tar.gz",
    "platform": null,
    "description": "# GZeus \n\nis a package that chunk-reads *GZipped* text files *LIGHTENING* fast. \n\n## What is this package for?\n\nThis package is designed for workloads that \n\n1. Need to read data from a very large .csv.gz file\n\n2. You have additional rules that you want to apply while reading, and you want to work in a chunk by chunk fashion that saves you memory.\n\nThis package provides a Chunker class that will read gz compressed text file by chunks. In the case of csv files, each chunk will represent a proper decompressed csv file and only the first chunk will have header info, if headers are present. The Chunker will produce these chunks in a streaming fashion, thus minimizing memory load.\n\n**This package can potentially be used to stream large gzipped text files as well. But is not capable of semantic chunking, which is often needed for text processing for LLMs. This package only chunks by identifying the last needle (new line character) in the haystack (text string) in the current buffer.**\n\n## Assumptions\n\nThe new_line_symbol provided by the user only shows up in the underlying text file as new line symbol.\n\n## We get decompressed bytes by chunks, then what? \n\nMost of the times, we only need to extract partial data from large .csv.gz files. This is where the combination of GZeus and Polars really shines. \n\nIf you have Polars installed already:\n```python\nfrom gzeus import stream_polars_csv_gz\n\nfor df_chunk in stream_polars_csv_gz(\"PATH TO YOUR DATA\", func = your_func):\n    # do work with df_chunk\n```\nwhere your_func should be `pl.LazyFrame -> pl.DataFrame`. If you need more control over the byte chunks, you can structure your code like below:\n\n```python\nfrom gzeus import Chunker\nimport polars as pl\n\n# Turn portion of the produced bytes into a DataFrame. Only possible with Polars, \n# or dataframe packages with \"lazy\" capabilities.\ndef bytes_into_df(df:pl.LazyFrame) -> pl.DataFrame:\n    return df.filter(\n        pl.col(\"City_Category\") == 'A'\n    ).select(\"City_Category\", \"Primary_Bank_Type\", \"Source\").collect()\n\nck = (\n    Chunker(buffer_size=1_000_000, new_line_symbol='\\n')\n    .with_local_file(\"../data/test.csv.gz\")\n)\n\ndf_temp = pl.scan_csv(ck.read_one()) # first chunk\nschema = df_temp.collect_schema() # Infer schema from first chunk\ndfs = [bytes_into_df(df_temp)]\n\ndfs.extend(\n    bytes_into_df(\n        pl.scan_csv(byte_chunk, has_header=False, schema=schema)\n    )\n    for byte_chunk in ck.chunks()\n)\n\ndf = pl.concat(dfs)\ndf.head()\n```\n\n## Performance\n\nSee [here](./benches/bench.ipynb).\n\nIt is extremely hard to have an apples-to-apples comparison with other tools. Here I will focus on the comparison with pandas.read_csv, which has an iterator option. Note: GZeus chunks are defined by byte-sizes, while pandas.read_csv iterator has a fixed number of rows per chunk.\n\nHowever, generally speaking, I find that for .csv.gz files:\n\n1. GZeus + Polars is at least a 50% reduction in time than pd.read_csv with zero additional work on each chunk\n2. If you set higher buffer size, GZeus + Polars can take only 1/5 of the time of pandas.read_csv.\n2. Even faster with more workload per chunk (mostly because of Polars).\n\n## Cloud Files\n\nTo support \"chunk read\" from any cloud major provider is no easy task. Not only will it require an async interface in Rust, which is much harder to write and maintain, but there are also performance issues related to getting only a small chunk each time. To name a few:\n\n1. Increase the number of calls to the storage\n2. Repeatedly opening the file and seeking to the last read position. \n3. Rate limits issues, especially with VPN. E.g. to get better performance, gzeus needs to read 10MB+ per chunk, but this will increase \"packets per second\" significantly.\n\nA workaround is to use temp files. For example, for Amazon s3, one can do the following:\n\n```python\nimport tempfile\nimport boto3\n\ns3 = boto3.client('s3')\n\ntmp = tempfile.NamedTemporaryFile()\ntmp.write(s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', tmp))\ndf = chunk_load_data_using_gzeus(tmp.name) # a wrapper function for the code shown above.\ntmp.close()\n```\n\nAlmost always, the machine should have enough disk space. In `chunk_load_data_using_gzeus`, data is read by chunks and therefore won't lead to OOM errors. It can be any wrapper around `stream_polars_csv_gz` provided by the package.\n\n## Road Maps\n1. To be decided\n\n## Other Projects to Check Out\n1. Dataframe-friendly data analysis package [polars_ds](https://github.com/abstractqqq/polars_ds_extension)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [
        "compression",
        " data-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b5d3f0681950e784ee564c259a409ba54a5156ee85d0db684dd56dd077eb1841",
                "md5": "3b2da8ebdcbc4c96a7ad341c3a5422aa",
                "sha256": "184421e0605e4dfa0d74206555b7745f6653840d71397795c93e58b95b1e6040"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "3b2da8ebdcbc4c96a7ad341c3a5422aa",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 269863,
            "upload_time": "2025-02-19T04:57:17",
            "upload_time_iso_8601": "2025-02-19T04:57:17.363819Z",
            "url": "https://files.pythonhosted.org/packages/b5/d3/f0681950e784ee564c259a409ba54a5156ee85d0db684dd56dd077eb1841/gzeus-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5b2f84e93ed0748f3187557ee5ba35bd6ade2c8dcdb78f1a64cc68b758deb75e",
                "md5": "1b15184add950058a72a03729e85bc40",
                "sha256": "b2fb07a64f16f1ea8a2f5dcfae162e50e158b3f81098d58555f57b80679477e2"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0-cp39-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "1b15184add950058a72a03729e85bc40",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 247289,
            "upload_time": "2025-02-19T04:57:19",
            "upload_time_iso_8601": "2025-02-19T04:57:19.358223Z",
            "url": "https://files.pythonhosted.org/packages/5b/2f/84e93ed0748f3187557ee5ba35bd6ade2c8dcdb78f1a64cc68b758deb75e/gzeus-0.1.0-cp39-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "90037efc88f584c58ebe8006fa7b2de80871de618dd6d8171d6906430b9d57cc",
                "md5": "756e15abd6c059cdbdac43d6d9e202d9",
                "sha256": "410d52792b00c3d205cbcaa0e6accb6b090490287f3a535d4a64978c981b44d5"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "756e15abd6c059cdbdac43d6d9e202d9",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 287069,
            "upload_time": "2025-02-19T04:57:21",
            "upload_time_iso_8601": "2025-02-19T04:57:21.654897Z",
            "url": "https://files.pythonhosted.org/packages/90/03/7efc88f584c58ebe8006fa7b2de80871de618dd6d8171d6906430b9d57cc/gzeus-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "64ec3f1f1834c97305b51dff39ef015743dbfe2ce1b422a52e69ac2c31335a64",
                "md5": "61cc1c82a0b7db9b20f8cb8ec780b745",
                "sha256": "85298c2994e5064f10f48fa705705f3d90d93381f40983ae2c84e7892d2cb329"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0-cp39-abi3-manylinux_2_24_aarch64.whl",
            "has_sig": false,
            "md5_digest": "61cc1c82a0b7db9b20f8cb8ec780b745",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 268161,
            "upload_time": "2025-02-19T04:57:23",
            "upload_time_iso_8601": "2025-02-19T04:57:23.864075Z",
            "url": "https://files.pythonhosted.org/packages/64/ec/3f1f1834c97305b51dff39ef015743dbfe2ce1b422a52e69ac2c31335a64/gzeus-0.1.0-cp39-abi3-manylinux_2_24_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ae16de69597eb313ed97e4f130d9d9ce00aae425cf996467afefca14566555c1",
                "md5": "0285521c67e33bc2dd4268e296a90af2",
                "sha256": "98898e2df40b9e9e67fba7705e2c53add48090d01effdd93008282fc3f1e1234"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0-cp39-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "0285521c67e33bc2dd4268e296a90af2",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 210975,
            "upload_time": "2025-02-19T04:57:26",
            "upload_time_iso_8601": "2025-02-19T04:57:26.039613Z",
            "url": "https://files.pythonhosted.org/packages/ae/16/de69597eb313ed97e4f130d9d9ce00aae425cf996467afefca14566555c1/gzeus-0.1.0-cp39-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "368ef1f08bed54b962f83469d6bb57abc6c8dd6501ffd16fb2cfd0e8d8384e1d",
                "md5": "a6d55643015a5d778c61312b01083216",
                "sha256": "f20194f0f64eaf6f2fb4e1407f83c90e939006b0f26b226bb135a0c356710c41"
            },
            "downloads": -1,
            "filename": "gzeus-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a6d55643015a5d778c61312b01083216",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 20348,
            "upload_time": "2025-02-19T04:57:28",
            "upload_time_iso_8601": "2025-02-19T04:57:28.064920Z",
            "url": "https://files.pythonhosted.org/packages/36/8e/f1f08bed54b962f83469d6bb57abc6c8dd6501ffd16fb2cfd0e8d8384e1d/gzeus-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-19 04:57:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "gzeus"
}
        
Elapsed time: 0.50944s