parquetreader

Name	parquetreader JSON
Version	0.0.1 JSON
	download
home_page	None
Summary	Pyarrow Dataset wrapper for reading parquet datasets as rows
upload_time	2025-02-11 16:51:09
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	Copyright (c) <year> <copyright holders> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	parquet reader
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SMurphyDev - Parquet Reader

version = 0.0.1

The purpose of this library is to enable reading parquet files one row at a
time in a relatively memory consious manner. I say relatively because this
library is a thin wrapper over pyarrow, and pyarrow Datasets, and arrows favors
greedy allocation.

Parquet is a columnar format, which is compressed on disk. It's intended use
case is for analytics workflows where you may need to persist large amounts of
data to disk that you will want to query later. The problem which inspired this
library is a very different usecase. I needed to extract data from a parquet
file for use in an ETL style workflow. If you have a similar problem maybe this
will be useful for you too.

## Installation

Installation is straight forward. Just use pip

```
pip install parquetreader
```

## Usage

In the simplest case you should be able to read a parquet file like so:

```
import parquetreader.reader as pr

# Fields/Columns you want to read from the parquet file.
fields = ["Field_1", "Field_2", "Field_3"]

# Path to the file you want to read.
# (Or to a directory containing parquet files, or a list of parquet files)
file_path = "path/to/file.parquet"

reader = rd.ParquetReader(file_path)

for row in reader.get_rows(fields):
    print(row["Field_1"])
    print(row["Field_2"])
    print(row["Field_3"])
```

get_rows returns a generator which yields data in the underlying file one row
at a time. Files/Datasets are read in batches of 10k records, the records are
converted into dictionaries of python types and returned in a way which allows
us to iterate over them lazily one at a time.

If you need more control you can create the pyarrow dataset yourself. Under the
hood get_rows() calls Dataset.to_batches(). You can also pass arguments in
directly here which allow you to control the performance of reading the parquet
files.

```
import parquetreader.reader as pr
import pyarrow.dataset as ds

# Fields/Columns you want to read from the parquet file.
fields = ["Field_1", "Field_2", "Field_3"]

# Path to the file you want to read.
# (Or to a directory containing parquet files, or a list of parquet files)
file_path = "path/to/file.parquet"

dataset = ds.dataset(
    file_path,
    format="parquet",
    exclude_invalid_files=True,
)

reader = rd.ParquetReader(dataset)

# Accepts same arguments as Dataset.to_batch()
for record in pbr.get_rows_with_args(
            columns=fields,
            batch_size=batch_size,
            batch_readahead=4,  # Number of batches to read ahead in a file
            fragment_readahead=2,  # Number of files to read ahead in a dataset
            use_threads=False,
        ):
    print(row["Field_1"])
    print(row["Field_2"])
    print(row["Field_3"])
```

You can read more about the arguments you can pass when creating a dataset or
reading a batch from the arrow docs:

1. [Dataset Args](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset)
2. [to_batch()/get_rows_with_args() Args](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches)
3. [Pyarrow docs on batch reads](https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads)

## Development

To get up and running if you want to contribute:

```
git clone https://github.com/SMurphyDev/parquet-batch.git
git cd parquet-batch

python3 -m venv venv
source venv/bin/activate
pip install pip-tools
pip-sync requirements.txt dev-requirements.txt

```

At this point you should have all of the required dependencies set up and you
should be good to go.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parquetreader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "SMurphyDev <stephen@smurphydev.ie>",
    "keywords": "parquet, reader",
    "author": null,
    "author_email": "SMurphyDev <stephen@smurphydev.ie>",
    "download_url": "https://files.pythonhosted.org/packages/32/91/29e947396ee1bc9adad1872aa43babde17fbe6a4c84a0ef768ac7885f7df/parquetreader-0.0.1.tar.gz",
    "platform": null,
    "description": "# SMurphyDev - Parquet Reader\n\nversion = 0.0.1\n\nThe purpose of this library is to enable reading parquet files one row at a\ntime in a relatively memory consious manner. I say relatively because this\nlibrary is a thin wrapper over pyarrow, and pyarrow Datasets, and arrows favors\ngreedy allocation.\n\nParquet is a columnar format, which is compressed on disk. It's intended use\ncase is for analytics workflows where you may need to persist large amounts of\ndata to disk that you will want to query later. The problem which inspired this\nlibrary is a very different usecase. I needed to extract data from a parquet\nfile for use in an ETL style workflow. If you have a similar problem maybe this\nwill be useful for you too.\n\n## Installation\n\nInstallation is straight forward. Just use pip\n\n```\npip install parquetreader\n```\n\n## Usage\n\nIn the simplest case you should be able to read a parquet file like so:\n\n```\nimport parquetreader.reader as pr\n\n# Fields/Columns you want to read from the parquet file.\nfields = [\"Field_1\", \"Field_2\", \"Field_3\"]\n\n# Path to the file you want to read.\n# (Or to a directory containing parquet files, or a list of parquet files)\nfile_path = \"path/to/file.parquet\"\n\nreader = rd.ParquetReader(file_path)\n\nfor row in reader.get_rows(fields):\n    print(row[\"Field_1\"])\n    print(row[\"Field_2\"])\n    print(row[\"Field_3\"])\n```\n\nget_rows returns a generator which yields data in the underlying file one row\nat a time. Files/Datasets are read in batches of 10k records, the records are\nconverted into dictionaries of python types and returned in a way which allows\nus to iterate over them lazily one at a time.\n\nIf you need more control you can create the pyarrow dataset yourself. Under the\nhood get_rows() calls Dataset.to_batches(). You can also pass arguments in\ndirectly here which allow you to control the performance of reading the parquet\nfiles.\n\n```\nimport parquetreader.reader as pr\nimport pyarrow.dataset as ds\n\n# Fields/Columns you want to read from the parquet file.\nfields = [\"Field_1\", \"Field_2\", \"Field_3\"]\n\n# Path to the file you want to read.\n# (Or to a directory containing parquet files, or a list of parquet files)\nfile_path = \"path/to/file.parquet\"\n\ndataset = ds.dataset(\n    file_path,\n    format=\"parquet\",\n    exclude_invalid_files=True,\n)\n\nreader = rd.ParquetReader(dataset)\n\n# Accepts same arguments as Dataset.to_batch()\nfor record in pbr.get_rows_with_args(\n            columns=fields,\n            batch_size=batch_size,\n            batch_readahead=4,  # Number of batches to read ahead in a file\n            fragment_readahead=2,  # Number of files to read ahead in a dataset\n            use_threads=False,\n        ):\n    print(row[\"Field_1\"])\n    print(row[\"Field_2\"])\n    print(row[\"Field_3\"])\n```\n\nYou can read more about the arguments you can pass when creating a dataset or\nreading a batch from the arrow docs:\n\n1. [Dataset Args](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset)\n2. [to_batch()/get_rows_with_args() Args](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches)\n3. [Pyarrow docs on batch reads](https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads)\n\n## Development\n\nTo get up and running if you want to contribute:\n\n```\ngit clone https://github.com/SMurphyDev/parquet-batch.git\ngit cd parquet-batch\n\npython3 -m venv venv\nsource venv/bin/activate\npip install pip-tools\npip-sync requirements.txt dev-requirements.txt\n\n```\n\nAt this point you should have all of the required dependencies set up and you\nshould be good to go.\n",
    "bugtrack_url": null,
    "license": "Copyright (c) <year> <copyright holders>\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n        ",
    "summary": "Pyarrow Dataset wrapper for reading parquet datasets as rows",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/SMurphyDev/parquet-batch",
        "Issues": "https://github.com/SMurphyDev/parquet-batch/issues",
        "repository": "https://github.com/SMurphyDev/parquet-batch.git"
    },
    "split_keywords": [
        "parquet",
        " reader"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4bb043b08337e4dcb3daa49daf8104ba737be0df6d935b5d4c697e71bef81fd",
                "md5": "7d66b0b45b8fb11fc0ec2916b1580f95",
                "sha256": "fb187698769c593f2fab6e384f40555ad120983c3998c09e2d005be9e2ba711f"
            },
            "downloads": -1,
            "filename": "parquetreader-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7d66b0b45b8fb11fc0ec2916b1580f95",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 6159,
            "upload_time": "2025-02-11T16:51:08",
            "upload_time_iso_8601": "2025-02-11T16:51:08.249555Z",
            "url": "https://files.pythonhosted.org/packages/f4/bb/043b08337e4dcb3daa49daf8104ba737be0df6d935b5d4c697e71bef81fd/parquetreader-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "329129e947396ee1bc9adad1872aa43babde17fbe6a4c84a0ef768ac7885f7df",
                "md5": "52daba6be76bdbc619b47148e3bb8411",
                "sha256": "1b9901a7ead5e45cd9737486538fb40bae7017af1054ed38042f0b267c2802bb"
            },
            "downloads": -1,
            "filename": "parquetreader-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "52daba6be76bdbc619b47148e3bb8411",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 10195,
            "upload_time": "2025-02-11T16:51:09",
            "upload_time_iso_8601": "2025-02-11T16:51:09.555477Z",
            "url": "https://files.pythonhosted.org/packages/32/91/29e947396ee1bc9adad1872aa43babde17fbe6a4c84a0ef768ac7885f7df/parquetreader-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-11 16:51:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SMurphyDev",
    "github_project": "parquet-batch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "parquetreader"
}

None