aws-parquet


Nameaws-parquet JSON
Version 0.5.0 PyPI version JSON
download
home_pagehttps://github.com/marwan116/aws-parquet/
SummaryAn object-oriented interface for defining parquet datasets for AWS built on top of awswrangler and pandera
upload_time2023-06-19 10:30:25
maintainer
docs_urlNone
authorMarwan Sarieddine
requires_python>=3.8,<=3.11
licenseMIT
keywords pandas aws parquet
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # aws-parquet

<br>

[![PyPI version shields.io](https://img.shields.io/pypi/v/aws-parquet.svg)](https://pypi.org/project/aws-parquet/)
[![PyPI license](https://img.shields.io/pypi/l/aws-parquet.svg)](https://pypi.python.org/pypi/)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/aws-parquet.svg)](https://pypi.python.org/pypi/aws-parquet/)
[![Downloads](https://pepy.tech/badge/aws-parquet/month)](https://pepy.tech/project/aws-parquet)
[![Downloads](https://pepy.tech/badge/aws-parquet)](https://pepy.tech/project/aws-parquet)

`aws-parquet` is a toolkit than enables working with parquet datasets on AWS. It handles AWS S3 reads/writes, AWS Glue catalog updates and AWS Athena queries by providing a simple and intuitive interface.

## Motivation

The goal is to provide a simple and intuitive interface to create and manage parquet datasets on AWS.

`aws-parquet` makes use of the following tools: 
- [awswrangler](https://aws-sdk-pandas.readthedocs.io/en/stable/) as an AWS SDK for pandas
- [pandera](https://pandera.readthedocs.io/en/stable/) for pandas-based data validation
- [typeguard](https://typeguard.readthedocs.io/en/stable/userguide.html) and [pydantic](https://docs.pydantic.dev/latest/) for runtime type checking

## Features
`aws-parquet` provides a `ParquetDataset` class that enables the following operations:

- create a parquet dataset that will get registered in AWS Glue
- append new data to the dataset and update the AWS Glue catalog
- read a partition of the dataset and perform proper schema validation and type casting
- overwrite data in the dataset after performing proper schema validation and type casting
- delete a partition of the dataset and update the AWS Glue catalog
- query the dataset using AWS Athena


## How to setup

Using pip:

```bash
pip install aws_parquet
```

## How to use

Create a parquet dataset that will get registered in AWS Glue

```python
import os

from aws_parquet import ParquetDataset
import pandas as pd
import pandera as pa
from pandera.typing import Series

# define your pandera schema model
class MyDatasetSchemaModel(pa.SchemaModel):
    col1: Series[int] = pa.Field(nullable=False, ge=0, lt=10)
    col2: Series[pa.DateTime]
    col3: Series[float]

# configuration
database = "default"
bucket_name = os.environ["AWS_S3_BUCKET"]
table_name = "foo_bar"
path = f"s3://{bucket_name}/{table_name}/"
partition_cols = ["col1", "col2"]
schema = MyDatasetSchemaModel.to_schema()

# create the dataset
dataset = ParquetDataset(
    database=database,
    table=table_name,
    partition_cols=partition_cols,
    path=path,
    pandera_schema=schema,
)

dataset.create()
```
Append new data to the dataset

```python
df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [1.0, 2.0, 3.0]
})

dataset.update(df)
```

Read a partition of the dataset

```python
df = dataset.read({"col2": "2021-01-01"})
```

Overwrite data in the dataset

```python
df_overwrite = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [4.0, 5.0, 6.0]
})
dataset.update(df_overwrite, overwrite=True)
```

Query the dataset using AWS Athena

```python
df = dataset.query("SELECT col1 FROM foo_bar")
```

Delete a partition of the dataset

```python
dataset.delete({"col1": 1, "col2": "2021-01-01"})
```


Delete the dataset in its entirety

```python
dataset.delete()
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/marwan116/aws-parquet/",
    "name": "aws-parquet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<=3.11",
    "maintainer_email": "",
    "keywords": "pandas,aws,parquet",
    "author": "Marwan Sarieddine",
    "author_email": "sarieddine.marwan@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "# aws-parquet\n\n<br>\n\n[![PyPI version shields.io](https://img.shields.io/pypi/v/aws-parquet.svg)](https://pypi.org/project/aws-parquet/)\n[![PyPI license](https://img.shields.io/pypi/l/aws-parquet.svg)](https://pypi.python.org/pypi/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/aws-parquet.svg)](https://pypi.python.org/pypi/aws-parquet/)\n[![Downloads](https://pepy.tech/badge/aws-parquet/month)](https://pepy.tech/project/aws-parquet)\n[![Downloads](https://pepy.tech/badge/aws-parquet)](https://pepy.tech/project/aws-parquet)\n\n`aws-parquet` is a toolkit than enables working with parquet datasets on AWS. It handles AWS S3 reads/writes, AWS Glue catalog updates and AWS Athena queries by providing a simple and intuitive interface.\n\n## Motivation\n\nThe goal is to provide a simple and intuitive interface to create and manage parquet datasets on AWS.\n\n`aws-parquet` makes use of the following tools: \n- [awswrangler](https://aws-sdk-pandas.readthedocs.io/en/stable/) as an AWS SDK for pandas\n- [pandera](https://pandera.readthedocs.io/en/stable/) for pandas-based data validation\n- [typeguard](https://typeguard.readthedocs.io/en/stable/userguide.html) and [pydantic](https://docs.pydantic.dev/latest/) for runtime type checking\n\n## Features\n`aws-parquet` provides a `ParquetDataset` class that enables the following operations:\n\n- create a parquet dataset that will get registered in AWS Glue\n- append new data to the dataset and update the AWS Glue catalog\n- read a partition of the dataset and perform proper schema validation and type casting\n- overwrite data in the dataset after performing proper schema validation and type casting\n- delete a partition of the dataset and update the AWS Glue catalog\n- query the dataset using AWS Athena\n\n\n## How to setup\n\nUsing pip:\n\n```bash\npip install aws_parquet\n```\n\n## How to use\n\nCreate a parquet dataset that will get registered in AWS Glue\n\n```python\nimport os\n\nfrom aws_parquet import ParquetDataset\nimport pandas as pd\nimport pandera as pa\nfrom pandera.typing import Series\n\n# define your pandera schema model\nclass MyDatasetSchemaModel(pa.SchemaModel):\n    col1: Series[int] = pa.Field(nullable=False, ge=0, lt=10)\n    col2: Series[pa.DateTime]\n    col3: Series[float]\n\n# configuration\ndatabase = \"default\"\nbucket_name = os.environ[\"AWS_S3_BUCKET\"]\ntable_name = \"foo_bar\"\npath = f\"s3://{bucket_name}/{table_name}/\"\npartition_cols = [\"col1\", \"col2\"]\nschema = MyDatasetSchemaModel.to_schema()\n\n# create the dataset\ndataset = ParquetDataset(\n    database=database,\n    table=table_name,\n    partition_cols=partition_cols,\n    path=path,\n    pandera_schema=schema,\n)\n\ndataset.create()\n```\nAppend new data to the dataset\n\n```python\ndf = pd.DataFrame({\n    \"col1\": [1, 2, 3],\n    \"col2\": [\"2021-01-01\", \"2021-01-02\", \"2021-01-03\"],\n    \"col3\": [1.0, 2.0, 3.0]\n})\n\ndataset.update(df)\n```\n\nRead a partition of the dataset\n\n```python\ndf = dataset.read({\"col2\": \"2021-01-01\"})\n```\n\nOverwrite data in the dataset\n\n```python\ndf_overwrite = pd.DataFrame({\n    \"col1\": [1, 2, 3],\n    \"col2\": [\"2021-01-01\", \"2021-01-02\", \"2021-01-03\"],\n    \"col3\": [4.0, 5.0, 6.0]\n})\ndataset.update(df_overwrite, overwrite=True)\n```\n\nQuery the dataset using AWS Athena\n\n```python\ndf = dataset.query(\"SELECT col1 FROM foo_bar\")\n```\n\nDelete a partition of the dataset\n\n```python\ndataset.delete({\"col1\": 1, \"col2\": \"2021-01-01\"})\n```\n\n\nDelete the dataset in its entirety\n\n```python\ndataset.delete()\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An object-oriented interface for defining parquet datasets for AWS built on top of awswrangler and pandera",
    "version": "0.5.0",
    "project_urls": {
        "Documentation": "https://aws-parquet.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/marwan116/aws-parquet/",
        "Repository": "https://github.com/marwan116/aws-parquet/"
    },
    "split_keywords": [
        "pandas",
        "aws",
        "parquet"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f72ec5b657fda0e9be441f9c941fe42ba992587145c72287c96db351dd9d278b",
                "md5": "463aedae7b6ee5a7c564f20c373ffbe0",
                "sha256": "e9ec49bf2f11c73cab98e1d186c1fc118496681c33a62914cd52365932734be2"
            },
            "downloads": -1,
            "filename": "aws_parquet-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "463aedae7b6ee5a7c564f20c373ffbe0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<=3.11",
            "size": 10655,
            "upload_time": "2023-06-19T10:30:25",
            "upload_time_iso_8601": "2023-06-19T10:30:25.270684Z",
            "url": "https://files.pythonhosted.org/packages/f7/2e/c5b657fda0e9be441f9c941fe42ba992587145c72287c96db351dd9d278b/aws_parquet-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-19 10:30:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "marwan116",
    "github_project": "aws-parquet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "aws-parquet"
}
        
Elapsed time: 0.30773s