hyper-python-utils


Namehyper-python-utils JSON
Version 0.4.0 PyPI version JSON
download
home_pageNone
SummaryAWS S3 and Athena utilities for data processing with Polars
upload_time2025-10-20 23:52:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords aws s3 athena polars data utilities
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Hyper Python Utils

![Version](https://img.shields.io/badge/version-0.4.0-blue.svg)
![Python](https://img.shields.io/badge/python-3.8+-green.svg)
![PyPI](https://img.shields.io/pypi/v/hyper-python-utils.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)

AWS S3 and Athena utilities for data processing with Pandas and Polars.

## Installation

```bash
pip install hyper-python-utils
```

## Features

- **Simple Query Functions (New in v0.2.0)**: Easy-to-use wrapper functions
  - `query()`: Execute Athena queries with minimal setup
  - `query_unload()`: Execute UNLOAD query and return S3 path
  - `load_unload_data()`: Load DataFrame from UNLOAD results
  - `cleanup_unload_data()`: Clean up S3 files (optional)
  - Support for both Pandas and Polars DataFrames
  - Optimized performance with Parquet + GZIP

- **FileHandler**: S3 file operations with Polars DataFrames
  - Upload/download CSV and Parquet files
  - Parallel loading of multiple files
  - Partitioned uploads by range or date
  - Support for compressed formats

- **QueryManager**: Advanced Athena query execution and management
  - Execute queries with result monitoring
  - Clean up query result files
  - Error handling and timeouts
  - Full control over query execution

## Quick Start

### Simple Query Functions (Recommended for Most Use Cases)

The easiest way to query Athena data:

```python
import hyper_python_utils as hp

# Execute a simple query (returns pandas DataFrame by default)
df = hp.query(
    database="my_database",
    query="SELECT * FROM my_table LIMIT 100"
)
print(df)
print(type(df))  # <class 'pandas.core.frame.DataFrame'>

# Get results as polars DataFrame
df = hp.query(
    database="my_database",
    query="SELECT * FROM my_table LIMIT 100",
    option="polars"
)
print(type(df))  # <class 'polars.dataframe.frame.DataFrame'>

# For large datasets, use UNLOAD (3-step process for better control)
# Step 1: Execute query and get S3 path
s3_path = hp.query_unload(
    database="my_database",
    query="SELECT * FROM large_table WHERE date > '2024-01-01'"
)
# Step 2: Load data from S3
df = hp.load_unload_data(s3_path, option="pandas")  # or option="polars"
# Step 3: Clean up (optional)
hp.cleanup_unload_data(s3_path)

# Queries with semicolons are automatically handled
df = hp.query(database="my_database", query="SELECT * FROM table;")  # Works fine!
```

**Key Features:**
- Pre-configured with optimal settings (bucket: `athena-query-results-for-hyper`)
- Automatic cleanup of temporary files (for `query()` only)
- No exceptions on empty results (returns empty DataFrame)
- Query execution time displayed in logs
- `query_unload()` uses Parquet + GZIP for 4x performance boost
- Three-step UNLOAD process for better control: execute, load, cleanup

**When to use which?**
- `query()`: Normal queries, small to medium datasets (< 1M rows)
- `query_unload()` + `load_unload_data()`: Large datasets (> 1M rows), when performance matters

**UNLOAD Process:**
1. `query_unload()`: Execute query and get S3 directory path
2. `load_unload_data()`: Load DataFrame from S3 files
3. `cleanup_unload_data()`: (Optional) Delete files from S3

## Requirements

- Python >= 3.8
- boto3 >= 1.26.0
- polars >= 0.18.0
- pandas >= 1.5.0

## Configuration

### AWS Credentials

Make sure your AWS credentials are configured either through:
- AWS CLI (`aws configure`)
- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
- IAM roles (when running on EC2)

Required permissions:
- S3: `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, `s3:DeleteObject`
- Athena: `athena:StartQueryExecution`, `athena:GetQueryExecution`

### Required Environment Variables

**IMPORTANT:** You must set the `HYPER_ATHENA_BUCKET` environment variable before using this library.

```bash
# REQUIRED: Set your S3 bucket for Athena query results
export HYPER_ATHENA_BUCKET="your-athena-results-bucket"

# OPTIONAL: Set custom query result prefix (default: "query_results/")
export HYPER_ATHENA_PREFIX="my-custom-prefix/"

# OPTIONAL: Set custom UNLOAD prefix (default: "query_results_for_unload")
export HYPER_UNLOAD_PREFIX="my-unload-prefix"
```

**Python Example:**
```python
import os

# REQUIRED: Set bucket before importing the library
os.environ["HYPER_ATHENA_BUCKET"] = "my-company-athena-results"

# OPTIONAL: Customize prefixes
os.environ["HYPER_ATHENA_PREFIX"] = "analytics/queries/"
os.environ["HYPER_UNLOAD_PREFIX"] = "analytics/unload"

import hyper_python_utils as hp

# Now you can use the library
df = hp.query(database="my_db", query="SELECT * FROM table")
```

**Using .env file:**
```bash
# Copy the example file
cp .env.example .env

# Edit .env and set your bucket name
# HYPER_ATHENA_BUCKET=your-actual-bucket-name

# Then use python-dotenv to load it
```

```python
from dotenv import load_dotenv
load_dotenv()  # Load .env file

import hyper_python_utils as hp
df = hp.query(database="my_db", query="SELECT * FROM table")
```

## Changelog

### v0.3.2 (Latest)
- **Fixed**: Improved file filtering for UNLOAD to only include Parquet files (.parquet, .parquet.gz)
- **Improved**: Added debug logging to show which files are being read during UNLOAD

### v0.3.1
- **Fixed**: Removed automatic cleanup for UNLOAD files to prevent timing issues
- **Improved**: UNLOAD files now kept in S3 for reliable access

### v0.3.0
- **New**: Added `query()` and `query_unload()` wrapper functions for simplified usage
- **New**: Support for both Pandas and Polars DataFrames (Pandas is default)
- **Improved**: UNLOAD queries now use Parquet + GZIP (4x performance improvement)
- **Improved**: Empty query results return empty DataFrame instead of throwing exception
- **Improved**: Query execution time now displayed in logs
- **Improved**: Automatic removal of trailing semicolons in queries
- **Improved**: Silent cleanup (removed unnecessary log messages)

### v0.1.2
- Initial stable release
- FileHandler for S3 operations
- QueryManager for Athena queries

## License

MIT License

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hyper-python-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "aws, s3, athena, polars, data, utilities",
    "author": null,
    "author_email": "jaeyoung_lim <limjyjustin@naver.com>",
    "download_url": "https://files.pythonhosted.org/packages/6d/72/10c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42/hyper_python_utils-0.4.0.tar.gz",
    "platform": null,
    "description": "# Hyper Python Utils\n\n![Version](https://img.shields.io/badge/version-0.4.0-blue.svg)\n![Python](https://img.shields.io/badge/python-3.8+-green.svg)\n![PyPI](https://img.shields.io/pypi/v/hyper-python-utils.svg)\n![License](https://img.shields.io/badge/license-MIT-green.svg)\n\nAWS S3 and Athena utilities for data processing with Pandas and Polars.\n\n## Installation\n\n```bash\npip install hyper-python-utils\n```\n\n## Features\n\n- **Simple Query Functions (New in v0.2.0)**: Easy-to-use wrapper functions\n  - `query()`: Execute Athena queries with minimal setup\n  - `query_unload()`: Execute UNLOAD query and return S3 path\n  - `load_unload_data()`: Load DataFrame from UNLOAD results\n  - `cleanup_unload_data()`: Clean up S3 files (optional)\n  - Support for both Pandas and Polars DataFrames\n  - Optimized performance with Parquet + GZIP\n\n- **FileHandler**: S3 file operations with Polars DataFrames\n  - Upload/download CSV and Parquet files\n  - Parallel loading of multiple files\n  - Partitioned uploads by range or date\n  - Support for compressed formats\n\n- **QueryManager**: Advanced Athena query execution and management\n  - Execute queries with result monitoring\n  - Clean up query result files\n  - Error handling and timeouts\n  - Full control over query execution\n\n## Quick Start\n\n### Simple Query Functions (Recommended for Most Use Cases)\n\nThe easiest way to query Athena data:\n\n```python\nimport hyper_python_utils as hp\n\n# Execute a simple query (returns pandas DataFrame by default)\ndf = hp.query(\n    database=\"my_database\",\n    query=\"SELECT * FROM my_table LIMIT 100\"\n)\nprint(df)\nprint(type(df))  # <class 'pandas.core.frame.DataFrame'>\n\n# Get results as polars DataFrame\ndf = hp.query(\n    database=\"my_database\",\n    query=\"SELECT * FROM my_table LIMIT 100\",\n    option=\"polars\"\n)\nprint(type(df))  # <class 'polars.dataframe.frame.DataFrame'>\n\n# For large datasets, use UNLOAD (3-step process for better control)\n# Step 1: Execute query and get S3 path\ns3_path = hp.query_unload(\n    database=\"my_database\",\n    query=\"SELECT * FROM large_table WHERE date > '2024-01-01'\"\n)\n# Step 2: Load data from S3\ndf = hp.load_unload_data(s3_path, option=\"pandas\")  # or option=\"polars\"\n# Step 3: Clean up (optional)\nhp.cleanup_unload_data(s3_path)\n\n# Queries with semicolons are automatically handled\ndf = hp.query(database=\"my_database\", query=\"SELECT * FROM table;\")  # Works fine!\n```\n\n**Key Features:**\n- Pre-configured with optimal settings (bucket: `athena-query-results-for-hyper`)\n- Automatic cleanup of temporary files (for `query()` only)\n- No exceptions on empty results (returns empty DataFrame)\n- Query execution time displayed in logs\n- `query_unload()` uses Parquet + GZIP for 4x performance boost\n- Three-step UNLOAD process for better control: execute, load, cleanup\n\n**When to use which?**\n- `query()`: Normal queries, small to medium datasets (< 1M rows)\n- `query_unload()` + `load_unload_data()`: Large datasets (> 1M rows), when performance matters\n\n**UNLOAD Process:**\n1. `query_unload()`: Execute query and get S3 directory path\n2. `load_unload_data()`: Load DataFrame from S3 files\n3. `cleanup_unload_data()`: (Optional) Delete files from S3\n\n## Requirements\n\n- Python >= 3.8\n- boto3 >= 1.26.0\n- polars >= 0.18.0\n- pandas >= 1.5.0\n\n## Configuration\n\n### AWS Credentials\n\nMake sure your AWS credentials are configured either through:\n- AWS CLI (`aws configure`)\n- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)\n- IAM roles (when running on EC2)\n\nRequired permissions:\n- S3: `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, `s3:DeleteObject`\n- Athena: `athena:StartQueryExecution`, `athena:GetQueryExecution`\n\n### Required Environment Variables\n\n**IMPORTANT:** You must set the `HYPER_ATHENA_BUCKET` environment variable before using this library.\n\n```bash\n# REQUIRED: Set your S3 bucket for Athena query results\nexport HYPER_ATHENA_BUCKET=\"your-athena-results-bucket\"\n\n# OPTIONAL: Set custom query result prefix (default: \"query_results/\")\nexport HYPER_ATHENA_PREFIX=\"my-custom-prefix/\"\n\n# OPTIONAL: Set custom UNLOAD prefix (default: \"query_results_for_unload\")\nexport HYPER_UNLOAD_PREFIX=\"my-unload-prefix\"\n```\n\n**Python Example:**\n```python\nimport os\n\n# REQUIRED: Set bucket before importing the library\nos.environ[\"HYPER_ATHENA_BUCKET\"] = \"my-company-athena-results\"\n\n# OPTIONAL: Customize prefixes\nos.environ[\"HYPER_ATHENA_PREFIX\"] = \"analytics/queries/\"\nos.environ[\"HYPER_UNLOAD_PREFIX\"] = \"analytics/unload\"\n\nimport hyper_python_utils as hp\n\n# Now you can use the library\ndf = hp.query(database=\"my_db\", query=\"SELECT * FROM table\")\n```\n\n**Using .env file:**\n```bash\n# Copy the example file\ncp .env.example .env\n\n# Edit .env and set your bucket name\n# HYPER_ATHENA_BUCKET=your-actual-bucket-name\n\n# Then use python-dotenv to load it\n```\n\n```python\nfrom dotenv import load_dotenv\nload_dotenv()  # Load .env file\n\nimport hyper_python_utils as hp\ndf = hp.query(database=\"my_db\", query=\"SELECT * FROM table\")\n```\n\n## Changelog\n\n### v0.3.2 (Latest)\n- **Fixed**: Improved file filtering for UNLOAD to only include Parquet files (.parquet, .parquet.gz)\n- **Improved**: Added debug logging to show which files are being read during UNLOAD\n\n### v0.3.1\n- **Fixed**: Removed automatic cleanup for UNLOAD files to prevent timing issues\n- **Improved**: UNLOAD files now kept in S3 for reliable access\n\n### v0.3.0\n- **New**: Added `query()` and `query_unload()` wrapper functions for simplified usage\n- **New**: Support for both Pandas and Polars DataFrames (Pandas is default)\n- **Improved**: UNLOAD queries now use Parquet + GZIP (4x performance improvement)\n- **Improved**: Empty query results return empty DataFrame instead of throwing exception\n- **Improved**: Query execution time now displayed in logs\n- **Improved**: Automatic removal of trailing semicolons in queries\n- **Improved**: Silent cleanup (removed unnecessary log messages)\n\n### v0.1.2\n- Initial stable release\n- FileHandler for S3 operations\n- QueryManager for Athena queries\n\n## License\n\nMIT License\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AWS S3 and Athena utilities for data processing with Polars",
    "version": "0.4.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/NHNAD-wooyeon/hyper-python-utils/issues",
        "Documentation": "https://github.com/NHNAD-wooyeon/hyper-python-utils#readme",
        "Homepage": "https://github.com/NHNAD-wooyeon/hyper-python-utils",
        "Repository": "https://github.com/NHNAD-wooyeon/hyper-python-utils"
    },
    "split_keywords": [
        "aws",
        " s3",
        " athena",
        " polars",
        " data",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3846978e8ac3bb7286d22e0c8d72118e8d793d95432edc126f46ede3c65824ef",
                "md5": "023fcfd637ce47da1146b40847f8faae",
                "sha256": "c3b137ef6b64be19b60b65f07d86706bc82cc13dd54d5d14280a8db273ff8a93"
            },
            "downloads": -1,
            "filename": "hyper_python_utils-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "023fcfd637ce47da1146b40847f8faae",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11224,
            "upload_time": "2025-10-20T23:52:26",
            "upload_time_iso_8601": "2025-10-20T23:52:26.000935Z",
            "url": "https://files.pythonhosted.org/packages/38/46/978e8ac3bb7286d22e0c8d72118e8d793d95432edc126f46ede3c65824ef/hyper_python_utils-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6d7210c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42",
                "md5": "89cd2fbbad8ade9c039b90207be2c36e",
                "sha256": "6fc740dc65e2c8180129ae507c393f06f9204fcf6a61bd9651f6abd0a359238f"
            },
            "downloads": -1,
            "filename": "hyper_python_utils-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "89cd2fbbad8ade9c039b90207be2c36e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 12535,
            "upload_time": "2025-10-20T23:52:27",
            "upload_time_iso_8601": "2025-10-20T23:52:27.225928Z",
            "url": "https://files.pythonhosted.org/packages/6d/72/10c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42/hyper_python_utils-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-20 23:52:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NHNAD-wooyeon",
    "github_project": "hyper-python-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hyper-python-utils"
}
        
Elapsed time: 2.50228s