| Name | hyper-python-utils JSON |
| Version |
0.4.0
JSON |
| download |
| home_page | None |
| Summary | AWS S3 and Athena utilities for data processing with Polars |
| upload_time | 2025-10-20 23:52:27 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT |
| keywords |
aws
s3
athena
polars
data
utilities
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# Hyper Python Utils




AWS S3 and Athena utilities for data processing with Pandas and Polars.
## Installation
```bash
pip install hyper-python-utils
```
## Features
- **Simple Query Functions (New in v0.2.0)**: Easy-to-use wrapper functions
- `query()`: Execute Athena queries with minimal setup
- `query_unload()`: Execute UNLOAD query and return S3 path
- `load_unload_data()`: Load DataFrame from UNLOAD results
- `cleanup_unload_data()`: Clean up S3 files (optional)
- Support for both Pandas and Polars DataFrames
- Optimized performance with Parquet + GZIP
- **FileHandler**: S3 file operations with Polars DataFrames
- Upload/download CSV and Parquet files
- Parallel loading of multiple files
- Partitioned uploads by range or date
- Support for compressed formats
- **QueryManager**: Advanced Athena query execution and management
- Execute queries with result monitoring
- Clean up query result files
- Error handling and timeouts
- Full control over query execution
## Quick Start
### Simple Query Functions (Recommended for Most Use Cases)
The easiest way to query Athena data:
```python
import hyper_python_utils as hp
# Execute a simple query (returns pandas DataFrame by default)
df = hp.query(
database="my_database",
query="SELECT * FROM my_table LIMIT 100"
)
print(df)
print(type(df)) # <class 'pandas.core.frame.DataFrame'>
# Get results as polars DataFrame
df = hp.query(
database="my_database",
query="SELECT * FROM my_table LIMIT 100",
option="polars"
)
print(type(df)) # <class 'polars.dataframe.frame.DataFrame'>
# For large datasets, use UNLOAD (3-step process for better control)
# Step 1: Execute query and get S3 path
s3_path = hp.query_unload(
database="my_database",
query="SELECT * FROM large_table WHERE date > '2024-01-01'"
)
# Step 2: Load data from S3
df = hp.load_unload_data(s3_path, option="pandas") # or option="polars"
# Step 3: Clean up (optional)
hp.cleanup_unload_data(s3_path)
# Queries with semicolons are automatically handled
df = hp.query(database="my_database", query="SELECT * FROM table;") # Works fine!
```
**Key Features:**
- Pre-configured with optimal settings (bucket: `athena-query-results-for-hyper`)
- Automatic cleanup of temporary files (for `query()` only)
- No exceptions on empty results (returns empty DataFrame)
- Query execution time displayed in logs
- `query_unload()` uses Parquet + GZIP for 4x performance boost
- Three-step UNLOAD process for better control: execute, load, cleanup
**When to use which?**
- `query()`: Normal queries, small to medium datasets (< 1M rows)
- `query_unload()` + `load_unload_data()`: Large datasets (> 1M rows), when performance matters
**UNLOAD Process:**
1. `query_unload()`: Execute query and get S3 directory path
2. `load_unload_data()`: Load DataFrame from S3 files
3. `cleanup_unload_data()`: (Optional) Delete files from S3
## Requirements
- Python >= 3.8
- boto3 >= 1.26.0
- polars >= 0.18.0
- pandas >= 1.5.0
## Configuration
### AWS Credentials
Make sure your AWS credentials are configured either through:
- AWS CLI (`aws configure`)
- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
- IAM roles (when running on EC2)
Required permissions:
- S3: `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, `s3:DeleteObject`
- Athena: `athena:StartQueryExecution`, `athena:GetQueryExecution`
### Required Environment Variables
**IMPORTANT:** You must set the `HYPER_ATHENA_BUCKET` environment variable before using this library.
```bash
# REQUIRED: Set your S3 bucket for Athena query results
export HYPER_ATHENA_BUCKET="your-athena-results-bucket"
# OPTIONAL: Set custom query result prefix (default: "query_results/")
export HYPER_ATHENA_PREFIX="my-custom-prefix/"
# OPTIONAL: Set custom UNLOAD prefix (default: "query_results_for_unload")
export HYPER_UNLOAD_PREFIX="my-unload-prefix"
```
**Python Example:**
```python
import os
# REQUIRED: Set bucket before importing the library
os.environ["HYPER_ATHENA_BUCKET"] = "my-company-athena-results"
# OPTIONAL: Customize prefixes
os.environ["HYPER_ATHENA_PREFIX"] = "analytics/queries/"
os.environ["HYPER_UNLOAD_PREFIX"] = "analytics/unload"
import hyper_python_utils as hp
# Now you can use the library
df = hp.query(database="my_db", query="SELECT * FROM table")
```
**Using .env file:**
```bash
# Copy the example file
cp .env.example .env
# Edit .env and set your bucket name
# HYPER_ATHENA_BUCKET=your-actual-bucket-name
# Then use python-dotenv to load it
```
```python
from dotenv import load_dotenv
load_dotenv() # Load .env file
import hyper_python_utils as hp
df = hp.query(database="my_db", query="SELECT * FROM table")
```
## Changelog
### v0.3.2 (Latest)
- **Fixed**: Improved file filtering for UNLOAD to only include Parquet files (.parquet, .parquet.gz)
- **Improved**: Added debug logging to show which files are being read during UNLOAD
### v0.3.1
- **Fixed**: Removed automatic cleanup for UNLOAD files to prevent timing issues
- **Improved**: UNLOAD files now kept in S3 for reliable access
### v0.3.0
- **New**: Added `query()` and `query_unload()` wrapper functions for simplified usage
- **New**: Support for both Pandas and Polars DataFrames (Pandas is default)
- **Improved**: UNLOAD queries now use Parquet + GZIP (4x performance improvement)
- **Improved**: Empty query results return empty DataFrame instead of throwing exception
- **Improved**: Query execution time now displayed in logs
- **Improved**: Automatic removal of trailing semicolons in queries
- **Improved**: Silent cleanup (removed unnecessary log messages)
### v0.1.2
- Initial stable release
- FileHandler for S3 operations
- QueryManager for Athena queries
## License
MIT License
Raw data
{
"_id": null,
"home_page": null,
"name": "hyper-python-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "aws, s3, athena, polars, data, utilities",
"author": null,
"author_email": "jaeyoung_lim <limjyjustin@naver.com>",
"download_url": "https://files.pythonhosted.org/packages/6d/72/10c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42/hyper_python_utils-0.4.0.tar.gz",
"platform": null,
"description": "# Hyper Python Utils\n\n\n\n\n\n\nAWS S3 and Athena utilities for data processing with Pandas and Polars.\n\n## Installation\n\n```bash\npip install hyper-python-utils\n```\n\n## Features\n\n- **Simple Query Functions (New in v0.2.0)**: Easy-to-use wrapper functions\n - `query()`: Execute Athena queries with minimal setup\n - `query_unload()`: Execute UNLOAD query and return S3 path\n - `load_unload_data()`: Load DataFrame from UNLOAD results\n - `cleanup_unload_data()`: Clean up S3 files (optional)\n - Support for both Pandas and Polars DataFrames\n - Optimized performance with Parquet + GZIP\n\n- **FileHandler**: S3 file operations with Polars DataFrames\n - Upload/download CSV and Parquet files\n - Parallel loading of multiple files\n - Partitioned uploads by range or date\n - Support for compressed formats\n\n- **QueryManager**: Advanced Athena query execution and management\n - Execute queries with result monitoring\n - Clean up query result files\n - Error handling and timeouts\n - Full control over query execution\n\n## Quick Start\n\n### Simple Query Functions (Recommended for Most Use Cases)\n\nThe easiest way to query Athena data:\n\n```python\nimport hyper_python_utils as hp\n\n# Execute a simple query (returns pandas DataFrame by default)\ndf = hp.query(\n database=\"my_database\",\n query=\"SELECT * FROM my_table LIMIT 100\"\n)\nprint(df)\nprint(type(df)) # <class 'pandas.core.frame.DataFrame'>\n\n# Get results as polars DataFrame\ndf = hp.query(\n database=\"my_database\",\n query=\"SELECT * FROM my_table LIMIT 100\",\n option=\"polars\"\n)\nprint(type(df)) # <class 'polars.dataframe.frame.DataFrame'>\n\n# For large datasets, use UNLOAD (3-step process for better control)\n# Step 1: Execute query and get S3 path\ns3_path = hp.query_unload(\n database=\"my_database\",\n query=\"SELECT * FROM large_table WHERE date > '2024-01-01'\"\n)\n# Step 2: Load data from S3\ndf = hp.load_unload_data(s3_path, option=\"pandas\") # or option=\"polars\"\n# Step 3: Clean up (optional)\nhp.cleanup_unload_data(s3_path)\n\n# Queries with semicolons are automatically handled\ndf = hp.query(database=\"my_database\", query=\"SELECT * FROM table;\") # Works fine!\n```\n\n**Key Features:**\n- Pre-configured with optimal settings (bucket: `athena-query-results-for-hyper`)\n- Automatic cleanup of temporary files (for `query()` only)\n- No exceptions on empty results (returns empty DataFrame)\n- Query execution time displayed in logs\n- `query_unload()` uses Parquet + GZIP for 4x performance boost\n- Three-step UNLOAD process for better control: execute, load, cleanup\n\n**When to use which?**\n- `query()`: Normal queries, small to medium datasets (< 1M rows)\n- `query_unload()` + `load_unload_data()`: Large datasets (> 1M rows), when performance matters\n\n**UNLOAD Process:**\n1. `query_unload()`: Execute query and get S3 directory path\n2. `load_unload_data()`: Load DataFrame from S3 files\n3. `cleanup_unload_data()`: (Optional) Delete files from S3\n\n## Requirements\n\n- Python >= 3.8\n- boto3 >= 1.26.0\n- polars >= 0.18.0\n- pandas >= 1.5.0\n\n## Configuration\n\n### AWS Credentials\n\nMake sure your AWS credentials are configured either through:\n- AWS CLI (`aws configure`)\n- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)\n- IAM roles (when running on EC2)\n\nRequired permissions:\n- S3: `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, `s3:DeleteObject`\n- Athena: `athena:StartQueryExecution`, `athena:GetQueryExecution`\n\n### Required Environment Variables\n\n**IMPORTANT:** You must set the `HYPER_ATHENA_BUCKET` environment variable before using this library.\n\n```bash\n# REQUIRED: Set your S3 bucket for Athena query results\nexport HYPER_ATHENA_BUCKET=\"your-athena-results-bucket\"\n\n# OPTIONAL: Set custom query result prefix (default: \"query_results/\")\nexport HYPER_ATHENA_PREFIX=\"my-custom-prefix/\"\n\n# OPTIONAL: Set custom UNLOAD prefix (default: \"query_results_for_unload\")\nexport HYPER_UNLOAD_PREFIX=\"my-unload-prefix\"\n```\n\n**Python Example:**\n```python\nimport os\n\n# REQUIRED: Set bucket before importing the library\nos.environ[\"HYPER_ATHENA_BUCKET\"] = \"my-company-athena-results\"\n\n# OPTIONAL: Customize prefixes\nos.environ[\"HYPER_ATHENA_PREFIX\"] = \"analytics/queries/\"\nos.environ[\"HYPER_UNLOAD_PREFIX\"] = \"analytics/unload\"\n\nimport hyper_python_utils as hp\n\n# Now you can use the library\ndf = hp.query(database=\"my_db\", query=\"SELECT * FROM table\")\n```\n\n**Using .env file:**\n```bash\n# Copy the example file\ncp .env.example .env\n\n# Edit .env and set your bucket name\n# HYPER_ATHENA_BUCKET=your-actual-bucket-name\n\n# Then use python-dotenv to load it\n```\n\n```python\nfrom dotenv import load_dotenv\nload_dotenv() # Load .env file\n\nimport hyper_python_utils as hp\ndf = hp.query(database=\"my_db\", query=\"SELECT * FROM table\")\n```\n\n## Changelog\n\n### v0.3.2 (Latest)\n- **Fixed**: Improved file filtering for UNLOAD to only include Parquet files (.parquet, .parquet.gz)\n- **Improved**: Added debug logging to show which files are being read during UNLOAD\n\n### v0.3.1\n- **Fixed**: Removed automatic cleanup for UNLOAD files to prevent timing issues\n- **Improved**: UNLOAD files now kept in S3 for reliable access\n\n### v0.3.0\n- **New**: Added `query()` and `query_unload()` wrapper functions for simplified usage\n- **New**: Support for both Pandas and Polars DataFrames (Pandas is default)\n- **Improved**: UNLOAD queries now use Parquet + GZIP (4x performance improvement)\n- **Improved**: Empty query results return empty DataFrame instead of throwing exception\n- **Improved**: Query execution time now displayed in logs\n- **Improved**: Automatic removal of trailing semicolons in queries\n- **Improved**: Silent cleanup (removed unnecessary log messages)\n\n### v0.1.2\n- Initial stable release\n- FileHandler for S3 operations\n- QueryManager for Athena queries\n\n## License\n\nMIT License\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AWS S3 and Athena utilities for data processing with Polars",
"version": "0.4.0",
"project_urls": {
"Bug Tracker": "https://github.com/NHNAD-wooyeon/hyper-python-utils/issues",
"Documentation": "https://github.com/NHNAD-wooyeon/hyper-python-utils#readme",
"Homepage": "https://github.com/NHNAD-wooyeon/hyper-python-utils",
"Repository": "https://github.com/NHNAD-wooyeon/hyper-python-utils"
},
"split_keywords": [
"aws",
" s3",
" athena",
" polars",
" data",
" utilities"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3846978e8ac3bb7286d22e0c8d72118e8d793d95432edc126f46ede3c65824ef",
"md5": "023fcfd637ce47da1146b40847f8faae",
"sha256": "c3b137ef6b64be19b60b65f07d86706bc82cc13dd54d5d14280a8db273ff8a93"
},
"downloads": -1,
"filename": "hyper_python_utils-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "023fcfd637ce47da1146b40847f8faae",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11224,
"upload_time": "2025-10-20T23:52:26",
"upload_time_iso_8601": "2025-10-20T23:52:26.000935Z",
"url": "https://files.pythonhosted.org/packages/38/46/978e8ac3bb7286d22e0c8d72118e8d793d95432edc126f46ede3c65824ef/hyper_python_utils-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6d7210c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42",
"md5": "89cd2fbbad8ade9c039b90207be2c36e",
"sha256": "6fc740dc65e2c8180129ae507c393f06f9204fcf6a61bd9651f6abd0a359238f"
},
"downloads": -1,
"filename": "hyper_python_utils-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "89cd2fbbad8ade9c039b90207be2c36e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 12535,
"upload_time": "2025-10-20T23:52:27",
"upload_time_iso_8601": "2025-10-20T23:52:27.225928Z",
"url": "https://files.pythonhosted.org/packages/6d/72/10c0ad19d35f5c69f8d940519fac2dc19dc1d65d91901efb80cba2734c42/hyper_python_utils-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-20 23:52:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NHNAD-wooyeon",
"github_project": "hyper-python-utils",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "hyper-python-utils"
}