# What is Pyllas?
Pyllas is a Python library for interacting with AWS Athena.
It is designed for data analysis in Jupyter notebooks, but can be used in any Python environment.
Features:
* Easy to use.
* Good Performance even on large datasets.
* Query result as Pandas DataFrame.
* Create materialized tables from queries and use them in subsequent queries.
* Get information about query execution progress, time and data scanned.
* Automatically cancel queries when stop execution of Jupyter notebook cell or on KeyboardInterrupt.
## Quick start
---
Pyllas can be installed using pip:
```bash
pip install pyllas
```
### Setup AWS environment
First, you need to add your AWS credentials to the file `~/.aws/credentials`.
> See [IAM credentials](https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials?section=IAM_credentials)
for more information.
>
Second, create bucket for Athena query results.
For example: `s3://wristylotus.athena/default-query-result/`
Third, create an Athena workgroup and schema for your tables.
For example, workgroup: `wristylotus`, schema: `create schema wristylotus`.
> [!IMPORTANT]
> Athena schema and workgroup must have the same name.
Finally, you need to grant the following permissions:
- `s3:*` on Athena queries result S3 path
- `athena:StartQueryExecution`
- `athena:CancelQueryExecution`
- `athena:GetQueryExecution`
- `glue:CreateTable`
Congratulations! You are ready to use Pyllas.
```python
import pyllas
athena = pyllas.Athena(
workgroup='wristylotus',
s3_output_location='s3://wristylotus.athena/default-query-result/'
)
athena.query("SELECT 'Hello Athena!' AS greeting")
```
## Documentation
---
### Initialization
An Athena client can be obtained using the `pyllas.Athena` initializer.
```python
import pyllas
athena = pyllas.Athena(
# Athena workgroup name
workgroup='wristylotus',
# S3 path for Athena query results
s3_output_location='s3://wristylotus.athena/default-query-result/',
# Number of threads to load query results
n_jobs=1,
# Prints additional information about query execution
debug=False
)
```
### Executing queries
To execute a query, use the `query` method:
```python
users_df = athena.query('SELECT id, name FROM users')
```
Pyllas also supports parameterized queries:
```python
users_df = athena.query('''
SELECT id, name
FROM users
WHERE name IN ${names} AND age > ${age}
''',
params={'names': ['Bart', 'Lisa'], 'age': 14}
)
```
For the SQL templating use `Expr` and `Table`, `Database`, `Schema` type aliases:
```python
from pyllas import sql
users_df = athena.query('SELECT id, name FROM ${schema}.${table} ${filter}',
params={
'schema': sql.Schema('main'),
'table': sql.Table('users'),
'filter': sql.Expr("WHERE name = 'Bart'")
}
)
```
> For more information, see [API](#query-execution)
Instead of getting the result as a Pandas DataFrame, you can get the result table name:
```python
users_table = athena.create_table('SELECT id, name FROM users WHERE age > 14')
# and then use it in subsequent queries
athena.query(f'SELECT * FROM {users_table}')
```
> For more information, see [API](#create-table)
For a not SELECT queries, such as `CREATE TABLE`, `DROP TABLE` etc., use:
```python
athena.execute_statement("DROP TABLE users")
```
> For more information, see [API](#execute-statement)
## API Reference
---
### Client
The Athena class is a facade to all functionality offered by the library.
```python
class Athena:
"""
Athena client.
Provides methods to execute SQL queries in AWS Athena service.
:param workgroup: name of the workgroup to execute queries
:param s3_output_location: S3 path to store query results
:param n_jobs: number of parallel jobs to read query results, default: 1.
n_jobs=1, then use only main-threaded
n_jobs=-1, then use all available CPUs
:param debug: enable logging debug level
"""
def __init__(
self,
workgroup: str,
s3_output_location: str,
n_jobs: int = 1,
debug: bool = False
):
...
```
#### Query execution
```python
def query(self, query: str | Path,
*,
params: dict = None,
date_fields: Union[tuple, list] = ('date', 'event_date', 'report_date'),
ask_status_sec: int = 5) -> pd.DataFrame:
"""
Execute query and load results as pandas DataFrame.
Parameters
----------
:param query: str or pathlib.Path
query string or sql file path to read
:param params: dict
parameters to infuse :param query, see :func: pyllas.sql.infuse
:param date_fields: tuple or list
field names to convert to pandas.datetime. Default: ('date', 'event_date', 'report_date')
:param ask_status_sec: int
interval in seconds to check query status. Default: 5
"""
```
#### Create table
```python
def create_table(self, *, query: Path | str, params: dict = None,
prefix: str = 'tmp_', name: str = None,
overwrite: bool = False, ask_status_sec: int = 5) -> str:
"""
Create a table with query results and return it name.
Parameters
----------
:param query: str or pathlib.Path
query string or sql file path to read
:param params: dict
parameters to infuse :param query, see :func: pyllas.sql.infuse
:param name: str
name for the table. Default: auto-generated random name
:param prefix: str
prefix for the auto-generated table name, used if :param name is None. Default: `tmp_`
:param overwrite: bool
overwrite table if it exists. Default: False
:param ask_status_sec: int
interval in seconds to check query status. Default: 5
"""
```
#### Execute statement
```python
def execute_statement(self, query: str | Path, *, database: str = None,
params: dict = None, batch_size: int = 1000, ask_status_sec: int = 5) -> PageIterator:
"""
For all queries except SELECT. Such as `CREATE TABLE`, `DROP TABLE` etc.
Returns PageIterator of dictionaries with query results.
Example:
>> athena.execute_statement("SHOW TABLES IN test_db", database='test_db')
{'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'test_table'}]}], 'ResultSetMetadata': {'ColumnInfo': [{'CatalogName': 'hive', ...}]}}}
Parameters
----------
:param query: str or pathlib.Path
query string or sql file path to read
:param database: str
database name
:param params: dict
parameters to infuse :param query, see :func: pyllas.sql.infuse
:param batch_size: int
batch size to read query results. Default: 1000
:param ask_status_sec: int
interval in seconds to check query status. Default: 5
"""
```
#### Cancel query
```python
def cancel_query(self, query_id: str) -> None:
"""
Cancel query.
Parameters
----------
:param query_id: str
query id to cancel
"""
```
Raw data
{
"_id": null,
"home_page": "https://github.com/wristylotus/Pyllas",
"name": "pyllas",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "pyllas aws athena sql jupyter",
"author": "wristylotus",
"author_email": "wristylotus@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/cf/ee/788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481/pyllas-0.4.0.tar.gz",
"platform": null,
"description": "# What is Pyllas?\n\nPyllas is a Python library for interacting with AWS Athena.\n\nIt is designed for data analysis in Jupyter notebooks, but can be used in any Python environment.\n\nFeatures:\n\n* Easy to use.\n* Good Performance even on large datasets.\n* Query result as Pandas DataFrame.\n* Create materialized tables from queries and use them in subsequent queries.\n* Get information about query execution progress, time and data scanned.\n* Automatically cancel queries when stop execution of Jupyter notebook cell or on KeyboardInterrupt.\n\n## Quick start\n\n---\n\nPyllas can be installed using pip:\n\n```bash\npip install pyllas\n```\n\n### Setup AWS environment\n\nFirst, you need to add your AWS credentials to the file `~/.aws/credentials`.\n\n> See [IAM credentials](https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials?section=IAM_credentials)\nfor more information.\n> \nSecond, create bucket for Athena query results.\nFor example: `s3://wristylotus.athena/default-query-result/`\n\nThird, create an Athena workgroup and schema for your tables.\nFor example, workgroup: `wristylotus`, schema: `create schema wristylotus`.\n> [!IMPORTANT]\n> Athena schema and workgroup must have the same name.\n\nFinally, you need to grant the following permissions:\n\n- `s3:*` on Athena queries result S3 path\n- `athena:StartQueryExecution`\n- `athena:CancelQueryExecution`\n- `athena:GetQueryExecution`\n- `glue:CreateTable`\n\nCongratulations! You are ready to use Pyllas.\n\n```python\nimport pyllas\n\nathena = pyllas.Athena(\n workgroup='wristylotus',\n s3_output_location='s3://wristylotus.athena/default-query-result/'\n)\n\nathena.query(\"SELECT 'Hello Athena!' AS greeting\")\n```\n\n## Documentation\n\n---\n\n### Initialization\n\nAn Athena client can be obtained using the `pyllas.Athena` initializer.\n\n```python\nimport pyllas\n\nathena = pyllas.Athena(\n # Athena workgroup name\n workgroup='wristylotus',\n # S3 path for Athena query results\n s3_output_location='s3://wristylotus.athena/default-query-result/',\n # Number of threads to load query results\n n_jobs=1,\n # Prints additional information about query execution\n debug=False\n)\n```\n\n### Executing queries\n\nTo execute a query, use the `query` method:\n\n```python\nusers_df = athena.query('SELECT id, name FROM users')\n```\n\nPyllas also supports parameterized queries:\n\n```python\nusers_df = athena.query('''\n SELECT id, name \n FROM users \n WHERE name IN ${names} AND age > ${age}\n ''', \n params={'names': ['Bart', 'Lisa'], 'age': 14}\n)\n```\n\nFor the SQL templating use `Expr` and `Table`, `Database`, `Schema` type aliases:\n\n```python\nfrom pyllas import sql\n\nusers_df = athena.query('SELECT id, name FROM ${schema}.${table} ${filter}', \n params={\n 'schema': sql.Schema('main'), \n 'table': sql.Table('users'), \n 'filter': sql.Expr(\"WHERE name = 'Bart'\")\n }\n)\n```\n> For more information, see [API](#query-execution)\n\nInstead of getting the result as a Pandas DataFrame, you can get the result table name:\n\n```python\nusers_table = athena.create_table('SELECT id, name FROM users WHERE age > 14')\n# and then use it in subsequent queries\nathena.query(f'SELECT * FROM {users_table}')\n```\n> For more information, see [API](#create-table)\n\nFor a not SELECT queries, such as `CREATE TABLE`, `DROP TABLE` etc., use:\n\n```python\nathena.execute_statement(\"DROP TABLE users\")\n```\n> For more information, see [API](#execute-statement)\n\n## API Reference\n\n---\n\n### Client\n\nThe Athena class is a facade to all functionality offered by the library.\n\n```python\nclass Athena:\n \"\"\"\n Athena client.\n\n Provides methods to execute SQL queries in AWS Athena service.\n\n :param workgroup: name of the workgroup to execute queries\n :param s3_output_location: S3 path to store query results\n :param n_jobs: number of parallel jobs to read query results, default: 1.\n n_jobs=1, then use only main-threaded\n n_jobs=-1, then use all available CPUs\n :param debug: enable logging debug level\n \"\"\"\n\n def __init__(\n self,\n workgroup: str,\n s3_output_location: str,\n n_jobs: int = 1,\n debug: bool = False\n ):\n ...\n```\n#### Query execution\n```python\n def query(self, query: str | Path,\n *,\n params: dict = None,\n date_fields: Union[tuple, list] = ('date', 'event_date', 'report_date'),\n ask_status_sec: int = 5) -> pd.DataFrame:\n \"\"\"\n Execute query and load results as pandas DataFrame.\n\n Parameters\n ----------\n :param query: str or pathlib.Path\n query string or sql file path to read\n :param params: dict\n parameters to infuse :param query, see :func: pyllas.sql.infuse\n :param date_fields: tuple or list\n field names to convert to pandas.datetime. Default: ('date', 'event_date', 'report_date')\n :param ask_status_sec: int\n interval in seconds to check query status. Default: 5\n \"\"\"\n```\n#### Create table\n```python\n def create_table(self, *, query: Path | str, params: dict = None,\n prefix: str = 'tmp_', name: str = None,\n overwrite: bool = False, ask_status_sec: int = 5) -> str:\n \"\"\"\n Create a table with query results and return it name.\n\n Parameters\n ----------\n :param query: str or pathlib.Path\n query string or sql file path to read\n :param params: dict\n parameters to infuse :param query, see :func: pyllas.sql.infuse\n :param name: str\n name for the table. Default: auto-generated random name\n :param prefix: str\n prefix for the auto-generated table name, used if :param name is None. Default: `tmp_`\n :param overwrite: bool\n overwrite table if it exists. Default: False\n :param ask_status_sec: int\n interval in seconds to check query status. Default: 5\n \"\"\"\n```\n#### Execute statement\n```python\n def execute_statement(self, query: str | Path, *, database: str = None,\n params: dict = None, batch_size: int = 1000, ask_status_sec: int = 5) -> PageIterator:\n \"\"\"\n For all queries except SELECT. Such as `CREATE TABLE`, `DROP TABLE` etc.\n Returns PageIterator of dictionaries with query results.\n Example:\n >> athena.execute_statement(\"SHOW TABLES IN test_db\", database='test_db')\n {'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'test_table'}]}], 'ResultSetMetadata': {'ColumnInfo': [{'CatalogName': 'hive', ...}]}}}\n\n Parameters\n ----------\n :param query: str or pathlib.Path\n query string or sql file path to read\n :param database: str\n database name\n :param params: dict\n parameters to infuse :param query, see :func: pyllas.sql.infuse\n :param batch_size: int\n batch size to read query results. Default: 1000\n :param ask_status_sec: int\n interval in seconds to check query status. Default: 5\n \"\"\"\n```\n#### Cancel query\n```python\n def cancel_query(self, query_id: str) -> None:\n \"\"\"\n Cancel query.\n\n Parameters\n ----------\n :param query_id: str\n query id to cancel\n \"\"\"\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Client for AWS Athena",
"version": "0.4.0",
"project_urls": {
"Documentation": "https://github.com/wristylotus/Pyllas/blob/main/README.md",
"Homepage": "https://github.com/wristylotus/Pyllas"
},
"split_keywords": [
"pyllas",
"aws",
"athena",
"sql",
"jupyter"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dd9151a0884e0a13de58f1dabb863a66478a591181f3e33c8d8d954c1ccf1b8e",
"md5": "af448deada158d1b7de094af99db3552",
"sha256": "f1a3f73524f07cf128c3249cd755f704d0ff0bf53f39358a0d6e3a4cf41c0da2"
},
"downloads": -1,
"filename": "pyllas-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "af448deada158d1b7de094af99db3552",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 15268,
"upload_time": "2023-12-24T02:47:17",
"upload_time_iso_8601": "2023-12-24T02:47:17.450627Z",
"url": "https://files.pythonhosted.org/packages/dd/91/51a0884e0a13de58f1dabb863a66478a591181f3e33c8d8d954c1ccf1b8e/pyllas-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cfee788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481",
"md5": "d2775df38c842d7273d6f1dfd0d5f744",
"sha256": "4f466ad01629f9740e58253730d185edf1b776c537dc601f4aff659ec32aa073"
},
"downloads": -1,
"filename": "pyllas-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "d2775df38c842d7273d6f1dfd0d5f744",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 15232,
"upload_time": "2023-12-24T02:47:20",
"upload_time_iso_8601": "2023-12-24T02:47:20.637455Z",
"url": "https://files.pythonhosted.org/packages/cf/ee/788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481/pyllas-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-24 02:47:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wristylotus",
"github_project": "Pyllas",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "pyllas"
}