pyllas


Namepyllas JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/wristylotus/Pyllas
SummaryClient for AWS Athena
upload_time2023-12-24 02:47:20
maintainer
docs_urlNone
authorwristylotus
requires_python>=3.8
license
keywords pyllas aws athena sql jupyter
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # What is Pyllas?

Pyllas is a Python library for interacting with AWS Athena.

It is designed for data analysis in Jupyter notebooks, but can be used in any Python environment.

Features:

* Easy to use.
* Good Performance even on large datasets.
* Query result as Pandas DataFrame.
* Create materialized tables from queries and use them in subsequent queries.
* Get information about query execution progress, time and data scanned.
* Automatically cancel queries when stop execution of Jupyter notebook cell or on KeyboardInterrupt.

## Quick start

---

Pyllas can be installed using pip:

```bash
pip install pyllas
```

### Setup AWS environment

First, you need to add your AWS credentials to the file `~/.aws/credentials`.

> See [IAM credentials](https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials?section=IAM_credentials)
for more information.
> 
Second, create bucket for Athena query results.
For example: `s3://wristylotus.athena/default-query-result/`

Third, create an Athena workgroup and schema for your tables.
For example, workgroup: `wristylotus`, schema: `create schema wristylotus`.
> [!IMPORTANT]
> Athena schema and workgroup must have the same name.

Finally, you need to grant the following permissions:

- `s3:*` on Athena queries result S3 path
- `athena:StartQueryExecution`
- `athena:CancelQueryExecution`
- `athena:GetQueryExecution`
- `glue:CreateTable`

Congratulations! You are ready to use Pyllas.

```python
import pyllas

athena = pyllas.Athena(
    workgroup='wristylotus',
    s3_output_location='s3://wristylotus.athena/default-query-result/'
)

athena.query("SELECT 'Hello Athena!' AS greeting")
```

## Documentation

---

### Initialization

An Athena client can be obtained using the `pyllas.Athena` initializer.

```python
import pyllas

athena = pyllas.Athena(
    # Athena workgroup name
    workgroup='wristylotus',
    # S3 path for Athena query results
    s3_output_location='s3://wristylotus.athena/default-query-result/',
    # Number of threads to load query results
    n_jobs=1,
    # Prints additional information about query execution
    debug=False
)
```

### Executing queries

To execute a query, use the `query` method:

```python
users_df = athena.query('SELECT id, name FROM users')
```

Pyllas also supports parameterized queries:

```python
users_df = athena.query('''
  SELECT id, name 
  FROM users 
  WHERE name IN ${names} AND age > ${age}
  ''', 
  params={'names': ['Bart', 'Lisa'], 'age': 14}
)
```

For the SQL templating use `Expr` and `Table`, `Database`, `Schema` type aliases:

```python
from pyllas import sql

users_df = athena.query('SELECT id, name FROM ${schema}.${table} ${filter}', 
  params={
      'schema': sql.Schema('main'), 
      'table': sql.Table('users'), 
      'filter': sql.Expr("WHERE name = 'Bart'")
  }
)
```
> For more information, see [API](#query-execution)

Instead of getting the result as a Pandas DataFrame, you can get the result table name:

```python
users_table = athena.create_table('SELECT id, name FROM users WHERE age > 14')
# and then use it in subsequent queries
athena.query(f'SELECT * FROM {users_table}')
```
> For more information, see [API](#create-table)

For a not SELECT queries, such as `CREATE TABLE`, `DROP TABLE` etc., use:

```python
athena.execute_statement("DROP TABLE users")
```
> For more information, see [API](#execute-statement)

## API Reference

---

### Client

The Athena class is a facade to all functionality offered by the library.

```python
class Athena:
    """
        Athena client.

        Provides methods to execute SQL queries in AWS Athena service.

        :param workgroup: name of the workgroup to execute queries
        :param s3_output_location: S3 path to store query results
        :param n_jobs: number of parallel jobs to read query results, default: 1.
               n_jobs=1, then use only main-threaded
               n_jobs=-1, then use all available CPUs
        :param debug: enable logging debug level
    """

    def __init__(
            self,
            workgroup: str,
            s3_output_location: str,
            n_jobs: int = 1,
            debug: bool = False
    ):
        ...
```
#### Query execution
```python
    def query(self, query: str | Path,
              *,
              params: dict = None,
              date_fields: Union[tuple, list] = ('date', 'event_date', 'report_date'),
              ask_status_sec: int = 5) -> pd.DataFrame:
        """
        Execute query and load results as pandas DataFrame.

        Parameters
        ----------
        :param query: str or pathlib.Path
            query string or sql file path to read
        :param params: dict
            parameters to infuse :param query, see :func: pyllas.sql.infuse
        :param date_fields: tuple or list
               field names to convert to pandas.datetime. Default: ('date', 'event_date', 'report_date')
        :param ask_status_sec: int
               interval in seconds to check query status. Default: 5
        """
```
#### Create table
```python
    def create_table(self, *, query: Path | str, params: dict = None,
                     prefix: str = 'tmp_', name: str = None,
                     overwrite: bool = False, ask_status_sec: int = 5) -> str:
        """
        Create a table with query results and return it name.

        Parameters
        ----------
        :param query: str or pathlib.Path
            query string or sql file path to read
        :param params: dict
            parameters to infuse :param query, see :func: pyllas.sql.infuse
        :param name: str
            name for the table. Default: auto-generated random name
        :param prefix: str
            prefix for the auto-generated table name, used if :param name is None. Default: `tmp_`
        :param overwrite: bool
            overwrite table if it exists. Default: False
        :param ask_status_sec: int
            interval in seconds to check query status. Default: 5
        """
```
#### Execute statement
```python
    def execute_statement(self, query: str | Path, *, database: str = None,
                          params: dict = None, batch_size: int = 1000, ask_status_sec: int = 5) -> PageIterator:
        """
        For all queries except SELECT. Such as `CREATE TABLE`, `DROP TABLE` etc.
        Returns PageIterator of dictionaries with query results.
        Example:
        >> athena.execute_statement("SHOW TABLES IN test_db", database='test_db')
        {'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'test_table'}]}], 'ResultSetMetadata': {'ColumnInfo': [{'CatalogName': 'hive', ...}]}}}

        Parameters
        ----------
        :param query: str or pathlib.Path
            query string or sql file path to read
        :param database: str
            database name
        :param params: dict
            parameters to infuse :param query, see :func: pyllas.sql.infuse
        :param batch_size: int
            batch size to read query results. Default: 1000
        :param ask_status_sec: int
            interval in seconds to check query status. Default: 5
        """
```
#### Cancel query
```python
    def cancel_query(self, query_id: str) -> None:
        """
        Cancel query.

        Parameters
        ----------
        :param query_id: str
            query id to cancel
        """
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wristylotus/Pyllas",
    "name": "pyllas",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "pyllas aws athena sql jupyter",
    "author": "wristylotus",
    "author_email": "wristylotus@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/cf/ee/788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481/pyllas-0.4.0.tar.gz",
    "platform": null,
    "description": "# What is Pyllas?\n\nPyllas is a Python library for interacting with AWS Athena.\n\nIt is designed for data analysis in Jupyter notebooks, but can be used in any Python environment.\n\nFeatures:\n\n* Easy to use.\n* Good Performance even on large datasets.\n* Query result as Pandas DataFrame.\n* Create materialized tables from queries and use them in subsequent queries.\n* Get information about query execution progress, time and data scanned.\n* Automatically cancel queries when stop execution of Jupyter notebook cell or on KeyboardInterrupt.\n\n## Quick start\n\n---\n\nPyllas can be installed using pip:\n\n```bash\npip install pyllas\n```\n\n### Setup AWS environment\n\nFirst, you need to add your AWS credentials to the file `~/.aws/credentials`.\n\n> See [IAM credentials](https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials?section=IAM_credentials)\nfor more information.\n> \nSecond, create bucket for Athena query results.\nFor example: `s3://wristylotus.athena/default-query-result/`\n\nThird, create an Athena workgroup and schema for your tables.\nFor example, workgroup: `wristylotus`, schema: `create schema wristylotus`.\n> [!IMPORTANT]\n> Athena schema and workgroup must have the same name.\n\nFinally, you need to grant the following permissions:\n\n- `s3:*` on Athena queries result S3 path\n- `athena:StartQueryExecution`\n- `athena:CancelQueryExecution`\n- `athena:GetQueryExecution`\n- `glue:CreateTable`\n\nCongratulations! You are ready to use Pyllas.\n\n```python\nimport pyllas\n\nathena = pyllas.Athena(\n    workgroup='wristylotus',\n    s3_output_location='s3://wristylotus.athena/default-query-result/'\n)\n\nathena.query(\"SELECT 'Hello Athena!' AS greeting\")\n```\n\n## Documentation\n\n---\n\n### Initialization\n\nAn Athena client can be obtained using the `pyllas.Athena` initializer.\n\n```python\nimport pyllas\n\nathena = pyllas.Athena(\n    # Athena workgroup name\n    workgroup='wristylotus',\n    # S3 path for Athena query results\n    s3_output_location='s3://wristylotus.athena/default-query-result/',\n    # Number of threads to load query results\n    n_jobs=1,\n    # Prints additional information about query execution\n    debug=False\n)\n```\n\n### Executing queries\n\nTo execute a query, use the `query` method:\n\n```python\nusers_df = athena.query('SELECT id, name FROM users')\n```\n\nPyllas also supports parameterized queries:\n\n```python\nusers_df = athena.query('''\n  SELECT id, name \n  FROM users \n  WHERE name IN ${names} AND age > ${age}\n  ''', \n  params={'names': ['Bart', 'Lisa'], 'age': 14}\n)\n```\n\nFor the SQL templating use `Expr` and `Table`, `Database`, `Schema` type aliases:\n\n```python\nfrom pyllas import sql\n\nusers_df = athena.query('SELECT id, name FROM ${schema}.${table} ${filter}', \n  params={\n      'schema': sql.Schema('main'), \n      'table': sql.Table('users'), \n      'filter': sql.Expr(\"WHERE name = 'Bart'\")\n  }\n)\n```\n> For more information, see [API](#query-execution)\n\nInstead of getting the result as a Pandas DataFrame, you can get the result table name:\n\n```python\nusers_table = athena.create_table('SELECT id, name FROM users WHERE age > 14')\n# and then use it in subsequent queries\nathena.query(f'SELECT * FROM {users_table}')\n```\n> For more information, see [API](#create-table)\n\nFor a not SELECT queries, such as `CREATE TABLE`, `DROP TABLE` etc., use:\n\n```python\nathena.execute_statement(\"DROP TABLE users\")\n```\n> For more information, see [API](#execute-statement)\n\n## API Reference\n\n---\n\n### Client\n\nThe Athena class is a facade to all functionality offered by the library.\n\n```python\nclass Athena:\n    \"\"\"\n        Athena client.\n\n        Provides methods to execute SQL queries in AWS Athena service.\n\n        :param workgroup: name of the workgroup to execute queries\n        :param s3_output_location: S3 path to store query results\n        :param n_jobs: number of parallel jobs to read query results, default: 1.\n               n_jobs=1, then use only main-threaded\n               n_jobs=-1, then use all available CPUs\n        :param debug: enable logging debug level\n    \"\"\"\n\n    def __init__(\n            self,\n            workgroup: str,\n            s3_output_location: str,\n            n_jobs: int = 1,\n            debug: bool = False\n    ):\n        ...\n```\n#### Query execution\n```python\n    def query(self, query: str | Path,\n              *,\n              params: dict = None,\n              date_fields: Union[tuple, list] = ('date', 'event_date', 'report_date'),\n              ask_status_sec: int = 5) -> pd.DataFrame:\n        \"\"\"\n        Execute query and load results as pandas DataFrame.\n\n        Parameters\n        ----------\n        :param query: str or pathlib.Path\n            query string or sql file path to read\n        :param params: dict\n            parameters to infuse :param query, see :func: pyllas.sql.infuse\n        :param date_fields: tuple or list\n               field names to convert to pandas.datetime. Default: ('date', 'event_date', 'report_date')\n        :param ask_status_sec: int\n               interval in seconds to check query status. Default: 5\n        \"\"\"\n```\n#### Create table\n```python\n    def create_table(self, *, query: Path | str, params: dict = None,\n                     prefix: str = 'tmp_', name: str = None,\n                     overwrite: bool = False, ask_status_sec: int = 5) -> str:\n        \"\"\"\n        Create a table with query results and return it name.\n\n        Parameters\n        ----------\n        :param query: str or pathlib.Path\n            query string or sql file path to read\n        :param params: dict\n            parameters to infuse :param query, see :func: pyllas.sql.infuse\n        :param name: str\n            name for the table. Default: auto-generated random name\n        :param prefix: str\n            prefix for the auto-generated table name, used if :param name is None. Default: `tmp_`\n        :param overwrite: bool\n            overwrite table if it exists. Default: False\n        :param ask_status_sec: int\n            interval in seconds to check query status. Default: 5\n        \"\"\"\n```\n#### Execute statement\n```python\n    def execute_statement(self, query: str | Path, *, database: str = None,\n                          params: dict = None, batch_size: int = 1000, ask_status_sec: int = 5) -> PageIterator:\n        \"\"\"\n        For all queries except SELECT. Such as `CREATE TABLE`, `DROP TABLE` etc.\n        Returns PageIterator of dictionaries with query results.\n        Example:\n        >> athena.execute_statement(\"SHOW TABLES IN test_db\", database='test_db')\n        {'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'test_table'}]}], 'ResultSetMetadata': {'ColumnInfo': [{'CatalogName': 'hive', ...}]}}}\n\n        Parameters\n        ----------\n        :param query: str or pathlib.Path\n            query string or sql file path to read\n        :param database: str\n            database name\n        :param params: dict\n            parameters to infuse :param query, see :func: pyllas.sql.infuse\n        :param batch_size: int\n            batch size to read query results. Default: 1000\n        :param ask_status_sec: int\n            interval in seconds to check query status. Default: 5\n        \"\"\"\n```\n#### Cancel query\n```python\n    def cancel_query(self, query_id: str) -> None:\n        \"\"\"\n        Cancel query.\n\n        Parameters\n        ----------\n        :param query_id: str\n            query id to cancel\n        \"\"\"\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Client for AWS Athena",
    "version": "0.4.0",
    "project_urls": {
        "Documentation": "https://github.com/wristylotus/Pyllas/blob/main/README.md",
        "Homepage": "https://github.com/wristylotus/Pyllas"
    },
    "split_keywords": [
        "pyllas",
        "aws",
        "athena",
        "sql",
        "jupyter"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd9151a0884e0a13de58f1dabb863a66478a591181f3e33c8d8d954c1ccf1b8e",
                "md5": "af448deada158d1b7de094af99db3552",
                "sha256": "f1a3f73524f07cf128c3249cd755f704d0ff0bf53f39358a0d6e3a4cf41c0da2"
            },
            "downloads": -1,
            "filename": "pyllas-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "af448deada158d1b7de094af99db3552",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 15268,
            "upload_time": "2023-12-24T02:47:17",
            "upload_time_iso_8601": "2023-12-24T02:47:17.450627Z",
            "url": "https://files.pythonhosted.org/packages/dd/91/51a0884e0a13de58f1dabb863a66478a591181f3e33c8d8d954c1ccf1b8e/pyllas-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cfee788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481",
                "md5": "d2775df38c842d7273d6f1dfd0d5f744",
                "sha256": "4f466ad01629f9740e58253730d185edf1b776c537dc601f4aff659ec32aa073"
            },
            "downloads": -1,
            "filename": "pyllas-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d2775df38c842d7273d6f1dfd0d5f744",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15232,
            "upload_time": "2023-12-24T02:47:20",
            "upload_time_iso_8601": "2023-12-24T02:47:20.637455Z",
            "url": "https://files.pythonhosted.org/packages/cf/ee/788c9df031f1c13d7a3505ec2c15c17971578368490f89c56dc76d002481/pyllas-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-24 02:47:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wristylotus",
    "github_project": "Pyllas",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pyllas"
}
        
Elapsed time: 0.15531s