# athena-python-udf
<!-- markdownlint-disable -->
[![PyPI](https://img.shields.io/pypi/v/athena-python-udf.svg)](https://pypi.org/project/athena-python-udf/)
[![Changelog](https://img.shields.io/github/v/release/dbt-athena/athena-python-udf?include_prereleases&label=changelog)](https://github.com/dbt-athena/athena-python-udf/releases)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/dbt-athena/athena-python-udf/blob/main/LICENSE)
<!-- markdownlint-restore -->
Athena User Defined Functions(UDFs) in Python made easy!
This library implements the Athena UDF protocol in Python,
so you don't have to use Java, and you can use any Python library you wish, including numpy/pandas!
## Installation
Install this library using `pip`:
```bash
pip install athena-python-udf
```
## Usage
- Install the package
- Create a lambda handler Python file subclass `BaseAthenaUDF`
- Implement the `handle_athena_record` static method with your required functionality like this:
```python
from typing import Any
from athena_udf import BaseAthenaUDF
from pyarrow import Schema
class SimpleVarcharUDF(BaseAthenaUDF):
@staticmethod
def handle_athena_record(input_schema: Schema, output_schema: Schema, arguments: list[Any]):
varchar = arguments[0]
return varchar.lower()
lambda_handler = SimpleVarcharUDF(use_threads=False).lambda_handler
```
This very basic example takes a `varchar` input, and returns the lowercase version.
- `varchar` is converted to a python string on the way in and way out.
- `input_schema` contains a `PyArrow` schema representing the schema of the data being passed
- `output_schema` contains a `PyArrow` schema representing the schema of what athena expects to be returned.
- `arguments` contains a list of arguments given to the function. Can be more than one with different types.
You can also play with multithreading (enabled by default) using the following parameters:
- `chunk_size` - if you want to force splitting received record batch into chunks of specific size
and process these chunks consecutively.
It may be useful if your lambda will operate with some rate-limited external APIs.
- `max_workers` - basic ThreadPoolExecutor parameter. You can leave it empty to keep default behavior.
If you package the above into a zip, with dependencies and name your lambda function `my-lambda`
you can then run it from the athena console like so:
```sql
USING EXTERNAL FUNCTION my_udf(col1 varchar) RETURNS varchar LAMBDA 'athena-test'
SELECT my_udf('FooBar');
```
Which will yield the result `foobar`
See other examples in the [examples](examples) folder of this repo.
## Important information before using
Each lambda instance will take multiple requests for the same query.
Each request can contain multiple rows, `athena-udf`
handles this for you and your implementation will receive a single row.
Athena will group your data into around 1MB chunks in a single request.
The maximum your function can return is 6MB per chunk.
This library uses `PyArrow`. This is a large library, so the Lambdas will be around 50MB zipped.
Timestamps seem to be truncated into Python `date` objects missing the time.
Functions can return one value only.
To return more complex data structures, consider returning a JSON payload and parsing on athena.
## Development
To contribute to this library, first checkout the code.
Then create a new virtual environment with all required dependencies and activate it:
```bash
poetry install
source .venv/bin/activate
```
To run the tests:
```bash
pytest
```
Raw data
{
"_id": null,
"home_page": "https://github.com/dbt-athena/athena-python-udf",
"name": "athena-python-udf",
"maintainer": "Serhii Dimchenko",
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": "svdimchenko@gmail.com",
"keywords": "aws, athena, python, udf, lambda",
"author": "David Markey",
"author_email": "david@dmarkey.com",
"download_url": "https://files.pythonhosted.org/packages/71/a6/f194fa0775ee21251ec5dfb0f39104ce6db7d043d6b17e20ff462642afbb/athena_python_udf-0.2.2.tar.gz",
"platform": null,
"description": "# athena-python-udf\n\n<!-- markdownlint-disable -->\n[![PyPI](https://img.shields.io/pypi/v/athena-python-udf.svg)](https://pypi.org/project/athena-python-udf/)\n[![Changelog](https://img.shields.io/github/v/release/dbt-athena/athena-python-udf?include_prereleases&label=changelog)](https://github.com/dbt-athena/athena-python-udf/releases)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/dbt-athena/athena-python-udf/blob/main/LICENSE)\n<!-- markdownlint-restore -->\n\nAthena User Defined Functions(UDFs) in Python made easy!\n\nThis library implements the Athena UDF protocol in Python,\nso you don't have to use Java, and you can use any Python library you wish, including numpy/pandas!\n\n## Installation\n\nInstall this library using `pip`:\n\n```bash\npip install athena-python-udf\n```\n\n## Usage\n\n- Install the package\n- Create a lambda handler Python file subclass `BaseAthenaUDF`\n- Implement the `handle_athena_record` static method with your required functionality like this:\n\n```python\nfrom typing import Any\n\nfrom athena_udf import BaseAthenaUDF\nfrom pyarrow import Schema\n\n\nclass SimpleVarcharUDF(BaseAthenaUDF):\n\n @staticmethod\n def handle_athena_record(input_schema: Schema, output_schema: Schema, arguments: list[Any]):\n varchar = arguments[0]\n return varchar.lower()\n\n\nlambda_handler = SimpleVarcharUDF(use_threads=False).lambda_handler\n```\n\nThis very basic example takes a `varchar` input, and returns the lowercase version.\n\n- `varchar` is converted to a python string on the way in and way out.\n- `input_schema` contains a `PyArrow` schema representing the schema of the data being passed\n- `output_schema` contains a `PyArrow` schema representing the schema of what athena expects to be returned.\n- `arguments` contains a list of arguments given to the function. Can be more than one with different types.\n\nYou can also play with multithreading (enabled by default) using the following parameters:\n\n- `chunk_size` - if you want to force splitting received record batch into chunks of specific size\n and process these chunks consecutively.\n It may be useful if your lambda will operate with some rate-limited external APIs.\n\n- `max_workers` - basic ThreadPoolExecutor parameter. You can leave it empty to keep default behavior.\n\nIf you package the above into a zip, with dependencies and name your lambda function `my-lambda`\nyou can then run it from the athena console like so:\n\n```sql\nUSING EXTERNAL FUNCTION my_udf(col1 varchar) RETURNS varchar LAMBDA 'athena-test'\n\nSELECT my_udf('FooBar');\n```\n\nWhich will yield the result `foobar`\n\nSee other examples in the [examples](examples) folder of this repo.\n\n## Important information before using\n\nEach lambda instance will take multiple requests for the same query.\nEach request can contain multiple rows, `athena-udf`\nhandles this for you and your implementation will receive a single row.\n\nAthena will group your data into around 1MB chunks in a single request.\nThe maximum your function can return is 6MB per chunk.\n\nThis library uses `PyArrow`. This is a large library, so the Lambdas will be around 50MB zipped.\n\nTimestamps seem to be truncated into Python `date` objects missing the time.\n\nFunctions can return one value only.\nTo return more complex data structures, consider returning a JSON payload and parsing on athena.\n\n## Development\n\nTo contribute to this library, first checkout the code.\nThen create a new virtual environment with all required dependencies and activate it:\n\n```bash\npoetry install\nsource .venv/bin/activate\n```\n\nTo run the tests:\n\n```bash\npytest\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Athena User Defined Functions(UDFs) in Python made easy!",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/dbt-athena/athena-python-udf",
"Repository": "https://github.com/dbt-athena/athena-python-udf"
},
"split_keywords": [
"aws",
" athena",
" python",
" udf",
" lambda"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c5335d5e6e04c59480321f2ec166e109e93a158c9485e7cc36c89e22f4d3c81a",
"md5": "ab11132aa55cc8a5a87163d86eba14cf",
"sha256": "7c24130cc55511d3739f2aa98bfd8401f2044efcf41fe850cecd69ffe01bcd8b"
},
"downloads": -1,
"filename": "athena_python_udf-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ab11132aa55cc8a5a87163d86eba14cf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 8839,
"upload_time": "2024-05-30T10:04:15",
"upload_time_iso_8601": "2024-05-30T10:04:15.259901Z",
"url": "https://files.pythonhosted.org/packages/c5/33/5d5e6e04c59480321f2ec166e109e93a158c9485e7cc36c89e22f4d3c81a/athena_python_udf-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "71a6f194fa0775ee21251ec5dfb0f39104ce6db7d043d6b17e20ff462642afbb",
"md5": "9ce89729d03ea6800e726c2b6926bcf0",
"sha256": "d7baedbcd18806e576032eac032bfeda77a6a60121e07de298744b10be90ebdf"
},
"downloads": -1,
"filename": "athena_python_udf-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "9ce89729d03ea6800e726c2b6926bcf0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 8128,
"upload_time": "2024-05-30T10:04:16",
"upload_time_iso_8601": "2024-05-30T10:04:16.881895Z",
"url": "https://files.pythonhosted.org/packages/71/a6/f194fa0775ee21251ec5dfb0f39104ce6db7d043d6b17e20ff462642afbb/athena_python_udf-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-30 10:04:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dbt-athena",
"github_project": "athena-python-udf",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "athena-python-udf"
}