arrow-udf


Namearrow-udf JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryA user-defined function framework for Apache Arrow
upload_time2024-05-07 10:36:18
maintainerNone
docs_urlNone
authorRisingWave Labs
requires_python>=3.8
licenseApache Software License
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Arrow UDF Python Server

## Installation

```sh
pip install arrow-udf
```

## Usage

Define functions in a Python file:

```python
# udf.py
from arrow_udf import udf, udtf, UdfServer
import struct
import socket

# Define a scalar function
@udf(input_types=['INT', 'INT'], result_type='INT')
def gcd(x, y):
    while y != 0:
        (x, y) = (y, x % y)
    return x

# Define a scalar function that returns multiple values (within a struct)
@udf(input_types=['BINARY'], result_type='STRUCT<src_addr: STRING, dst_addr: STRING, src_port: INT16, dst_port: INT16>')
def extract_tcp_info(tcp_packet: bytes):
    src_addr, dst_addr = struct.unpack('!4s4s', tcp_packet[12:20])
    src_port, dst_port = struct.unpack('!HH', tcp_packet[20:24])
    src_addr = socket.inet_ntoa(src_addr)
    dst_addr = socket.inet_ntoa(dst_addr)
    return {
        'src_addr': src_addr,
        'dst_addr': dst_addr,
        'src_port': src_port,
        'dst_port': dst_port,
    }

# Define a table function
@udtf(input_types='INT', result_types='INT')
def series(n):
    for i in range(n):
        yield i

# Start a UDF server
if __name__ == '__main__':
    server = UdfServer(location="0.0.0.0:8815")
    server.add_function(gcd)
    server.add_function(extract_tcp_info)
    server.add_function(series)
    server.serve()
```

Start the UDF server:

```sh
python3 udf.py
```

## Data Types

| Arrow Type           | Python Type                    |
| -------------------- | ------------------------------ |
| `boolean`            | `bool`                         |
| `int8`               | `int`                          |
| `int16`              | `int`                          |
| `int32`              | `int`                          |
| `int64`              | `int`                          |
| `uint8`              | `int`                          |
| `uint16`             | `int`                          |
| `uint32`             | `int`                          |
| `uint64`             | `int`                          |
| `float32`            | `float`                        |
| `float32`            | `float`                        |
| `date32`             | `datetime.date`                |
| `time64`             | `datetime.time`                |
| `timestamp`          | `datetime.datetime`            |
| `interval`           | `MonthDayNano` / `(int, int, int)` (fields can be obtained by `months()`, `days()` and `nanoseconds()` from `MonthDayNano`) |
| `string`             | `str`                          |
| `binary`             | `bytes`                        |
| `large_string`       | `str`                          |
| `large_binary`       | `bytes`                        |

Extension types:

| Data type   | Metadata            | Python Type           |
| ----------- | ------------------- | --------------------- |
| `decimal`   | `arrowudf.decimal`  | `decimal.Decimal`     |
| `json`      | `arrowudf.json`     | `any`                 |

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "arrow-udf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "RisingWave Labs",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/05/f7/a4aa3ac575e229937dec67f28631fea455d1b467b7679ddc783e0aba34f0/arrow_udf-0.2.1.tar.gz",
    "platform": null,
    "description": "# Arrow UDF Python Server\n\n## Installation\n\n```sh\npip install arrow-udf\n```\n\n## Usage\n\nDefine functions in a Python file:\n\n```python\n# udf.py\nfrom arrow_udf import udf, udtf, UdfServer\nimport struct\nimport socket\n\n# Define a scalar function\n@udf(input_types=['INT', 'INT'], result_type='INT')\ndef gcd(x, y):\n    while y != 0:\n        (x, y) = (y, x % y)\n    return x\n\n# Define a scalar function that returns multiple values (within a struct)\n@udf(input_types=['BINARY'], result_type='STRUCT<src_addr: STRING, dst_addr: STRING, src_port: INT16, dst_port: INT16>')\ndef extract_tcp_info(tcp_packet: bytes):\n    src_addr, dst_addr = struct.unpack('!4s4s', tcp_packet[12:20])\n    src_port, dst_port = struct.unpack('!HH', tcp_packet[20:24])\n    src_addr = socket.inet_ntoa(src_addr)\n    dst_addr = socket.inet_ntoa(dst_addr)\n    return {\n        'src_addr': src_addr,\n        'dst_addr': dst_addr,\n        'src_port': src_port,\n        'dst_port': dst_port,\n    }\n\n# Define a table function\n@udtf(input_types='INT', result_types='INT')\ndef series(n):\n    for i in range(n):\n        yield i\n\n# Start a UDF server\nif __name__ == '__main__':\n    server = UdfServer(location=\"0.0.0.0:8815\")\n    server.add_function(gcd)\n    server.add_function(extract_tcp_info)\n    server.add_function(series)\n    server.serve()\n```\n\nStart the UDF server:\n\n```sh\npython3 udf.py\n```\n\n## Data Types\n\n| Arrow Type           | Python Type                    |\n| -------------------- | ------------------------------ |\n| `boolean`            | `bool`                         |\n| `int8`               | `int`                          |\n| `int16`              | `int`                          |\n| `int32`              | `int`                          |\n| `int64`              | `int`                          |\n| `uint8`              | `int`                          |\n| `uint16`             | `int`                          |\n| `uint32`             | `int`                          |\n| `uint64`             | `int`                          |\n| `float32`            | `float`                        |\n| `float32`            | `float`                        |\n| `date32`             | `datetime.date`                |\n| `time64`             | `datetime.time`                |\n| `timestamp`          | `datetime.datetime`            |\n| `interval`           | `MonthDayNano` / `(int, int, int)` (fields can be obtained by `months()`, `days()` and `nanoseconds()` from `MonthDayNano`) |\n| `string`             | `str`                          |\n| `binary`             | `bytes`                        |\n| `large_string`       | `str`                          |\n| `large_binary`       | `bytes`                        |\n\nExtension types:\n\n| Data type   | Metadata            | Python Type           |\n| ----------- | ------------------- | --------------------- |\n| `decimal`   | `arrowudf.decimal`  | `decimal.Decimal`     |\n| `json`      | `arrowudf.json`     | `any`                 |\n",
    "bugtrack_url": null,
    "license": "Apache Software License",
    "summary": "A user-defined function framework for Apache Arrow",
    "version": "0.2.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ea3e61a63bd3032b55ba80b2cbfdc179ee033ac11bf2cdc947ec491d6729cfc",
                "md5": "afed9fe69c42079eefa5eb851b54c8e2",
                "sha256": "a31a975ce97698152012ac2ee073cf77c55bbd513a7a41d5c4325a9565753703"
            },
            "downloads": -1,
            "filename": "arrow_udf-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "afed9fe69c42079eefa5eb851b54c8e2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 10620,
            "upload_time": "2024-05-07T10:36:17",
            "upload_time_iso_8601": "2024-05-07T10:36:17.037665Z",
            "url": "https://files.pythonhosted.org/packages/1e/a3/e61a63bd3032b55ba80b2cbfdc179ee033ac11bf2cdc947ec491d6729cfc/arrow_udf-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "05f7a4aa3ac575e229937dec67f28631fea455d1b467b7679ddc783e0aba34f0",
                "md5": "1ecb851be350c470fe7220840c68e8d2",
                "sha256": "3892aa478b5e81383511d1f70a57ae2eaccfb0dbb2a6a55cc86281c0bef4fcd6"
            },
            "downloads": -1,
            "filename": "arrow_udf-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1ecb851be350c470fe7220840c68e8d2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9973,
            "upload_time": "2024-05-07T10:36:18",
            "upload_time_iso_8601": "2024-05-07T10:36:18.326392Z",
            "url": "https://files.pythonhosted.org/packages/05/f7/a4aa3ac575e229937dec67f28631fea455d1b467b7679ddc783e0aba34f0/arrow_udf-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-07 10:36:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "arrow-udf"
}
        
Elapsed time: 0.84162s