# JSON Schema to AWS Glue schema converter
## Installation
```bash
pip install pydantic-glue
```
## What?
Converts `pydantic` schemas to `json schema` and then to `AWS glue schema`,
so in theory anything that can be converted to JSON Schema *could* also work.
## Why?
When using `AWS Kinesis Firehose` in a configuration that receives JSONs and writes `parquet` files on S3,
one needs to define a `AWS Glue` table so Firehose knows what schema to use when creating the parquet files.
AWS Glue lets you define a schema using `Avro` or `JSON Schema` and then to create a table from that schema,
but as of *May 2022*
there are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.
<https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform>
This is also confirmed by AWS support.
What one could do is create a table set the columns manually,
but this means you now have two sources of truth to maintain.
This tool allows you to define a table in `pydantic`
and generate a JSON with column types that can be used with `terraform` to create a Glue table.
## Example
Take the following pydantic class
```python title="example.py"
from pydantic import BaseModel
from typing import List
class Bar(BaseModel):
name: str
age: int
class Foo(BaseModel):
nums: List[int]
bars: List[Bar]
other: str
```
Running `pydantic-glue`
```bash
pydantic-glue -f example.py -c Foo
```
you get this JSON in the terminal:
```json
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"nums": "array<int>",
"bars": "array<struct<name:string,age:int>>",
"other": "string"
}
}
```
and can be used in terraform like that
```terraform
locals {
columns = jsondecode(file("${path.module}/glue_schema.json")).columns
}
resource "aws_glue_catalog_table" "table" {
name = "table_name"
database_name = "db_name"
storage_descriptor {
dynamic "columns" {
for_each = local.columns
content {
name = columns.key
type = columns.value
}
}
}
}
```
Alternatively you can run CLI with `-o` flag to set output file location:
```bash
pydantic-glue -f example.py -c Foo -o example.json -l
```
## Override the type for the AWS Glue Schema
Wherever there is a `type` key in the input JSON Schema, an additional key `glue_type` may be
defined to override the type that is used in the AWS Glue Schema. This is, for example, useful for
a pydantic model that has a field of type `int` that is unix epoch time, while the column type you
would like in Glue is of type `timestamp`.
Additional JSON Schema keys to a pydantic model can be added by using the
[`Field` function](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field)
with the argument `json_schema_extra` like so:
```python
from pydantic import BaseModel, Field
class A(BaseModel):
epoch_time: int = Field(
...,
json_schema_extra={
"glue_type": "timestamp",
},
)
```
The resulting JSON Schema will be:
```json
{
"properties": {
"epoch_time": {
"glue_type": "timestamp",
"title": "Epoch Time",
"type": "integer"
}
},
"required": [
"epoch_time"
],
"title": "A",
"type": "object"
}
```
And the result after processing with pydantic-glue:
```json
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"epoch_time": "timestamp",
}
}
```
Recursing through object properties terminates when you supply a `glue_type` to use. If the type is
complex, you must supply the full complex type yourself.
## How it works?
* `pydantic` gets converted to JSON Schema
* the JSON Schema types get mapped to Glue types recursively
## Future work
* Not all types are supported, I just add types as I need them, but adding types is very easy,
feel free to open issues or send a PR if you stumbled upon a non-supported use case
* the tool could be easily extended to working with JSON Schema directly
* thus, anything that can be converted to a JSON Schema should also work.
Raw data
{
"_id": null,
"home_page": "https://github.com/svdimchenko/pydantic-glue",
"name": "pydantic-glue",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "pydantic, glue, athena, types, convert",
"author": "Serhii Dimchenko",
"author_email": "svdimchenko@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/77/23/b219e56aba861232b712c82fdd8def4a347daae59fa6cffd8e1c42acc9af/pydantic_glue-0.6.0.tar.gz",
"platform": null,
"description": "# JSON Schema to AWS Glue schema converter\n\n## Installation\n\n```bash\npip install pydantic-glue\n```\n\n## What?\n\nConverts `pydantic` schemas to `json schema` and then to `AWS glue schema`,\nso in theory anything that can be converted to JSON Schema *could* also work.\n\n## Why?\n\nWhen using `AWS Kinesis Firehose` in a configuration that receives JSONs and writes `parquet` files on S3,\none needs to define a `AWS Glue` table so Firehose knows what schema to use when creating the parquet files.\n\nAWS Glue lets you define a schema using `Avro` or `JSON Schema` and then to create a table from that schema,\nbut as of *May 2022*\nthere are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.\n\n<https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform>\n\nThis is also confirmed by AWS support.\n\nWhat one could do is create a table set the columns manually,\nbut this means you now have two sources of truth to maintain.\n\nThis tool allows you to define a table in `pydantic`\nand generate a JSON with column types that can be used with `terraform` to create a Glue table.\n\n## Example\n\nTake the following pydantic class\n\n```python title=\"example.py\"\nfrom pydantic import BaseModel\nfrom typing import List\n\n\nclass Bar(BaseModel):\n name: str\n age: int\n\n\nclass Foo(BaseModel):\n nums: List[int]\n bars: List[Bar]\n other: str\n\n```\n\nRunning `pydantic-glue`\n\n```bash\npydantic-glue -f example.py -c Foo\n```\n\nyou get this JSON in the terminal:\n\n```json\n{\n \"//\": \"Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT\",\n \"columns\": {\n \"nums\": \"array<int>\",\n \"bars\": \"array<struct<name:string,age:int>>\",\n \"other\": \"string\"\n }\n}\n```\n\nand can be used in terraform like that\n\n```terraform\nlocals {\n columns = jsondecode(file(\"${path.module}/glue_schema.json\")).columns\n}\n\nresource \"aws_glue_catalog_table\" \"table\" {\n name = \"table_name\"\n database_name = \"db_name\"\n\n storage_descriptor {\n dynamic \"columns\" {\n for_each = local.columns\n\n content {\n name = columns.key\n type = columns.value\n }\n }\n }\n}\n```\n\nAlternatively you can run CLI with `-o` flag to set output file location:\n\n```bash\npydantic-glue -f example.py -c Foo -o example.json -l\n```\n\n## Override the type for the AWS Glue Schema\n\nWherever there is a `type` key in the input JSON Schema, an additional key `glue_type` may be\ndefined to override the type that is used in the AWS Glue Schema. This is, for example, useful for\na pydantic model that has a field of type `int` that is unix epoch time, while the column type you\nwould like in Glue is of type `timestamp`.\n\nAdditional JSON Schema keys to a pydantic model can be added by using the\n[`Field` function](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field)\nwith the argument `json_schema_extra` like so:\n\n```python\nfrom pydantic import BaseModel, Field\n\nclass A(BaseModel):\n epoch_time: int = Field(\n ...,\n json_schema_extra={\n \"glue_type\": \"timestamp\",\n },\n )\n```\n\nThe resulting JSON Schema will be:\n\n```json\n{\n \"properties\": {\n \"epoch_time\": {\n \"glue_type\": \"timestamp\",\n \"title\": \"Epoch Time\",\n \"type\": \"integer\"\n }\n },\n \"required\": [\n \"epoch_time\"\n ],\n \"title\": \"A\",\n \"type\": \"object\"\n}\n```\n\nAnd the result after processing with pydantic-glue:\n\n```json\n{\n \"//\": \"Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT\",\n \"columns\": {\n \"epoch_time\": \"timestamp\",\n }\n}\n```\n\nRecursing through object properties terminates when you supply a `glue_type` to use. If the type is\ncomplex, you must supply the full complex type yourself.\n\n## How it works?\n\n* `pydantic` gets converted to JSON Schema\n* the JSON Schema types get mapped to Glue types recursively\n\n## Future work\n\n* Not all types are supported, I just add types as I need them, but adding types is very easy,\n feel free to open issues or send a PR if you stumbled upon a non-supported use case\n* the tool could be easily extended to working with JSON Schema directly\n* thus, anything that can be converted to a JSON Schema should also work.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Convert pydantic model to aws glue schema for terraform",
"version": "0.6.0",
"project_urls": {
"Bug Tracker": "https://github.com/svdimchenko/pydantic-glue/issues",
"Homepage": "https://github.com/svdimchenko/pydantic-glue",
"Releases": "https://github.com/svdimchenko/pydantic-glue/releases",
"Repository": "https://github.com/svdimchenko/pydantic-glue"
},
"split_keywords": [
"pydantic",
" glue",
" athena",
" types",
" convert"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "697d2b20b773c1c9432b832541ddb30210017743ba00bf6babd681a0c71aa9fe",
"md5": "37342327e9f06512bddc26228b6d2a64",
"sha256": "624e38043e50e0a417efb02de58258434d37fc7d86dae602b46fcad543645071"
},
"downloads": -1,
"filename": "pydantic_glue-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "37342327e9f06512bddc26228b6d2a64",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 6025,
"upload_time": "2024-11-15T12:21:56",
"upload_time_iso_8601": "2024-11-15T12:21:56.997155Z",
"url": "https://files.pythonhosted.org/packages/69/7d/2b20b773c1c9432b832541ddb30210017743ba00bf6babd681a0c71aa9fe/pydantic_glue-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7723b219e56aba861232b712c82fdd8def4a347daae59fa6cffd8e1c42acc9af",
"md5": "4e859eb9b0be34351699284c1f6580f7",
"sha256": "190d42a073f5666791df760a40f22df3a7e0daf6970fb6f30661825d89cdcf61"
},
"downloads": -1,
"filename": "pydantic_glue-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "4e859eb9b0be34351699284c1f6580f7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 5290,
"upload_time": "2024-11-15T12:21:58",
"upload_time_iso_8601": "2024-11-15T12:21:58.071618Z",
"url": "https://files.pythonhosted.org/packages/77/23/b219e56aba861232b712c82fdd8def4a347daae59fa6cffd8e1c42acc9af/pydantic_glue-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-15 12:21:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "svdimchenko",
"github_project": "pydantic-glue",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pydantic-glue"
}