# Relationalize
Relationalize is a Python library for transforming collections of JSON objects, into a relational-friendly format.
It draws inspiration from the [AWS Glue Relationalize transform](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Relationalize.html).
## JSON Object Collections
When working with JSON often there are collections of objects with the same or similar structure. For example, in a NoSQL database there may be a collection describing users with the following two documents/objects:
```javascript
//Document 1
{
"username": "jsmith123",
"created_at": "2022-12-15T20:24:26.637Z",
"contact": {
"email_address": "jsmith123@gmail.com",
"phone_number": 1234567890
},
"connections": [
"jdoe456",
"elowry789"
]
}
// Document 2
{
"username": "jdoe456",
"created_at": 1671135896468,
"contact": {
"email_address": "jdoe899@yahoo.com",
"address": {
"address_1": "77 Middlesex Avenue",
"address_2": "Suite A",
"city": "Somerville",
"state": "MA",
"zip_code": "02145"
}
},
"connections": [
"jsmith123",
"hjones99"
]
}
```
There are a number of challenges that must be overcome to move this data into a relational-database friendly format:
1. Nested Objects (ex: "contact" field)
2. Different data types in the same column (ex: "created_at" field)
3. Sparse columns (ex: "contact.phone_number" & "contact.address" field)
4. Sub-Arrays (ex: "connections" field)
This package provides a solutution to all of these challenges with more portability and flexibility, and less limitations than AWS Glue relationalize.
## How Relationalize works
The relationalize function recursively navigates the JSON object and splits out new ojects/collections whenever an array is encountered and provides a connection/relation between the objects. You provide the Relationalize class a function which will determine where to write the transformed content. This could be a local file object, a remote (s3) file object, or an in memory buffer. Additionally any nested objects are flattened. Each object that is output by relationalize is a flat JSON object.
This package also provides a `Schema` class which can generate a schema for a collection of flat JSON objects. This schema can be used to handle type ambigouity and generate SQL DDL.
For example, the schemas generated by relationalizing and schema generating the above collection's objects would be:
```javascript
// users
{
"username": "str",
"created_at": "c-int-str",
"contact_email_address": "str",
"contact_phone_number": "int",
"contact_address_address_1": "str",
"contact_address_address_2": "str",
"contact_address_city": "str",
"contact_address_state": "str",
"contact_address_zip_code": "str",
"connections": "str"
}
//users_connections
{
"connections__rid_": "str",
"connections__index_": "int",
"connection__val_": "str"
}
```
When processing a collection of JSON objects, the schema is not known, so we must provide a way for the relationalize class to store the new collections it will potentially create. This could be a local or remote file, an in memory buffer, etc...
The relationalize class constructor takes in a function with the signature `(identifier: str) -> TextIO` as an argument (`create_output`). This function is used to create the outputs.
The relationalize class constructor also takes in an optional function that will be called whenever an object is written to a file that was created via the `create_output` function. This method can be utilized to generate the schemas as the objects are encountered, reducing the number of iterations needed over the objects.
For example:
```python
schemas: Dict[str, Schema] = {}
def on_object_write(schema: str, object: dict):
if schema not in schemas:
schemas[schema] = Schema()
schemas[schema].read_object(object)
with Relationalize('object_name', on_object_write=on_object_write) as r:
r.relationalize([{...}, {...}])
```
Once the collection has been relationalized and the schemas have been generated, you can utilize the `convert_object` method to create the final json object, which could be loaded into a database. The `convert_object` method will break out any ambigously typed columns into seperate columns.
For example the first document in the users collection would output the following three documents after being processed by `relationalize` and `convert_object`:
```javascript
// users
{
"username": "jsmith123",
"created_at_str": "2022-12-15T20:24:26.637Z",
"contact_email_address": "jsmith123@gmail.com",
"contact_phone_number": 1234567890,
"connections": "R_969c799a3177437d98074d985861242b"
}
// users_connections
{
"connections__rid_": "R_969c799a3177437d98074d985861242b",
"connections__index_": 0,
"connection__val_": "jdoe456"
}
{
"connections__rid_": "R_969c799a3177437d98074d985861242b",
"connections__index_": 1,
"connection__val_": "elowry789"
}
```
## Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install relationalize.
```bash
pip install relationalize
```
## Examples
Examples are placed in the `examples/` folder.
These examples are intended to be run from the working directory of `examples`.
We recommend starting with the `local_fs_example.py` and then moving to the `memory_example.py`.
For a complete API to database pipeline check out the `full_pokemon_s3_redshift_pipeline.py` example.
## Contributing
Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.
Please make sure to update tests as appropriate.
## License
[MIT](https://choosealicense.com/licenses/mit/)
Raw data
{
"_id": null,
"home_page": "https://github.com/tulip/relationalize",
"name": "relationalize",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "tulip, relationalize, json, relational, array",
"author": "Henry Jones",
"author_email": "Henry Jones <henry.jones@tulip.co>",
"download_url": "https://files.pythonhosted.org/packages/a4/e0/b63105ba558cf2750caf978c229e36332f17b16ae12ee22aafafaa9cbe4c/relationalize-0.1.9.tar.gz",
"platform": null,
"description": "# Relationalize\nRelationalize is a Python library for transforming collections of JSON objects, into a relational-friendly format.\nIt draws inspiration from the [AWS Glue Relationalize transform](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Relationalize.html).\n\n## JSON Object Collections\nWhen working with JSON often there are collections of objects with the same or similar structure. For example, in a NoSQL database there may be a collection describing users with the following two documents/objects:\n```javascript\n//Document 1\n{\n \"username\": \"jsmith123\",\n \"created_at\": \"2022-12-15T20:24:26.637Z\",\n \"contact\": {\n \"email_address\": \"jsmith123@gmail.com\",\n \"phone_number\": 1234567890\n },\n \"connections\": [\n \"jdoe456\",\n \"elowry789\"\n ]\n}\n// Document 2\n{\n \"username\": \"jdoe456\",\n \"created_at\": 1671135896468,\n \"contact\": {\n \"email_address\": \"jdoe899@yahoo.com\",\n \"address\": {\n \"address_1\": \"77 Middlesex Avenue\",\n \"address_2\": \"Suite A\",\n \"city\": \"Somerville\",\n \"state\": \"MA\",\n \"zip_code\": \"02145\"\n }\n },\n \"connections\": [\n \"jsmith123\",\n \"hjones99\"\n ]\n}\n```\nThere are a number of challenges that must be overcome to move this data into a relational-database friendly format:\n1. Nested Objects (ex: \"contact\" field)\n2. Different data types in the same column (ex: \"created_at\" field)\n3. Sparse columns (ex: \"contact.phone_number\" & \"contact.address\" field)\n4. Sub-Arrays (ex: \"connections\" field)\n\nThis package provides a solutution to all of these challenges with more portability and flexibility, and less limitations than AWS Glue relationalize.\n\n## How Relationalize works\nThe relationalize function recursively navigates the JSON object and splits out new ojects/collections whenever an array is encountered and provides a connection/relation between the objects. You provide the Relationalize class a function which will determine where to write the transformed content. This could be a local file object, a remote (s3) file object, or an in memory buffer. Additionally any nested objects are flattened. Each object that is output by relationalize is a flat JSON object.\n\nThis package also provides a `Schema` class which can generate a schema for a collection of flat JSON objects. This schema can be used to handle type ambigouity and generate SQL DDL.\n\nFor example, the schemas generated by relationalizing and schema generating the above collection's objects would be:\n```javascript\n// users\n{\n \"username\": \"str\",\n \"created_at\": \"c-int-str\",\n \"contact_email_address\": \"str\",\n \"contact_phone_number\": \"int\",\n \"contact_address_address_1\": \"str\",\n \"contact_address_address_2\": \"str\",\n \"contact_address_city\": \"str\",\n \"contact_address_state\": \"str\",\n \"contact_address_zip_code\": \"str\",\n \"connections\": \"str\"\n\n}\n//users_connections\n{\n \"connections__rid_\": \"str\",\n \"connections__index_\": \"int\",\n \"connection__val_\": \"str\"\n}\n```\n\nWhen processing a collection of JSON objects, the schema is not known, so we must provide a way for the relationalize class to store the new collections it will potentially create. This could be a local or remote file, an in memory buffer, etc...\n\nThe relationalize class constructor takes in a function with the signature `(identifier: str) -> TextIO` as an argument (`create_output`). This function is used to create the outputs.\n\nThe relationalize class constructor also takes in an optional function that will be called whenever an object is written to a file that was created via the `create_output` function. This method can be utilized to generate the schemas as the objects are encountered, reducing the number of iterations needed over the objects.\n\nFor example:\n```python\nschemas: Dict[str, Schema] = {}\n\ndef on_object_write(schema: str, object: dict):\n if schema not in schemas:\n schemas[schema] = Schema()\n schemas[schema].read_object(object)\n\nwith Relationalize('object_name', on_object_write=on_object_write) as r:\n r.relationalize([{...}, {...}])\n```\n\nOnce the collection has been relationalized and the schemas have been generated, you can utilize the `convert_object` method to create the final json object, which could be loaded into a database. The `convert_object` method will break out any ambigously typed columns into seperate columns.\n\nFor example the first document in the users collection would output the following three documents after being processed by `relationalize` and `convert_object`:\n```javascript\n// users\n{\n \"username\": \"jsmith123\",\n \"created_at_str\": \"2022-12-15T20:24:26.637Z\",\n \"contact_email_address\": \"jsmith123@gmail.com\",\n \"contact_phone_number\": 1234567890,\n \"connections\": \"R_969c799a3177437d98074d985861242b\"\n}\n// users_connections\n{\n \"connections__rid_\": \"R_969c799a3177437d98074d985861242b\",\n \"connections__index_\": 0,\n \"connection__val_\": \"jdoe456\"\n}\n{\n \"connections__rid_\": \"R_969c799a3177437d98074d985861242b\",\n \"connections__index_\": 1,\n \"connection__val_\": \"elowry789\"\n}\n```\n\n\n## Installation\n\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install relationalize.\n\n```bash\npip install relationalize\n```\n\n## Examples\nExamples are placed in the `examples/` folder.\nThese examples are intended to be run from the working directory of `examples`.\n\nWe recommend starting with the `local_fs_example.py` and then moving to the `memory_example.py`.\n\nFor a complete API to database pipeline check out the `full_pokemon_s3_redshift_pipeline.py` example.\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\nPlease make sure to update tests as appropriate.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
"bugtrack_url": null,
"license": null,
"summary": "A utility for converting/transporting arbitrary JSON data collections into a relational database",
"version": "0.1.9",
"project_urls": {
"Homepage": "https://github.com/tulip/relationalize"
},
"split_keywords": [
"tulip",
" relationalize",
" json",
" relational",
" array"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a4e0b63105ba558cf2750caf978c229e36332f17b16ae12ee22aafafaa9cbe4c",
"md5": "4ea49916438b10be6c3e635216351303",
"sha256": "1df8643c2a9575271c6914a0ad52d00f5620f70f124fdd775b0dcc4e82cbd412"
},
"downloads": -1,
"filename": "relationalize-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "4ea49916438b10be6c3e635216351303",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 11909,
"upload_time": "2024-08-15T20:29:55",
"upload_time_iso_8601": "2024-08-15T20:29:55.190237Z",
"url": "https://files.pythonhosted.org/packages/a4/e0/b63105ba558cf2750caf978c229e36332f17b16ae12ee22aafafaa9cbe4c/relationalize-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-15 20:29:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tulip",
"github_project": "relationalize",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "relationalize"
}