pyscgen


Namepyscgen JSON
Version 0.2.1 PyPI version JSON
download
home_pagehttps://github.com/Salfiii/pyscgen
SummaryPython AVRO Schema generator which can use multiples example JSONs as input to infer the Schema.
upload_time2022-12-13 08:11:45
maintainer
docs_urlNone
authorFlorian Salfenmoser
requires_python>=3.10,<4.0
licenseBSD-3-Clause
keywords avro json merge analyze pydantic model schema generator
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # (pyscgen) python schema generator 
A Python Package to analyze JSON Documents, build a merged JSON-Document out of multiple provided JSON Documents and create an 
AVRO-Schemas based on multiple given JSON Documents.

## Installation

### pip
``pip install pyscgen``

### poetry
``poetry add pyscgen``

## JSON Analyzer
receives a list of json documents, analyzes the structure and outputs the following infos:

-**Output**:
- collection_data: 
  - dict which stores infos about the columns of each document with attributes like name, path, data type and if its null
- column_infos: 
  - condensed/merged column infos of all given documents with attributes like name, path, nullability, density, unique values, data types found, parent column config etc.
  - This is the "real" result of the analyzer and the building plan for the JSON Merger and AVRO Schema generator.
- df_flattened
  - A pandas DataFrame which stores the json documents flattened and contains every found column with data.
  - One document is represented by one row, index starts at 0 and matches to the order in the given list of documents.
- df_dtypes
  - A pandas DataFrame which stores the python data type for each column of all given documents.
  - One document is represented by one row, index starts at 0 and matches to the order in the given list of documents.
- df_unique
  - A pandas DataFrame which stores unique values found for each column of all given documents.
  - This data frame is pivoted
    - The column "0" stores all found column. One column is represented by one row.
    - The columns 1 - n contain one distinct value each. If you analyze 10 documents and one field has a distinct value in each document, you´ll produce 10 value columns.

## JSON Merger
receives a list of json documents, analyzes the structure with the JSON Analyzer and outputs one merged dictionary/json document with all found columns and dummy values according to the found data types:
 
- **Output**: merged_doc: 
  - dictionary merged together from the all given JSON documents.

## AVRO Schema generator
receives a list of json documents, analyzes the structure with the JSON Analyzer and outputs a "Schema"-object which can be converted to a dict and stored in an avsc-file
 
- **Flow:**
  - JSON Analyzer -> JSON Merger -> AVRO Schema generator
- **Output**: avro_schema:
  - AVRO Schema object can be converted to a dict and stored to an avsc file, see examples.
- **Limitations/Currently not supported**:
  - Resolution of duplicated names in the AVRO-Output
    - You can resolve this manually by renaming duplicated elements by hand afterwards and use [aliases](https://avro.apache.org/docs/current/spec.html#Aliases) to still match with the input data.
    - You can also use inheritance in AVRO-Schemas, here´s an example: https://www.nerd.vision/post/reusing-schema-definitions-in-avro, but that´s definitely more complicated
  - empty dicts/maps:
    - If one of your JSON input documents has an empty dictionary, e. g. ```{"field1": "value1", "field2": {}}``` an empty AVRO record will be generated, which is in valid in AVRO.
    - You can resolve this by deleting the empty record or filling it with live if you know that it´s still needed in the future.
  - Mixed types for one field in the the JSON Input documents currently results in the most generic AVRO type, which ist a String (Union types, as an option, is planned)
    - This leads to the case that not all input documents automaticall validate positively against the AVRO schema, because you need to convert types beforehand. 
    - You can use the pydantic model generator to create a model, which automatically converts other data types to string, to solve this issue. Just let all your JSON flow through the model bevore validation.

## Pydantic Model generator
receives a list of json documents analyzes the structure with the JSON Analyzer, creates an AVRO Schema which is then used to create a pydantic Model with [pydantic-avro](https://github.com/godatadriven/pydantic-avro)
 
- **Flow**
  - JSON Analyzer -> JSON Merger -> AVRO Schema generator -> Pydantic Model generator 
- **Output:** pydantic model:
  - Pydantic Model as a string which can be written to a .py-File.
- **Limitations/Currently not supported**:
  - Python/AVRO bytes is currently not supported in pydantic model generation because there is no support in pydantic for this datatype right now.

# Why should you use it?
Considering the fact that there are already other AVRO Schema generators, why should you use this one?

The simple answer is "null" or "nullability".

All other solutions I tried simply let you pass one JSON Document as a base for the AVRO-Schema generation.
This does obviously not allow to infer nullabilty, because only what's present in this one message can be observed 
and therefore used for schema generation.

This library lets you pass a list of JSON Documents and therefore can gather the infos which fields are always 
present - aka. mandatory - and which fields are only found in a couple documents - and therefore nullable.

Handling nullability in AVRO-Schemas for records and arrays is quite painful in my opinion, 
this library gets rid of this chore for you.

Of course, it works just fine with only one JSON document like any other AVRO generator, just without the nullability part then.

# Principles
- CI-CO (Crap in - Crap out): 
  - The JSON Documents are not validated, therefore, if you pass in garbage, crap in - crap out.
  - The goal is to creata an AVRO-Schema no matter what. In my opinion, an half ready AVRO-Schema which needs to be edited by hand is better than no AVRO-Schema at all!

# Examples:
- [JSON Analyzer](./example/pyscgen_json_analyze.py)
- [JSON Merger](./example/pyscgen_json_merge.py)
- [AVRO Schema creator](./example/pyscgen_avro_create_schema.py)
- [Pydantic model creator](./example/pyscgen_pydantic_create_schema.py)

# Other useful resources regarding AVRO:
- [Official Apache AVRO docs](https://avro.apache.org/docs/current/spec.html)
- [Python AVRO Schema valdiator "fastavro"](https://github.com/fastavro/fastavro)

# License:
[BSD-3-Clause](LICENSE)
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Salfiii/pyscgen",
    "name": "pyscgen",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "AVRO,JSON,merge,analyze,pydantic,model,schema,generator",
    "author": "Florian Salfenmoser",
    "author_email": "florian.salfenmoser.dev@outlook.de",
    "download_url": "https://files.pythonhosted.org/packages/a1/3d/ca9292c1b4321e0e69abaa94b08cdbb5efb029c4d5912c363042bb973531/pyscgen-0.2.1.tar.gz",
    "platform": null,
    "description": "# (pyscgen) python schema generator \nA Python Package to analyze JSON Documents, build a merged JSON-Document out of multiple provided JSON Documents and create an \nAVRO-Schemas based on multiple given JSON Documents.\n\n## Installation\n\n### pip\n``pip install pyscgen``\n\n### poetry\n``poetry add pyscgen``\n\n## JSON Analyzer\nreceives a list of json documents, analyzes the structure and outputs the following infos:\n\n-**Output**:\n- collection_data: \n  - dict which stores infos about the columns of each document with attributes like name, path, data type and if its null\n- column_infos: \n  - condensed/merged column infos of all given documents with attributes like name, path, nullability, density, unique values, data types found, parent column config etc.\n  - This is the \"real\" result of the analyzer and the building plan for the JSON Merger and AVRO Schema generator.\n- df_flattened\n  - A pandas DataFrame which stores the json documents flattened and contains every found column with data.\n  - One document is represented by one row, index starts at 0 and matches to the order in the given list of documents.\n- df_dtypes\n  - A pandas DataFrame which stores the python data type for each column of all given documents.\n  - One document is represented by one row, index starts at 0 and matches to the order in the given list of documents.\n- df_unique\n  - A pandas DataFrame which stores unique values found for each column of all given documents.\n  - This data frame is pivoted\n    - The column \"0\" stores all found column. One column is represented by one row.\n    - The columns 1 - n contain one distinct value each. If you analyze 10 documents and one field has a distinct value in each document, you\u00b4ll produce 10 value columns.\n\n## JSON Merger\nreceives a list of json documents, analyzes the structure with the JSON Analyzer and outputs one merged dictionary/json document with all found columns and dummy values according to the found data types:\n \n- **Output**: merged_doc: \n  - dictionary merged together from the all given JSON documents.\n\n## AVRO Schema generator\nreceives a list of json documents, analyzes the structure with the JSON Analyzer and outputs a \"Schema\"-object which can be converted to a dict and stored in an avsc-file\n \n- **Flow:**\n  - JSON Analyzer -> JSON Merger -> AVRO Schema generator\n- **Output**: avro_schema:\n  - AVRO Schema object can be converted to a dict and stored to an avsc file, see examples.\n- **Limitations/Currently not supported**:\n  - Resolution of duplicated names in the AVRO-Output\n    - You can resolve this manually by renaming duplicated elements by hand afterwards and use [aliases](https://avro.apache.org/docs/current/spec.html#Aliases) to still match with the input data.\n    - You can also use inheritance in AVRO-Schemas, here\u00b4s an example: https://www.nerd.vision/post/reusing-schema-definitions-in-avro, but that\u00b4s definitely more complicated\n  - empty dicts/maps:\n    - If one of your JSON input documents has an empty dictionary, e. g. ```{\"field1\": \"value1\", \"field2\": {}}``` an empty AVRO record will be generated, which is in valid in AVRO.\n    - You can resolve this by deleting the empty record or filling it with live if you know that it\u00b4s still needed in the future.\n  - Mixed types for one field in the the JSON Input documents currently results in the most generic AVRO type, which ist a String (Union types, as an option, is planned)\n    - This leads to the case that not all input documents automaticall validate positively against the AVRO schema, because you need to convert types beforehand. \n    - You can use the pydantic model generator to create a model, which automatically converts other data types to string, to solve this issue. Just let all your JSON flow through the model bevore validation.\n\n## Pydantic Model generator\nreceives a list of json documents analyzes the structure with the JSON Analyzer, creates an AVRO Schema which is then used to create a pydantic Model with [pydantic-avro](https://github.com/godatadriven/pydantic-avro)\n \n- **Flow**\n  - JSON Analyzer -> JSON Merger -> AVRO Schema generator -> Pydantic Model generator \n- **Output:** pydantic model:\n  - Pydantic Model as a string which can be written to a .py-File.\n- **Limitations/Currently not supported**:\n  - Python/AVRO bytes is currently not supported in pydantic model generation because there is no support in pydantic for this datatype right now.\n\n# Why should you use it?\nConsidering the fact that there are already other AVRO Schema generators, why should you use this one?\n\nThe simple answer is \"null\" or \"nullability\".\n\nAll other solutions I tried simply let you pass one JSON Document as a base for the AVRO-Schema generation.\nThis does obviously not allow to infer nullabilty, because only what's present in this one message can be observed \nand therefore used for schema generation.\n\nThis library lets you pass a list of JSON Documents and therefore can gather the infos which fields are always \npresent - aka. mandatory - and which fields are only found in a couple documents - and therefore nullable.\n\nHandling nullability in AVRO-Schemas for records and arrays is quite painful in my opinion, \nthis library gets rid of this chore for you.\n\nOf course, it works just fine with only one JSON document like any other AVRO generator, just without the nullability part then.\n\n# Principles\n- CI-CO (Crap in - Crap out): \n  - The JSON Documents are not validated, therefore, if you pass in garbage, crap in - crap out.\n  - The goal is to creata an AVRO-Schema no matter what. In my opinion, an half ready AVRO-Schema which needs to be edited by hand is better than no AVRO-Schema at all!\n\n# Examples:\n- [JSON Analyzer](./example/pyscgen_json_analyze.py)\n- [JSON Merger](./example/pyscgen_json_merge.py)\n- [AVRO Schema creator](./example/pyscgen_avro_create_schema.py)\n- [Pydantic model creator](./example/pyscgen_pydantic_create_schema.py)\n\n# Other useful resources regarding AVRO:\n- [Official Apache AVRO docs](https://avro.apache.org/docs/current/spec.html)\n- [Python AVRO Schema valdiator \"fastavro\"](https://github.com/fastavro/fastavro)\n\n# License:\n[BSD-3-Clause](LICENSE)",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Python AVRO Schema generator which can use multiples example JSONs as input to infer the Schema.",
    "version": "0.2.1",
    "split_keywords": [
        "avro",
        "json",
        "merge",
        "analyze",
        "pydantic",
        "model",
        "schema",
        "generator"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "a22a096f16f001634137d9da5c98b3b9",
                "sha256": "1aae692a5aefde99141f7d9b1dd00757316a9057a423701df2862531eae13751"
            },
            "downloads": -1,
            "filename": "pyscgen-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a22a096f16f001634137d9da5c98b3b9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 21332,
            "upload_time": "2022-12-13T08:11:42",
            "upload_time_iso_8601": "2022-12-13T08:11:42.817568Z",
            "url": "https://files.pythonhosted.org/packages/21/8b/ad2719604123931ca61009c7c906c4887e51b2e080b7aff7df91042c0d7a/pyscgen-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "89488463f6e7616241e8b983501e9b56",
                "sha256": "05d58b05eb2fc8f1bbc36c6c96e7c75facbee0732bd496c0e4d61606a812fc6d"
            },
            "downloads": -1,
            "filename": "pyscgen-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "89488463f6e7616241e8b983501e9b56",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 17248,
            "upload_time": "2022-12-13T08:11:45",
            "upload_time_iso_8601": "2022-12-13T08:11:45.355502Z",
            "url": "https://files.pythonhosted.org/packages/a1/3d/ca9292c1b4321e0e69abaa94b08cdbb5efb029c4d5912c363042bb973531/pyscgen-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-13 08:11:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Salfiii",
    "github_project": "pyscgen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyscgen"
}
        
Elapsed time: 0.02493s