schemaworks


Nameschemaworks JSON
Version 1.2.1 PyPI version JSON
download
home_pagehttps://github.com/anatol-ju/schemaworks
SummaryA schema conversion toolkit for JSON, Spark, PyIceberg and SQL formats.
upload_time2025-07-16 13:53:00
maintainerNone
docs_urlNone
authorAnatol Jurenkow
requires_python>=3.10
licenseGNU GPLv3
keywords schema conversion spark json sql data-engineering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SchemaWorks

**SchemaWorks** is a Python library for converting between different schema definitions, such as JSON Schema, Spark DataTypes, SQL type strings, and more. It aims to simplify working with structured data across multiple data engineering and analytics platforms.

## 📣 New in 1.2.0
Added support to create Iceberg schemas to be used with PyIceberg.

## 🔧 Features

- Convert JSON Schema to:
  - Apache Spark StructType
  - SQL column type strings
  - Python dtypes dictionaries
  - Iceberg types (using PyIceberg)
- Convert Spark schemas and dtypes to JSON Schema
- Generate JSON Schemas from example data
- Flatten nested schemas for easier inspection or mapping
- Utilities for handling Decimal encoding and schema inference

## 🚀 Use Cases

- Building pipelines that consume or produce data in multiple formats
- Ensuring schema consistency across Spark, SQL, and data validation layers
- Automating schema generation from sample data for prototyping
- Simplifying developer tooling with schema introspection

## 🔍 Validation Support

SchemaWorks includes custom schema validation support through extended JSON Schema validators. It supports standard types like `string`, `integer`, `array`, and also recognises additional types common in data engineering workflows:

- Extended support for:
  - `float`, `bool`, `long`
  - `date`, `datetime`, `time`
  - `map`

Validation is performed using an enhanced version of `jsonschema.Draft202012Validator` that integrates these type checks.

## 🚚 Installation

You can install SchemaWorks using `pip` or `poetry`, depending on your preference.

### Using pip

Make sure you’re using Python 3.10 or later.

```bash
pip install schemaworks
```

This will install the package along with its core dependencies.

### Using Poetry

If you use [Poetry](https://python-poetry.org/) for dependency management:

```bash
poetry add schemaworks
```

To install development dependencies as well (for testing and linting):

```bash
poetry install --with dev
```

### Cloning the Repository (For Development)

If you want to clone and develop the package locally:

```bash
git clone https://github.com/anatol-ju/schemaworks.git
cd schemaworks
poetry install --with dev
pre-commit install  # optional: enable linting and formatting checks
```

To run the test suite:

```bash
poetry run pytest
```

## 🧱 Quick Example

```python
from schemaworks.converter import JsonSchemaConverter

# Load a JSON schema
schema = {
    "type": "object",
    "properties": {
        "user_id": {"type": "integer"},
        "purchase": {
            "type": "object",
            "properties": {
                "item": {"type": "string"},
                "price": {"type": "number"}
            }
        }
    }
}

converter = JsonSchemaConverter(schema=schema)

# Convert to Spark schema
spark_schema = converter.to_spark_schema()
print(spark_schema)

# Convert to SQL string
sql_schema = converter.to_sql_string()
print(sql_schema)
```

## 📖 Documentation

- JSON ↔ Spark conversions
  Map JSON schema types to Spark StructTypes and back.
- Schema flattening
  Flatten nested schemas into dot notation for easier access and mapping.
- Data-driven schema inference
  Automatically generate JSON schemas from raw data samples.
- Decimal compatibility
  Custom JSON encoder to handle decimal.Decimal values safely.
- Schema validation
  Validate schemas and make data conform if needed.

## 🧪 Testing

Run unit tests using pytest:
```bash
poetry run pytest
```

## ⭐ Examples

### ✅ Convert JSON schema to Spark StructType

When working with data pipelines, it’s common to receive schemas in JSON format — whether from APIs, data contracts, or auto-generated metadata. But tools like Apache Spark and PySpark require their own schema definitions in the form of StructType. Manually translating between these formats is error-prone, time-consuming, and doesn’t scale. This function bridges that gap by automatically converting standard JSON Schemas into Spark-compatible schemas, saving hours of manual effort and reducing the risk of type mismatches in production pipelines.

```python
from schemaworks import JsonSchemaConverter

json_schema = {
    "type": "object",
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string"},
        "price": {"type": "number"}
    }
}

converter = JsonSchemaConverter(schema=json_schema)
spark_schema = converter.to_spark_schema()
print(spark_schema)
```

### ✅ Infer schema from example JSON data

When working with dynamic or loosely structured data sources, manually writing a schema can be tedious and error-prone—especially when dealing with deeply nested or inconsistent inputs. This function allows you to infer a valid JSON Schema directly from real example data, making it much faster to prototype, validate, or document your datasets. It’s particularly useful when onboarding new datasets or integrating third-party APIs, where a formal schema may be missing or outdated.

```python
import json
from pprint import pprint
from schemaworks.utils import generate_schema

example_data = {}
with open("example_data.json", "r") as f:
    example_data = f.read()

example_data = json.loads(example_data)

schema = generate_schema(example_data, add_required=True)
pprint(schema)
```

### ✅ Flatten a nested schema

Flattening a nested JSON schema makes it easier to map fields to flat tabular structures, such as SQL tables or Spark DataFrames. It simplifies downstream processing, column selection, and validation—especially when working with deeply nested APIs or hierarchical datasets.

```python
converter.json_schema = {
    "type": "object",
    "properties": {
        "user_id": {"type": "integer"},
        "contact": {
            "type": "object",
            "properties": {
                "email": {"type": "string"},
                "phone": {"type": "string"}
            },
        },
        "active": {"type": "boolean"},
    },
    "required": ["user_id", "email"],
}
flattened = converter.to_flat()
pprint(flattened)
```

### ✅ Convert inferred schema to SQL column types

After inferring or converting a schema, it's often necessary to express it in SQL-friendly syntax—for example, when creating tables or validating incoming data. This method translates a JSON schema into a SQL column type definition string, which is especially helpful for building integration scripts, automating ETL jobs, or generating documentation.

```python
pprint(converter.to_sql_string())
```

### ✅ Convert to Apache Iceberg Schema

You can now (as of version 1.2.0) convert a JSON Schema directly into an Iceberg-compatible schema using PyIceberg:

```python
from schemaworks.converter import JsonSchemaConverter

json_schema = {
    "type": "object",
    "properties": {
        "uid": {"type": "string"},
        "details": {
            "type": "object",
            "properties": {
                "score": {"type": "number"},
                "active": {"type": "boolean"}
            },
            "required": ["score"]
        }
    },
    "required": ["uid"]
}

converter = JsonSchemaConverter(json_schema)
iceberg_schema = converter.to_iceberg_schema()
```

### ✅ Handle decimals in JSON safely

Custom encoder to convert `Decimal` objects to `int` or `float` for JSON serialization.

This avoids serialization errors caused by unsupported Decimal types.
It does not preserve full precision—conversion uses built-in float or int types.

```python
from schemaworks.utils import DecimalEncoder
from decimal import Decimal
import json

data = {"price": Decimal("19.99")}
print(json.dumps(data, cls=DecimalEncoder))  # Output: {"price": 19.99}
```

### ✅ Validate data

```python
from schemaworks.validators import PythonTypeValidator

schema = {
    "type": "object",
    "properties": {
        "created_at": {"type": "datetime"},
        "price": {"type": "float"},
        "active": {"type": "bool"}
    }
}

data = {
    "created_at": "2023-01-01T00:00:00",
    "price": 10.5,
    "active": True
}

validator = PythonTypeValidator()
validator.validate(data, schema)
```

### ✅ Make data conform to schema

You can also use `.conform()` to enforce schema types and fill in missing values with sensible defaults:

```python
conformed_data = validator.conform(data, schema, fill_missing=True)
```

## 📄 License

This project is licensed under the MIT License.

You are free to use, modify, and distribute this software, provided that you include the original copyright
notice and this permission notice in all copies or substantial portions of the software.

For full terms, see the [MIT license](https://opensource.org/license/mit).

## 🧑‍💻 Author

Anatol Jurenkow

Cloud Data Engineer | AWS Enthusiast | Iceberg Fan

[GitHub](https://github.com/anatol-ju) · [LinkedIn](https://de.linkedin.com/in/anatol-jurenkow)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/anatol-ju/schemaworks",
    "name": "schemaworks",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "schema, conversion, spark, json, sql, data-engineering",
    "author": "Anatol Jurenkow",
    "author_email": "azurius@t-online.de",
    "download_url": "https://files.pythonhosted.org/packages/2d/2e/4064362f550494c6b207f4ef2160fd802ad80a973edf2ff9ffedfdc69726/schemaworks-1.2.1.tar.gz",
    "platform": null,
    "description": "# SchemaWorks\n\n**SchemaWorks** is a Python library for converting between different schema definitions, such as JSON Schema, Spark DataTypes, SQL type strings, and more. It aims to simplify working with structured data across multiple data engineering and analytics platforms.\n\n## \ud83d\udce3 New in 1.2.0\nAdded support to create Iceberg schemas to be used with PyIceberg.\n\n## \ud83d\udd27 Features\n\n- Convert JSON Schema to:\n  - Apache Spark StructType\n  - SQL column type strings\n  - Python dtypes dictionaries\n  - Iceberg types (using PyIceberg)\n- Convert Spark schemas and dtypes to JSON Schema\n- Generate JSON Schemas from example data\n- Flatten nested schemas for easier inspection or mapping\n- Utilities for handling Decimal encoding and schema inference\n\n## \ud83d\ude80 Use Cases\n\n- Building pipelines that consume or produce data in multiple formats\n- Ensuring schema consistency across Spark, SQL, and data validation layers\n- Automating schema generation from sample data for prototyping\n- Simplifying developer tooling with schema introspection\n\n## \ud83d\udd0d Validation Support\n\nSchemaWorks includes custom schema validation support through extended JSON Schema validators. It supports standard types like `string`, `integer`, `array`, and also recognises additional types common in data engineering workflows:\n\n- Extended support for:\n  - `float`, `bool`, `long`\n  - `date`, `datetime`, `time`\n  - `map`\n\nValidation is performed using an enhanced version of `jsonschema.Draft202012Validator` that integrates these type checks.\n\n## \ud83d\ude9a Installation\n\nYou can install SchemaWorks using `pip` or `poetry`, depending on your preference.\n\n### Using pip\n\nMake sure you\u2019re using Python 3.10 or later.\n\n```bash\npip install schemaworks\n```\n\nThis will install the package along with its core dependencies.\n\n### Using Poetry\n\nIf you use [Poetry](https://python-poetry.org/) for dependency management:\n\n```bash\npoetry add schemaworks\n```\n\nTo install development dependencies as well (for testing and linting):\n\n```bash\npoetry install --with dev\n```\n\n### Cloning the Repository (For Development)\n\nIf you want to clone and develop the package locally:\n\n```bash\ngit clone https://github.com/anatol-ju/schemaworks.git\ncd schemaworks\npoetry install --with dev\npre-commit install  # optional: enable linting and formatting checks\n```\n\nTo run the test suite:\n\n```bash\npoetry run pytest\n```\n\n## \ud83e\uddf1 Quick Example\n\n```python\nfrom schemaworks.converter import JsonSchemaConverter\n\n# Load a JSON schema\nschema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"user_id\": {\"type\": \"integer\"},\n        \"purchase\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"item\": {\"type\": \"string\"},\n                \"price\": {\"type\": \"number\"}\n            }\n        }\n    }\n}\n\nconverter = JsonSchemaConverter(schema=schema)\n\n# Convert to Spark schema\nspark_schema = converter.to_spark_schema()\nprint(spark_schema)\n\n# Convert to SQL string\nsql_schema = converter.to_sql_string()\nprint(sql_schema)\n```\n\n## \ud83d\udcd6 Documentation\n\n- JSON \u2194 Spark conversions\n  Map JSON schema types to Spark StructTypes and back.\n- Schema flattening\n  Flatten nested schemas into dot notation for easier access and mapping.\n- Data-driven schema inference\n  Automatically generate JSON schemas from raw data samples.\n- Decimal compatibility\n  Custom JSON encoder to handle decimal.Decimal values safely.\n- Schema validation\n  Validate schemas and make data conform if needed.\n\n## \ud83e\uddea Testing\n\nRun unit tests using pytest:\n```bash\npoetry run pytest\n```\n\n## \u2b50 Examples\n\n### \u2705 Convert JSON schema to Spark StructType\n\nWhen working with data pipelines, it\u2019s common to receive schemas in JSON format \u2014 whether from APIs, data contracts, or auto-generated metadata. But tools like Apache Spark and PySpark require their own schema definitions in the form of StructType. Manually translating between these formats is error-prone, time-consuming, and doesn\u2019t scale. This function bridges that gap by automatically converting standard JSON Schemas into Spark-compatible schemas, saving hours of manual effort and reducing the risk of type mismatches in production pipelines.\n\n```python\nfrom schemaworks import JsonSchemaConverter\n\njson_schema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"id\": {\"type\": \"integer\"},\n        \"name\": {\"type\": \"string\"},\n        \"price\": {\"type\": \"number\"}\n    }\n}\n\nconverter = JsonSchemaConverter(schema=json_schema)\nspark_schema = converter.to_spark_schema()\nprint(spark_schema)\n```\n\n### \u2705 Infer schema from example JSON data\n\nWhen working with dynamic or loosely structured data sources, manually writing a schema can be tedious and error-prone\u2014especially when dealing with deeply nested or inconsistent inputs. This function allows you to infer a valid JSON Schema directly from real example data, making it much faster to prototype, validate, or document your datasets. It\u2019s particularly useful when onboarding new datasets or integrating third-party APIs, where a formal schema may be missing or outdated.\n\n```python\nimport json\nfrom pprint import pprint\nfrom schemaworks.utils import generate_schema\n\nexample_data = {}\nwith open(\"example_data.json\", \"r\") as f:\n    example_data = f.read()\n\nexample_data = json.loads(example_data)\n\nschema = generate_schema(example_data, add_required=True)\npprint(schema)\n```\n\n### \u2705 Flatten a nested schema\n\nFlattening a nested JSON schema makes it easier to map fields to flat tabular structures, such as SQL tables or Spark DataFrames. It simplifies downstream processing, column selection, and validation\u2014especially when working with deeply nested APIs or hierarchical datasets.\n\n```python\nconverter.json_schema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"user_id\": {\"type\": \"integer\"},\n        \"contact\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"email\": {\"type\": \"string\"},\n                \"phone\": {\"type\": \"string\"}\n            },\n        },\n        \"active\": {\"type\": \"boolean\"},\n    },\n    \"required\": [\"user_id\", \"email\"],\n}\nflattened = converter.to_flat()\npprint(flattened)\n```\n\n### \u2705 Convert inferred schema to SQL column types\n\nAfter inferring or converting a schema, it's often necessary to express it in SQL-friendly syntax\u2014for example, when creating tables or validating incoming data. This method translates a JSON schema into a SQL column type definition string, which is especially helpful for building integration scripts, automating ETL jobs, or generating documentation.\n\n```python\npprint(converter.to_sql_string())\n```\n\n### \u2705 Convert to Apache Iceberg Schema\n\nYou can now (as of version 1.2.0) convert a JSON Schema directly into an Iceberg-compatible schema using PyIceberg:\n\n```python\nfrom schemaworks.converter import JsonSchemaConverter\n\njson_schema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"uid\": {\"type\": \"string\"},\n        \"details\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"score\": {\"type\": \"number\"},\n                \"active\": {\"type\": \"boolean\"}\n            },\n            \"required\": [\"score\"]\n        }\n    },\n    \"required\": [\"uid\"]\n}\n\nconverter = JsonSchemaConverter(json_schema)\niceberg_schema = converter.to_iceberg_schema()\n```\n\n### \u2705 Handle decimals in JSON safely\n\nCustom encoder to convert `Decimal` objects to `int` or `float` for JSON serialization.\n\nThis avoids serialization errors caused by unsupported Decimal types.\nIt does not preserve full precision\u2014conversion uses built-in float or int types.\n\n```python\nfrom schemaworks.utils import DecimalEncoder\nfrom decimal import Decimal\nimport json\n\ndata = {\"price\": Decimal(\"19.99\")}\nprint(json.dumps(data, cls=DecimalEncoder))  # Output: {\"price\": 19.99}\n```\n\n### \u2705 Validate data\n\n```python\nfrom schemaworks.validators import PythonTypeValidator\n\nschema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"created_at\": {\"type\": \"datetime\"},\n        \"price\": {\"type\": \"float\"},\n        \"active\": {\"type\": \"bool\"}\n    }\n}\n\ndata = {\n    \"created_at\": \"2023-01-01T00:00:00\",\n    \"price\": 10.5,\n    \"active\": True\n}\n\nvalidator = PythonTypeValidator()\nvalidator.validate(data, schema)\n```\n\n### \u2705 Make data conform to schema\n\nYou can also use `.conform()` to enforce schema types and fill in missing values with sensible defaults:\n\n```python\nconformed_data = validator.conform(data, schema, fill_missing=True)\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License.\n\nYou are free to use, modify, and distribute this software, provided that you include the original copyright\nnotice and this permission notice in all copies or substantial portions of the software.\n\nFor full terms, see the [MIT license](https://opensource.org/license/mit).\n\n## \ud83e\uddd1\u200d\ud83d\udcbb Author\n\nAnatol Jurenkow\n\nCloud Data Engineer | AWS Enthusiast | Iceberg Fan\n\n[GitHub](https://github.com/anatol-ju) \u00b7 [LinkedIn](https://de.linkedin.com/in/anatol-jurenkow)\n",
    "bugtrack_url": null,
    "license": "GNU GPLv3",
    "summary": "A schema conversion toolkit for JSON, Spark, PyIceberg and SQL formats.",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/anatol-ju/schemaworks",
        "Repository": "https://github.com/anatol-ju/schemaworks"
    },
    "split_keywords": [
        "schema",
        " conversion",
        " spark",
        " json",
        " sql",
        " data-engineering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a0e9e68feac06a9075a25721c76cc45185f677336ef49a80adb4887fe649eeba",
                "md5": "8a427c50f80919916264ceb56644aaa0",
                "sha256": "f6a74b28a1b3c4686165a40acc647d473c72b7a91143d89cbf27273627efacb2"
            },
            "downloads": -1,
            "filename": "schemaworks-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8a427c50f80919916264ceb56644aaa0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18168,
            "upload_time": "2025-07-16T13:52:58",
            "upload_time_iso_8601": "2025-07-16T13:52:58.861199Z",
            "url": "https://files.pythonhosted.org/packages/a0/e9/e68feac06a9075a25721c76cc45185f677336ef49a80adb4887fe649eeba/schemaworks-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2d2e4064362f550494c6b207f4ef2160fd802ad80a973edf2ff9ffedfdc69726",
                "md5": "7fbea4b849957cafac7049337eebeb62",
                "sha256": "5074132ee69f00a5bd69dd583124d7fbd26d630367abce7cce40bdef4a57dc43"
            },
            "downloads": -1,
            "filename": "schemaworks-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7fbea4b849957cafac7049337eebeb62",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 18930,
            "upload_time": "2025-07-16T13:53:00",
            "upload_time_iso_8601": "2025-07-16T13:53:00.259871Z",
            "url": "https://files.pythonhosted.org/packages/2d/2e/4064362f550494c6b207f4ef2160fd802ad80a973edf2ff9ffedfdc69726/schemaworks-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 13:53:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "anatol-ju",
    "github_project": "schemaworks",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "schemaworks"
}
        
Elapsed time: 0.50541s