spark-ddl-parser


Namespark-ddl-parser JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryZero-dependency PySpark DDL schema parser
upload_time2025-10-14 14:46:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords spark pyspark ddl schema parser data-engineering schema-parser
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Spark DDL Parser

A zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.

## Features

- **Zero Dependencies**: Only uses Python standard library
- **PySpark Compatible**: Parses standard PySpark DDL format
- **Type Safe**: Returns structured dataclasses
- **Comprehensive**: Supports all PySpark data types including nested structs, arrays, and maps
- **Well Tested**: 200+ test cases covering edge cases and performance

## Installation

```bash
pip install spark-ddl-parser
```

## Quick Start

```python
from spark_ddl_parser import parse_ddl_schema

# Parse a simple schema
schema = parse_ddl_schema("id long, name string")

print(schema.fields[0].name)  # 'id'
print(schema.fields[0].data_type.type_name)  # 'long'
print(schema.fields[1].name)  # 'name'
print(schema.fields[1].data_type.type_name)  # 'string'
```

## Supported Types

### Simple Types
- `string`, `int`, `integer`, `long`, `bigint`
- `double`, `float`, `short`, `smallint`, `byte`, `tinyint`
- `boolean`, `bool`, `date`, `timestamp`, `binary`

### Complex Types
- **Arrays**: `array<string>`, `array<long>`
- **Maps**: `map<string,int>`, `map<string,array<long>>`
- **Structs**: `struct<name:string,age:int>`
- **Decimal**: `decimal(10,2)` (with precision and scale)

### Nested Structures

```python
# Nested structs
schema = parse_ddl_schema("""
    id long,
    address struct<
        street:string,
        city:string,
        zip:string
    >,
    tags array<string>,
    metadata map<string,string>
""")

# Access nested fields
address_field = schema.fields[1]
print(address_field.name)  # 'address'
print(address_field.data_type.type_name)  # 'struct'
```

## API Reference

### `parse_ddl_schema(ddl_string: str) -> StructType`

Parse a DDL schema string into a structured type.

**Parameters:**
- `ddl_string` (str): DDL schema string (e.g., "id long, name string")

**Returns:**
- `StructType`: Structured type with fields

**Raises:**
- `ValueError`: If DDL string is invalid

**Example:**
```python
schema = parse_ddl_schema("id long, name string")
```

### Type Objects

#### `StructType`
Represents a struct containing fields.

**Attributes:**
- `type_name` (str): Always "struct"
- `fields` (List[StructField]): List of struct fields

#### `StructField`
Represents a field in a struct.

**Attributes:**
- `name` (str): Field name
- `data_type` (DataType): Field data type
- `nullable` (bool): Whether field is nullable (default: True)

#### `SimpleType`
Represents a simple data type.

**Attributes:**
- `type_name` (str): Type name (e.g., "string", "long", "int")

#### `ArrayType`
Represents an array type.

**Attributes:**
- `type_name` (str): Always "array"
- `element_type` (DataType): Type of array elements

#### `MapType`
Represents a map type.

**Attributes:**
- `type_name` (str): Always "map"
- `key_type` (DataType): Type of map keys
- `value_type` (DataType): Type of map values

#### `DecimalType`
Represents a decimal type.

**Attributes:**
- `type_name` (str): Always "decimal"
- `precision` (int): Decimal precision (default: 10)
- `scale` (int): Decimal scale (default: 0)

## Examples

### Basic Schema
```python
from spark_ddl_parser import parse_ddl_schema

schema = parse_ddl_schema("id long, name string, age int")
print(len(schema.fields))  # 3
```

### Arrays and Maps
```python
schema = parse_ddl_schema("""
    tags array<string>,
    scores array<long>,
    metadata map<string,string>,
    counts map<string,int>
""")
```

### Nested Structs
```python
schema = parse_ddl_schema("""
    user struct<
        id:long,
        name:string,
        address:struct<
            street:string,
            city:string
        >
    >
""")
```

### Decimal Types
```python
schema = parse_ddl_schema("price decimal(10,2), rate decimal(5,4)")
```

## Format Support

The parser supports both space and colon separators:

```python
# Space separator
schema1 = parse_ddl_schema("id long, name string")

# Colon separator
schema2 = parse_ddl_schema("id:long, name:string")
```

## Error Handling

The parser provides detailed error messages for invalid DDL:

```python
try:
    schema = parse_ddl_schema("id long, name")  # Missing type
except ValueError as e:
    print(e)  # "Invalid field definition: name"
```

## Development

```bash
# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=spark_ddl_parser
```

## License

MIT License - see LICENSE file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Related Projects

- [mock-spark](https://github.com/eddiethedean/mock-spark) - Uses this parser for DDL schema support


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "spark-ddl-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "spark, pyspark, ddl, schema, parser, data-engineering, schema-parser",
    "author": null,
    "author_email": "Odos Matthews <odosmatthews@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/41/70/dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9/spark_ddl_parser-0.1.0.tar.gz",
    "platform": null,
    "description": "# Spark DDL Parser\n\nA zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.\n\n## Features\n\n- **Zero Dependencies**: Only uses Python standard library\n- **PySpark Compatible**: Parses standard PySpark DDL format\n- **Type Safe**: Returns structured dataclasses\n- **Comprehensive**: Supports all PySpark data types including nested structs, arrays, and maps\n- **Well Tested**: 200+ test cases covering edge cases and performance\n\n## Installation\n\n```bash\npip install spark-ddl-parser\n```\n\n## Quick Start\n\n```python\nfrom spark_ddl_parser import parse_ddl_schema\n\n# Parse a simple schema\nschema = parse_ddl_schema(\"id long, name string\")\n\nprint(schema.fields[0].name)  # 'id'\nprint(schema.fields[0].data_type.type_name)  # 'long'\nprint(schema.fields[1].name)  # 'name'\nprint(schema.fields[1].data_type.type_name)  # 'string'\n```\n\n## Supported Types\n\n### Simple Types\n- `string`, `int`, `integer`, `long`, `bigint`\n- `double`, `float`, `short`, `smallint`, `byte`, `tinyint`\n- `boolean`, `bool`, `date`, `timestamp`, `binary`\n\n### Complex Types\n- **Arrays**: `array<string>`, `array<long>`\n- **Maps**: `map<string,int>`, `map<string,array<long>>`\n- **Structs**: `struct<name:string,age:int>`\n- **Decimal**: `decimal(10,2)` (with precision and scale)\n\n### Nested Structures\n\n```python\n# Nested structs\nschema = parse_ddl_schema(\"\"\"\n    id long,\n    address struct<\n        street:string,\n        city:string,\n        zip:string\n    >,\n    tags array<string>,\n    metadata map<string,string>\n\"\"\")\n\n# Access nested fields\naddress_field = schema.fields[1]\nprint(address_field.name)  # 'address'\nprint(address_field.data_type.type_name)  # 'struct'\n```\n\n## API Reference\n\n### `parse_ddl_schema(ddl_string: str) -> StructType`\n\nParse a DDL schema string into a structured type.\n\n**Parameters:**\n- `ddl_string` (str): DDL schema string (e.g., \"id long, name string\")\n\n**Returns:**\n- `StructType`: Structured type with fields\n\n**Raises:**\n- `ValueError`: If DDL string is invalid\n\n**Example:**\n```python\nschema = parse_ddl_schema(\"id long, name string\")\n```\n\n### Type Objects\n\n#### `StructType`\nRepresents a struct containing fields.\n\n**Attributes:**\n- `type_name` (str): Always \"struct\"\n- `fields` (List[StructField]): List of struct fields\n\n#### `StructField`\nRepresents a field in a struct.\n\n**Attributes:**\n- `name` (str): Field name\n- `data_type` (DataType): Field data type\n- `nullable` (bool): Whether field is nullable (default: True)\n\n#### `SimpleType`\nRepresents a simple data type.\n\n**Attributes:**\n- `type_name` (str): Type name (e.g., \"string\", \"long\", \"int\")\n\n#### `ArrayType`\nRepresents an array type.\n\n**Attributes:**\n- `type_name` (str): Always \"array\"\n- `element_type` (DataType): Type of array elements\n\n#### `MapType`\nRepresents a map type.\n\n**Attributes:**\n- `type_name` (str): Always \"map\"\n- `key_type` (DataType): Type of map keys\n- `value_type` (DataType): Type of map values\n\n#### `DecimalType`\nRepresents a decimal type.\n\n**Attributes:**\n- `type_name` (str): Always \"decimal\"\n- `precision` (int): Decimal precision (default: 10)\n- `scale` (int): Decimal scale (default: 0)\n\n## Examples\n\n### Basic Schema\n```python\nfrom spark_ddl_parser import parse_ddl_schema\n\nschema = parse_ddl_schema(\"id long, name string, age int\")\nprint(len(schema.fields))  # 3\n```\n\n### Arrays and Maps\n```python\nschema = parse_ddl_schema(\"\"\"\n    tags array<string>,\n    scores array<long>,\n    metadata map<string,string>,\n    counts map<string,int>\n\"\"\")\n```\n\n### Nested Structs\n```python\nschema = parse_ddl_schema(\"\"\"\n    user struct<\n        id:long,\n        name:string,\n        address:struct<\n            street:string,\n            city:string\n        >\n    >\n\"\"\")\n```\n\n### Decimal Types\n```python\nschema = parse_ddl_schema(\"price decimal(10,2), rate decimal(5,4)\")\n```\n\n## Format Support\n\nThe parser supports both space and colon separators:\n\n```python\n# Space separator\nschema1 = parse_ddl_schema(\"id long, name string\")\n\n# Colon separator\nschema2 = parse_ddl_schema(\"id:long, name:string\")\n```\n\n## Error Handling\n\nThe parser provides detailed error messages for invalid DDL:\n\n```python\ntry:\n    schema = parse_ddl_schema(\"id long, name\")  # Missing type\nexcept ValueError as e:\n    print(e)  # \"Invalid field definition: name\"\n```\n\n## Development\n\n```bash\n# Install in development mode\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=spark_ddl_parser\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Related Projects\n\n- [mock-spark](https://github.com/eddiethedean/mock-spark) - Uses this parser for DDL schema support\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Zero-dependency PySpark DDL schema parser",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/eddiethedean/spark-ddl-parser",
        "Issues": "https://github.com/eddiethedean/spark-ddl-parser/issues",
        "Repository": "https://github.com/eddiethedean/spark-ddl-parser"
    },
    "split_keywords": [
        "spark",
        " pyspark",
        " ddl",
        " schema",
        " parser",
        " data-engineering",
        " schema-parser"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9d4a0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967",
                "md5": "c3b745ef2b12b7862349113dd9971f97",
                "sha256": "4bf4679e72d78d7ab1c3e4dd23b20ff57757f13a721e32a0cc47f9b578a20538"
            },
            "downloads": -1,
            "filename": "spark_ddl_parser-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3b745ef2b12b7862349113dd9971f97",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8867,
            "upload_time": "2025-10-14T14:45:59",
            "upload_time_iso_8601": "2025-10-14T14:45:59.989904Z",
            "url": "https://files.pythonhosted.org/packages/9d/4a/0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967/spark_ddl_parser-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4170dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9",
                "md5": "3ce5f61b1cd35de6eec587d763e7e6d5",
                "sha256": "d53f7ae5d2d4cae77dde21d091ba15301344ea412077045043d68ae8ae54e3b7"
            },
            "downloads": -1,
            "filename": "spark_ddl_parser-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3ce5f61b1cd35de6eec587d763e7e6d5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 17246,
            "upload_time": "2025-10-14T14:46:01",
            "upload_time_iso_8601": "2025-10-14T14:46:01.399832Z",
            "url": "https://files.pythonhosted.org/packages/41/70/dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9/spark_ddl_parser-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-14 14:46:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "eddiethedean",
    "github_project": "spark-ddl-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "spark-ddl-parser"
}
        
Elapsed time: 1.78626s