# Spark DDL Parser
A zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.
## Features
- **Zero Dependencies**: Only uses Python standard library
- **PySpark Compatible**: Parses standard PySpark DDL format
- **Type Safe**: Returns structured dataclasses
- **Comprehensive**: Supports all PySpark data types including nested structs, arrays, and maps
- **Well Tested**: 200+ test cases covering edge cases and performance
## Installation
```bash
pip install spark-ddl-parser
```
## Quick Start
```python
from spark_ddl_parser import parse_ddl_schema
# Parse a simple schema
schema = parse_ddl_schema("id long, name string")
print(schema.fields[0].name) # 'id'
print(schema.fields[0].data_type.type_name) # 'long'
print(schema.fields[1].name) # 'name'
print(schema.fields[1].data_type.type_name) # 'string'
```
## Supported Types
### Simple Types
- `string`, `int`, `integer`, `long`, `bigint`
- `double`, `float`, `short`, `smallint`, `byte`, `tinyint`
- `boolean`, `bool`, `date`, `timestamp`, `binary`
### Complex Types
- **Arrays**: `array<string>`, `array<long>`
- **Maps**: `map<string,int>`, `map<string,array<long>>`
- **Structs**: `struct<name:string,age:int>`
- **Decimal**: `decimal(10,2)` (with precision and scale)
### Nested Structures
```python
# Nested structs
schema = parse_ddl_schema("""
id long,
address struct<
street:string,
city:string,
zip:string
>,
tags array<string>,
metadata map<string,string>
""")
# Access nested fields
address_field = schema.fields[1]
print(address_field.name) # 'address'
print(address_field.data_type.type_name) # 'struct'
```
## API Reference
### `parse_ddl_schema(ddl_string: str) -> StructType`
Parse a DDL schema string into a structured type.
**Parameters:**
- `ddl_string` (str): DDL schema string (e.g., "id long, name string")
**Returns:**
- `StructType`: Structured type with fields
**Raises:**
- `ValueError`: If DDL string is invalid
**Example:**
```python
schema = parse_ddl_schema("id long, name string")
```
### Type Objects
#### `StructType`
Represents a struct containing fields.
**Attributes:**
- `type_name` (str): Always "struct"
- `fields` (List[StructField]): List of struct fields
#### `StructField`
Represents a field in a struct.
**Attributes:**
- `name` (str): Field name
- `data_type` (DataType): Field data type
- `nullable` (bool): Whether field is nullable (default: True)
#### `SimpleType`
Represents a simple data type.
**Attributes:**
- `type_name` (str): Type name (e.g., "string", "long", "int")
#### `ArrayType`
Represents an array type.
**Attributes:**
- `type_name` (str): Always "array"
- `element_type` (DataType): Type of array elements
#### `MapType`
Represents a map type.
**Attributes:**
- `type_name` (str): Always "map"
- `key_type` (DataType): Type of map keys
- `value_type` (DataType): Type of map values
#### `DecimalType`
Represents a decimal type.
**Attributes:**
- `type_name` (str): Always "decimal"
- `precision` (int): Decimal precision (default: 10)
- `scale` (int): Decimal scale (default: 0)
## Examples
### Basic Schema
```python
from spark_ddl_parser import parse_ddl_schema
schema = parse_ddl_schema("id long, name string, age int")
print(len(schema.fields)) # 3
```
### Arrays and Maps
```python
schema = parse_ddl_schema("""
tags array<string>,
scores array<long>,
metadata map<string,string>,
counts map<string,int>
""")
```
### Nested Structs
```python
schema = parse_ddl_schema("""
user struct<
id:long,
name:string,
address:struct<
street:string,
city:string
>
>
""")
```
### Decimal Types
```python
schema = parse_ddl_schema("price decimal(10,2), rate decimal(5,4)")
```
## Format Support
The parser supports both space and colon separators:
```python
# Space separator
schema1 = parse_ddl_schema("id long, name string")
# Colon separator
schema2 = parse_ddl_schema("id:long, name:string")
```
## Error Handling
The parser provides detailed error messages for invalid DDL:
```python
try:
schema = parse_ddl_schema("id long, name") # Missing type
except ValueError as e:
print(e) # "Invalid field definition: name"
```
## Development
```bash
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=spark_ddl_parser
```
## License
MIT License - see LICENSE file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Related Projects
- [mock-spark](https://github.com/eddiethedean/mock-spark) - Uses this parser for DDL schema support
Raw data
{
"_id": null,
"home_page": null,
"name": "spark-ddl-parser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "spark, pyspark, ddl, schema, parser, data-engineering, schema-parser",
"author": null,
"author_email": "Odos Matthews <odosmatthews@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/41/70/dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9/spark_ddl_parser-0.1.0.tar.gz",
"platform": null,
"description": "# Spark DDL Parser\n\nA zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.\n\n## Features\n\n- **Zero Dependencies**: Only uses Python standard library\n- **PySpark Compatible**: Parses standard PySpark DDL format\n- **Type Safe**: Returns structured dataclasses\n- **Comprehensive**: Supports all PySpark data types including nested structs, arrays, and maps\n- **Well Tested**: 200+ test cases covering edge cases and performance\n\n## Installation\n\n```bash\npip install spark-ddl-parser\n```\n\n## Quick Start\n\n```python\nfrom spark_ddl_parser import parse_ddl_schema\n\n# Parse a simple schema\nschema = parse_ddl_schema(\"id long, name string\")\n\nprint(schema.fields[0].name) # 'id'\nprint(schema.fields[0].data_type.type_name) # 'long'\nprint(schema.fields[1].name) # 'name'\nprint(schema.fields[1].data_type.type_name) # 'string'\n```\n\n## Supported Types\n\n### Simple Types\n- `string`, `int`, `integer`, `long`, `bigint`\n- `double`, `float`, `short`, `smallint`, `byte`, `tinyint`\n- `boolean`, `bool`, `date`, `timestamp`, `binary`\n\n### Complex Types\n- **Arrays**: `array<string>`, `array<long>`\n- **Maps**: `map<string,int>`, `map<string,array<long>>`\n- **Structs**: `struct<name:string,age:int>`\n- **Decimal**: `decimal(10,2)` (with precision and scale)\n\n### Nested Structures\n\n```python\n# Nested structs\nschema = parse_ddl_schema(\"\"\"\n id long,\n address struct<\n street:string,\n city:string,\n zip:string\n >,\n tags array<string>,\n metadata map<string,string>\n\"\"\")\n\n# Access nested fields\naddress_field = schema.fields[1]\nprint(address_field.name) # 'address'\nprint(address_field.data_type.type_name) # 'struct'\n```\n\n## API Reference\n\n### `parse_ddl_schema(ddl_string: str) -> StructType`\n\nParse a DDL schema string into a structured type.\n\n**Parameters:**\n- `ddl_string` (str): DDL schema string (e.g., \"id long, name string\")\n\n**Returns:**\n- `StructType`: Structured type with fields\n\n**Raises:**\n- `ValueError`: If DDL string is invalid\n\n**Example:**\n```python\nschema = parse_ddl_schema(\"id long, name string\")\n```\n\n### Type Objects\n\n#### `StructType`\nRepresents a struct containing fields.\n\n**Attributes:**\n- `type_name` (str): Always \"struct\"\n- `fields` (List[StructField]): List of struct fields\n\n#### `StructField`\nRepresents a field in a struct.\n\n**Attributes:**\n- `name` (str): Field name\n- `data_type` (DataType): Field data type\n- `nullable` (bool): Whether field is nullable (default: True)\n\n#### `SimpleType`\nRepresents a simple data type.\n\n**Attributes:**\n- `type_name` (str): Type name (e.g., \"string\", \"long\", \"int\")\n\n#### `ArrayType`\nRepresents an array type.\n\n**Attributes:**\n- `type_name` (str): Always \"array\"\n- `element_type` (DataType): Type of array elements\n\n#### `MapType`\nRepresents a map type.\n\n**Attributes:**\n- `type_name` (str): Always \"map\"\n- `key_type` (DataType): Type of map keys\n- `value_type` (DataType): Type of map values\n\n#### `DecimalType`\nRepresents a decimal type.\n\n**Attributes:**\n- `type_name` (str): Always \"decimal\"\n- `precision` (int): Decimal precision (default: 10)\n- `scale` (int): Decimal scale (default: 0)\n\n## Examples\n\n### Basic Schema\n```python\nfrom spark_ddl_parser import parse_ddl_schema\n\nschema = parse_ddl_schema(\"id long, name string, age int\")\nprint(len(schema.fields)) # 3\n```\n\n### Arrays and Maps\n```python\nschema = parse_ddl_schema(\"\"\"\n tags array<string>,\n scores array<long>,\n metadata map<string,string>,\n counts map<string,int>\n\"\"\")\n```\n\n### Nested Structs\n```python\nschema = parse_ddl_schema(\"\"\"\n user struct<\n id:long,\n name:string,\n address:struct<\n street:string,\n city:string\n >\n >\n\"\"\")\n```\n\n### Decimal Types\n```python\nschema = parse_ddl_schema(\"price decimal(10,2), rate decimal(5,4)\")\n```\n\n## Format Support\n\nThe parser supports both space and colon separators:\n\n```python\n# Space separator\nschema1 = parse_ddl_schema(\"id long, name string\")\n\n# Colon separator\nschema2 = parse_ddl_schema(\"id:long, name:string\")\n```\n\n## Error Handling\n\nThe parser provides detailed error messages for invalid DDL:\n\n```python\ntry:\n schema = parse_ddl_schema(\"id long, name\") # Missing type\nexcept ValueError as e:\n print(e) # \"Invalid field definition: name\"\n```\n\n## Development\n\n```bash\n# Install in development mode\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=spark_ddl_parser\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Related Projects\n\n- [mock-spark](https://github.com/eddiethedean/mock-spark) - Uses this parser for DDL schema support\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Zero-dependency PySpark DDL schema parser",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/eddiethedean/spark-ddl-parser",
"Issues": "https://github.com/eddiethedean/spark-ddl-parser/issues",
"Repository": "https://github.com/eddiethedean/spark-ddl-parser"
},
"split_keywords": [
"spark",
" pyspark",
" ddl",
" schema",
" parser",
" data-engineering",
" schema-parser"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9d4a0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967",
"md5": "c3b745ef2b12b7862349113dd9971f97",
"sha256": "4bf4679e72d78d7ab1c3e4dd23b20ff57757f13a721e32a0cc47f9b578a20538"
},
"downloads": -1,
"filename": "spark_ddl_parser-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c3b745ef2b12b7862349113dd9971f97",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 8867,
"upload_time": "2025-10-14T14:45:59",
"upload_time_iso_8601": "2025-10-14T14:45:59.989904Z",
"url": "https://files.pythonhosted.org/packages/9d/4a/0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967/spark_ddl_parser-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4170dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9",
"md5": "3ce5f61b1cd35de6eec587d763e7e6d5",
"sha256": "d53f7ae5d2d4cae77dde21d091ba15301344ea412077045043d68ae8ae54e3b7"
},
"downloads": -1,
"filename": "spark_ddl_parser-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "3ce5f61b1cd35de6eec587d763e7e6d5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 17246,
"upload_time": "2025-10-14T14:46:01",
"upload_time_iso_8601": "2025-10-14T14:46:01.399832Z",
"url": "https://files.pythonhosted.org/packages/41/70/dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9/spark_ddl_parser-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-14 14:46:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eddiethedean",
"github_project": "spark-ddl-parser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "spark-ddl-parser"
}