sparkorm

Name	sparkorm JSON
Version	1.2.18 JSON
	download
home_page	https://github.com/asuiu/sparkorm
Summary	SparkORM: Python Spark SQL & DataFrame schema management and basic Object Relational Mapping.
upload_time	2024-05-15 15:41:02
maintainer	None
docs_url	None
author	Andrei Suiu
requires_python	<4.0.0,>=3.8
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SparkORM ✨

[![PyPI version](https://badge.fury.io/py/SparkORM.svg)](https://badge.fury.io/py/SparkORM)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Python Spark SQL & DataFrame schema management and basic Object Relational Mapping.

## Why use SparkORM

`SparkORM` takes the pain out of working with DataFrame schemas in PySpark.
It makes schema definition more Pythonic. And it's
particularly useful you're dealing with structured data.

In plain old PySpark, you might find that you write schemas
[like this](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/plain_schema.py):

```python
CITY_SCHEMA = StructType()
CITY_NAME_FIELD = "name"
CITY_SCHEMA.add(StructField(CITY_NAME_FIELD, StringType(), False))
CITY_LAT_FIELD = "latitude"
CITY_SCHEMA.add(StructField(CITY_LAT_FIELD, FloatType()))
CITY_LONG_FIELD = "longitude"
CITY_SCHEMA.add(StructField(CITY_LONG_FIELD, FloatType()))

CONFERENCE_SCHEMA = StructType()
CONF_NAME_FIELD = "name"
CONFERENCE_SCHEMA.add(StructField(CONF_NAME_FIELD, StringType(), False))
CONF_CITY_FIELD = "city"
CONFERENCE_SCHEMA.add(StructField(CONF_CITY_FIELD, CITY_SCHEMA))
```

And then plain old PySpark makes you deal with nested fields like this:

```python
dframe.withColumn("city_name", df[CONF_CITY_FIELD][CITY_NAME_FIELD])
```

Instead, with `SparkORM`, schemas become a lot
[more literate](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/sparkorm_schema.py):

```python
class City(Struct):
    name = String()
    latitude = Float()
    longitude = Float()
    date_created = Date()

class Conference(TableModel):
    class Meta:
        name = "conference_table"
    name = String(nullable=False)
    city = City()

class LocalConferenceView(ViewModel):
    class Meta:
        name = "city_table"

Conference(spark).create()

Conference(spark).ensure_exists()  # Creates the table, and if it already exists - validates the scheme and throws an exception if it doesn't match

LocalConferenceView(spark).create_or_replace(select_statement=f"SELECT * FROM {Conference.get_name()}")

Conference(spark).insert([("Bucharest", 44.4268, 26.1025, date(2020, 1, 1))])

Conference(spark).drop()
```

As does dealing with nested fields:

```python
dframe.withColumn("city_name", Conference.city.name.COL)
```

Here's a summary of `SparkORM`'s features.

- ORM-like class-based Spark schema definitions.
- Automated field naming: The attribute name of a field as it appears
  in its `Struct` is (by default) used as its field name. This name can
  be optionally overridden.
- Programatically reference nested fields in your structs with the
  `PATH` and `COL` special properties. Avoid hand-constructing strings
  (or `Column`s) to reference your nested fields.
- Validate that a DataFrame matches a `SparkORM` schema.
- Reuse and build composite schemas with `inheritance`, `includes`, and
  `implements`.
- Get a human-readable Spark schema representation with `pretty_schema`.
- Create an instance of a schema as a dictionary, with validation of
  the input values.

Read on for documentation on these features.

## Defining a schema

Each Spark atomic type has a counterpart `SparkORM` field:

| PySpark type | `SparkORM` field |
|---|---|
| `ByteType` | `Byte` |
| `IntegerType` | `Integer` |
| `LongType` | `Long` |
| `ShortType` | `Short` |
| `DecimalType` | `Decimal` |
| `DoubleType` | `Double` |
| `FloatType` | `Float` |
| `StringType` | `String` |
| `BinaryType` | `Binary` |
| `BooleanType` | `Boolean` |
| `DateType` | `Date` |
| `TimestampType` | `Timestamp` |

`Array` (counterpart to `ArrayType` in PySpark) allows the definition
of arrays of objects. By creating a subclass of `Struct`, we can
define a custom class that will be converted to a `StructType`.

For
[example](https://github.com/asuiu/SparkORM/tree/master/examples/arrays/arrays.py),
given the `SparkORM` schema definition:

```python
from SparkORM import TableModel, String, Array

class Article(TableModel):
    title = String(nullable=False)
    tags = Array(String(), nullable=False)
    comments = Array(String(nullable=False))
```

Then we can build the equivalent PySpark schema (a `StructType`)
with:

```python

pyspark_struct = Article.get_schema()
```

Pretty printing the schema with the expression
`SparkORM.pretty_schema(pyspark_struct)` will give the following:

```text
StructType([
    StructField('title', StringType(), False),
    StructField('tags',
        ArrayType(StringType(), True),
        False),
    StructField('comments',
        ArrayType(StringType(), False),
        True)])
```

## Features

Many examples of how to use `SparkORM` can be found in
[`examples`](https://github.com/asuiu/SparkORM/tree/master/examples).
### ORM-like class-based schema definitions
The `SparkORM` table schema definition is based on classes. Each column is a class and accepts a number of arguments that will be used to generate the schema.

The following arguments are supported:
- `nullable` - if the column is nullable or not (default: `True`)
- `name` - the name of the column (default: the name of the attribute)
- `comment` - the comment of the column (default: `None`)
- `auto_increment` - if the column is auto incremented or not (default: `False`) Note: applicable only for `Long` columns
- `sql_modifiers` - the SQL modifiers of the column (default: `None`)
- `partitioned_by` - if the column is partitioned by or not (default: `False`)

Examples:
```python
class City(TableModel):
    name = String(nullable=False)
    latitude = Long(auto_increment=True) # auto_increment is a special property that will generate a unique value for each row
    longitude = Float(comment="Some comment")
    date_created = Date(sql_modifiers="GENERATED ALWAYS AS (CAST(birthDate AS DATE))") # sql_modifiers will be added to the CREATE clause for the column
    birthDate = Date(nullable=False, partitioned_by=True) # partitioned_by is a special property that will generate a partitioned_by clause for the column
```

### Automated field naming

By default, field names are inferred from the attribute name in the
struct they are declared.

For example, given the struct

```python
class Geolocation(TableModel):
    latitude = Float()
    longitude = Float()
```

the concrete name of the `Geolocation.latitude` field is `latitude`.

Names also be overridden by explicitly specifying the field name as an
argument to the field

```python
class Geolocation(TableModel):
    latitude = Float(name="lat")
    longitude = Float(name="lon")
```

which would mean the concrete name of the `Geolocation.latitude` field
is `lat`.

### Field paths and nested objects

Referencing fields in nested data can be a chore. `SparkORM` simplifies this
with path referencing.

[For example](https://github.com/asuiu/SparkORM/tree/master/examples/nested_objects/SparkORM_example.py), if we have a
schema with nested objects:

```python
class Address(Struct):
    post_code = String()
    city = String()


class User(Struct):
    username = String(nullable=False)
    address = Address()


class Comment(Struct):
    message = String()
    author = User(nullable=False)


class Article(TableModel):
    title = String(nullable=False)
    author = User(nullable=False)
    comments = Array(Comment())
```

We can use the special `PATH` property to turn a path into a
Spark-understandable string:

```python
author_city_str = Article.author.address.city.PATH
"author.address.city"
```

`COL` is a counterpart to `PATH` that returns a Spark `Column`
object for the path, allowing it to be used in all places where Spark
requires a column.

Function equivalents `path_str`, `path_col`, and `name` are also available.
This table demonstrates the equivalence of the property styles and the function
styles:

| Property style | Function style | Result (both styles are equivalent) |
| --- | --- | --- |
| `Article.author.address.city.PATH` | `SparkORM.path_str(Article.author.address.city)` | `"author.address.city"` |
| `Article.author.address.city.COL` | `SparkORM.path_col(Article.author.address.city)` | `Column` pointing to `author.address.city` |
| `Article.author.address.city.NAME` | `SparkORM.name(Article.author.address.city)` | `"city"` |

For paths that include an array, two approaches are provided:

```python
comment_usernames_str = Article.comments.e.author.username.PATH
"comments.author.username"

comment_usernames_str = Article.comments.author.username.PATH
"comments.author.username"
```

Both give the same result. However, the former (`e`) is more
type-oriented. The `e` attribute corresponds to the array's element
field. Although this looks strange at first, it has the advantage of
being inspectable by IDEs and other tools, allowing goodness such as
IDE auto-completion, automated refactoring, and identifying errors
before runtime.

### Field metadata

Field [metadata](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructField.html) can be specified with the `metadata` argument to a field, which accepts a dictionary
of key-value pairs.

```python
class Article(TableModel):
    title = String(nullable=False,
                   metadata={"description": "The title of the article", "max_length": 100})
```

The metadata can be accessed with the `METADATA` property of the field:

```python
Article.title.METADATA
{"description": "The title of the article", "max_length": 100}
```

### DataFrame validation

Struct method `validate_data_frame` will verify if a given DataFrame's
schema matches the Struct.
[For example](https://github.com/asuiu/SparkORM/tree/master/examples/validation/test_validation.py),
if we have our `Article`
struct and a DataFrame we want to ensure adheres to the `Article`
schema:

```python
dframe = spark_session.createDataFrame([{"title": "abc"}])

class Article(TableModel):
    title = String()
    body = String()
```

Then we can can validate with:

```python
validation_result = Article.validate_data_frame(dframe)
```

`validation_result.is_valid` indicates whether the DataFrame is valid
(`False` in this case), and `validation_result.report` is a
human-readable string describing the differences:

```text
Struct schema...

StructType([
    StructField('title', StringType(), True),
    StructField('body', StringType(), True)])

DataFrame schema...

StructType([
    StructField('title', StringType(), True)])

Diff of struct -> data frame...

  StructType([
-     StructField('title', StringType(), True)])
+     StructField('title', StringType(), True),
+     StructField('body', StringType(), True)])
```

For convenience,

```python
Article.validate_data_frame(dframe).raise_on_invalid()
```

will raise a `InvalidDataFrameError` (see `SparkORM.exceptions`) if the
DataFrame is not valid.

### Creating an instance of a schema

`SparkORM` simplifies the process of creating an instance of a struct.
You might need to do this, for example, when creating test data, or
when creating an object (a dict or a row) to return from a UDF.

Use `Struct.make_dict(...)` to instantiate a struct as a dictionary.
This has the advantage that the input values will be correctly
validated, and it will convert schema property names into their
underlying field names.

For
[example](https://github.com/asuiu/SparkORM/tree/master/examples/struct_instantiation/instantiate_as_dict.py),
given some simple Structs:

```python
class User(TableModel):
    id = Integer(name="user_id", nullable=False)
    username = String()

class Article(TableModel):
    id = Integer(name="article_id", nullable=False)
    title = String()
    author = User()
    text = String(name="body")
```

Here are a few examples of creating dicts from `Article`:

```python
Article.make_dict(
    id=1001,
    title="The article title",
    author=User.make_dict(
        id=440,
        username="user"
    ),
    text="Lorem ipsum article text lorem ipsum."
)

# generates...
{
    "article_id": 1001,
    "author": {
        "user_id": 440,
        "username": "user"},
    "body": "Lorem ipsum article text lorem ipsum.",
    "title": "The article title"
}
```

```python
Article.make_dict(
    id=1002
)

# generates...
{
    "article_id": 1002,
    "author": None,
    "body": None,
    "title": None
}
```

See
[this example](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_extended/conferences.py)
for an extended example of using `make_dict`.

### Composite schemas

It is sometimes useful to be able to re-use the fields of one struct
in another struct. `SparkORM` provides a few features to enable this:

- _inheritance_: A subclass inherits the fields of a base struct class.
- _includes_: Incorporate fields from another struct.
- _implements_: Enforce that a struct must implement the fields of
  another struct.

See the following examples for a better explanation.

#### Using inheritance

For [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/inheritance.py), the following:

```python
class BaseEvent(TableModel):
    correlation_id = String(nullable=False)
    event_time = Timestamp(nullable=False)

class RegistrationEvent(BaseEvent):
    user_id = String(nullable=False)
```

will produce the following `RegistrationEvent` schema:

```text
StructType([
    StructField('correlation_id', StringType(), False),
    StructField('event_time', TimestampType(), False),
    StructField('user_id', StringType(), False)])
```

#### Using an `includes` declaration

For [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/includes.py), the following:

```python
class EventMetadata(Struct):
    correlation_id = String(nullable=False)
    event_time = Timestamp(nullable=False)

class RegistrationEvent(TableModel):
    class Meta:
        includes = [EventMetadata]
    user_id = String(nullable=False)
```

will produce the `RegistrationEvent` schema:

```text
StructType(List(
    StructField('user_id', StringType(), False),
    StructField('correlation_id', StringType(), False),
    StructField('event_time', TimestampType(), False)))
```

#### Using an `implements` declaration

`implements` is similar to `includes`, but does not automatically
incorporate the fields of specified structs. Instead, it is up to
the implementor to ensure that the required fields are declared in
the struct.

Failing to implement a field from an `implements` struct will result in
a `StructImplementationError` error.

[For example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/implements.py):

```
class LogEntryMetadata(TableModel):
    logged_at = Timestamp(nullable=False)

class PageViewLogEntry(TableModel):
    class Meta:
        implements = [LogEntryMetadata]
    page_id = String(nullable=False)

# the above class declaration will fail with the following StructImplementationError error:
#   Struct 'PageViewLogEntry' does not implement field 'logged_at' required by struct 'LogEntryMetadata'
```


### Prettified Spark schema strings

Spark's stringified schema representation isn't very user-friendly, particularly for large schemas:


```text
StructType([StructField('name', StringType(), False), StructField('city', StructType([StructField('name', StringType(), False), StructField('latitude', FloatType(), True), StructField('longitude', FloatType(), True)]), True)])
```

The function `pretty_schema` will return something more useful:

```text
StructType([
    StructField('name', StringType(), False),
    StructField('city',
        StructType([
            StructField('name', StringType(), False),
            StructField('latitude', FloatType(), True),
            StructField('longitude', FloatType(), True)]),
        True)])
```

### Merge two Spark `StructType` types

It can be useful to build a composite schema from two `StructType`s. SparkORM provides a
`merge_schemas` function to do this.

[For example](https://github.com/asuiu/SparkORM/tree/master/examples/merge_struct_types/merge_struct_types.py):

```python
schema_a = StructType([
    StructField("message", StringType()),
    StructField("author", ArrayType(
        StructType([
            StructField("name", StringType())
        ])
    ))
])

schema_b = StructType([
    StructField("author", ArrayType(
        StructType([
            StructField("address", StringType())
        ])
    ))
])

merged_schema = merge_schemas(schema_a, schema_b)
```

results in a `merged_schema` that looks like:

```text
StructType([
    StructField('message', StringType(), True),
    StructField('author',
        ArrayType(StructType([
            StructField('name', StringType(), True),
            StructField('address', StringType(), True)]), True),
        True)])
```

## Contributing

Contributions are very welcome. Developers who'd like to contribute to
this project should refer to [CONTRIBUTING.md](./CONTRIBUTING.md).

## References:
Note: this library is a Fork from https://github.com/mattjw/sparkql

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/asuiu/sparkorm",
    "name": "sparkorm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Andrei Suiu",
    "author_email": "andrei.suiu@gmail.com",
    "download_url": null,
    "platform": null,
    "description": "# SparkORM \u2728\n\n[![PyPI version](https://badge.fury.io/py/SparkORM.svg)](https://badge.fury.io/py/SparkORM)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nPython Spark SQL & DataFrame schema management and basic Object Relational Mapping.\n\n## Why use SparkORM\n\n`SparkORM` takes the pain out of working with DataFrame schemas in PySpark.\nIt makes schema definition more Pythonic. And it's\nparticularly useful you're dealing with structured data.\n\nIn plain old PySpark, you might find that you write schemas\n[like this](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/plain_schema.py):\n\n```python\nCITY_SCHEMA = StructType()\nCITY_NAME_FIELD = \"name\"\nCITY_SCHEMA.add(StructField(CITY_NAME_FIELD, StringType(), False))\nCITY_LAT_FIELD = \"latitude\"\nCITY_SCHEMA.add(StructField(CITY_LAT_FIELD, FloatType()))\nCITY_LONG_FIELD = \"longitude\"\nCITY_SCHEMA.add(StructField(CITY_LONG_FIELD, FloatType()))\n\nCONFERENCE_SCHEMA = StructType()\nCONF_NAME_FIELD = \"name\"\nCONFERENCE_SCHEMA.add(StructField(CONF_NAME_FIELD, StringType(), False))\nCONF_CITY_FIELD = \"city\"\nCONFERENCE_SCHEMA.add(StructField(CONF_CITY_FIELD, CITY_SCHEMA))\n```\n\nAnd then plain old PySpark makes you deal with nested fields like this:\n\n```python\ndframe.withColumn(\"city_name\", df[CONF_CITY_FIELD][CITY_NAME_FIELD])\n```\n\nInstead, with `SparkORM`, schemas become a lot\n[more literate](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/sparkorm_schema.py):\n\n```python\nclass City(Struct):\n    name = String()\n    latitude = Float()\n    longitude = Float()\n    date_created = Date()\n\nclass Conference(TableModel):\n    class Meta:\n        name = \"conference_table\"\n    name = String(nullable=False)\n    city = City()\n\nclass LocalConferenceView(ViewModel):\n    class Meta:\n        name = \"city_table\"\n\nConference(spark).create()\n\nConference(spark).ensure_exists()  # Creates the table, and if it already exists - validates the scheme and throws an exception if it doesn't match\n\nLocalConferenceView(spark).create_or_replace(select_statement=f\"SELECT * FROM {Conference.get_name()}\")\n\nConference(spark).insert([(\"Bucharest\", 44.4268, 26.1025, date(2020, 1, 1))])\n\nConference(spark).drop()\n```\n\nAs does dealing with nested fields:\n\n```python\ndframe.withColumn(\"city_name\", Conference.city.name.COL)\n```\n\nHere's a summary of `SparkORM`'s features.\n\n- ORM-like class-based Spark schema definitions.\n- Automated field naming: The attribute name of a field as it appears\n  in its `Struct` is (by default) used as its field name. This name can\n  be optionally overridden.\n- Programatically reference nested fields in your structs with the\n  `PATH` and `COL` special properties. Avoid hand-constructing strings\n  (or `Column`s) to reference your nested fields.\n- Validate that a DataFrame matches a `SparkORM` schema.\n- Reuse and build composite schemas with `inheritance`, `includes`, and\n  `implements`.\n- Get a human-readable Spark schema representation with `pretty_schema`.\n- Create an instance of a schema as a dictionary, with validation of\n  the input values.\n\nRead on for documentation on these features.\n\n## Defining a schema\n\nEach Spark atomic type has a counterpart `SparkORM` field:\n\n| PySpark type | `SparkORM` field |\n|---|---|\n| `ByteType` | `Byte` |\n| `IntegerType` | `Integer` |\n| `LongType` | `Long` |\n| `ShortType` | `Short` |\n| `DecimalType` | `Decimal` |\n| `DoubleType` | `Double` |\n| `FloatType` | `Float` |\n| `StringType` | `String` |\n| `BinaryType` | `Binary` |\n| `BooleanType` | `Boolean` |\n| `DateType` | `Date` |\n| `TimestampType` | `Timestamp` |\n\n`Array` (counterpart to `ArrayType` in PySpark) allows the definition\nof arrays of objects. By creating a subclass of `Struct`, we can\ndefine a custom class that will be converted to a `StructType`.\n\nFor\n[example](https://github.com/asuiu/SparkORM/tree/master/examples/arrays/arrays.py),\ngiven the `SparkORM` schema definition:\n\n```python\nfrom SparkORM import TableModel, String, Array\n\nclass Article(TableModel):\n    title = String(nullable=False)\n    tags = Array(String(), nullable=False)\n    comments = Array(String(nullable=False))\n```\n\nThen we can build the equivalent PySpark schema (a `StructType`)\nwith:\n\n```python\n\npyspark_struct = Article.get_schema()\n```\n\nPretty printing the schema with the expression\n`SparkORM.pretty_schema(pyspark_struct)` will give the following:\n\n```text\nStructType([\n    StructField('title', StringType(), False),\n    StructField('tags',\n        ArrayType(StringType(), True),\n        False),\n    StructField('comments',\n        ArrayType(StringType(), False),\n        True)])\n```\n\n## Features\n\nMany examples of how to use `SparkORM` can be found in\n[`examples`](https://github.com/asuiu/SparkORM/tree/master/examples).\n### ORM-like class-based schema definitions\nThe `SparkORM` table schema definition is based on classes. Each column is a class and accepts a number of arguments that will be used to generate the schema.\n\nThe following arguments are supported:\n- `nullable` - if the column is nullable or not (default: `True`)\n- `name` - the name of the column (default: the name of the attribute)\n- `comment` - the comment of the column (default: `None`)\n- `auto_increment` - if the column is auto incremented or not (default: `False`) Note: applicable only for `Long` columns\n- `sql_modifiers` - the SQL modifiers of the column (default: `None`)\n- `partitioned_by` - if the column is partitioned by or not (default: `False`)\n\nExamples:\n```python\nclass City(TableModel):\n    name = String(nullable=False)\n    latitude = Long(auto_increment=True) # auto_increment is a special property that will generate a unique value for each row\n    longitude = Float(comment=\"Some comment\")\n    date_created = Date(sql_modifiers=\"GENERATED ALWAYS AS (CAST(birthDate AS DATE))\") # sql_modifiers will be added to the CREATE clause for the column\n    birthDate = Date(nullable=False, partitioned_by=True) # partitioned_by is a special property that will generate a partitioned_by clause for the column\n```\n\n### Automated field naming\n\nBy default, field names are inferred from the attribute name in the\nstruct they are declared.\n\nFor example, given the struct\n\n```python\nclass Geolocation(TableModel):\n    latitude = Float()\n    longitude = Float()\n```\n\nthe concrete name of the `Geolocation.latitude` field is `latitude`.\n\nNames also be overridden by explicitly specifying the field name as an\nargument to the field\n\n```python\nclass Geolocation(TableModel):\n    latitude = Float(name=\"lat\")\n    longitude = Float(name=\"lon\")\n```\n\nwhich would mean the concrete name of the `Geolocation.latitude` field\nis `lat`.\n\n### Field paths and nested objects\n\nReferencing fields in nested data can be a chore. `SparkORM` simplifies this\nwith path referencing.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/nested_objects/SparkORM_example.py), if we have a\nschema with nested objects:\n\n```python\nclass Address(Struct):\n    post_code = String()\n    city = String()\n\n\nclass User(Struct):\n    username = String(nullable=False)\n    address = Address()\n\n\nclass Comment(Struct):\n    message = String()\n    author = User(nullable=False)\n\n\nclass Article(TableModel):\n    title = String(nullable=False)\n    author = User(nullable=False)\n    comments = Array(Comment())\n```\n\nWe can use the special `PATH` property to turn a path into a\nSpark-understandable string:\n\n```python\nauthor_city_str = Article.author.address.city.PATH\n\"author.address.city\"\n```\n\n`COL` is a counterpart to `PATH` that returns a Spark `Column`\nobject for the path, allowing it to be used in all places where Spark\nrequires a column.\n\nFunction equivalents `path_str`, `path_col`, and `name` are also available.\nThis table demonstrates the equivalence of the property styles and the function\nstyles:\n\n| Property style | Function style | Result (both styles are equivalent) |\n| --- | --- | --- |\n| `Article.author.address.city.PATH` | `SparkORM.path_str(Article.author.address.city)` | `\"author.address.city\"` |\n| `Article.author.address.city.COL` | `SparkORM.path_col(Article.author.address.city)` | `Column` pointing to `author.address.city` |\n| `Article.author.address.city.NAME` | `SparkORM.name(Article.author.address.city)` | `\"city\"` |\n\nFor paths that include an array, two approaches are provided:\n\n```python\ncomment_usernames_str = Article.comments.e.author.username.PATH\n\"comments.author.username\"\n\ncomment_usernames_str = Article.comments.author.username.PATH\n\"comments.author.username\"\n```\n\nBoth give the same result. However, the former (`e`) is more\ntype-oriented. The `e` attribute corresponds to the array's element\nfield. Although this looks strange at first, it has the advantage of\nbeing inspectable by IDEs and other tools, allowing goodness such as\nIDE auto-completion, automated refactoring, and identifying errors\nbefore runtime.\n\n### Field metadata\n\nField [metadata](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructField.html) can be specified with the `metadata` argument to a field, which accepts a dictionary\nof key-value pairs.\n\n```python\nclass Article(TableModel):\n    title = String(nullable=False,\n                   metadata={\"description\": \"The title of the article\", \"max_length\": 100})\n```\n\nThe metadata can be accessed with the `METADATA` property of the field:\n\n```python\nArticle.title.METADATA\n{\"description\": \"The title of the article\", \"max_length\": 100}\n```\n\n### DataFrame validation\n\nStruct method `validate_data_frame` will verify if a given DataFrame's\nschema matches the Struct.\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/validation/test_validation.py),\nif we have our `Article`\nstruct and a DataFrame we want to ensure adheres to the `Article`\nschema:\n\n```python\ndframe = spark_session.createDataFrame([{\"title\": \"abc\"}])\n\nclass Article(TableModel):\n    title = String()\n    body = String()\n```\n\nThen we can can validate with:\n\n```python\nvalidation_result = Article.validate_data_frame(dframe)\n```\n\n`validation_result.is_valid` indicates whether the DataFrame is valid\n(`False` in this case), and `validation_result.report` is a\nhuman-readable string describing the differences:\n\n```text\nStruct schema...\n\nStructType([\n    StructField('title', StringType(), True),\n    StructField('body', StringType(), True)])\n\nDataFrame schema...\n\nStructType([\n    StructField('title', StringType(), True)])\n\nDiff of struct -> data frame...\n\n  StructType([\n-     StructField('title', StringType(), True)])\n+     StructField('title', StringType(), True),\n+     StructField('body', StringType(), True)])\n```\n\nFor convenience,\n\n```python\nArticle.validate_data_frame(dframe).raise_on_invalid()\n```\n\nwill raise a `InvalidDataFrameError` (see `SparkORM.exceptions`) if the\nDataFrame is not valid.\n\n### Creating an instance of a schema\n\n`SparkORM` simplifies the process of creating an instance of a struct.\nYou might need to do this, for example, when creating test data, or\nwhen creating an object (a dict or a row) to return from a UDF.\n\nUse `Struct.make_dict(...)` to instantiate a struct as a dictionary.\nThis has the advantage that the input values will be correctly\nvalidated, and it will convert schema property names into their\nunderlying field names.\n\nFor\n[example](https://github.com/asuiu/SparkORM/tree/master/examples/struct_instantiation/instantiate_as_dict.py),\ngiven some simple Structs:\n\n```python\nclass User(TableModel):\n    id = Integer(name=\"user_id\", nullable=False)\n    username = String()\n\nclass Article(TableModel):\n    id = Integer(name=\"article_id\", nullable=False)\n    title = String()\n    author = User()\n    text = String(name=\"body\")\n```\n\nHere are a few examples of creating dicts from `Article`:\n\n```python\nArticle.make_dict(\n    id=1001,\n    title=\"The article title\",\n    author=User.make_dict(\n        id=440,\n        username=\"user\"\n    ),\n    text=\"Lorem ipsum article text lorem ipsum.\"\n)\n\n# generates...\n{\n    \"article_id\": 1001,\n    \"author\": {\n        \"user_id\": 440,\n        \"username\": \"user\"},\n    \"body\": \"Lorem ipsum article text lorem ipsum.\",\n    \"title\": \"The article title\"\n}\n```\n\n```python\nArticle.make_dict(\n    id=1002\n)\n\n# generates...\n{\n    \"article_id\": 1002,\n    \"author\": None,\n    \"body\": None,\n    \"title\": None\n}\n```\n\nSee\n[this example](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_extended/conferences.py)\nfor an extended example of using `make_dict`.\n\n### Composite schemas\n\nIt is sometimes useful to be able to re-use the fields of one struct\nin another struct. `SparkORM` provides a few features to enable this:\n\n- _inheritance_: A subclass inherits the fields of a base struct class.\n- _includes_: Incorporate fields from another struct.\n- _implements_: Enforce that a struct must implement the fields of\n  another struct.\n\nSee the following examples for a better explanation.\n\n#### Using inheritance\n\nFor [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/inheritance.py), the following:\n\n```python\nclass BaseEvent(TableModel):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(BaseEvent):\n    user_id = String(nullable=False)\n```\n\nwill produce the following `RegistrationEvent` schema:\n\n```text\nStructType([\n    StructField('correlation_id', StringType(), False),\n    StructField('event_time', TimestampType(), False),\n    StructField('user_id', StringType(), False)])\n```\n\n#### Using an `includes` declaration\n\nFor [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/includes.py), the following:\n\n```python\nclass EventMetadata(Struct):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(TableModel):\n    class Meta:\n        includes = [EventMetadata]\n    user_id = String(nullable=False)\n```\n\nwill produce the `RegistrationEvent` schema:\n\n```text\nStructType(List(\n    StructField('user_id', StringType(), False),\n    StructField('correlation_id', StringType(), False),\n    StructField('event_time', TimestampType(), False)))\n```\n\n#### Using an `implements` declaration\n\n`implements` is similar to `includes`, but does not automatically\nincorporate the fields of specified structs. Instead, it is up to\nthe implementor to ensure that the required fields are declared in\nthe struct.\n\nFailing to implement a field from an `implements` struct will result in\na `StructImplementationError` error.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/implements.py):\n\n```\nclass LogEntryMetadata(TableModel):\n    logged_at = Timestamp(nullable=False)\n\nclass PageViewLogEntry(TableModel):\n    class Meta:\n        implements = [LogEntryMetadata]\n    page_id = String(nullable=False)\n\n# the above class declaration will fail with the following StructImplementationError error:\n#   Struct 'PageViewLogEntry' does not implement field 'logged_at' required by struct 'LogEntryMetadata'\n```\n\n\n### Prettified Spark schema strings\n\nSpark's stringified schema representation isn't very user-friendly, particularly for large schemas:\n\n\n```text\nStructType([StructField('name', StringType(), False), StructField('city', StructType([StructField('name', StringType(), False), StructField('latitude', FloatType(), True), StructField('longitude', FloatType(), True)]), True)])\n```\n\nThe function `pretty_schema` will return something more useful:\n\n```text\nStructType([\n    StructField('name', StringType(), False),\n    StructField('city',\n        StructType([\n            StructField('name', StringType(), False),\n            StructField('latitude', FloatType(), True),\n            StructField('longitude', FloatType(), True)]),\n        True)])\n```\n\n### Merge two Spark `StructType` types\n\nIt can be useful to build a composite schema from two `StructType`s. SparkORM provides a\n`merge_schemas` function to do this.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/merge_struct_types/merge_struct_types.py):\n\n```python\nschema_a = StructType([\n    StructField(\"message\", StringType()),\n    StructField(\"author\", ArrayType(\n        StructType([\n            StructField(\"name\", StringType())\n        ])\n    ))\n])\n\nschema_b = StructType([\n    StructField(\"author\", ArrayType(\n        StructType([\n            StructField(\"address\", StringType())\n        ])\n    ))\n])\n\nmerged_schema = merge_schemas(schema_a, schema_b)\n```\n\nresults in a `merged_schema` that looks like:\n\n```text\nStructType([\n    StructField('message', StringType(), True),\n    StructField('author',\n        ArrayType(StructType([\n            StructField('name', StringType(), True),\n            StructField('address', StringType(), True)]), True),\n        True)])\n```\n\n## Contributing\n\nContributions are very welcome. Developers who'd like to contribute to\nthis project should refer to [CONTRIBUTING.md](./CONTRIBUTING.md).\n\n## References:\nNote: this library is a Fork from https://github.com/mattjw/sparkql\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "SparkORM: Python Spark SQL & DataFrame schema management and basic Object Relational Mapping.",
    "version": "1.2.18",
    "project_urls": {
        "Homepage": "https://github.com/asuiu/sparkorm",
        "Repository": "https://github.com/asuiu/sparkorm"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00e856a8db4ee9643b2a3f3ec86a9cc7824801aa6b5f80143a11cad0b642e726",
                "md5": "a369b35913b01acad11298e705fceb0e",
                "sha256": "2659529498ef8528eed73002910dde2ae2eb35bb93eaec1dd78b8be8bc747a8a"
            },
            "downloads": -1,
            "filename": "sparkorm-1.2.18-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a369b35913b01acad11298e705fceb0e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0.0,>=3.8",
            "size": 35614,
            "upload_time": "2024-05-15T15:41:02",
            "upload_time_iso_8601": "2024-05-15T15:41:02.650555Z",
            "url": "https://files.pythonhosted.org/packages/00/e8/56a8db4ee9643b2a3f3ec86a9cc7824801aa6b5f80143a11cad0b642e726/sparkorm-1.2.18-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-15 15:41:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "asuiu",
    "github_project": "sparkorm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "lcname": "sparkorm"
}

Andrei Suiu