pyspark-types


Namepyspark-types JSON
Version 0.0.3 PyPI version JSON
download
home_page
Summary`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes
upload_time2024-02-11 18:25:03
maintainer
docs_urlNone
authorDan
requires_python>=3.10,<4.0
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PySpark Types

`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes.

## Usage

### Pydantic
PySparkBaseModel is a base class for PySpark models that provides methods for converting between PySpark Rows and Pydantic models.

Here's an example of a Pydantic model that will be used to create a PySpark DataFrame:

```python
from pyspark_types.auxiliary import BoundDecimal
from pyspark_types.pydantic import PySparkBaseModel


class Person(PySparkBaseModel):
    name: str
    age: int
    addresses: dict[str, str]
    salary: BoundDecimal

```

To create a PySpark DataFrame from a list of Person Pydantic models, we can use PySparkBaseModel.create_spark_dataframe() method.

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

# create a list of Pydantic models
data = [
    Person(
        name="Alice",
        age=25,
        addresses={"home": "123 Main St", "work": "456 Pine St"},
        salary=BoundDecimal("5000.00", precision=10, scale=2),
    ),
    Person(
        name="Bob",
        age=30,
        addresses={"home": "789 Elm St", "work": "321 Oak St"},
        salary=BoundDecimal("6000.50", precision=10, scale=2),
    ),
]

# create a PySpark DataFrame from the list of Pydantic models
df = Person.create_spark_dataframe(data, spark)

# show the contents of the DataFrame
df.show()

```

Output: 
```bash
+---+-----+--------------------+------+
|age| name|           addresses|salary|
+---+-----+--------------------+------+
| 25|Alice|[home -> 123 Main...|5000.00|
| 30|  Bob|[home -> 789 Elm ...|6000.50|
+---+-----+--------------------+------+

```

The PySparkBaseModel.create_spark_dataframe() method converts the list of Pydantic models to a list of dictionaries, and then creates a PySpark DataFrame from the list of dictionaries and schema generated from the Pydantic model.

You can also generate a schema based on a Pydantic model by calling the PySparkBaseModel.schema() method:
```python
schema = PySparkBaseModel.schema(Person)

```

This creates a PySpark schema for the Person Pydantic model.

Note that if you have custom types, such as BoundDecimal, you will need to add support for them in PySparkBaseModel. For example, you can modify the PySparkBaseModel.dict() method to extract BoundDecimal values when mapping to DecimalType.
### Dataclasses

To use pyspark_types, you first need to define a Python data class with the fields you want to map to PySpark. For example:

```python
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    is_student: bool

```
To map this data class to a PySpark StructType, you can use the map_dataclass_to_struct() function:

```python
from pyspark_types import map_dataclass_to_struct

person_struct = map_dataclass_to_struct(Person)
```

This will return a PySpark StructType that corresponds to the Person data class.

You can also use the apply_nullability() function to set the nullable flag for a given PySpark DataType:

```python
from pyspark.sql.types import StringType
from pyspark_types import apply_nullability

nullable_string_type = apply_nullability(StringType(), True)
```

This will return a new PySpark StringType with the nullable flag set to True.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pyspark-types",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Dan",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/6f/04/6288547fd30d1931f79001f6bbf971ad54134581924f97b4050c1f929eb9/pyspark_types-0.0.3.tar.gz",
    "platform": null,
    "description": "# PySpark Types\n\n`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes.\n\n## Usage\n\n### Pydantic\nPySparkBaseModel is a base class for PySpark models that provides methods for converting between PySpark Rows and Pydantic models.\n\nHere's an example of a Pydantic model that will be used to create a PySpark DataFrame:\n\n```python\nfrom pyspark_types.auxiliary import BoundDecimal\nfrom pyspark_types.pydantic import PySparkBaseModel\n\n\nclass Person(PySparkBaseModel):\n    name: str\n    age: int\n    addresses: dict[str, str]\n    salary: BoundDecimal\n\n```\n\nTo create a PySpark DataFrame from a list of Person Pydantic models, we can use PySparkBaseModel.create_spark_dataframe() method.\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"MyApp\").getOrCreate()\n\n# create a list of Pydantic models\ndata = [\n    Person(\n        name=\"Alice\",\n        age=25,\n        addresses={\"home\": \"123 Main St\", \"work\": \"456 Pine St\"},\n        salary=BoundDecimal(\"5000.00\", precision=10, scale=2),\n    ),\n    Person(\n        name=\"Bob\",\n        age=30,\n        addresses={\"home\": \"789 Elm St\", \"work\": \"321 Oak St\"},\n        salary=BoundDecimal(\"6000.50\", precision=10, scale=2),\n    ),\n]\n\n# create a PySpark DataFrame from the list of Pydantic models\ndf = Person.create_spark_dataframe(data, spark)\n\n# show the contents of the DataFrame\ndf.show()\n\n```\n\nOutput: \n```bash\n+---+-----+--------------------+------+\n|age| name|           addresses|salary|\n+---+-----+--------------------+------+\n| 25|Alice|[home -> 123 Main...|5000.00|\n| 30|  Bob|[home -> 789 Elm ...|6000.50|\n+---+-----+--------------------+------+\n\n```\n\nThe PySparkBaseModel.create_spark_dataframe() method converts the list of Pydantic models to a list of dictionaries, and then creates a PySpark DataFrame from the list of dictionaries and schema generated from the Pydantic model.\n\nYou can also generate a schema based on a Pydantic model by calling the PySparkBaseModel.schema() method:\n```python\nschema = PySparkBaseModel.schema(Person)\n\n```\n\nThis creates a PySpark schema for the Person Pydantic model.\n\nNote that if you have custom types, such as BoundDecimal, you will need to add support for them in PySparkBaseModel. For example, you can modify the PySparkBaseModel.dict() method to extract BoundDecimal values when mapping to DecimalType.\n### Dataclasses\n\nTo use pyspark_types, you first need to define a Python data class with the fields you want to map to PySpark. For example:\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass Person:\n    name: str\n    age: int\n    is_student: bool\n\n```\nTo map this data class to a PySpark StructType, you can use the map_dataclass_to_struct() function:\n\n```python\nfrom pyspark_types import map_dataclass_to_struct\n\nperson_struct = map_dataclass_to_struct(Person)\n```\n\nThis will return a PySpark StructType that corresponds to the Person data class.\n\nYou can also use the apply_nullability() function to set the nullable flag for a given PySpark DataType:\n\n```python\nfrom pyspark.sql.types import StringType\nfrom pyspark_types import apply_nullability\n\nnullable_string_type = apply_nullability(StringType(), True)\n```\n\nThis will return a new PySpark StringType with the nullable flag set to True.\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes",
    "version": "0.0.3",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2dfa7c46646d61732420b9faeeeaa5eb509c49a47e0cf4aba7760d00ecf1473f",
                "md5": "72a8444206f8511600aa944d45f722ea",
                "sha256": "a122c3c614b042749afc07671325c6f24e48943f5a02f1592bce46e909f1ddca"
            },
            "downloads": -1,
            "filename": "pyspark_types-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "72a8444206f8511600aa944d45f722ea",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 6388,
            "upload_time": "2024-02-11T18:25:01",
            "upload_time_iso_8601": "2024-02-11T18:25:01.440882Z",
            "url": "https://files.pythonhosted.org/packages/2d/fa/7c46646d61732420b9faeeeaa5eb509c49a47e0cf4aba7760d00ecf1473f/pyspark_types-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f046288547fd30d1931f79001f6bbf971ad54134581924f97b4050c1f929eb9",
                "md5": "ff3a50b1296ae532176f093453347149",
                "sha256": "dbb4b68e30e5850b8a4dfa8c0350d7162080636645552c1ca2102da4772ee6fe"
            },
            "downloads": -1,
            "filename": "pyspark_types-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "ff3a50b1296ae532176f093453347149",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 4846,
            "upload_time": "2024-02-11T18:25:03",
            "upload_time_iso_8601": "2024-02-11T18:25:03.209743Z",
            "url": "https://files.pythonhosted.org/packages/6f/04/6288547fd30d1931f79001f6bbf971ad54134581924f97b4050c1f929eb9/pyspark_types-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-11 18:25:03",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pyspark-types"
}
        
Dan
Elapsed time: 0.60816s