sparkpolars

Name	sparkpolars JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Conversion between PySpark and Polars DataFrames
upload_time	2025-02-14 16:25:52
maintainer	None
docs_url	None
author	Skander Boudawara
requires_python	>=3.10
license	MIT License Copyright (c) 2025 Skander Boudawara Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	pyspark polars conversion spark-to-polars polars-to-spark
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # sparkpolars

**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

## Installation

```sh
pip install sparkpolars
# or
conda install skandev::sparkpolars
```

## Requirements

- **Python** ≥ 3.10  
- **Apache Spark** ≥ 3.3.0 (must be pre-installed)  
- **Polars** ≥ 1.0 (must be pre-installed)  
- **Pyspark** must also be installed if you plan to use this library  

## Why Does This Library Exist?

### The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

```python
# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark
```

### The Solution

**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.

### Key Benefits

- 🚀 **No extra dependencies** – No need for Pandas or PyArrow  
- ✅ **Reliable handling of complex types** – Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable  

## Features

- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`
- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`
- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`
- `ARROW` and `PANDAS` modes may have limitations with complex types
- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`
- Timezone and time unit customization for Polars `Datetime`

## Usage

### 0. Supercharge Polars and Spark DataFrame

In your `__init__.py` file at the root project you can do the following for ease of use

```python
from sparkpolars import toPolars, to_spark
from pyspark.sql import DataFrame as SparkDataFrame
from polars import DataFrame as PolarsDataFrame, LazyFrame as PolarsLazyFrame

__all__ = [
    "toPolars",
    "to_spark",
]

SparkDataFrame.toPolars = toPolars
PolarsDataFrame.to_spark = to_spark
PolarsLazyFrame.to_spark = to_spark
```

### 1. From Spark to Polars DataFrame

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()
```

### 2. From Spark to Polars LazyFrame

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)
```

### 3. From Polars DataFrame to Spark

```python
from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or 
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession
```

### 4. Using Specific Mode

```python
from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)
```

### 5. Using Config

```python
from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)
```

## Known Limitations

### JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

### Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

### Handling `MapType`:


#### From Spark to Polars
If you have in Spark:

Type: `StructField("example", MapType(StringType(), IntegerType()))`

Data:  `{"a": 1, "b": 2}`

Then it will become in Polars:

Type: `{"example": List(Struct("key": String, "value": Int32))}`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

#### From Polars to Spark
If you have in Polars:

Type: `{"example": List(Struct("key": String, "value": Int32))}`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

Then it will become in Spark without specifying any config (Default Behavior):

Type: `StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

If you want this data to be converted to MapType:

```python
from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)
```

Type: `StructField("example", MapType(StringType(), IntegerType()))`

Data:  `{"a": 1, "b": 2}`

## License
- pending

## Contribution
- pending

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sparkpolars",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "pyspark, polars, conversion, spark-to-polars, polars-to-spark",
    "author": "Skander Boudawara",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/a2/e2/89b7ef4562de56143c7eea5fd224bae5f71838ade2bf2122091f4e55fd76/sparkpolars-0.1.0.tar.gz",
    "platform": null,
    "description": "# sparkpolars\n\n**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)\n\n## Installation\n\n```sh\npip install sparkpolars\n# or\nconda install skandev::sparkpolars\n```\n\n## Requirements\n\n- **Python** \u2265 3.10  \n- **Apache Spark** \u2265 3.3.0 (must be pre-installed)  \n- **Polars** \u2265 1.0 (must be pre-installed)  \n- **Pyspark** must also be installed if you plan to use this library  \n\n## Why Does This Library Exist?\n\n### The Problem\n\nTypical conversions between Spark and Polars often involve an intermediate Pandas step:\n\n```python\n# Traditional approach:\n# Spark -> Pandas -> Polars\n# or\n# Polars -> Pandas -> Spark\n```\n\n### The Solution\n\n**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.\n\n### Key Benefits\n\n- \ud83d\ude80 **No extra dependencies** \u2013 No need for Pandas or PyArrow  \n- \u2705 **Reliable handling of complex types** \u2013 Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable  \n\n## Features\n\n- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame\n- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`\n- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`\n- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`\n- `ARROW` and `PANDAS` modes may have limitations with complex types\n- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`\n- Timezone and time unit customization for Polars `Datetime`\n\n## Usage\n\n### 0. Supercharge Polars and Spark DataFrame\n\nIn your `__init__.py` file at the root project you can do the following for ease of use\n\n```python\nfrom sparkpolars import toPolars, to_spark\nfrom pyspark.sql import DataFrame as SparkDataFrame\nfrom polars import DataFrame as PolarsDataFrame, LazyFrame as PolarsLazyFrame\n\n__all__ = [\n    \"toPolars\",\n    \"to_spark\",\n]\n\nSparkDataFrame.toPolars = toPolars\nPolarsDataFrame.to_spark = to_spark\nPolarsLazyFrame.to_spark = to_spark\n```\n\n### 1. From Spark to Polars DataFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars()\n```\n\n### 2. From Spark to Polars LazyFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars(lazy=True)\n```\n\n### 3. From Polars DataFrame to Spark\n\n```python\nfrom pyspark.sql import SparkSession\nfrom polars import DataFrame\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = DataFrame({\"a\": [1], \"b\": [2]})  # It can also be a LazyDataFrame\n\nspark_df = df.to_spark(spark=spark)\n# or \nspark_df = df.to_spark()  # It will try to get the Spark ActiveSession\n```\n\n### 4. Using Specific Mode\n\n```python\nfrom sparkpolars import ModeMethod\n\nspark_df = df.to_spark(mode=ModeMethod.NATIVE)\nspark_df = df.to_spark(mode=ModeMethod.PANDAS)\nspark_df = df.to_spark(mode=ModeMethod.ARROW)\n\npolars_df = df.toPolars(mode=ModeMethod.NATIVE)\npolars_df = df.toPolars(mode=ModeMethod.PANDAS)\npolars_df = df.toPolars(mode=ModeMethod.ARROW)\n```\n\n### 5. Using Config\n\n```python\nfrom sparkpolars import Config\n\nconf = Config(\n    map_elements=[\"column_should_be_converted_to_map_type\", ...],  # Specify columns to convert to MapType\n    time_unit=\"ms\",  # Literal[\"ns\", \"us\", \"ms\"], defaults to \"us\"\n)\nspark_df = df.to_spark(config=conf)\n\npolars_df = df.toPolars(config=conf)\n```\n\n## Known Limitations\n\n### JVM Timezone Discrepancy\n\nSpark timestamps are collected via the JVM, which may differ from Spark\u2019s timezone settings. If issues arise, verify the JVM timezone.\n\n### Memory Constraints\n\nCollecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)\n\n### Handling `MapType`:\n\n\n#### From Spark to Polars\nIf you have in Spark:\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData:  `{\"a\": 1, \"b\": 2}`\n\nThen it will become in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\n#### From Polars to Spark\nIf you have in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nThen it will become in Spark without specifying any config (Default Behavior):\n\nType: `StructField(\"example\", ArrayType(StructType(StructField(\"key\", StringType())), StructField(\"value\", IntegerType())))`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nIf you want this data to be converted to MapType:\n\n```python\nfrom sparkpolars import Config\nconf = Config(\n    map_elements=[\"example\"]\n)\n```\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData:  `{\"a\": 1, \"b\": 2}`\n\n## License\n- pending\n\n## Contribution\n- pending\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Skander Boudawara\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Conversion between PySpark and Polars DataFrames",
    "version": "0.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/skanderboudawara/sparkpolars/issues",
        "Homepage": "https://pypi.org/project/sparkpolars/",
        "Source": "https://github.com/skanderboudawara/sparkpolars"
    },
    "split_keywords": [
        "pyspark",
        " polars",
        " conversion",
        " spark-to-polars",
        " polars-to-spark"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "75a9a2437154adbe0038f1c6156d01d44e4d20e1440cb3a3cda11ee36e327849",
                "md5": "ea731c786727094fcd4dc5471fbf0aa7",
                "sha256": "0e9b5e1d77b337368ce927632c50f8679926c96b64047ddfd707ac6103834b35"
            },
            "downloads": -1,
            "filename": "sparkpolars-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea731c786727094fcd4dc5471fbf0aa7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 11743,
            "upload_time": "2025-02-14T16:25:51",
            "upload_time_iso_8601": "2025-02-14T16:25:51.545888Z",
            "url": "https://files.pythonhosted.org/packages/75/a9/a2437154adbe0038f1c6156d01d44e4d20e1440cb3a3cda11ee36e327849/sparkpolars-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a2e289b7ef4562de56143c7eea5fd224bae5f71838ade2bf2122091f4e55fd76",
                "md5": "183d555d7342471f00d0baa3a7ef4668",
                "sha256": "ede578c6d9cc355cca5cb8e89c7bae0567c6eb183385af5a36da3debea1cb7f0"
            },
            "downloads": -1,
            "filename": "sparkpolars-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "183d555d7342471f00d0baa3a7ef4668",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 20486,
            "upload_time": "2025-02-14T16:25:52",
            "upload_time_iso_8601": "2025-02-14T16:25:52.536290Z",
            "url": "https://files.pythonhosted.org/packages/a2/e2/89b7ef4562de56143c7eea5fd224bae5f71838ade2bf2122091f4e55fd76/sparkpolars-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-14 16:25:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "skanderboudawara",
    "github_project": "sparkpolars",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sparkpolars"
}

Skander Boudawara