sparkpolars


Namesparkpolars JSON
Version 0.0.6 PyPI version JSON
download
home_pageNone
SummaryConversion between PySpark and Polars DataFrames
upload_time2025-02-12 00:23:10
maintainerNone
docs_urlNone
authorSkander Boudawara
requires_python>=3.10
licenseMIT License Copyright (c) 2025 Skander Boudawara Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords pyspark polars conversion spark-to-polars polars-to-spark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # sparkpolars

**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

## Installation

```sh
pip install sparkpolars
# or
conda install skandev::sparkpolars
```

## Requirements

- **Python** ≥ 3.10  
- **Apache Spark** ≥ 3.3.0 (must be pre-installed)  
- **Polars** ≥ 1.0 (must be pre-installed)  
- **Pyspark** must also be installed if you plan to use this library  

## Why Does This Library Exist?

### The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

```python
# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark
```

### The Solution

**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.

### Key Benefits

- 🚀 **No extra dependencies** – No need for Pandas or PyArrow  
- ✅ **Reliable handling of complex types** – Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable  

## Features

- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`
- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`
- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`
- `ARROW` and `PANDAS` modes may have limitations with complex types
- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`
- Timezone and time unit customization for Polars `Datetime`

## Usage

### 1. From Spark to Polars DataFrame

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()
```

### 2. From Spark to Polars LazyFrame

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)
```

### 3. From Polars DataFrame to Spark

```python
from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or 
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession
```

### 4. Using Specific Mode

```python
from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)
```

### 5. Using Config

```python
from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)
```

## Known Limitations

### JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

### Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

### Handling `MapType`:


#### From Spark to Polars
If you have in Spark:

Type: `StructField("example", MapType(StringType(), IntegerType()))`

Data:  `{"a": 1, "b": 2}`

Then it will become in Polars:

Type: `{"example": List(Struct("key": String, "value": Int32))}`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

#### From Polars to Spark
If you have in Polars:

Type: `{"example": List(Struct("key": String, "value": Int32))}`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

Then it will become in Spark without specifying any config (Default Behavior):

Type: `StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))`

Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`

If you want this data to be converted to MapType:

```python
from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)
```

Type: `StructField("example", MapType(StringType(), IntegerType()))`

Data:  `{"a": 1, "b": 2}`

## License
- pending

## Contribution
- pending

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sparkpolars",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "pyspark, polars, conversion, spark-to-polars, polars-to-spark",
    "author": "Skander Boudawara",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/f4/36/1cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8/sparkpolars-0.0.6.tar.gz",
    "platform": null,
    "description": "# sparkpolars\n\n**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)\n\n## Installation\n\n```sh\npip install sparkpolars\n# or\nconda install skandev::sparkpolars\n```\n\n## Requirements\n\n- **Python** \u2265 3.10  \n- **Apache Spark** \u2265 3.3.0 (must be pre-installed)  \n- **Polars** \u2265 1.0 (must be pre-installed)  \n- **Pyspark** must also be installed if you plan to use this library  \n\n## Why Does This Library Exist?\n\n### The Problem\n\nTypical conversions between Spark and Polars often involve an intermediate Pandas step:\n\n```python\n# Traditional approach:\n# Spark -> Pandas -> Polars\n# or\n# Polars -> Pandas -> Spark\n```\n\n### The Solution\n\n**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.\n\n### Key Benefits\n\n- \ud83d\ude80 **No extra dependencies** \u2013 No need for Pandas or PyArrow  \n- \u2705 **Reliable handling of complex types** \u2013 Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable  \n\n## Features\n\n- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame\n- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`\n- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`\n- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`\n- `ARROW` and `PANDAS` modes may have limitations with complex types\n- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`\n- Timezone and time unit customization for Polars `Datetime`\n\n## Usage\n\n### 1. From Spark to Polars DataFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars()\n```\n\n### 2. From Spark to Polars LazyFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars(lazy=True)\n```\n\n### 3. From Polars DataFrame to Spark\n\n```python\nfrom pyspark.sql import SparkSession\nfrom polars import DataFrame\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = DataFrame({\"a\": [1], \"b\": [2]})  # It can also be a LazyDataFrame\n\nspark_df = df.to_spark(spark=spark)\n# or \nspark_df = df.to_spark()  # It will try to get the Spark ActiveSession\n```\n\n### 4. Using Specific Mode\n\n```python\nfrom sparkpolars import ModeMethod\n\nspark_df = df.to_spark(mode=ModeMethod.NATIVE)\nspark_df = df.to_spark(mode=ModeMethod.PANDAS)\nspark_df = df.to_spark(mode=ModeMethod.ARROW)\n\npolars_df = df.toPolars(mode=ModeMethod.NATIVE)\npolars_df = df.toPolars(mode=ModeMethod.PANDAS)\npolars_df = df.toPolars(mode=ModeMethod.ARROW)\n```\n\n### 5. Using Config\n\n```python\nfrom sparkpolars import Config\n\nconf = Config(\n    map_elements=[\"column_should_be_converted_to_map_type\", ...],  # Specify columns to convert to MapType\n    time_unit=\"ms\",  # Literal[\"ns\", \"us\", \"ms\"], defaults to \"us\"\n)\nspark_df = df.to_spark(config=conf)\n\npolars_df = df.toPolars(config=conf)\n```\n\n## Known Limitations\n\n### JVM Timezone Discrepancy\n\nSpark timestamps are collected via the JVM, which may differ from Spark\u2019s timezone settings. If issues arise, verify the JVM timezone.\n\n### Memory Constraints\n\nCollecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)\n\n### Handling `MapType`:\n\n\n#### From Spark to Polars\nIf you have in Spark:\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData:  `{\"a\": 1, \"b\": 2}`\n\nThen it will become in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\n#### From Polars to Spark\nIf you have in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nThen it will become in Spark without specifying any config (Default Behavior):\n\nType: `StructField(\"example\", ArrayType(StructType(StructField(\"key\", StringType())), StructField(\"value\", IntegerType())))`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nIf you want this data to be converted to MapType:\n\n```python\nfrom sparkpolars import Config\nconf = Config(\n    map_elements=[\"example\"]\n)\n```\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData:  `{\"a\": 1, \"b\": 2}`\n\n## License\n- pending\n\n## Contribution\n- pending\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Skander Boudawara\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Conversion between PySpark and Polars DataFrames",
    "version": "0.0.6",
    "project_urls": {
        "Bug Reports": "https://github.com/skanderboudawara/sparkpolars/issues",
        "Homepage": "https://pypi.org/project/sparkpolars/",
        "Source": "https://github.com/skanderboudawara/sparkpolars"
    },
    "split_keywords": [
        "pyspark",
        " polars",
        " conversion",
        " spark-to-polars",
        " polars-to-spark"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "84ef52ee5d39aed1332f8d95d6c8c27b60a4d7b17efa8c0ed1a313402653b931",
                "md5": "e4e5a7c4fe87ed99621e9b9724969833",
                "sha256": "41d3b350be0f8fb381d3131db1122de365e18c61f546ec9baaa72cf5d858ada7"
            },
            "downloads": -1,
            "filename": "sparkpolars-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e4e5a7c4fe87ed99621e9b9724969833",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 11520,
            "upload_time": "2025-02-12T00:23:07",
            "upload_time_iso_8601": "2025-02-12T00:23:07.598564Z",
            "url": "https://files.pythonhosted.org/packages/84/ef/52ee5d39aed1332f8d95d6c8c27b60a4d7b17efa8c0ed1a313402653b931/sparkpolars-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4361cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8",
                "md5": "6e36bcf2c3645dc07413c4e9f31eef32",
                "sha256": "d7c6c56f60784080fb4794aba08673389a7c937785218cdcf2e7afc4ca9c0e14"
            },
            "downloads": -1,
            "filename": "sparkpolars-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "6e36bcf2c3645dc07413c4e9f31eef32",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 16519,
            "upload_time": "2025-02-12T00:23:10",
            "upload_time_iso_8601": "2025-02-12T00:23:10.750199Z",
            "url": "https://files.pythonhosted.org/packages/f4/36/1cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8/sparkpolars-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-12 00:23:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "skanderboudawara",
    "github_project": "sparkpolars",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sparkpolars"
}
        
Elapsed time: 4.70837s