Name | sparkpolars JSON |
Version |
0.0.6
JSON |
| download |
home_page | None |
Summary | Conversion between PySpark and Polars DataFrames |
upload_time | 2025-02-12 00:23:10 |
maintainer | None |
docs_url | None |
author | Skander Boudawara |
requires_python | >=3.10 |
license | MIT License
Copyright (c) 2025 Skander Boudawara
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
pyspark
polars
conversion
spark-to-polars
polars-to-spark
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# sparkpolars
**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)
## Installation
```sh
pip install sparkpolars
# or
conda install skandev::sparkpolars
```
## Requirements
- **Python** ≥ 3.10
- **Apache Spark** ≥ 3.3.0 (must be pre-installed)
- **Polars** ≥ 1.0 (must be pre-installed)
- **Pyspark** must also be installed if you plan to use this library
## Why Does This Library Exist?
### The Problem
Typical conversions between Spark and Polars often involve an intermediate Pandas step:
```python
# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark
```
### The Solution
**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.
### Key Benefits
- 🚀 **No extra dependencies** – No need for Pandas or PyArrow
- ✅ **Reliable handling of complex types** – Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable
## Features
- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`
- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`
- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`
- `ARROW` and `PANDAS` modes may have limitations with complex types
- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`
- Timezone and time unit customization for Polars `Datetime`
## Usage
### 1. From Spark to Polars DataFrame
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 2)], ["a", "b"])
polars_df = df.toPolars()
```
### 2. From Spark to Polars LazyFrame
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 2)], ["a", "b"])
polars_df = df.toPolars(lazy=True)
```
### 3. From Polars DataFrame to Spark
```python
from pyspark.sql import SparkSession
from polars import DataFrame
spark = SparkSession.builder.appName("example").getOrCreate()
df = DataFrame({"a": [1], "b": [2]}) # It can also be a LazyDataFrame
spark_df = df.to_spark(spark=spark)
# or
spark_df = df.to_spark() # It will try to get the Spark ActiveSession
```
### 4. Using Specific Mode
```python
from sparkpolars import ModeMethod
spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)
polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)
```
### 5. Using Config
```python
from sparkpolars import Config
conf = Config(
map_elements=["column_should_be_converted_to_map_type", ...], # Specify columns to convert to MapType
time_unit="ms", # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)
polars_df = df.toPolars(config=conf)
```
## Known Limitations
### JVM Timezone Discrepancy
Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.
### Memory Constraints
Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)
### Handling `MapType`:
#### From Spark to Polars
If you have in Spark:
Type: `StructField("example", MapType(StringType(), IntegerType()))`
Data: `{"a": 1, "b": 2}`
Then it will become in Polars:
Type: `{"example": List(Struct("key": String, "value": Int32))}`
Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`
#### From Polars to Spark
If you have in Polars:
Type: `{"example": List(Struct("key": String, "value": Int32))}`
Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`
Then it will become in Spark without specifying any config (Default Behavior):
Type: `StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))`
Data: `[{"key": "a", "value": 1}, {"key": "b", "value": 2}]`
If you want this data to be converted to MapType:
```python
from sparkpolars import Config
conf = Config(
map_elements=["example"]
)
```
Type: `StructField("example", MapType(StringType(), IntegerType()))`
Data: `{"a": 1, "b": 2}`
## License
- pending
## Contribution
- pending
Raw data
{
"_id": null,
"home_page": null,
"name": "sparkpolars",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "pyspark, polars, conversion, spark-to-polars, polars-to-spark",
"author": "Skander Boudawara",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/f4/36/1cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8/sparkpolars-0.0.6.tar.gz",
"platform": null,
"description": "# sparkpolars\n\n**sparkpolars** is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)\n\n## Installation\n\n```sh\npip install sparkpolars\n# or\nconda install skandev::sparkpolars\n```\n\n## Requirements\n\n- **Python** \u2265 3.10 \n- **Apache Spark** \u2265 3.3.0 (must be pre-installed) \n- **Polars** \u2265 1.0 (must be pre-installed) \n- **Pyspark** must also be installed if you plan to use this library \n\n## Why Does This Library Exist?\n\n### The Problem\n\nTypical conversions between Spark and Polars often involve an intermediate Pandas step:\n\n```python\n# Traditional approach:\n# Spark -> Pandas -> Polars\n# or\n# Polars -> Pandas -> Spark\n```\n\n### The Solution\n\n**sparkpolars** eliminates unnecessary dependencies like `pandas` and `pyarrow` by leveraging native functions such as `.collect()` and schema interpretation.\n\n### Key Benefits\n\n- \ud83d\ude80 **No extra dependencies** \u2013 No need for Pandas or PyArrow \n- \u2705 **Reliable handling of complex types** \u2013 Provides better consistency for `MapType`, `StructType`, and nested `ArrayType`, where existing conversion methods can be unreliable \n\n## Features\n\n- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame\n- Ensures schema consistency: preserves `LongType` as `Int64` instead of mistakenly converting to `Int32`\n- Three conversion modes: `NATIVE`, `ARROW`, `PANDAS`\n- `NATIVE` mode properly converts `MapType`, `StructType`, and nested `ArrayType`\n- `ARROW` and `PANDAS` modes may have limitations with complex types\n- Configurable conversion settings for Polars `list(struct)` to Spark `MapType`\n- Timezone and time unit customization for Polars `Datetime`\n\n## Usage\n\n### 1. From Spark to Polars DataFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars()\n```\n\n### 2. From Spark to Polars LazyFrame\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = spark.createDataFrame([(1, 2)], [\"a\", \"b\"])\n\npolars_df = df.toPolars(lazy=True)\n```\n\n### 3. From Polars DataFrame to Spark\n\n```python\nfrom pyspark.sql import SparkSession\nfrom polars import DataFrame\n\nspark = SparkSession.builder.appName(\"example\").getOrCreate()\n\ndf = DataFrame({\"a\": [1], \"b\": [2]}) # It can also be a LazyDataFrame\n\nspark_df = df.to_spark(spark=spark)\n# or \nspark_df = df.to_spark() # It will try to get the Spark ActiveSession\n```\n\n### 4. Using Specific Mode\n\n```python\nfrom sparkpolars import ModeMethod\n\nspark_df = df.to_spark(mode=ModeMethod.NATIVE)\nspark_df = df.to_spark(mode=ModeMethod.PANDAS)\nspark_df = df.to_spark(mode=ModeMethod.ARROW)\n\npolars_df = df.toPolars(mode=ModeMethod.NATIVE)\npolars_df = df.toPolars(mode=ModeMethod.PANDAS)\npolars_df = df.toPolars(mode=ModeMethod.ARROW)\n```\n\n### 5. Using Config\n\n```python\nfrom sparkpolars import Config\n\nconf = Config(\n map_elements=[\"column_should_be_converted_to_map_type\", ...], # Specify columns to convert to MapType\n time_unit=\"ms\", # Literal[\"ns\", \"us\", \"ms\"], defaults to \"us\"\n)\nspark_df = df.to_spark(config=conf)\n\npolars_df = df.toPolars(config=conf)\n```\n\n## Known Limitations\n\n### JVM Timezone Discrepancy\n\nSpark timestamps are collected via the JVM, which may differ from Spark\u2019s timezone settings. If issues arise, verify the JVM timezone.\n\n### Memory Constraints\n\nCollecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)\n\n### Handling `MapType`:\n\n\n#### From Spark to Polars\nIf you have in Spark:\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData: `{\"a\": 1, \"b\": 2}`\n\nThen it will become in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\n#### From Polars to Spark\nIf you have in Polars:\n\nType: `{\"example\": List(Struct(\"key\": String, \"value\": Int32))}`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nThen it will become in Spark without specifying any config (Default Behavior):\n\nType: `StructField(\"example\", ArrayType(StructType(StructField(\"key\", StringType())), StructField(\"value\", IntegerType())))`\n\nData: `[{\"key\": \"a\", \"value\": 1}, {\"key\": \"b\", \"value\": 2}]`\n\nIf you want this data to be converted to MapType:\n\n```python\nfrom sparkpolars import Config\nconf = Config(\n map_elements=[\"example\"]\n)\n```\n\nType: `StructField(\"example\", MapType(StringType(), IntegerType()))`\n\nData: `{\"a\": 1, \"b\": 2}`\n\n## License\n- pending\n\n## Contribution\n- pending\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2025 Skander Boudawara\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Conversion between PySpark and Polars DataFrames",
"version": "0.0.6",
"project_urls": {
"Bug Reports": "https://github.com/skanderboudawara/sparkpolars/issues",
"Homepage": "https://pypi.org/project/sparkpolars/",
"Source": "https://github.com/skanderboudawara/sparkpolars"
},
"split_keywords": [
"pyspark",
" polars",
" conversion",
" spark-to-polars",
" polars-to-spark"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "84ef52ee5d39aed1332f8d95d6c8c27b60a4d7b17efa8c0ed1a313402653b931",
"md5": "e4e5a7c4fe87ed99621e9b9724969833",
"sha256": "41d3b350be0f8fb381d3131db1122de365e18c61f546ec9baaa72cf5d858ada7"
},
"downloads": -1,
"filename": "sparkpolars-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e4e5a7c4fe87ed99621e9b9724969833",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 11520,
"upload_time": "2025-02-12T00:23:07",
"upload_time_iso_8601": "2025-02-12T00:23:07.598564Z",
"url": "https://files.pythonhosted.org/packages/84/ef/52ee5d39aed1332f8d95d6c8c27b60a4d7b17efa8c0ed1a313402653b931/sparkpolars-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f4361cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8",
"md5": "6e36bcf2c3645dc07413c4e9f31eef32",
"sha256": "d7c6c56f60784080fb4794aba08673389a7c937785218cdcf2e7afc4ca9c0e14"
},
"downloads": -1,
"filename": "sparkpolars-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "6e36bcf2c3645dc07413c4e9f31eef32",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 16519,
"upload_time": "2025-02-12T00:23:10",
"upload_time_iso_8601": "2025-02-12T00:23:10.750199Z",
"url": "https://files.pythonhosted.org/packages/f4/36/1cf0131ce921b825e5c320ef71d18ac43ed1d6349b1cbbf1a85baf8f6ab8/sparkpolars-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-12 00:23:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "skanderboudawara",
"github_project": "sparkpolars",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sparkpolars"
}