# AI Helpers - PySpark utils
`pyspark-utils` is a Python module that provides a collection of utilities to simplify and enhance the use of PySpark. These utilities are designed to make working with PySpark more efficient and to reduce boilerplate code.
## Table of Contents
- [AI Helpers - PySpark utils](#ai-helpers---pyspark-utils)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Getting Started](#getting-started)
- [Utilities \& Examples](#utilities--examples)
- [Contributing](#contributing)
## Installation
You can install the `pyspark-utils` module via pip:
```bash
pip install ai-helpers-pyspark-utils
```
## Getting Started
First, import the module in your Python script:
```python
import pyspark_utils as psu
```
Now you can use the utilities provided by `pyspark-utils`.
## Utilities & Examples
- `get_spark_session`: Recover appropriate SparkSession.
Create a spark dataframe:
```python
>>> import pyspark_utils as psu
>>> spark = psu.get_spark_session("example")
>>> sdf = spark.createDataFrame(
[
[None, "a", 1, 1.0],
["b", "b", 1, 2.0],
["b", "b", None, 3.0],
["c", "c", None, 2.0],
["c", "c", 3, 4.0],
["d", None, 4, 2.0],
["d", None, 5, 6.0],
],
["col0", "col1", "col2", "col3"],
)
>>> sdf.show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|NULL| a| 1| 1.0|
| b| b| 1| 2.0|
| b| b|NULL| 3.0|
| c| c|NULL| 2.0|
| c| c| 3| 4.0|
| d|NULL| 4| 2.0|
| d|NULL| 5| 6.0|
+----+----+----+----+
```
- `with_columns`: Use multiple 'withColumn' calls on a dataframe in a single command.
```python
>>> import pyspark_utils as psu
>>> import pyspark.sql.functions as F
>>> col4 = F.col("col3") + 2
>>> col5 = F.lit(True)
>>> transformed_sdf = psu.with_columns(
sdf,
col_func_mapping={"col4": col4, "col5": col5}
)
>>> transformed_sdf.show()
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|NULL| a| 1| 1.0| 3.0|true|
| b| b| 1| 2.0| 4.0|true|
| b| b|NULL| 3.0| 5.0|true|
| c| c|NULL| 2.0| 4.0|true|
| c| c| 3| 4.0| 6.0|true|
| d|NULL| 4| 2.0| 4.0|true|
| d|NULL| 5| 6.0| 8.0|true|
+----+----+----+----+----+----+
```
- `keep_first_rows`: Keep the first row of each group defined by `partition_cols` and `order_cols`.
```python
>>> transformed_sdf = psu.utils.keep_first_rows(sdf, [F.col("col0")], [F.col("col3")])
>>> transformed_sdf.show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|NULL| a| 1| 1.0|
| b| b| 1| 2.0|
| c| c|NULL| 2.0|
| d|NULL| 4| 2.0|
+----+----+----+----+
```
- `assert_cols_in_df`: Assserts that all specified columns are present in specified dataframe.
- `assert_df_close`: Asserts that two dataframes are (almost) equal, even if the order of the columns is different.
## Contributing
We welcome contributions to `pyspark-utils`. To contribute, please follow these steps:
1. Fork the repository.
2. Create a new branch (`git checkout -b feature-branch`).
3. Make your changes.
4. Commit your changes (`git commit -am 'Add some feature'`).
5. Push to the branch (`git push origin feature-branch`).
6. Create a new Pull Request.
Please ensure your code follows the project's coding standards and includes appropriate tests.
Raw data
{
"_id": null,
"home_page": "https://github.com/ai-helpers/pyspark-utils",
"name": "ai-helpers-pyspark-utils",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.11,>=3.9",
"maintainer_email": null,
"keywords": "machine-learning, pyspark, utils",
"author": "Corentin Vasseur",
"author_email": "vasseur.corentin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/dc/4b/38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d/ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
"platform": null,
"description": "# AI Helpers - PySpark utils\n\n`pyspark-utils` is a Python module that provides a collection of utilities to simplify and enhance the use of PySpark. These utilities are designed to make working with PySpark more efficient and to reduce boilerplate code.\n\n## Table of Contents\n\n- [AI Helpers - PySpark utils](#ai-helpers---pyspark-utils)\n - [Table of Contents](#table-of-contents)\n - [Installation](#installation)\n - [Getting Started](#getting-started)\n - [Utilities \\& Examples](#utilities--examples)\n - [Contributing](#contributing)\n\n## Installation\n\nYou can install the `pyspark-utils` module via pip:\n\n```bash\npip install ai-helpers-pyspark-utils\n```\n\n## Getting Started\n\nFirst, import the module in your Python script:\n\n```python\nimport pyspark_utils as psu\n```\n\nNow you can use the utilities provided by `pyspark-utils`.\n\n## Utilities & Examples\n\n- `get_spark_session`: Recover appropriate SparkSession.\n \n Create a spark dataframe: \n \n ```python\n >>> import pyspark_utils as psu\n\n >>> spark = psu.get_spark_session(\"example\")\n >>> sdf = spark.createDataFrame(\n [\n [None, \"a\", 1, 1.0],\n [\"b\", \"b\", 1, 2.0],\n [\"b\", \"b\", None, 3.0],\n [\"c\", \"c\", None, 2.0],\n [\"c\", \"c\", 3, 4.0],\n [\"d\", None, 4, 2.0],\n [\"d\", None, 5, 6.0],\n ],\n [\"col0\", \"col1\", \"col2\", \"col3\"],\n )\n >>> sdf.show()\n +----+----+----+----+\n |col0|col1|col2|col3|\n +----+----+----+----+\n |NULL| a| 1| 1.0|\n | b| b| 1| 2.0|\n | b| b|NULL| 3.0|\n | c| c|NULL| 2.0|\n | c| c| 3| 4.0|\n | d|NULL| 4| 2.0|\n | d|NULL| 5| 6.0|\n +----+----+----+----+ \n ```\n\n- `with_columns`: Use multiple 'withColumn' calls on a dataframe in a single command.\n\n ```python\n >>> import pyspark_utils as psu\n >>> import pyspark.sql.functions as F\n\n >>> col4 = F.col(\"col3\") + 2\n >>> col5 = F.lit(True)\n\n >>> transformed_sdf = psu.with_columns(\n sdf, \n col_func_mapping={\"col4\": col4, \"col5\": col5}\n )\n >>> transformed_sdf.show()\n +----+----+----+----+----+----+\n |col0|col1|col2|col3|col4|col5|\n +----+----+----+----+----+----+\n |NULL| a| 1| 1.0| 3.0|true|\n | b| b| 1| 2.0| 4.0|true|\n | b| b|NULL| 3.0| 5.0|true|\n | c| c|NULL| 2.0| 4.0|true|\n | c| c| 3| 4.0| 6.0|true|\n | d|NULL| 4| 2.0| 4.0|true|\n | d|NULL| 5| 6.0| 8.0|true|\n +----+----+----+----+----+----+\n ```\n\n- `keep_first_rows`: Keep the first row of each group defined by `partition_cols` and `order_cols`.\n\n ```python\n >>> transformed_sdf = psu.utils.keep_first_rows(sdf, [F.col(\"col0\")], [F.col(\"col3\")])\n >>> transformed_sdf.show()\n +----+----+----+----+\n |col0|col1|col2|col3|\n +----+----+----+----+\n |NULL| a| 1| 1.0|\n | b| b| 1| 2.0|\n | c| c|NULL| 2.0|\n | d|NULL| 4| 2.0|\n +----+----+----+----+\n ```\n\n- `assert_cols_in_df`: Assserts that all specified columns are present in specified dataframe.\n\n- `assert_df_close`: Asserts that two dataframes are (almost) equal, even if the order of the columns is different.\n\n## Contributing\n\nWe welcome contributions to `pyspark-utils`. To contribute, please follow these steps:\n\n1. Fork the repository.\n2. Create a new branch (`git checkout -b feature-branch`).\n3. Make your changes.\n4. Commit your changes (`git commit -am 'Add some feature'`).\n5. Push to the branch (`git push origin feature-branch`).\n6. Create a new Pull Request.\n\nPlease ensure your code follows the project's coding standards and includes appropriate tests.\n",
"bugtrack_url": null,
"license": null,
"summary": "Common pyspark utils",
"version": "0.1.0a3",
"project_urls": {
"Documentation": "https://ai-helpers.github.io/pyspark-utils/",
"Homepage": "https://github.com/ai-helpers/pyspark-utils",
"Repository": "https://github.com/ai-helpers/pyspark-utils"
},
"split_keywords": [
"machine-learning",
" pyspark",
" utils"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "59588406418ff37a0c8b2ae6fb5b00cb883aef764b9206ea4a115a532160e32f",
"md5": "cc70c8daf32e5f67a5555e0ae5c77ae4",
"sha256": "ed17b9e20d7aa80ca655268f5997278a63df4b1d2c7abfcbcd4064f6dfd5210e"
},
"downloads": -1,
"filename": "ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cc70c8daf32e5f67a5555e0ae5c77ae4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.11,>=3.9",
"size": 4118,
"upload_time": "2024-05-30T13:12:46",
"upload_time_iso_8601": "2024-05-30T13:12:46.762803Z",
"url": "https://files.pythonhosted.org/packages/59/58/8406418ff37a0c8b2ae6fb5b00cb883aef764b9206ea4a115a532160e32f/ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "dc4b38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d",
"md5": "72752625932b44928ff841d2105671a0",
"sha256": "34a76f6cebd2702c59a378279abb953ac7b786e481ccf6206e0a0f247366f868"
},
"downloads": -1,
"filename": "ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
"has_sig": false,
"md5_digest": "72752625932b44928ff841d2105671a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.11,>=3.9",
"size": 3941,
"upload_time": "2024-05-30T13:12:50",
"upload_time_iso_8601": "2024-05-30T13:12:50.628704Z",
"url": "https://files.pythonhosted.org/packages/dc/4b/38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d/ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-30 13:12:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ai-helpers",
"github_project": "pyspark-utils",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "ai-helpers-pyspark-utils"
}