ai-helpers-pyspark-utils


Nameai-helpers-pyspark-utils JSON
Version 0.1.0a3 PyPI version JSON
download
home_pagehttps://github.com/ai-helpers/pyspark-utils
SummaryCommon pyspark utils
upload_time2024-05-30 13:12:50
maintainerNone
docs_urlNone
authorCorentin Vasseur
requires_python<3.11,>=3.9
licenseNone
keywords machine-learning pyspark utils
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AI Helpers - PySpark utils

`pyspark-utils` is a Python module that provides a collection of utilities to simplify and enhance the use of PySpark. These utilities are designed to make working with PySpark more efficient and to reduce boilerplate code.

## Table of Contents

- [AI Helpers - PySpark utils](#ai-helpers---pyspark-utils)
  - [Table of Contents](#table-of-contents)
  - [Installation](#installation)
  - [Getting Started](#getting-started)
  - [Utilities \& Examples](#utilities--examples)
  - [Contributing](#contributing)

## Installation

You can install the `pyspark-utils` module via pip:

```bash
pip install ai-helpers-pyspark-utils
```

## Getting Started

First, import the module in your Python script:

```python
import pyspark_utils as psu
```

Now you can use the utilities provided by `pyspark-utils`.

## Utilities & Examples

- `get_spark_session`: Recover appropriate SparkSession.
  
  Create a spark dataframe: 
  
  ```python
  >>> import pyspark_utils as psu

  >>> spark = psu.get_spark_session("example")
  >>> sdf = spark.createDataFrame(
        [
            [None, "a", 1, 1.0],
            ["b", "b", 1, 2.0],
            ["b", "b", None, 3.0],
            ["c", "c", None, 2.0],
            ["c", "c", 3, 4.0],
            ["d", None, 4, 2.0],
            ["d", None, 5, 6.0],
        ],
        ["col0", "col1", "col2", "col3"],
    )
  >>> sdf.show()
  +----+----+----+----+
  |col0|col1|col2|col3|
  +----+----+----+----+
  |NULL|   a|   1| 1.0|
  |   b|   b|   1| 2.0|
  |   b|   b|NULL| 3.0|
  |   c|   c|NULL| 2.0|
  |   c|   c|   3| 4.0|
  |   d|NULL|   4| 2.0|
  |   d|NULL|   5| 6.0|
  +----+----+----+----+ 
  ```

- `with_columns`: Use multiple 'withColumn' calls on a dataframe in a single command.

  ```python
  >>> import pyspark_utils as psu
  >>> import pyspark.sql.functions as F

  >>> col4 = F.col("col3") + 2
  >>> col5 = F.lit(True)

  >>> transformed_sdf = psu.with_columns(
    sdf, 
    col_func_mapping={"col4": col4, "col5": col5}
    )
  >>> transformed_sdf.show()
  +----+----+----+----+----+----+
  |col0|col1|col2|col3|col4|col5|
  +----+----+----+----+----+----+
  |NULL|   a|   1| 1.0| 3.0|true|
  |   b|   b|   1| 2.0| 4.0|true|
  |   b|   b|NULL| 3.0| 5.0|true|
  |   c|   c|NULL| 2.0| 4.0|true|
  |   c|   c|   3| 4.0| 6.0|true|
  |   d|NULL|   4| 2.0| 4.0|true|
  |   d|NULL|   5| 6.0| 8.0|true|
  +----+----+----+----+----+----+
  ```

- `keep_first_rows`: Keep the first row of each group defined by `partition_cols` and `order_cols`.

  ```python
  >>> transformed_sdf = psu.utils.keep_first_rows(sdf, [F.col("col0")], [F.col("col3")])
  >>> transformed_sdf.show()
  +----+----+----+----+
  |col0|col1|col2|col3|
  +----+----+----+----+
  |NULL|   a|   1| 1.0|
  |   b|   b|   1| 2.0|
  |   c|   c|NULL| 2.0|
  |   d|NULL|   4| 2.0|
  +----+----+----+----+
  ```

- `assert_cols_in_df`: Assserts that all specified columns are present in specified dataframe.

- `assert_df_close`: Asserts that two dataframes are (almost) equal, even if the order of the columns is different.

## Contributing

We welcome contributions to `pyspark-utils`. To contribute, please follow these steps:

1. Fork the repository.
2. Create a new branch (`git checkout -b feature-branch`).
3. Make your changes.
4. Commit your changes (`git commit -am 'Add some feature'`).
5. Push to the branch (`git push origin feature-branch`).
6. Create a new Pull Request.

Please ensure your code follows the project's coding standards and includes appropriate tests.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ai-helpers/pyspark-utils",
    "name": "ai-helpers-pyspark-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.11,>=3.9",
    "maintainer_email": null,
    "keywords": "machine-learning, pyspark, utils",
    "author": "Corentin Vasseur",
    "author_email": "vasseur.corentin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/dc/4b/38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d/ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
    "platform": null,
    "description": "# AI Helpers - PySpark utils\n\n`pyspark-utils` is a Python module that provides a collection of utilities to simplify and enhance the use of PySpark. These utilities are designed to make working with PySpark more efficient and to reduce boilerplate code.\n\n## Table of Contents\n\n- [AI Helpers - PySpark utils](#ai-helpers---pyspark-utils)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n  - [Getting Started](#getting-started)\n  - [Utilities \\& Examples](#utilities--examples)\n  - [Contributing](#contributing)\n\n## Installation\n\nYou can install the `pyspark-utils` module via pip:\n\n```bash\npip install ai-helpers-pyspark-utils\n```\n\n## Getting Started\n\nFirst, import the module in your Python script:\n\n```python\nimport pyspark_utils as psu\n```\n\nNow you can use the utilities provided by `pyspark-utils`.\n\n## Utilities & Examples\n\n- `get_spark_session`: Recover appropriate SparkSession.\n  \n  Create a spark dataframe: \n  \n  ```python\n  >>> import pyspark_utils as psu\n\n  >>> spark = psu.get_spark_session(\"example\")\n  >>> sdf = spark.createDataFrame(\n        [\n            [None, \"a\", 1, 1.0],\n            [\"b\", \"b\", 1, 2.0],\n            [\"b\", \"b\", None, 3.0],\n            [\"c\", \"c\", None, 2.0],\n            [\"c\", \"c\", 3, 4.0],\n            [\"d\", None, 4, 2.0],\n            [\"d\", None, 5, 6.0],\n        ],\n        [\"col0\", \"col1\", \"col2\", \"col3\"],\n    )\n  >>> sdf.show()\n  +----+----+----+----+\n  |col0|col1|col2|col3|\n  +----+----+----+----+\n  |NULL|   a|   1| 1.0|\n  |   b|   b|   1| 2.0|\n  |   b|   b|NULL| 3.0|\n  |   c|   c|NULL| 2.0|\n  |   c|   c|   3| 4.0|\n  |   d|NULL|   4| 2.0|\n  |   d|NULL|   5| 6.0|\n  +----+----+----+----+ \n  ```\n\n- `with_columns`: Use multiple 'withColumn' calls on a dataframe in a single command.\n\n  ```python\n  >>> import pyspark_utils as psu\n  >>> import pyspark.sql.functions as F\n\n  >>> col4 = F.col(\"col3\") + 2\n  >>> col5 = F.lit(True)\n\n  >>> transformed_sdf = psu.with_columns(\n    sdf, \n    col_func_mapping={\"col4\": col4, \"col5\": col5}\n    )\n  >>> transformed_sdf.show()\n  +----+----+----+----+----+----+\n  |col0|col1|col2|col3|col4|col5|\n  +----+----+----+----+----+----+\n  |NULL|   a|   1| 1.0| 3.0|true|\n  |   b|   b|   1| 2.0| 4.0|true|\n  |   b|   b|NULL| 3.0| 5.0|true|\n  |   c|   c|NULL| 2.0| 4.0|true|\n  |   c|   c|   3| 4.0| 6.0|true|\n  |   d|NULL|   4| 2.0| 4.0|true|\n  |   d|NULL|   5| 6.0| 8.0|true|\n  +----+----+----+----+----+----+\n  ```\n\n- `keep_first_rows`: Keep the first row of each group defined by `partition_cols` and `order_cols`.\n\n  ```python\n  >>> transformed_sdf = psu.utils.keep_first_rows(sdf, [F.col(\"col0\")], [F.col(\"col3\")])\n  >>> transformed_sdf.show()\n  +----+----+----+----+\n  |col0|col1|col2|col3|\n  +----+----+----+----+\n  |NULL|   a|   1| 1.0|\n  |   b|   b|   1| 2.0|\n  |   c|   c|NULL| 2.0|\n  |   d|NULL|   4| 2.0|\n  +----+----+----+----+\n  ```\n\n- `assert_cols_in_df`: Assserts that all specified columns are present in specified dataframe.\n\n- `assert_df_close`: Asserts that two dataframes are (almost) equal, even if the order of the columns is different.\n\n## Contributing\n\nWe welcome contributions to `pyspark-utils`. To contribute, please follow these steps:\n\n1. Fork the repository.\n2. Create a new branch (`git checkout -b feature-branch`).\n3. Make your changes.\n4. Commit your changes (`git commit -am 'Add some feature'`).\n5. Push to the branch (`git push origin feature-branch`).\n6. Create a new Pull Request.\n\nPlease ensure your code follows the project's coding standards and includes appropriate tests.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Common pyspark utils",
    "version": "0.1.0a3",
    "project_urls": {
        "Documentation": "https://ai-helpers.github.io/pyspark-utils/",
        "Homepage": "https://github.com/ai-helpers/pyspark-utils",
        "Repository": "https://github.com/ai-helpers/pyspark-utils"
    },
    "split_keywords": [
        "machine-learning",
        " pyspark",
        " utils"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59588406418ff37a0c8b2ae6fb5b00cb883aef764b9206ea4a115a532160e32f",
                "md5": "cc70c8daf32e5f67a5555e0ae5c77ae4",
                "sha256": "ed17b9e20d7aa80ca655268f5997278a63df4b1d2c7abfcbcd4064f6dfd5210e"
            },
            "downloads": -1,
            "filename": "ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cc70c8daf32e5f67a5555e0ae5c77ae4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.9",
            "size": 4118,
            "upload_time": "2024-05-30T13:12:46",
            "upload_time_iso_8601": "2024-05-30T13:12:46.762803Z",
            "url": "https://files.pythonhosted.org/packages/59/58/8406418ff37a0c8b2ae6fb5b00cb883aef764b9206ea4a115a532160e32f/ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dc4b38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d",
                "md5": "72752625932b44928ff841d2105671a0",
                "sha256": "34a76f6cebd2702c59a378279abb953ac7b786e481ccf6206e0a0f247366f868"
            },
            "downloads": -1,
            "filename": "ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
            "has_sig": false,
            "md5_digest": "72752625932b44928ff841d2105671a0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.11,>=3.9",
            "size": 3941,
            "upload_time": "2024-05-30T13:12:50",
            "upload_time_iso_8601": "2024-05-30T13:12:50.628704Z",
            "url": "https://files.pythonhosted.org/packages/dc/4b/38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d/ai_helpers_pyspark_utils-0.1.0a3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-30 13:12:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ai-helpers",
    "github_project": "pyspark-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ai-helpers-pyspark-utils"
}
        
Elapsed time: 1.82903s