tidypyspark


Nametidypyspark JSON
Version 0.0.1 PyPI version JSON
download
home_page
Summarydplyr for pyspark
upload_time2023-03-21 22:31:09
maintainer
docs_urlNone
authorSrikanth Komala sheshachala
requires_python>=3.8,<4.0
licenseGNU General Public License v3.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI
version](https://badge.fury.io/py/tidypyspark.svg)](https://badge.fury.io/py/tidypyspark)

# `tidypyspark`

> Make [pyspark](https://pypi.org/project/pyspark/) sing
> [dplyr](https://dplyr.tidyverse.org/)

> Inspired by [sparklyr](https://spark.rstudio.com/),
> [tidyverse](https://tidyverse.tidyverse.org/)

`tidypyspark` python package provides *minimal, pythonic* wrapper around
pyspark sql dataframe API in
[tidyverse](https://tidyverse.tidyverse.org/) flavor.

-   With accessor `ts`, apply `tidypyspark` methods where both input and
    output are mostly pyspark dataframes.
-   Consistent 'verbs' (`select`, `arrange`, `distinct`, ...)

Also see [`tidypandas`](https://pypi.org/project/tidypandas/): A
**grammar of data manipulation** for
[pandas](https://pandas.pydata.org/docs/index.html) inspired by
[tidyverse](https://tidyverse.tidyverse.org/)

## Usage

    # assumed that pyspark session is active
    from tidypyspark import ts 
    import pyspark.sql.functions as F
    from tidypyspark.datasets import get_penguins_path

    pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)

    (pen.ts.add_row_number(order_by = 'bill_depth_mm')
        .ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},
                   by = 'species',
                   order_by = ['bill_depth_mm', 'row_number'],
                   range_between = (-float('inf'), 0)
                   )
        .ts.select(['species', 'bill_length_mm', 'cumsum_bl'])
        ).show(5)
        
    +--------------+-------+-------------+------------------+
    |bill_length_mm|species|bill_depth_mm|         cumsum_bl|
    +--------------+-------+-------------+------------------+
    |          32.1| Adelie|         15.5|              32.1|
    |          35.2| Adelie|         15.9| 67.30000000000001|
    |          37.7| Adelie|           16|105.00000000000001|
    |          36.2| Adelie|         16.1|141.20000000000002|
    |          33.1| Adelie|         16.1|             174.3|
    +--------------+-------+-------------+------------------+

## Example

-   `tidypyspark` code:

<!-- -->

    (pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])
     .ts.pivot_longer('species', include = False)
     ).show(5)
     
     +-------+-----------------+-----+
    |species|             name|value|
    +-------+-----------------+-----+
    | Adelie|   bill_length_mm| 39.1|
    | Adelie|    bill_depth_mm| 18.7|
    | Adelie|flipper_length_mm|  181|
    | Adelie|   bill_length_mm| 39.5|
    | Adelie|    bill_depth_mm| 17.4|
    +-------+-----------------+-----+

-   equivalent pyspark code:

<!-- -->

    stack_expr = '''
                 stack(3, 'bill_length_mm', `bill_length_mm`,
                          'bill_depth_mm', `bill_depth_mm`,
                          'flipper_length_mm', `flipper_length_mm`)
                          as (`name`, `value`)
                 '''
    pen.select('species', F.expr(stack_expr)).show(5)

> `tidypyspark` relies on the amazing `pyspark` library and spark
> ecosystem.

## Installation

`pip install tidypyspark`

-   On github: <https://github.com/talegari/tidypyspark>
-   On pypi: <https://pypi.org/project/tidypyspark>


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tidypyspark",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Srikanth Komala sheshachala",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/ca/b9/d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2/tidypyspark-0.0.1.tar.gz",
    "platform": null,
    "description": "[![PyPI\nversion](https://badge.fury.io/py/tidypyspark.svg)](https://badge.fury.io/py/tidypyspark)\n\n# `tidypyspark`\n\n> Make [pyspark](https://pypi.org/project/pyspark/) sing\n> [dplyr](https://dplyr.tidyverse.org/)\n\n> Inspired by [sparklyr](https://spark.rstudio.com/),\n> [tidyverse](https://tidyverse.tidyverse.org/)\n\n`tidypyspark` python package provides *minimal, pythonic* wrapper around\npyspark sql dataframe API in\n[tidyverse](https://tidyverse.tidyverse.org/) flavor.\n\n-   With accessor `ts`, apply `tidypyspark` methods where both input and\n    output are mostly pyspark dataframes.\n-   Consistent 'verbs' (`select`, `arrange`, `distinct`, ...)\n\nAlso see [`tidypandas`](https://pypi.org/project/tidypandas/): A\n**grammar of data manipulation** for\n[pandas](https://pandas.pydata.org/docs/index.html) inspired by\n[tidyverse](https://tidyverse.tidyverse.org/)\n\n## Usage\n\n    # assumed that pyspark session is active\n    from tidypyspark import ts \n    import pyspark.sql.functions as F\n    from tidypyspark.datasets import get_penguins_path\n\n    pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)\n\n    (pen.ts.add_row_number(order_by = 'bill_depth_mm')\n        .ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},\n                   by = 'species',\n                   order_by = ['bill_depth_mm', 'row_number'],\n                   range_between = (-float('inf'), 0)\n                   )\n        .ts.select(['species', 'bill_length_mm', 'cumsum_bl'])\n        ).show(5)\n        \n    +--------------+-------+-------------+------------------+\n    |bill_length_mm|species|bill_depth_mm|         cumsum_bl|\n    +--------------+-------+-------------+------------------+\n    |          32.1| Adelie|         15.5|              32.1|\n    |          35.2| Adelie|         15.9| 67.30000000000001|\n    |          37.7| Adelie|           16|105.00000000000001|\n    |          36.2| Adelie|         16.1|141.20000000000002|\n    |          33.1| Adelie|         16.1|             174.3|\n    +--------------+-------+-------------+------------------+\n\n## Example\n\n-   `tidypyspark` code:\n\n<!-- -->\n\n    (pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])\n     .ts.pivot_longer('species', include = False)\n     ).show(5)\n     \n     +-------+-----------------+-----+\n    |species|             name|value|\n    +-------+-----------------+-----+\n    | Adelie|   bill_length_mm| 39.1|\n    | Adelie|    bill_depth_mm| 18.7|\n    | Adelie|flipper_length_mm|  181|\n    | Adelie|   bill_length_mm| 39.5|\n    | Adelie|    bill_depth_mm| 17.4|\n    +-------+-----------------+-----+\n\n-   equivalent pyspark code:\n\n<!-- -->\n\n    stack_expr = '''\n                 stack(3, 'bill_length_mm', `bill_length_mm`,\n                          'bill_depth_mm', `bill_depth_mm`,\n                          'flipper_length_mm', `flipper_length_mm`)\n                          as (`name`, `value`)\n                 '''\n    pen.select('species', F.expr(stack_expr)).show(5)\n\n> `tidypyspark` relies on the amazing `pyspark` library and spark\n> ecosystem.\n\n## Installation\n\n`pip install tidypyspark`\n\n-   On github: <https://github.com/talegari/tidypyspark>\n-   On pypi: <https://pypi.org/project/tidypyspark>\n\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3.0",
    "summary": "dplyr for pyspark",
    "version": "0.0.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "95ea9dceb4d12670256f0ffaa887bdaf315242ac237f8ac926ffb85c243f28cc",
                "md5": "661e2da59162b73e6da0e44c907b59a2",
                "sha256": "0ed22f74ef26ad291586f8d8b86b46c3d84d9c2762fe954b54743c31bd2b1297"
            },
            "downloads": -1,
            "filename": "tidypyspark-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "661e2da59162b73e6da0e44c907b59a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 41144,
            "upload_time": "2023-03-21T22:31:06",
            "upload_time_iso_8601": "2023-03-21T22:31:06.360558Z",
            "url": "https://files.pythonhosted.org/packages/95/ea/9dceb4d12670256f0ffaa887bdaf315242ac237f8ac926ffb85c243f28cc/tidypyspark-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cab9d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2",
                "md5": "38fb7e4bcbb64ef792df7a385e935855",
                "sha256": "9d74f920383a04a98fc21a0591fe2ad038e04556f414f3b2172dad98cb378f30"
            },
            "downloads": -1,
            "filename": "tidypyspark-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "38fb7e4bcbb64ef792df7a385e935855",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 40614,
            "upload_time": "2023-03-21T22:31:09",
            "upload_time_iso_8601": "2023-03-21T22:31:09.071467Z",
            "url": "https://files.pythonhosted.org/packages/ca/b9/d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2/tidypyspark-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-21 22:31:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "tidypyspark"
}
        
Elapsed time: 0.04600s