[](https://badge.fury.io/py/tidypyspark)
# `tidypyspark`
> Make [pyspark](https://pypi.org/project/pyspark/) sing
> [dplyr](https://dplyr.tidyverse.org/)
> Inspired by [sparklyr](https://spark.rstudio.com/),
> [tidyverse](https://tidyverse.tidyverse.org/)
`tidypyspark` python package provides *minimal, pythonic* wrapper around
pyspark sql dataframe API in
[tidyverse](https://tidyverse.tidyverse.org/) flavor.
- With accessor `ts`, apply `tidypyspark` methods where both input and
output are mostly pyspark dataframes.
- Consistent 'verbs' (`select`, `arrange`, `distinct`, ...)
Also see [`tidypandas`](https://pypi.org/project/tidypandas/): A
**grammar of data manipulation** for
[pandas](https://pandas.pydata.org/docs/index.html) inspired by
[tidyverse](https://tidyverse.tidyverse.org/)
## Usage
# assumed that pyspark session is active
from tidypyspark import ts
import pyspark.sql.functions as F
from tidypyspark.datasets import get_penguins_path
pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)
(pen.ts.add_row_number(order_by = 'bill_depth_mm')
.ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},
by = 'species',
order_by = ['bill_depth_mm', 'row_number'],
range_between = (-float('inf'), 0)
)
.ts.select(['species', 'bill_length_mm', 'cumsum_bl'])
).show(5)
+--------------+-------+-------------+------------------+
|bill_length_mm|species|bill_depth_mm| cumsum_bl|
+--------------+-------+-------------+------------------+
| 32.1| Adelie| 15.5| 32.1|
| 35.2| Adelie| 15.9| 67.30000000000001|
| 37.7| Adelie| 16|105.00000000000001|
| 36.2| Adelie| 16.1|141.20000000000002|
| 33.1| Adelie| 16.1| 174.3|
+--------------+-------+-------------+------------------+
## Example
- `tidypyspark` code:
<!-- -->
(pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])
.ts.pivot_longer('species', include = False)
).show(5)
+-------+-----------------+-----+
|species| name|value|
+-------+-----------------+-----+
| Adelie| bill_length_mm| 39.1|
| Adelie| bill_depth_mm| 18.7|
| Adelie|flipper_length_mm| 181|
| Adelie| bill_length_mm| 39.5|
| Adelie| bill_depth_mm| 17.4|
+-------+-----------------+-----+
- equivalent pyspark code:
<!-- -->
stack_expr = '''
stack(3, 'bill_length_mm', `bill_length_mm`,
'bill_depth_mm', `bill_depth_mm`,
'flipper_length_mm', `flipper_length_mm`)
as (`name`, `value`)
'''
pen.select('species', F.expr(stack_expr)).show(5)
> `tidypyspark` relies on the amazing `pyspark` library and spark
> ecosystem.
## Installation
`pip install tidypyspark`
- On github: <https://github.com/talegari/tidypyspark>
- On pypi: <https://pypi.org/project/tidypyspark>
Raw data
{
"_id": null,
"home_page": "",
"name": "tidypyspark",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Srikanth Komala sheshachala",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/ca/b9/d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2/tidypyspark-0.0.1.tar.gz",
"platform": null,
"description": "[](https://badge.fury.io/py/tidypyspark)\n\n# `tidypyspark`\n\n> Make [pyspark](https://pypi.org/project/pyspark/) sing\n> [dplyr](https://dplyr.tidyverse.org/)\n\n> Inspired by [sparklyr](https://spark.rstudio.com/),\n> [tidyverse](https://tidyverse.tidyverse.org/)\n\n`tidypyspark` python package provides *minimal, pythonic* wrapper around\npyspark sql dataframe API in\n[tidyverse](https://tidyverse.tidyverse.org/) flavor.\n\n- With accessor `ts`, apply `tidypyspark` methods where both input and\n output are mostly pyspark dataframes.\n- Consistent 'verbs' (`select`, `arrange`, `distinct`, ...)\n\nAlso see [`tidypandas`](https://pypi.org/project/tidypandas/): A\n**grammar of data manipulation** for\n[pandas](https://pandas.pydata.org/docs/index.html) inspired by\n[tidyverse](https://tidyverse.tidyverse.org/)\n\n## Usage\n\n # assumed that pyspark session is active\n from tidypyspark import ts \n import pyspark.sql.functions as F\n from tidypyspark.datasets import get_penguins_path\n\n pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)\n\n (pen.ts.add_row_number(order_by = 'bill_depth_mm')\n .ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},\n by = 'species',\n order_by = ['bill_depth_mm', 'row_number'],\n range_between = (-float('inf'), 0)\n )\n .ts.select(['species', 'bill_length_mm', 'cumsum_bl'])\n ).show(5)\n \n +--------------+-------+-------------+------------------+\n |bill_length_mm|species|bill_depth_mm| cumsum_bl|\n +--------------+-------+-------------+------------------+\n | 32.1| Adelie| 15.5| 32.1|\n | 35.2| Adelie| 15.9| 67.30000000000001|\n | 37.7| Adelie| 16|105.00000000000001|\n | 36.2| Adelie| 16.1|141.20000000000002|\n | 33.1| Adelie| 16.1| 174.3|\n +--------------+-------+-------------+------------------+\n\n## Example\n\n- `tidypyspark` code:\n\n<!-- -->\n\n (pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])\n .ts.pivot_longer('species', include = False)\n ).show(5)\n \n +-------+-----------------+-----+\n |species| name|value|\n +-------+-----------------+-----+\n | Adelie| bill_length_mm| 39.1|\n | Adelie| bill_depth_mm| 18.7|\n | Adelie|flipper_length_mm| 181|\n | Adelie| bill_length_mm| 39.5|\n | Adelie| bill_depth_mm| 17.4|\n +-------+-----------------+-----+\n\n- equivalent pyspark code:\n\n<!-- -->\n\n stack_expr = '''\n stack(3, 'bill_length_mm', `bill_length_mm`,\n 'bill_depth_mm', `bill_depth_mm`,\n 'flipper_length_mm', `flipper_length_mm`)\n as (`name`, `value`)\n '''\n pen.select('species', F.expr(stack_expr)).show(5)\n\n> `tidypyspark` relies on the amazing `pyspark` library and spark\n> ecosystem.\n\n## Installation\n\n`pip install tidypyspark`\n\n- On github: <https://github.com/talegari/tidypyspark>\n- On pypi: <https://pypi.org/project/tidypyspark>\n\n",
"bugtrack_url": null,
"license": "GNU General Public License v3.0",
"summary": "dplyr for pyspark",
"version": "0.0.1",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "95ea9dceb4d12670256f0ffaa887bdaf315242ac237f8ac926ffb85c243f28cc",
"md5": "661e2da59162b73e6da0e44c907b59a2",
"sha256": "0ed22f74ef26ad291586f8d8b86b46c3d84d9c2762fe954b54743c31bd2b1297"
},
"downloads": -1,
"filename": "tidypyspark-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "661e2da59162b73e6da0e44c907b59a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 41144,
"upload_time": "2023-03-21T22:31:06",
"upload_time_iso_8601": "2023-03-21T22:31:06.360558Z",
"url": "https://files.pythonhosted.org/packages/95/ea/9dceb4d12670256f0ffaa887bdaf315242ac237f8ac926ffb85c243f28cc/tidypyspark-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cab9d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2",
"md5": "38fb7e4bcbb64ef792df7a385e935855",
"sha256": "9d74f920383a04a98fc21a0591fe2ad038e04556f414f3b2172dad98cb378f30"
},
"downloads": -1,
"filename": "tidypyspark-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "38fb7e4bcbb64ef792df7a385e935855",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 40614,
"upload_time": "2023-03-21T22:31:09",
"upload_time_iso_8601": "2023-03-21T22:31:09.071467Z",
"url": "https://files.pythonhosted.org/packages/ca/b9/d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2/tidypyspark-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-21 22:31:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "tidypyspark"
}