siuba
=====
*scrappy data analysis, with seamless support for pandas and SQL*
[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)
[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)
<img width="30%" align="right" src="./docs/siuba_small.svg">
siuba ([小巴](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:
* `select()` - keep certain columns of data.
* `filter()` - keep certain rows of data.
* `mutate()` - create or modify an existing column of data.
* `summarize()` - reduce one or more columns down to a single number.
* `arrange()` - reorder the rows of data.
These actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.
Inputs to these functions can be a pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).
For more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html).
For examples of siuba in action, see the [siuba guide](https://siuba.org/guide).
Installation
------------
```
pip install siuba
```
Examples
--------
See the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.
### Basic use
The code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.
```python
from siuba import group_by, summarize, _
from siuba.data import mtcars
(mtcars
>> group_by(_.cyl)
>> summarize(avg_hp = _.hp.mean())
)
```
```
Out[1]:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286
```
There are three key concepts in this example:
| concept | example | meaning |
| ------- | ------- | ------- |
| verb | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |
| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |
| pipe | `mtcars >> group_by(...)` | a syntax that allows you to chain verbs with the `>>` operator |
See the [siuba guide overview](https://siuba.org/guide) for a full introduction.
### What is a siu expression (e.g. `_.cyl == 4`)?
A siu expression is a way of specifying **what** action you want to perform.
This allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.
```python
from siuba import _
_.cyl == 4
```
```
Out[2]:
█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4
```
You can also think of siu expressions as a shorthand for a lambda function.
```python
from siuba import _
# lambda approach
mtcars[lambda _: _.cyl == 4]
# siu expression approach
mtcars[_.cyl == 4]
```
```
Out[3]:
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
.. ... ... ... ... ... ... ... .. .. ... ...
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
[11 rows x 11 columns]
```
See the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).
### Using with a SQL database
A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.
In the code below, we set up an example database.
```python
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars
# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
```
Next, we use the code from the first example, except now executed a SQL table.
```python
# Demo SQL analysis with siuba ----
from siuba import _, tbl, group_by, summarize, filter
# connect with siuba
tbl_mtcars = tbl(engine, "mtcars")
(tbl_mtcars
>> group_by(_.cyl)
>> summarize(avg_hp = _.hp.mean())
)
```
```
Out[4]:
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286
# .. may have more rows
```
See the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).
### Example notebooks
Below are some examples I've kept as I've worked on siuba.
For the most up to date explanations, see the [siuba guide](https://siuba.org/guide)
* [siu expressions](examples/examples-siu.ipynb)
* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)
- [select verb case study](examples/case-iris-select.ipynb)
* sql using dplyr style
- [simple sql statements](examples/examples-sql.ipynb)
- [the kitchen sink with postgres](examples/examples-postgres.ipynb)
* [tidytuesday examples](https://github.com/machow/tidytuesday-py)
- tidytuesday is a weekly R data analysis project. In order to kick the tires
on siuba, I've been using it to complete the assignments. More specifically,
I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)
to use siuba.
Testing
-------
Tests are done using pytest.
They can be run using the following.
```bash
# start postgres db
docker-compose up
pytest siuba
```
Raw data
{
"_id": null,
"home_page": "https://github.com/machow/siuba",
"name": "siuba",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "package",
"author": "Michael Chow",
"author_email": "mc_al_gh_siuba@fastmail.com",
"download_url": "https://files.pythonhosted.org/packages/4c/c3/5203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786/siuba-0.4.4.tar.gz",
"platform": null,
"description": "siuba\n=====\n\n*scrappy data analysis, with seamless support for pandas and SQL*\n\n[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)\n[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)\n\n<img width=\"30%\" align=\"right\" src=\"./docs/siuba_small.svg\">\n\nsiuba ([\u5c0f\u5df4](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:\n\n* `select()` - keep certain columns of data.\n* `filter()` - keep certain rows of data.\n* `mutate()` - create or modify an existing column of data.\n* `summarize()` - reduce one or more columns down to a single number.\n* `arrange()` - reorder the rows of data.\n\nThese actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.\nInputs to these functions can be a\u00a0pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).\n\nFor more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html). \nFor examples of siuba in action, see the [siuba guide](https://siuba.org/guide).\n\nInstallation\n------------\n\n```\npip install siuba\n```\n\nExamples\n--------\n\nSee the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.\n\n### Basic use\n\nThe code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.\n\n```python\nfrom siuba import group_by, summarize, _\nfrom siuba.data import mtcars\n\n(mtcars\n >> group_by(_.cyl)\n >> summarize(avg_hp = _.hp.mean())\n )\n```\n\n```\nOut[1]: \n cyl avg_hp\n0 4 82.636364\n1 6 122.285714\n2 8 209.214286\n```\n\nThere are three key concepts in this example:\n\n| concept | example | meaning |\n| ------- | ------- | ------- |\n| verb | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |\n| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |\n| pipe | `mtcars >> group_by(...)` | a syntax that allows you to chain verbs with the `>>` operator |\n\n\nSee the [siuba guide overview](https://siuba.org/guide) for a full introduction.\n\n### What is a siu expression (e.g. `_.cyl == 4`)?\n\nA siu expression is a way of specifying **what** action you want to perform.\nThis allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.\n\n```python\nfrom siuba import _\n\n_.cyl == 4\n```\n\n```\nOut[2]:\n\u2588\u2500==\n\u251c\u2500\u2588\u2500.\n\u2502 \u251c\u2500_\n\u2502 \u2514\u2500'cyl'\n\u2514\u25004\n```\n\nYou can also think of siu expressions as a shorthand for a lambda function.\n\n```python\nfrom siuba import _\n\n# lambda approach\nmtcars[lambda _: _.cyl == 4]\n\n# siu expression approach\nmtcars[_.cyl == 4]\n```\n\n```\nOut[3]: \n mpg cyl disp hp drat wt qsec vs am gear carb\n2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1\n7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2\n.. ... ... ... ... ... ... ... .. .. ... ...\n27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2\n31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2\n\n[11 rows x 11 columns]\n```\n\nSee the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).\n\n### Using with a SQL database\n\nA killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.\n\nIn the code below, we set up an example database.\n\n```python\n# Setup example data ----\nfrom sqlalchemy import create_engine\nfrom siuba.data import mtcars\n\n# copy pandas DataFrame to sqlite\nengine = create_engine(\"sqlite:///:memory:\")\nmtcars.to_sql(\"mtcars\", engine, if_exists = \"replace\")\n```\n\nNext, we use the code from the first example, except now executed a SQL table.\n\n```python\n# Demo SQL analysis with siuba ----\nfrom siuba import _, tbl, group_by, summarize, filter\n\n# connect with siuba\ntbl_mtcars = tbl(engine, \"mtcars\")\n\n(tbl_mtcars\n >> group_by(_.cyl)\n >> summarize(avg_hp = _.hp.mean())\n )\n```\n\n```\nOut[4]: \n# Source: lazy query\n# DB Conn: Engine(sqlite:///:memory:)\n# Preview:\n cyl avg_hp\n0 4 82.636364\n1 6 122.285714\n2 8 209.214286\n# .. may have more rows\n```\n\nSee the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).\n\n### Example notebooks\n\nBelow are some examples I've kept as I've worked on siuba.\nFor the most up to date explanations, see the [siuba guide](https://siuba.org/guide)\n\n* [siu expressions](examples/examples-siu.ipynb)\n* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)\n - [select verb case study](examples/case-iris-select.ipynb)\n* sql using dplyr style\n - [simple sql statements](examples/examples-sql.ipynb)\n - [the kitchen sink with postgres](examples/examples-postgres.ipynb)\n* [tidytuesday examples](https://github.com/machow/tidytuesday-py)\n - tidytuesday is a weekly R data analysis project. In order to kick the tires\n on siuba, I've been using it to complete the assignments. More specifically,\n I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)\n to use siuba.\n\nTesting\n-------\n\nTests are done using pytest.\nThey can be run using the following.\n\n```bash\n# start postgres db\ndocker-compose up\npytest siuba\n```\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package for quick, scrappy analyses with pandas and SQL",
"version": "0.4.4",
"project_urls": {
"Homepage": "https://github.com/machow/siuba"
},
"split_keywords": [
"package"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bea959a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801",
"md5": "0a05413635f481da7e39a5e1be482fbb",
"sha256": "07a6b2a02f39e53a8fdb1f1a3a7c49bce182346901e99df1d073e5427cf9e9dc"
},
"downloads": -1,
"filename": "siuba-0.4.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0a05413635f481da7e39a5e1be482fbb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 208633,
"upload_time": "2023-09-19T21:09:06",
"upload_time_iso_8601": "2023-09-19T21:09:06.604059Z",
"url": "https://files.pythonhosted.org/packages/be/a9/59a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801/siuba-0.4.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4cc35203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786",
"md5": "6d2b4d9aa9922c0ef7decc8594d89a6b",
"sha256": "73fb5e3934e45f2083cf0cc362c761fac2f1ae6f918746442edd8f493f009387"
},
"downloads": -1,
"filename": "siuba-0.4.4.tar.gz",
"has_sig": false,
"md5_digest": "6d2b4d9aa9922c0ef7decc8594d89a6b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 170529,
"upload_time": "2023-09-19T21:09:08",
"upload_time_iso_8601": "2023-09-19T21:09:08.669495Z",
"url": "https://files.pythonhosted.org/packages/4c/c3/5203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786/siuba-0.4.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-19 21:09:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "machow",
"github_project": "siuba",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "siuba"
}