siuba


Namesiuba JSON
Version 0.4.4 PyPI version JSON
download
home_pagehttps://github.com/machow/siuba
SummaryA package for quick, scrappy analyses with pandas and SQL
upload_time2023-09-19 21:09:08
maintainer
docs_urlNone
authorMichael Chow
requires_python>=3.7
licenseMIT
keywords package
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            siuba
=====

*scrappy data analysis, with seamless support for pandas and SQL*

[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)
[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)

<img width="30%" align="right" src="./docs/siuba_small.svg">

siuba ([小巴](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:

* `select()` - keep certain columns of data.
* `filter()` - keep certain rows of data.
* `mutate()` - create or modify an existing column of data.
* `summarize()` - reduce one or more columns down to a single number.
* `arrange()` - reorder the rows of data.

These actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.
Inputs to these functions can be a pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).

For more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html). 
For examples of siuba in action, see the [siuba guide](https://siuba.org/guide).

Installation
------------

```
pip install siuba
```

Examples
--------

See the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.

### Basic use

The code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.

```python
from siuba import group_by, summarize, _
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
```

```
Out[1]: 
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286
```

There are three key concepts in this example:

| concept | example | meaning |
| ------- | ------- | ------- |
| verb    | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |
| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |
| pipe | `mtcars >> group_by(...)` | a syntax that allows you to chain verbs with the `>>` operator |


See the [siuba guide overview](https://siuba.org/guide) for a full introduction.

### What is a siu expression (e.g. `_.cyl == 4`)?

A siu expression is a way of specifying **what** action you want to perform.
This allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.

```python
from siuba import _

_.cyl == 4
```

```
Out[2]:
█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4
```

You can also think of siu expressions as a shorthand for a lambda function.

```python
from siuba import _

# lambda approach
mtcars[lambda _: _.cyl == 4]

# siu expression approach
mtcars[_.cyl == 4]
```

```
Out[3]: 
     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...
27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2

[11 rows x 11 columns]
```

See the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).

### Using with a SQL database

A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.

In the code below, we set up an example database.

```python
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
```

Next, we use the code from the first example, except now executed a SQL table.

```python
# Demo SQL analysis with siuba ----
from siuba import _, tbl, group_by, summarize, filter

# connect with siuba
tbl_mtcars = tbl(engine, "mtcars")

(tbl_mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
```

```
Out[4]: 
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286
# .. may have more rows
```

See the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).

### Example notebooks

Below are some examples I've kept as I've worked on siuba.
For the most up to date explanations, see the [siuba guide](https://siuba.org/guide)

* [siu expressions](examples/examples-siu.ipynb)
* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)
  - [select verb case study](examples/case-iris-select.ipynb)
* sql using dplyr style
  - [simple sql statements](examples/examples-sql.ipynb)
  - [the kitchen sink with postgres](examples/examples-postgres.ipynb)
* [tidytuesday examples](https://github.com/machow/tidytuesday-py)
  - tidytuesday is a weekly R data analysis project. In order to kick the tires
    on siuba, I've been using it to complete the assignments. More specifically,
    I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)
    to use siuba.

Testing
-------

Tests are done using pytest.
They can be run using the following.

```bash
# start postgres db
docker-compose up
pytest siuba
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/machow/siuba",
    "name": "siuba",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "package",
    "author": "Michael Chow",
    "author_email": "mc_al_gh_siuba@fastmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4c/c3/5203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786/siuba-0.4.4.tar.gz",
    "platform": null,
    "description": "siuba\n=====\n\n*scrappy data analysis, with seamless support for pandas and SQL*\n\n[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)\n[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)\n\n<img width=\"30%\" align=\"right\" src=\"./docs/siuba_small.svg\">\n\nsiuba ([\u5c0f\u5df4](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:\n\n* `select()` - keep certain columns of data.\n* `filter()` - keep certain rows of data.\n* `mutate()` - create or modify an existing column of data.\n* `summarize()` - reduce one or more columns down to a single number.\n* `arrange()` - reorder the rows of data.\n\nThese actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.\nInputs to these functions can be a\u00a0pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).\n\nFor more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html). \nFor examples of siuba in action, see the [siuba guide](https://siuba.org/guide).\n\nInstallation\n------------\n\n```\npip install siuba\n```\n\nExamples\n--------\n\nSee the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.\n\n### Basic use\n\nThe code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.\n\n```python\nfrom siuba import group_by, summarize, _\nfrom siuba.data import mtcars\n\n(mtcars\n  >> group_by(_.cyl)\n  >> summarize(avg_hp = _.hp.mean())\n  )\n```\n\n```\nOut[1]: \n   cyl      avg_hp\n0    4   82.636364\n1    6  122.285714\n2    8  209.214286\n```\n\nThere are three key concepts in this example:\n\n| concept | example | meaning |\n| ------- | ------- | ------- |\n| verb    | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |\n| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |\n| pipe | `mtcars >> group_by(...)` | a syntax that allows you to chain verbs with the `>>` operator |\n\n\nSee the [siuba guide overview](https://siuba.org/guide) for a full introduction.\n\n### What is a siu expression (e.g. `_.cyl == 4`)?\n\nA siu expression is a way of specifying **what** action you want to perform.\nThis allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.\n\n```python\nfrom siuba import _\n\n_.cyl == 4\n```\n\n```\nOut[2]:\n\u2588\u2500==\n\u251c\u2500\u2588\u2500.\n\u2502 \u251c\u2500_\n\u2502 \u2514\u2500'cyl'\n\u2514\u25004\n```\n\nYou can also think of siu expressions as a shorthand for a lambda function.\n\n```python\nfrom siuba import _\n\n# lambda approach\nmtcars[lambda _: _.cyl == 4]\n\n# siu expression approach\nmtcars[_.cyl == 4]\n```\n\n```\nOut[3]: \n     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb\n2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1\n7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2\n..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...\n27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2\n31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2\n\n[11 rows x 11 columns]\n```\n\nSee the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).\n\n### Using with a SQL database\n\nA killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.\n\nIn the code below, we set up an example database.\n\n```python\n# Setup example data ----\nfrom sqlalchemy import create_engine\nfrom siuba.data import mtcars\n\n# copy pandas DataFrame to sqlite\nengine = create_engine(\"sqlite:///:memory:\")\nmtcars.to_sql(\"mtcars\", engine, if_exists = \"replace\")\n```\n\nNext, we use the code from the first example, except now executed a SQL table.\n\n```python\n# Demo SQL analysis with siuba ----\nfrom siuba import _, tbl, group_by, summarize, filter\n\n# connect with siuba\ntbl_mtcars = tbl(engine, \"mtcars\")\n\n(tbl_mtcars\n  >> group_by(_.cyl)\n  >> summarize(avg_hp = _.hp.mean())\n  )\n```\n\n```\nOut[4]: \n# Source: lazy query\n# DB Conn: Engine(sqlite:///:memory:)\n# Preview:\n   cyl      avg_hp\n0    4   82.636364\n1    6  122.285714\n2    8  209.214286\n# .. may have more rows\n```\n\nSee the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).\n\n### Example notebooks\n\nBelow are some examples I've kept as I've worked on siuba.\nFor the most up to date explanations, see the [siuba guide](https://siuba.org/guide)\n\n* [siu expressions](examples/examples-siu.ipynb)\n* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)\n  - [select verb case study](examples/case-iris-select.ipynb)\n* sql using dplyr style\n  - [simple sql statements](examples/examples-sql.ipynb)\n  - [the kitchen sink with postgres](examples/examples-postgres.ipynb)\n* [tidytuesday examples](https://github.com/machow/tidytuesday-py)\n  - tidytuesday is a weekly R data analysis project. In order to kick the tires\n    on siuba, I've been using it to complete the assignments. More specifically,\n    I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)\n    to use siuba.\n\nTesting\n-------\n\nTests are done using pytest.\nThey can be run using the following.\n\n```bash\n# start postgres db\ndocker-compose up\npytest siuba\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package for quick, scrappy analyses with pandas and SQL",
    "version": "0.4.4",
    "project_urls": {
        "Homepage": "https://github.com/machow/siuba"
    },
    "split_keywords": [
        "package"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bea959a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801",
                "md5": "0a05413635f481da7e39a5e1be482fbb",
                "sha256": "07a6b2a02f39e53a8fdb1f1a3a7c49bce182346901e99df1d073e5427cf9e9dc"
            },
            "downloads": -1,
            "filename": "siuba-0.4.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a05413635f481da7e39a5e1be482fbb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 208633,
            "upload_time": "2023-09-19T21:09:06",
            "upload_time_iso_8601": "2023-09-19T21:09:06.604059Z",
            "url": "https://files.pythonhosted.org/packages/be/a9/59a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801/siuba-0.4.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4cc35203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786",
                "md5": "6d2b4d9aa9922c0ef7decc8594d89a6b",
                "sha256": "73fb5e3934e45f2083cf0cc362c761fac2f1ae6f918746442edd8f493f009387"
            },
            "downloads": -1,
            "filename": "siuba-0.4.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6d2b4d9aa9922c0ef7decc8594d89a6b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 170529,
            "upload_time": "2023-09-19T21:09:08",
            "upload_time_iso_8601": "2023-09-19T21:09:08.669495Z",
            "url": "https://files.pythonhosted.org/packages/4c/c3/5203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786/siuba-0.4.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-19 21:09:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "machow",
    "github_project": "siuba",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "siuba"
}
        
Elapsed time: 0.12481s