duckboat


Nameduckboat JSON
Version 0.15.0 PyPI version JSON
download
home_pageNone
SummaryA SQL-based Python dataframe library for ergonomic interactive data analysis and exploration.
upload_time2024-12-29 02:40:13
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseBSD 3-Clause License Copyright (c) 2024, AJ Friend Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords etl sql data-wrangling duckdb pipelines
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Duckboat

*Ugly to some, but gets the job done.*

[GitHub](https://github.com/ajfriend/duckboat) | [Docs](https://ajfriend.github.io/duckboat/) | [PyPI](https://pypi.org/project/duckboat/)

Duckboat is a SQL-based Python dataframe library for ergonomic interactive
data analysis and exploration.


```python
pip install duckboat
```

Duckboat allows you to chain SQL snippets (meaning you can usually omit `select *` and `from ...`)
to incrementally and lazily build up complex queries.

Duckboat is a light wrapper around the
[DuckDB relational API](https://duckdb.org/docs/api/python/relational_api),
so
expressions are evaluated lazily and optimized by DuckDB prior to execution.
The resulting queries are fast, avoiding the need to materialize intermediate tables or
perform data transfers.
You can leverage all the SQL syntax improvements provided by DuckDB:
[1](https://duckdb.org/2022/05/04/friendlier-sql.html)
[2](https://duckdb.org/2023/08/23/even-friendlier-sql.html)
[3](https://duckdb.org/docs/sql/dialect/friendly_sql.html)

## Examples

```python
import duckboat as uck

csv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'

uck.Table(csv).do(
    "where sex = 'female' ",
    'where year > 2008',
    'select *, cast(body_mass_g as double) as grams',
    'select species, island, avg(grams) as avg_grams group by 1,2',
    'select * replace (round(avg_grams, 1) as avg_grams)',
    'order by avg_grams',
)
```

```
┌───────────┬───────────┬───────────┐
│  species  │  island   │ avg_grams │
│  varchar  │  varchar  │  double   │
├───────────┼───────────┼───────────┤
│ Adelie    │ Torgersen │    3193.8 │
│ Adelie    │ Dream     │    3357.5 │
│ Adelie    │ Biscoe    │    3446.9 │
│ Chinstrap │ Dream     │    3522.9 │
│ Gentoo    │ Biscoe    │    4786.3 │
└───────────┴───────────┴───────────┘
```

### To and from other data formats

We can translate to and from other data formats like Pandas DataFrames, Polars, or Arrow Tables.

```python
import pandas as pd

df = pd.DataFrame({'a': [0]})
t = uck.Table(df)
t
```

```
┌───────┐
│   a   │
│ int64 │
├───────┤
│     0 │
└───────┘
```

Translate back to a pandas dataframe with any of the following:

```python
t.df()
t.hold('pandas')
t.do('pandas')
```


### Chaining expressions

You can chain calls to `Table.do()`:


```python
f = 'select a + 1 as a'
t.do(f).do(f).do(f)
```

```
┌───────┐
│   a   │
│ int64 │
├───────┤
│     3 │
└───────┘
```

Alternatively, `Table.do()` accepts a sequence of arguments:

```
t.do(f, f, f)
```

It also accepts lists of expressions, and will apply them recursively:

```python
fs = [f, f, f]
t.do(fs)
```

Note, you could also still call this as:

```python
t.do(*fs)
```

Use lists to group expressions, which Duckboat will apply recursively:

```python
t.do(f, [f], [f, [[f, f], f]])
```

```
┌───────┐
│   a   │
│ int64 │
├───────┤
│     6 │
└───────┘
```

Duckboat will also apply functions:

```python
def foo(x):
    return x.do('select a + 2 as a')

# the following are equivalent
foo(t)
t.do(foo)
```

Of course, you can mix functions, SQL strings, and lists:

```python
t.do([foo, f])
```


### Databases and joins

TODO

### Extravagant affordances

TODO


### Objects

(probably in the docstrings, rather than the readme)

#### Table

The core functionality comes from `.sql`, where we allow snippets.
"Shell" functionality comes from the `duckboat.do()` method, allowing for things like...

#### Database

The core functionality comes from the `.sql`, which loads *only* the Tables listed. No other Python objects are loaded.
A full duckdb sql expression is expected here, with table names provided explicitly.

"Shell" functionality comes from the `duckboat.do()` function/method, allowing for things like...


#### eager

The duckboat library makes some efforts to protect against unintentionally evaluating expressions eagerly, rather
than letting them rest lazily. For example, calling the `__repr__` method on a `Table` or `Database` will
trigger an evaluation of that object, which could include pulling a large table, or showing a large intermediate result.

in `ipython` or `jupyterlab`, we can typically avoid this by not ending a cell with an object. doing so
triggers the objects `__repr__`. However, these tactics don't work when using an IDE like Positron,
which eagerly inspects objects in the namespace to provide insight into what you're working with. This is often useful,
but not always what you want when working with large datasets or expensive computations.

## Philosophy

This approach results in a mixture of Python and SQL that, I think, is semantically very similar to
[Google's Pipe Syntax for SQL](https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/):
We can leverage our existing knowledge of SQL, while making a few small changes to make it more ergonomic and composable.

When doing interactive data analysis, I find this approach easier to read and write than
fluent APIs (like in [Polars](https://pola.rs/) or [Ibis](https://ibis-project.org/)) or typical [Pandas](https://pandas.pydata.org/) code.
If some operation is easier in other libraries, Duckboat makes it straightforward translate between them, either directly or through Apache Arrow.

## Feedback

I'd love to hear any feedback on the approach here, so feel free to reach out through
[Issues](https://github.com/ajfriend/duckboat/issues)
or
[Discussions](https://github.com/ajfriend/duckboat/discussions).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "duckboat",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "ETL, SQL, data-wrangling, duckdb, pipelines",
    "author": null,
    "author_email": "AJ Friend <ajfriend@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/48/e5/ed44ea142f2f448e591fc5976aff4e92e3e8c0f461836336f7c6251c1d4c/duckboat-0.15.0.tar.gz",
    "platform": null,
    "description": "# Duckboat\n\n*Ugly to some, but gets the job done.*\n\n[GitHub](https://github.com/ajfriend/duckboat) | [Docs](https://ajfriend.github.io/duckboat/) | [PyPI](https://pypi.org/project/duckboat/)\n\nDuckboat is a SQL-based Python dataframe library for ergonomic interactive\ndata analysis and exploration.\n\n\n```python\npip install duckboat\n```\n\nDuckboat allows you to chain SQL snippets (meaning you can usually omit `select *` and `from ...`)\nto incrementally and lazily build up complex queries.\n\nDuckboat is a light wrapper around the\n[DuckDB relational API](https://duckdb.org/docs/api/python/relational_api),\nso\nexpressions are evaluated lazily and optimized by DuckDB prior to execution.\nThe resulting queries are fast, avoiding the need to materialize intermediate tables or\nperform data transfers.\nYou can leverage all the SQL syntax improvements provided by DuckDB:\n[1](https://duckdb.org/2022/05/04/friendlier-sql.html)\n[2](https://duckdb.org/2023/08/23/even-friendlier-sql.html)\n[3](https://duckdb.org/docs/sql/dialect/friendly_sql.html)\n\n## Examples\n\n```python\nimport duckboat as uck\n\ncsv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'\n\nuck.Table(csv).do(\n    \"where sex = 'female' \",\n    'where year > 2008',\n    'select *, cast(body_mass_g as double) as grams',\n    'select species, island, avg(grams) as avg_grams group by 1,2',\n    'select * replace (round(avg_grams, 1) as avg_grams)',\n    'order by avg_grams',\n)\n```\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  species  \u2502  island   \u2502 avg_grams \u2502\n\u2502  varchar  \u2502  varchar  \u2502  double   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Adelie    \u2502 Torgersen \u2502    3193.8 \u2502\n\u2502 Adelie    \u2502 Dream     \u2502    3357.5 \u2502\n\u2502 Adelie    \u2502 Biscoe    \u2502    3446.9 \u2502\n\u2502 Chinstrap \u2502 Dream     \u2502    3522.9 \u2502\n\u2502 Gentoo    \u2502 Biscoe    \u2502    4786.3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### To and from other data formats\n\nWe can translate to and from other data formats like Pandas DataFrames, Polars, or Arrow Tables.\n\n```python\nimport pandas as pd\n\ndf = pd.DataFrame({'a': [0]})\nt = uck.Table(df)\nt\n```\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   a   \u2502\n\u2502 int64 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502     0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nTranslate back to a pandas dataframe with any of the following:\n\n```python\nt.df()\nt.hold('pandas')\nt.do('pandas')\n```\n\n\n### Chaining expressions\n\nYou can chain calls to `Table.do()`:\n\n\n```python\nf = 'select a + 1 as a'\nt.do(f).do(f).do(f)\n```\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   a   \u2502\n\u2502 int64 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502     3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nAlternatively, `Table.do()` accepts a sequence of arguments:\n\n```\nt.do(f, f, f)\n```\n\nIt also accepts lists of expressions, and will apply them recursively:\n\n```python\nfs = [f, f, f]\nt.do(fs)\n```\n\nNote, you could also still call this as:\n\n```python\nt.do(*fs)\n```\n\nUse lists to group expressions, which Duckboat will apply recursively:\n\n```python\nt.do(f, [f], [f, [[f, f], f]])\n```\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   a   \u2502\n\u2502 int64 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502     6 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nDuckboat will also apply functions:\n\n```python\ndef foo(x):\n    return x.do('select a + 2 as a')\n\n# the following are equivalent\nfoo(t)\nt.do(foo)\n```\n\nOf course, you can mix functions, SQL strings, and lists:\n\n```python\nt.do([foo, f])\n```\n\n\n### Databases and joins\n\nTODO\n\n### Extravagant affordances\n\nTODO\n\n\n### Objects\n\n(probably in the docstrings, rather than the readme)\n\n#### Table\n\nThe core functionality comes from `.sql`, where we allow snippets.\n\"Shell\" functionality comes from the `duckboat.do()` method, allowing for things like...\n\n#### Database\n\nThe core functionality comes from the `.sql`, which loads *only* the Tables listed. No other Python objects are loaded.\nA full duckdb sql expression is expected here, with table names provided explicitly.\n\n\"Shell\" functionality comes from the `duckboat.do()` function/method, allowing for things like...\n\n\n#### eager\n\nThe duckboat library makes some efforts to protect against unintentionally evaluating expressions eagerly, rather\nthan letting them rest lazily. For example, calling the `__repr__` method on a `Table` or `Database` will\ntrigger an evaluation of that object, which could include pulling a large table, or showing a large intermediate result.\n\nin `ipython` or `jupyterlab`, we can typically avoid this by not ending a cell with an object. doing so\ntriggers the objects `__repr__`. However, these tactics don't work when using an IDE like Positron,\nwhich eagerly inspects objects in the namespace to provide insight into what you're working with. This is often useful,\nbut not always what you want when working with large datasets or expensive computations.\n\n## Philosophy\n\nThis approach results in a mixture of Python and SQL that, I think, is semantically very similar to\n[Google's Pipe Syntax for SQL](https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/):\nWe can leverage our existing knowledge of SQL, while making a few small changes to make it more ergonomic and composable.\n\nWhen doing interactive data analysis, I find this approach easier to read and write than\nfluent APIs (like in [Polars](https://pola.rs/) or [Ibis](https://ibis-project.org/)) or typical [Pandas](https://pandas.pydata.org/) code.\nIf some operation is easier in other libraries, Duckboat makes it straightforward translate between them, either directly or through Apache Arrow.\n\n## Feedback\n\nI'd love to hear any feedback on the approach here, so feel free to reach out through\n[Issues](https://github.com/ajfriend/duckboat/issues)\nor\n[Discussions](https://github.com/ajfriend/duckboat/discussions).\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause License  Copyright (c) 2024, AJ Friend  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.",
    "summary": "A SQL-based Python dataframe library for ergonomic interactive data analysis and exploration.",
    "version": "0.15.0",
    "project_urls": {
        "documentation": "https://ajfriend.github.io/duckboat",
        "homepage": "https://github.com/ajfriend/duckboat",
        "pypi": "https://pypi.org/project/duckboat",
        "repository": "https://github.com/ajfriend/duckboat"
    },
    "split_keywords": [
        "etl",
        " sql",
        " data-wrangling",
        " duckdb",
        " pipelines"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c97fd4255fccb4ae03f370c2cf8d788b30df7e2ede737b86cc92fa21a45bc945",
                "md5": "d14cc47d5852b3c02fd2124b9e7ffd06",
                "sha256": "fb11f4da52d4ec090240d7dfed87326ef09b164ea100a549092c8f320ddeea75"
            },
            "downloads": -1,
            "filename": "duckboat-0.15.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d14cc47d5852b3c02fd2124b9e7ffd06",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10726,
            "upload_time": "2024-12-29T02:40:09",
            "upload_time_iso_8601": "2024-12-29T02:40:09.550970Z",
            "url": "https://files.pythonhosted.org/packages/c9/7f/d4255fccb4ae03f370c2cf8d788b30df7e2ede737b86cc92fa21a45bc945/duckboat-0.15.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "48e5ed44ea142f2f448e591fc5976aff4e92e3e8c0f461836336f7c6251c1d4c",
                "md5": "f597e517ba83214f301866498d4e6d91",
                "sha256": "8f52f6b71a2ef75391b1a707deccbd15ab39951cfda622e650b9ff893bb7365f"
            },
            "downloads": -1,
            "filename": "duckboat-0.15.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f597e517ba83214f301866498d4e6d91",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 8041,
            "upload_time": "2024-12-29T02:40:13",
            "upload_time_iso_8601": "2024-12-29T02:40:13.359764Z",
            "url": "https://files.pythonhosted.org/packages/48/e5/ed44ea142f2f448e591fc5976aff4e92e3e8c0f461836336f7c6251c1d4c/duckboat-0.15.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-29 02:40:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ajfriend",
    "github_project": "duckboat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "duckboat"
}
        
Elapsed time: 0.36489s