magic-duckdb


Namemagic-duckdb JSON
Version 0.1.27 PyPI version JSON
download
home_pagehttps://github.com/iqmo-org/magic_duckdb
SummaryJupyter Cell and Line Magics for DuckDB
upload_time2024-03-30 15:46:18
maintainerNone
docs_urlNone
authoriqmo
requires_pythonNone
licenseNone
keywords jupyter cell magic duckdb line magic magic
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # magic_duckdb

DuckDB Cell (%%dql) and Line (%dql) Magics for Jupyter and VSCode

## Motivation

magic_duckdb was created to:

- Provide simple cell/line magics with minimal code and zero dependencies
- Match performance of DuckDB python API
- Be a simple starting point to add other useful features with minimal complexity
- Bundle useful features, like using OpenAI to improve SQL, sql formatting (beautifying) and explain analyze graphics

### Why not the %sql magics (jupysql or ipython-sql)?

The goal of this project is to expose the native features of duckdb, with minimal dependencies, such as exporting to arrow tables or using DuckDB relation objects. 

## Simplicity

The goal of this project was to provide minimal line & cell magics for DuckDB in Jupyter notebooks with minimal dependencies and as simply as possible.

The core code is concentrated in two places:

- magic.py: Barebones cell and line magic that parses arguments, and executes statements
- duckdb_mode.py: execute() calls the appropriate method based on the user selected output type.

Features that require additional dependencies, such as Jinja2 for the --jinja2 feature, are imported dynamically.

## Quick Start

```
%pip install magic_duckdb --upgrade --quiet
%load_ext magic_duckdb
%dql select * from range(100)
```

## Usage

```
Connection:
-cn <connection_string>: Create a new connection to a DuckDB. If passed without a query, this changes the global connection, otherwise used just for the current query.
    %dql -cn myfile.db
-co <connection_object>: Use an existing DuckDB Connection. If passed without a query, this changes the global connection, otherwise used just for the current query.
    con = duckdb.connect("somefile.db")
    %dql -co con
-d: Use the duckdb.default_connection. If passed without a query, this changes the global connection, otherwise used just for the current query.
-g | --getcon: Get the current connection
    con = %dql --getcon
--close: Close current connection

Modes:
-t <type> [default: df]: Selects the type for this request. If passed without a query, this changes the global default, otherwise used just for the current query.
-e <explain_mode>: Display the explain plan, or explain analyze, or AST

Options:
-l | --listtypes: Returns a list of available output types, for -t

-j | --jinja2: Process the SQL as a jinja template, using the user_ns environment.
    `%dql -j select * from {{var1}}`
-p | --params: Pass the specified parameter(s) as a SQL parameters
    `%dql -p obj1 select
-o <var>: Stores the resulting output in a variable named <var>
-tp <name> <value>: Pass a kwarg to the type function. Intended to be used to pass parameters to .show(). Must be passed on each call, not saved. 
    `%dql -t show -tp max_rows 10 <query>`

Extras:
--tables: Returns tables used by the query
-f: Format the string using npx sql-formatter
Other:
-ai / -aichat: Route request to OpenAI
    %dql -ai Fix selct star frm mytable

```

See [notebooks](https://github.com/iqmo-org/magic_duckdb/tree/main/notebooks) for usage examples.

## Usage Details

- `%dql -t [df | arrow | pl | relation | show | df_markdown] <query>`: Equivalent to - `connection.sql(query).<type>()`
- `%dql -e [explain | explain_analyze_tree | explain_analyze_json | explain_analyze_draw | analyze | ast_json | ast_tree | ast_draw] <query>`:
  - `explain` is equivalent to `connection.sql(query).explain()`
  - `explain_analyze_*` options enable profiling (`PRAGMA enable_profiling`), and use `connection.sql(query).explain(type='analyze')`
  - `explain_analyze_draw` requires [graphviz](https://graphviz.org/) to be installed on your system, and "dot" or "dot.exe" must be in your PATH or added via `magic.explain_analyze_graphviz.dot_path = 'path_to_dot'`. The graphviz python module must also be installed: `pip install graphviz`
  - The AST options use [json_serialize_sql](https://github.com/duckdb/duckdb/discussions/6922) to describe the SQL. `ast_json` displays the raw json result, `ast_tree` displays an indented tree, and `ast_draw` uses graphviz to draw a graphical version of the tree
- `%dql --format <query>` uses [sql-formatter](https://github.com/sql-formatter-org/sql-formatter). This is a javascript library, so it needs to be installed separately, although it's executed via npx so should be fine as long as you have npx / node in your path.
- `%dql --tables <query>` returns the list of tables used by the query, equivalent to: `duckdb.get_table_names("SELECT * FROM xyz")`
- `%dql [-ai | -aichat] fix <query>` passes the current schema to OpenAI and askes OpenAI to "fix" the query. An OpenAI developer key is required.

```
            # to set openai key
            from magic_duckdb.extras import sql_ai
            sql_ai.OPENAI_KEY = openai_key
```

## Autocompletion: Work in Progress

There are two different Autocompletion implementations, one for MatcherAPIv2 and the other (pre-ipython 8.6.0) for MatcherAPIv1. The MatcherAPIv2 version is tried first, and if it fails, the MatcherAPIv1 version is loaded. MatcherAPIv1 will not match the entire results of a cell: it's limited to a line by line match.

- Phrase completion: `%dql create <tab>` will show common phrases, such as `CREATE OR REPLACE`
- Pragma completion: `%dql PRAGMA <tab>` will list available pragma's.
- Table completion: `%dql select * from <tab>` will list available tables and Pandas DataFrames. This is triggered by a list of keywords (ie: from) that are expected to be followed. See magic_duckdb.autocomplete.common for the keywords.
- Column completion: `%dql select tablename.<tab> from tablename` will list available tables and Pandas DataFrames for tablename.

Notes:

- VScode will autocomplete differently than Jupyter: For instance, VScode autocompletes on spaces, Jupyter on tabs.
- Column completion only matches table or dataframe names. It doesn't look up aliases or analyze subqueries.
- This is meant as a proof of concept. The next step is to leverage duckdb's existing work: [4921](https://github.com/duckdb/duckdb/pull/4921) and test cases here: [shell-test.py](https://github.com/Mytherin/duckdb/blob/5f75cb90b478434f0a1811af0695ea3a186a67a8/tools/shell/shell-test.py)
- https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/hinterland/README.html: Enable code autocompletion menu for every keypress in a code cell, instead of only calling it with tab.

Autocompletion can be disabled with:

```
from magic_duckdb import magic
magic.ENABLE_AUTOCOMPLETE=False
%load_ext magic_duckdb
```

# Capturing output

Line Magics can be captured with assignment or -o:

```
# These are equivalent:
%dql -o varname <query>
varname = %dql <query>
```

Cell magics can only be captured with -o (var = %%dql doesn't work)

```
%%dql -o varname
<query>
```

To silence a cell, you can stack %%capture:

```
%%capture
%%iql -o bqldf
<query>
```

# Performance Comparison

The jupysql/sql-alchemy/duckdb-engine %sql magic was surprisingly slow when compared to magic_duckdb or duckdb. I didn't spend a lot of time evaluating this, so please do your own evaluation: my priority was keeping magic_duckdb simple by using duckdb directly.

### Versions

> Python: 3.9.16 (main, Mar 8 2023, 10:39:24) [MSC v.1916 64 bit (AMD64)]
> DuckDB: library_version source_id
> 0 0.8.1-dev51 e84cc1acb8
> Pandas : 2.0.1
> jupysql 0.7.5

### Test Setup

See [benchmarking.ipynb](https://github.com/iqmo-org/magic_duckdb/blob/main/notebooks/benchmarking.ipynb) for the test code.

### Results

|     | name                      |      1 |   1000 | 1000000 | description                                         |
| --- | ------------------------- | -----: | -----: | ------: | --------------------------------------------------- |
| 0   | test_magicddb_pandas      |   3.47 |   4.11 |   291.0 | %dql -t df query                                    |
| 1   | test_duckdb_execute_df    |   3.80 |   3.39 |   281.0 | con.execute(query).df()                             |
| 2   | test_duckdb_execute_arrow |   3.91 |   4.04 |   128.0 | con.execute(query).arrow()                          |
| 3   | test_magicddb_arrow       |   4.10 |   5.38 |   127.0 | %dql -t arrow query                                 |
| 4   | test_duckdb_sql_df        |   6.93 |   7.94 |   318.0 | con.sql(query).df()                                 |
| 5   | test_jupysql              | 321.00 | 256.00 |   547.0 | %config SqlMagic.autopandas = True <br/> %sql query |

Copyright &copy; 2023 Iqmo Corporation

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/iqmo-org/magic_duckdb",
    "name": "magic-duckdb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Jupyter, Cell Magic, DuckDB, Line Magic, Magic",
    "author": "iqmo",
    "author_email": "info@iqmo.com",
    "download_url": null,
    "platform": null,
    "description": "# magic_duckdb\n\nDuckDB Cell (%%dql) and Line (%dql) Magics for Jupyter and VSCode\n\n## Motivation\n\nmagic_duckdb was created to:\n\n- Provide simple cell/line magics with minimal code and zero dependencies\n- Match performance of DuckDB python API\n- Be a simple starting point to add other useful features with minimal complexity\n- Bundle useful features, like using OpenAI to improve SQL, sql formatting (beautifying) and explain analyze graphics\n\n### Why not the %sql magics (jupysql or ipython-sql)?\n\nThe goal of this project is to expose the native features of duckdb, with minimal dependencies, such as exporting to arrow tables or using DuckDB relation objects. \n\n## Simplicity\n\nThe goal of this project was to provide minimal line & cell magics for DuckDB in Jupyter notebooks with minimal dependencies and as simply as possible.\n\nThe core code is concentrated in two places:\n\n- magic.py: Barebones cell and line magic that parses arguments, and executes statements\n- duckdb_mode.py: execute() calls the appropriate method based on the user selected output type.\n\nFeatures that require additional dependencies, such as Jinja2 for the --jinja2 feature, are imported dynamically.\n\n## Quick Start\n\n```\n%pip install magic_duckdb --upgrade --quiet\n%load_ext magic_duckdb\n%dql select * from range(100)\n```\n\n## Usage\n\n```\nConnection:\n-cn <connection_string>: Create a new connection to a DuckDB. If passed without a query, this changes the global connection, otherwise used just for the current query.\n    %dql -cn myfile.db\n-co <connection_object>: Use an existing DuckDB Connection. If passed without a query, this changes the global connection, otherwise used just for the current query.\n    con = duckdb.connect(\"somefile.db\")\n    %dql -co con\n-d: Use the duckdb.default_connection. If passed without a query, this changes the global connection, otherwise used just for the current query.\n-g | --getcon: Get the current connection\n    con = %dql --getcon\n--close: Close current connection\n\nModes:\n-t <type> [default: df]: Selects the type for this request. If passed without a query, this changes the global default, otherwise used just for the current query.\n-e <explain_mode>: Display the explain plan, or explain analyze, or AST\n\nOptions:\n-l | --listtypes: Returns a list of available output types, for -t\n\n-j | --jinja2: Process the SQL as a jinja template, using the user_ns environment.\n    `%dql -j select * from {{var1}}`\n-p | --params: Pass the specified parameter(s) as a SQL parameters\n    `%dql -p obj1 select\n-o <var>: Stores the resulting output in a variable named <var>\n-tp <name> <value>: Pass a kwarg to the type function. Intended to be used to pass parameters to .show(). Must be passed on each call, not saved. \n    `%dql -t show -tp max_rows 10 <query>`\n\nExtras:\n--tables: Returns tables used by the query\n-f: Format the string using npx sql-formatter\nOther:\n-ai / -aichat: Route request to OpenAI\n    %dql -ai Fix selct star frm mytable\n\n```\n\nSee [notebooks](https://github.com/iqmo-org/magic_duckdb/tree/main/notebooks) for usage examples.\n\n## Usage Details\n\n- `%dql -t [df | arrow | pl | relation | show | df_markdown] <query>`: Equivalent to - `connection.sql(query).<type>()`\n- `%dql -e [explain | explain_analyze_tree | explain_analyze_json | explain_analyze_draw | analyze | ast_json | ast_tree | ast_draw] <query>`:\n  - `explain` is equivalent to `connection.sql(query).explain()`\n  - `explain_analyze_*` options enable profiling (`PRAGMA enable_profiling`), and use `connection.sql(query).explain(type='analyze')`\n  - `explain_analyze_draw` requires [graphviz](https://graphviz.org/) to be installed on your system, and \"dot\" or \"dot.exe\" must be in your PATH or added via `magic.explain_analyze_graphviz.dot_path = 'path_to_dot'`. The graphviz python module must also be installed: `pip install graphviz`\n  - The AST options use [json_serialize_sql](https://github.com/duckdb/duckdb/discussions/6922) to describe the SQL. `ast_json` displays the raw json result, `ast_tree` displays an indented tree, and `ast_draw` uses graphviz to draw a graphical version of the tree\n- `%dql --format <query>` uses [sql-formatter](https://github.com/sql-formatter-org/sql-formatter). This is a javascript library, so it needs to be installed separately, although it's executed via npx so should be fine as long as you have npx / node in your path.\n- `%dql --tables <query>` returns the list of tables used by the query, equivalent to: `duckdb.get_table_names(\"SELECT * FROM xyz\")`\n- `%dql [-ai | -aichat] fix <query>` passes the current schema to OpenAI and askes OpenAI to \"fix\" the query. An OpenAI developer key is required.\n\n```\n            # to set openai key\n            from magic_duckdb.extras import sql_ai\n            sql_ai.OPENAI_KEY = openai_key\n```\n\n## Autocompletion: Work in Progress\n\nThere are two different Autocompletion implementations, one for MatcherAPIv2 and the other (pre-ipython 8.6.0) for MatcherAPIv1. The MatcherAPIv2 version is tried first, and if it fails, the MatcherAPIv1 version is loaded. MatcherAPIv1 will not match the entire results of a cell: it's limited to a line by line match.\n\n- Phrase completion: `%dql create <tab>` will show common phrases, such as `CREATE OR REPLACE`\n- Pragma completion: `%dql PRAGMA <tab>` will list available pragma's.\n- Table completion: `%dql select * from <tab>` will list available tables and Pandas DataFrames. This is triggered by a list of keywords (ie: from) that are expected to be followed. See magic_duckdb.autocomplete.common for the keywords.\n- Column completion: `%dql select tablename.<tab> from tablename` will list available tables and Pandas DataFrames for tablename.\n\nNotes:\n\n- VScode will autocomplete differently than Jupyter: For instance, VScode autocompletes on spaces, Jupyter on tabs.\n- Column completion only matches table or dataframe names. It doesn't look up aliases or analyze subqueries.\n- This is meant as a proof of concept. The next step is to leverage duckdb's existing work: [4921](https://github.com/duckdb/duckdb/pull/4921) and test cases here: [shell-test.py](https://github.com/Mytherin/duckdb/blob/5f75cb90b478434f0a1811af0695ea3a186a67a8/tools/shell/shell-test.py)\n- https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/hinterland/README.html: Enable code autocompletion menu for every keypress in a code cell, instead of only calling it with tab.\n\nAutocompletion can be disabled with:\n\n```\nfrom magic_duckdb import magic\nmagic.ENABLE_AUTOCOMPLETE=False\n%load_ext magic_duckdb\n```\n\n# Capturing output\n\nLine Magics can be captured with assignment or -o:\n\n```\n# These are equivalent:\n%dql -o varname <query>\nvarname = %dql <query>\n```\n\nCell magics can only be captured with -o (var = %%dql doesn't work)\n\n```\n%%dql -o varname\n<query>\n```\n\nTo silence a cell, you can stack %%capture:\n\n```\n%%capture\n%%iql -o bqldf\n<query>\n```\n\n# Performance Comparison\n\nThe jupysql/sql-alchemy/duckdb-engine %sql magic was surprisingly slow when compared to magic_duckdb or duckdb. I didn't spend a lot of time evaluating this, so please do your own evaluation: my priority was keeping magic_duckdb simple by using duckdb directly.\n\n### Versions\n\n> Python: 3.9.16 (main, Mar 8 2023, 10:39:24) [MSC v.1916 64 bit (AMD64)]\n> DuckDB: library_version source_id\n> 0 0.8.1-dev51 e84cc1acb8\n> Pandas : 2.0.1\n> jupysql 0.7.5\n\n### Test Setup\n\nSee [benchmarking.ipynb](https://github.com/iqmo-org/magic_duckdb/blob/main/notebooks/benchmarking.ipynb) for the test code.\n\n### Results\n\n|     | name                      |      1 |   1000 | 1000000 | description                                         |\n| --- | ------------------------- | -----: | -----: | ------: | --------------------------------------------------- |\n| 0   | test_magicddb_pandas      |   3.47 |   4.11 |   291.0 | %dql -t df query                                    |\n| 1   | test_duckdb_execute_df    |   3.80 |   3.39 |   281.0 | con.execute(query).df()                             |\n| 2   | test_duckdb_execute_arrow |   3.91 |   4.04 |   128.0 | con.execute(query).arrow()                          |\n| 3   | test_magicddb_arrow       |   4.10 |   5.38 |   127.0 | %dql -t arrow query                                 |\n| 4   | test_duckdb_sql_df        |   6.93 |   7.94 |   318.0 | con.sql(query).df()                                 |\n| 5   | test_jupysql              | 321.00 | 256.00 |   547.0 | %config SqlMagic.autopandas = True <br/> %sql query |\n\nCopyright &copy; 2023 Iqmo Corporation\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Jupyter Cell and Line Magics for DuckDB",
    "version": "0.1.27",
    "project_urls": {
        "Homepage": "https://github.com/iqmo-org/magic_duckdb"
    },
    "split_keywords": [
        "jupyter",
        " cell magic",
        " duckdb",
        " line magic",
        " magic"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "936280d7f6d235b8aece35f9ceea0133340b5844984e8849a8224dc022bcdecb",
                "md5": "2b3fcd27d5c31506abf7ffc30ea8f822",
                "sha256": "a9f74bb068051862d0f4524679786cf30932b306150e1f3819152448b02f3f97"
            },
            "downloads": -1,
            "filename": "magic_duckdb-0.1.27-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2b3fcd27d5c31506abf7ffc30ea8f822",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 21990,
            "upload_time": "2024-03-30T15:46:18",
            "upload_time_iso_8601": "2024-03-30T15:46:18.754222Z",
            "url": "https://files.pythonhosted.org/packages/93/62/80d7f6d235b8aece35f9ceea0133340b5844984e8849a8224dc022bcdecb/magic_duckdb-0.1.27-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-30 15:46:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iqmo-org",
    "github_project": "magic_duckdb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "magic-duckdb"
}
        
Elapsed time: 0.62546s