[![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/streaming_jupyter_integrations)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)
[![PyPI version](https://badge.fury.io/py/streaming-jupyter-integrations.svg)](https://pypi.org/project/streaming-jupyter-integrations/)
[![Downloads](https://pepy.tech/badge/streaming_jupyter_integrations)](https://pepy.tech/badge/streaming_jupyter_integrations)
# Streaming Jupyter Integrations
Streaming Jupyter Integrations project includes a set of magics for interactively running _Flink SQL_ jobs in [Jupyter](https://jupyter.org/) Notebooks
## Installation
In order to actually use these magics, you must install our PIP package along `jupyterlab-lsp`:
```shell
python3 -m pip install jupyterlab-lsp streaming-jupyter-integrations
```
## Usage
Register in Jupyter with a running IPython in the first cell:
```python
%load_ext streaming_jupyter_integrations.magics
```
Then you need to decide which _execution mode_ and _execution target_ to choose.
```python
%flink_connect --execution-mode [mode] --execution-target [target]
```
By default, the `streaming` execution mode and `local` execution target are used.
```python
%flink_connect
```
### Execution mode
Currently, Flink supports two execution modes: _batch_ and _streaming_. Please see
[Flink documentation](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/)
for more details.
In order to specify execution mode, add `--execution-mode` parameter, for instance:
```python
%flink_connect --execution-mode batch
```
### Execution target
Streaming Jupyter Integrations supports 3 execution targets:
* Local
* Remote
* YARN Session
#### Local execution target
Running Flink in `local` mode will start a MiniCluster in a local JVM with parallelism 1.
In order to run Flink locally, use:
```python
%flink_connect --execution-target local
```
Alternatively, since the execution target is `local` by default, use:
```python
%flink_connect
```
One can specify port of the local JobManager (8099 by default). This is useful especially if you run multiple
Notebooks in a single JupyterLab.
```python
%flink_connect --execution-target local --local-port 8123
```
#### Remote execution target
Running Flink in remote mode will connect to an existing Flink session cluster. Besides specifying `--execution-target`
to be `remote`, you also need to specify `--remote-hostname` and `--remote-port` pointing to Flink Job Manager's
REST API address.
```python
%flink_connect \
--execution-target remote \
--remote-hostname example.com \
--remote-port 8888
```
#### YARN session execution target
Running Flink in `yarn-session` mode will connect to an existing Flink session cluster running on YARN. You may specify
the hostname and port of the YARN Resource Manager (`--resource-manager-hostname` and `--resource-manager-port`).
If Resource Manager address is not provided, it is assumed that notebook runs on the same node as Resource Manager.
You can also specify YARN applicationId (`--yarn-application-id`) to which the notebook will connect to.
If `--yarn-application-id` is not specified and there is one YARN application running on the cluster, the notebook will
try to connect to it. Otherwise, it will fail.
Connecting to a remote Flink session cluster running on a remote YARN cluster:
```python
%flink_connect \
--execution-target yarn-session \
--resource-manager-hostname example.com \
--resource-manager-port 8888 \
--yarn-application-id application_1666172784500_0001
```
Connecting to a Flink session cluster running on a YARN cluster:
```python
%flink_connect \
--execution-target yarn-session \
--yarn-application-id application_1666172784500_0001
```
Connecting to a Flink session cluster running on a dedicated YARN cluster:
```python
%flink_connect --execution-target yarn-session
```
## Variables
Magics allow for dynamic variable substitution in _Flink SQL_ cells.
```python
my_variable = 1
```
```sql
SELECT * FROM some_table WHERE product_id = {my_variable}
```
Moreover, you can mark sensitive variables like password so they will be read from environment variables or user input every time one runs the cell:
```sql
CREATE TABLE MyUserTable (
id BIGINT,
name STRING,
age INT,
status BOOLEAN,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/mydatabase',
'table-name' = 'users',
'username' = '${my_username}',
'password' = '${my_password}'
);
```
### `%%flink_execute` command
The command allows to use Python DataStream API and Table API. There are two handles exposed for each API:
`stream_env` and `table_env`, respectively.
Table API example:
```python
%%flink_execute
query = """
SELECT user_id, COUNT(*)
FROM orders
GROUP BY user_id
"""
execution_output = table_env.execute_sql(query)
```
When Table API is used, the final result has to be assigned to `execution_output` variable.
DataStream API example:
```python
%%flink_execute
from pyflink.common.typeinfo import Types
execution_output = stream_env.from_collection(
collection=[(1, 'aaa'), (2, 'bb'), (3, 'cccc')],
type_info=Types.ROW([Types.INT(), Types.STRING()])
)
```
When DataStream API is used, the final result has to be assigned to `execution_output` variable. Please note that
the pipeline does not end with `.execute()`, the execution is triggered by the Jupyter magics under the hood.
---
## Local development
Note: You will need NodeJS to build the extension package.
The `jlpm` command is JupyterLab's pinned version of
[yarn](https://yarnpkg.com/) that is installed with JupyterLab. You may use
`yarn` or `npm` in lieu of `jlpm` below. In order to use `jlpm`, you have to
have `jupyterlab` installed (e.g., by `brew install jupyterlab`, if you use
Homebrew as your package manager).
```bash
# Clone the repo to your local environment
# Change directory to the flink_sql_lsp_extension directory
# Install package in development mode
pip install -e .
# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite
# Rebuild extension Typescript source after making changes
jlpm build
```
The project uses [pre-commit](https://pre-commit.com/) hooks to ensure code quality, mostly by linting.
To use it, [install pre-commit](https://pre-commit.com/#install) and then run
```shell
pre-commit install --install-hooks
```
From that moment, it will lint the files you have modified on every commit attempt.
Raw data
{
"_id": null,
"home_page": "https://github.com/getindata/streaming-jupyter-integrations",
"name": "streaming-jupyter-integrations",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "jupyter flink sql ipython",
"author": "GetInData",
"author_email": "office@getindata.com",
"download_url": "https://files.pythonhosted.org/packages/ff/bd/135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd/streaming_jupyter_integrations-0.14.4.tar.gz",
"platform": null,
"description": "[![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/streaming_jupyter_integrations)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)\n[![PyPI version](https://badge.fury.io/py/streaming-jupyter-integrations.svg)](https://pypi.org/project/streaming-jupyter-integrations/)\n[![Downloads](https://pepy.tech/badge/streaming_jupyter_integrations)](https://pepy.tech/badge/streaming_jupyter_integrations)\n\n# Streaming Jupyter Integrations\n\nStreaming Jupyter Integrations project includes a set of magics for interactively running _Flink SQL_ jobs in [Jupyter](https://jupyter.org/) Notebooks\n\n## Installation\n\nIn order to actually use these magics, you must install our PIP package along `jupyterlab-lsp`:\n\n```shell\npython3 -m pip install jupyterlab-lsp streaming-jupyter-integrations\n```\n\n## Usage\n\nRegister in Jupyter with a running IPython in the first cell:\n\n```python\n%load_ext streaming_jupyter_integrations.magics\n```\n\nThen you need to decide which _execution mode_ and _execution target_ to choose.\n\n```python\n%flink_connect --execution-mode [mode] --execution-target [target]\n```\n\nBy default, the `streaming` execution mode and `local` execution target are used.\n\n```python\n%flink_connect\n```\n\n### Execution mode\n\nCurrently, Flink supports two execution modes: _batch_ and _streaming_. Please see\n[Flink documentation](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/)\nfor more details.\n\nIn order to specify execution mode, add `--execution-mode` parameter, for instance:\n```python\n%flink_connect --execution-mode batch\n```\n\n### Execution target\n\nStreaming Jupyter Integrations supports 3 execution targets:\n* Local\n* Remote\n* YARN Session\n\n#### Local execution target\n\nRunning Flink in `local` mode will start a MiniCluster in a local JVM with parallelism 1.\n\nIn order to run Flink locally, use:\n```python\n%flink_connect --execution-target local\n```\n\nAlternatively, since the execution target is `local` by default, use:\n```python\n%flink_connect\n```\n\nOne can specify port of the local JobManager (8099 by default). This is useful especially if you run multiple\nNotebooks in a single JupyterLab.\n\n```python\n%flink_connect --execution-target local --local-port 8123\n```\n\n\n#### Remote execution target\n\nRunning Flink in remote mode will connect to an existing Flink session cluster. Besides specifying `--execution-target`\nto be `remote`, you also need to specify `--remote-hostname` and `--remote-port` pointing to Flink Job Manager's\nREST API address.\n\n```python\n%flink_connect \\\n --execution-target remote \\\n --remote-hostname example.com \\\n --remote-port 8888\n```\n\n#### YARN session execution target\n\nRunning Flink in `yarn-session` mode will connect to an existing Flink session cluster running on YARN. You may specify\nthe hostname and port of the YARN Resource Manager (`--resource-manager-hostname` and `--resource-manager-port`).\nIf Resource Manager address is not provided, it is assumed that notebook runs on the same node as Resource Manager.\nYou can also specify YARN applicationId (`--yarn-application-id`) to which the notebook will connect to.\nIf `--yarn-application-id` is not specified and there is one YARN application running on the cluster, the notebook will\ntry to connect to it. Otherwise, it will fail.\n\nConnecting to a remote Flink session cluster running on a remote YARN cluster:\n```python\n%flink_connect \\\n --execution-target yarn-session \\\n --resource-manager-hostname example.com \\\n --resource-manager-port 8888 \\\n --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a YARN cluster:\n```python\n%flink_connect \\\n --execution-target yarn-session \\\n --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a dedicated YARN cluster:\n```python\n%flink_connect --execution-target yarn-session\n```\n\n## Variables\nMagics allow for dynamic variable substitution in _Flink SQL_ cells.\n```python\nmy_variable = 1\n```\n```sql\nSELECT * FROM some_table WHERE product_id = {my_variable}\n```\n\nMoreover, you can mark sensitive variables like password so they will be read from environment variables or user input every time one runs the cell:\n```sql\nCREATE TABLE MyUserTable (\n id BIGINT,\n name STRING,\n age INT,\n status BOOLEAN,\n PRIMARY KEY (id) NOT ENFORCED\n) WITH (\n 'connector' = 'jdbc',\n 'url' = 'jdbc:mysql://localhost:3306/mydatabase',\n 'table-name' = 'users',\n 'username' = '${my_username}',\n 'password' = '${my_password}'\n);\n```\n\n### `%%flink_execute` command\n\nThe command allows to use Python DataStream API and Table API. There are two handles exposed for each API:\n`stream_env` and `table_env`, respectively.\n\nTable API example:\n```python\n%%flink_execute\nquery = \"\"\"\n SELECT user_id, COUNT(*)\n FROM orders\n GROUP BY user_id\n\"\"\"\nexecution_output = table_env.execute_sql(query)\n```\n\nWhen Table API is used, the final result has to be assigned to `execution_output` variable.\n\nDataStream API example:\n```python\n%%flink_execute\nfrom pyflink.common.typeinfo import Types\n\nexecution_output = stream_env.from_collection(\n collection=[(1, 'aaa'), (2, 'bb'), (3, 'cccc')],\n type_info=Types.ROW([Types.INT(), Types.STRING()])\n)\n```\n\nWhen DataStream API is used, the final result has to be assigned to `execution_output` variable. Please note that\nthe pipeline does not end with `.execute()`, the execution is triggered by the Jupyter magics under the hood.\n\n---\n\n## Local development\n\nNote: You will need NodeJS to build the extension package.\n\nThe `jlpm` command is JupyterLab's pinned version of\n[yarn](https://yarnpkg.com/) that is installed with JupyterLab. You may use\n`yarn` or `npm` in lieu of `jlpm` below. In order to use `jlpm`, you have to\nhave `jupyterlab` installed (e.g., by `brew install jupyterlab`, if you use\nHomebrew as your package manager).\n\n```bash\n# Clone the repo to your local environment\n# Change directory to the flink_sql_lsp_extension directory\n# Install package in development mode\npip install -e .\n# Link your development version of the extension with JupyterLab\njupyter labextension develop . --overwrite\n# Rebuild extension Typescript source after making changes\njlpm build\n```\n\nThe project uses [pre-commit](https://pre-commit.com/) hooks to ensure code quality, mostly by linting.\nTo use it, [install pre-commit](https://pre-commit.com/#install) and then run\n```shell\npre-commit install --install-hooks\n```\nFrom that moment, it will lint the files you have modified on every commit attempt.\n",
"bugtrack_url": null,
"license": "Apache Software License (Apache 2.0)",
"summary": "JupyterNotebook Flink magics",
"version": "0.14.4",
"project_urls": {
"Homepage": "https://github.com/getindata/streaming-jupyter-integrations"
},
"split_keywords": [
"jupyter",
"flink",
"sql",
"ipython"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1799b8a675f7b3077d26c47e512ab7586a376a11923ce5cef7af31e1f6c691a9",
"md5": "6a08287daceec518e7945817468fc3e6",
"sha256": "736e6cff0d9139e86dc3be566e7404352fa4df35d58e2ddbb5d176b67dde89ba"
},
"downloads": -1,
"filename": "streaming_jupyter_integrations-0.14.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6a08287daceec518e7945817468fc3e6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 43457,
"upload_time": "2023-09-22T20:15:14",
"upload_time_iso_8601": "2023-09-22T20:15:14.361744Z",
"url": "https://files.pythonhosted.org/packages/17/99/b8a675f7b3077d26c47e512ab7586a376a11923ce5cef7af31e1f6c691a9/streaming_jupyter_integrations-0.14.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ffbd135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd",
"md5": "34c69e70603d4cb0830489242840b21c",
"sha256": "986e9957ba6fae6740c5256847bbf79e48157e8631a7a02b0ed33898eb784c7f"
},
"downloads": -1,
"filename": "streaming_jupyter_integrations-0.14.4.tar.gz",
"has_sig": false,
"md5_digest": "34c69e70603d4cb0830489242840b21c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 131768,
"upload_time": "2023-09-22T20:15:15",
"upload_time_iso_8601": "2023-09-22T20:15:15.806614Z",
"url": "https://files.pythonhosted.org/packages/ff/bd/135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd/streaming_jupyter_integrations-0.14.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-22 20:15:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "getindata",
"github_project": "streaming-jupyter-integrations",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "streaming-jupyter-integrations"
}