streaming-jupyter-integrations

Name	streaming-jupyter-integrations JSON
Version	0.14.4 JSON
	download
home_page	https://github.com/getindata/streaming-jupyter-integrations
Summary	JupyterNotebook Flink magics
upload_time	2023-09-22 20:15:15
maintainer
docs_url	None
author	GetInData
requires_python	>=3.7
license	Apache Software License (Apache 2.0)
keywords	jupyter flink sql ipython
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/streaming_jupyter_integrations)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)
[![PyPI version](https://badge.fury.io/py/streaming-jupyter-integrations.svg)](https://pypi.org/project/streaming-jupyter-integrations/)
[![Downloads](https://pepy.tech/badge/streaming_jupyter_integrations)](https://pepy.tech/badge/streaming_jupyter_integrations)

# Streaming Jupyter Integrations

Streaming Jupyter Integrations project includes a set of magics for interactively running _Flink SQL_  jobs in [Jupyter](https://jupyter.org/) Notebooks

## Installation

In order to actually use these magics, you must install our PIP package along `jupyterlab-lsp`:

```shell
python3 -m pip install jupyterlab-lsp streaming-jupyter-integrations
```

## Usage

Register in Jupyter with a running IPython in the first cell:

```python
%load_ext streaming_jupyter_integrations.magics
```

Then you need to decide which _execution mode_ and _execution target_ to choose.

```python
%flink_connect --execution-mode [mode] --execution-target [target]
```

By default, the `streaming` execution mode and `local` execution target are used.

```python
%flink_connect
```

### Execution mode

Currently, Flink supports two execution modes: _batch_ and _streaming_. Please see
[Flink documentation](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/)
for more details.

In order to specify execution mode, add `--execution-mode` parameter, for instance:
```python
%flink_connect --execution-mode batch
```

### Execution target

Streaming Jupyter Integrations supports 3 execution targets:
* Local
* Remote
* YARN Session

#### Local execution target

Running Flink in `local` mode will start a MiniCluster in a local JVM with parallelism 1.

In order to run Flink locally, use:
```python
%flink_connect --execution-target local
```

Alternatively, since the execution target is `local` by default, use:
```python
%flink_connect
```

One can specify port of the local JobManager (8099 by default). This is useful especially if you run multiple
Notebooks in a single JupyterLab.

```python
%flink_connect --execution-target local --local-port 8123
```


#### Remote execution target

Running Flink in remote mode will connect to an existing Flink session cluster. Besides specifying `--execution-target`
to be `remote`, you also need to specify `--remote-hostname` and `--remote-port` pointing to Flink Job Manager's
REST API address.

```python
%flink_connect \
    --execution-target remote \
    --remote-hostname example.com \
    --remote-port 8888
```

#### YARN session execution target

Running Flink in `yarn-session` mode will connect to an existing Flink session cluster running on YARN. You may specify
the hostname and port of the YARN Resource Manager (`--resource-manager-hostname` and `--resource-manager-port`).
If Resource Manager address is not provided, it is assumed that notebook runs on the same node as Resource Manager.
You can also specify YARN applicationId (`--yarn-application-id`) to which the notebook will connect to.
If `--yarn-application-id` is not specified and there is one YARN application running on the cluster, the notebook will
try to connect to it. Otherwise, it will fail.

Connecting to a remote Flink session cluster running on a remote YARN cluster:
```python
%flink_connect \
    --execution-target yarn-session \
    --resource-manager-hostname example.com \
    --resource-manager-port 8888 \
    --yarn-application-id application_1666172784500_0001
```

Connecting to a Flink session cluster running on a YARN cluster:
```python
%flink_connect \
    --execution-target yarn-session \
    --yarn-application-id application_1666172784500_0001
```

Connecting to a Flink session cluster running on a dedicated YARN cluster:
```python
%flink_connect --execution-target yarn-session
```

## Variables
Magics allow for dynamic variable substitution in _Flink SQL_ cells.
```python
my_variable = 1
```
```sql
SELECT * FROM some_table WHERE product_id = {my_variable}
```

Moreover, you can mark sensitive variables like password so they will be read from environment variables or user input every time one runs the cell:
```sql
CREATE TABLE MyUserTable (
  id BIGINT,
  name STRING,
  age INT,
  status BOOLEAN,
  PRIMARY KEY (id) NOT ENFORCED
) WITH (
   'connector' = 'jdbc',
   'url' = 'jdbc:mysql://localhost:3306/mydatabase',
   'table-name' = 'users',
   'username' = '${my_username}',
   'password' = '${my_password}'
);
```

### `%%flink_execute` command

The command allows to use Python DataStream API and Table API. There are two handles exposed for each API:
`stream_env` and `table_env`, respectively.

Table API example:
```python
%%flink_execute
query = """
    SELECT   user_id, COUNT(*)
    FROM     orders
    GROUP BY user_id
"""
execution_output = table_env.execute_sql(query)
```

When Table API is used, the final result has to be assigned to `execution_output` variable.

DataStream API example:
```python
%%flink_execute
from pyflink.common.typeinfo import Types

execution_output = stream_env.from_collection(
    collection=[(1, 'aaa'), (2, 'bb'), (3, 'cccc')],
    type_info=Types.ROW([Types.INT(), Types.STRING()])
)
```

When DataStream API is used, the final result has to be assigned to `execution_output` variable. Please note that
the pipeline does not end with `.execute()`, the execution is triggered by the Jupyter magics under the hood.

---

## Local development

Note: You will need NodeJS to build the extension package.

The `jlpm` command is JupyterLab's pinned version of
[yarn](https://yarnpkg.com/) that is installed with JupyterLab. You may use
`yarn` or `npm` in lieu of `jlpm` below. In order to use `jlpm`, you have to
have `jupyterlab` installed (e.g., by `brew install jupyterlab`, if you use
Homebrew as your package manager).

```bash
# Clone the repo to your local environment
# Change directory to the flink_sql_lsp_extension directory
# Install package in development mode
pip install -e .
# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite
# Rebuild extension Typescript source after making changes
jlpm build
```

The project uses [pre-commit](https://pre-commit.com/) hooks to ensure code quality, mostly by linting.
To use it, [install pre-commit](https://pre-commit.com/#install) and then run
```shell
pre-commit install --install-hooks
```
From that moment, it will lint the files you have modified on every commit attempt.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/getindata/streaming-jupyter-integrations",
    "name": "streaming-jupyter-integrations",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "jupyter flink sql ipython",
    "author": "GetInData",
    "author_email": "office@getindata.com",
    "download_url": "https://files.pythonhosted.org/packages/ff/bd/135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd/streaming_jupyter_integrations-0.14.4.tar.gz",
    "platform": null,
    "description": "[![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/streaming_jupyter_integrations)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)\n[![PyPI version](https://badge.fury.io/py/streaming-jupyter-integrations.svg)](https://pypi.org/project/streaming-jupyter-integrations/)\n[![Downloads](https://pepy.tech/badge/streaming_jupyter_integrations)](https://pepy.tech/badge/streaming_jupyter_integrations)\n\n# Streaming Jupyter Integrations\n\nStreaming Jupyter Integrations project includes a set of magics for interactively running _Flink SQL_  jobs in [Jupyter](https://jupyter.org/) Notebooks\n\n## Installation\n\nIn order to actually use these magics, you must install our PIP package along `jupyterlab-lsp`:\n\n```shell\npython3 -m pip install jupyterlab-lsp streaming-jupyter-integrations\n```\n\n## Usage\n\nRegister in Jupyter with a running IPython in the first cell:\n\n```python\n%load_ext streaming_jupyter_integrations.magics\n```\n\nThen you need to decide which _execution mode_ and _execution target_ to choose.\n\n```python\n%flink_connect --execution-mode [mode] --execution-target [target]\n```\n\nBy default, the `streaming` execution mode and `local` execution target are used.\n\n```python\n%flink_connect\n```\n\n### Execution mode\n\nCurrently, Flink supports two execution modes: _batch_ and _streaming_. Please see\n[Flink documentation](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/)\nfor more details.\n\nIn order to specify execution mode, add `--execution-mode` parameter, for instance:\n```python\n%flink_connect --execution-mode batch\n```\n\n### Execution target\n\nStreaming Jupyter Integrations supports 3 execution targets:\n* Local\n* Remote\n* YARN Session\n\n#### Local execution target\n\nRunning Flink in `local` mode will start a MiniCluster in a local JVM with parallelism 1.\n\nIn order to run Flink locally, use:\n```python\n%flink_connect --execution-target local\n```\n\nAlternatively, since the execution target is `local` by default, use:\n```python\n%flink_connect\n```\n\nOne can specify port of the local JobManager (8099 by default). This is useful especially if you run multiple\nNotebooks in a single JupyterLab.\n\n```python\n%flink_connect --execution-target local --local-port 8123\n```\n\n\n#### Remote execution target\n\nRunning Flink in remote mode will connect to an existing Flink session cluster. Besides specifying `--execution-target`\nto be `remote`, you also need to specify `--remote-hostname` and `--remote-port` pointing to Flink Job Manager's\nREST API address.\n\n```python\n%flink_connect \\\n    --execution-target remote \\\n    --remote-hostname example.com \\\n    --remote-port 8888\n```\n\n#### YARN session execution target\n\nRunning Flink in `yarn-session` mode will connect to an existing Flink session cluster running on YARN. You may specify\nthe hostname and port of the YARN Resource Manager (`--resource-manager-hostname` and `--resource-manager-port`).\nIf Resource Manager address is not provided, it is assumed that notebook runs on the same node as Resource Manager.\nYou can also specify YARN applicationId (`--yarn-application-id`) to which the notebook will connect to.\nIf `--yarn-application-id` is not specified and there is one YARN application running on the cluster, the notebook will\ntry to connect to it. Otherwise, it will fail.\n\nConnecting to a remote Flink session cluster running on a remote YARN cluster:\n```python\n%flink_connect \\\n    --execution-target yarn-session \\\n    --resource-manager-hostname example.com \\\n    --resource-manager-port 8888 \\\n    --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a YARN cluster:\n```python\n%flink_connect \\\n    --execution-target yarn-session \\\n    --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a dedicated YARN cluster:\n```python\n%flink_connect --execution-target yarn-session\n```\n\n## Variables\nMagics allow for dynamic variable substitution in _Flink SQL_ cells.\n```python\nmy_variable = 1\n```\n```sql\nSELECT * FROM some_table WHERE product_id = {my_variable}\n```\n\nMoreover, you can mark sensitive variables like password so they will be read from environment variables or user input every time one runs the cell:\n```sql\nCREATE TABLE MyUserTable (\n  id BIGINT,\n  name STRING,\n  age INT,\n  status BOOLEAN,\n  PRIMARY KEY (id) NOT ENFORCED\n) WITH (\n   'connector' = 'jdbc',\n   'url' = 'jdbc:mysql://localhost:3306/mydatabase',\n   'table-name' = 'users',\n   'username' = '${my_username}',\n   'password' = '${my_password}'\n);\n```\n\n### `%%flink_execute` command\n\nThe command allows to use Python DataStream API and Table API. There are two handles exposed for each API:\n`stream_env` and `table_env`, respectively.\n\nTable API example:\n```python\n%%flink_execute\nquery = \"\"\"\n    SELECT   user_id, COUNT(*)\n    FROM     orders\n    GROUP BY user_id\n\"\"\"\nexecution_output = table_env.execute_sql(query)\n```\n\nWhen Table API is used, the final result has to be assigned to `execution_output` variable.\n\nDataStream API example:\n```python\n%%flink_execute\nfrom pyflink.common.typeinfo import Types\n\nexecution_output = stream_env.from_collection(\n    collection=[(1, 'aaa'), (2, 'bb'), (3, 'cccc')],\n    type_info=Types.ROW([Types.INT(), Types.STRING()])\n)\n```\n\nWhen DataStream API is used, the final result has to be assigned to `execution_output` variable. Please note that\nthe pipeline does not end with `.execute()`, the execution is triggered by the Jupyter magics under the hood.\n\n---\n\n## Local development\n\nNote: You will need NodeJS to build the extension package.\n\nThe `jlpm` command is JupyterLab's pinned version of\n[yarn](https://yarnpkg.com/) that is installed with JupyterLab. You may use\n`yarn` or `npm` in lieu of `jlpm` below. In order to use `jlpm`, you have to\nhave `jupyterlab` installed (e.g., by `brew install jupyterlab`, if you use\nHomebrew as your package manager).\n\n```bash\n# Clone the repo to your local environment\n# Change directory to the flink_sql_lsp_extension directory\n# Install package in development mode\npip install -e .\n# Link your development version of the extension with JupyterLab\njupyter labextension develop . --overwrite\n# Rebuild extension Typescript source after making changes\njlpm build\n```\n\nThe project uses [pre-commit](https://pre-commit.com/) hooks to ensure code quality, mostly by linting.\nTo use it, [install pre-commit](https://pre-commit.com/#install) and then run\n```shell\npre-commit install --install-hooks\n```\nFrom that moment, it will lint the files you have modified on every commit attempt.\n",
    "bugtrack_url": null,
    "license": "Apache Software License (Apache 2.0)",
    "summary": "JupyterNotebook Flink magics",
    "version": "0.14.4",
    "project_urls": {
        "Homepage": "https://github.com/getindata/streaming-jupyter-integrations"
    },
    "split_keywords": [
        "jupyter",
        "flink",
        "sql",
        "ipython"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1799b8a675f7b3077d26c47e512ab7586a376a11923ce5cef7af31e1f6c691a9",
                "md5": "6a08287daceec518e7945817468fc3e6",
                "sha256": "736e6cff0d9139e86dc3be566e7404352fa4df35d58e2ddbb5d176b67dde89ba"
            },
            "downloads": -1,
            "filename": "streaming_jupyter_integrations-0.14.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6a08287daceec518e7945817468fc3e6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 43457,
            "upload_time": "2023-09-22T20:15:14",
            "upload_time_iso_8601": "2023-09-22T20:15:14.361744Z",
            "url": "https://files.pythonhosted.org/packages/17/99/b8a675f7b3077d26c47e512ab7586a376a11923ce5cef7af31e1f6c691a9/streaming_jupyter_integrations-0.14.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffbd135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd",
                "md5": "34c69e70603d4cb0830489242840b21c",
                "sha256": "986e9957ba6fae6740c5256847bbf79e48157e8631a7a02b0ed33898eb784c7f"
            },
            "downloads": -1,
            "filename": "streaming_jupyter_integrations-0.14.4.tar.gz",
            "has_sig": false,
            "md5_digest": "34c69e70603d4cb0830489242840b21c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 131768,
            "upload_time": "2023-09-22T20:15:15",
            "upload_time_iso_8601": "2023-09-22T20:15:15.806614Z",
            "url": "https://files.pythonhosted.org/packages/ff/bd/135191532e84b523b5fd1ef23a3012e1db09cdc15eeda95865db73abb3dd/streaming_jupyter_integrations-0.14.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-22 20:15:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "getindata",
    "github_project": "streaming-jupyter-integrations",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "streaming-jupyter-integrations"
}

GetInData