![](https://github.com/Wh1isper/sparglim/actions/workflows/python-package.yml/badge.svg)
![](https://img.shields.io/pypi/dm/sparglim)
![](https://img.shields.io/github/last-commit/wh1isper/sparglim)
![](https://img.shields.io/pypi/pyversions/sparglim)
![](https://img.shields.io/github/license/wh1isper/sparglim)
![](https://img.shields.io/github/v/release/wh1isper/sparglim?logo=github)
![](https://img.shields.io/github/v/release/wh1isper/sparglim?include_prereleases&label=pre-release&logo=github)
# Sparglim ✨
Sparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S、Connect Server etc.).
**This is a fledgling project, looking forward to any PRs, Feature Requests and Discussions!**
🌟✨⭐ Start to support!
## Quick Start
Run Jupyterlab with `sparglim` docker image:
```bash
docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim
```
Access `http://localhost:8888` in browser to use jupyterlab with `sparglim`. Then you can try [SQL Magic](#sql-magic).
Run and Daemon a Spark Connect Server:
```bash
docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server
```
Access `http://localhost:4040` for Spark-UI and `sc://localhost:15002` for Spark Connect Server. [Use sparglim to setup SparkSession to connect to Spark Connect Server](#connect-to-spark-connect-server).
## Install: `pip install sparglim[all]`
- Install only for config and daemon spark connect server `pip install sparglim`
- Install for pyspark app `pip install sparglim[pyspark]`
- Install for using magic within ipython/jupyter (will also install pyspark) `pip install sparglim[magic]`
- **Install for all above** (such as using magic in jupyterlab on k8s) `pip install sparglim[all]`
## Feature
- [Config Spark via environment variables](./config.md)
- `%SQL` and `%%SQL` magic for executing Spark SQL in IPython/Jupyter
- SQL statement can be written in multiple lines, support using `;` to separate statements
- Support config `connect client`, see [Spark Connect Overview](https://spark.apache.org/docs/latest/spark-connect-overview.html#spark-connect-overview)
- *TODO: Visualize the result of SQL statement(Spark Dataframe)*
- `sparglim-server` for daemon Spark Connect Server
## User cases
### Basic
```python
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
# Create a local[*] spark session with s3&kerberos config
spark = ConfigBuilder().get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
```
### Building a PySpark App
To config Spark on k8s for Data explorations, see [examples/jupyter-sparglim-on-k8s](./examples/jupyter-sparglim-on-k8s)
To config Spark for ELT Application/Service, see project [pyspark-sampling](https://github.com/Wh1isper/pyspark-sampling/)
### Deploy Spark Connect Server on K8S (And Connect to it)
To daemon Spark Connect Server on K8S, see [examples/sparglim-server](./examples/sparglim-server)
To daemon Spark Connect Server on K8S and Connect it in JupyterLab , see [examples/jupyter-sparglim-sc](./examples/jupyter-sparglim-sc)
### Connect to Spark Connect Server
Only thing need to do is to set `SPARGLIM_REMOTE` env, format is `sc://host:port`
Example Code:
```python
import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
```
### SQL Magic
Install Sparglim with
```bash
pip install sparglim["magic"]
```
Load magic in IPython/Jupyter
```ipython
%load_ext sparglim.sql
spark # show SparkSession brief info
```
Create a view:
```python
from datetime import datetime, date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.createOrReplaceTempView("tb")
```
Query the view by `%SQL`:
```ipython
%sql SELECT * FROM tb
```
`%SQL` result dataframe can be assigned to a variable:
```ipython
df = %sql SELECT * FROM tb
df
```
or `%%SQL` can be used to execute multiple statements:
```ipython
%%sql SELECT
*
FROM
tb;
```
You can also using Spark SQL to load data from external data source, such as:
```ipython
%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;
```
## Develop
Install pre-commit before commit
```
pip install pre-commit
pre-commit install
```
Install package locally
```
pip install -e .[test]
```
Run unit-test before PR, **ensure that new features are covered by unit tests**
```
pytest -v
```
(Optional, python<=3.10) Use [pytype](https://github.com/google/pytype) to check typed
```
pytype ./sparglim
```
Raw data
{
"_id": null,
"home_page": "",
"name": "sparglim",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "ipython,magic,pyspark,sparglim,spark",
"author": "",
"author_email": "wh1isper <9573586@qq.com>",
"download_url": "https://files.pythonhosted.org/packages/67/2e/2b98cb8b015df42002880d327b694e975f245abeb3a495d629e7560c4e22/sparglim-0.2.1.tar.gz",
"platform": null,
"description": "![](https://github.com/Wh1isper/sparglim/actions/workflows/python-package.yml/badge.svg)\n![](https://img.shields.io/pypi/dm/sparglim)\n![](https://img.shields.io/github/last-commit/wh1isper/sparglim)\n![](https://img.shields.io/pypi/pyversions/sparglim)\n![](https://img.shields.io/github/license/wh1isper/sparglim)\n![](https://img.shields.io/github/v/release/wh1isper/sparglim?logo=github)\n![](https://img.shields.io/github/v/release/wh1isper/sparglim?include_prereleases&label=pre-release&logo=github)\n\n# Sparglim \u2728\n\nSparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S\u3001Connect Server etc.).\n\n**This is a fledgling project, looking forward to any PRs, Feature Requests and Discussions!**\n\n\ud83c\udf1f\u2728\u2b50 Start to support!\n\n## Quick Start\n\nRun Jupyterlab with `sparglim` docker image:\n\n```bash\ndocker run \\\n-it \\\n-p 8888:8888 \\\nwh1isper/jupyterlab-sparglim\n```\n\nAccess `http://localhost:8888` in browser to use jupyterlab with `sparglim`. Then you can try [SQL Magic](#sql-magic).\n\nRun and Daemon a Spark Connect Server:\n\n```bash\ndocker run \\\n-it \\\n-p 15002:15002 \\\n-p 4040:4040 \\\nwh1isper/sparglim-server\n```\n\nAccess `http://localhost:4040` for Spark-UI and `sc://localhost:15002` for Spark Connect Server. [Use sparglim to setup SparkSession to connect to Spark Connect Server](#connect-to-spark-connect-server).\n\n## Install: `pip install sparglim[all]`\n\n- Install only for config and daemon spark connect server `pip install sparglim`\n- Install for pyspark app `pip install sparglim[pyspark]`\n- Install for using magic within ipython/jupyter (will also install pyspark) `pip install sparglim[magic]`\n- **Install for all above** (such as using magic in jupyterlab on k8s) `pip install sparglim[all]`\n\n## Feature\n\n- [Config Spark via environment variables](./config.md)\n- `%SQL` and `%%SQL` magic for executing Spark SQL in IPython/Jupyter\n - SQL statement can be written in multiple lines, support using `;` to separate statements\n - Support config `connect client`, see [Spark Connect Overview](https://spark.apache.org/docs/latest/spark-connect-overview.html#spark-connect-overview)\n - *TODO: Visualize the result of SQL statement(Spark Dataframe)*\n- `sparglim-server` for daemon Spark Connect Server\n\n## User cases\n\n### Basic\n\n```python\nfrom sparglim.config.builder import ConfigBuilder\nfrom datetime import datetime, date\nfrom pyspark.sql import Row\n\n# Create a local[*] spark session with s3&kerberos config\nspark = ConfigBuilder().get_or_create()\n\ndf = spark.createDataFrame([\n Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n])\ndf.show()\n```\n\n### Building a PySpark App\n\nTo config Spark on k8s for Data explorations, see [examples/jupyter-sparglim-on-k8s](./examples/jupyter-sparglim-on-k8s)\n\nTo config Spark for ELT Application/Service, see project [pyspark-sampling](https://github.com/Wh1isper/pyspark-sampling/)\n\n### Deploy Spark Connect Server on K8S (And Connect to it)\n\nTo daemon Spark Connect Server on K8S, see [examples/sparglim-server](./examples/sparglim-server)\n\nTo daemon Spark Connect Server on K8S and Connect it in JupyterLab , see [examples/jupyter-sparglim-sc](./examples/jupyter-sparglim-sc)\n\n### Connect to Spark Connect Server\n\nOnly thing need to do is to set `SPARGLIM_REMOTE` env, format is `sc://host:port`\n\nExample Code:\n\n```python\nimport os\nos.environ[\"SPARGLIM_REMOTE\"] = \"sc://localhost:15002\" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python\n\nfrom sparglim.config.builder import ConfigBuilder\nfrom datetime import datetime, date\nfrom pyspark.sql import Row\n\n\nc = ConfigBuilder().config_connect_client()\nspark = c.get_or_create()\n\ndf = spark.createDataFrame([\n Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n])\ndf.show()\n\n```\n\n### SQL Magic\n\nInstall Sparglim with\n\n```bash\npip install sparglim[\"magic\"]\n```\n\nLoad magic in IPython/Jupyter\n\n```ipython\n%load_ext sparglim.sql\nspark # show SparkSession brief info\n```\n\nCreate a view:\n\n```python\nfrom datetime import datetime, date\nfrom pyspark.sql import Row\n\ndf = spark.createDataFrame([\n Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n ])\ndf.createOrReplaceTempView(\"tb\")\n```\n\nQuery the view by `%SQL`:\n\n```ipython\n%sql SELECT * FROM tb\n```\n\n`%SQL` result dataframe can be assigned to a variable:\n\n```ipython\ndf = %sql SELECT * FROM tb\ndf\n```\n\nor `%%SQL` can be used to execute multiple statements:\n\n```ipython\n%%sql SELECT\n *\n FROM\n tb;\n```\n\nYou can also using Spark SQL to load data from external data source, such as:\n\n```ipython\n%%sql CREATE TABLE tb_people\nUSING json\nOPTIONS (path \"/path/to/file.json\");\nShow tables;\n```\n\n## Develop\n\nInstall pre-commit before commit\n\n```\npip install pre-commit\npre-commit install\n```\n\nInstall package locally\n\n```\npip install -e .[test]\n```\n\nRun unit-test before PR, **ensure that new features are covered by unit tests**\n\n```\npytest -v\n```\n\n(Optional, python<=3.10) Use [pytype](https://github.com/google/pytype) to check typed\n\n```\npytype ./sparglim\n```\n",
"bugtrack_url": null,
"license": "BSD license",
"summary": "sparglim",
"version": "0.2.1",
"project_urls": {
"Source": "https://github.com/wh1isper/sparglim"
},
"split_keywords": [
"ipython",
"magic",
"pyspark",
"sparglim",
"spark"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2d7bd8875370bd503cde5156360855c4faad96029aac026cc116ae132f8fc93b",
"md5": "48dd6f0c3daf9a5a1a7bd23312886216",
"sha256": "0385a32675357aed61ab554f1c64b9841a64ed748e294672becedc4b77f8a6b8"
},
"downloads": -1,
"filename": "sparglim-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "48dd6f0c3daf9a5a1a7bd23312886216",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 17784,
"upload_time": "2024-01-26T08:00:29",
"upload_time_iso_8601": "2024-01-26T08:00:29.222630Z",
"url": "https://files.pythonhosted.org/packages/2d/7b/d8875370bd503cde5156360855c4faad96029aac026cc116ae132f8fc93b/sparglim-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "672e2b98cb8b015df42002880d327b694e975f245abeb3a495d629e7560c4e22",
"md5": "876d3af08dc237f6ddcfbc9f9d7ec24a",
"sha256": "e37ddd24df47524fea92da31f433c53a18cc614bc19faadb3f85753dadad6f39"
},
"downloads": -1,
"filename": "sparglim-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "876d3af08dc237f6ddcfbc9f9d7ec24a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 28793,
"upload_time": "2024-01-26T08:00:31",
"upload_time_iso_8601": "2024-01-26T08:00:31.287815Z",
"url": "https://files.pythonhosted.org/packages/67/2e/2b98cb8b015df42002880d327b694e975f245abeb3a495d629e7560c4e22/sparglim-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-26 08:00:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wh1isper",
"github_project": "sparglim",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sparglim"
}