seedspark


Nameseedspark JSON
Version 0.5.1 PyPI version JSON
download
home_pageNone
SummarySpark ETL Utility Framework
upload_time2024-05-14 17:24:40
maintainerNone
docs_urlNone
authorChethanUK
requires_python<4.0,>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SeedSpark

## Why Spark

Apache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a Directed Acyclic Graph (DAG) scheduler, a query optimizer, and a physical execution engine

Spark’s design philosophy centers around four key characteristics:
- Speed: Leveraging in-memory data processing, Spark executes tasks up to 100 times faster in memory and 10 times faster on disk than traditional big data processing systems (e.g., Hadoop MapReduce).
- Ease of Use: Through high-level APIs and built-in modules, Spark simplifies the process of complex data transformations and analyses, making it accessible to both developers and data analysts.
- Modularity and Extensibility: Spark's modular nature allows it to be used for a range of data processing tasks from batch processing to real-time streams and machine learning. Extensibility with numerous data sources and libraries further enhances its utility.
- Unified Analytics: Spark's unified framework reduces the complexity involved in processing data that might otherwise require multiple engines or different technologies.

Spark’s architecture is designed to optimize efficiency. The use of RDDs (Resilient Distributed Datasets) and subsequent abstractions like DataFrames and Datasets simplifies data manipulation while providing fault tolerance. By retaining intermediate results in memory rather than on disk, Spark minimizes costly I/O operations that are a common bottleneck in big data processing

The DAG execution engine enhances this by allowing for more complex operational pipelines and optimizing workflows dynamically. This approach minimizes redundant data shuffling across the cluster, leading to significant performance improvements

## Run on GitPod

Start Dev Env in Gitpod:
[![StartDevEnvInGitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/chethanuk/seedspark)


Force build: [![ForcePrebuild](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#prebuild/https://github.com/chethanuk/seedspark)

## Installation

Install Python 3.10 or above

```
pyenv install 3.11 \
    && pyenv global 3.11
```

Install Scala and Spark
```bash
make install-scala &&\
    make install-spark
```

Optional: If you want yo u can verify the installation commands
```bash
$ make --just-print install-spark

It will output following:
echo "Installing Hadoop..."
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk install hadoop 3.3.5
# Set Global version
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk default hadoop 3.3.5
echo "Installing Spark..."
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk install spark 3.5.0
# Set Global version
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk default spark 3.5.0
```

## Verify Installation

```bash
poetry env info
```

```bash
poetry version
```

```bash
sdk version
```

Run SDK Current to verify current packages

```bash
sdk current
```
should show:
```bash
Using:

java: 11.0.22-zulu
scala: 2.13.12
spark: 3.5.0
```

![Verify](https://i.imgur.com/P847qaX.png)

Then verify spark-shell version
```bash
spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.22
```

Verify top level packages

```bash
poetry show -T
```

PySpark version should match above spark version
```bash
pyspark                  3.5.0    Apache Spark Python API
```

## Run Pytest

```bash
# Install packages
poetry install --with=testing --no-interaction
# Run Pytest
poetry run coverage run -m pytest -vv tests --reruns 5 --reruns-delay 20
```

![PyTest](https://i.imgur.com/Xta8950.png)

Then Check the code file `seedspark/examples/music_sessions_top_n.py`
Update or Replace with actual path of music_sessions_data.tsv

Then run following:

```bash
# OPTIONAL - Skip this if already downloaded the dataset OR Download dataset
cd datasets/
pip install pandas requests tqdm; python lastfm_dataset_1k.py
# Update or Replace with actual path of new music_sessions_data.tsv
cd ..
# Execute Spark APP
poetry run python seedspark/examples/music_sessions_top_n.py
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "seedspark",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "ChethanUK",
    "author_email": "chethanuk@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/67/b2/e5c82b313b44acd99d14017e592ff2dcd4d49ea1b353f8be1458f5991b15/seedspark-0.5.1.tar.gz",
    "platform": null,
    "description": "# SeedSpark\n\n## Why Spark\n\nApache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a Directed Acyclic Graph (DAG) scheduler, a query optimizer, and a physical execution engine\n\nSpark\u2019s design philosophy centers around four key characteristics:\n- Speed: Leveraging in-memory data processing, Spark executes tasks up to 100 times faster in memory and 10 times faster on disk than traditional big data processing systems (e.g., Hadoop MapReduce).\n- Ease of Use: Through high-level APIs and built-in modules, Spark simplifies the process of complex data transformations and analyses, making it accessible to both developers and data analysts.\n- Modularity and Extensibility: Spark's modular nature allows it to be used for a range of data processing tasks from batch processing to real-time streams and machine learning. Extensibility with numerous data sources and libraries further enhances its utility.\n- Unified Analytics: Spark's unified framework reduces the complexity involved in processing data that might otherwise require multiple engines or different technologies.\n\nSpark\u2019s architecture is designed to optimize efficiency. The use of RDDs (Resilient Distributed Datasets) and subsequent abstractions like DataFrames and Datasets simplifies data manipulation while providing fault tolerance. By retaining intermediate results in memory rather than on disk, Spark minimizes costly I/O operations that are a common bottleneck in big data processing\n\nThe DAG execution engine enhances this by allowing for more complex operational pipelines and optimizing workflows dynamically. This approach minimizes redundant data shuffling across the cluster, leading to significant performance improvements\n\n## Run on GitPod\n\nStart Dev Env in Gitpod:\n[![StartDevEnvInGitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/chethanuk/seedspark)\n\n\nForce build: [![ForcePrebuild](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#prebuild/https://github.com/chethanuk/seedspark)\n\n## Installation\n\nInstall Python 3.10 or above\n\n```\npyenv install 3.11 \\\n    && pyenv global 3.11\n```\n\nInstall Scala and Spark\n```bash\nmake install-scala &&\\\n    make install-spark\n```\n\nOptional: If you want yo u can verify the installation commands\n```bash\n$ make --just-print install-spark\n\nIt will output following:\necho \"Installing Hadoop...\"\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\" && sdk install hadoop 3.3.5\n# Set Global version\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\" && sdk default hadoop 3.3.5\necho \"Installing Spark...\"\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\" && sdk install spark 3.5.0\n# Set Global version\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\" && sdk default spark 3.5.0\n```\n\n## Verify Installation\n\n```bash\npoetry env info\n```\n\n```bash\npoetry version\n```\n\n```bash\nsdk version\n```\n\nRun SDK Current to verify current packages\n\n```bash\nsdk current\n```\nshould show:\n```bash\nUsing:\n\njava: 11.0.22-zulu\nscala: 2.13.12\nspark: 3.5.0\n```\n\n![Verify](https://i.imgur.com/P847qaX.png)\n\nThen verify spark-shell version\n```bash\nspark-shell --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 3.5.0\n      /_/\n\nUsing Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.22\n```\n\nVerify top level packages\n\n```bash\npoetry show -T\n```\n\nPySpark version should match above spark version\n```bash\npyspark                  3.5.0    Apache Spark Python API\n```\n\n## Run Pytest\n\n```bash\n# Install packages\npoetry install --with=testing --no-interaction\n# Run Pytest\npoetry run coverage run -m pytest -vv tests --reruns 5 --reruns-delay 20\n```\n\n![PyTest](https://i.imgur.com/Xta8950.png)\n\nThen Check the code file `seedspark/examples/music_sessions_top_n.py`\nUpdate or Replace with actual path of music_sessions_data.tsv\n\nThen run following:\n\n```bash\n# OPTIONAL - Skip this if already downloaded the dataset OR Download dataset\ncd datasets/\npip install pandas requests tqdm; python lastfm_dataset_1k.py\n# Update or Replace with actual path of new music_sessions_data.tsv\ncd ..\n# Execute Spark APP\npoetry run python seedspark/examples/music_sessions_top_n.py\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Spark ETL Utility Framework",
    "version": "0.5.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "088f4f876739bbdad22f52d97268bdd2b0f4112a36787591ea3bc7d5926f3352",
                "md5": "2570e5197aa211175d20871cd1689da1",
                "sha256": "5ffa9ee8496b7dd15df2e780c854b722b981067fbcb08f4bfda7ae2d1a0b9f3e"
            },
            "downloads": -1,
            "filename": "seedspark-0.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2570e5197aa211175d20871cd1689da1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 21846,
            "upload_time": "2024-05-14T17:24:39",
            "upload_time_iso_8601": "2024-05-14T17:24:39.609171Z",
            "url": "https://files.pythonhosted.org/packages/08/8f/4f876739bbdad22f52d97268bdd2b0f4112a36787591ea3bc7d5926f3352/seedspark-0.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "67b2e5c82b313b44acd99d14017e592ff2dcd4d49ea1b353f8be1458f5991b15",
                "md5": "9ebdadba1404f1ca40c73995b9a48ee8",
                "sha256": "a6f4127866a32d191adb57800651a90c9885997c7ea8bfd7acecbb0f64723f87"
            },
            "downloads": -1,
            "filename": "seedspark-0.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "9ebdadba1404f1ca40c73995b9a48ee8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 18715,
            "upload_time": "2024-05-14T17:24:40",
            "upload_time_iso_8601": "2024-05-14T17:24:40.891289Z",
            "url": "https://files.pythonhosted.org/packages/67/b2/e5c82b313b44acd99d14017e592ff2dcd4d49ea1b353f8be1458f5991b15/seedspark-0.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-14 17:24:40",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "seedspark"
}
        
Elapsed time: 0.42847s