seedspark


Nameseedspark JSON
Version 0.4.3 PyPI version JSON
download
home_pagehttps://github.com/ChethanUK/
SummarySeedSpark is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments
upload_time2023-08-09 14:18:33
maintainerChethanUK
docs_urlNone
authorChethanUK
requires_python>=3.8,<4.0
licenseApache-2.0
keywords pyspark data-ops data-engineering data-quality data-profiling dataquality dataunittest data-unit-tests data-profilers data-engineer best-practices big-data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # SeedSpark

**SeedSpark** is an open-source is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments or to perform end to end tests. The goal is to enable rapid development of Spark pipelines via PySpark on Spark clusters and locally test the pipeline by using various utilities.

## TODO

1. Move logwrap [on top of loguru] extension out as a seperate package.
1. Add Test containers for [amundsen](https://www.amundsen.io/amundsen/), etc..

## Getting Started

1. Setup [SDKMAN](#setup-sdkman)
1. Setup [Java](#setup-java)
1. Setup [Apache Spark](#setup-apache-spark)
1. Install [Poetry](#poetry)
1. Install Pre-commit and [follow instruction in here](PreCommit.MD)
1. Run [tests locally](#running-tests-locally)

### Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based system. It provides a convenient command line interface for installing, switching, removing and listing Candidates.
SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells.
See documentation on the [SDKMAN! website](https://sdkman.io).

Open your favourite terminal and enter the following:

```shell
$ curl -s https://get.sdkman.io | bash
If the environment needs tweaking for SDKMAN to be installed,
the installer will prompt you accordingly and ask you to restart.

Next, open a new terminal or enter:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

Lastly, run the following code snippet to ensure that installation succeeded:

$ sdk version
```


### Setup Java

Install Java Now open favourite terminal and enter the following:

```shell
List the AdoptOpenJDK OpenJDK versions
$ sdk list java

To install For Java 11
$ sdk install java 11.0.10.hs-adpt

To install For Java 11
$ sdk install java 8.0.292.hs-adpt
```

### Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

```bash
List the Apache Spark versions:
$ sdk list spark

To install For Spark 3
$ sdk install spark 3.0.2

To install For Spark 3.1
$ sdk install spark 3.0.2
```

### Poetry

Poetry [Commands](https://python-poetry.org/docs/cli/#search)

```bash
poetry install

poetry update

# --tree: List the dependencies as a tree.
# --latest (-l): Show the latest version.
# --outdated (-o): Show the latest version but only for packages that are outdated.
poetry show -o
```

## Running Tests Locally

Take a look at tests in `tests/dataquality` and `tests/jobs`

```bash
$ poetry run pytest
Ran 95 tests in 96.95s
```

NOTE: It's just curated stuff in this repo for personal usage.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ChethanUK/",
    "name": "seedspark",
    "maintainer": "ChethanUK",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "chethanuk@outlook.com",
    "keywords": "PySpark,data-ops,data-engineering,data-quality,data-profiling,dataquality,dataunittest,data-unit-tests,data-profilers,data-engineer,best-practices,big-data",
    "author": "ChethanUK",
    "author_email": "chethanuk@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/68/62/e1b586075a0bef43d2dd57021b686568682482246730ca9b911fdfaaa2e1/seedspark-0.4.3.tar.gz",
    "platform": null,
    "description": "# SeedSpark\n\n**SeedSpark** is an open-source is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments or to perform end to end tests. The goal is to enable rapid development of Spark pipelines via PySpark on Spark clusters\u00a0and locally test the pipeline by using various utilities.\n\n## TODO\n\n1. Move logwrap [on top of loguru] extension out as a seperate package.\n1. Add Test containers for [amundsen](https://www.amundsen.io/amundsen/), etc..\n\n## Getting Started\n\n1. Setup [SDKMAN](#setup-sdkman)\n1. Setup [Java](#setup-java)\n1. Setup [Apache Spark](#setup-apache-spark)\n1. Install [Poetry](#poetry)\n1. Install Pre-commit and [follow instruction in here](PreCommit.MD)\n1. Run [tests locally](#running-tests-locally)\n\n### Setup SDKMAN\n\nSDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based system. It provides a convenient command line interface for installing, switching, removing and listing Candidates.\nSDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells.\nSee documentation on the [SDKMAN! website](https://sdkman.io).\n\nOpen your favourite terminal and enter the following:\n\n```shell\n$ curl -s https://get.sdkman.io | bash\nIf the environment needs tweaking for SDKMAN to be installed,\nthe installer will prompt you accordingly and ask you to restart.\n\nNext, open a new terminal or enter:\n\n$ source \"$HOME/.sdkman/bin/sdkman-init.sh\"\n\nLastly, run the following code snippet to ensure that installation succeeded:\n\n$ sdk version\n```\n\n\n### Setup Java\n\nInstall Java Now open favourite terminal and enter the following:\n\n```shell\nList the AdoptOpenJDK OpenJDK versions\n$ sdk list java\n\nTo install For Java 11\n$ sdk install java 11.0.10.hs-adpt\n\nTo install For Java 11\n$ sdk install java 8.0.292.hs-adpt\n```\n\n### Setup Apache Spark\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the Apache Spark versions:\n$ sdk list spark\n\nTo install For Spark 3\n$ sdk install spark 3.0.2\n\nTo install For Spark 3.1\n$ sdk install spark 3.0.2\n```\n\n### Poetry\n\nPoetry [Commands](https://python-poetry.org/docs/cli/#search)\n\n```bash\npoetry install\n\npoetry update\n\n# --tree: List the dependencies as a tree.\n# --latest (-l): Show the latest version.\n# --outdated (-o): Show the latest version but only for packages that are outdated.\npoetry show -o\n```\n\n## Running Tests Locally\n\nTake a look at tests in `tests/dataquality` and `tests/jobs`\n\n```bash\n$ poetry run pytest\nRan 95 tests in 96.95s\n```\n\nNOTE: It's just curated stuff in this repo for personal usage.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "SeedSpark is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments",
    "version": "0.4.3",
    "project_urls": {
        "Documentation": "https://github.com/ChethanUK/seedspark#readme",
        "Homepage": "https://github.com/ChethanUK/",
        "Repository": "https://github.com/ChethanUK/seedspark"
    },
    "split_keywords": [
        "pyspark",
        "data-ops",
        "data-engineering",
        "data-quality",
        "data-profiling",
        "dataquality",
        "dataunittest",
        "data-unit-tests",
        "data-profilers",
        "data-engineer",
        "best-practices",
        "big-data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a38aabca0a9cc62eab9034d53c6ffced38d91576b872006db75a9ab479a7b92",
                "md5": "92272eaac8ed5e6817d9886f0c373aa2",
                "sha256": "83472e7fe2bf17db78b22045f4785ebd698cf0c1512f096009c0d6ceadeb0a96"
            },
            "downloads": -1,
            "filename": "seedspark-0.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "92272eaac8ed5e6817d9886f0c373aa2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 21467,
            "upload_time": "2023-08-09T14:18:32",
            "upload_time_iso_8601": "2023-08-09T14:18:32.117953Z",
            "url": "https://files.pythonhosted.org/packages/8a/38/aabca0a9cc62eab9034d53c6ffced38d91576b872006db75a9ab479a7b92/seedspark-0.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6862e1b586075a0bef43d2dd57021b686568682482246730ca9b911fdfaaa2e1",
                "md5": "bf2e763c844f1b1229507d9526f367c3",
                "sha256": "831b3442750a9a92b0802b39dbb7e5ab3adb6980179aa52e537df2ab011de990"
            },
            "downloads": -1,
            "filename": "seedspark-0.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "bf2e763c844f1b1229507d9526f367c3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 18671,
            "upload_time": "2023-08-09T14:18:33",
            "upload_time_iso_8601": "2023-08-09T14:18:33.726135Z",
            "url": "https://files.pythonhosted.org/packages/68/62/e1b586075a0bef43d2dd57021b686568682482246730ca9b911fdfaaa2e1/seedspark-0.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-09 14:18:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ChethanUK",
    "github_project": "seedspark#readme",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "seedspark"
}
        
Elapsed time: 0.09420s