pyspark-explorer

Name	pyspark-explorer JSON
Version	0.2.2 JSON
	download
home_page	None
Summary	Explore data files with pyspark
upload_time	2025-01-06 12:37:24
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	MIT License Copyright (c) 2024 Krzysztof Ruta Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	data explorer pyspark spark
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Spark File Explorer
When developing spark applications I came across the growing number of data files that I create.

![pe03](https://github.com/user-attachments/assets/e7d51949-2868-4b1c-ac4a-3807d0f4a41a)

![pe04](https://github.com/user-attachments/assets/442d70e5-8098-4bbf-87db-a9cddbeaf223)

## CSVs are fine but what about JSON and complex PARQUET files?

To open and explore a file I used Excel to view CSV files, text editors with plugins to view JSON files,
but there was nothing handy to view PARQUETs. Event formatted JSONs were not always readable. What about viewing schemas?

Each time I had to use spark and write simple apps which was not a problem itself but was tedious and boring.

## Why not a database?

Well, for tabular data there problems is already solved - just use your preferred database.
Quite often we can load text files or even parquets directly to the database.

So what's the big deal?

## Hierarchical data sets

Unfortunately the files I often deal with have hierarchical structure. They cannot be simply visualized as tables
or rather some fields contain tables of other structures. Each of these structures is a table itself but how to load
and explore such embedded tables in a database?

## For Spark files use... Spark!

Hold on - since I generate files using Apache Spark, why can't I use it to explore them?
I can easily handle complex structures and file types using built-in features. So all I need is to build a use interface
to display directories, files and their contents.

## Why console?

I use Kubernetes in production environment, I develop Spark applications locally or in VM.
In all environments I would like to have _one tool to rule them all_.

I like console tools a lot, they require some sort of simplicity. They can run locally or over SSH connection on
the remote cluster. Sounds perfect. All I needed was a console UI library, so I wouldn't have to reinvent the wheel.

## Textual

What a great project [_textual_](https://textual.textualize.io/) is!

Years ago I used [_curses_](https://docs.python.org/3/library/curses.html) but
[_textual_](https://textual.textualize.io/) is so superior to what I used back then. It has so many features packed in
a friendly form of simple to use components. Highly recommended.

# Usage

Install package with pip:

pip install pyspark-explorer

Run:

pyspark-explorer

You may wish to provide a base path upfront. It can be changed at any time (press _o_ for _Options_).

For local files that could be for example:

# Linux
pyspark-explorer file:///home/myuser/datafiles/base_path
# Windows
pyspark-explorer file:/c:/datafiles/base_path

For remote location:

# Remote hdfs cluster
pyspark-explorer hdfs://somecluster/datafiles/base_path

Default path is set to /, which represents local root filesystem and works fine even in Windows thanks to Spark logics.

Configuration files are saved to your home directory (_.pyspark-explorer_ subdirectory).
These are json files so you are free to edit them.

# Spark limitations

Note that you will not be able to open any JSON file - only those with _correct_ structure can be viewed. If you try to open a file which has an unacceptable structure, Spark will throw an error, e.g.:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

or e.g.

[COLUMN_ALREADY_EXISTS] The column `event` already exists. Consider to choose another name or rename the existing column.

or e.g.

'NoneType' object has no attribute '__fields__'

etc.

You can find the log file in your home directory (_.pyspark-explorer_ subdirectory).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyspark-explorer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Krzysztof Ruta <krzys9876@gmail.com>",
    "keywords": "data, explorer, pyspark, spark",
    "author": null,
    "author_email": "Krzysztof Ruta <krzys9876@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9a/0a/ffd697fcf1a2abb62bdbead8e11d9abd02753ade7120f59f332ebf1fd650/pyspark_explorer-0.2.2.tar.gz",
    "platform": null,
    "description": "# Spark File Explorer\nWhen developing spark applications I came across the growing number of data files that I create. \n\n![pe03](https://github.com/user-attachments/assets/e7d51949-2868-4b1c-ac4a-3807d0f4a41a)\n\n![pe04](https://github.com/user-attachments/assets/442d70e5-8098-4bbf-87db-a9cddbeaf223)\n\n## CSVs are fine but what about JSON and complex PARQUET files?\n\nTo open and explore a file I used Excel to view CSV files, text editors with plugins to view JSON files, \nbut there was nothing handy to view PARQUETs. Event formatted JSONs were not always readable. What about viewing schemas? \n\nEach time I had to use spark and write simple apps which was not a problem itself but was tedious and boring.\n\n## Why not a database?\n\nWell, for tabular data there problems is already solved - just use your preferred database.\nQuite often we can load text files or even parquets directly to the database. \n\nSo what's the big deal?\n\n## Hierarchical data sets\n\nUnfortunately the files I often deal with have hierarchical structure. They cannot be simply visualized as tables\nor rather some fields contain tables of other structures. Each of these structures is a table itself but how to load \nand explore such embedded tables in a database?\n\n## For Spark files use... Spark! \n\nHold on - since I generate files using Apache Spark, why can't I use it to explore them?\nI can easily handle complex structures and file types using built-in features. So all I need is to build a use interface \nto display directories, files and their contents.\n\n## Why console?\n\nI use Kubernetes in production environment, I develop Spark applications locally or in VM. \nIn all environments I would like to have _one tool to rule them all_.  \n\nI like console tools a lot, they require some sort of simplicity. They can run locally or over SSH connection on \nthe remote cluster. Sounds perfect. All I needed was a console UI library, so I wouldn't have to reinvent the wheel.\n\n## Textual\n\nWhat a great project [_textual_](https://textual.textualize.io/) is! \n\nYears ago I used [_curses_](https://docs.python.org/3/library/curses.html) but \n[_textual_](https://textual.textualize.io/) is so superior to what I used back then. It has so many features packed in\na friendly form of simple to use components. Highly recommended.\n\n# Usage\n\nInstall package with pip:\n    \n    pip install pyspark-explorer\n\nRun:\n\n    pyspark-explorer\n\nYou may wish to provide a base path upfront. It can be changed at any time (press _o_ for _Options_).\n\nFor local files that could be for example:\n\n    # Linux\n    pyspark-explorer file:///home/myuser/datafiles/base_path\n    # Windows\n    pyspark-explorer file:/c:/datafiles/base_path\n\nFor remote location:\n\n    # Remote hdfs cluster\n    pyspark-explorer hdfs://somecluster/datafiles/base_path\n\nDefault path is set to /, which represents local root filesystem and works fine even in Windows thanks to Spark logics.\n\nConfiguration files are saved to your home directory (_.pyspark-explorer_ subdirectory). \nThese are json files so you are free to edit them.\n\n# Spark limitations\n\nNote that you will not be able to open any JSON file - only those with _correct_ structure can be viewed. If you try to open a file which has an unacceptable structure, Spark will throw an error, e.g.:\n\n    Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the\n    referenced columns only include the internal corrupt record column\n    (named _corrupt_record by default). For example:\n    spark.read.schema(schema).csv(file).filter($\"_corrupt_record\".isNotNull).count()\n    and spark.read.schema(schema).csv(file).select(\"_corrupt_record\").show().\n    Instead, you can cache or save the parsed results and then send the same query.\n    For example, val df = spark.read.schema(schema).csv(file).cache() and then\n    df.filter($\"_corrupt_record\".isNotNull).count().\n\nor e.g.\n\n    [COLUMN_ALREADY_EXISTS] The column `event` already exists. Consider to choose another name or rename the existing column.\n\nor e.g.\n\n    'NoneType' object has no attribute '__fields__'\n\netc.\n\nYou can find the log file in your home directory (_.pyspark-explorer_ subdirectory).",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Krzysztof Ruta  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Explore data files with pyspark",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/krzys9876/pyspark_explorer",
        "Repository": "https://github.com/krzys9876/pyspark_explorer"
    },
    "split_keywords": [
        "data",
        " explorer",
        " pyspark",
        " spark"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f08472149921c461eed38aeab45706b6bcb1080aae113b86d67738ce3144e7ef",
                "md5": "4f9ac95d5dec23afe85613327390c58c",
                "sha256": "cb9ac457e3a6a5484a13ec3ff85583dfaa9827702a281d884aa2446ca7cffca8"
            },
            "downloads": -1,
            "filename": "pyspark_explorer-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4f9ac95d5dec23afe85613327390c58c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 17212,
            "upload_time": "2025-01-06T12:37:23",
            "upload_time_iso_8601": "2025-01-06T12:37:23.436703Z",
            "url": "https://files.pythonhosted.org/packages/f0/84/72149921c461eed38aeab45706b6bcb1080aae113b86d67738ce3144e7ef/pyspark_explorer-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9a0affd697fcf1a2abb62bdbead8e11d9abd02753ade7120f59f332ebf1fd650",
                "md5": "e2d35fae24fce5d60884ac551adbca38",
                "sha256": "5be652f4c999159631d0ca60edc61d95d545575132c531d3f7df0952a19569fe"
            },
            "downloads": -1,
            "filename": "pyspark_explorer-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e2d35fae24fce5d60884ac551adbca38",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 18973,
            "upload_time": "2025-01-06T12:37:24",
            "upload_time_iso_8601": "2025-01-06T12:37:24.577169Z",
            "url": "https://files.pythonhosted.org/packages/9a/0a/ffd697fcf1a2abb62bdbead8e11d9abd02753ade7120f59f332ebf1fd650/pyspark_explorer-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-06 12:37:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "krzys9876",
    "github_project": "pyspark_explorer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyspark-explorer"
}

None