schemadiffed

Name	schemadiffed JSON
Version	0.1.0.1 JSON
	download
home_page
Summary	Compare Parquet file schemas across different filesystems
upload_time	2023-07-22 13:03:34
maintainer
docs_url	None
author	Elsayed91
requires_python	>=3.10,<4.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # schemadiff

schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like `Apache Spark` or `Google BigQuery`, as unexpected schema differences can disrupt data loading and processing.

Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:

- BigQuery: `Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet`
- Spark: `Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary`

schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.

## Installation

Install the package with pip:

```bash
pip install schemadiffed # schemadiff taken :p
```

## Usage

The package can be used as a Python library or as a command-line tool.

### Python Library

Here's an example of using schemadiff to group files by their schema:

```python
import os
from schemadiff import compare_schemas

os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')
```

In this example, `compare_schemas` groups the Parquet files in the directory `path/to/parquet_files` by their schema. It saves the results to `report.json` and also returns the grouped files as a list for potential downstream use.

### Command-Line Interface

schemadiff can also be used as a command-line tool. After installation, the command `compare-schemas` is available in your shell:

```bash
python schemadiff  --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'
```

## Features

- Efficient processing by reading the metadata of Parquet files.
- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
- Supports wildcard characters for flexible file selection.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "schemadiffed",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Elsayed91",
    "author_email": "elsayed.is@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/ed/e3/b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde/schemadiffed-0.1.0.1.tar.gz",
    "platform": null,
    "description": "# schemadiff\n\nschemadiff is a niche package designed for situations where a \u2014 large \u2014 number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like `Apache Spark` or `Google BigQuery`, as unexpected schema differences can disrupt data loading and processing.\n\nConsider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:\n\n- BigQuery: `Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet`\n- Spark: `Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary`\n\nschemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.\n\n## Installation\n\nInstall the package with pip:\n\n```bash\npip install schemadiffed # schemadiff taken :p\n```\n\n## Usage\n\nThe package can be used as a Python library or as a command-line tool.\n\n### Python Library\n\nHere's an example of using schemadiff to group files by their schema:\n\n```python\nimport os\nfrom schemadiff import compare_schemas\n\nos.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'\ngrouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')\n```\n\nIn this example, `compare_schemas` groups the Parquet files in the directory `path/to/parquet_files` by their schema. It saves the results to `report.json` and also returns the grouped files as a list for potential downstream use.\n\n### Command-Line Interface\n\nschemadiff can also be used as a command-line tool. After installation, the command `compare-schemas` is available in your shell:\n\n```bash\npython schemadiff  --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'\n```\n\n## Features\n\n- Efficient processing by reading the metadata of Parquet files.\n- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).\n- Supports wildcard characters for flexible file selection.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Compare Parquet file schemas across different filesystems",
    "version": "0.1.0.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7178fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d",
                "md5": "33f73a8d915eb8e1756f7fd13379b677",
                "sha256": "7927879d4c6e177d773c7871eb5ee78ce8507f64a66e5b13dcc7ca4bee454e17"
            },
            "downloads": -1,
            "filename": "schemadiffed-0.1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "33f73a8d915eb8e1756f7fd13379b677",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 8475,
            "upload_time": "2023-07-22T13:03:33",
            "upload_time_iso_8601": "2023-07-22T13:03:33.126245Z",
            "url": "https://files.pythonhosted.org/packages/71/78/fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d/schemadiffed-0.1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ede3b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde",
                "md5": "6623bcb7c6d618e532576e62c4f79c34",
                "sha256": "c7d61f4023061d29a7b68bdf581e99797aa201f23f9da4cd1b0873481212fb73"
            },
            "downloads": -1,
            "filename": "schemadiffed-0.1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6623bcb7c6d618e532576e62c4f79c34",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 6133,
            "upload_time": "2023-07-22T13:03:34",
            "upload_time_iso_8601": "2023-07-22T13:03:34.560215Z",
            "url": "https://files.pythonhosted.org/packages/ed/e3/b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde/schemadiffed-0.1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-22 13:03:34",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "schemadiffed"
}

Elsayed91