Name | schemadiffed JSON |
Version |
0.1.0.1
JSON |
| download |
home_page | |
Summary | Compare Parquet file schemas across different filesystems |
upload_time | 2023-07-22 13:03:34 |
maintainer | |
docs_url | None |
author | Elsayed91 |
requires_python | >=3.10,<4.0 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# schemadiff
schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like `Apache Spark` or `Google BigQuery`, as unexpected schema differences can disrupt data loading and processing.
Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:
- BigQuery: `Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet`
- Spark: `Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary`
schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.
## Installation
Install the package with pip:
```bash
pip install schemadiffed # schemadiff taken :p
```
## Usage
The package can be used as a Python library or as a command-line tool.
### Python Library
Here's an example of using schemadiff to group files by their schema:
```python
import os
from schemadiff import compare_schemas
os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')
```
In this example, `compare_schemas` groups the Parquet files in the directory `path/to/parquet_files` by their schema. It saves the results to `report.json` and also returns the grouped files as a list for potential downstream use.
### Command-Line Interface
schemadiff can also be used as a command-line tool. After installation, the command `compare-schemas` is available in your shell:
```bash
python schemadiff --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'
```
## Features
- Efficient processing by reading the metadata of Parquet files.
- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
- Supports wildcard characters for flexible file selection.
Raw data
{
"_id": null,
"home_page": "",
"name": "schemadiffed",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Elsayed91",
"author_email": "elsayed.is@outlook.com",
"download_url": "https://files.pythonhosted.org/packages/ed/e3/b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde/schemadiffed-0.1.0.1.tar.gz",
"platform": null,
"description": "# schemadiff\n\nschemadiff is a niche package designed for situations where a \u2014 large \u2014 number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like `Apache Spark` or `Google BigQuery`, as unexpected schema differences can disrupt data loading and processing.\n\nConsider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:\n\n- BigQuery: `Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet`\n- Spark: `Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary`\n\nschemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.\n\n## Installation\n\nInstall the package with pip:\n\n```bash\npip install schemadiffed # schemadiff taken :p\n```\n\n## Usage\n\nThe package can be used as a Python library or as a command-line tool.\n\n### Python Library\n\nHere's an example of using schemadiff to group files by their schema:\n\n```python\nimport os\nfrom schemadiff import compare_schemas\n\nos.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'\ngrouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')\n```\n\nIn this example, `compare_schemas` groups the Parquet files in the directory `path/to/parquet_files` by their schema. It saves the results to `report.json` and also returns the grouped files as a list for potential downstream use.\n\n### Command-Line Interface\n\nschemadiff can also be used as a command-line tool. After installation, the command `compare-schemas` is available in your shell:\n\n```bash\npython schemadiff --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'\n```\n\n## Features\n\n- Efficient processing by reading the metadata of Parquet files.\n- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).\n- Supports wildcard characters for flexible file selection.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Compare Parquet file schemas across different filesystems",
"version": "0.1.0.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7178fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d",
"md5": "33f73a8d915eb8e1756f7fd13379b677",
"sha256": "7927879d4c6e177d773c7871eb5ee78ce8507f64a66e5b13dcc7ca4bee454e17"
},
"downloads": -1,
"filename": "schemadiffed-0.1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "33f73a8d915eb8e1756f7fd13379b677",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 8475,
"upload_time": "2023-07-22T13:03:33",
"upload_time_iso_8601": "2023-07-22T13:03:33.126245Z",
"url": "https://files.pythonhosted.org/packages/71/78/fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d/schemadiffed-0.1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ede3b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde",
"md5": "6623bcb7c6d618e532576e62c4f79c34",
"sha256": "c7d61f4023061d29a7b68bdf581e99797aa201f23f9da4cd1b0873481212fb73"
},
"downloads": -1,
"filename": "schemadiffed-0.1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "6623bcb7c6d618e532576e62c4f79c34",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 6133,
"upload_time": "2023-07-22T13:03:34",
"upload_time_iso_8601": "2023-07-22T13:03:34.560215Z",
"url": "https://files.pythonhosted.org/packages/ed/e3/b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde/schemadiffed-0.1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-22 13:03:34",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "schemadiffed"
}