# pyspark-test
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-black.svg)](https://github.com/ambv/black)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Unit Test](https://github.com/debugger24/pyspark-test/workflows/Unit%20Test/badge.svg?branch=main)](https://github.com/debugger24/pyspark-test/actions?query=workflow%3A%22Unit+Test%22)
[![PyPI version](https://badge.fury.io/py/pyspark-test.svg)](https://badge.fury.io/py/pyspark-test)
[![Downloads](https://pepy.tech/badge/pyspark-test)](https://pepy.tech/project/pyspark-test)
Check that left and right spark DataFrame are equal.
This function is intended to compare two spark DataFrames and output any differences. It is inspired from pandas testing module but for pyspark, and for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.
# Installation
```
pip install pyspark-test
```
# Usage
```py
assert_pyspark_df_equal(left_df, actual_df)
```
## Additional Arguments
* `check_dtype` : To compare the data types of spark dataframe. Default true
* `check_column_names` : To compare column names. Default false. Not required of we are checking data types.
* `check_columns_in_order` : To check the columns should be in order or not. Default to false
* `order_by` : Column names with which dataframe must be sorted before comparing. Default None.
# Example
```py
import datetime
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark_test import assert_pyspark_df_equal
sc = SparkContext.getOrCreate(conf=conf)
spark_session = SparkSession(sc)
df_1 = spark_session.createDataFrame(
data=[
[datetime.date(2020, 1, 1), 'demo', 1.123, 10],
[None, None, None, None],
],
schema=StructType(
[
StructField('col_a', DateType(), True),
StructField('col_b', StringType(), True),
StructField('col_c', DoubleType(), True),
StructField('col_d', LongType(), True),
]
),
)
df_2 = spark_session.createDataFrame(
data=[
[datetime.date(2020, 1, 1), 'demo', 1.123, 10],
[None, None, None, None],
],
schema=StructType(
[
StructField('col_a', DateType(), True),
StructField('col_b', StringType(), True),
StructField('col_c', DoubleType(), True),
StructField('col_d', LongType(), True),
]
),
)
assert_pyspark_df_equal(df_1, df_2)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/debugger24/pyspark-test",
"name": "pyspark-test",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "assert pyspark unit test testing compare",
"author": "Rahul Kumar",
"author_email": "rahulcomp24@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f8/a9/3ca6c0f3289da348d25693adb4f80e3d8b2389dea603f222feae4dd78e76/pyspark_test-0.2.0.tar.gz",
"platform": "",
"description": "# pyspark-test\n\n[![Code Style: Black](https://img.shields.io/badge/code%20style-black-black.svg)](https://github.com/ambv/black)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Unit Test](https://github.com/debugger24/pyspark-test/workflows/Unit%20Test/badge.svg?branch=main)](https://github.com/debugger24/pyspark-test/actions?query=workflow%3A%22Unit+Test%22)\n[![PyPI version](https://badge.fury.io/py/pyspark-test.svg)](https://badge.fury.io/py/pyspark-test)\n[![Downloads](https://pepy.tech/badge/pyspark-test)](https://pepy.tech/project/pyspark-test)\n\nCheck that left and right spark DataFrame are equal.\n\nThis function is intended to compare two spark DataFrames and output any differences. It is inspired from pandas testing module but for pyspark, and for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.\n\n# Installation\n\n```\npip install pyspark-test\n```\n\n# Usage\n\n```py\nassert_pyspark_df_equal(left_df, actual_df)\n```\n\n## Additional Arguments\n\n* `check_dtype` : To compare the data types of spark dataframe. Default true\n* `check_column_names` : To compare column names. Default false. Not required of we are checking data types.\n* `check_columns_in_order` : To check the columns should be in order or not. Default to false\n* `order_by` : Column names with which dataframe must be sorted before comparing. Default None.\n\n# Example\n\n```py\nimport datetime\n\nfrom pyspark import SparkContext\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.types import *\n\nfrom pyspark_test import assert_pyspark_df_equal\n\nsc = SparkContext.getOrCreate(conf=conf)\nspark_session = SparkSession(sc)\n\ndf_1 = spark_session.createDataFrame(\n data=[\n [datetime.date(2020, 1, 1), 'demo', 1.123, 10],\n [None, None, None, None],\n ],\n schema=StructType(\n [\n StructField('col_a', DateType(), True),\n StructField('col_b', StringType(), True),\n StructField('col_c', DoubleType(), True),\n StructField('col_d', LongType(), True),\n ]\n ),\n)\n\ndf_2 = spark_session.createDataFrame(\n data=[\n [datetime.date(2020, 1, 1), 'demo', 1.123, 10],\n [None, None, None, None],\n ],\n schema=StructType(\n [\n StructField('col_a', DateType(), True),\n StructField('col_b', StringType(), True),\n StructField('col_c', DoubleType(), True),\n StructField('col_d', LongType(), True),\n ]\n ),\n)\n\nassert_pyspark_df_equal(df_1, df_2)\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache Software License (Apache 2.0)",
"summary": "Check that left and right spark DataFrame are equal.",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/debugger24/pyspark-test"
},
"split_keywords": [
"assert",
"pyspark",
"unit",
"test",
"testing",
"compare"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ec326d75e7d5171393ead86c2a6c0aba5b5cdff495286537732c3a1ad05575c1",
"md5": "e35c397e281ab7f8908b4ecfdfe2e73d",
"sha256": "dd4fb03c4f438f718a870a9268a459f2f8924829c767302f5515202707c97709"
},
"downloads": -1,
"filename": "pyspark_test-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e35c397e281ab7f8908b4ecfdfe2e73d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7416,
"upload_time": "2021-10-31T18:24:59",
"upload_time_iso_8601": "2021-10-31T18:24:59.250622Z",
"url": "https://files.pythonhosted.org/packages/ec/32/6d75e7d5171393ead86c2a6c0aba5b5cdff495286537732c3a1ad05575c1/pyspark_test-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f8a93ca6c0f3289da348d25693adb4f80e3d8b2389dea603f222feae4dd78e76",
"md5": "1e1975b8d80865b0396e5fb71db0a639",
"sha256": "0d9d8d3a352a9b1c30761b0553a5771cb9dbb9a278955b3e7b0aed0ae13892d8"
},
"downloads": -1,
"filename": "pyspark_test-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "1e1975b8d80865b0396e5fb71db0a639",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7038,
"upload_time": "2021-10-31T18:25:00",
"upload_time_iso_8601": "2021-10-31T18:25:00.872506Z",
"url": "https://files.pythonhosted.org/packages/f8/a9/3ca6c0f3289da348d25693adb4f80e3d8b2389dea603f222feae4dd78e76/pyspark_test-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-10-31 18:25:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "debugger24",
"github_project": "pyspark-test",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "pyspark-test"
}