# pyspark_eda
`pyspark_eda` is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate, bivariate analysis and multivariate analysis, handling missing values, outliers, and visualizing data distributions.
## Features
- **Univariate analysis:** Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.
- **Bivariate analysis:** Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required.
- **Multivariate analysis:** Includes Variance Inflation Factor (VIF).
- **Automatic handling:** Deals with missing values and outliers seamlessly.
- **Visualization:** Provides graphical representation of data distributions and relationships.
## Installation
You can install `pyspark_eda` via pip:
```bash
pip install pyspark_eda
```
## Function
### Univariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_list** (*list*, optional): List of columns to exclude from analysis.
- **print_graphs** (*int*, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.
### Description
Performs univariate analysis on the DataFrame and prints summary statistics and visualizations.
It returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution.
You can display the table to view the results.
### Example Usage
## get_univariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']
# Perform univariate analysis
get_univariate_analysis(df, table_name="your_table_name", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)
```
## Function
### Bivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.
- **p_correlation_analysis** (*int, optional*): Whether to perform Pearson's correlation analysis (1 for yes, 0 for no),default value is 0.
- **s_correlation_analysis** (*int, optional*): Whether to perform Spearman's correlation analysis (1 for yes, 0 for no),default value is 0.
- **cramer_analysis** (*int, optional*): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 0.
- **anova_analysis** (*int, optional*): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 0.
- **print_graphs** (*int, optional*): Whether to print graphs (1 for yes, 0 for no),default value is 0.
### Description
Performs bivariate analysis on the DataFrame, including Pearsons Correlation,Spearmans Correlation, Cramer's V, and ANOVA.
It returns a table with the following columns: Column_1, Column_2, Pearson_Correlation,Spearman_Correlation, Cramers_V, Anova_F_Value,Anova_P_Value.
You can display the table to view the results.
### Example Usage
### get_bivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']
# Perform bivariate analysis
get_bivariate_analysis(df, table_name="bivariate_analysis_results", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], p_correlation_analysis=1,s_correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=0)
```
## Function
### Multivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.
### Description
Performs multivariate analysis on the DataFrame, which gives the Variance Inflation Factor (VIF) for each numerical column.
It returns a table with the following columns: Feature, VIF. You can display the table to view the results.
### Example Usage
### get_multivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical columns
numerical_columns = ['col1', 'col2', 'col3']
# Perform bivariate analysis
get_multivariate_analysis(df, table_name="multivariate_analysis_results", numerical_columns=numerical_columns, id_columns=['id_column'])
```
## Contact
- **Author:** Tanya Irani
- **Email:** tanyairani22@gmail.com
Raw data
{
"_id": null,
"home_page": null,
"name": "pyspark-eda",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "data analysis pyspark univariate bivariate mutlivariate statistics",
"author": "Tanya Irani",
"author_email": "tanyairani22@gmail.com.com",
"download_url": "https://files.pythonhosted.org/packages/80/bc/33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7/pyspark_eda-1.6.0.tar.gz",
"platform": null,
"description": "# pyspark_eda\r\n\r\n`pyspark_eda` is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate, bivariate analysis and multivariate analysis, handling missing values, outliers, and visualizing data distributions.\r\n\r\n## Features\r\n\r\n- **Univariate analysis:** Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.\r\n- **Bivariate analysis:** Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required. \r\n- **Multivariate analysis:** Includes Variance Inflation Factor (VIF).\r\n- **Automatic handling:** Deals with missing values and outliers seamlessly.\r\n- **Visualization:** Provides graphical representation of data distributions and relationships.\r\n\r\n## Installation\r\nYou can install `pyspark_eda` via pip:\r\n\r\n```bash\r\npip install pyspark_eda\r\n```\r\n## Function\r\n### Univariate Analysis \r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.\r\n- **id_list** (*list*, optional): List of columns to exclude from analysis.\r\n- **print_graphs** (*int*, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.\r\n\r\n### Description\r\nPerforms univariate analysis on the DataFrame and prints summary statistics and visualizations.\r\nIt returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution.\r\nYou can display the table to view the results.\r\n\r\n### Example Usage\r\n## get_univariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_univariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical and categorical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\ncategorical_columns = ['col4', 'col5', 'col6']\r\n\r\n# Perform univariate analysis\r\nget_univariate_analysis(df, table_name=\"your_table_name\", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)\r\n```\r\n\r\n## Function\r\n### Bivariate Analysis\r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.\r\n- **id_columns** (*list, optional*): List of columns to exclude from analysis.\r\n- **p_correlation_analysis** (*int, optional*): Whether to perform Pearson's correlation analysis (1 for yes, 0 for no),default value is 0.\r\n- **s_correlation_analysis** (*int, optional*): Whether to perform Spearman's correlation analysis (1 for yes, 0 for no),default value is 0.\r\n- **cramer_analysis** (*int, optional*): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 0.\r\n- **anova_analysis** (*int, optional*): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 0.\r\n- **print_graphs** (*int, optional*): Whether to print graphs (1 for yes, 0 for no),default value is 0.\r\n\r\n### Description\r\nPerforms bivariate analysis on the DataFrame, including Pearsons Correlation,Spearmans Correlation, Cramer's V, and ANOVA.\r\nIt returns a table with the following columns: Column_1, Column_2, Pearson_Correlation,Spearman_Correlation, Cramers_V, Anova_F_Value,Anova_P_Value.\r\nYou can display the table to view the results.\r\n\r\n### Example Usage \r\n### get_bivariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_bivariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical and categorical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\ncategorical_columns = ['col4', 'col5', 'col6']\r\n\r\n# Perform bivariate analysis\r\nget_bivariate_analysis(df, table_name=\"bivariate_analysis_results\", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], p_correlation_analysis=1,s_correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=0)\r\n```\r\n\r\n## Function\r\n### Multivariate Analysis\r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **id_columns** (*list, optional*): List of columns to exclude from analysis.\r\n\r\n### Description\r\nPerforms multivariate analysis on the DataFrame, which gives the Variance Inflation Factor (VIF) for each numerical column.\r\nIt returns a table with the following columns: Feature, VIF. You can display the table to view the results. \r\n\r\n### Example Usage \r\n### get_multivariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_bivariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\n\r\n# Perform bivariate analysis\r\nget_multivariate_analysis(df, table_name=\"multivariate_analysis_results\", numerical_columns=numerical_columns, id_columns=['id_column'])\r\n```\r\n\r\n## Contact\r\n- **Author:** Tanya Irani\r\n- **Email:** tanyairani22@gmail.com\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python package for univariate ,bivariate and multivariate data analysis using PySpark",
"version": "1.6.0",
"project_urls": null,
"split_keywords": [
"data",
"analysis",
"pyspark",
"univariate",
"bivariate",
"mutlivariate",
"statistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fd208ba02e486fd0917adf660988dfc76bf629b09e96d2c89132a9abc0d20390",
"md5": "394739563a4f17a3f35584bc6a3d65a6",
"sha256": "7fc721a661080c441e6c304063dea4e5c852a6c61a1e79ce8c520f690c089bc9"
},
"downloads": -1,
"filename": "pyspark_eda-1.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "394739563a4f17a3f35584bc6a3d65a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8947,
"upload_time": "2024-07-05T06:47:26",
"upload_time_iso_8601": "2024-07-05T06:47:26.591151Z",
"url": "https://files.pythonhosted.org/packages/fd/20/8ba02e486fd0917adf660988dfc76bf629b09e96d2c89132a9abc0d20390/pyspark_eda-1.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "80bc33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7",
"md5": "fb8a0a9b4b78db833266c33ab8caeba6",
"sha256": "51e108282d2360f0bb44adbc1b06df763cfacaa9f58347465421ee1ef21d9534"
},
"downloads": -1,
"filename": "pyspark_eda-1.6.0.tar.gz",
"has_sig": false,
"md5_digest": "fb8a0a9b4b78db833266c33ab8caeba6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 8464,
"upload_time": "2024-07-05T06:47:29",
"upload_time_iso_8601": "2024-07-05T06:47:29.152302Z",
"url": "https://files.pythonhosted.org/packages/80/bc/33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7/pyspark_eda-1.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-05 06:47:29",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pyspark-eda"
}