pyspark-eda


Namepyspark-eda JSON
Version 1.6.0 PyPI version JSON
download
home_pageNone
SummaryA Python package for univariate ,bivariate and multivariate data analysis using PySpark
upload_time2024-07-05 06:47:29
maintainerNone
docs_urlNone
authorTanya Irani
requires_python>=3.6
licenseNone
keywords data analysis pyspark univariate bivariate mutlivariate statistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pyspark_eda

`pyspark_eda` is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate, bivariate analysis and multivariate analysis, handling missing values, outliers, and visualizing data distributions.

## Features

- **Univariate analysis:** Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.
- **Bivariate analysis:** Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required. 
- **Multivariate analysis:** Includes Variance Inflation Factor (VIF).
- **Automatic handling:** Deals with missing values and outliers seamlessly.
- **Visualization:** Provides graphical representation of data distributions and relationships.

## Installation
You can install `pyspark_eda` via pip:

```bash
pip install pyspark_eda
```
## Function
### Univariate Analysis 
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_list** (*list*, optional): List of columns to exclude from analysis.
- **print_graphs** (*int*, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.

### Description
Performs univariate analysis on the DataFrame and prints summary statistics and visualizations.
It returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution.
You can display the table to view the results.

### Example Usage
## get_univariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']

# Perform univariate analysis
get_univariate_analysis(df, table_name="your_table_name", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)
```

## Function
### Bivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.
- **p_correlation_analysis** (*int, optional*): Whether to perform Pearson's correlation analysis (1 for yes, 0 for no),default value is 0.
- **s_correlation_analysis** (*int, optional*): Whether to perform Spearman's correlation analysis (1 for yes, 0 for no),default value is 0.
- **cramer_analysis** (*int, optional*): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 0.
- **anova_analysis** (*int, optional*): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 0.
- **print_graphs** (*int, optional*): Whether to print graphs (1 for yes, 0 for no),default value is 0.

### Description
Performs bivariate analysis on the DataFrame, including Pearsons Correlation,Spearmans Correlation, Cramer's V, and ANOVA.
It returns a table with the following columns: Column_1, Column_2, Pearson_Correlation,Spearman_Correlation, Cramers_V, Anova_F_Value,Anova_P_Value.
You can display the table to view the results.

### Example Usage 
### get_bivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']

# Perform bivariate analysis
get_bivariate_analysis(df, table_name="bivariate_analysis_results", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], p_correlation_analysis=1,s_correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=0)
```

## Function
### Multivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.

### Description
Performs multivariate analysis on the DataFrame, which gives the Variance Inflation Factor (VIF) for each numerical column.
It returns a table with the following columns: Feature, VIF. You can display the table to view the results. 

### Example Usage 
### get_multivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical columns
numerical_columns = ['col1', 'col2', 'col3']

# Perform bivariate analysis
get_multivariate_analysis(df, table_name="multivariate_analysis_results", numerical_columns=numerical_columns, id_columns=['id_column'])
```

## Contact
- **Author:** Tanya Irani
- **Email:** tanyairani22@gmail.com

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyspark-eda",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "data analysis pyspark univariate bivariate mutlivariate statistics",
    "author": "Tanya Irani",
    "author_email": "tanyairani22@gmail.com.com",
    "download_url": "https://files.pythonhosted.org/packages/80/bc/33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7/pyspark_eda-1.6.0.tar.gz",
    "platform": null,
    "description": "# pyspark_eda\r\n\r\n`pyspark_eda` is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate, bivariate analysis and multivariate analysis, handling missing values, outliers, and visualizing data distributions.\r\n\r\n## Features\r\n\r\n- **Univariate analysis:** Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.\r\n- **Bivariate analysis:** Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required. \r\n- **Multivariate analysis:** Includes Variance Inflation Factor (VIF).\r\n- **Automatic handling:** Deals with missing values and outliers seamlessly.\r\n- **Visualization:** Provides graphical representation of data distributions and relationships.\r\n\r\n## Installation\r\nYou can install `pyspark_eda` via pip:\r\n\r\n```bash\r\npip install pyspark_eda\r\n```\r\n## Function\r\n### Univariate Analysis \r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.\r\n- **id_list** (*list*, optional): List of columns to exclude from analysis.\r\n- **print_graphs** (*int*, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.\r\n\r\n### Description\r\nPerforms univariate analysis on the DataFrame and prints summary statistics and visualizations.\r\nIt returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution.\r\nYou can display the table to view the results.\r\n\r\n### Example Usage\r\n## get_univariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_univariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical and categorical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\ncategorical_columns = ['col4', 'col5', 'col6']\r\n\r\n# Perform univariate analysis\r\nget_univariate_analysis(df, table_name=\"your_table_name\", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)\r\n```\r\n\r\n## Function\r\n### Bivariate Analysis\r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.\r\n- **id_columns** (*list, optional*): List of columns to exclude from analysis.\r\n- **p_correlation_analysis** (*int, optional*): Whether to perform Pearson's correlation analysis (1 for yes, 0 for no),default value is 0.\r\n- **s_correlation_analysis** (*int, optional*): Whether to perform Spearman's correlation analysis (1 for yes, 0 for no),default value is 0.\r\n- **cramer_analysis** (*int, optional*): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 0.\r\n- **anova_analysis** (*int, optional*): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 0.\r\n- **print_graphs** (*int, optional*): Whether to print graphs (1 for yes, 0 for no),default value is 0.\r\n\r\n### Description\r\nPerforms bivariate analysis on the DataFrame, including Pearsons Correlation,Spearmans Correlation, Cramer's V, and ANOVA.\r\nIt returns a table with the following columns: Column_1, Column_2, Pearson_Correlation,Spearman_Correlation, Cramers_V, Anova_F_Value,Anova_P_Value.\r\nYou can display the table to view the results.\r\n\r\n### Example Usage \r\n### get_bivariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_bivariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical and categorical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\ncategorical_columns = ['col4', 'col5', 'col6']\r\n\r\n# Perform bivariate analysis\r\nget_bivariate_analysis(df, table_name=\"bivariate_analysis_results\", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], p_correlation_analysis=1,s_correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=0)\r\n```\r\n\r\n## Function\r\n### Multivariate Analysis\r\n### Parameters\r\n- **df** (*DataFrame*): The input PySpark DataFrame.\r\n- **table_name** (*str*): The base table name to save the results\r\n- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.\r\n- **id_columns** (*list, optional*): List of columns to exclude from analysis.\r\n\r\n### Description\r\nPerforms multivariate analysis on the DataFrame, which gives the Variance Inflation Factor (VIF) for each numerical column.\r\nIt returns a table with the following columns: Feature, VIF. You can display the table to view the results. \r\n\r\n### Example Usage \r\n### get_multivariate_analysis\r\n```python\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark_eda import get_bivariate_analysis\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName('DataAnalysis').getOrCreate()\r\n\r\n# Load your data into a PySpark DataFrame\r\ndf = spark.read.csv('your_data.csv', header=True, inferSchema=True)\r\n\r\n# Identify numerical columns\r\nnumerical_columns = ['col1', 'col2', 'col3']\r\n\r\n# Perform bivariate analysis\r\nget_multivariate_analysis(df, table_name=\"multivariate_analysis_results\", numerical_columns=numerical_columns, id_columns=['id_column'])\r\n```\r\n\r\n## Contact\r\n- **Author:** Tanya Irani\r\n- **Email:** tanyairani22@gmail.com\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for univariate ,bivariate and multivariate data analysis using PySpark",
    "version": "1.6.0",
    "project_urls": null,
    "split_keywords": [
        "data",
        "analysis",
        "pyspark",
        "univariate",
        "bivariate",
        "mutlivariate",
        "statistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd208ba02e486fd0917adf660988dfc76bf629b09e96d2c89132a9abc0d20390",
                "md5": "394739563a4f17a3f35584bc6a3d65a6",
                "sha256": "7fc721a661080c441e6c304063dea4e5c852a6c61a1e79ce8c520f690c089bc9"
            },
            "downloads": -1,
            "filename": "pyspark_eda-1.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "394739563a4f17a3f35584bc6a3d65a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8947,
            "upload_time": "2024-07-05T06:47:26",
            "upload_time_iso_8601": "2024-07-05T06:47:26.591151Z",
            "url": "https://files.pythonhosted.org/packages/fd/20/8ba02e486fd0917adf660988dfc76bf629b09e96d2c89132a9abc0d20390/pyspark_eda-1.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "80bc33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7",
                "md5": "fb8a0a9b4b78db833266c33ab8caeba6",
                "sha256": "51e108282d2360f0bb44adbc1b06df763cfacaa9f58347465421ee1ef21d9534"
            },
            "downloads": -1,
            "filename": "pyspark_eda-1.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fb8a0a9b4b78db833266c33ab8caeba6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 8464,
            "upload_time": "2024-07-05T06:47:29",
            "upload_time_iso_8601": "2024-07-05T06:47:29.152302Z",
            "url": "https://files.pythonhosted.org/packages/80/bc/33d86999e3e8ffe986ad935813532582072ccf06c9d7484519a2d03cdbe7/pyspark_eda-1.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-05 06:47:29",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pyspark-eda"
}
        
Elapsed time: 2.09529s