resampling-techniques


Nameresampling-techniques JSON
Version 1.5 PyPI version JSON
download
home_pagehttps://github.com/vishanth10/statistics_library.git
SummaryA custom statistics library/ reshuffing technqiues for chi-abs, boostrapting analysis and other statistical functions.
upload_time2024-03-01 02:13:39
maintainer
docs_urlNone
authorVishanth Hari Raj Balasubramanian
requires_python
license
keywords statistics resampling chi-abs analysis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Statistics_library

Library for calculating Chi abs, resampling, Bootstrapping -- LS 40 @UCLA Life science course


Statistics Library Documentation
Author: Vishanth Hari Raj
GitHub Repository: https://github.com/vishanth10/statistics_library.git
 
Function Documentation
1. read_data
The read_data function reads data from either a DataFrame or a file path. If the input is a DataFrame, it makes a copy. If it's a file path, it checks the file extension (.xls, .xlsx, .csv) and reads the data accordingly. It then applies pandas' to_numeric function to convert the data to numeric types, with non-numeric values coerced to NaN.
Usage Example:
data = read_data('path_to_file.csv') # For CSV files
data = read_data('path_to_file.xlsx') # For Excel files
data = read_data(dataframe) # For existing DataFrame
 
2. compare_dimensions
This function compares the dimensions of the observed and expected datasets. If the dimensions don't match, it raises a ValueError. This ensures that further statistical calculations are valid.
Usage Example:
compare_dimensions(observed_data, expected_data) # Checks if dimensions of observed and expected data match
 
3. calculate_chi_squared
Calculates the chi-squared statistic from observed and expected datasets. This is done by summing the squared differences between observed and expected values, divided by the expected values.
Usage Example:
chi_squared = calculate_chi_squared(observed_data, expected_data) # Returns the chi-squared statistic
 
4. calculate_expected
Computes expected frequencies for a contingency table based on the observed data. It uses row and column sums to calculate these frequencies, assuming independence between rows and columns.
Usage Example:
expected_data = calculate_expected(observed_data) # Returns the expected frequencies
 
5. calculate_chi_abs
This function calculates the chi absolute statistic for observed and expected data. If expected data is not provided, it calculates the expected data based on the observed data.
Usage Example:
chi_abs = calculate_chi_abs(observed_data, expected_data) # Calculates the chi absolute statistic
 
6. chi_abs_stat
A wrapper function for calculate_chi_abs. It allows the user to provide only observed data, with an option to provide expected data. It handles dimension comparison and error logging.
Usage Example:
chi_abs = chi_abs_stat(observed_data, expected_data) # Calculates chi absolute statistic with optional expected data
 
7. calculate_p_value
Calculates the p-value from a chi-squared statistic and degrees of freedom using the chi-squared survival function.
Usage Example:
p_value = calculate_p_value(chi_squared, dof) # Returns the p-value
 
8. chi_squared_stat
Reads observed and expected data, checks their dimensions, calculates the chi-squared statistic, and then computes the p-value.
Usage Example:
chi_squared_value = chi_squared_stat(observed_data, expected_data) # Calculates chi-squared statistic
 
9. p_value_stat
Reads observed and expected data, checks dimensions, calculates the chi-squared statistic, and then computes the p-value.
Usage Example:
p_value = p_value_stat(observed_data, expected_data) # Calculates the p-value for chi-squared statistic
 
10. convert_df_to_numpy
Converts DataFrame data to numpy arrays for use in functions, particularly for bootstrapping. It returns a tuple of numpy arrays for observed and expected data.
Usage Example:
observed_array, expected_array = convert_df_to_numpy(df_observed, df_expected) # Converts DataFrame to numpy arrays
 
11. bootstrap_chi_abs_distribution
Generates a bootstrap distribution of the chi absolute statistic for an n*n contingency table. Simulates new datasets and calculates the chi absolute for each, returning an array of these statistics.
Usage Example:
simulated_chi_abs = bootstrap_chi_abs_distribution(observed_data) # Returns array of simulated chi absolute statistics
 
12. calculate_p_value_bootstrap
Calculates the p-value for the chi absolute statistic using bootstrap methods. Compares the observed chi absolute statistic against the distribution of simulated chi absolute values to compute the p-value.
Usage Example:
p_value = calculate_p_value_bootstrap(observed_chi_abs, simulated_chi_abs) # Calculates p-value using bootstrap method
 
13. plot_chi_abs_distribution
Plots the distribution of simulated chi absolute values along with the observed chi absolute value. Shows the calculated p-value, providing a visual representation of the statistical analysis.
Usage Example:
plot_chi_abs_distribution(simulated_data, observed_data, p_value) # Plots distribution of chi absolute values
 


Extended:- Relative Risk Analysis Documentation
This documentation provides an overview and usage guide for a set of Python functions designed to calculate the relative risk between two treatments, resample data for statistical analysis, calculate confidence intervals, and plot the distribution of relative risks. These functions are intended for statistical analysis in lifesciences research or any field requiring comparative risk assessment.
 
1. calculate_relative_risk_two_treatments
Calculates the relative risk (RR) of an event occurring between two treatments. RR is a measure of the strength of association between an exposure and an outcome.
Parameters:
- observed_data: 2D array of observed data.
- event_row_index: Index of the row corresponding to the event.
- treatment1_index: Column index for the first treatment.
- treatment2_index: Column index for the second treatment.
Logic:
The function sums the total occurrences for each treatment and calculates the probability of the event for each treatment. The relative risk is then the ratio of these probabilities.
Usage Example:
relative_risk = calculate_relative_risk_two_treatments(observed_data, 0, 1, 2)
 
2. resample_and_calculate_rr
Performs resampling to calculate a distribution of relative risks through bootstrapping.
Parameters:
- observed_data: 2D array of observed data.
- event_row_index: Index of the event row.
- reference_treatment_index: Index for the reference treatment column (default is 0).
- num_simulations: Number of bootstrap simulations to perform.
Logic:
Generates simulated datasets by resampling from the observed data, maintaining the original proportions. Calculates the relative risk for each simulated dataset to create a distribution of relative risks.
Usage Example:
simulated_rr = resample_and_calculate_rr(observed_data, 0, num_simulations=10000)
 
3. calculate_confidence_interval
Calculates the 95% confidence interval for the relative risk from a distribution of simulated relative risks.
Parameters:
- simulated_rr (np.array): Array of simulated relative risks.
- percentile (float, optional): The percentile for the confidence interval, defaulting to 95%.
Logic:
The function calculates the lower and upper bounds of the confidence interval based on the specified percentile using numpy's percentile function. This provides an estimate of the interval within which the true relative risk is expected to lie with 95% certainty.
Usage Example:
lower_bound, upper_bound = calculate_confidence_interval(simulated_rr, 95)
 
4. calculate_probabilities_for_each_treatment
Calculates the probability of an event occurring for each treatment group.
Parameters:
- observed_data (np.array or pd.DataFrame): The observed data.
- event_row_index (int): The row index for the event.
Logic:
For each treatment group, the function calculates the total counts and then determines the probability of the event. These probabilities are stored in a dictionary, providing a quick reference for each treatment's event probability.
Usage Example:
probabilities = calculate_probabilities_for_each_treatment(observed_data, 0)
 
5. plot_relative_risk_distribution
Plots the distribution of simulated relative risks against the observed relative risk, including confidence intervals.
Parameters:
- simulated_rr (np.array): Array of simulated relative risks.
- observed_rr (float): The observed relative risk.
Logic:
This function visualizes the distribution of simulated relative risks, highlighting the observed relative risk and marking the confidence intervals. It uses seaborn and matplotlib for plotting.
Usage Example:
plot_relative_risk_distribution(simulated_rr, observed_rr)


APPENDIX - I

bootstrap_chi_abs_distribution :
The bootstrap_chi_abs_distribution function is designed for generating a bootstrap distribution of the chi absolute statistic for an n*n contingency table. This is crucial in statistical analysis, especially when assessing the significance of observed frequencies against expected frequencies in categorical data.
Parameters:
- observed_data: Observed frequencies in an n*n contingency table, either a numpy array or pandas DataFrame.
- num_simulations: Number of bootstrap samples to generate.
- with_replacement: Boolean indicating whether sampling should be with replacement.
Process:
1. Convert DataFrame to Numpy Array: If observed_data is a DataFrame, it's converted to a numpy array.
2. Determine Dimensions and Calculate Expected Frequencies: Calculates total rows, total columns, and expected frequencies.
3. Initialize Results Array: An array to store chi absolute statistics for each simulation.
4. Calculate Total Counts Per Column: Computes the total count of observations per column.
5. Create Pooled Data Array: Concatenates all categories for resampling.
6. Bootstrap Simulation Loop: Iterates num_simulations times, creating and analyzing simulated datasets.
7. Calculate Chi Absolute Statistic: Computes chi absolute statistic for each simulated dataset.
8. Return Results: Returns the array containing chi absolute statistics from all simulations.
Usage:
This function is used in statistical hypothesis testing to assess the significance of the observed data. By comparing the observed chi absolute statistic to a distribution generated through bootstrapping, one can infer the likelihood of observing such a statistic under the null hypothesis.



APPENDIX II
 


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/vishanth10/statistics_library.git",
    "name": "resampling-techniques",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "statistics,resampling chi-abs analysis",
    "author": "Vishanth Hari Raj Balasubramanian",
    "author_email": "rbvish1007@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d0/71/a5948135f02825fcdf9dcbf647d37f99a39c19b10ee6360ff21f240a628b/resampling_techniques-1.5.tar.gz",
    "platform": null,
    "description": "# Statistics_library\n\nLibrary for calculating Chi abs, resampling, Bootstrapping -- LS 40 @UCLA Life science course\n\n\nStatistics Library Documentation\nAuthor: Vishanth Hari Raj\nGitHub Repository: https://github.com/vishanth10/statistics_library.git\n \nFunction Documentation\n1. read_data\nThe read_data function reads data from either a DataFrame or a file path. If the input is a DataFrame, it makes a copy. If it's a file path, it checks the file extension (.xls, .xlsx, .csv) and reads the data accordingly. It then applies pandas' to_numeric function to convert the data to numeric types, with non-numeric values coerced to NaN.\nUsage Example:\ndata = read_data('path_to_file.csv') # For CSV files\ndata = read_data('path_to_file.xlsx') # For Excel files\ndata = read_data(dataframe) # For existing DataFrame\n \n2. compare_dimensions\nThis function compares the dimensions of the observed and expected datasets. If the dimensions don't match, it raises a ValueError. This ensures that further statistical calculations are valid.\nUsage Example:\ncompare_dimensions(observed_data, expected_data) # Checks if dimensions of observed and expected data match\n \n3. calculate_chi_squared\nCalculates the chi-squared statistic from observed and expected datasets. This is done by summing the squared differences between observed and expected values, divided by the expected values.\nUsage Example:\nchi_squared = calculate_chi_squared(observed_data, expected_data) # Returns the chi-squared statistic\n \n4. calculate_expected\nComputes expected frequencies for a contingency table based on the observed data. It uses row and column sums to calculate these frequencies, assuming independence between rows and columns.\nUsage Example:\nexpected_data = calculate_expected(observed_data) # Returns the expected frequencies\n \n5. calculate_chi_abs\nThis function calculates the chi absolute statistic for observed and expected data. If expected data is not provided, it calculates the expected data based on the observed data.\nUsage Example:\nchi_abs = calculate_chi_abs(observed_data, expected_data) # Calculates the chi absolute statistic\n \n6. chi_abs_stat\nA wrapper function for calculate_chi_abs. It allows the user to provide only observed data, with an option to provide expected data. It handles dimension comparison and error logging.\nUsage Example:\nchi_abs = chi_abs_stat(observed_data, expected_data) # Calculates chi absolute statistic with optional expected data\n \n7. calculate_p_value\nCalculates the p-value from a chi-squared statistic and degrees of freedom using the chi-squared survival function.\nUsage Example:\np_value = calculate_p_value(chi_squared, dof) # Returns the p-value\n \n8. chi_squared_stat\nReads observed and expected data, checks their dimensions, calculates the chi-squared statistic, and then computes the p-value.\nUsage Example:\nchi_squared_value = chi_squared_stat(observed_data, expected_data) # Calculates chi-squared statistic\n \n9. p_value_stat\nReads observed and expected data, checks dimensions, calculates the chi-squared statistic, and then computes the p-value.\nUsage Example:\np_value = p_value_stat(observed_data, expected_data) # Calculates the p-value for chi-squared statistic\n \n10. convert_df_to_numpy\nConverts DataFrame data to numpy arrays for use in functions, particularly for bootstrapping. It returns a tuple of numpy arrays for observed and expected data.\nUsage Example:\nobserved_array, expected_array = convert_df_to_numpy(df_observed, df_expected) # Converts DataFrame to numpy arrays\n \n11. bootstrap_chi_abs_distribution\nGenerates a bootstrap distribution of the chi absolute statistic for an n*n contingency table. Simulates new datasets and calculates the chi absolute for each, returning an array of these statistics.\nUsage Example:\nsimulated_chi_abs = bootstrap_chi_abs_distribution(observed_data) # Returns array of simulated chi absolute statistics\n \n12. calculate_p_value_bootstrap\nCalculates the p-value for the chi absolute statistic using bootstrap methods. Compares the observed chi absolute statistic against the distribution of simulated chi absolute values to compute the p-value.\nUsage Example:\np_value = calculate_p_value_bootstrap(observed_chi_abs, simulated_chi_abs) # Calculates p-value using bootstrap method\n \n13. plot_chi_abs_distribution\nPlots the distribution of simulated chi absolute values along with the observed chi absolute value. Shows the calculated p-value, providing a visual representation of the statistical analysis.\nUsage Example:\nplot_chi_abs_distribution(simulated_data, observed_data, p_value) # Plots distribution of chi absolute values\n \n\n\nExtended:- Relative Risk Analysis Documentation\nThis documentation provides an overview and usage guide for a set of Python functions designed to calculate the relative risk between two treatments, resample data for statistical analysis, calculate confidence intervals, and plot the distribution of relative risks. These functions are intended for statistical analysis in lifesciences research or any field requiring comparative risk assessment.\n \n1. calculate_relative_risk_two_treatments\nCalculates the relative risk (RR) of an event occurring between two treatments. RR is a measure of the strength of association between an exposure and an outcome.\nParameters:\n- observed_data: 2D array of observed data.\n- event_row_index: Index of the row corresponding to the event.\n- treatment1_index: Column index for the first treatment.\n- treatment2_index: Column index for the second treatment.\nLogic:\nThe function sums the total occurrences for each treatment and calculates the probability of the event for each treatment. The relative risk is then the ratio of these probabilities.\nUsage Example:\nrelative_risk = calculate_relative_risk_two_treatments(observed_data, 0, 1, 2)\n \n2. resample_and_calculate_rr\nPerforms resampling to calculate a distribution of relative risks through bootstrapping.\nParameters:\n- observed_data: 2D array of observed data.\n- event_row_index: Index of the event row.\n- reference_treatment_index: Index for the reference treatment column (default is 0).\n- num_simulations: Number of bootstrap simulations to perform.\nLogic:\nGenerates simulated datasets by resampling from the observed data, maintaining the original proportions. Calculates the relative risk for each simulated dataset to create a distribution of relative risks.\nUsage Example:\nsimulated_rr = resample_and_calculate_rr(observed_data, 0, num_simulations=10000)\n \n3. calculate_confidence_interval\nCalculates the 95% confidence interval for the relative risk from a distribution of simulated relative risks.\nParameters:\n- simulated_rr (np.array): Array of simulated relative risks.\n- percentile (float, optional): The percentile for the confidence interval, defaulting to 95%.\nLogic:\nThe function calculates the lower and upper bounds of the confidence interval based on the specified percentile using numpy's percentile function. This provides an estimate of the interval within which the true relative risk is expected to lie with 95% certainty.\nUsage Example:\nlower_bound, upper_bound = calculate_confidence_interval(simulated_rr, 95)\n \n4. calculate_probabilities_for_each_treatment\nCalculates the probability of an event occurring for each treatment group.\nParameters:\n- observed_data (np.array or pd.DataFrame): The observed data.\n- event_row_index (int): The row index for the event.\nLogic:\nFor each treatment group, the function calculates the total counts and then determines the probability of the event. These probabilities are stored in a dictionary, providing a quick reference for each treatment's event probability.\nUsage Example:\nprobabilities = calculate_probabilities_for_each_treatment(observed_data, 0)\n \n5. plot_relative_risk_distribution\nPlots the distribution of simulated relative risks against the observed relative risk, including confidence intervals.\nParameters:\n- simulated_rr (np.array): Array of simulated relative risks.\n- observed_rr (float): The observed relative risk.\nLogic:\nThis function visualizes the distribution of simulated relative risks, highlighting the observed relative risk and marking the confidence intervals. It uses seaborn and matplotlib for plotting.\nUsage Example:\nplot_relative_risk_distribution(simulated_rr, observed_rr)\n\n\nAPPENDIX - I\n\nbootstrap_chi_abs_distribution :\nThe bootstrap_chi_abs_distribution function is designed for generating a bootstrap distribution of the chi absolute statistic for an n*n contingency table. This is crucial in statistical analysis, especially when assessing the significance of observed frequencies against expected frequencies in categorical data.\nParameters:\n- observed_data: Observed frequencies in an n*n contingency table, either a numpy array or pandas DataFrame.\n- num_simulations: Number of bootstrap samples to generate.\n- with_replacement: Boolean indicating whether sampling should be with replacement.\nProcess:\n1. Convert DataFrame to Numpy Array: If observed_data is a DataFrame, it's converted to a numpy array.\n2. Determine Dimensions and Calculate Expected Frequencies: Calculates total rows, total columns, and expected frequencies.\n3. Initialize Results Array: An array to store chi absolute statistics for each simulation.\n4. Calculate Total Counts Per Column: Computes the total count of observations per column.\n5. Create Pooled Data Array: Concatenates all categories for resampling.\n6. Bootstrap Simulation Loop: Iterates num_simulations times, creating and analyzing simulated datasets.\n7. Calculate Chi Absolute Statistic: Computes chi absolute statistic for each simulated dataset.\n8. Return Results: Returns the array containing chi absolute statistics from all simulations.\nUsage:\nThis function is used in statistical hypothesis testing to assess the significance of the observed data. By comparing the observed chi absolute statistic to a distribution generated through bootstrapping, one can infer the likelihood of observing such a statistic under the null hypothesis.\n\n\n\nAPPENDIX II\n \n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A custom statistics library/ reshuffing technqiues for chi-abs, boostrapting analysis and other statistical functions.",
    "version": "1.5",
    "project_urls": {
        "Homepage": "https://github.com/vishanth10/statistics_library.git"
    },
    "split_keywords": [
        "statistics",
        "resampling chi-abs analysis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1431a545c5b3324d4650312b2a076c0004237f14aa5d09a9a531123956e6ac20",
                "md5": "fabb944aeb23b098494c5d6b50921256",
                "sha256": "0c1fcb908b416ca39fb5ece88add610530458a2663f9e31316ed7dfde1dfc5d3"
            },
            "downloads": -1,
            "filename": "resampling_techniques-1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fabb944aeb23b098494c5d6b50921256",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5202,
            "upload_time": "2024-03-01T02:13:37",
            "upload_time_iso_8601": "2024-03-01T02:13:37.440620Z",
            "url": "https://files.pythonhosted.org/packages/14/31/a545c5b3324d4650312b2a076c0004237f14aa5d09a9a531123956e6ac20/resampling_techniques-1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d071a5948135f02825fcdf9dcbf647d37f99a39c19b10ee6360ff21f240a628b",
                "md5": "7489353624ae6fb3635c4bccf6c73fd7",
                "sha256": "cc9e4288573441c2668c9075b4bd154b853c38c16e1888b8d69b908917c8cdfe"
            },
            "downloads": -1,
            "filename": "resampling_techniques-1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "7489353624ae6fb3635c4bccf6c73fd7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5425,
            "upload_time": "2024-03-01T02:13:39",
            "upload_time_iso_8601": "2024-03-01T02:13:39.421083Z",
            "url": "https://files.pythonhosted.org/packages/d0/71/a5948135f02825fcdf9dcbf647d37f99a39c19b10ee6360ff21f240a628b/resampling_techniques-1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-01 02:13:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vishanth10",
    "github_project": "statistics_library",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "resampling-techniques"
}
        
Elapsed time: 1.37223s