datasafari


Namedatasafari JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://www.datasafari.dev
SummaryDataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
upload_time2024-07-12 20:11:59
maintainerNone
docs_urlNone
authorGeorge Dreemer
requires_python<4.0,>=3.9
licenseGPL-3.0-only
keywords data science data analysis machine learning data preprocessing statistical testing data transformation predictive modeling data visualization exploratory data analysis hypothesis testing feature engineering model evaluation model tuning data cleaning data insights numerical analysis categorical data statistics ml automation data workflow data discovery sklearn integration statistical inference automated machine learning data exploration
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![DataSafari Banner](https://www.datasafari.dev/docs/_static/thumbs/ds-branding-thumb-main-web.png)
# Welcome to DataSafari!

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners. Whether you're exploring data, evaluating statistical assumptions, transforming datasets, or building predictive models, DataSafari provides all the tools you need in one package.

> In this README you can find a brief overview of how to start using DataSafari and what features you can utilize. For a more complete presentation you can visit [DataSafari's docs](https://www.datasafari.dev/docs).

## Quick Start

### Installation

To get started with DataSafari, install it using pip:

```console
pip install datasafari
```

Or, if you prefer using Poetry:

```console
poetry add datasafari
```

### Importing

Import DataSafari in your Python script to begin:

```python
import datasafari as ds
```

For detailed installation options, including installing from source, check our [Installation Guide in the docs](https://www.datasafari.dev/docs/other/installation).

## DataSafari at a Glance

DataSafari is organized into several subpackages, each tailored to specific data science tasks.

> *The logic behind the naming of each subpackage is inspired by the typical data workflow: exploring and understanding your data, transforming and cleaning it, evaluating assumptions and finally making predictions.* - George

### Explorers

**Explore and understand your data in depth and quicker than ever before.**

| Module         | Description                                                                                                                                                                       |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `explore_df()` | Explore a DataFrame and gain a birds-eye view of summary statistics, NAs, data types and more.                                                                                    |
| `explore_num()`| Explore numerical variables in a DataFrame and gain insights on distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection. |
| `explore_cat()`| Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.              |

### Transformers

**Clean, encode and enhance your data to prepare it for further analysis.**

| Module          | Description                                                                                                                                                               |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `transform_num()`| Transform numerical variables in a DataFrame through operations like standardization, log-transformation, various scalings, winsorization, and interaction term creation. |
| `transform_cat()`| Transforms categorical variables in a DataFrame through a range of encoding options and basic to advanced machine learning-based methods for uniform data cleaning.       |

### Evaluators

**Ensure your data meets the required assumptions for analyses.**

| Module                         | Description                                                                                                                                                                              |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `evaluate_normality()`         | Evaluate normality of numerical data within groups defined by a categorical variable, employing multiple statistical tests, dynamically chosen based on data suitability.                |
| `evaluate_variance()`          | Evaluate variance homogeneity across groups defined by a categorical variable within a dataset, using several statistical tests, dynamically chosen based on data suitability.           |
| `evaluate_dtype()`             | Evaluate and automatically categorize the data types of DataFrame columns, effectively distinguishing between ambiguous cases based on detailed logical assessments.                    |
| `evaluate_contingency_table()` | Evaluate the suitability of statistical tests for a given contingency table by analyzing its characteristics and guiding the selection of appropriate tests.                             |

### Predictors

**Streamline model building and hypothesis testing.**

| Module                | Description                                                                                                                                                                           |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `predict_hypothesis()`| Conduct the optimal hypothesis test on a DataFrame, tailoring the approach based on the variable types and automating the testing prerequisites and analyses, outputting test results and interpretation. |
| `predict_ml()`        | Streamline the entire process of data preprocessing, model selection, and tuning, delivering optimal model recommendations based t on the data provided.                             |


## DataSafari in Action

### Hypothesis Testing? One line.

```python
from datasafari.predictor import predict_hypothesis
import pandas as pd
import numpy as np

# Sample DataFrame
df_hypothesis = pd.DataFrame({
    'Group': np.random.choice(['Control', 'Treatment'], size=100),
    'Score': np.random.normal(0, 1, 100)
})

# Perform hypothesis testing
results = predict_hypothesis(df_hypothesis, 'Group', 'Score')
```

**How DataSafari Streamlines Hypothesis Testing:**

- **Automatic Test Selection**: Depending on the data types, ``predict_hypothesis()`` automatically selects the appropriate test. It uses Chi-square, Fisher's exact test or other exact tests for categorical pairs, and T-tests, ANOVA and others for categorical and numerical combinations, adapting based on group counts, sample size and data distribution.

- **Assumption Verification**: Essential assumptions for the chosen tests are automatically checked.
    - **Normality**: Normality is verified using tests like Shapiro-Wilk or Anderson-Darling, essential for parametric tests.
    - **Variance Homogeneity**: Tests such as Levene’s or Bartlett’s are used to confirm equal variances, informing the choice between ANOVA types.

- **Comprehensive Output**:
    - **Justifications**: Provides comprehensive reasoning on all test choices.
    - **Test Statistics**: Key quantitative results from the hypothesis test.
    - **P-values**: Indicators of the statistical significance of the findings.
    - **Conclusions**: Clear textual interpretations of whether the results support or reject the hypothesis.

### Machine Learning? You guessed it.

```python
from datasafari.predictor import predict_ml
import pandas as pd
import numpy as np

# Another sample DataFrame for ML
df_ml = pd.DataFrame({
    'Age': np.random.randint(20, 60, size=100),
    'Salary': np.random.normal(50000, 15000, size=100),
    'Experience': np.random.randint(1, 20, size=100)
})

x_cols = ['Age', 'Experience']
y_col = 'Salary'

# Discover the best models for your data
best_models = predict_ml(df_ml, x_cols, y_col)
```

**How DataSafari Simplifies Machine Learning Model Selection:**

- **Tailored Data Preprocessing**: The function automatically processes various types of data (numerical, categorical, text, datetime), preparing them optimally for machine learning.
    - Numerical data might be scaled or normalized.
    - Categorical data can be encoded.
    - Text data might be vectorized using techniques suitable for the analysis.

- **Intelligent Model Evaluation:** The function evaluates a variety of models using a composite score that synthesizes performance across multiple metrics, taking into account the multidimensional aspects of model performance.
    - **Composite Score Calculation**: Scores for each metric are weighted according to specified priorities by the user, with lower weights assigned to non-priority metrics (e.g. RMSE over MAE). This composite score serves as a holistic measure of model performance, ensuring that the models recommended are not just good in one aspect but are robust across multiple criteria.

- **Automated Hyperparameter Tuning:** Once the top models are identified based on the composite score, the pipeline employs techniques like grid search, random search, or Bayesian optimization to fine-tune the models.
    - **Output of Tuned Models**: The best configurations for the models are output, along with their performance metrics, allowing users to make informed decisions about which models to deploy based on robust, empirically derived data.

- **Customization Options & Sensible Defaults:** Users can define custom hyperparameter grids, select specific tuning algorithms, prioritize models, tailor data preprocessing, and prioritize metrics.
    - **Accessibility**: Every part of the process is in the hands of the user, but sensible defaults are provided for ultimate simplicity of use, which is the approach for ``datasafari`` in general.

----
## License

DataSafari is licensed under the GNU General Public License v3.0. This ensures that all modifications and derivatives of this project remain open-source and freely available under the same terms.


## Contact

Connect with me on [LinkedIn](https://www.linkedin.com/in/georgedreemer) or visit my [website](https://www.georgedreemer.com).

> Thank you very much for taking an interest in DataSafari! 💚 - George

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.datasafari.dev",
    "name": "datasafari",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "data science, data analysis, machine learning, data preprocessing, statistical testing, data transformation, predictive modeling, data visualization, exploratory data analysis, hypothesis testing, feature engineering, model evaluation, model tuning, data cleaning, data insights, numerical analysis, categorical data, statistics, ML automation, data workflow, data discovery, sklearn integration, statistical inference, automated machine learning, data exploration",
    "author": "George Dreemer",
    "author_email": "georgedreemer@proton.me",
    "download_url": "https://files.pythonhosted.org/packages/24/18/7c05aa7c133acae0c51bfec4a3a3fda5e86595fb674e557b8f79bb42c197/datasafari-1.0.0.tar.gz",
    "platform": null,
    "description": "![DataSafari Banner](https://www.datasafari.dev/docs/_static/thumbs/ds-branding-thumb-main-web.png)\n# Welcome to DataSafari!\n\nDataSafari simplifies complex data science tasks into straightforward, powerful one-liners. Whether you're exploring data, evaluating statistical assumptions, transforming datasets, or building predictive models, DataSafari provides all the tools you need in one package.\n\n> In this README you can find a brief overview of how to start using DataSafari and what features you can utilize. For a more complete presentation you can visit [DataSafari's docs](https://www.datasafari.dev/docs).\n\n## Quick Start\n\n### Installation\n\nTo get started with DataSafari, install it using pip:\n\n```console\npip install datasafari\n```\n\nOr, if you prefer using Poetry:\n\n```console\npoetry add datasafari\n```\n\n### Importing\n\nImport DataSafari in your Python script to begin:\n\n```python\nimport datasafari as ds\n```\n\nFor detailed installation options, including installing from source, check our [Installation Guide in the docs](https://www.datasafari.dev/docs/other/installation).\n\n## DataSafari at a Glance\n\nDataSafari is organized into several subpackages, each tailored to specific data science tasks.\n\n> *The logic behind the naming of each subpackage is inspired by the typical data workflow: exploring and understanding your data, transforming and cleaning it, evaluating assumptions and finally making predictions.* - George\n\n### Explorers\n\n**Explore and understand your data in depth and quicker than ever before.**\n\n| Module         | Description                                                                                                                                                                       |\n|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `explore_df()` | Explore a DataFrame and gain a birds-eye view of summary statistics, NAs, data types and more.                                                                                    |\n| `explore_num()`| Explore numerical variables in a DataFrame and gain insights on distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection. |\n| `explore_cat()`| Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.              |\n\n### Transformers\n\n**Clean, encode and enhance your data to prepare it for further analysis.**\n\n| Module          | Description                                                                                                                                                               |\n|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `transform_num()`| Transform numerical variables in a DataFrame through operations like standardization, log-transformation, various scalings, winsorization, and interaction term creation. |\n| `transform_cat()`| Transforms categorical variables in a DataFrame through a range of encoding options and basic to advanced machine learning-based methods for uniform data cleaning.       |\n\n### Evaluators\n\n**Ensure your data meets the required assumptions for analyses.**\n\n| Module                         | Description                                                                                                                                                                              |\n|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `evaluate_normality()`         | Evaluate normality of numerical data within groups defined by a categorical variable, employing multiple statistical tests, dynamically chosen based on data suitability.                |\n| `evaluate_variance()`          | Evaluate variance homogeneity across groups defined by a categorical variable within a dataset, using several statistical tests, dynamically chosen based on data suitability.           |\n| `evaluate_dtype()`             | Evaluate and automatically categorize the data types of DataFrame columns, effectively distinguishing between ambiguous cases based on detailed logical assessments.                    |\n| `evaluate_contingency_table()` | Evaluate the suitability of statistical tests for a given contingency table by analyzing its characteristics and guiding the selection of appropriate tests.                             |\n\n### Predictors\n\n**Streamline model building and hypothesis testing.**\n\n| Module                | Description                                                                                                                                                                           |\n|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `predict_hypothesis()`| Conduct the optimal hypothesis test on a DataFrame, tailoring the approach based on the variable types and automating the testing prerequisites and analyses, outputting test results and interpretation. |\n| `predict_ml()`        | Streamline the entire process of data preprocessing, model selection, and tuning, delivering optimal model recommendations based t on the data provided.                             |\n\n\n## DataSafari in Action\n\n### Hypothesis Testing? One line.\n\n```python\nfrom datasafari.predictor import predict_hypothesis\nimport pandas as pd\nimport numpy as np\n\n# Sample DataFrame\ndf_hypothesis = pd.DataFrame({\n    'Group': np.random.choice(['Control', 'Treatment'], size=100),\n    'Score': np.random.normal(0, 1, 100)\n})\n\n# Perform hypothesis testing\nresults = predict_hypothesis(df_hypothesis, 'Group', 'Score')\n```\n\n**How DataSafari Streamlines Hypothesis Testing:**\n\n- **Automatic Test Selection**: Depending on the data types, ``predict_hypothesis()`` automatically selects the appropriate test. It uses Chi-square, Fisher's exact test or other exact tests for categorical pairs, and T-tests, ANOVA and others for categorical and numerical combinations, adapting based on group counts, sample size and data distribution.\n\n- **Assumption Verification**: Essential assumptions for the chosen tests are automatically checked.\n    - **Normality**: Normality is verified using tests like Shapiro-Wilk or Anderson-Darling, essential for parametric tests.\n    - **Variance Homogeneity**: Tests such as Levene\u2019s or Bartlett\u2019s are used to confirm equal variances, informing the choice between ANOVA types.\n\n- **Comprehensive Output**:\n    - **Justifications**: Provides comprehensive reasoning on all test choices.\n    - **Test Statistics**: Key quantitative results from the hypothesis test.\n    - **P-values**: Indicators of the statistical significance of the findings.\n    - **Conclusions**: Clear textual interpretations of whether the results support or reject the hypothesis.\n\n### Machine Learning? You guessed it.\n\n```python\nfrom datasafari.predictor import predict_ml\nimport pandas as pd\nimport numpy as np\n\n# Another sample DataFrame for ML\ndf_ml = pd.DataFrame({\n    'Age': np.random.randint(20, 60, size=100),\n    'Salary': np.random.normal(50000, 15000, size=100),\n    'Experience': np.random.randint(1, 20, size=100)\n})\n\nx_cols = ['Age', 'Experience']\ny_col = 'Salary'\n\n# Discover the best models for your data\nbest_models = predict_ml(df_ml, x_cols, y_col)\n```\n\n**How DataSafari Simplifies Machine Learning Model Selection:**\n\n- **Tailored Data Preprocessing**: The function automatically processes various types of data (numerical, categorical, text, datetime), preparing them optimally for machine learning.\n    - Numerical data might be scaled or normalized.\n    - Categorical data can be encoded.\n    - Text data might be vectorized using techniques suitable for the analysis.\n\n- **Intelligent Model Evaluation:** The function evaluates a variety of models using a composite score that synthesizes performance across multiple metrics, taking into account the multidimensional aspects of model performance.\n    - **Composite Score Calculation**: Scores for each metric are weighted according to specified priorities by the user, with lower weights assigned to non-priority metrics (e.g. RMSE over MAE). This composite score serves as a holistic measure of model performance, ensuring that the models recommended are not just good in one aspect but are robust across multiple criteria.\n\n- **Automated Hyperparameter Tuning:** Once the top models are identified based on the composite score, the pipeline employs techniques like grid search, random search, or Bayesian optimization to fine-tune the models.\n    - **Output of Tuned Models**: The best configurations for the models are output, along with their performance metrics, allowing users to make informed decisions about which models to deploy based on robust, empirically derived data.\n\n- **Customization Options & Sensible Defaults:** Users can define custom hyperparameter grids, select specific tuning algorithms, prioritize models, tailor data preprocessing, and prioritize metrics.\n    - **Accessibility**: Every part of the process is in the hands of the user, but sensible defaults are provided for ultimate simplicity of use, which is the approach for ``datasafari`` in general.\n\n----\n## License\n\nDataSafari is licensed under the GNU General Public License v3.0. This ensures that all modifications and derivatives of this project remain open-source and freely available under the same terms.\n\n\n## Contact\n\nConnect with me on [LinkedIn](https://www.linkedin.com/in/georgedreemer) or visit my [website](https://www.georgedreemer.com).\n\n> Thank you very much for taking an interest in DataSafari! \ud83d\udc9a - George\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-only",
    "summary": "DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://www.datasafari.dev/docs",
        "Homepage": "https://www.datasafari.dev",
        "Repository": "https://github.com/ETA444/datasafari"
    },
    "split_keywords": [
        "data science",
        " data analysis",
        " machine learning",
        " data preprocessing",
        " statistical testing",
        " data transformation",
        " predictive modeling",
        " data visualization",
        " exploratory data analysis",
        " hypothesis testing",
        " feature engineering",
        " model evaluation",
        " model tuning",
        " data cleaning",
        " data insights",
        " numerical analysis",
        " categorical data",
        " statistics",
        " ml automation",
        " data workflow",
        " data discovery",
        " sklearn integration",
        " statistical inference",
        " automated machine learning",
        " data exploration"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "89aba9d5a3d651d32423b85a35645ee949a4a136692dc1de8b8fdee2303b12c0",
                "md5": "5f81e104bab693e8b3578199fae63906",
                "sha256": "8145f8a29ad31f9494c2cdaf1f14863d64fed0ed2bfb4fe0c004e6d6e951a4a9"
            },
            "downloads": -1,
            "filename": "datasafari-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5f81e104bab693e8b3578199fae63906",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 2841268,
            "upload_time": "2024-07-12T20:11:57",
            "upload_time_iso_8601": "2024-07-12T20:11:57.539283Z",
            "url": "https://files.pythonhosted.org/packages/89/ab/a9d5a3d651d32423b85a35645ee949a4a136692dc1de8b8fdee2303b12c0/datasafari-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "24187c05aa7c133acae0c51bfec4a3a3fda5e86595fb674e557b8f79bb42c197",
                "md5": "648c0126e6a71761d8f6366f39077b96",
                "sha256": "d64038f51fd6968a9aeea5eda0ee4199c7a5c4f482c535286b69238ba98628f6"
            },
            "downloads": -1,
            "filename": "datasafari-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "648c0126e6a71761d8f6366f39077b96",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 2798030,
            "upload_time": "2024-07-12T20:11:59",
            "upload_time_iso_8601": "2024-07-12T20:11:59.797085Z",
            "url": "https://files.pythonhosted.org/packages/24/18/7c05aa7c133acae0c51bfec4a3a3fda5e86595fb674e557b8f79bb42c197/datasafari-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-12 20:11:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ETA444",
    "github_project": "datasafari",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "datasafari"
}
        
Elapsed time: 0.27934s