datashadric


Namedatashadric JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryAn exploratory data science toolkit for analysis, machine learning, and visualization
upload_time2025-10-06 07:14:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2025 datashadric Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords data science machine learning statistics visualization pandas analysis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics

An exploratory Python toolkit for data science, machine learning, statistical analysis, and visualization.

## Author

### **Paul Namalomba - University of Cape Town**
  - SESKA Computational Engineer
  - PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanics)

## Overview

`datashadric` provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, sunsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.

## Features

- **Machine Learning**: Model training, data ensembling (sampling), model evaluation, and prediction tools.
- **Regression Analysis**: Linear and Logistic regression modeling with diagnostic checks.
- **Data Manipulation**: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.
- **Statistical Analysis**: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included. 
- **Visualization**: Plotting functions for data exploration, visualization and presentation.

## Installation

### From PyPI (recommended)
```bash
pip install datashadric
```

### From Source
```bash
git clone https://github.com/diversecellar/datashadric.git
cd datashadric
pip install .
```

### Development Installation
```bash
git clone https://github.com/diversecellar/datashadric.git
cd datashadric
pip install -e ".[dev]"
```

## Quick Start

```python
import pandas as pd
from datashadric.mlearning import ml_naive_bayes_model
from datashadric.regression import lr_ols_model
from datashadric.dataframing import df_check_na_values
from datashadric.stochastics import df_gaussian_checks
from datashadric.plotters import df_boxplotter

# load your data
df = pd.read_csv('your_data.csv')

# check for missing values
na_summary = df_check_na_values(df)

# test for normality
normality_results = df_gaussian_checks(df, 'your_column')

# create visualizations
df_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)

# build machine learning models
model, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)

# perform regression analysis
ols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])
```

## Module Overview

### `mlearning` - Machine Learning
- `ml_naive_bayes_model()`: Train and evaluate Naive Bayes classifiers
- `ml_naive_bayes_metrics()`: Calculate detailed model performance metrics
- `logr_predictor()`: Logistic regression modeling and prediction
- `confusion_matrix_from_predictions()`: Generate confusion matrices

### `regression` - Regression Analysis
- `lr_ols_model()`: Ordinary Least Squares regression modeling
- `lr_check_homoscedasticity()`: Test regression assumptions
- `lr_check_normality()`: Check residual normality
- `lr_post_hoc_test()`: Post-hoc regression diagnostics

### `dataframing` - Data Manipulation
- `df_check_na_values()`: Comprehensive missing value analysis
- `df_drop_dupes()`: Remove duplicate rows with reporting
- `df_one_hot_encoding()`: Convert categorical variables to dummy variables
- `df_check_correlation()`: Correlation analysis and visualization

### `stochastics` - Statistical Analysis
- `df_gaussian_checks()`: Test data normality with Shapiro-Wilk and Q-Q plots
- `df_calc_conf_interval()`: Calculate confidence intervals
- `df_calc_moe()`: Compute margin of error
- `df_calc_zscore()`: Z-score calculations

### `plotters` - Visualization
- `df_boxplotter()`: Box plots for outlier detection
- `df_histplotter()`: Histogram creation with customization
- `df_scatterplotter()`: Scatter plot generation
- `df_pairplot()`: Comprehensive pairwise plotting

## Dependencies

### Core Dependencies
- pandas >= 1.3.0
- numpy >= 1.20.0
- scikit-learn >= 1.0.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- scipy >= 1.7.0
- statsmodels >= 0.12.0
- plotly >- 5.0.0

You can simply do:
```bash
pip install -r requirements/requirements-core.txt
```

### Testing Dependencies
For running tests, you'll need to install additional packages:
```bash
pip install pytest pytest-cov
```

## Testing

To run the test suite:

```bash
# Install testing dependencies first
pip install pytest pytest-cov

# Run all tests
python -m pytest tests/ -v

# Run tests with coverage report
python -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missing
```

## Examples

### Data Cleaning and Exploration
```python
from datashadric.dataframing import df_check_na_values, df_drop_dupes
from datashadric.plotters import df_histplotter

# check data quality
na_report = df_check_na_values(df)
df_clean = df_drop_dupes(df)

# visualize distributions
df_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)
```

### Statistical Testing
```python
from datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval

# test normality
normality_test = df_gaussian_checks(df, 'measurement_column')

# calculate confidence intervals
ci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)
```

### Machine Learning Workflow
```python
from datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics

# train model
model, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)

# detailed evaluation
detailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

If you encounter any problems or have questions, please file an issue on the [GitHub repository](https://github.com/yourusername/datashadric/issues).

## Changelog

### Version: 0.1.0 
### Release Date: 2 October 2025
- Initial release
- Core modules: mlearning, regression, dataframing, stochastics, plotters
- Comprehensive documentation and examples
- Minimal test coverage

### Version: 0.1.1
### Release Date: 3 October 2025
- Supplemental release 
- Additional functions for outlier detection
- Additional functions for plotting (LOWESS meanline plotter)
- Additional functions for data clustering based on k-means

### Version: 0.1.2
### Release Date: 6 October 2025
- Enhanced dataframe utilities
- New functions for index and column name retrieval
- Improved documentation and examples

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datashadric",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Paul Namalomba (GitHub: diversecellar)\" <kabwenzenamalomba@gmail.com>",
    "keywords": "data science, machine learning, statistics, visualization, pandas, analysis",
    "author": null,
    "author_email": "\"Paul Namalomba (GitHub: diversecellar)\" <kabwenzenamalomba@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ef/f0/fed251737549a83143d9b6cae88306326cc8b96d464762e9634b7ee626f1/datashadric-0.1.2.tar.gz",
    "platform": null,
    "description": "# datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics\r\n\r\nAn exploratory Python toolkit for data science, machine learning, statistical analysis, and visualization.\r\n\r\n## Author\r\n\r\n### **Paul Namalomba - University of Cape Town**\r\n  - SESKA Computational Engineer\r\n  - PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanics)\r\n\r\n## Overview\r\n\r\n`datashadric` provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, sunsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.\r\n\r\n## Features\r\n\r\n- **Machine Learning**: Model training, data ensembling (sampling), model evaluation, and prediction tools.\r\n- **Regression Analysis**: Linear and Logistic regression modeling with diagnostic checks.\r\n- **Data Manipulation**: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.\r\n- **Statistical Analysis**: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included. \r\n- **Visualization**: Plotting functions for data exploration, visualization and presentation.\r\n\r\n## Installation\r\n\r\n### From PyPI (recommended)\r\n```bash\r\npip install datashadric\r\n```\r\n\r\n### From Source\r\n```bash\r\ngit clone https://github.com/diversecellar/datashadric.git\r\ncd datashadric\r\npip install .\r\n```\r\n\r\n### Development Installation\r\n```bash\r\ngit clone https://github.com/diversecellar/datashadric.git\r\ncd datashadric\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## Quick Start\r\n\r\n```python\r\nimport pandas as pd\r\nfrom datashadric.mlearning import ml_naive_bayes_model\r\nfrom datashadric.regression import lr_ols_model\r\nfrom datashadric.dataframing import df_check_na_values\r\nfrom datashadric.stochastics import df_gaussian_checks\r\nfrom datashadric.plotters import df_boxplotter\r\n\r\n# load your data\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# check for missing values\r\nna_summary = df_check_na_values(df)\r\n\r\n# test for normality\r\nnormality_results = df_gaussian_checks(df, 'your_column')\r\n\r\n# create visualizations\r\ndf_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)\r\n\r\n# build machine learning models\r\nmodel, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)\r\n\r\n# perform regression analysis\r\nols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])\r\n```\r\n\r\n## Module Overview\r\n\r\n### `mlearning` - Machine Learning\r\n- `ml_naive_bayes_model()`: Train and evaluate Naive Bayes classifiers\r\n- `ml_naive_bayes_metrics()`: Calculate detailed model performance metrics\r\n- `logr_predictor()`: Logistic regression modeling and prediction\r\n- `confusion_matrix_from_predictions()`: Generate confusion matrices\r\n\r\n### `regression` - Regression Analysis\r\n- `lr_ols_model()`: Ordinary Least Squares regression modeling\r\n- `lr_check_homoscedasticity()`: Test regression assumptions\r\n- `lr_check_normality()`: Check residual normality\r\n- `lr_post_hoc_test()`: Post-hoc regression diagnostics\r\n\r\n### `dataframing` - Data Manipulation\r\n- `df_check_na_values()`: Comprehensive missing value analysis\r\n- `df_drop_dupes()`: Remove duplicate rows with reporting\r\n- `df_one_hot_encoding()`: Convert categorical variables to dummy variables\r\n- `df_check_correlation()`: Correlation analysis and visualization\r\n\r\n### `stochastics` - Statistical Analysis\r\n- `df_gaussian_checks()`: Test data normality with Shapiro-Wilk and Q-Q plots\r\n- `df_calc_conf_interval()`: Calculate confidence intervals\r\n- `df_calc_moe()`: Compute margin of error\r\n- `df_calc_zscore()`: Z-score calculations\r\n\r\n### `plotters` - Visualization\r\n- `df_boxplotter()`: Box plots for outlier detection\r\n- `df_histplotter()`: Histogram creation with customization\r\n- `df_scatterplotter()`: Scatter plot generation\r\n- `df_pairplot()`: Comprehensive pairwise plotting\r\n\r\n## Dependencies\r\n\r\n### Core Dependencies\r\n- pandas >= 1.3.0\r\n- numpy >= 1.20.0\r\n- scikit-learn >= 1.0.0\r\n- matplotlib >= 3.4.0\r\n- seaborn >= 0.11.0\r\n- scipy >= 1.7.0\r\n- statsmodels >= 0.12.0\r\n- plotly >- 5.0.0\r\n\r\nYou can simply do:\r\n```bash\r\npip install -r requirements/requirements-core.txt\r\n```\r\n\r\n### Testing Dependencies\r\nFor running tests, you'll need to install additional packages:\r\n```bash\r\npip install pytest pytest-cov\r\n```\r\n\r\n## Testing\r\n\r\nTo run the test suite:\r\n\r\n```bash\r\n# Install testing dependencies first\r\npip install pytest pytest-cov\r\n\r\n# Run all tests\r\npython -m pytest tests/ -v\r\n\r\n# Run tests with coverage report\r\npython -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missing\r\n```\r\n\r\n## Examples\r\n\r\n### Data Cleaning and Exploration\r\n```python\r\nfrom datashadric.dataframing import df_check_na_values, df_drop_dupes\r\nfrom datashadric.plotters import df_histplotter\r\n\r\n# check data quality\r\nna_report = df_check_na_values(df)\r\ndf_clean = df_drop_dupes(df)\r\n\r\n# visualize distributions\r\ndf_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)\r\n```\r\n\r\n### Statistical Testing\r\n```python\r\nfrom datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval\r\n\r\n# test normality\r\nnormality_test = df_gaussian_checks(df, 'measurement_column')\r\n\r\n# calculate confidence intervals\r\nci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)\r\n```\r\n\r\n### Machine Learning Workflow\r\n```python\r\nfrom datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics\r\n\r\n# train model\r\nmodel, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)\r\n\r\n# detailed evaluation\r\ndetailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)\r\n```\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\r\n\r\n1. Fork the repository\r\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\r\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\r\n4. Push to the branch (`git push origin feature/AmazingFeature`)\r\n5. Open a Pull Request\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Support\r\n\r\nIf you encounter any problems or have questions, please file an issue on the [GitHub repository](https://github.com/yourusername/datashadric/issues).\r\n\r\n## Changelog\r\n\r\n### Version: 0.1.0 \r\n### Release Date: 2 October 2025\r\n- Initial release\r\n- Core modules: mlearning, regression, dataframing, stochastics, plotters\r\n- Comprehensive documentation and examples\r\n- Minimal test coverage\r\n\r\n### Version: 0.1.1\r\n### Release Date: 3 October 2025\r\n- Supplemental release \r\n- Additional functions for outlier detection\r\n- Additional functions for plotting (LOWESS meanline plotter)\r\n- Additional functions for data clustering based on k-means\r\n\r\n### Version: 0.1.2\r\n### Release Date: 6 October 2025\r\n- Enhanced dataframe utilities\r\n- New functions for index and column name retrieval\r\n- Improved documentation and examples\r\n",
    "bugtrack_url": null,
    "license": "MIT License\r\n        \r\n        Copyright (c) 2025 datashadric\r\n        \r\n        Permission is hereby granted, free of charge, to any person obtaining a copy\r\n        of this software and associated documentation files (the \"Software\"), to deal\r\n        in the Software without restriction, including without limitation the rights\r\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\n        copies of the Software, and to permit persons to whom the Software is\r\n        furnished to do so, subject to the following conditions:\r\n        \r\n        The above copyright notice and this permission notice shall be included in all\r\n        copies or substantial portions of the Software.\r\n        \r\n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\n        SOFTWARE.",
    "summary": "An exploratory data science toolkit for analysis, machine learning, and visualization",
    "version": "0.1.2",
    "project_urls": {
        "Bug Reports": "https://github.com/diversecellar/datashadric/issues",
        "Documentation": "https://github.com/diversecellar/datashadric/README.md",
        "Homepage": "https://github.com/diversecellar/datashadric",
        "Source": "https://github.com/diversecellar/datashadric/src/datashadric"
    },
    "split_keywords": [
        "data science",
        " machine learning",
        " statistics",
        " visualization",
        " pandas",
        " analysis"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f90ad0310c8973a9b526a5aeb9b3441f929c6706fd527cc9ea420ae2c2b9b624",
                "md5": "d3fe2b7edb8e1874993e2bd35f573610",
                "sha256": "3cdd878c693f648d10f2279ac2d903cfd5e263947aa207910e44871c36a5557d"
            },
            "downloads": -1,
            "filename": "datashadric-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d3fe2b7edb8e1874993e2bd35f573610",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 20235,
            "upload_time": "2025-10-06T07:14:50",
            "upload_time_iso_8601": "2025-10-06T07:14:50.693279Z",
            "url": "https://files.pythonhosted.org/packages/f9/0a/d0310c8973a9b526a5aeb9b3441f929c6706fd527cc9ea420ae2c2b9b624/datashadric-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eff0fed251737549a83143d9b6cae88306326cc8b96d464762e9634b7ee626f1",
                "md5": "5bbd7ed991e0458b9da67d10a7850832",
                "sha256": "6e1f32dfeed58f204f704b97ae2402a2fd2e6192c536e8363254a1ec872270cb"
            },
            "downloads": -1,
            "filename": "datashadric-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "5bbd7ed991e0458b9da67d10a7850832",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 21770,
            "upload_time": "2025-10-06T07:14:52",
            "upload_time_iso_8601": "2025-10-06T07:14:52.576380Z",
            "url": "https://files.pythonhosted.org/packages/ef/f0/fed251737549a83143d9b6cae88306326cc8b96d464762e9634b7ee626f1/datashadric-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 07:14:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "diversecellar",
    "github_project": "datashadric",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "datashadric"
}
        
Elapsed time: 1.47435s