jan883-codebase

Name	jan883-codebase JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	Personal codebase for data science and machine learning projects. Includes data preprocessing, feature engineering, model selection, and model evaluation.
upload_time	2025-01-26 21:58:56
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT License Copyright (c) 2023 Jan du Plessis Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	eda data science machine learning modeling utilities
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # **jan883-codebase** - Data Science Collection
![jan883_codebase logo](https://github.com/janduplessis883/jan883-codebase/blob/master/images/logo2.png?raw=true)

This repository contains a collection of Python functions designed to streamline Exploratory Data Analysis (EDA) and model selection processes. The toolkit is divided into three main sections: **EDA Level 1**, **EDA Level 2**, **Model Selection** and a selection of other tools lick **NotionHelper**, helper class for the Official Notio API via `notion-client` each providing a set of utility functions to assist in data transformation, analysis, and model evaluation.

This toolkit is ideal for data scientists and analysts looking to accelerate their EDA and model selection workflows. Whether you're working on classification, regression, or clustering tasks, this repository provides the tools to make your process more efficient and insightful.

---
# Data Pre-processing
```python
from jan883_codebase.data_preprocessing.eda import *

# Run this function for a printout of included functions in Jupyter Notebook.
eda0()
eda1()
eda2()
```
### **EDA Level 0 - Pure Understanding of Original Data**

- `inspect_df(df) Run df.head()`, df.describe(), df.isna().sum() & df.duplicated().sum() on your dataframe.
- `column_summary(df)` Create a dataframe with column info, dtype, value_counts, etc.
- `column_summary_plus(df)` Create a dataframe with column info, dtype, value_counts, plus df.decsribe() info.
- `univariate_analysis(df)` Perform Univariate Analysis of numeric columns.

### **EDA Level 1 — Transformation of Original Data**

- `update_column_names(df)` Update Column names, replace " " with "_".
- `label_encode_column(df, col_name)` Label encode a df column returing a df with the new column (original col dropped).
- `one_hot_encode_column(df, col_name)` One Hot Encode a df column returing a df with the new column (original col dropped).
- `train_no_outliers = remove_outliers_zscore(train, threshold=3)` Remove outliers using Z score.
- `df_imputed = impute_missing_values(df, strategy='median')` Impute missing values in DF

### **EDA Level 2 — Understanding of Transformed Data**
- `correlation_analysis(df, width=16, height=12)` Correlation Heatmap & Maximum pairwise correlation.
- `newDF, woeDF = iv_woe(df, target, bins=10, show_woe=False)` Returns newDF, woeDF. IV / WOE Values - Information Value (IV) quantifies the prediction power of a feature. We are looking for IV of 0.1 to 0.5. For those with IV of 0, there is a high chance it is the way it is due to imbalance of data, resulting in lack of binning. Keep this in mind during further analysis.
- `individual_t_test_classification(df, y_column, y_value_1, y_value_2, list_of_features, alpha_val=0.05, sample_frac=1.0, random_state=None)` Statistical test of individual features - Classification problem.
- `individual_t_test_regression(df, y_column, list_of_features, alpha_val=0.05, sample_frac=1.0, random_state=None)` Statistical test of individual features - Regressions problem.
- `create_qq_plots(df, reference_col)` Create QQ plots of the features in a dataframe.
- `volcano_plot(df, reference_col)` Create Volcano Plot with P-values.
- `X, y = define_X_y(df, target)` Define X and y..
- `X_train, X_test, y_train, y_test = train_test_split_custom(X, y, test_size=0.2, random_state=42)` Split train, test.
- `X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y, val_size=0.2, test_size=0.2, random_state=42)` Split train, val, test.
- `X_train_res, y_train_res = oversample_SMOTE(X_train, y_train, sampling_strategy="auto", k_neighbors=5, random_state=42) `Oversample minority class.
- `scaled_X = scale_df(X, scaler='standard')` only scales X, does not scale X_test or X_val.
- `scaled_X_train, scaled_X_test = scale_X_train_X_test(X_train, X_test, scaler="standard", save_scaler=False)` Standard, MinMax and Robust Scaler. X_train uses fit_transform, X_test uses transform.
- `sample_df(df, n_samples)` Take a sample of the full df.
---
# Model Selection
```python
from jan883_codebase.data_preprocessing.eda import *

ml0() # Run this function for a printout of included functions in Jupyter Notebook.
```

- `feature_importance_plot(model, X, y)` Plot Feature Importance using a single model.
- `evaluate_classification_model(model, X, y, cv=5)` Plot peformance metrics of single classification model.
- `evaluate_regression_model(model, X, y)` Plot peformance metrics of single regression model.
- `test_regression_models(X, y, test_size=0.2, random_state=None, scale_data=False)` Test Regression models.
- `test_classification_models(X, y, test_size=0.2, random_state=None, scale_data=False)` Test Classification models.
---
# NotionHelper class

```python
import os
# Set the environment variable
os.environ["NOTION_TOKEN"] = "<your-notion-token>"

from jan883_codebase.notion_api.notionhelper import NotionHelper
nh = NotionHelper() # Instantiate the class

nh.get_all_pages_as_dataframe(database_id)
```
A helper class to interact with the **Notion API.**

### Methods

- `get_database(database_id)`: Fetches the schema of a Notion database given its database_id.
- `notion_search_db(database_id, query="")`: Searches for pages in a Notion database that contain the specified query in their title.
- `notion_get_page(page_id)`: Returns the JSON of the page properties and an array of blocks on a Notion page given its page_id.
- `create_database(parent_page_id, database_title, properties)`: Creates a new database in Notion under the specified parent page with the given title and properties.
- `new_page_to_db(database_id, page_properties)`: Adds a new page to a Notion database with the specified properties.
- `append_page_body(page_id, blocks)`: Appends blocks of text to the body of a Notion page.
- `get_all_page_ids(database_id)`: Returns the IDs of all pages in a given Notion database.
- `get_all_pages_as_json(database_id, limit=None)`: Returns a list of JSON objects representing all pages in the given database, with all properties.
- `get_all_pages_as_dataframe(database_id, limit=None)`: Returns a Pandas DataFrame representing all pages in the given database, with selected properties.
---
# More function:
NER, RAG, Semitment Analysis, Telegram API, Web Scrapping Tools


### [Example Deepnote Notebook](https://deepnote.com/workspace/Jans-Team-dc9449a1-8aab-44a0-a4c2-577280c6908e/project/jan883codebase-Walk-through-2da6b2b0-32cb-4db2-8e49-1f3514edd142/notebook/Notebook-1-ae733a1529594694a9bd21702502ba10?utm_source=share-modal&utm_medium=product-shared-content&utm_campaign=notebook&utm_content=2da6b2b0-32cb-4db2-8e49-1f3514edd142)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "jan883-codebase",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "EDA, data science, machine learning, modeling, utilities",
    "author": null,
    "author_email": "Jan du Plessis <drjanduplessis@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/c1/69/c6af7290e506f3800635fb0b2ff05a370239f3dc4e6c9e25b2459a43abe8/jan883_codebase-0.2.0.tar.gz",
    "platform": null,
    "description": "# **jan883-codebase** - Data Science Collection\n![jan883_codebase logo](https://github.com/janduplessis883/jan883-codebase/blob/master/images/logo2.png?raw=true)\n\nThis repository contains a collection of Python functions designed to streamline Exploratory Data Analysis (EDA) and model selection processes. The toolkit is divided into three main sections: **EDA Level 1**, **EDA Level 2**, **Model Selection** and a selection of other tools lick **NotionHelper**, helper class for the Official Notio API via `notion-client` each providing a set of utility functions to assist in data transformation, analysis, and model evaluation.\n\nThis toolkit is ideal for data scientists and analysts looking to accelerate their EDA and model selection workflows. Whether you're working on classification, regression, or clustering tasks, this repository provides the tools to make your process more efficient and insightful.\n\n---\n# Data Pre-processing\n```python\nfrom jan883_codebase.data_preprocessing.eda import *\n\n# Run this function for a printout of included functions in Jupyter Notebook.\neda0()\neda1()\neda2()\n```\n### **EDA Level 0 - Pure Understanding of Original Data**\n\n- `inspect_df(df) Run df.head()`, df.describe(), df.isna().sum() & df.duplicated().sum() on your dataframe.\n- `column_summary(df)` Create a dataframe with column info, dtype, value_counts, etc.\n- `column_summary_plus(df)` Create a dataframe with column info, dtype, value_counts, plus df.decsribe() info.\n- `univariate_analysis(df)` Perform Univariate Analysis of numeric columns.\n\n### **EDA Level 1 \u2014 Transformation of Original Data**\n\n- `update_column_names(df)` Update Column names, replace \" \" with \"_\".\n- `label_encode_column(df, col_name)` Label encode a df column returing a df with the new column (original col dropped).\n- `one_hot_encode_column(df, col_name)` One Hot Encode a df column returing a df with the new column (original col dropped).\n- `train_no_outliers = remove_outliers_zscore(train, threshold=3)` Remove outliers using Z score.\n- `df_imputed = impute_missing_values(df, strategy='median')` Impute missing values in DF\n\n### **EDA Level 2 \u2014 Understanding of Transformed Data**\n- `correlation_analysis(df, width=16, height=12)` Correlation Heatmap & Maximum pairwise correlation.\n- `newDF, woeDF = iv_woe(df, target, bins=10, show_woe=False)` Returns newDF, woeDF. IV / WOE Values - Information Value (IV) quantifies the prediction power of a feature. We are looking for IV of 0.1 to 0.5. For those with IV of 0, there is a high chance it is the way it is due to imbalance of data, resulting in lack of binning. Keep this in mind during further analysis.\n- `individual_t_test_classification(df, y_column, y_value_1, y_value_2, list_of_features, alpha_val=0.05, sample_frac=1.0, random_state=None)` Statistical test of individual features - Classification problem.\n- `individual_t_test_regression(df, y_column, list_of_features, alpha_val=0.05, sample_frac=1.0, random_state=None)` Statistical test of individual features - Regressions problem.\n- `create_qq_plots(df, reference_col)` Create QQ plots of the features in a dataframe.\n- `volcano_plot(df, reference_col)` Create Volcano Plot with P-values.\n- `X, y = define_X_y(df, target)` Define X and y..\n- `X_train, X_test, y_train, y_test = train_test_split_custom(X, y, test_size=0.2, random_state=42)` Split train, test.\n- `X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y, val_size=0.2, test_size=0.2, random_state=42)` Split train, val, test.\n- `X_train_res, y_train_res = oversample_SMOTE(X_train, y_train, sampling_strategy=\"auto\", k_neighbors=5, random_state=42) `Oversample minority class.\n- `scaled_X = scale_df(X, scaler='standard')` only scales X, does not scale X_test or X_val.\n- `scaled_X_train, scaled_X_test = scale_X_train_X_test(X_train, X_test, scaler=\"standard\", save_scaler=False)` Standard, MinMax and Robust Scaler. X_train uses fit_transform, X_test uses transform.\n- `sample_df(df, n_samples)` Take a sample of the full df.\n---\n# Model Selection\n```python\nfrom jan883_codebase.data_preprocessing.eda import *\n\nml0() # Run this function for a printout of included functions in Jupyter Notebook.\n```\n\n- `feature_importance_plot(model, X, y)` Plot Feature Importance using a single model.\n- `evaluate_classification_model(model, X, y, cv=5)` Plot peformance metrics of single classification model.\n- `evaluate_regression_model(model, X, y)` Plot peformance metrics of single regression model.\n- `test_regression_models(X, y, test_size=0.2, random_state=None, scale_data=False)` Test Regression models.\n- `test_classification_models(X, y, test_size=0.2, random_state=None, scale_data=False)` Test Classification models.\n---\n# NotionHelper class\n\n```python\nimport os\n# Set the environment variable\nos.environ[\"NOTION_TOKEN\"] = \"<your-notion-token>\"\n\nfrom jan883_codebase.notion_api.notionhelper import NotionHelper\nnh = NotionHelper() # Instantiate the class\n\nnh.get_all_pages_as_dataframe(database_id)\n```\nA helper class to interact with the **Notion API.**\n\n### Methods\n\n- `get_database(database_id)`: Fetches the schema of a Notion database given its database_id.\n- `notion_search_db(database_id, query=\"\")`: Searches for pages in a Notion database that contain the specified query in their title.\n- `notion_get_page(page_id)`: Returns the JSON of the page properties and an array of blocks on a Notion page given its page_id.\n- `create_database(parent_page_id, database_title, properties)`: Creates a new database in Notion under the specified parent page with the given title and properties.\n- `new_page_to_db(database_id, page_properties)`: Adds a new page to a Notion database with the specified properties.\n- `append_page_body(page_id, blocks)`: Appends blocks of text to the body of a Notion page.\n- `get_all_page_ids(database_id)`: Returns the IDs of all pages in a given Notion database.\n- `get_all_pages_as_json(database_id, limit=None)`: Returns a list of JSON objects representing all pages in the given database, with all properties.\n- `get_all_pages_as_dataframe(database_id, limit=None)`: Returns a Pandas DataFrame representing all pages in the given database, with selected properties.\n---\n# More function:\nNER, RAG, Semitment Analysis, Telegram API, Web Scrapping Tools\n\n\n### [Example Deepnote Notebook](https://deepnote.com/workspace/Jans-Team-dc9449a1-8aab-44a0-a4c2-577280c6908e/project/jan883codebase-Walk-through-2da6b2b0-32cb-4db2-8e49-1f3514edd142/notebook/Notebook-1-ae733a1529594694a9bd21702502ba10?utm_source=share-modal&utm_medium=product-shared-content&utm_campaign=notebook&utm_content=2da6b2b0-32cb-4db2-8e49-1f3514edd142)\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2023 Jan du Plessis\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "Personal codebase for data science and machine learning projects. Includes data preprocessing, feature engineering, model selection, and model evaluation.",
    "version": "0.2.0",
    "project_urls": {
        "Changelog": "https://github.com/janduplessis883/jan883-codebase/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/janduplessis883/jan883-codebase#readme",
        "Homepage": "https://github.com/janduplessis883/jan883-codebase",
        "Issues": "https://github.com/janduplessis883/jan883-codebase/issues",
        "Repository": "https://github.com/janduplessis883/jan883-codebase"
    },
    "split_keywords": [
        "eda",
        " data science",
        " machine learning",
        " modeling",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7dbe10baa8f19b75f449b9fdafd8e2e9bc397b17401d48cfc8e3e4c203943c53",
                "md5": "e19d2bce39991f5e5bd850624b0588d3",
                "sha256": "0e564e20831cec9048e8861e2e9cfeae8bb9dbf6f34dd8ed4036fb39a6bab52e"
            },
            "downloads": -1,
            "filename": "jan883_codebase-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e19d2bce39991f5e5bd850624b0588d3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 1388509,
            "upload_time": "2025-01-26T21:58:53",
            "upload_time_iso_8601": "2025-01-26T21:58:53.519678Z",
            "url": "https://files.pythonhosted.org/packages/7d/be/10baa8f19b75f449b9fdafd8e2e9bc397b17401d48cfc8e3e4c203943c53/jan883_codebase-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c169c6af7290e506f3800635fb0b2ff05a370239f3dc4e6c9e25b2459a43abe8",
                "md5": "d9cc1b2dfebf63e391902bf53e572334",
                "sha256": "b38c77e596ebfadd95cf085a258beb4eb9cb839f4f973b283165e3727c62454e"
            },
            "downloads": -1,
            "filename": "jan883_codebase-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d9cc1b2dfebf63e391902bf53e572334",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 2699950,
            "upload_time": "2025-01-26T21:58:56",
            "upload_time_iso_8601": "2025-01-26T21:58:56.364237Z",
            "url": "https://files.pythonhosted.org/packages/c1/69/c6af7290e506f3800635fb0b2ff05a370239f3dc4e6c9e25b2459a43abe8/jan883_codebase-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-26 21:58:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "janduplessis883",
    "github_project": "jan883-codebase",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "jan883-codebase"
}

None