future-sales-prediction-2024

Name	future-sales-prediction-2024 JSON
Version	3.4.17 JSON
	download
home_page	None
Summary	A package for feature extraction, hyperopt, and validation schemas
upload_time	2024-12-23 13:08:33
maintainer	None
docs_url	None
author	Polina Yatsko
requires_python	<3.13,>=3.7
license	MIT
keywords	machine-learning xgboost hyperopt data-science regression
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            #### Future Sales Prediction 2024

Future Sales Prediction 2024 is a Python package designed for building robust time-series sales prediction models. The package integrates preprocessing, feature engineering, hyperparameter optimization, and model training workflows, leveraging DVC for data versioning and Google Cloud Storage for seamless data access.



#### Project Status: Completed

## Features

* Data Handling: Tools to preprocess raw datasets and optimize memory usage.
* Feature Engineering: Generate and refine features for predictive modeling.
* Hyperparameter Tuning: Automate parameter optimization with Hyperopt.
* Model Training: Time-series cross-validation and training for regression models.
* Validation: Validate data integrity to ensure quality and consistency.
* Data Versioning: DVC integration for easy data retrieval from Google Cloud.

### Installation
Install the package using pip:

pip install future_sales_prediction_2024

### Usage Guide
* Step 1: Authenticate with Google Cloud
Before fetching data, authenticate with Google Cloud:

Option A: Use Google Cloud SDK: gcloud auth application-default login

Option B: Use a Service Account key file: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

* Step 2: Pull the Data
Step 2: Pull the Data

Option A - locally:
- Use the pull_data.py script to clone the repository, fetch DVC-tracked data, and save it to the current directory:

* pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1

Option B - using online-service(Google Colab, Kaggle and etc.)
* !pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1

This will:

Clone the repository.
Pull datasets tracked via DVC from Google Cloud Storage.
Save datasets in a folder called data_pulled in the current working directory.

* Step 3: Explore the Codebase and Build Models
After fetching the data, you can explore and use the following modules:

____________________________________________
### Modules and Functions
## Data Handling
File: future_sales_prediction_2024/data_handling.py

prepare_full_data(items, categories, train, shops, test) -> pd.DataFrame
Merges raw datasets into a single comprehensive dataset (full_data.csv), available after dvc pull.

reduce_mem_usage(df) -> pd.DataFrame
Optimizes memory usage by converting data types where applicable.

## Feature Engineering
File: future_sales_prediction_2024/feature_extraction.py

Class: FeatureExtractor
Extracts features for predictive modeling.

Initialization Parameters:
full_data: Full dataset containing all columns.
train: Training data for aggregating revenue-based features.
Output:
Returns a processed dataset (full_featured_data.csv), stored in preprocessed_data after dvc pull.

Class: FeatureImportanceLayer
Analyzes feature importance using baseline and tuned models.

Initialization Parameters:

X: Feature matrix.
y: Target vector.
output_dir: Directory for saving feature importance plots.
Key Methods:

fit_baseline_model(): Trains a baseline model for feature importance based on RandomForestRegressor.
plot_baseline_importance(): Visualizes baseline model feature importance.
fit_final_model(): Trains a final model with optimized hyperparameters - model-agnostic.
Parameters: 
- Model (XGBRegressor by default)
- params: Model hyperparameters (Optional)
- use_shap(bool): Use SHAP values if the model doesn't provide native feature importance
plot_final_model_importance(): Visualizes feature importance for the final model.

Output of plot_baseline_importance and plot_final_model_importance: feature_importance_results/baseline_importance.png and feature_importance_results/final_model_importance.png

## Hyperparameter Tuning
File: future_sales_prediction_2024/hyperparameters.py

hyperparameter_tuning(X, y, model_class, param_space, eval_fn, max_evals=50) -> dict
Performs hyperparameter optimization using Hyperopt for models like XGBRegressor or RandomForestRegressor.

Parameters:

X: Feature matrix.
y: Target vector.
model_class: Model class (e.g., XGBRegressor).
param_space: Search space for hyperparameters.
eval_fn: Evaluation function for loss metric.
max_evals: Number of evaluations.
Returns:
Best hyperparameters as a dictionary.

## Model Training
File: future_sales_prediction_2024/model_training.py

tss_cv(df, n_splits, model, true_pred_plot=True)
Performs time-series cross-validation and calculates RMSE.
Returns Mean RMSE for all splits

df: DataFrame with features and target variable.
n_splits: Number of cross-validation splits.
model: Regression model (e.g., XGBRegressor).
data_split(df) -> Tuple[np.ndarray, ...]
Splits the data into training, validation, and test sets.

train_predict(X, y, X_test, model_, model_params=None) -> np.ndarray
Trains the model with provided features and predicts outcomes.

## Validation
File: future_sales_prediction_2024/validation.py

Class: Validator
Ensures data quality by checking types, ranges, duplicates, and missing values.

Initialization Parameters:

column_types: Expected column data types (e.g., {'shop_id': 'int64'}).
value_ranges: Numeric range for each column (e.g., {'month': (1, 12)}).
check_duplicates: Whether to check for duplicate rows.
check_missing: Whether to check for missing values.
Method: transform(X)
Validates a DataFrame and returns a confirmation message if successful.

### Conclusion:
This package is a modular and flexible solution for streamlining data science workflows. It provides data scientists and ML engineers with reusable tools to focus on solving domain-specific problems.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "future-sales-prediction-2024",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.7",
    "maintainer_email": null,
    "keywords": "machine-learning xgboost hyperopt data-science regression",
    "author": "Polina Yatsko",
    "author_email": "yatsko_polina1@mail.ru",
    "download_url": "https://files.pythonhosted.org/packages/32/57/6e0b79d7eb64cae808e97a2785712b363c9b599856d7d2aa4e00a58eb450/future_sales_prediction_2024-3.4.17.tar.gz",
    "platform": null,
    "description": "#### Future Sales Prediction 2024\n\nFuture Sales Prediction 2024 is a Python package designed for building robust time-series sales prediction models. The package integrates preprocessing, feature engineering, hyperparameter optimization, and model training workflows, leveraging DVC for data versioning and Google Cloud Storage for seamless data access.\n\n\n\n#### Project Status: Completed\n\n## Features\n\n* Data Handling: Tools to preprocess raw datasets and optimize memory usage.\n* Feature Engineering: Generate and refine features for predictive modeling.\n* Hyperparameter Tuning: Automate parameter optimization with Hyperopt.\n* Model Training: Time-series cross-validation and training for regression models.\n* Validation: Validate data integrity to ensure quality and consistency.\n* Data Versioning: DVC integration for easy data retrieval from Google Cloud.\n\n### Installation\nInstall the package using pip:\n\npip install future_sales_prediction_2024\n\n### Usage Guide\n* Step 1: Authenticate with Google Cloud\nBefore fetching data, authenticate with Google Cloud:\n\nOption A: Use Google Cloud SDK: gcloud auth application-default login\n\nOption B: Use a Service Account key file: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json\n\n* Step 2: Pull the Data\nStep 2: Pull the Data\n\nOption A - locally:\n- Use the pull_data.py script to clone the repository, fetch DVC-tracked data, and save it to the current directory:\n\n* pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1\n\nOption B - using online-service(Google Colab, Kaggle and etc.)\n* !pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1\n\nThis will:\n\nClone the repository.\nPull datasets tracked via DVC from Google Cloud Storage.\nSave datasets in a folder called data_pulled in the current working directory.\n\n* Step 3: Explore the Codebase and Build Models\nAfter fetching the data, you can explore and use the following modules:\n\n____________________________________________\n### Modules and Functions\n## Data Handling\nFile: future_sales_prediction_2024/data_handling.py\n\nprepare_full_data(items, categories, train, shops, test) -> pd.DataFrame\nMerges raw datasets into a single comprehensive dataset (full_data.csv), available after dvc pull.\n\nreduce_mem_usage(df) -> pd.DataFrame\nOptimizes memory usage by converting data types where applicable.\n\n## Feature Engineering\nFile: future_sales_prediction_2024/feature_extraction.py\n\nClass: FeatureExtractor\nExtracts features for predictive modeling.\n\nInitialization Parameters:\nfull_data: Full dataset containing all columns.\ntrain: Training data for aggregating revenue-based features.\nOutput:\nReturns a processed dataset (full_featured_data.csv), stored in preprocessed_data after dvc pull.\n\nClass: FeatureImportanceLayer\nAnalyzes feature importance using baseline and tuned models.\n\nInitialization Parameters:\n\nX: Feature matrix.\ny: Target vector.\noutput_dir: Directory for saving feature importance plots.\nKey Methods:\n\nfit_baseline_model(): Trains a baseline model for feature importance based on RandomForestRegressor.\nplot_baseline_importance(): Visualizes baseline model feature importance.\nfit_final_model(): Trains a final model with optimized hyperparameters - model-agnostic.\nParameters: \n- Model (XGBRegressor by default)\n- params: Model hyperparameters (Optional)\n- use_shap(bool): Use SHAP values if the model doesn't provide native feature importance\nplot_final_model_importance(): Visualizes feature importance for the final model.\n\nOutput of plot_baseline_importance and plot_final_model_importance: feature_importance_results/baseline_importance.png and feature_importance_results/final_model_importance.png\n\n## Hyperparameter Tuning\nFile: future_sales_prediction_2024/hyperparameters.py\n\nhyperparameter_tuning(X, y, model_class, param_space, eval_fn, max_evals=50) -> dict\nPerforms hyperparameter optimization using Hyperopt for models like XGBRegressor or RandomForestRegressor.\n\nParameters:\n\nX: Feature matrix.\ny: Target vector.\nmodel_class: Model class (e.g., XGBRegressor).\nparam_space: Search space for hyperparameters.\neval_fn: Evaluation function for loss metric.\nmax_evals: Number of evaluations.\nReturns:\nBest hyperparameters as a dictionary.\n\n## Model Training\nFile: future_sales_prediction_2024/model_training.py\n\ntss_cv(df, n_splits, model, true_pred_plot=True)\nPerforms time-series cross-validation and calculates RMSE.\nReturns Mean RMSE for all splits\n\ndf: DataFrame with features and target variable.\nn_splits: Number of cross-validation splits.\nmodel: Regression model (e.g., XGBRegressor).\ndata_split(df) -> Tuple[np.ndarray, ...]\nSplits the data into training, validation, and test sets.\n\ntrain_predict(X, y, X_test, model_, model_params=None) -> np.ndarray\nTrains the model with provided features and predicts outcomes.\n\n## Validation\nFile: future_sales_prediction_2024/validation.py\n\nClass: Validator\nEnsures data quality by checking types, ranges, duplicates, and missing values.\n\nInitialization Parameters:\n\ncolumn_types: Expected column data types (e.g., {'shop_id': 'int64'}).\nvalue_ranges: Numeric range for each column (e.g., {'month': (1, 12)}).\ncheck_duplicates: Whether to check for duplicate rows.\ncheck_missing: Whether to check for missing values.\nMethod: transform(X)\nValidates a DataFrame and returns a confirmation message if successful.\n\n### Conclusion:\nThis package is a modular and flexible solution for streamlining data science workflows. It provides data scientists and ML engineers with reusable tools to focus on solving domain-specific problems.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package for feature extraction, hyperopt, and validation schemas",
    "version": "3.4.17",
    "project_urls": null,
    "split_keywords": [
        "machine-learning",
        "xgboost",
        "hyperopt",
        "data-science",
        "regression"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "002616992c3effa3c1186d20b79f20467222fd11700a32ab0e02eb1d79f01350",
                "md5": "92cad3d3efd8473fba40975c4bb6fce5",
                "sha256": "baef35033d6fe5c966f03fbc25681c130aa807d1a0997e58cd47d0d827f120e4"
            },
            "downloads": -1,
            "filename": "future_sales_prediction_2024-3.4.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "92cad3d3efd8473fba40975c4bb6fce5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.7",
            "size": 20008,
            "upload_time": "2024-12-23T13:08:32",
            "upload_time_iso_8601": "2024-12-23T13:08:32.000982Z",
            "url": "https://files.pythonhosted.org/packages/00/26/16992c3effa3c1186d20b79f20467222fd11700a32ab0e02eb1d79f01350/future_sales_prediction_2024-3.4.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "32576e0b79d7eb64cae808e97a2785712b363c9b599856d7d2aa4e00a58eb450",
                "md5": "622a044a09902b72ff0bf29f49e84f1e",
                "sha256": "d266ce19f2c864c3b0103ed95c59400ce1db6200c5cd31918b66928bf750b280"
            },
            "downloads": -1,
            "filename": "future_sales_prediction_2024-3.4.17.tar.gz",
            "has_sig": false,
            "md5_digest": "622a044a09902b72ff0bf29f49e84f1e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.7",
            "size": 22721,
            "upload_time": "2024-12-23T13:08:33",
            "upload_time_iso_8601": "2024-12-23T13:08:33.118977Z",
            "url": "https://files.pythonhosted.org/packages/32/57/6e0b79d7eb64cae808e97a2785712b363c9b599856d7d2aa4e00a58eb450/future_sales_prediction_2024-3.4.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-23 13:08:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "future-sales-prediction-2024"
}

Polina Yatsko