#### Future Sales Prediction 2024
Future Sales Prediction 2024 is a Python package designed for building robust time-series sales prediction models. The package integrates preprocessing, feature engineering, hyperparameter optimization, and model training workflows, leveraging DVC for data versioning and Google Cloud Storage for seamless data access.
#### Project Status: Completed
## Features
* Data Handling: Tools to preprocess raw datasets and optimize memory usage.
* Feature Engineering: Generate and refine features for predictive modeling.
* Hyperparameter Tuning: Automate parameter optimization with Hyperopt.
* Model Training: Time-series cross-validation and training for regression models.
* Validation: Validate data integrity to ensure quality and consistency.
* Data Versioning: DVC integration for easy data retrieval from Google Cloud.
### Installation
Install the package using pip:
pip install future_sales_prediction_2024
### Usage Guide
* Step 1: Authenticate with Google Cloud
Before fetching data, authenticate with Google Cloud:
Option A: Use Google Cloud SDK: gcloud auth application-default login
Option B: Use a Service Account key file: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
* Step 2: Pull the Data
Step 2: Pull the Data
Option A - locally:
- Use the pull_data.py script to clone the repository, fetch DVC-tracked data, and save it to the current directory:
* pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1
Option B - using online-service(Google Colab, Kaggle and etc.)
* !pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1
This will:
Clone the repository.
Pull datasets tracked via DVC from Google Cloud Storage.
Save datasets in a folder called data_pulled in the current working directory.
* Step 3: Explore the Codebase and Build Models
After fetching the data, you can explore and use the following modules:
____________________________________________
### Modules and Functions
## Data Handling
File: future_sales_prediction_2024/data_handling.py
prepare_full_data(items, categories, train, shops, test) -> pd.DataFrame
Merges raw datasets into a single comprehensive dataset (full_data.csv), available after dvc pull.
reduce_mem_usage(df) -> pd.DataFrame
Optimizes memory usage by converting data types where applicable.
## Feature Engineering
File: future_sales_prediction_2024/feature_extraction.py
Class: FeatureExtractor
Extracts features for predictive modeling.
Initialization Parameters:
full_data: Full dataset containing all columns.
train: Training data for aggregating revenue-based features.
Output:
Returns a processed dataset (full_featured_data.csv), stored in preprocessed_data after dvc pull.
Class: FeatureImportanceLayer
Analyzes feature importance using baseline and tuned models.
Initialization Parameters:
X: Feature matrix.
y: Target vector.
output_dir: Directory for saving feature importance plots.
Key Methods:
fit_baseline_model(): Trains a baseline model for feature importance based on RandomForestRegressor.
plot_baseline_importance(): Visualizes baseline model feature importance.
fit_final_model(): Trains a final model with optimized hyperparameters - model-agnostic.
Parameters:
- Model (XGBRegressor by default)
- params: Model hyperparameters (Optional)
- use_shap(bool): Use SHAP values if the model doesn't provide native feature importance
plot_final_model_importance(): Visualizes feature importance for the final model.
Output of plot_baseline_importance and plot_final_model_importance: feature_importance_results/baseline_importance.png and feature_importance_results/final_model_importance.png
## Hyperparameter Tuning
File: future_sales_prediction_2024/hyperparameters.py
hyperparameter_tuning(X, y, model_class, param_space, eval_fn, max_evals=50) -> dict
Performs hyperparameter optimization using Hyperopt for models like XGBRegressor or RandomForestRegressor.
Parameters:
X: Feature matrix.
y: Target vector.
model_class: Model class (e.g., XGBRegressor).
param_space: Search space for hyperparameters.
eval_fn: Evaluation function for loss metric.
max_evals: Number of evaluations.
Returns:
Best hyperparameters as a dictionary.
## Model Training
File: future_sales_prediction_2024/model_training.py
tss_cv(df, n_splits, model, true_pred_plot=True)
Performs time-series cross-validation and calculates RMSE.
Returns Mean RMSE for all splits
df: DataFrame with features and target variable.
n_splits: Number of cross-validation splits.
model: Regression model (e.g., XGBRegressor).
data_split(df) -> Tuple[np.ndarray, ...]
Splits the data into training, validation, and test sets.
train_predict(X, y, X_test, model_, model_params=None) -> np.ndarray
Trains the model with provided features and predicts outcomes.
## Validation
File: future_sales_prediction_2024/validation.py
Class: Validator
Ensures data quality by checking types, ranges, duplicates, and missing values.
Initialization Parameters:
column_types: Expected column data types (e.g., {'shop_id': 'int64'}).
value_ranges: Numeric range for each column (e.g., {'month': (1, 12)}).
check_duplicates: Whether to check for duplicate rows.
check_missing: Whether to check for missing values.
Method: transform(X)
Validates a DataFrame and returns a confirmation message if successful.
### Conclusion:
This package is a modular and flexible solution for streamlining data science workflows. It provides data scientists and ML engineers with reusable tools to focus on solving domain-specific problems.
Raw data
{
"_id": null,
"home_page": null,
"name": "future-sales-prediction-2024",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.7",
"maintainer_email": null,
"keywords": "machine-learning xgboost hyperopt data-science regression",
"author": "Polina Yatsko",
"author_email": "yatsko_polina1@mail.ru",
"download_url": "https://files.pythonhosted.org/packages/cc/aa/89d7a43d293c1998dc6516e5b317491017fdb6bf0e563bc65e6083a62747/future_sales_prediction_2024-3.4.15.tar.gz",
"platform": null,
"description": "#### Future Sales Prediction 2024\n\nFuture Sales Prediction 2024 is a Python package designed for building robust time-series sales prediction models. The package integrates preprocessing, feature engineering, hyperparameter optimization, and model training workflows, leveraging DVC for data versioning and Google Cloud Storage for seamless data access.\n\n\n\n#### Project Status: Completed\n\n## Features\n\n* Data Handling: Tools to preprocess raw datasets and optimize memory usage.\n* Feature Engineering: Generate and refine features for predictive modeling.\n* Hyperparameter Tuning: Automate parameter optimization with Hyperopt.\n* Model Training: Time-series cross-validation and training for regression models.\n* Validation: Validate data integrity to ensure quality and consistency.\n* Data Versioning: DVC integration for easy data retrieval from Google Cloud.\n\n### Installation\nInstall the package using pip:\n\npip install future_sales_prediction_2024\n\n### Usage Guide\n* Step 1: Authenticate with Google Cloud\nBefore fetching data, authenticate with Google Cloud:\n\nOption A: Use Google Cloud SDK: gcloud auth application-default login\n\nOption B: Use a Service Account key file: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json\n\n* Step 2: Pull the Data\nStep 2: Pull the Data\n\nOption A - locally:\n- Use the pull_data.py script to clone the repository, fetch DVC-tracked data, and save it to the current directory:\n\n* pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1\n\nOption B - using online-service(Google Colab, Kaggle and etc.)\n* !pull_data --repo https://github.com/YPolina/Trainee.git --branch DS-4.1\n\nThis will:\n\nClone the repository.\nPull datasets tracked via DVC from Google Cloud Storage.\nSave datasets in a folder called data_pulled in the current working directory.\n\n* Step 3: Explore the Codebase and Build Models\nAfter fetching the data, you can explore and use the following modules:\n\n____________________________________________\n### Modules and Functions\n## Data Handling\nFile: future_sales_prediction_2024/data_handling.py\n\nprepare_full_data(items, categories, train, shops, test) -> pd.DataFrame\nMerges raw datasets into a single comprehensive dataset (full_data.csv), available after dvc pull.\n\nreduce_mem_usage(df) -> pd.DataFrame\nOptimizes memory usage by converting data types where applicable.\n\n## Feature Engineering\nFile: future_sales_prediction_2024/feature_extraction.py\n\nClass: FeatureExtractor\nExtracts features for predictive modeling.\n\nInitialization Parameters:\nfull_data: Full dataset containing all columns.\ntrain: Training data for aggregating revenue-based features.\nOutput:\nReturns a processed dataset (full_featured_data.csv), stored in preprocessed_data after dvc pull.\n\nClass: FeatureImportanceLayer\nAnalyzes feature importance using baseline and tuned models.\n\nInitialization Parameters:\n\nX: Feature matrix.\ny: Target vector.\noutput_dir: Directory for saving feature importance plots.\nKey Methods:\n\nfit_baseline_model(): Trains a baseline model for feature importance based on RandomForestRegressor.\nplot_baseline_importance(): Visualizes baseline model feature importance.\nfit_final_model(): Trains a final model with optimized hyperparameters - model-agnostic.\nParameters: \n- Model (XGBRegressor by default)\n- params: Model hyperparameters (Optional)\n- use_shap(bool): Use SHAP values if the model doesn't provide native feature importance\nplot_final_model_importance(): Visualizes feature importance for the final model.\n\nOutput of plot_baseline_importance and plot_final_model_importance: feature_importance_results/baseline_importance.png and feature_importance_results/final_model_importance.png\n\n## Hyperparameter Tuning\nFile: future_sales_prediction_2024/hyperparameters.py\n\nhyperparameter_tuning(X, y, model_class, param_space, eval_fn, max_evals=50) -> dict\nPerforms hyperparameter optimization using Hyperopt for models like XGBRegressor or RandomForestRegressor.\n\nParameters:\n\nX: Feature matrix.\ny: Target vector.\nmodel_class: Model class (e.g., XGBRegressor).\nparam_space: Search space for hyperparameters.\neval_fn: Evaluation function for loss metric.\nmax_evals: Number of evaluations.\nReturns:\nBest hyperparameters as a dictionary.\n\n## Model Training\nFile: future_sales_prediction_2024/model_training.py\n\ntss_cv(df, n_splits, model, true_pred_plot=True)\nPerforms time-series cross-validation and calculates RMSE.\nReturns Mean RMSE for all splits\n\ndf: DataFrame with features and target variable.\nn_splits: Number of cross-validation splits.\nmodel: Regression model (e.g., XGBRegressor).\ndata_split(df) -> Tuple[np.ndarray, ...]\nSplits the data into training, validation, and test sets.\n\ntrain_predict(X, y, X_test, model_, model_params=None) -> np.ndarray\nTrains the model with provided features and predicts outcomes.\n\n## Validation\nFile: future_sales_prediction_2024/validation.py\n\nClass: Validator\nEnsures data quality by checking types, ranges, duplicates, and missing values.\n\nInitialization Parameters:\n\ncolumn_types: Expected column data types (e.g., {'shop_id': 'int64'}).\nvalue_ranges: Numeric range for each column (e.g., {'month': (1, 12)}).\ncheck_duplicates: Whether to check for duplicate rows.\ncheck_missing: Whether to check for missing values.\nMethod: transform(X)\nValidates a DataFrame and returns a confirmation message if successful.\n\n### Conclusion:\nThis package is a modular and flexible solution for streamlining data science workflows. It provides data scientists and ML engineers with reusable tools to focus on solving domain-specific problems.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package for feature extraction, hyperopt, and validation schemas",
"version": "3.4.15",
"project_urls": null,
"split_keywords": [
"machine-learning",
"xgboost",
"hyperopt",
"data-science",
"regression"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f9594a35c3b2b27d6e856672691af562df035ae9287ffb15d487a84b3bc0d65d",
"md5": "93d952abb9ccaa498a98382bfc8e0b78",
"sha256": "cf068a38bbb17c2a192af426cea4bd60a7d33d431e807f6999fe08d148b4b7e7"
},
"downloads": -1,
"filename": "future_sales_prediction_2024-3.4.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93d952abb9ccaa498a98382bfc8e0b78",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.7",
"size": 19986,
"upload_time": "2024-12-16T11:33:22",
"upload_time_iso_8601": "2024-12-16T11:33:22.466628Z",
"url": "https://files.pythonhosted.org/packages/f9/59/4a35c3b2b27d6e856672691af562df035ae9287ffb15d487a84b3bc0d65d/future_sales_prediction_2024-3.4.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ccaa89d7a43d293c1998dc6516e5b317491017fdb6bf0e563bc65e6083a62747",
"md5": "e1f104886974430f200ca01e19b3c1fc",
"sha256": "f01705cbacd694115fabaacdcddb41942f268def4491c2f82045c9f189d8915d"
},
"downloads": -1,
"filename": "future_sales_prediction_2024-3.4.15.tar.gz",
"has_sig": false,
"md5_digest": "e1f104886974430f200ca01e19b3c1fc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.7",
"size": 22742,
"upload_time": "2024-12-16T11:33:24",
"upload_time_iso_8601": "2024-12-16T11:33:24.738352Z",
"url": "https://files.pythonhosted.org/packages/cc/aa/89d7a43d293c1998dc6516e5b317491017fdb6bf0e563bc65e6083a62747/future_sales_prediction_2024-3.4.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-16 11:33:24",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "future-sales-prediction-2024"
}