# 🧪 VJModels
A collection of my experimental machine learning models. These models are part of my personal exploration in the field, so they might not be fully refined, but they contain some interesting ideas. Feel free to check them out! You can also install the package via [pip](https://pypi.org/project/VJModels/) and incorporate the models into your own projects.
```bash
pip install VJModels
```
**Example usage:**
```python
from VJModels.Forests import IncrementalForestClassifier
# X_train, y_train, X_test should be your datasets
inc_forest = IncrementalForestClassifier()
inc_forest.fit(X_train, y_train)
y_test_pred = inc_forest.predict(X_test)
```
# Summary
1. Forests
- [1.1 WSagging](#wsagging)
- [1.2 Incremental](#incremental)
2. Linear Models
- [2.1 Advanced Linear Regression](#advanced-linear-regression)
# Forests
## WSagging
**WSagging** is a term I coined, standing for **Weighted Score Averaging**. The idea is quite simple. Suppose you have `X` features and `y` targets. You select a number of times the algorithm will run (`n_models` parameter in the class constructor). At each iteration, you randomly split the dataset `X` into two different datasets: `X_train` and `X_validation`. The model is trained on the `X_train` dataset and scored on `X_validation` against `y_validation`. Both the trained model and its score are saved.
Later, during the prediction phase, you average the predictions based on the scores from the validation set to obtain a new prediction for your data. Specifically, the classifier and regressor work as follows:
### Classifier
```python
def predict(self, X):
n_samples = len(X)
results = [0] * n_samples
predictions_list = [tree.predict(X) for tree in self.trees]
importance_list = [self.get_importance(i) for i in range(len(self.trees))]
for i in range(n_samples):
results[i] = sum(importance if prediction[i] == 1 else -importance
for prediction, importance in zip(predictions_list, importance_list))
return [1 if result > 0 else 0 for result in results]
```
- **`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.
- **`importance_list`**: This holds the importance score for each tree, reflecting how much weight each tree's prediction carries.
- **Weighted Sum Calculation**:
- For each sample, calculate the weighted sum of predictions from all trees.
- Add the importance score for positive predictions (`1`) and subtract the importance score for negative predictions (`0`).
- **Final Classification**:
- If the resulting weighted sum is positive, classify the sample as `1`.
- If the weighted sum is zero or negative, classify the sample as `0`.
### Regressor
```python
def predict(self, X):
predictions_list = [tree.predict(X) for tree in self.trees]
importance_list = [self.get_importance(i) for i in range(len(self.trees))]
results = [sum(importance * prediction for importance, prediction in zip(importance_list, preds)) / sum(importance_list) for preds in zip(*predictions_list)]
return results
```
**`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.
- **`importance_list`**: This holds the importance score for each tree, indicating the weight of each tree's predictions.
- **Weighted Average Calculation**:
- For each sample, compute the weighted average of predictions from all trees.
- Multiply each prediction by its corresponding tree's importance score.
- Sum these weighted predictions and divide by the total sum of importance scores to obtain the final result.
- **Final Prediction**:
- The result is a weighted average of the predictions, where the importance scores determine the contribution of each tree’s prediction to the final outcome.
## Incremental
The IncrementalForests algorithm builds upon WSagging by incorporating an incremental approach. In the first iteration, the dataset `X` and targets `y` are split into `train_0` and `validation_0`. A model is trained on `train_0`, scored on `validation_0`, and both the model and score are saved.
In the next iteration, `validation_0` is further split into `train_1` and `validation_1`. The new training set `train_0` is combined with `train_1`, and a new model is trained on this merged dataset. This model is evaluated on `validation_1`, and the model and score are saved.
This process continues until one of the stopping criteria is met: the validation set becomes too small, the score drops by a predefined margin, the maximum score is achieved, or the number of trees reaches the maximum limit.
During the prediction phase, like in WSagging, predictions are averaged based on validation scores to obtain final predictions. The prediction algorithm is similar to WSagging but uses a different importance formula: `(n - i) * scores[i] ** exponent`, where `n` is the total number of models trained and `scores[i]` is the score of the ith model. This formula gives more weight to models trained earlier in the process.
# Linear Models
## Advanced Linear Regression
This class is designed for building a linear regression model with additional diagnostic checks to ensure model validity. The main steps include transforming categorical variables, fitting the model, checking the normality of residuals, applying transformations if necessary, and testing for heteroscedasticity. The class also provides a detailed summary of the model, including parameters, R² values, and p-values.
### **Key Steps and Functionality**
1. **Step 1: Categorical Variable Transformation**
- The class begins by transforming categorical variables into dummy variables, which are suitable for regression modeling. If there is only one categorical variable, it is referred to as a "category." Otherwise, multiple variables are called "categories."
2. **Step 2: Model Fitting**
- After transforming the variables, the class fits a stepwise regression model. This process involves removing predictors that are non-significant or cause multicollinearity issues.
3. **Step 3: Residual Normality Test**
- The class checks the normality of residuals to validate model assumptions. The type of normality test used depends on the sample size:
- **Shapiro-Francia test:** Used when the sample size is 30 or more.
- **Shapiro-Wilk test:** Used when the sample size is less than 30.
- The p-value from the test is reported, and a conclusion is drawn regarding the normality of the residuals.
4. **Step 4: Box-Cox Transformation (When needed)**
- If the model's response variable does not meet the normality assumption, a Box-Cox transformation can be applied to stabilize variance and make the data more normally distributed. The transformed target variable is then used to refit the model via the stepwise method.
5. **Step 5: Heteroscedasticity Test**
- The Breusch-Pagan test is performed to check for heteroscedasticity (non-constant variance of residuals). The p-value from the test is reported, and a conclusion is drawn about the presence or absence of heteroscedasticity.
### **Usage**
To use this class:
1. Initialize the class.
2. Call the fitting method with your dataset and target variable to perform all steps.
3. Use the `summary` method to get a detailed report of the model.
4. Use the object to predict the target value on new observations.
```python
from VJModels.LinearModels import AdvancedLinearRegression
model = AdvancedLinearRegression(data, 'target')
model.fit()
print(model.summary())
predictions = model.predict(new_data)
print(predictions)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Vanderval31bs/VJModels",
"name": "VJModels",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "MachineLearning, Models, Forests",
"author": "Vanderval Borges de Souza Junior",
"author_email": "vander31bs@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/50/24/b0c2be69d88102ffa280c16522aa06d5387a9ba9d0d139e6decc686c75ca/VJModels-5.0.1.tar.gz",
"platform": null,
"description": "# \ud83e\uddea VJModels\r\n\r\nA collection of my experimental machine learning models. These models are part of my personal exploration in the field, so they might not be fully refined, but they contain some interesting ideas. Feel free to check them out! You can also install the package via [pip](https://pypi.org/project/VJModels/) and incorporate the models into your own projects.\r\n\r\n\r\n```bash\r\npip install VJModels\r\n```\r\n\r\n**Example usage:**\r\n\r\n```python\r\nfrom VJModels.Forests import IncrementalForestClassifier\r\n\r\n# X_train, y_train, X_test should be your datasets\r\n\r\ninc_forest = IncrementalForestClassifier()\r\ninc_forest.fit(X_train, y_train)\r\ny_test_pred = inc_forest.predict(X_test)\r\n```\r\n\r\n# Summary\r\n\r\n1. Forests\r\n - [1.1 WSagging](#wsagging)\r\n - [1.2 Incremental](#incremental)\r\n\r\n2. Linear Models\r\n - [2.1 Advanced Linear Regression](#advanced-linear-regression)\r\n\r\n# Forests\r\n\r\n## WSagging\r\n\r\n**WSagging** is a term I coined, standing for **Weighted Score Averaging**. The idea is quite simple. Suppose you have `X` features and `y` targets. You select a number of times the algorithm will run (`n_models` parameter in the class constructor). At each iteration, you randomly split the dataset `X` into two different datasets: `X_train` and `X_validation`. The model is trained on the `X_train` dataset and scored on `X_validation` against `y_validation`. Both the trained model and its score are saved.\r\n\r\nLater, during the prediction phase, you average the predictions based on the scores from the validation set to obtain a new prediction for your data. Specifically, the classifier and regressor work as follows:\r\n\r\n### Classifier\r\n\r\n```python\r\ndef predict(self, X):\r\n n_samples = len(X)\r\n results = [0] * n_samples\r\n\r\n predictions_list = [tree.predict(X) for tree in self.trees]\r\n importance_list = [self.get_importance(i) for i in range(len(self.trees))]\r\n\r\n for i in range(n_samples):\r\n results[i] = sum(importance if prediction[i] == 1 else -importance\r\n for prediction, importance in zip(predictions_list, importance_list))\r\n\r\n return [1 if result > 0 else 0 for result in results]\r\n```\r\n\r\n- **`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.\r\n- **`importance_list`**: This holds the importance score for each tree, reflecting how much weight each tree's prediction carries.\r\n- **Weighted Sum Calculation**:\r\n - For each sample, calculate the weighted sum of predictions from all trees.\r\n - Add the importance score for positive predictions (`1`) and subtract the importance score for negative predictions (`0`).\r\n- **Final Classification**:\r\n - If the resulting weighted sum is positive, classify the sample as `1`.\r\n - If the weighted sum is zero or negative, classify the sample as `0`.\r\n\r\n ### Regressor\r\n\r\n```python\r\ndef predict(self, X):\r\n predictions_list = [tree.predict(X) for tree in self.trees]\r\n importance_list = [self.get_importance(i) for i in range(len(self.trees))]\r\n\r\n results = [sum(importance * prediction for importance, prediction in zip(importance_list, preds)) / sum(importance_list) for preds in zip(*predictions_list)]\r\n \r\n return results\r\n```\r\n\r\n **`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.\r\n- **`importance_list`**: This holds the importance score for each tree, indicating the weight of each tree's predictions.\r\n- **Weighted Average Calculation**:\r\n - For each sample, compute the weighted average of predictions from all trees.\r\n - Multiply each prediction by its corresponding tree's importance score.\r\n - Sum these weighted predictions and divide by the total sum of importance scores to obtain the final result.\r\n- **Final Prediction**:\r\n - The result is a weighted average of the predictions, where the importance scores determine the contribution of each tree\u2019s prediction to the final outcome.\r\n\r\n## Incremental\r\n\r\nThe IncrementalForests algorithm builds upon WSagging by incorporating an incremental approach. In the first iteration, the dataset `X` and targets `y` are split into `train_0` and `validation_0`. A model is trained on `train_0`, scored on `validation_0`, and both the model and score are saved.\r\n\r\nIn the next iteration, `validation_0` is further split into `train_1` and `validation_1`. The new training set `train_0` is combined with `train_1`, and a new model is trained on this merged dataset. This model is evaluated on `validation_1`, and the model and score are saved.\r\n\r\nThis process continues until one of the stopping criteria is met: the validation set becomes too small, the score drops by a predefined margin, the maximum score is achieved, or the number of trees reaches the maximum limit.\r\n\r\nDuring the prediction phase, like in WSagging, predictions are averaged based on validation scores to obtain final predictions. The prediction algorithm is similar to WSagging but uses a different importance formula: `(n - i) * scores[i] ** exponent`, where `n` is the total number of models trained and `scores[i]` is the score of the ith model. This formula gives more weight to models trained earlier in the process.\r\n\r\n# Linear Models\r\n\r\n## Advanced Linear Regression\r\n\r\nThis class is designed for building a linear regression model with additional diagnostic checks to ensure model validity. The main steps include transforming categorical variables, fitting the model, checking the normality of residuals, applying transformations if necessary, and testing for heteroscedasticity. The class also provides a detailed summary of the model, including parameters, R\u00b2 values, and p-values.\r\n\r\n### **Key Steps and Functionality**\r\n\r\n1. **Step 1: Categorical Variable Transformation**\r\n - The class begins by transforming categorical variables into dummy variables, which are suitable for regression modeling. If there is only one categorical variable, it is referred to as a \"category.\" Otherwise, multiple variables are called \"categories.\"\r\n\r\n2. **Step 2: Model Fitting**\r\n - After transforming the variables, the class fits a stepwise regression model. This process involves removing predictors that are non-significant or cause multicollinearity issues.\r\n\r\n3. **Step 3: Residual Normality Test**\r\n - The class checks the normality of residuals to validate model assumptions. The type of normality test used depends on the sample size:\r\n - **Shapiro-Francia test:** Used when the sample size is 30 or more.\r\n - **Shapiro-Wilk test:** Used when the sample size is less than 30.\r\n - The p-value from the test is reported, and a conclusion is drawn regarding the normality of the residuals.\r\n\r\n4. **Step 4: Box-Cox Transformation (When needed)**\r\n - If the model's response variable does not meet the normality assumption, a Box-Cox transformation can be applied to stabilize variance and make the data more normally distributed. The transformed target variable is then used to refit the model via the stepwise method.\r\n\r\n5. **Step 5: Heteroscedasticity Test**\r\n - The Breusch-Pagan test is performed to check for heteroscedasticity (non-constant variance of residuals). The p-value from the test is reported, and a conclusion is drawn about the presence or absence of heteroscedasticity.\r\n\r\n### **Usage**\r\n\r\nTo use this class:\r\n1. Initialize the class.\r\n2. Call the fitting method with your dataset and target variable to perform all steps.\r\n3. Use the `summary` method to get a detailed report of the model.\r\n4. Use the object to predict the target value on new observations.\r\n\r\n```python\r\nfrom VJModels.LinearModels import AdvancedLinearRegression\r\n\r\nmodel = AdvancedLinearRegression(data, 'target')\r\nmodel.fit()\r\n\r\nprint(model.summary())\r\n\r\npredictions = model.predict(new_data)\r\nprint(predictions)\r\n```\r\n\r\n\r\n \r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "My personal machine learning models",
"version": "5.0.1",
"project_urls": {
"Download": "https://github.com/Vanderval31bs/VJModels/archive/refs/tags/v5.0.1-alpha.tar.gz",
"Homepage": "https://github.com/Vanderval31bs/VJModels"
},
"split_keywords": [
"machinelearning",
" models",
" forests"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5024b0c2be69d88102ffa280c16522aa06d5387a9ba9d0d139e6decc686c75ca",
"md5": "4abeba278d3f7e013f33c63638713196",
"sha256": "89a44e184ae6a56cb802f34ef337f6120c69b174fdebcccbeb0c9b9f89bf0f93"
},
"downloads": -1,
"filename": "VJModels-5.0.1.tar.gz",
"has_sig": false,
"md5_digest": "4abeba278d3f7e013f33c63638713196",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14131,
"upload_time": "2024-08-22T23:02:39",
"upload_time_iso_8601": "2024-08-22T23:02:39.849686Z",
"url": "https://files.pythonhosted.org/packages/50/24/b0c2be69d88102ffa280c16522aa06d5387a9ba9d0d139e6decc686c75ca/VJModels-5.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-22 23:02:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Vanderval31bs",
"github_project": "VJModels",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "vjmodels"
}