vtacML


NamevtacML JSON
Version 0.1.20 PyPI version JSON
download
home_pagehttps://github.com/jerbeario/VTAC_ML
SummaryA machine learning pipeline to classify objects in VTAC dataset as GRB or not.
upload_time2024-09-02 14:44:38
maintainerNone
docs_urlNone
authorJeremy Palmerio
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # vtacML

vtacML is a machine learning package designed for the analysis of data from the Visible Telescope (VT) on the SVOM mission. This package uses machine learning models to analyze a dataframe of features from VT observations and identify potential gamma-ray burst (GRB) candidates. The primary goal of vtacML is to integrate into the SVOM data analysis pipeline and add a feature to each observation indicating the probability that it is a GRB candidate.

## Table of Contents

- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)
  - [Quick Start](#quick-start)
  - [Grid Search and Model Training](#grid-search-and-model-training)
  - [Loading and Using the Best Model](#loading-and-using-the-best-model)
  - [Using Pre-trained Model for Immediate Prediction](#using-pre-trained-model-for-immediate-prediction)
  - [Config File](#config-file)
- [Documentation](#documentation)
- [License](#license)
- [Contact](#contact)

## Overview

The SVOM mission, a collaboration between the China National Space Administration (CNSA) and the French space agency CNES, aims to study gamma-ray bursts (GRBs), the most energetic explosions in the universe. The Visible Telescope (VT) on SVOM plays a critical role in observing these events in the optical wavelength range.

vtacML leverages machine learning to analyze VT data, providing a probability score for each observation to indicate its likelihood of being a GRB candidate. The package includes tools for data preprocessing, model training, evaluation, and visualization.

## Installation

To install vtacML, you can use `pip`:

```sh
pip install vtacML
```

Alternatively, you can clone the repository and install the package locally:

```sh
git clone https://github.com/jerbeario/vtacML.git
cd vtacML
pip install .
```

## Usage

### Quick Start

Here’s a quick example to get you started with vtacML:

```python
from vtacML.pipeline import VTACMLPipe

# Initialize the pipeline
pipeline = VTACMLPipe()

# Load configuration
pipeline.load_config('path/to/config.yaml')

# Train the model
pipeline.train()

# Evaluate the model
pipeline.evaluate('evaluation_name', plot=True)

# Predict GRB candidates
predictions = pipeline.predict(observation_dataframe, prob=True)
print(predictions)
```

### Grid Search and Model Training

vtacML can perform grid search on a large array of models and parameters specified in the configuration file. Initialize the `VTACMLPipe` class with a specified config file (or use the default) and train it. Then, you can save the best model for future use.

```python
from vtacML.pipeline import VTACMLPipe

# Initialize the pipeline with a configuration file
pipeline = VTACMLPipe(config_file='path/to/config.yaml')

# Train the model with grid search
pipeline.train()

# Save the best model
pipeline.save_best_model('path/to/save/best_model.pkl')
```

### Loading and Using the Best Model

After training and saving the best model, you can create a new instance of the `VTACMLPipe` class and load the best model for further use.

```python
from vtacML.pipeline import VTACMLPipe

# Initialize a new pipeline instance
pipeline = VTACMLPipe()

# Load the best model
pipeline.load_best_model('path/to/save/best_model.pkl')

# Predict GRB candidates
predictions = pipeline.predict(observation_dataframe, prob=True)
print(predictions)
```

### Using Pre-trained Model for Immediate Prediction

If you already have a trained model, you can use the quick wrapper function `predict_from_best_pipeline` to predict data immediately. A pre-trained model is available by default.

```python
from vtacML.pipeline import predict_from_best_pipeline

# Predict GRB candidates using the pre-trained model
predictions = predict_from_best_pipeline(observation_dataframe, model_path='path/to/pretrained_model.pkl')
print(predictions)
```

### Config File

The config file is used to configure the model searching process. 

```yaml
# Default config file, used to search for best model using only first two sequences (X0, X1) from the VT pipeline
Inputs:
  file: 'combined_qpo_vt_all_cases_with_GRB_with_flags.parquet' # Data file used for training. Located in /data/
#  path: 'combined_qpo_vt_with_GRB.parquet'
#  path: 'combined_qpo_vt_faint_case_with_GRB_with_flags.parquet'
  columns: [
    "MAGCAL_R0",
    "MAGCAL_B0",
    "MAGERR_R0",
    "MAGERR_B0",
    "MAGCAL_R1",
    "MAGCAL_B1",
    "MAGERR_R1",
    "MAGERR_B1",
    "MAGVAR_R1",
    "MAGVAR_B1",
    'EFLAG_R0',
    'EFLAG_R1',
    'EFLAG_B0',
    'EFLAG_B1',
    "NEW_SRC",
    "DMAG_CAT"
    ] # features used for training
  target_column: 'IS_GRB' # feature column that holds the class information to be predicted

# Set of models and parameters to perform GridSearchCV over
Models:
  rfc:
    class: RandomForestClassifier()
    param_grid:
      'rfc__n_estimators': [100, 200, 300]  # Number of trees in the forest
      'rfc__max_depth': [4, 6, 8]  # Maximum depth of the tree
      'rfc__min_samples_split': [2, 5, 10]  # Minimum number of samples required to split an internal node
      'rfc__min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
      'rfc__bootstrap': [True, False]  # Whether bootstrap samples are used when building trees
  ada:
    class: AdaBoostClassifier()
    param_grid:
      'ada__n_estimators': [50, 100, 200]  # Number of weak learners
      'ada__learning_rate': [0.01, 0.1, 1]  # Learning rate
      'ada__algorithm': ['SAMME']  # Algorithm for boosting
  svc:
    class: SVC()
    param_grid:
      'svc__C': [0.1, 1, 10, 100]  # Regularization parameter
      'svc__kernel': ['poly', 'rbf', 'sigmoid']  # Kernel type to be used in the algorithm
      'svc__gamma': ['scale', 'auto']  # Kernel coefficient
      'svc__degree': [3, 4, 5]  # Degree of the polynomial kernel function (if `kernel` is 'poly')
  knn:
    class: KNeighborsClassifier()
    param_grid:
      'knn__n_neighbors': [3, 5, 7, 9]  # Number of neighbors to use
      'knn__weights': ['uniform', 'distance']  # Weight function used in prediction
      'knn__algorithm': ['ball_tree', 'kd_tree', 'brute']  # Algorithm used to compute the nearest neighbors
      'knn__p': [1, 2]  # Power parameter for the Minkowski metric
  lr:
    class: LogisticRegression()
    param_grid:
      'lr__penalty': ['l1', 'l2', 'elasticnet']  # Specify the norm of the penalty
      'lr__C': [0.01, 0.1, 1, 10]  # Inverse of regularization strength
      'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']  # Algorithm to use in the optimization problem
      'lr__max_iter': [100, 200, 300]  # Maximum number of iterations taken for the solvers to converge
  dt:
    class: DecisionTreeClassifier()
    param_grid:
      'dt__criterion': ['gini', 'entropy']  # The function to measure the quality of a split
      'dt__splitter': ['best', 'random']  # The strategy used to choose the split at each node
      'dt__max_depth': [4, 6, 8, 10]  # Maximum depth of the tree
      'dt__min_samples_split': [2, 5, 10]  # Minimum number of samples required to split an internal node
      'dt__min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node

# Output directories
Outputs:
  model_path: '/output/models'
  viz_path: '/output/visualizations/'
  plot_correlation:
    flag: True
    path: 'output/corr_plots/'


```

## Documentation

See documentation at 


### Setting Up Development Environment


To set up a development environment, you can use the provided `requirements-dev.txt`:


```sh

conda create --name vtacML-dev python=3.8

conda activate vtacML-dev

pip install -r requirements.txt

```


### Running Tests


To run tests, use the following command:


```sh

pytest

```


## License


This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.


## Contact


For questions or support, please contact:


- Jeremy Palmerio - [palmerio.jeremy@gmail.com](mailto:palmerio.jeremy@gmail.com)

- Project Link: [https://github.com/jerbeario/vtacML](https://github.com/jerbeario/VTAC_ML)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jerbeario/VTAC_ML",
    "name": "vtacML",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Jeremy Palmerio",
    "author_email": "jeremypalmerio05@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fd/f2/11e502537c78d5c01a7c41cd9faa884291ae9bda4ef063dc3d7f87db9813/vtacml-0.1.20.tar.gz",
    "platform": null,
    "description": "# vtacML\n\nvtacML is a machine learning package designed for the analysis of data from the Visible Telescope (VT) on the SVOM mission. This package uses machine learning models to analyze a dataframe of features from VT observations and identify potential gamma-ray burst (GRB) candidates. The primary goal of vtacML is to integrate into the SVOM data analysis pipeline and add a feature to each observation indicating the probability that it is a GRB candidate.\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Quick Start](#quick-start)\n  - [Grid Search and Model Training](#grid-search-and-model-training)\n  - [Loading and Using the Best Model](#loading-and-using-the-best-model)\n  - [Using Pre-trained Model for Immediate Prediction](#using-pre-trained-model-for-immediate-prediction)\n  - [Config File](#config-file)\n- [Documentation](#documentation)\n- [License](#license)\n- [Contact](#contact)\n\n## Overview\n\nThe SVOM mission, a collaboration between the China National Space Administration (CNSA) and the French space agency CNES, aims to study gamma-ray bursts (GRBs), the most energetic explosions in the universe. The Visible Telescope (VT) on SVOM plays a critical role in observing these events in the optical wavelength range.\n\nvtacML leverages machine learning to analyze VT data, providing a probability score for each observation to indicate its likelihood of being a GRB candidate. The package includes tools for data preprocessing, model training, evaluation, and visualization.\n\n## Installation\n\nTo install vtacML, you can use `pip`:\n\n```sh\npip install vtacML\n```\n\nAlternatively, you can clone the repository and install the package locally:\n\n```sh\ngit clone https://github.com/jerbeario/vtacML.git\ncd vtacML\npip install .\n```\n\n## Usage\n\n### Quick Start\n\nHere\u2019s a quick example to get you started with vtacML:\n\n```python\nfrom vtacML.pipeline import VTACMLPipe\n\n# Initialize the pipeline\npipeline = VTACMLPipe()\n\n# Load configuration\npipeline.load_config('path/to/config.yaml')\n\n# Train the model\npipeline.train()\n\n# Evaluate the model\npipeline.evaluate('evaluation_name', plot=True)\n\n# Predict GRB candidates\npredictions = pipeline.predict(observation_dataframe, prob=True)\nprint(predictions)\n```\n\n### Grid Search and Model Training\n\nvtacML can perform grid search on a large array of models and parameters specified in the configuration file. Initialize the `VTACMLPipe` class with a specified config file (or use the default) and train it. Then, you can save the best model for future use.\n\n```python\nfrom vtacML.pipeline import VTACMLPipe\n\n# Initialize the pipeline with a configuration file\npipeline = VTACMLPipe(config_file='path/to/config.yaml')\n\n# Train the model with grid search\npipeline.train()\n\n# Save the best model\npipeline.save_best_model('path/to/save/best_model.pkl')\n```\n\n### Loading and Using the Best Model\n\nAfter training and saving the best model, you can create a new instance of the `VTACMLPipe` class and load the best model for further use.\n\n```python\nfrom vtacML.pipeline import VTACMLPipe\n\n# Initialize a new pipeline instance\npipeline = VTACMLPipe()\n\n# Load the best model\npipeline.load_best_model('path/to/save/best_model.pkl')\n\n# Predict GRB candidates\npredictions = pipeline.predict(observation_dataframe, prob=True)\nprint(predictions)\n```\n\n### Using Pre-trained Model for Immediate Prediction\n\nIf you already have a trained model, you can use the quick wrapper function `predict_from_best_pipeline` to predict data immediately. A pre-trained model is available by default.\n\n```python\nfrom vtacML.pipeline import predict_from_best_pipeline\n\n# Predict GRB candidates using the pre-trained model\npredictions = predict_from_best_pipeline(observation_dataframe, model_path='path/to/pretrained_model.pkl')\nprint(predictions)\n```\n\n### Config File\n\nThe config file is used to configure the model searching process. \n\n```yaml\n# Default config file, used to search for best model using only first two sequences (X0, X1) from the VT pipeline\nInputs:\n  file: 'combined_qpo_vt_all_cases_with_GRB_with_flags.parquet' # Data file used for training. Located in /data/\n#  path: 'combined_qpo_vt_with_GRB.parquet'\n#  path: 'combined_qpo_vt_faint_case_with_GRB_with_flags.parquet'\n  columns: [\n    \"MAGCAL_R0\",\n    \"MAGCAL_B0\",\n    \"MAGERR_R0\",\n    \"MAGERR_B0\",\n    \"MAGCAL_R1\",\n    \"MAGCAL_B1\",\n    \"MAGERR_R1\",\n    \"MAGERR_B1\",\n    \"MAGVAR_R1\",\n    \"MAGVAR_B1\",\n    'EFLAG_R0',\n    'EFLAG_R1',\n    'EFLAG_B0',\n    'EFLAG_B1',\n    \"NEW_SRC\",\n    \"DMAG_CAT\"\n    ] # features used for training\n  target_column: 'IS_GRB' # feature column that holds the class information to be predicted\n\n# Set of models and parameters to perform GridSearchCV over\nModels:\n  rfc:\n    class: RandomForestClassifier()\n    param_grid:\n      'rfc__n_estimators': [100, 200, 300]  # Number of trees in the forest\n      'rfc__max_depth': [4, 6, 8]  # Maximum depth of the tree\n      'rfc__min_samples_split': [2, 5, 10]  # Minimum number of samples required to split an internal node\n      'rfc__min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node\n      'rfc__bootstrap': [True, False]  # Whether bootstrap samples are used when building trees\n  ada:\n    class: AdaBoostClassifier()\n    param_grid:\n      'ada__n_estimators': [50, 100, 200]  # Number of weak learners\n      'ada__learning_rate': [0.01, 0.1, 1]  # Learning rate\n      'ada__algorithm': ['SAMME']  # Algorithm for boosting\n  svc:\n    class: SVC()\n    param_grid:\n      'svc__C': [0.1, 1, 10, 100]  # Regularization parameter\n      'svc__kernel': ['poly', 'rbf', 'sigmoid']  # Kernel type to be used in the algorithm\n      'svc__gamma': ['scale', 'auto']  # Kernel coefficient\n      'svc__degree': [3, 4, 5]  # Degree of the polynomial kernel function (if `kernel` is 'poly')\n  knn:\n    class: KNeighborsClassifier()\n    param_grid:\n      'knn__n_neighbors': [3, 5, 7, 9]  # Number of neighbors to use\n      'knn__weights': ['uniform', 'distance']  # Weight function used in prediction\n      'knn__algorithm': ['ball_tree', 'kd_tree', 'brute']  # Algorithm used to compute the nearest neighbors\n      'knn__p': [1, 2]  # Power parameter for the Minkowski metric\n  lr:\n    class: LogisticRegression()\n    param_grid:\n      'lr__penalty': ['l1', 'l2', 'elasticnet']  # Specify the norm of the penalty\n      'lr__C': [0.01, 0.1, 1, 10]  # Inverse of regularization strength\n      'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']  # Algorithm to use in the optimization problem\n      'lr__max_iter': [100, 200, 300]  # Maximum number of iterations taken for the solvers to converge\n  dt:\n    class: DecisionTreeClassifier()\n    param_grid:\n      'dt__criterion': ['gini', 'entropy']  # The function to measure the quality of a split\n      'dt__splitter': ['best', 'random']  # The strategy used to choose the split at each node\n      'dt__max_depth': [4, 6, 8, 10]  # Maximum depth of the tree\n      'dt__min_samples_split': [2, 5, 10]  # Minimum number of samples required to split an internal node\n      'dt__min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node\n\n# Output directories\nOutputs:\n  model_path: '/output/models'\n  viz_path: '/output/visualizations/'\n  plot_correlation:\n    flag: True\n    path: 'output/corr_plots/'\n\n\n```\n\n## Documentation\n\nSee documentation at \n\n\n### Setting Up Development Environment\n\n\nTo set up a development environment, you can use the provided `requirements-dev.txt`:\n\n\n```sh\n\nconda create --name vtacML-dev python=3.8\n\nconda activate vtacML-dev\n\npip install -r requirements.txt\n\n```\n\n\n### Running Tests\n\n\nTo run tests, use the following command:\n\n\n```sh\n\npytest\n\n```\n\n\n## License\n\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.\n\n\n## Contact\n\n\nFor questions or support, please contact:\n\n\n- Jeremy Palmerio - [palmerio.jeremy@gmail.com](mailto:palmerio.jeremy@gmail.com)\n\n- Project Link: [https://github.com/jerbeario/vtacML](https://github.com/jerbeario/VTAC_ML)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A machine learning pipeline to classify objects in VTAC dataset as GRB or not.",
    "version": "0.1.20",
    "project_urls": {
        "Homepage": "https://github.com/jerbeario/VTAC_ML"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1e778f487632c6d815da20832da849767c5d7c47f4882f85a8899b90fbdb0adc",
                "md5": "c22d85d60cc4c3c93603b95870104edd",
                "sha256": "4899f2ce5f92a5ee0e6dfdb748ee89a0a2a901069ac536f1c6ef846036765d09"
            },
            "downloads": -1,
            "filename": "vtacML-0.1.20-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c22d85d60cc4c3c93603b95870104edd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 32286668,
            "upload_time": "2024-09-02T14:44:33",
            "upload_time_iso_8601": "2024-09-02T14:44:33.523438Z",
            "url": "https://files.pythonhosted.org/packages/1e/77/8f487632c6d815da20832da849767c5d7c47f4882f85a8899b90fbdb0adc/vtacML-0.1.20-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fdf211e502537c78d5c01a7c41cd9faa884291ae9bda4ef063dc3d7f87db9813",
                "md5": "903823dcaf754079f3b3a208f3b37464",
                "sha256": "2b840d1a6d786cfa6686fea6cc67b3bf73d06487da93fc2dbaf784d93433305c"
            },
            "downloads": -1,
            "filename": "vtacml-0.1.20.tar.gz",
            "has_sig": false,
            "md5_digest": "903823dcaf754079f3b3a208f3b37464",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 31871473,
            "upload_time": "2024-09-02T14:44:38",
            "upload_time_iso_8601": "2024-09-02T14:44:38.771620Z",
            "url": "https://files.pythonhosted.org/packages/fd/f2/11e502537c78d5c01a7c41cd9faa884291ae9bda4ef063dc3d7f87db9813/vtacml-0.1.20.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-02 14:44:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jerbeario",
    "github_project": "VTAC_ML",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "vtacml"
}
        
Elapsed time: 0.64303s