deepBreaks

Name	deepBreaks JSON
Version	1.1.2 JSON
	download
home_page	http://github.com/omicsEye/deepBreaks
Summary	deepBreaks: a machine learning tool for identifying and prioritizing genotype-phenotype associations
upload_time	2023-08-18 16:05:20
maintainer
docs_url	None
author	Mahdi Baghbanzadeh
requires_python
license	MIT
keywords	machine learning genomics sequencing data
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # deepBreaks #
![](https://github.com/omicsEye/deepbreaks/blob/master/img/fig1_overview.png?raw=True)
---

***deepBreaks*** , a computational method, aims to identify important
changes in association with the phenotype of interest
using multi-alignment sequencing data from a population.

**Key features:**

* **Generality:** *deepBreaks* is a new computational tool for identifying genomic regions and genetic variants
significantly associated with phenotypes of interest.
* **Validation:** A comprehensive evaluation of deepBreaks performance using synthetic 
data generation with known ground truth for genotype-phenotype association testing.
* **Interpretation:** Rather than checking all possible mutations (breaks), _deepBreaks_ prioritizes only statistically
  promising candidate mutations.
* **Elegance:** User-friendly, open-source software allowing for high-quality visualization
and statistical tests. 
* **Optimization:** Since sequence data are often very high volume (next-generation DNA sequencing reads typically 
in the millions), all modules have been written and benchmarked for computing time.
* **Documentation:** Open-source GitHub repository of code complete with tutorials and a wide range of
real-world applications.

---
**Citation:**

Mahdi Baghbanzadeh, Tyson Dawson, Bahar Sayoldin, Todd H. Oakley, Keith A. Crandall, Ali Rahnavard (2023).
**_deepBreaks_: a machine learning tool for identifying and prioritizing genotype-phenotype associations**
, https://github.com/omicsEye/deepBreaks/.

---

# deepBreaks user manual #

## Contents ##

* [Features](#features)
* [deepBreaks](#deepBreaks)
    * [Installation](#installation)
      * [Windows Linux Mac](#Windows-Linux-Mac)
      * [Apple M1/M2 MAC](#apple-m1m2-mac)
* [Getting Started with deepBreaks](#getting-started-with-deepBreaks)
    * [Test deepBreaks](#test-omeClust)
    * [Options](#options) 
    * [Input](#input)
    * [Output](#output)
    * [Demo](#demo)
    * [Tutorial](#tutorial)
* [Applications](#applications)
  * [*deepBreaks* identifies amino acids associated with color sensitivity](#opsin)
  * [Novel insights of niche associations in the oral microbiome](#hmp)
  * [*deepBreaks* reveals important SARS-CoV-2 regions associated with Alpha and Delta variants](#covid)
  * [*deepBreaks* identifies HIV regions with potentially important functions](#hiv)
* [Support](#support)
------------------------------------------------------------------------------------------------------------------------------
# Features #
1. Generic software that can handle any kind of sequencing data and phenotypes
2. One place to do all analysis and producing high-quality visualizations
3. Optimized computation
4. User-friendly software
5. Provides a predictive power of most discriminative positions in a sequencing data
# DeepBreaks #

## Installation ##
* First install *conda*  
Go to the [Anaconda website](https://www.anaconda.com/) and download the latest version for your operating system.  
* For Windows users: do not forget to add `conda` to your system `path`
* Second is to check for conda availability  
open a terminal (or command line for Windows users) and run:
```
conda --version
```
it should out put something like:
```
conda 4.9.2
```
if not, you must make *conda* available to your system for further steps.
if you have problems adding conda to PATH, you can find instructions
[here](https://docs.anaconda.com/anaconda/user-guide/faq/).  

### Windows Linux Mac ###
If you are using an **Apple M1/M2 MAC** please go to the [Apple M1/M2 MAC](#apple-m1m2-mac) for installation
instructions.  
If you have a working conda on your system, you can safely skip to step three.  
If you are using windows, please make sure you have both git and Microsoft Visual C++ 14.0 or greater installed.
install [git](https://gitforwindows.org/)
[Microsoft C++ build tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
In case you face issues with this step, [this link](https://github.com/pycaret/pycaret/issues/1254) may help you.
1) Create a new conda environment (let's call it deepBreaks_env) with the following command:
```
conda create --name deepBreaks_env python=3.9
```
2) Activate your conda environment:
```commandline
conda activate deepBreaks_env 
```
3) Install *deepBreaks*:
install with pip:
```commandline
pip install deepBreaks
```
or you can directly install if from GitHub:
```commandline
python -m pip install git+https://github.com/omicsEye/deepbreaks
```
### Apple M1/M2 MAC ###
1) Update/install Xcode Command Line Tools
  ```commandline
  xcode-select --install
  ```
2) Install [Brew](https://brew.sh/index_fr)
  ```commandline
  /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  ```
3) Install libraries for brew
  ```commandline
  brew install cmake libomp
  ```
4) Install miniforge
  ```commandline
  brew install miniforge
  ```
5) Close the current terminal and open a new terminal
6) Create a new conda environment (let's call it deepBreaks_env) with the following command:
  ```commandline
  conda create --name deepBreaks_env python=3.9
  ```
7) Activate the conda environment
  ```commandline
  conda activate deepBreaks_env
  ```
8) Install packages from Conda
  ```commandline
  conda install -c conda-forge lightgbm
  pip install xgboost
  ```
9) Finally, install *deepBreaks*:
install with pip:
```commandline
pip install deepBreaks
```
or you can directly install if from GitHub:
```commandline
python -m pip install git+https://github.com/omicsEye/deepbreaks
```
-----------------------------------------------------------------------------------------------------------------------

# Getting Started with deepBreaks #

## Test deepBreaks ##

To test if deepBreaks is installed correctly, you may run the following command in the terminal:

```#!cmd
deepBreaks -h
```
Which yields deepBreaks command line options.



## Options ##

```
$ deepBreaks -h
```
## Input ##
```commandline
usage: deepBreaks [-h] --seqfile SEQFILE --seqtype SEQTYPE --meta_data META_DATA --metavar METAVAR [--gap GAP] [--miss_gap MISS_GAP]
                  [--ult_rare ULT_RARE] --anatype {reg,cl}
                  [--distance_metric {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}]
                  [--fraction FRACTION] [--redundant_threshold REDUNDANT_THRESHOLD] [--distance_threshold DISTANCE_THRESHOLD]
                  [--top_models TOP_MODELS] [--cv CV] [--separate_cv] [--tune] [--plot] [--write]

options:
  -h, --help            show this help message and exit
  --seqfile SEQFILE, -sf SEQFILE
                        files contains the sequences
  --seqtype SEQTYPE, -st SEQTYPE
                        type of sequence: 'nu' for nucleotides or 'aa' for amino-acid
  --meta_data META_DATA, -md META_DATA
                        files contains the meta data
  --metavar METAVAR, -mv METAVAR
                        name of the meta var (response variable)
  --gap GAP, -gp GAP    Threshold to drop positions that have GAPs above this proportion. Default value is 0.7 and it means that the positions that
                        70% or more GAPs will be dropped from the analysis.
  --miss_gap MISS_GAP, -mgp MISS_GAP
                        Threshold to impute missing values with GAP. Gapsin positions that have missing values (gaps) above this proportionare
                        replaced with the term 'GAP'. the rest of the missing valuesare replaced by the mode of each position.
  --ult_rare ULT_RARE, -u ULT_RARE
                        Threshold to modify the ultra rare cases in each position.
  --anatype {reg,cl}, -a {reg,cl}
                        type of analysis
  --distance_metric {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}, -dm {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}
                        distance metric. Default is correlation.
  --fraction FRACTION, -fr FRACTION
                        fraction of main data to run
  --redundant_threshold REDUNDANT_THRESHOLD, -rt REDUNDANT_THRESHOLD
                        threshold for the p-value of the statistical tests to drop redundant features. Defaultvalue is 0.25
  --distance_threshold DISTANCE_THRESHOLD, -dth DISTANCE_THRESHOLD
                        threshold for the distance between positions to put them in clusters. features with distances <= than the threshold will be
                        grouped together. Default values is 0.3
  --top_models TOP_MODELS, -tm TOP_MODELS
                        number of top models to consider for merging the results. Default value is 5
  --cv CV, -cv CV       number of folds for cross validation. Default is 10. If the given number is less than 1,
                        then instead of CV, a train/test split approach will be used with --cv being the test size. 
  
  --tune                After running the 10-fold cross validations, should the top selected models be tuned and finalize, or finalized only?
  --plot                plot all the individual positions that are statistically significant.Depending on your data, this process may produce many
                        plots.
  --write               During reading the fasta file we delete the positions that have GAPs over a certain threshold that can be changed in the
                        `gap_threshold` argumentin the `read_data` function. As this may change the whole FASTA file, you maywant to save the FASTA
                        file after this cleaning step.

```

## Output ##  
1. correlated positions. We group all the colinear positions together.
2. models summary. list of models and their performance metrics.
3. plot of the feature importance of the top models in *modelName_dpi.png* format.
4. csv files of feature importance based on top models containing, feature, importance, relative importance, 
group of the position (we group all the colinear positions together)
5. plots and csv file of average of feature importance of top models.
6. box plot (regression) or stacked bar plot (classification) for top positions of each model.
7. pickle files of the plots and final models
8. p-values of all the variables used in training of the final model

## Demo ##
```commandline
deepBreaks -sf PATH_TO_SEQUENCE.FASTA -st aa -md PATH_TO_META_DATA.tsv -mv
 META_VARIABLE_NAME -a reg  -dth 0.15 --plot --write
```

## Tutorial ##
Multiple detailed jupyter notebook of _deepBreaks_ implementation are available in the
[examples](https://github.com/omicsEye/deepbreaks/tree/master/examples) and the
required data for the examples are also available in the
[data](https://github.com/omicsEye/deepbreaks/tree/master/data) directory.  

For the `deepBreaks.models.model_compare` function, these are the available models by default:
* Regression:
```python
models = {
            'rf': RandomForestRegressor(n_jobs=-1, random_state=123),
            'Adaboost': AdaBoostRegressor(random_state=123),
            'et': ExtraTreesRegressor(n_jobs=-1, random_state=123),
            'gbc': GradientBoostingRegressor(random_state=123),
            'dt': DecisionTreeRegressor(random_state=123),
            'lr': LinearRegression(n_jobs=-1),
            'Lasso': Lasso(random_state=123),
            'LassoLars': LassoLars(random_state=123),
            'BayesianRidge': BayesianRidge(),
            'HubR': HuberRegressor(),
            'xgb': XGBRegressor(n_jobs=-1, random_state=123),
            'lgbm': LGBMRegressor(n_jobs=-1, random_state=123)
        }
```
 * Classification:
```python
models = {
            'rf': RandomForestClassifier(n_jobs=-1, random_state=123),
            'Adaboost': AdaBoostClassifier(random_state=123),
            'et': ExtraTreesClassifier(n_jobs=-1, random_state=123),
            'lg': LogisticRegression(n_jobs=-1, random_state=123),
            'gbc': GradientBoostingClassifier(random_state=123),
            'dt': DecisionTreeClassifier(random_state=123),
            'xgb': XGBClassifier(n_jobs=-1, random_state=123),
            'lgbm': LGBMClassifier(n_jobs=-1, random_state=123)
        }
```
The default metrics for evaluation are:
* Regression:
```python
scores = {'R2': 'r2',
          'MAE': 'neg_mean_absolute_error',
          'MSE': 'neg_mean_squared_error',
          'RMSE': 'neg_root_mean_squared_error',
          'MAPE': 'neg_mean_absolute_percentage_error'
          }
```
 * Classification:
```python
scores = {'Accuracy': 'accuracy',
          'AUC': 'roc_auc_ovr',
          'F1': 'f1_macro',
          'Recall': 'recall_macro',
          'Precision': 'precision_macro'
          }
```
To get the ful list of available metrics, you can use:
```python
from sklearn import metrics
print(metrics.SCORERS.keys())
```
The default search parameters for the models are:
```python
import numpy as np
params = {
        'rf': {'rf__max_features': ["sqrt", "log2"]},
        'Adaboost': {'Adaboost__learning_rate': np.linspace(0.001, 0.1, num=2),
                     'Adaboost__n_estimators': [100, 200]},
        'gbc': {'gbc__max_depth': range(3, 6),
                'gbc__max_features': ['sqrt', 'log2'],
                'gbc__n_estimators': [200, 500, 800],
                'gbc__learning_rate': np.linspace(0.001, 0.1, num=2)},
        'et': {'et__max_depth': [4, 6, 8],
               'et__n_estimators': [500, 1000]},
        'dt': {'dt__max_depth': [4, 6, 8]},
        'Lasso': {'Lasso__alpha': np.linspace(0.01, 100, num=5)},
        'LassoLars': {'LassoLars__alpha': np.linspace(0.01, 100, num=5)}
    }
```
**Attention:** The names of models in the provided `dict` are the same with the names in the `dict` provided 
for the `params`. If the name from the models `dict` does not match, the default `sklearn` parameters for that model
is then used.  For example, `model_compare_cv` uses the `xgboost` with default hyperparameters.  

To use the `deepBreaks.models.model_compare_cv` function with default parameters:
```python
from deepBreaks.models import model_compare_cv
from deepBreaks.preprocessing import MisCare, ConstantCare, URareCare, CustomOneHotEncoder
from deepBreaks.preprocessing import FeatureSelection, CollinearCare
from deepBreaks.utils import get_models, get_scores, get_params, make_pipeline

ana_type = 'reg'  # assume that we are running a regression analysis
report_dir = 'PATH/TO/A/DIRECTORY' # to save the reports
prep_pipeline = make_pipeline(cache_dir=None,
    steps=[
        ('mc', MisCare(missing_threshold=0.25)),
        ('cc', ConstantCare()),
        ('ur', URareCare(threshold=0.05)),
        ('cc2', ConstantCare()),
        ('one_hot', CustomOneHotEncoder()),
        ('feature_selection', FeatureSelection(model_type=ana_type, alpha=0.25)),
        ('collinear_care', CollinearCare(dist_method='correlation', threshold=0.25))
    ])
report, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,
                               models_dict=get_models(ana_type=ana_type),
                               scoring=get_scores(ana_type=ana_type),
                               report_dir=report_dir,
                               cv=10, ana_type=ana_type, cache_dir=None)

```
To use a new set of `models`, `params`, or `metrics` you can define them in a `dict`:
```python
import deepBreaks.models as ml
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from deepBreaks.models import model_compare_cv
from deepBreaks.preprocessing import MisCare, ConstantCare, URareCare, CustomOneHotEncoder
from deepBreaks.preprocessing import FeatureSelection, CollinearCare
from deepBreaks.utils import get_models, get_scores, get_params, make_pipeline

ana_type = 'reg'  # assume that we are running a regression analysis
report_dir = 'PATH/TO/A/DIRECTORY' # to save the reports
# define a new set of models
models = {'rf': RandomForestRegressor(n_jobs=-1, random_state=123),
          'Adaboost': AdaBoostRegressor(random_state=123),
          'et': ExtraTreesRegressor(n_jobs=-1, random_state=123)
          }


prep_pipeline = make_pipeline(cache_dir=None,
    steps=[
        ('mc', MisCare(missing_threshold=0.25)),
        ('cc', ConstantCare()),
        ('ur', URareCare(threshold=0.05)),
        ('cc2', ConstantCare()),
        ('one_hot', CustomOneHotEncoder()),
        ('feature_selection', FeatureSelection(model_type=ana_type, alpha=0.25)),
        ('collinear_care', CollinearCare(dist_method='correlation', threshold=0.25))
    ])
report, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,
                               models_dict=models,
                               scoring=get_scores(ana_type=ana_type),
                               report_dir=report_dir,
                               cv=10, ana_type=ana_type, cache_dir=None)

'''
Since we do not define a set of parameters for the model "et", it will fit with
default parameters
'''
# change the set of metrics
scores = {'R2': 'r2',
          'MAE': 'neg_mean_absolute_error',
          'MSE': 'neg_mean_squared_error'
          }

report, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,
                               models_dict=models,
                               scoring=scores,
                               report_dir=report_dir,
                               cv=10, ana_type=ana_type, cache_dir=None)
```

# Applications #
Here we try to use the **_deepBreaks_** on different datasets and elaborate on the results.

<h2 id="opsin">
<i>deepBreaks</i> identifies amino acids associated with color sensitivity
</h2>

![Opsins](https://github.com/omicsEye/deepbreaks/blob/master/img/lite_mar/figure.png?raw=True)  

Opsins are genes involved in light sensitivity and vision, and when coupled with a light-reactive chromophore, the
absorbance of the resulting photopigment dictates physiological phenotypes like color sensitivity. We analyzed the 
amino acid sequence of rod opsins because previously published mutagenesis work established mechanistic connections
between 12 specific amino acid sites and phenotypes [Yokoyama et al. (2008)](https://doi.org/10.1073/pnas.0802426105). 
Therefore, we hypothesized that machine learning approaches could predict known associations between amino acid sites 
and absorbance phenotypes. We identified opsins expressed in
rod cells of vertebrates (mainly marine fishes) with absorption spectra measurements (λmax, the wavelength with the
highest absorption). The dataset contains 175 samples of opsin sequences. We next applied deepBreaks on this
dataset to find the most important sites contributing to the variations of λmax. 
This [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/continuous_phenotype_light_sensitivity.ipynb) 
illustrates the steps.


<h2 id="hmp">
Novel insights of niche associations in the oral microbiome
</h2>

![hmp](https://github.com/omicsEye/deepbreaks/blob/master/img/hmp/hmp.png?raw=True)  
Microbial species tend to adapt at the genome level to the niche in which they live. We hypothesize 
that genes with essential functions change based on where microbial species live. Here we use microbial strain 
representatives from stool metagenomics data of healthy adults from the
[Human Microbiome Project](https://doi.org/10.1038/nature11234). The input for deepBreaks consists of 1) an MSA file
with 1006 rows, each a representative strain of a specific microbial species, here Haemophilus parainfluenzae, with
49839 lengths; and 2) labels for deepBreaks prediction are body sites from which samples were collected. 
This [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_HMP.ipynb)
illustrates the steps.


<h2 id="covid">
<i>deepBreaks</i> reveals important SARS-CoV-2 regions associated with Alpha and Delta variants
</h2>

![sarscov2](https://github.com/omicsEye/deepbreaks/blob/master/img/sars_cov2/sarscov2.png?raw=True)
Variants occur with new mutations in the virus genome. Most mutations in the SARS-CoV-2 genome do not affect the
functioning of the virus. However, mutations in the spike protein of SARS-CoV-2, which binds to receptors on cells 
lining the inside of the human nose, may make the virus easier to spread or affect how well vaccines protect people. 
We are going to study the mutations in the spike protein of the sequences of Alpha (B.1.1.7): the first variant of 
concern described in the United Kingdom (UK) in late December 2020 and Delta (B.1.617.2): first reported in India in
December 2020. We used the publicly available data from the [GSAID](https://gisaid.org/) and obtained 900 sequences
of spike protein region of Alpha (450 samples) and Delta (450 samples) variants. Then, we used deepBreaks to analyze 
the data and find the most important (predictive) positions in these sequences in terms of classifying the variants. 
This
[Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_SARS_Cov2_variants.ipynb) 
illustrates the steps.


<h2 id="hiv">
<i>deepBreaks</i> identifies HIV regions with potentially important functions
</h2>

![sarscov2](https://github.com/omicsEye/deepbreaks/blob/master/img/HIV/HIV3.png?raw=True)
Subtypes of the human immunodeficiency virus type 1 (HIV-1) group M are different in the envelope (Env) glycoproteins 
of the virus. These parts of the virus are displayed on the surface of the virion and are targets for both neutralizing
antibody and cell-mediated immune responses. The third hypervariable domain (V3) of HIV-1 gp120 is a cysteine-bounded
loop structure usually composed of 105 nucleotides and labeled as the base (nu 1:26 and 75:105), stem
(nu 27:44 and 54:74), and turn (nu 45:53) regions [Lynch et al. (2009)](https://doi.org/10.1089%2Faid.2008.0219) .
Among all of the hyper-variable regions in gp120 (V1-V5), V3 is playing the main role in the virus infectivity
[Felsövályi et al. (2006)](https://doi.org/10.1089%2Faid.2006.22.703). 
Here we useare using deepBreaks to identify important regions in the V3 loop that are important in terms of associating
the V3 sequences V3 to subtypes B and C. We used the [Los Alamos HIV Database](www.hiv.lanl.gov) to gather the 
nucleotide sequences of the V3 loop of subtypes B and C. 
This [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_HIV.ipynb) 
illustrates the steps.

# Support #

* Please submit your questions or issues with the software at
[Issues tracker](https://github.com/omicsEye/deepBreaks/issues).  
* For community discussions, questions, and issue reporting, please visit our forum [here](https://forum.omicseye.org/c/omics-downstream-analysis/deepbreaks/12)

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/omicsEye/deepBreaks",
    "name": "deepBreaks",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "machine learning,genomics,sequencing data",
    "author": "Mahdi Baghbanzadeh",
    "author_email": "mbagh@gwu.edu",
    "download_url": "https://files.pythonhosted.org/packages/26/af/d10105f783ee7d848d54fb0d59025647bdec134ee24e25999b4de537a56c/deepBreaks-1.1.2.tar.gz",
    "platform": "Linux",
    "description": "# deepBreaks #\n![](https://github.com/omicsEye/deepbreaks/blob/master/img/fig1_overview.png?raw=True)\n---\n\n***deepBreaks*** , a computational method, aims to identify important\nchanges in association with the phenotype of interest\nusing multi-alignment sequencing data from a population.\n\n**Key features:**\n\n* **Generality:** *deepBreaks* is a new computational tool for identifying genomic regions and genetic variants\nsignificantly associated with phenotypes of interest.\n* **Validation:** A comprehensive evaluation of deepBreaks performance using synthetic \ndata generation with known ground truth for genotype-phenotype association testing.\n* **Interpretation:** Rather than checking all possible mutations (breaks), _deepBreaks_ prioritizes only statistically\n  promising candidate mutations.\n* **Elegance:** User-friendly, open-source software allowing for high-quality visualization\nand statistical tests. \n* **Optimization:** Since sequence data are often very high volume (next-generation DNA sequencing reads typically \nin the millions), all modules have been written and benchmarked for computing time.\n* **Documentation:** Open-source GitHub repository of code complete with tutorials and a wide range of\nreal-world applications.\n\n---\n**Citation:**\n\nMahdi Baghbanzadeh, Tyson Dawson, Bahar Sayoldin, Todd H. Oakley, Keith A. Crandall, Ali Rahnavard (2023).\n**_deepBreaks_: a machine learning tool for identifying and prioritizing genotype-phenotype associations**\n, https://github.com/omicsEye/deepBreaks/.\n\n---\n\n# deepBreaks user manual #\n\n## Contents ##\n\n* [Features](#features)\n* [deepBreaks](#deepBreaks)\n    * [Installation](#installation)\n      * [Windows Linux Mac](#Windows-Linux-Mac)\n      * [Apple M1/M2 MAC](#apple-m1m2-mac)\n* [Getting Started with deepBreaks](#getting-started-with-deepBreaks)\n    * [Test deepBreaks](#test-omeClust)\n    * [Options](#options) \n    * [Input](#input)\n    * [Output](#output)\n    * [Demo](#demo)\n    * [Tutorial](#tutorial)\n* [Applications](#applications)\n  * [*deepBreaks* identifies amino acids associated with color sensitivity](#opsin)\n  * [Novel insights of niche associations in the oral microbiome](#hmp)\n  * [*deepBreaks* reveals important SARS-CoV-2 regions associated with Alpha and Delta variants](#covid)\n  * [*deepBreaks* identifies HIV regions with potentially important functions](#hiv)\n* [Support](#support)\n------------------------------------------------------------------------------------------------------------------------------\n# Features #\n1. Generic software that can handle any kind of sequencing data and phenotypes\n2. One place to do all analysis and producing high-quality visualizations\n3. Optimized computation\n4. User-friendly software\n5. Provides a predictive power of most discriminative positions in a sequencing data\n# DeepBreaks #\n\n## Installation ##\n* First install *conda*  \nGo to the [Anaconda website](https://www.anaconda.com/) and download the latest version for your operating system.  \n* For Windows users: do not forget to add `conda` to your system `path`\n* Second is to check for conda availability  \nopen a terminal (or command line for Windows users) and run:\n```\nconda --version\n```\nit should out put something like:\n```\nconda 4.9.2\n```\nif not, you must make *conda* available to your system for further steps.\nif you have problems adding conda to PATH, you can find instructions\n[here](https://docs.anaconda.com/anaconda/user-guide/faq/).  \n\n### Windows Linux Mac ###\nIf you are using an **Apple M1/M2 MAC** please go to the [Apple M1/M2 MAC](#apple-m1m2-mac) for installation\ninstructions.  \nIf you have a working conda on your system, you can safely skip to step three.  \nIf you are using windows, please make sure you have both git and Microsoft Visual C++ 14.0 or greater installed.\ninstall [git](https://gitforwindows.org/)\n[Microsoft C++ build tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)\nIn case you face issues with this step, [this link](https://github.com/pycaret/pycaret/issues/1254) may help you.\n1) Create a new conda environment (let's call it deepBreaks_env) with the following command:\n```\nconda create --name deepBreaks_env python=3.9\n```\n2) Activate your conda environment:\n```commandline\nconda activate deepBreaks_env \n```\n3) Install *deepBreaks*:\ninstall with pip:\n```commandline\npip install deepBreaks\n```\nor you can directly install if from GitHub:\n```commandline\npython -m pip install git+https://github.com/omicsEye/deepbreaks\n```\n### Apple M1/M2 MAC ###\n1) Update/install Xcode Command Line Tools\n  ```commandline\n  xcode-select --install\n  ```\n2) Install [Brew](https://brew.sh/index_fr)\n  ```commandline\n  /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"\n  ```\n3) Install libraries for brew\n  ```commandline\n  brew install cmake libomp\n  ```\n4) Install miniforge\n  ```commandline\n  brew install miniforge\n  ```\n5) Close the current terminal and open a new terminal\n6) Create a new conda environment (let's call it deepBreaks_env) with the following command:\n  ```commandline\n  conda create --name deepBreaks_env python=3.9\n  ```\n7) Activate the conda environment\n  ```commandline\n  conda activate deepBreaks_env\n  ```\n8) Install packages from Conda\n  ```commandline\n  conda install -c conda-forge lightgbm\n  pip install xgboost\n  ```\n9) Finally, install *deepBreaks*:\ninstall with pip:\n```commandline\npip install deepBreaks\n```\nor you can directly install if from GitHub:\n```commandline\npython -m pip install git+https://github.com/omicsEye/deepbreaks\n```\n-----------------------------------------------------------------------------------------------------------------------\n\n# Getting Started with deepBreaks #\n\n## Test deepBreaks ##\n\nTo test if deepBreaks is installed correctly, you may run the following command in the terminal:\n\n```#!cmd\ndeepBreaks -h\n```\nWhich yields deepBreaks command line options.\n\n\n\n## Options ##\n\n```\n$ deepBreaks -h\n```\n## Input ##\n```commandline\nusage: deepBreaks [-h] --seqfile SEQFILE --seqtype SEQTYPE --meta_data META_DATA --metavar METAVAR [--gap GAP] [--miss_gap MISS_GAP]\n                  [--ult_rare ULT_RARE] --anatype {reg,cl}\n                  [--distance_metric {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}]\n                  [--fraction FRACTION] [--redundant_threshold REDUNDANT_THRESHOLD] [--distance_threshold DISTANCE_THRESHOLD]\n                  [--top_models TOP_MODELS] [--cv CV] [--separate_cv] [--tune] [--plot] [--write]\n\noptions:\n  -h, --help            show this help message and exit\n  --seqfile SEQFILE, -sf SEQFILE\n                        files contains the sequences\n  --seqtype SEQTYPE, -st SEQTYPE\n                        type of sequence: 'nu' for nucleotides or 'aa' for amino-acid\n  --meta_data META_DATA, -md META_DATA\n                        files contains the meta data\n  --metavar METAVAR, -mv METAVAR\n                        name of the meta var (response variable)\n  --gap GAP, -gp GAP    Threshold to drop positions that have GAPs above this proportion. Default value is 0.7 and it means that the positions that\n                        70% or more GAPs will be dropped from the analysis.\n  --miss_gap MISS_GAP, -mgp MISS_GAP\n                        Threshold to impute missing values with GAP. Gapsin positions that have missing values (gaps) above this proportionare\n                        replaced with the term 'GAP'. the rest of the missing valuesare replaced by the mode of each position.\n  --ult_rare ULT_RARE, -u ULT_RARE\n                        Threshold to modify the ultra rare cases in each position.\n  --anatype {reg,cl}, -a {reg,cl}\n                        type of analysis\n  --distance_metric {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}, -dm {correlation,hamming,jaccard,normalized_mutual_info_score,adjusted_mutual_info_score,adjusted_rand_score}\n                        distance metric. Default is correlation.\n  --fraction FRACTION, -fr FRACTION\n                        fraction of main data to run\n  --redundant_threshold REDUNDANT_THRESHOLD, -rt REDUNDANT_THRESHOLD\n                        threshold for the p-value of the statistical tests to drop redundant features. Defaultvalue is 0.25\n  --distance_threshold DISTANCE_THRESHOLD, -dth DISTANCE_THRESHOLD\n                        threshold for the distance between positions to put them in clusters. features with distances <= than the threshold will be\n                        grouped together. Default values is 0.3\n  --top_models TOP_MODELS, -tm TOP_MODELS\n                        number of top models to consider for merging the results. Default value is 5\n  --cv CV, -cv CV       number of folds for cross validation. Default is 10. If the given number is less than 1,\n                        then instead of CV, a train/test split approach will be used with --cv being the test size. \n  \n  --tune                After running the 10-fold cross validations, should the top selected models be tuned and finalize, or finalized only?\n  --plot                plot all the individual positions that are statistically significant.Depending on your data, this process may produce many\n                        plots.\n  --write               During reading the fasta file we delete the positions that have GAPs over a certain threshold that can be changed in the\n                        `gap_threshold` argumentin the `read_data` function. As this may change the whole FASTA file, you maywant to save the FASTA\n                        file after this cleaning step.\n\n```\n\n## Output ##  \n1. correlated positions. We group all the colinear positions together.\n2. models summary. list of models and their performance metrics.\n3. plot of the feature importance of the top models in *modelName_dpi.png* format.\n4. csv files of feature importance based on top models containing, feature, importance, relative importance, \ngroup of the position (we group all the colinear positions together)\n5. plots and csv file of average of feature importance of top models.\n6. box plot (regression) or stacked bar plot (classification) for top positions of each model.\n7. pickle files of the plots and final models\n8. p-values of all the variables used in training of the final model\n\n## Demo ##\n```commandline\ndeepBreaks -sf PATH_TO_SEQUENCE.FASTA -st aa -md PATH_TO_META_DATA.tsv -mv\n META_VARIABLE_NAME -a reg  -dth 0.15 --plot --write\n```\n\n## Tutorial ##\nMultiple detailed jupyter notebook of _deepBreaks_ implementation are available in the\n[examples](https://github.com/omicsEye/deepbreaks/tree/master/examples) and the\nrequired data for the examples are also available in the\n[data](https://github.com/omicsEye/deepbreaks/tree/master/data) directory.  \n\nFor the `deepBreaks.models.model_compare` function, these are the available models by default:\n* Regression:\n```python\nmodels = {\n            'rf': RandomForestRegressor(n_jobs=-1, random_state=123),\n            'Adaboost': AdaBoostRegressor(random_state=123),\n            'et': ExtraTreesRegressor(n_jobs=-1, random_state=123),\n            'gbc': GradientBoostingRegressor(random_state=123),\n            'dt': DecisionTreeRegressor(random_state=123),\n            'lr': LinearRegression(n_jobs=-1),\n            'Lasso': Lasso(random_state=123),\n            'LassoLars': LassoLars(random_state=123),\n            'BayesianRidge': BayesianRidge(),\n            'HubR': HuberRegressor(),\n            'xgb': XGBRegressor(n_jobs=-1, random_state=123),\n            'lgbm': LGBMRegressor(n_jobs=-1, random_state=123)\n        }\n```\n * Classification:\n```python\nmodels = {\n            'rf': RandomForestClassifier(n_jobs=-1, random_state=123),\n            'Adaboost': AdaBoostClassifier(random_state=123),\n            'et': ExtraTreesClassifier(n_jobs=-1, random_state=123),\n            'lg': LogisticRegression(n_jobs=-1, random_state=123),\n            'gbc': GradientBoostingClassifier(random_state=123),\n            'dt': DecisionTreeClassifier(random_state=123),\n            'xgb': XGBClassifier(n_jobs=-1, random_state=123),\n            'lgbm': LGBMClassifier(n_jobs=-1, random_state=123)\n        }\n```\nThe default metrics for evaluation are:\n* Regression:\n```python\nscores = {'R2': 'r2',\n          'MAE': 'neg_mean_absolute_error',\n          'MSE': 'neg_mean_squared_error',\n          'RMSE': 'neg_root_mean_squared_error',\n          'MAPE': 'neg_mean_absolute_percentage_error'\n          }\n```\n * Classification:\n```python\nscores = {'Accuracy': 'accuracy',\n          'AUC': 'roc_auc_ovr',\n          'F1': 'f1_macro',\n          'Recall': 'recall_macro',\n          'Precision': 'precision_macro'\n          }\n```\nTo get the ful list of available metrics, you can use:\n```python\nfrom sklearn import metrics\nprint(metrics.SCORERS.keys())\n```\nThe default search parameters for the models are:\n```python\nimport numpy as np\nparams = {\n        'rf': {'rf__max_features': [\"sqrt\", \"log2\"]},\n        'Adaboost': {'Adaboost__learning_rate': np.linspace(0.001, 0.1, num=2),\n                     'Adaboost__n_estimators': [100, 200]},\n        'gbc': {'gbc__max_depth': range(3, 6),\n                'gbc__max_features': ['sqrt', 'log2'],\n                'gbc__n_estimators': [200, 500, 800],\n                'gbc__learning_rate': np.linspace(0.001, 0.1, num=2)},\n        'et': {'et__max_depth': [4, 6, 8],\n               'et__n_estimators': [500, 1000]},\n        'dt': {'dt__max_depth': [4, 6, 8]},\n        'Lasso': {'Lasso__alpha': np.linspace(0.01, 100, num=5)},\n        'LassoLars': {'LassoLars__alpha': np.linspace(0.01, 100, num=5)}\n    }\n```\n**Attention:** The names of models in the provided `dict` are the same with the names in the `dict` provided \nfor the `params`. If the name from the models `dict` does not match, the default `sklearn` parameters for that model\nis then used.  For example, `model_compare_cv` uses the `xgboost` with default hyperparameters.  \n\nTo use the `deepBreaks.models.model_compare_cv` function with default parameters:\n```python\nfrom deepBreaks.models import model_compare_cv\nfrom deepBreaks.preprocessing import MisCare, ConstantCare, URareCare, CustomOneHotEncoder\nfrom deepBreaks.preprocessing import FeatureSelection, CollinearCare\nfrom deepBreaks.utils import get_models, get_scores, get_params, make_pipeline\n\nana_type = 'reg'  # assume that we are running a regression analysis\nreport_dir = 'PATH/TO/A/DIRECTORY' # to save the reports\nprep_pipeline = make_pipeline(cache_dir=None,\n    steps=[\n        ('mc', MisCare(missing_threshold=0.25)),\n        ('cc', ConstantCare()),\n        ('ur', URareCare(threshold=0.05)),\n        ('cc2', ConstantCare()),\n        ('one_hot', CustomOneHotEncoder()),\n        ('feature_selection', FeatureSelection(model_type=ana_type, alpha=0.25)),\n        ('collinear_care', CollinearCare(dist_method='correlation', threshold=0.25))\n    ])\nreport, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,\n                               models_dict=get_models(ana_type=ana_type),\n                               scoring=get_scores(ana_type=ana_type),\n                               report_dir=report_dir,\n                               cv=10, ana_type=ana_type, cache_dir=None)\n\n```\nTo use a new set of `models`, `params`, or `metrics` you can define them in a `dict`:\n```python\nimport deepBreaks.models as ml\nfrom sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor\nfrom sklearn.ensemble import ExtraTreesRegressor\nfrom deepBreaks.models import model_compare_cv\nfrom deepBreaks.preprocessing import MisCare, ConstantCare, URareCare, CustomOneHotEncoder\nfrom deepBreaks.preprocessing import FeatureSelection, CollinearCare\nfrom deepBreaks.utils import get_models, get_scores, get_params, make_pipeline\n\nana_type = 'reg'  # assume that we are running a regression analysis\nreport_dir = 'PATH/TO/A/DIRECTORY' # to save the reports\n# define a new set of models\nmodels = {'rf': RandomForestRegressor(n_jobs=-1, random_state=123),\n          'Adaboost': AdaBoostRegressor(random_state=123),\n          'et': ExtraTreesRegressor(n_jobs=-1, random_state=123)\n          }\n\n\nprep_pipeline = make_pipeline(cache_dir=None,\n    steps=[\n        ('mc', MisCare(missing_threshold=0.25)),\n        ('cc', ConstantCare()),\n        ('ur', URareCare(threshold=0.05)),\n        ('cc2', ConstantCare()),\n        ('one_hot', CustomOneHotEncoder()),\n        ('feature_selection', FeatureSelection(model_type=ana_type, alpha=0.25)),\n        ('collinear_care', CollinearCare(dist_method='correlation', threshold=0.25))\n    ])\nreport, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,\n                               models_dict=models,\n                               scoring=get_scores(ana_type=ana_type),\n                               report_dir=report_dir,\n                               cv=10, ana_type=ana_type, cache_dir=None)\n\n'''\nSince we do not define a set of parameters for the model \"et\", it will fit with\ndefault parameters\n'''\n# change the set of metrics\nscores = {'R2': 'r2',\n          'MAE': 'neg_mean_absolute_error',\n          'MSE': 'neg_mean_squared_error'\n          }\n\nreport, top = model_compare_cv(X=tr, y=y, preprocess_pipe=prep_pipeline,\n                               models_dict=models,\n                               scoring=scores,\n                               report_dir=report_dir,\n                               cv=10, ana_type=ana_type, cache_dir=None)\n```\n\n# Applications #\nHere we try to use the **_deepBreaks_** on different datasets and elaborate on the results.\n\n<h2 id=\"opsin\">\n<i>deepBreaks</i> identifies amino acids associated with color sensitivity\n</h2>\n\n![Opsins](https://github.com/omicsEye/deepbreaks/blob/master/img/lite_mar/figure.png?raw=True)  \n\nOpsins are genes involved in light sensitivity and vision, and when coupled with a light-reactive chromophore, the\nabsorbance of the resulting photopigment dictates physiological phenotypes like color sensitivity. We analyzed the \namino acid sequence of rod opsins because previously published mutagenesis work established mechanistic connections\nbetween 12 specific amino acid sites and phenotypes [Yokoyama et al. (2008)](https://doi.org/10.1073/pnas.0802426105). \nTherefore, we hypothesized that machine learning approaches could predict known associations between amino acid sites \nand absorbance phenotypes. We identified opsins expressed in\nrod cells of vertebrates (mainly marine fishes) with absorption spectra measurements (\u03bbmax, the wavelength with the\nhighest absorption). The dataset contains 175 samples of opsin sequences. We next applied deepBreaks on this\ndataset to find the most important sites contributing to the variations of \u03bbmax. \nThis [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/continuous_phenotype_light_sensitivity.ipynb) \nillustrates the steps.\n\n\n<h2 id=\"hmp\">\nNovel insights of niche associations in the oral microbiome\n</h2>\n\n![hmp](https://github.com/omicsEye/deepbreaks/blob/master/img/hmp/hmp.png?raw=True)  \nMicrobial species tend to adapt at the genome level to the niche in which they live. We hypothesize \nthat genes with essential functions change based on where microbial species live. Here we use microbial strain \nrepresentatives from stool metagenomics data of healthy adults from the\n[Human Microbiome Project](https://doi.org/10.1038/nature11234). The input for deepBreaks consists of 1) an MSA file\nwith 1006 rows, each a representative strain of a specific microbial species, here Haemophilus parainfluenzae, with\n49839 lengths; and 2) labels for deepBreaks prediction are body sites from which samples were collected. \nThis [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_HMP.ipynb)\nillustrates the steps.\n\n\n<h2 id=\"covid\">\n<i>deepBreaks</i> reveals important SARS-CoV-2 regions associated with Alpha and Delta variants\n</h2>\n\n![sarscov2](https://github.com/omicsEye/deepbreaks/blob/master/img/sars_cov2/sarscov2.png?raw=True)\nVariants occur with new mutations in the virus genome. Most mutations in the SARS-CoV-2 genome do not affect the\nfunctioning of the virus. However, mutations in the spike protein of SARS-CoV-2, which binds to receptors on cells \nlining the inside of the human nose, may make the virus easier to spread or affect how well vaccines protect people. \nWe are going to study the mutations in the spike protein of the sequences of Alpha (B.1.1.7): the first variant of \nconcern described in the United Kingdom (UK) in late December 2020 and Delta (B.1.617.2): first reported in India in\nDecember 2020. We used the publicly available data from the [GSAID](https://gisaid.org/) and obtained 900 sequences\nof spike protein region of Alpha (450 samples) and Delta (450 samples) variants. Then, we used deepBreaks to analyze \nthe data and find the most important (predictive) positions in these sequences in terms of classifying the variants. \nThis\n[Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_SARS_Cov2_variants.ipynb) \nillustrates the steps.\n\n\n<h2 id=\"hiv\">\n<i>deepBreaks</i> identifies HIV regions with potentially important functions\n</h2>\n\n![sarscov2](https://github.com/omicsEye/deepbreaks/blob/master/img/HIV/HIV3.png?raw=True)\nSubtypes of the human immunodeficiency virus type 1 (HIV-1) group M are different in the envelope (Env) glycoproteins \nof the virus. These parts of the virus are displayed on the surface of the virion and are targets for both neutralizing\nantibody and cell-mediated immune responses. The third hypervariable domain (V3) of HIV-1 gp120 is a cysteine-bounded\nloop structure usually composed of 105 nucleotides and labeled as the base (nu 1:26 and 75:105), stem\n(nu 27:44 and 54:74), and turn (nu 45:53) regions [Lynch et al. (2009)](https://doi.org/10.1089%2Faid.2008.0219) .\nAmong all of the hyper-variable regions in gp120 (V1-V5), V3 is playing the main role in the virus infectivity\n[Fels\u00f6v\u00e1lyi et al. (2006)](https://doi.org/10.1089%2Faid.2006.22.703). \nHere we useare using deepBreaks to identify important regions in the V3 loop that are important in terms of associating\nthe V3 sequences V3 to subtypes B and C. We used the [Los Alamos HIV Database](www.hiv.lanl.gov) to gather the \nnucleotide sequences of the V3 loop of subtypes B and C. \nThis [Jupyter Notebook](https://github.com/omicsEye/deepbreaks/blob/master/examples/discrete_phenotype_HIV.ipynb) \nillustrates the steps.\n\n# Support #\n\n* Please submit your questions or issues with the software at\n[Issues tracker](https://github.com/omicsEye/deepBreaks/issues).  \n* For community discussions, questions, and issue reporting, please visit our forum [here](https://forum.omicseye.org/c/omics-downstream-analysis/deepbreaks/12)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "deepBreaks: a machine learning tool for identifying and prioritizing genotype-phenotype associations",
    "version": "1.1.2",
    "project_urls": {
        "Homepage": "http://github.com/omicsEye/deepBreaks"
    },
    "split_keywords": [
        "machine learning",
        "genomics",
        "sequencing data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "26afd10105f783ee7d848d54fb0d59025647bdec134ee24e25999b4de537a56c",
                "md5": "b9f3b10e24e6d3a6cff250414cbf7828",
                "sha256": "3f8c0273e82d2c0eddbd8a17358f362de41257cdeca52372c91a81e1cfea04c2"
            },
            "downloads": -1,
            "filename": "deepBreaks-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b9f3b10e24e6d3a6cff250414cbf7828",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 37878,
            "upload_time": "2023-08-18T16:05:20",
            "upload_time_iso_8601": "2023-08-18T16:05:20.995669Z",
            "url": "https://files.pythonhosted.org/packages/26/af/d10105f783ee7d848d54fb0d59025647bdec134ee24e25999b4de537a56c/deepBreaks-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-18 16:05:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "omicsEye",
    "github_project": "deepBreaks",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "deepbreaks"
}

Mahdi Baghbanzadeh