biodem


Namebiodem JSON
Version 0.6.0 PyPI version JSON
download
home_pageNone
SummaryDual-extraction method for phenotypic prediction and functional gene mining of complex traits.
upload_time2024-05-06 03:38:49
maintainerNone
docs_urlNone
authorNone
requires_python<3.12,>=3.10
licenseMIT License Copyright (c) 2024 Ma Lab Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords bioinformatics deep-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# DEM

### Dual-extraction method for phenotypic prediction and functional gene mining

[![pypi-badge](https://img.shields.io/pypi/v/biodem)](https://pypi.org/project/biodem)
[![pypi-badge](https://img.shields.io/pypi/dm/biodem.svg?label=Pypi%20downloads)](https://pypi.org/project/biodem)
[![license-badge](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

</div>

+ The **DEM** software comprises 4 functional modules: data preprocessing, dual-extraction modeling, phenotypic prediction, and functional gene mining.  


![modules of biodem](fig_dem.png)

<br></br>

## Installation

### System requirements
+ Python 3.11.
+ Graphics: GPU with [PyTorch](https://pytorch.org) support.
> Recommended: NVIDIA graphics card with 12GB memory or larger.

### Install `biodem` package
> [Conda](https://conda.io/projects/conda/en/latest/index.html) / [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) or virtualenv is recommended for installation.

1. Create a conda environment:
    ```sh
    conda create -n dem python=3.11
    conda activate dem

    # Install PyTorch with CUDA support
    conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
    ```

2. Install biodem package:
    + Simple installation from [PyPI](https://pypi.org/project/biodem)
        ```sh
        pip install biodem
        ```

#### Test installation
A pipeline for testing is provided in [**`./test/test_biodem.py`**](./test/test_biodem.py). *It's also a simple usage example of `biodem`.*
```sh
cd test
python test_biodem.py
```

<br></br>

# DEM ([biodem](https://pypi.org/project/biodem)) includes 4 functional modules:

### 1. Data preprocessing

> **Nested cross-validation is recommended for data preprocessing, before these steps.**

+ _Steps:_
    1. Split the data into nested cross-validation folds.
    2. Imputation & standardization.
    2. Feature selection using the variance threshold method, PCA, and RF for multi-omics data.
    3. SNP2Gene transformation.
    
+ _Related functions:_

    Function | Command line tool
    --- | ---
    `filter_na_pheno` | `filter-pheno`
    `KFoldSplitter` | -
    `data_prep_ncv_regre` | [`ncv-prep-r`](#ncv-prep-r)
    `data_prep_ncv_class` | [`ncv-prep-c`](#ncv-prep-c)
    `impute_omics` | [`dem-impute`](#dem-impute)
    `select_varpca` | [`dem-select-varpca`](#dem-select-varpca)
    `select_rf` | [`dem-select-rf`](#dem-select-rf)
    `build_snp_dict` | - [Instruction (in Julia)](#transform-snps-to-genomic-embedding)
    `encode_vcf2matrix` | - [Instruction (in Julia)](#transform-snps-to-genomic-embedding)
    `process_avail_snp` | -
    `SNPDataModule` | -
    `SNP2GeneTrain` | -
    `SNP2GenePipeline` | [`dem-s2g-pipe`](#dem-s2g-pipe)
    `snp_to_gene` | -
    `train_snp2gene` | -

### 2. Dual-extraction modeling

+ It takes preprocessed multi-omics data and phenotypic data as inputs for nested cross-validation, based on which the DEM model is constructed. It is capable of performing both classification and regression tasks.
    
+ _Related function:_

    Function | Command line tool
    --- | ---
    `DEMLTNDataModule` | -
    `DEMTrain` | -
    `DEMTrainPipeline` | [`dem-train-pipe`](#dem-train-pipe)

### 3. Phenotypic prediction

+ It takes a pretrained model and omics data of new samples as inputs and returns the predicted phenotypes.

+ _Related function:_

    Function | Command line tool
    --- | ---
    `DEMPredict` | -
    `predict_pheno` | [`dem-predict`](#dem-predict)

### 4. Functional gene mining

+ It performs functional gene mining based on the trained DEM model through _feature permutation_.

+ _Related function:_

    Function | Command line tool
    --- | ---
    `DEMFeatureRanking` | -
    `rank_feat` | [`dem-rank`](#dem-rank)


---

<br></br>

## How to use `biodem`
We provide both [**command line tools**](#use-command-line-tools) and [**importable functions**](#import-biodem-in-your-python-project) for `biodem`.
> DEM is mainly implemented in Python. We also provide two Julia scripts for VCF&GFF file processing and SNP encoding.

#### Input and output file formats
Please refer to the directory [**`./data`**](./data/) for **the file formats** that will be used in following methods.

<br></br>

### Import `biodem` in your Python project
The main functions' purposes and parameters are described in related sections of [***Use command line tools***](#use-command-line-tools).
**We provide a simple example analysis pipeline in [`./test/test_biodem.py`](./test/test_biodem.py) to demonstrate how to use `biodem` in Python.**

```python
# An example:

# Use a pretrained DEM model to predict phenotypes of given omics data files
from biodem import predict_pheno

# Define the paths to your model and omics data files
model_path = 'dem_model.ckpt'
omics_file_paths = ['omics_type_1.csv', 'omics_type_2.csv']
output_dir = 'dir_predictions'

# Run the DEM model to predict phenotypes
predict_pheno(
    model_path = model_path,
    omics_paths = omics_file_paths,
    result_dir = output_dir,
    batch_size = 16,
    map_location = None,
)
```

<br></br>

### Use command line tools

> In terminal, run the command like `dem-impute --help` or `dem-impute -h` for help.

<br></br>

+ #### `ncv-prep-r`
    <br>The processing steps are the same as "[1. Data preprocessing](#1-data-preprocessing)".</br>

    ```sh
    # Usage:
    ncv-prep-r -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -n 1000 -x <trait> --raw-label <labels.csv> --n-trees 2000 --na 0.1
    ```

    Parameters | Type | Required | Description
    --- | --- | --- | ---
    `-K` `--loop-outer` | int | * | Number of outer loops for nested cross-validation.
    `-k` `--loop-inner` | int | * | Number of inner loops for nested cross-validation.
    `-i` `--input` | str | * | Path to the input omics/phenotypes data.
    `-o` `--dir-out` | str | * | Path to the output directory.
    `-t` `--threshold-var` | float | * | Threshold for variance selection.
    `-n` `--n-rf-selected` | int | * | Number of selected features by random forest.
    `-x` `--trait` | str | * | Name of the phenotype column in the input data.
    `--raw-label` | str | * | Path to the raw labels data.
    `--n-trees` | int | * | Number of trees in random forest.
    `--na` | float | * | Threshold for missing value.

<br></br>

+ #### `ncv-prep-c`
    <br>The processing steps are the same as "[1. Data preprocessing](#1-data-preprocessing)"</br>
    ```sh
    # Usage:
    ncv-prep-c -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -r 0.9 -x <trait> --raw-label <labels.csv> --na 0.1
    ```

    Parameters | Type | Required | Description
    --- | --- | --- | ---
    `-K` `--loop-outer` | int | * | Number of outer loops for nested cross-validation.
    `-k` `--loop-inner` | int | * | Number of inner loops for nested cross-validation.
    `-i` `--input` | str | * | Path to the input omics/phenotypes data.
    `-o` `--dir-out` | str | * | Path to the output directory.
    `-t` `--threshold-var` | float | * | Threshold for variance selection.
    `-r` `--target-var-ratio` | float | * | Target variance ratio for PCA.
    `-x` `--trait` | str | * | Name of the phenotype column in the input data.
    `--raw-label` | str | * | Path to the raw labels data.
    `--na` | float | * | Threshold for missing value.


<br></br>

+ #### `dem-impute`
    <br>The tool contains 3 steps:</br>
    1. Delete omics features with missing values exceeding 25%.
    2. Impute missing values with mean value of each feature.
    3. Min-max scaling.
    ```sh
    # Usage:
    dem-impute -I <an_omics_file_in> -O <an_omics_file_out> -i <a_phenotypes_file_in> -o <phenotypes_out> -p <NA_threshold> -m <is_minmax_omics> -z <is_zscore_phenotypes>
    ```

    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-I` `--inom` | string | optional | Input a path to an omics file
    `-O` `--outom` | string | optional | Define your output omics file path
    `-i` `--inph` | string | optional | Input a path to a trait's phenotypes
    `-o` `--outph` | string | optional | Define your output phenotypes path
    `-p` `--propna` | float | optional | The allowed max proportion of missing values in a feature (DEFAULT: 0.25)
    `-m` `--minmax` | int, 0 or 1 | optional | Whether min-max scaling for omics is required (0 denotes False, 1 denotes True, DEFAULT: 1)
    `-z` `--zscore` | int, 0 or 1 | optional | Whether z-score transformation for phenotypes is required (0 denotes False, 1 denotes True)


<br></br>

+ #### `dem-select-varpca`
    
    <br>
    
    To analyze the significance of each feature across different phenotype classes and save important features, the following steps were taken:
    
    </br>
    
    1. Remove features with variance less than the given threshold.
    2. The ANOVA F value for each feature was calculated.
    3. Features with the lowest F values were sequentially removed.
    4. This process continued until the first principal component of the remaining features explained less than the given percent of the variance. 
    ```sh
    # Usage:
    dem-select-varpca -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -V <var_threshold> -P <target_variance_ratio>
    ```

    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-I` `--inom` | string | * | Input a path to an omics file
    `-i` `--inph` | string | * | Input a path to a trait's phenotypes
    `-O` `--outom` | string | * | Define your output omics file path
    `-V` `--minvar` | float |  | The allowed minimum variance of a feature (DEFAULT: 0.0)
    `-P` `--varpc` | float |  | Target variance of PC1 (DEFAULT: 0.5)


<br></br>

+ #### `dem-select-rf`
    <br>The random forest (RF) algorithm, based on an ensemble of the given number of decision trees, was employed to screen out representative omics features for subsequent DEM construction.</br>
    ```sh
    # Usage:
    dem-select-rf -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -n <num_features_to_save> -p <validation_data_proportion> -N <num_trees> -S <random_seed_rf> -s <random_seed_sp>
    ```
    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-I` `--inom` | string | * | Input a path to an omics file
    `-i` `--inph` | string | * | Input a path to a trait's phenotypes
    `-O` `--outom` | string | * | Define your output omics file path
    `-n` `--nfeat` | int | * | Number of features to save
    `-N` `--ntree` | int |  | Number of trees in the random forest (DEFAULT: 2500) (larger number of trees will result in more accurate results, but will take longer to run)
    `-S` `--seedrf` | list[int] |  | Random seeds for RF (DEFAULT: 1000, 1001, ..., 1009)


<br></br>

+ #### `dem-s2g-pipe`
    <br>The pipeline for SNP2Gene modeling and transformation based on nested cross-validation.</br>
    ```sh
    # Usaage example:
    dem-s2g-pipe -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --h5 <path_h5_processed> --json <path_json_genes_snps> --log-dir <log_dir> --o-s2g-dir <dir_to_save_converted>
    ```
    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-t` `--trait` | string | * | Trait name
    `-l` `--lbl-class` | int | * | Number of classes for trait
    `--inner-lbl` | string | * | Directory to inner label files
    `--outer-lbl` | string | * | Directory to outer label files
    `--h5` | string | * | Path to h5 file containing genotype data
    `--json` | string | * | Path to json file containing SNP-gene mapping information
    `--log-dir` | string | * | Directory to save log files and models
    `--o-s2g-dir` | string | * | Directory to save SNP2Gene results


<br></br>

+ #### `dem-train-pipe`
    <br>The pipeline for DEM modeling and prediction on nested cross-validation datasets with hyperparameter optimization. DEM aims to achieve high phenotypic prediction accuracy and identify the most informative omics features for the given trait.
    This tool is used to construct a dual-extraction model based on the given omics data and phenotype, through cross validation or random sampling.</br>
    ```sh
    # Usage example:
    dem-train-pipe -o <log_dir> -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --inner-om <dir_omics_inner> --outer-om <dir_omics_outer>
    ```
    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-o` `--log-dir` | string | * | Directory to save log files and models
    `-t` `--trait` | string | * | Trait name
    `-l` `--lbl-class` | integer | * | Number of classes for trait
    `--inner-lbl` | string | * | Directory to inner label files
    `--outer-lbl` | string | * | Directory to outer label files
    `--inner-om` | string | * | Directory to inner omics files
    `--outer-om` | string | * | Directory to outer omics files


<br></br>

+ #### `dem-predict`
    <br>Predicts the phenotype of given omics data files using a trained DEM model.</br>
    ```sh
    # Usage example:
    dem-predict -m <model_file_path> -o <phenotype_output> -I <omics_files>
    ```
    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-I` `--inom` | list[string] | * | Input path(s) to omics file(s)
    `-m` `--inmd` | string | * | The path to a pretrained DEM model
    `-o` `--outdir` | string | * | The directory to save predicted phenotypes


<br></br>

+ #### `dem-rank`
    <br>Assess the importance of features by contrasting the DEM model’s test performance on the actual feature values against those where the feature values have been randomly shuffled. Features that are ranked highly are then identified. </br>
    ```sh
    # Usage example:
    dem-rank -I <omics_file_path> -i <pheno_file_path> -t <trait_name> -l <n_label_class> -m <model_path> -o <output_dir> -b <batch_size> -s <seed1> <seed2>...
    ```
    Parameters | Type | Required | Descriptions
    --- | --- | --- | ---
    `-I` `--inom` | string | * | Input path(s) to omics file(s)
    `-i` `--inph` | string | * | Input a path to a trait's phenotypes
    `-t` `--trait` | string | * | The name of the trait to be predicted
    `-l` `--lbl-class` | int | * | Number of classes for the trait (1 for regression)
    `-m` `--inmd` | string | * | The path to a pretrained DEM model
    `-o` `--outdir` | string | * | The directory to save ranked features
    `-b` `--batch-size` | int |  | Batch size for feature ranking (default: 16)
    `-s` `--seeds` | integer |  | Random seeds for ranking repeats (default: 0-9)


<br></br>

## Transform SNPs to Genomic Embedding

1. **Build a dict of SNPs and their corresponding genomic regions**
    + It accepts a VCF file and a GFF file (reference genome) as input, and returns a dictionary of SNPs and their corresponding genomic regions.
2. **One-hot encode SNPs**
    + It accepts a dictionary of SNPs and their corresponding genomic regions, and returns one-hot encoded SNPs based on the actual di-nucleotide composition of each SNP.
3. **Transform encoded SNPs to genomic embedding**
    + The one-hot encoded SNPs are transformed into dense and continuous features that represent genomic variation (each feature corresponds to a gene).

## Requirements
1. Please [install `biodem`](#installation) first.
2. Install [Julia](https://julialang.org) 1.10 .
3. Install the following Julia packages:
    ```sh
    # Open Julia REPL
    julia
    ```
    ```julia
    # In Julia REPL
    using Pkg
    Pkg.add("JLD2")
    Pkg.add("HDF5")
    Pkg.add("JSON")
    Pkg.add("GFF3")
    Pkg.add("GeneticVariation")
    Pkg.add("CSV")
    Pkg.add("DataFrames")
    ```

## Input and output file formats
Please refer to **the example files** in [`./data/`](./data/) for **the file formats** that will be used in following functions.

<br></br>

### Example usage

- Build a dictionary of SNPs and their corresponding genomic regions from a VCF file and a GFF file. The one-hot encoded SNPs are saved to a H5 file.
```sh
# In shell, enter Julia REPL
julia
```
```julia
# In Julia REPL

# Import functions from git repository
include("./src/biodem/utils_vcf_gff.jl")
include("./src/biodem/utils_encode_snp_vcf.jl")

# Build a dict of SNPs and their corresponding genomic regions
path_gff = "example.gff"
path_vcf = "example.vcf"
build_snp_dict(path_vcf, path_gff, true, true)

# One-hot encode SNPs
path_snp_dict = "snp_dict.jld2"# The output of build_snp_dict function
encode_vcf2matrix(path_snp_dict, true)
```

- Pick up available SNPs and their corresponding genomic regions from the H5 file produced by `encode_vcf2matrix` function.
```python
# In Python

# Import function for processing available SNPs
from biodem import process_avail_snp

path_json_snp_gene_relation = "gene4snp.json"
path_h5_snp_matrix = "snp_and_gene4snp.h5"

process_avail_snp(path_h5_snp_matrix, path_json_snp_gene_relation)
# The output will be a new H5 file containing only available SNPs and their corresponding genomic regions.
```

- Train SNP2Gene models
Please refer to the [`dem-s2g-pipe`](#dem-s2g-pipe) and [`snp_to_gene`](./src/biodem/module_snp.py) function for more details.

---

## Directory structure
```sh
.
├── src                        # Python and Julia implementation for DEM
├── data                       # Data files and example files
├── test                       # Easy tests for packages and pipelines
├── pyproject.toml             # Python package metadata
├── LICENSE                    # License file
└── README.md                  # This file

./src/biodem
├── __init__.py                # Initialize the Python package
├── cli_dem.py                 # Command line interface for DEM
├── utils.py                   # Utilities for modeling and prediction
├── module_data_prep.py        # Data preprocessing
├── module_data_prep_ncv.py    # Data preprocessing for nested cross-validation
├── module_snp.py              # SNP2Gene modeling pipeline for SNP preprocessing
├── module_dem.py              # DEM modeling pipeline, Phenotypic prediction, and Functional gene mining
├── model_dem.py               # DEM model definition
├── model_snp2gene.py          # SNP2Gene model definition
├── utils_vcf_gff.jl           # Utilities for VCF and GFF file processing
└── utils_encode_snp_vcf.jl    # Utilities for one-hot encoding SNPs
```

<br></br>

# Asking for help
If you have any questions please:
+ Contact us via [GitHub](https://github.com/cma2015/dem/issues).
+ [Email](mailto:ryl1999@126.com) us.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "biodem",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.10",
    "maintainer_email": null,
    "keywords": "bioinformatics, deep-learning",
    "author": null,
    "author_email": "Chenhua Wu <chanhuawu@outlook.com>, Yanlin Ren <ryl1999@126.com>",
    "download_url": "https://files.pythonhosted.org/packages/bb/d9/049507a2c3c02ba5ab29b264248ebd51a76dd1eae56f065da9181bb0d360/biodem-0.6.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# DEM\n\n### Dual-extraction method for phenotypic prediction and functional gene mining\n\n[![pypi-badge](https://img.shields.io/pypi/v/biodem)](https://pypi.org/project/biodem)\n[![pypi-badge](https://img.shields.io/pypi/dm/biodem.svg?label=Pypi%20downloads)](https://pypi.org/project/biodem)\n[![license-badge](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n</div>\n\n+ The **DEM** software comprises 4 functional modules: data preprocessing, dual-extraction modeling, phenotypic prediction, and functional gene mining.  \n\n\n![modules of biodem](fig_dem.png)\n\n<br></br>\n\n## Installation\n\n### System requirements\n+ Python 3.11.\n+ Graphics: GPU with [PyTorch](https://pytorch.org) support.\n> Recommended: NVIDIA graphics card with 12GB memory or larger.\n\n### Install `biodem` package\n> [Conda](https://conda.io/projects/conda/en/latest/index.html) / [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) or virtualenv is recommended for installation.\n\n1. Create a conda environment:\n    ```sh\n    conda create -n dem python=3.11\n    conda activate dem\n\n    # Install PyTorch with CUDA support\n    conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia\n    ```\n\n2. Install biodem package:\n    + Simple installation from [PyPI](https://pypi.org/project/biodem)\n        ```sh\n        pip install biodem\n        ```\n\n#### Test installation\nA pipeline for testing is provided in [**`./test/test_biodem.py`**](./test/test_biodem.py). *It's also a simple usage example of `biodem`.*\n```sh\ncd test\npython test_biodem.py\n```\n\n<br></br>\n\n# DEM ([biodem](https://pypi.org/project/biodem)) includes 4 functional modules:\n\n### 1. Data preprocessing\n\n> **Nested cross-validation is recommended for data preprocessing, before these steps.**\n\n+ _Steps:_\n    1. Split the data into nested cross-validation folds.\n    2. Imputation & standardization.\n    2. Feature selection using the variance threshold method, PCA, and RF for multi-omics data.\n    3. SNP2Gene transformation.\n    \n+ _Related functions:_\n\n    Function | Command line tool\n    --- | ---\n    `filter_na_pheno` | `filter-pheno`\n    `KFoldSplitter` | -\n    `data_prep_ncv_regre` | [`ncv-prep-r`](#ncv-prep-r)\n    `data_prep_ncv_class` | [`ncv-prep-c`](#ncv-prep-c)\n    `impute_omics` | [`dem-impute`](#dem-impute)\n    `select_varpca` | [`dem-select-varpca`](#dem-select-varpca)\n    `select_rf` | [`dem-select-rf`](#dem-select-rf)\n    `build_snp_dict` | - [Instruction (in Julia)](#transform-snps-to-genomic-embedding)\n    `encode_vcf2matrix` | - [Instruction (in Julia)](#transform-snps-to-genomic-embedding)\n    `process_avail_snp` | -\n    `SNPDataModule` | -\n    `SNP2GeneTrain` | -\n    `SNP2GenePipeline` | [`dem-s2g-pipe`](#dem-s2g-pipe)\n    `snp_to_gene` | -\n    `train_snp2gene` | -\n\n### 2. Dual-extraction modeling\n\n+ It takes preprocessed multi-omics data and phenotypic data as inputs for nested cross-validation, based on which the DEM model is constructed. It is capable of performing both classification and regression tasks.\n    \n+ _Related function:_\n\n    Function | Command line tool\n    --- | ---\n    `DEMLTNDataModule` | -\n    `DEMTrain` | -\n    `DEMTrainPipeline` | [`dem-train-pipe`](#dem-train-pipe)\n\n### 3. Phenotypic prediction\n\n+ It takes a pretrained model and omics data of new samples as inputs and returns the predicted phenotypes.\n\n+ _Related function:_\n\n    Function | Command line tool\n    --- | ---\n    `DEMPredict` | -\n    `predict_pheno` | [`dem-predict`](#dem-predict)\n\n### 4. Functional gene mining\n\n+ It performs functional gene mining based on the trained DEM model through _feature permutation_.\n\n+ _Related function:_\n\n    Function | Command line tool\n    --- | ---\n    `DEMFeatureRanking` | -\n    `rank_feat` | [`dem-rank`](#dem-rank)\n\n\n---\n\n<br></br>\n\n## How to use `biodem`\nWe provide both [**command line tools**](#use-command-line-tools) and [**importable functions**](#import-biodem-in-your-python-project) for `biodem`.\n> DEM is mainly implemented in Python. We also provide two Julia scripts for VCF&GFF file processing and SNP encoding.\n\n#### Input and output file formats\nPlease refer to the directory [**`./data`**](./data/) for **the file formats** that will be used in following methods.\n\n<br></br>\n\n### Import `biodem` in your Python project\nThe main functions' purposes and parameters are described in related sections of [***Use command line tools***](#use-command-line-tools).\n**We provide a simple example analysis pipeline in [`./test/test_biodem.py`](./test/test_biodem.py) to demonstrate how to use `biodem` in Python.**\n\n```python\n# An example:\n\n# Use a pretrained DEM model to predict phenotypes of given omics data files\nfrom biodem import predict_pheno\n\n# Define the paths to your model and omics data files\nmodel_path = 'dem_model.ckpt'\nomics_file_paths = ['omics_type_1.csv', 'omics_type_2.csv']\noutput_dir = 'dir_predictions'\n\n# Run the DEM model to predict phenotypes\npredict_pheno(\n    model_path = model_path,\n    omics_paths = omics_file_paths,\n    result_dir = output_dir,\n    batch_size = 16,\n    map_location = None,\n)\n```\n\n<br></br>\n\n### Use command line tools\n\n> In terminal, run the command like `dem-impute --help` or `dem-impute -h` for help.\n\n<br></br>\n\n+ #### `ncv-prep-r`\n    <br>The processing steps are the same as \"[1. Data preprocessing](#1-data-preprocessing)\".</br>\n\n    ```sh\n    # Usage:\n    ncv-prep-r -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -n 1000 -x <trait> --raw-label <labels.csv> --n-trees 2000 --na 0.1\n    ```\n\n    Parameters | Type | Required | Description\n    --- | --- | --- | ---\n    `-K` `--loop-outer` | int | * | Number of outer loops for nested cross-validation.\n    `-k` `--loop-inner` | int | * | Number of inner loops for nested cross-validation.\n    `-i` `--input` | str | * | Path to the input omics/phenotypes data.\n    `-o` `--dir-out` | str | * | Path to the output directory.\n    `-t` `--threshold-var` | float | * | Threshold for variance selection.\n    `-n` `--n-rf-selected` | int | * | Number of selected features by random forest.\n    `-x` `--trait` | str | * | Name of the phenotype column in the input data.\n    `--raw-label` | str | * | Path to the raw labels data.\n    `--n-trees` | int | * | Number of trees in random forest.\n    `--na` | float | * | Threshold for missing value.\n\n<br></br>\n\n+ #### `ncv-prep-c`\n    <br>The processing steps are the same as \"[1. Data preprocessing](#1-data-preprocessing)\"</br>\n    ```sh\n    # Usage:\n    ncv-prep-c -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -r 0.9 -x <trait> --raw-label <labels.csv> --na 0.1\n    ```\n\n    Parameters | Type | Required | Description\n    --- | --- | --- | ---\n    `-K` `--loop-outer` | int | * | Number of outer loops for nested cross-validation.\n    `-k` `--loop-inner` | int | * | Number of inner loops for nested cross-validation.\n    `-i` `--input` | str | * | Path to the input omics/phenotypes data.\n    `-o` `--dir-out` | str | * | Path to the output directory.\n    `-t` `--threshold-var` | float | * | Threshold for variance selection.\n    `-r` `--target-var-ratio` | float | * | Target variance ratio for PCA.\n    `-x` `--trait` | str | * | Name of the phenotype column in the input data.\n    `--raw-label` | str | * | Path to the raw labels data.\n    `--na` | float | * | Threshold for missing value.\n\n\n<br></br>\n\n+ #### `dem-impute`\n    <br>The tool contains 3 steps:</br>\n    1. Delete omics features with missing values exceeding 25%.\n    2. Impute missing values with mean value of each feature.\n    3. Min-max scaling.\n    ```sh\n    # Usage:\n    dem-impute -I <an_omics_file_in> -O <an_omics_file_out> -i <a_phenotypes_file_in> -o <phenotypes_out> -p <NA_threshold> -m <is_minmax_omics> -z <is_zscore_phenotypes>\n    ```\n\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-I` `--inom` | string | optional | Input a path to an omics file\n    `-O` `--outom` | string | optional | Define your output omics file path\n    `-i` `--inph` | string | optional | Input a path to a trait's phenotypes\n    `-o` `--outph` | string | optional | Define your output phenotypes path\n    `-p` `--propna` | float | optional | The allowed max proportion of missing values in a feature (DEFAULT: 0.25)\n    `-m` `--minmax` | int, 0 or 1 | optional | Whether min-max scaling for omics is required (0 denotes False, 1 denotes True, DEFAULT: 1)\n    `-z` `--zscore` | int, 0 or 1 | optional | Whether z-score transformation for phenotypes is required (0 denotes False, 1 denotes True)\n\n\n<br></br>\n\n+ #### `dem-select-varpca`\n    \n    <br>\n    \n    To analyze the significance of each feature across different phenotype classes and save important features, the following steps were taken:\n    \n    </br>\n    \n    1. Remove features with variance less than the given threshold.\n    2. The ANOVA F value for each feature was calculated.\n    3. Features with the lowest F values were sequentially removed.\n    4. This process continued until the first principal component of the remaining features explained less than the given percent of the variance. \n    ```sh\n    # Usage:\n    dem-select-varpca -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -V <var_threshold> -P <target_variance_ratio>\n    ```\n\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-I` `--inom` | string | * | Input a path to an omics file\n    `-i` `--inph` | string | * | Input a path to a trait's phenotypes\n    `-O` `--outom` | string | * | Define your output omics file path\n    `-V` `--minvar` | float |  | The allowed minimum variance of a feature (DEFAULT: 0.0)\n    `-P` `--varpc` | float |  | Target variance of PC1 (DEFAULT: 0.5)\n\n\n<br></br>\n\n+ #### `dem-select-rf`\n    <br>The random forest (RF) algorithm, based on an ensemble of the given number of decision trees, was employed to screen out representative omics features for subsequent DEM construction.</br>\n    ```sh\n    # Usage:\n    dem-select-rf -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -n <num_features_to_save> -p <validation_data_proportion> -N <num_trees> -S <random_seed_rf> -s <random_seed_sp>\n    ```\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-I` `--inom` | string | * | Input a path to an omics file\n    `-i` `--inph` | string | * | Input a path to a trait's phenotypes\n    `-O` `--outom` | string | * | Define your output omics file path\n    `-n` `--nfeat` | int | * | Number of features to save\n    `-N` `--ntree` | int |  | Number of trees in the random forest (DEFAULT: 2500) (larger number of trees will result in more accurate results, but will take longer to run)\n    `-S` `--seedrf` | list[int] |  | Random seeds for RF (DEFAULT: 1000, 1001, ..., 1009)\n\n\n<br></br>\n\n+ #### `dem-s2g-pipe`\n    <br>The pipeline for SNP2Gene modeling and transformation based on nested cross-validation.</br>\n    ```sh\n    # Usaage example:\n    dem-s2g-pipe -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --h5 <path_h5_processed> --json <path_json_genes_snps> --log-dir <log_dir> --o-s2g-dir <dir_to_save_converted>\n    ```\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-t` `--trait` | string | * | Trait name\n    `-l` `--lbl-class` | int | * | Number of classes for trait\n    `--inner-lbl` | string | * | Directory to inner label files\n    `--outer-lbl` | string | * | Directory to outer label files\n    `--h5` | string | * | Path to h5 file containing genotype data\n    `--json` | string | * | Path to json file containing SNP-gene mapping information\n    `--log-dir` | string | * | Directory to save log files and models\n    `--o-s2g-dir` | string | * | Directory to save SNP2Gene results\n\n\n<br></br>\n\n+ #### `dem-train-pipe`\n    <br>The pipeline for DEM modeling and prediction on nested cross-validation datasets with hyperparameter optimization. DEM aims to achieve high phenotypic prediction accuracy and identify the most informative omics features for the given trait.\n    This tool is used to construct a dual-extraction model based on the given omics data and phenotype, through cross validation or random sampling.</br>\n    ```sh\n    # Usage example:\n    dem-train-pipe -o <log_dir> -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --inner-om <dir_omics_inner> --outer-om <dir_omics_outer>\n    ```\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-o` `--log-dir` | string | * | Directory to save log files and models\n    `-t` `--trait` | string | * | Trait name\n    `-l` `--lbl-class` | integer | * | Number of classes for trait\n    `--inner-lbl` | string | * | Directory to inner label files\n    `--outer-lbl` | string | * | Directory to outer label files\n    `--inner-om` | string | * | Directory to inner omics files\n    `--outer-om` | string | * | Directory to outer omics files\n\n\n<br></br>\n\n+ #### `dem-predict`\n    <br>Predicts the phenotype of given omics data files using a trained DEM model.</br>\n    ```sh\n    # Usage example:\n    dem-predict -m <model_file_path> -o <phenotype_output> -I <omics_files>\n    ```\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-I` `--inom` | list[string] | * | Input path(s) to omics file(s)\n    `-m` `--inmd` | string | * | The path to a pretrained DEM model\n    `-o` `--outdir` | string | * | The directory to save predicted phenotypes\n\n\n<br></br>\n\n+ #### `dem-rank`\n    <br>Assess the importance of features by contrasting the DEM model\u2019s test performance on the actual feature values against those where the feature values have been randomly shuffled. Features that are ranked highly are then identified. </br>\n    ```sh\n    # Usage example:\n    dem-rank -I <omics_file_path> -i <pheno_file_path> -t <trait_name> -l <n_label_class> -m <model_path> -o <output_dir> -b <batch_size> -s <seed1> <seed2>...\n    ```\n    Parameters | Type | Required | Descriptions\n    --- | --- | --- | ---\n    `-I` `--inom` | string | * | Input path(s) to omics file(s)\n    `-i` `--inph` | string | * | Input a path to a trait's phenotypes\n    `-t` `--trait` | string | * | The name of the trait to be predicted\n    `-l` `--lbl-class` | int | * | Number of classes for the trait (1 for regression)\n    `-m` `--inmd` | string | * | The path to a pretrained DEM model\n    `-o` `--outdir` | string | * | The directory to save ranked features\n    `-b` `--batch-size` | int |  | Batch size for feature ranking (default: 16)\n    `-s` `--seeds` | integer |  | Random seeds for ranking repeats (default: 0-9)\n\n\n<br></br>\n\n## Transform SNPs to Genomic Embedding\n\n1. **Build a dict of SNPs and their corresponding genomic regions**\n    + It accepts a VCF file and a GFF file (reference genome) as input, and returns a dictionary of SNPs and their corresponding genomic regions.\n2. **One-hot encode SNPs**\n    + It accepts a dictionary of SNPs and their corresponding genomic regions, and returns one-hot encoded SNPs based on the actual di-nucleotide composition of each SNP.\n3. **Transform encoded SNPs to genomic embedding**\n    + The one-hot encoded SNPs are transformed into dense and continuous features that represent genomic variation (each feature corresponds to a gene).\n\n## Requirements\n1. Please [install `biodem`](#installation) first.\n2. Install [Julia](https://julialang.org) 1.10 .\n3. Install the following Julia packages:\n    ```sh\n    # Open Julia REPL\n    julia\n    ```\n    ```julia\n    # In Julia REPL\n    using Pkg\n    Pkg.add(\"JLD2\")\n    Pkg.add(\"HDF5\")\n    Pkg.add(\"JSON\")\n    Pkg.add(\"GFF3\")\n    Pkg.add(\"GeneticVariation\")\n    Pkg.add(\"CSV\")\n    Pkg.add(\"DataFrames\")\n    ```\n\n## Input and output file formats\nPlease refer to **the example files** in [`./data/`](./data/) for **the file formats** that will be used in following functions.\n\n<br></br>\n\n### Example usage\n\n- Build a dictionary of SNPs and their corresponding genomic regions from a VCF file and a GFF file. The one-hot encoded SNPs are saved to a H5 file.\n```sh\n# In shell, enter Julia REPL\njulia\n```\n```julia\n# In Julia REPL\n\n# Import functions from git repository\ninclude(\"./src/biodem/utils_vcf_gff.jl\")\ninclude(\"./src/biodem/utils_encode_snp_vcf.jl\")\n\n# Build a dict of SNPs and their corresponding genomic regions\npath_gff = \"example.gff\"\npath_vcf = \"example.vcf\"\nbuild_snp_dict(path_vcf, path_gff, true, true)\n\n# One-hot encode SNPs\npath_snp_dict = \"snp_dict.jld2\"# The output of build_snp_dict function\nencode_vcf2matrix(path_snp_dict, true)\n```\n\n- Pick up available SNPs and their corresponding genomic regions from the H5 file produced by `encode_vcf2matrix` function.\n```python\n# In Python\n\n# Import function for processing available SNPs\nfrom biodem import process_avail_snp\n\npath_json_snp_gene_relation = \"gene4snp.json\"\npath_h5_snp_matrix = \"snp_and_gene4snp.h5\"\n\nprocess_avail_snp(path_h5_snp_matrix, path_json_snp_gene_relation)\n# The output will be a new H5 file containing only available SNPs and their corresponding genomic regions.\n```\n\n- Train SNP2Gene models\nPlease refer to the [`dem-s2g-pipe`](#dem-s2g-pipe) and [`snp_to_gene`](./src/biodem/module_snp.py) function for more details.\n\n---\n\n## Directory structure\n```sh\n.\n\u251c\u2500\u2500 src                        # Python and Julia implementation for DEM\n\u251c\u2500\u2500 data                       # Data files and example files\n\u251c\u2500\u2500 test                       # Easy tests for packages and pipelines\n\u251c\u2500\u2500 pyproject.toml             # Python package metadata\n\u251c\u2500\u2500 LICENSE                    # License file\n\u2514\u2500\u2500 README.md                  # This file\n\n./src/biodem\n\u251c\u2500\u2500 __init__.py                # Initialize the Python package\n\u251c\u2500\u2500 cli_dem.py                 # Command line interface for DEM\n\u251c\u2500\u2500 utils.py                   # Utilities for modeling and prediction\n\u251c\u2500\u2500 module_data_prep.py        # Data preprocessing\n\u251c\u2500\u2500 module_data_prep_ncv.py    # Data preprocessing for nested cross-validation\n\u251c\u2500\u2500 module_snp.py              # SNP2Gene modeling pipeline for SNP preprocessing\n\u251c\u2500\u2500 module_dem.py              # DEM modeling pipeline, Phenotypic prediction, and Functional gene mining\n\u251c\u2500\u2500 model_dem.py               # DEM model definition\n\u251c\u2500\u2500 model_snp2gene.py          # SNP2Gene model definition\n\u251c\u2500\u2500 utils_vcf_gff.jl           # Utilities for VCF and GFF file processing\n\u2514\u2500\u2500 utils_encode_snp_vcf.jl    # Utilities for one-hot encoding SNPs\n```\n\n<br></br>\n\n# Asking for help\nIf you have any questions please:\n+ Contact us via [GitHub](https://github.com/cma2015/dem/issues).\n+ [Email](mailto:ryl1999@126.com) us.\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Ma Lab  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Dual-extraction method for phenotypic prediction and functional gene mining of complex traits.",
    "version": "0.6.0",
    "project_urls": {
        "Homepage": "https://github.com/cma2015/DEM"
    },
    "split_keywords": [
        "bioinformatics",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f72a6b65336de9c357a12e23b99f6b73ff697ec5b82292f954d011905676e1cd",
                "md5": "c3b5aa969d19043d2b4a389d29938fda",
                "sha256": "a9301034504f208f5fb309cf6779c1bdf3e1e45dcde16792214f080e5e508a5a"
            },
            "downloads": -1,
            "filename": "biodem-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3b5aa969d19043d2b4a389d29938fda",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.10",
            "size": 41283,
            "upload_time": "2024-05-06T03:38:38",
            "upload_time_iso_8601": "2024-05-06T03:38:38.565358Z",
            "url": "https://files.pythonhosted.org/packages/f7/2a/6b65336de9c357a12e23b99f6b73ff697ec5b82292f954d011905676e1cd/biodem-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bbd9049507a2c3c02ba5ab29b264248ebd51a76dd1eae56f065da9181bb0d360",
                "md5": "3934c1c2042367d43d9d6b0c7e64f893",
                "sha256": "a1993651d87f08e8f05ff33b3bad66d3e5e02f956f7e83941423f20fc821031d"
            },
            "downloads": -1,
            "filename": "biodem-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3934c1c2042367d43d9d6b0c7e64f893",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.10",
            "size": 90911776,
            "upload_time": "2024-05-06T03:38:49",
            "upload_time_iso_8601": "2024-05-06T03:38:49.251290Z",
            "url": "https://files.pythonhosted.org/packages/bb/d9/049507a2c3c02ba5ab29b264248ebd51a76dd1eae56f065da9181bb0d360/biodem-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-06 03:38:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cma2015",
    "github_project": "DEM",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "biodem"
}
        
Elapsed time: 0.30400s