mirdeepsquared


Namemirdeepsquared JSON
Version 0.6.0 PyPI version JSON
download
home_pagehttps://github.com/jontejj/mirdeepsquared
SummaryFilters out false positives from the novel section of the output from miRDeep2
upload_time2023-12-20 18:39:11
maintainer
docs_urlNone
authorJonatan Jönsson
requires_python>=3.9, <4
license
keywords mirdeep2 mirna deeplearning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Mirdeepsquared
Mirdeepsquared uses a deep learning model that predicts if novel miRNA sequences in the output of [miRDeep2](https://github.com/rajewsky-lab/mirdeep2) are false positives or not. This greatly reduces the amount of manual work that is currently needed to filter out the false positives.

## Usage (with pip install)
```
virtualenv mirdeepsquared-env -p python3.9
source mirdeepsquared-env/bin/activate
pip install mirdeepsquared
mirdeepsquared path/to/your_result.csv path/to/your_output.mrd
```

The output are your true positives, i.e highly likely to actually be novel miRNA:s.

Another way to use mirdeepsquared is to create a filtered result.html file:
```
mirdeepsquared path/to/your_result.csv path/to/your_output.mrd > tp.txt
mod-html path/to/your_result.html path/to/filtered_results.html -- tp.txt
```

## Installing (from source)
Use python 3.9 as tensorflow requires it

```
virtualenv mirdeepsquared-env -p python3.9
source mirdeepsquared-env/bin/activate
pip install -r requirements.txt
python3 -m pip install -e .
./prepare_default_dataset.sh
python mirdeepsquared/train.py resources/dataset/split/train -o models/
python mirdeepsquared/predict_cmd.py path/to/your_result.csv path/to/your/output.mrd -m models/
```

### Installing on Uppmax
```
git clone https://github.com/jontejj/mirdeepsquared.git
cd mirdeepsquared
module load python3/3.9.5
virtualenv mirdeepsquared-env -p python3.9
source mirdeepsquared-env/bin/activate
pip install -r requirements.txt
./prepare_default_dataset.sh
python mirdeepsquared/train.py resources/dataset/split/train -o models/
```

Then you can use ```python mirdeepsquared/predict_cmd.py your_result.csv your_output.mrd -m models/``` to get a list of the true positives

# How Mirdeepsquared was developed

https://www.ncbi.nlm.nih.gov/sra was used to select SRR datafiles. The accession list was then used to download the datafiles with fasterq-dump:

```
fasterq-dump -e 16 -t temp SRR_ID
```
```cutadapt``` was then used to trim adapters. The resulting files where then listed in a ```config.txt``` file like this:
```
SRR2496781.fastq hh1
SRR2496782.fastq hh2
SRR2496783.fastq hh3
SRR2496784.fastq hh4
```

A bowtie index (GRCh38 (from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.39/)) for the human genome was built with https://bowtie-bio.sourceforge.net/manual.shtml:
```
bowtie-build GRCh38.p13.genome.fa GRCh38 --threads 8
```

Then mapper.pl (from miRDeep2) was used to create ```healthy_reads_collapsed.fa``` and ```healthy_reads_collapsed_vs_genome.arf``` from the reads:
```
mapper.pl config.txt -d -e -m -p GRCh38 -s healthy_reads_collapsed.fa -t healthy_reads_collapsed_vs_genome.arf -v -h
```

```precursor-sequences-h_sapiens.fas```, ```rhesus-monkey-mature-seq.fas``` and ```known-mature-sequences-h_sapiens.fas``` was downloaded from https://mirgenedb.org/. Then miRDeep2.pl was used to create the output.mrd and the result*.csv file:
```
miRDeep2.pl healthy_reads_collapsed.fa GRCh38.p13.genome.fa healthy_reads_collapsed_vs_genome.arf known-mature-sequences-h_sapiens.fas rhesus-monkey-mature-seq.fas precursor-sequences-h_sapiens.fas -t Human -b -5 2>report2.log
```

This process was done for one healthy tissue (SRR2496781-SRR2496784), where all the novel classifications were marked as false positives and then it was also done for data from the TCGA dataset (https://www.cancer.gov/ccg/research/genome-sequencing/tcga) (specifially TCGA-LUSC), where the mature classifications were assumed to be true positives. For the false positives, -b -5 was added to the miRDeep2.pl command to include even more false positives than miRDeep2.pl normally gives.

The resulting output.mrd and result.csv was then passed to ```extract_features.py``` in order to create the dataset (pickle files with Pandas dataframe in them):
True positives:
```
python mirdeepsquared/extract_features.py TCGA-LUSC/result_19_01_2023_t_23_35_49.csv TCGA-LUSC/output.mrd resources/dataset/true_positives/true_positives_TCGA_LUSC_all.pkl -tp --section known
python mirdeepsquared/mirgene_db_filter.py resources/dataset/true_positives/true_positives_TCGA_LUSC_all.pkl resources/known-mature-sequences-h_sapiens.fas resources/dataset/true_positives_TCGA_LUSC_only_in_mirgene_db.pkl"
```
False positives:
```
python extract_features.py false_positives/result_08_11_2023_t_19_35_00.csv false_positives/08_11_2023_t_19_35_00_output.mrd resources/dataset/false_positives_SRR2496781-84_bigger.pkl -fp --section novel
```

As the resulting dataset files, ```true_positives_TCGA_LUSC.pkl``` and ```false_positives_SRR2496781-84_bigger.pkl```, were small, they were checked into git in resources/dataset/ in order to make future training easier for people who want to improve the model. On top of this, the output.mrd and the result.csv for the false positives were also checked in to git.

The data was then split into a holdout and a train dataset with ```split_dataset.py```:
```
python split_dataset.py resources/dataset/ resources/dataset/split
```

This created ```resources/dataset/split/holdout/holdout.pkl``` and ```resources/dataset/split/train/train.pkl```.

The different train*.py files contain different models with varying performance. The best one so far is ```train-simple-density-map.py```, initially it got 100% accuracy on the test set. It later turned out that true positives not in mirgene db were not filtered away correctly ([#1][i1]), this lead to a less imbalanced dataset. The imbalance was later corrected by incorporating samples from TCGA_BRCA instead. With this correction, the accuracy was 98% instead.

To avoid overfitting to the training data the complexity of the model ```train-simple-density-map.py``` was reduced by lowering the units on the Dense layer (10000 -> 1000). The false negative and false positive samples in the confusion matrix (listed by ```predictor.py```) was then manually inspected. It turned out that some samples actually had the wrong label as they either were not really false positives or not really true positives. These labels were corrected with:
```
python mirdeepsquared/correct_invalid_labels.py
```

To test how well the model generalizes, a dataset was also created for Zebrafish ([Danio rerio][genome1]) and House mouse ([Mus musculus][genome2]). For these datafiles, the accuracy was first 88%. Then the mouse datafiles were included in the training data and the Zebrafish in the holdout dataset. The accuracy then improved to 90%.

By generating ```false_positives_with_empty_read_density_maps.pkl``` with ```generate_data.py``` the accuracy increased to 94.4% (Big cross-validated model (```trainer.py```), with best-hyperparameters.yaml, trained on: true_positives_TCGA_LUSC_only_in_mirgene_db.pkl + true_positives_TCGA_BRCA.pkl + mouse.mature.pkl + false_positives_with_empty_read_density_maps.pkl + false_positives_SRR2496781-84_bigger.pkl and evaluated on holdout + zebrafish.mature.2nd.run.pkl.

By combining ```density_map_model.py```, ```motifs_bayes_model.py```, ```numerical_model.py```, ```structure_model.py``` and the big model trained (in ```train.py```) with ```mirdeepsquared/best-hyperparameters.yaml``` into an ensemble model, the accuracy on the holdout data increased to 96%.

[release.sh](release.sh) was then used to release a mirdeepsquared package to https://pypi.org/.

The most problematic samples were then collected into pdf:s by ```predictor.py``` for analysis by experts used to filter out false positives from miRdeep2.

After analyzing the most problematic samples, some of the true positives were deemed not to actually be in mirgene db. It turned out that some of the known sequences matched a mature sequence in mirgene db but when the whole precursor was compared, there were differences. One possible explanation is that these could be pseudo genes that evolved from the functioning miRNA.

After making ```mirgene_db_filter.py``` more stringent, i.e instead of ```true_positives_TCGA_LUSC_only_in_mirgene_db.pkl```, ```true_positives_TCGA_LUSC_only_precursors_in_mirgene_db``` were generated and trained on instead. The accuracy, together with a new best-hyperparameters.yaml (based on the result of a half-finished grid search on uppmax), then increased to 99.4%.

*TODO*: analyze the most problematic samples in the subset created with ```resources/dataset/hard/only_matching_mature_in_mirgene_db/generate-tricky.sh```, and incorporate the tricky cases into the training data once it's been correctly labeled.

[i1]: https://github.com/jontejj/mirdeepsquared/issues/1

[genome1] : https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002035.6/

[genome2]: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/

[acc1] : https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR8305633%2CSRR6411465%2CSRR8305629%2CSRR8305619%2CSRR10358540%2CSRR6411467%2CSRR6411468%2CSRR6411466%2CSRR15498151%2CSRR15498149%2CSRR11974581%2CSRR11974578

[acc2] : https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR25511174%2CSRR25338387%2CSRR25205118%2CSRR22739462%2CSRR24949848%2CSRR17652208%2CSRR10240201%2CSRR8494799%2CSRR6793409%2CSRR6787032%2CSRR1551219%2CSRR1551218%2CSRR1551220%2CSRR25205161%2CSRR25205080%2CSRR25338389%2CSRR25338388%2CSRR25511175%2CSRR25511182%2CSRR24949847%2CSRR24949826%2CSRR22739477%2CSRR22739469%2CSRR17652211%2CSRR17652201%2CSRR10240206%2CSRR10240195%2CSRR8494806%2CSRR8494810%2CSRR6793389%2CSRR6793401%2CSRR6787013%2CSRR6787012%2CSRR6787016&o=acc_s%3Aa

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jontejj/mirdeepsquared",
    "name": "mirdeepsquared",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9, <4",
    "maintainer_email": "",
    "keywords": "miRDeep2,miRNA,deeplearning",
    "author": "Jonatan J\u00f6nsson",
    "author_email": "jontejj@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a6/b3/7b9957a6d2da801fd9537880d933eadaefda060b6e2dc3c80c8bb9d67edc/mirdeepsquared-0.6.0.tar.gz",
    "platform": null,
    "description": "# Mirdeepsquared\nMirdeepsquared uses a deep learning model that predicts if novel miRNA sequences in the output of [miRDeep2](https://github.com/rajewsky-lab/mirdeep2) are false positives or not. This greatly reduces the amount of manual work that is currently needed to filter out the false positives.\n\n## Usage (with pip install)\n```\nvirtualenv mirdeepsquared-env -p python3.9\nsource mirdeepsquared-env/bin/activate\npip install mirdeepsquared\nmirdeepsquared path/to/your_result.csv path/to/your_output.mrd\n```\n\nThe output are your true positives, i.e highly likely to actually be novel miRNA:s.\n\nAnother way to use mirdeepsquared is to create a filtered result.html file:\n```\nmirdeepsquared path/to/your_result.csv path/to/your_output.mrd > tp.txt\nmod-html path/to/your_result.html path/to/filtered_results.html -- tp.txt\n```\n\n## Installing (from source)\nUse python 3.9 as tensorflow requires it\n\n```\nvirtualenv mirdeepsquared-env -p python3.9\nsource mirdeepsquared-env/bin/activate\npip install -r requirements.txt\npython3 -m pip install -e .\n./prepare_default_dataset.sh\npython mirdeepsquared/train.py resources/dataset/split/train -o models/\npython mirdeepsquared/predict_cmd.py path/to/your_result.csv path/to/your/output.mrd -m models/\n```\n\n### Installing on Uppmax\n```\ngit clone https://github.com/jontejj/mirdeepsquared.git\ncd mirdeepsquared\nmodule load python3/3.9.5\nvirtualenv mirdeepsquared-env -p python3.9\nsource mirdeepsquared-env/bin/activate\npip install -r requirements.txt\n./prepare_default_dataset.sh\npython mirdeepsquared/train.py resources/dataset/split/train -o models/\n```\n\nThen you can use ```python mirdeepsquared/predict_cmd.py your_result.csv your_output.mrd -m models/``` to get a list of the true positives\n\n# How Mirdeepsquared was developed\n\nhttps://www.ncbi.nlm.nih.gov/sra was used to select SRR datafiles. The accession list was then used to download the datafiles with fasterq-dump:\n\n```\nfasterq-dump -e 16 -t temp SRR_ID\n```\n```cutadapt``` was then used to trim adapters. The resulting files where then listed in a ```config.txt``` file like this:\n```\nSRR2496781.fastq hh1\nSRR2496782.fastq hh2\nSRR2496783.fastq hh3\nSRR2496784.fastq hh4\n```\n\nA bowtie index (GRCh38 (from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.39/)) for the human genome was built with https://bowtie-bio.sourceforge.net/manual.shtml:\n```\nbowtie-build GRCh38.p13.genome.fa GRCh38 --threads 8\n```\n\nThen mapper.pl (from miRDeep2) was used to create ```healthy_reads_collapsed.fa``` and ```healthy_reads_collapsed_vs_genome.arf``` from the reads:\n```\nmapper.pl config.txt -d -e -m -p GRCh38 -s healthy_reads_collapsed.fa -t healthy_reads_collapsed_vs_genome.arf -v -h\n```\n\n```precursor-sequences-h_sapiens.fas```, ```rhesus-monkey-mature-seq.fas``` and ```known-mature-sequences-h_sapiens.fas``` was downloaded from https://mirgenedb.org/. Then miRDeep2.pl was used to create the output.mrd and the result*.csv file:\n```\nmiRDeep2.pl healthy_reads_collapsed.fa GRCh38.p13.genome.fa healthy_reads_collapsed_vs_genome.arf known-mature-sequences-h_sapiens.fas rhesus-monkey-mature-seq.fas precursor-sequences-h_sapiens.fas -t Human -b -5 2>report2.log\n```\n\nThis process was done for one healthy tissue (SRR2496781-SRR2496784), where all the novel classifications were marked as false positives and then it was also done for data from the TCGA dataset (https://www.cancer.gov/ccg/research/genome-sequencing/tcga) (specifially TCGA-LUSC), where the mature classifications were assumed to be true positives. For the false positives, -b -5 was added to the miRDeep2.pl command to include even more false positives than miRDeep2.pl normally gives.\n\nThe resulting output.mrd and result.csv was then passed to ```extract_features.py``` in order to create the dataset (pickle files with Pandas dataframe in them):\nTrue positives:\n```\npython mirdeepsquared/extract_features.py TCGA-LUSC/result_19_01_2023_t_23_35_49.csv TCGA-LUSC/output.mrd resources/dataset/true_positives/true_positives_TCGA_LUSC_all.pkl -tp --section known\npython mirdeepsquared/mirgene_db_filter.py resources/dataset/true_positives/true_positives_TCGA_LUSC_all.pkl resources/known-mature-sequences-h_sapiens.fas resources/dataset/true_positives_TCGA_LUSC_only_in_mirgene_db.pkl\"\n```\nFalse positives:\n```\npython extract_features.py false_positives/result_08_11_2023_t_19_35_00.csv false_positives/08_11_2023_t_19_35_00_output.mrd resources/dataset/false_positives_SRR2496781-84_bigger.pkl -fp --section novel\n```\n\nAs the resulting dataset files, ```true_positives_TCGA_LUSC.pkl``` and ```false_positives_SRR2496781-84_bigger.pkl```, were small, they were checked into git in resources/dataset/ in order to make future training easier for people who want to improve the model. On top of this, the output.mrd and the result.csv for the false positives were also checked in to git.\n\nThe data was then split into a holdout and a train dataset with ```split_dataset.py```:\n```\npython split_dataset.py resources/dataset/ resources/dataset/split\n```\n\nThis created ```resources/dataset/split/holdout/holdout.pkl``` and ```resources/dataset/split/train/train.pkl```.\n\nThe different train*.py files contain different models with varying performance. The best one so far is ```train-simple-density-map.py```, initially it got 100% accuracy on the test set. It later turned out that true positives not in mirgene db were not filtered away correctly ([#1][i1]), this lead to a less imbalanced dataset. The imbalance was later corrected by incorporating samples from TCGA_BRCA instead. With this correction, the accuracy was 98% instead.\n\nTo avoid overfitting to the training data the complexity of the model ```train-simple-density-map.py``` was reduced by lowering the units on the Dense layer (10000 -> 1000). The false negative and false positive samples in the confusion matrix (listed by ```predictor.py```) was then manually inspected. It turned out that some samples actually had the wrong label as they either were not really false positives or not really true positives. These labels were corrected with:\n```\npython mirdeepsquared/correct_invalid_labels.py\n```\n\nTo test how well the model generalizes, a dataset was also created for Zebrafish ([Danio rerio][genome1]) and House mouse ([Mus musculus][genome2]). For these datafiles, the accuracy was first 88%. Then the mouse datafiles were included in the training data and the Zebrafish in the holdout dataset. The accuracy then improved to 90%.\n\nBy generating ```false_positives_with_empty_read_density_maps.pkl``` with ```generate_data.py``` the accuracy increased to 94.4% (Big cross-validated model (```trainer.py```), with best-hyperparameters.yaml, trained on: true_positives_TCGA_LUSC_only_in_mirgene_db.pkl + true_positives_TCGA_BRCA.pkl + mouse.mature.pkl + false_positives_with_empty_read_density_maps.pkl + false_positives_SRR2496781-84_bigger.pkl and evaluated on holdout + zebrafish.mature.2nd.run.pkl.\n\nBy combining ```density_map_model.py```, ```motifs_bayes_model.py```, ```numerical_model.py```, ```structure_model.py``` and the big model trained (in ```train.py```) with ```mirdeepsquared/best-hyperparameters.yaml``` into an ensemble model, the accuracy on the holdout data increased to 96%.\n\n[release.sh](release.sh) was then used to release a mirdeepsquared package to https://pypi.org/.\n\nThe most problematic samples were then collected into pdf:s by ```predictor.py``` for analysis by experts used to filter out false positives from miRdeep2.\n\nAfter analyzing the most problematic samples, some of the true positives were deemed not to actually be in mirgene db. It turned out that some of the known sequences matched a mature sequence in mirgene db but when the whole precursor was compared, there were differences. One possible explanation is that these could be pseudo genes that evolved from the functioning miRNA.\n\nAfter making ```mirgene_db_filter.py``` more stringent, i.e instead of ```true_positives_TCGA_LUSC_only_in_mirgene_db.pkl```, ```true_positives_TCGA_LUSC_only_precursors_in_mirgene_db``` were generated and trained on instead. The accuracy, together with a new best-hyperparameters.yaml (based on the result of a half-finished grid search on uppmax), then increased to 99.4%.\n\n*TODO*: analyze the most problematic samples in the subset created with ```resources/dataset/hard/only_matching_mature_in_mirgene_db/generate-tricky.sh```, and incorporate the tricky cases into the training data once it's been correctly labeled.\n\n[i1]: https://github.com/jontejj/mirdeepsquared/issues/1\n\n[genome1] : https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002035.6/\n\n[genome2]: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/\n\n[acc1] : https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR8305633%2CSRR6411465%2CSRR8305629%2CSRR8305619%2CSRR10358540%2CSRR6411467%2CSRR6411468%2CSRR6411466%2CSRR15498151%2CSRR15498149%2CSRR11974581%2CSRR11974578\n\n[acc2] : https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR25511174%2CSRR25338387%2CSRR25205118%2CSRR22739462%2CSRR24949848%2CSRR17652208%2CSRR10240201%2CSRR8494799%2CSRR6793409%2CSRR6787032%2CSRR1551219%2CSRR1551218%2CSRR1551220%2CSRR25205161%2CSRR25205080%2CSRR25338389%2CSRR25338388%2CSRR25511175%2CSRR25511182%2CSRR24949847%2CSRR24949826%2CSRR22739477%2CSRR22739469%2CSRR17652211%2CSRR17652201%2CSRR10240206%2CSRR10240195%2CSRR8494806%2CSRR8494810%2CSRR6793389%2CSRR6793401%2CSRR6787013%2CSRR6787012%2CSRR6787016&o=acc_s%3Aa\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Filters out false positives from the novel section of the output from miRDeep2",
    "version": "0.6.0",
    "project_urls": {
        "Bug Reports": "https://github.com/jontejj/mirdeepsquared/issues",
        "Homepage": "https://github.com/jontejj/mirdeepsquared",
        "Say Thanks!": "https://saythanks.io/to/jontejj",
        "Source": "https://github.com/jontejj/mirdeepsquared/"
    },
    "split_keywords": [
        "mirdeep2",
        "mirna",
        "deeplearning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ce7ccd57aee954115b335ba8995969ca1420c243d8487dd858a1445a4c759de2",
                "md5": "6c53f0b824bb2014ebbf2e132215d9ff",
                "sha256": "157156063edb09ca70afb4ae504b369ec3dcb49f4fc1c295e0bd9bbd4e838a2c"
            },
            "downloads": -1,
            "filename": "mirdeepsquared-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6c53f0b824bb2014ebbf2e132215d9ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9, <4",
            "size": 47540556,
            "upload_time": "2023-12-20T18:39:03",
            "upload_time_iso_8601": "2023-12-20T18:39:03.184244Z",
            "url": "https://files.pythonhosted.org/packages/ce/7c/cd57aee954115b335ba8995969ca1420c243d8487dd858a1445a4c759de2/mirdeepsquared-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a6b37b9957a6d2da801fd9537880d933eadaefda060b6e2dc3c80c8bb9d67edc",
                "md5": "5afa2ce1c54d86006f67ae44058f08cc",
                "sha256": "47385ea660603241ef356bdccbf1f1c9cef60bc722c601db583e2e47ff5958b3"
            },
            "downloads": -1,
            "filename": "mirdeepsquared-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5afa2ce1c54d86006f67ae44058f08cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9, <4",
            "size": 47513014,
            "upload_time": "2023-12-20T18:39:11",
            "upload_time_iso_8601": "2023-12-20T18:39:11.186503Z",
            "url": "https://files.pythonhosted.org/packages/a6/b3/7b9957a6d2da801fd9537880d933eadaefda060b6e2dc3c80c8bb9d67edc/mirdeepsquared-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-20 18:39:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jontejj",
    "github_project": "mirdeepsquared",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "mirdeepsquared"
}
        
Elapsed time: 2.69236s