virusrecom


Namevirusrecom JSON
Version 1.1.5 PyPI version JSON
download
home_pagehttps://github.com/ZhijianZhou01/virusrecom
SummaryAn information-theory-based method for recombination detection of viral lineages.
upload_time2024-04-18 13:24:41
maintainerNone
docs_urlNone
authorZhi-Jian Zhou
requires_pythonNone
licenseNone
keywords recombination virus evolution information entropy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VirusRecom: Detecting recombination of viral lineages using information theory

## 1. Download and install

VirusRecom is developed based on ```Python 3```, and you can get and install VirusRecom in a variety of ways.

### 1.1. pip method

virusrecom has been distributed to the standard library of PyPI, and can be easily installed by the tool ```pip```.

```
pip install virusrecom
virusrecom -h
```

### 1.2. Or local installation

In addition to the  ```pip``` method, you can also install virusrecom manually using the file ```setup.py```. 

Firstly, download this repository, then, run:
```
python setup.py install
virusrecom -h
```

### 1.3. Or run the source code directly

virusrecom can also be run using the source code without installation. First, download this repository, then, install the required python environment of virusrecom:

```
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```

finally, run virusrecom by the file ```main.py```. Please view the help documentation by ```python main.py -h```.

### 1.4. Or use the binary files

For the two earlier release packages (versions v1.0 and v1.1), you can also directly run the binary files of virusrecom without installation. The  binary files are provided at https://github.com/ZhijianZhou01/virusrecom/releases. 

In general, the executable file of virusrecom is located at the  ```main``` folder. Then, running the ```virusrecom.exe``` (windows system) or ```virusrecom``` (Linux or MacOS system) to start. If you could not get permission to run virusrecom on Linux system or MacOS system, you could change permissions by ```chmod -R 775 Directory``` or ```chmod -R 777 Directory```. 


## 2. Getting help
virusrecom is a command-line-interface program, users can get help documentation of the software by entering  ```virusrecom -h ``` or  ```virusrecom --help ```. 

<b>For detailed documentation, please refer to</b> [Manual of VirusRecom v1.1](https://github.com/ZhijianZhou01/virusrecom/blob/main/Manual%20of%20VirusRecom%20v1.1_2023.12.23.pdf)

<b>Tip: since version 1.1, virusrecom optimizes the parameters of input-file, which is slightly different from virusrecom v1.0.</b>

<b>The simple help documentation of virusrecom v1.1.3 is as follows.</b>

| Parameter | Description |
| --- | --- |
|-h, --help | Show this help message and exit.|
|-a ALIGNMENT | Aligned sequence file (*.fasta). Note, each sequence name requires containing lineage mark.|
|-ua UNALIGNMENT | Unaligned (non-alignment) sequence file (*.fasta). Note, each sequence name requires containing lineage mark.|
|-at ALIGN_TOOL | Program used for multiple sequence alignments (MSA).|
|-iwic INPUT_WIC | Using the already obtained WIC values of reference lineages directly by a *.csv input-file.|
|-q QUERY | Name of query lineage (usually potential recombinant), such as ‘-q xxxx’. Besides, ‘-q auto’ can scan all lineages as potential recombinant in turn.|
|-l LINEAGES | Path of a text-file containing multiple lineage marks.|
|-g GAP | Reserve sites containing gap in subsequent analyses? ‘-g y’means to reserve, and ‘-g n’ means to delete.|
|-m METHOD | Method for scanning. ‘-m p’ means use polymorphic sites only, ‘-m a’ means use all the sites.|
|-w WINDOW | Number of nucleotides sites per sliding window. Note: if the ‘-m p’ has been used, -w refers to the number of polymorphic sites per windows.|
|-s STEP | Step size of the sliding window. Note: if the ‘-m p’ has been used, -s refers to the number of polymorphic sites per jump.|
|-mr MAX_REGION | The maximum allowed recombination region. Note: if the ‘-m p’ method has been used, it refers the maximum number of polymorphic sites contained in a recombinant region.|
|-cp PERCENTAGE | The cutoff threshold of proportion (cp, default: 0.9) used for searching recombination regions when mWIC/EIC >= cp, the maximum value of cp is 1.|
|-cu CUMULATIVE | Simply using the max cumulative WIC of all sites to identify the major parent. Off by default. If required, specify ‘-cu y.|
|-b BREAKPOINT | Possible breakpoint scan of recombination. ‘-b y’ means yes, ‘-b n’ means no. Note: this option only takes effect when ‘-m p’ has been specified.|
| -bw BREAKWIN | The window size (default: 200) used for breakpoint scan. The step size is fixed at 1. Note: this option only takes effect when ‘-m p -b y’ has been specified.|
|-t THREAD | Number of threads (default: 1) used for MAS.|
|-y Y_START | Starting value (default: 0) of the Y-axis in plot diagram.|
|-le LEGEND | The location of the legend, the default is adaptive. '-le r' indicates placed on the right.|
|-owic ONLY_WIC | Only calculate site WIC value. Off by default. If required, please specify ‘-owic y’.|
|-e ENGRAVE | Engraves file name to sequence names in batches. By specifying a directory containing one or multiple sequence files (*.fasta).|
|-en EXPORT_NAME | Export all sequence name of a *.fasta file.|
|-o | Output directory to store all results.|
|--no_wic_fig | Do not draw the image of WICs.|
|--no_mwic_fig | Do not draw the image of mWICs.|


For more information about the algorithm of virusrecom, please refer to [the publication of virusrecom](https://academic.oup.com/bib/article-abstract/24/1/bbac513/6886420).

## 3. Example of usage
The sequences data for test in the documentation was stored at https://github.com/ZhijianZhou01/virusrecom/tree/main/example. 

<b>Note, the ```recombination_test_data.zip``` in directory ```example``` is against virusrecom v1.0, not virusrecom v1.1</b>.

In this demonstration, the test data is from the the ```recombination_test_data_v1.1.zip``` provided in the directory ```example```. 

### 3.1. Aligned input-sequences
If the input sequence-data has been aligned, and it should be loaded via the ```-a``` parameter. Multiple sequence alignments (MSA) can be pre-completed by many programs, this is not introduced. Now, let's focus on the directory ```aligned_input_sequences``` in the file ```recombination_test_data_v1.1.zip```. 

(1) An aligned sequence-file named ```alignment_lineages_data.fasta```, which including multiple sequences from the query lineage and other reference lineages. 
    
(2) A text-file named ```reference_lineages_name.txt```, which including the names (marks) of these reference lineages. 
    
```
     reference_lineage_1
     reference_lineage_2
     reference_lineage_3
     reference_lineage_4
     reference_lineage_5
     reference_lineage_6
     reference_lineage_7
     reference_lineage_8
     reference_lineage_9
```
     
Note, these marks of reference lineages should also appear in sequence names of the file ```alignment_lineages_data.fasta```. <b>The mark of each reference lineage should be unique</b>, otherwise, there will be duplicate matches in subsequent analysis.

Before running the command of VirusRecom, let's think about the search strategy for recombination events. Firstly, we use only polymorphic sites considering that sequences from these lineages are highly similar, which means that the parameter ```-m p``` needs to be specified. Secondly, we do not consider gap-containing sites in this test and use the parameter ```-g n```. Instead, if you consider these gap sites, you need to use the parameter ```-g y```. Next, in the first run, let's try first with a window size of 100 and a step size of 20. Of note the value of “size” at this time represents the number of polymorphic sites because the ```-m p``` parameter has been specified. For the two parameters ```-cp``` and ```-mr```, we use the default value of 0.9 and 1000 in this test. Finally, we specify a folder to save the results by parameter ```-o```. 

Then, switch the current directory to ```aligned_input_sequences```, and run the following command (an example) to detect recombination events in query lineage:

```
virusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir
```

Note: (1) if the current directory is not switched to ```aligned_input_sequences```, the file and directory path in command need the absolute paths instead of relative paths.
(2) the string “query_recombinant” in command is the corresponding mark of query lineage in the file ```alignment_lineages_data.fasta```.


<b>After the run is complete</b>, in the directory ```outdir```, there are three subdirectories and two aggregated reports:

![outdir.png](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/outdir.png)

(1) In the directory ```run_record```, if ```-g n``` is specified, and the file ```Record_of_deleted_gap_sites_*.txt``` containing all the gap sites will be created. Besides, If ```-m p``` is specified, and the file ```Record_of_same_sites_in_aligned_sequence*.txt``` containing all the same sites will be created.

(2) In the directory ```WICs_of_sites```, the file ```*_site_WIC_from_lineages.pdf```, ```*_site_WIC_from_lineages.xlsx``` and the file ```*_site_WIC.csv``` are used to record the WIC value of each site. 

(3) In the directory ```WICs_of_slide_window```, the file ```*_mWIC_from_lineages.xlsx``` and the file ```*_mWIC_from_lineages.pdf``` are used to record the mean WIC of each sliding window. 

![recombination_step4.png](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/recombination_step4.png)

The user can fine-tune the window size and step size according to the density of points in the generated graph. In general, very dense points means that the noise is too high and the window size can be increased appropriately in next scan. 

In addition to the three sub-directories above, VirusRecom provides two summary files. The file ```Possible_recombination_event_conciseness.txt``` only retains results of recombination events with p-values less than 0.05.
```
Possible major parent: reference_lineage_1(global mWIC: 1.8976186779157704)

Other possible parents and significant recombination regions (p<0.05):
reference_lineage_2	7237 to 11539(mWIC: 1.9553354371515168), p_value: 7.831109305531908e-06	

Significance test of recombinant regions using Mann-Whitney-U test with two-tailed probabilities, p-value less than 0.05 indicates a significant difference.
```

In this output report, the major parent of query lineage was ```reference_lineage_1``` and the minor parent was ```reference_lineage_2```, and the recombination region was site 7237 to 11539 and the p-value was 7.83e-06. The identified recombination event was relatively close to the actual (from site 7333 to 11473 in the genome), and the error of the recombination boundary is also acceptable.

In fact, ```Possible_recombination_event_conciseness.txt``` is interpretations of the recombination information contained in ```*_mWIC_from_lineages.pdf```. Although VirusRecom shows a good balance between precision and recall in simulated data, false positive or false negatives sometimes occur. Therefore, for the identification results from VirusRecom, users can make own judgment. 

Besides, the output file ```Possible_recombination_event_detailed.txt``` shows those results with p-values greater than 0.05. <b>Tip: recombination events with p-values over 0.001 are less reliable</b>. 

If ```-b y``` is specified, then VirusRecom will perform the search of recombination breakpoint and plot. For example:
```
virusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -b y -bw 200 -o outdir
```
<b>Tip:</b> (1) ```-b y``` only takes effect when ```-m p``` has been specified. 
(2) the step size of breakpoint search is fixed to 1. 

The negative logarithm of p-value in each site is in the file ```*_-lg(p-value)_for_potential_breakpoint.pdf``` and the file ```*_-lg(p-value)_for_potential_breakpoint.xlsx```. 

![breakpoint.jpg](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/breakpoint.jpg)

The highest peak (the highest −lgP value) indicated the possible recombination breakpoint.

### 3.2. Unaligned input-sequences
VirusRecom can also handle unaligned input-sequences. In this case, multiple sequence alignment is performed by calling external program. In <b>virusrecom v1.1</b>, mafft, muscle, and clustal-omega is supported. It is worth mentioning that VirusRecom call them from the system path, so they need to be installed on the machine beforehand.

For the example data in directory ```unaligned_input_sequences```, run the following command:
```
virusrecom -ua unalignment_lineages_data.fas -at mafft -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir
```
<b>Note:</b> (1) ```-at mafft``` means to call mafft in the system path, and the alignment strategy is auto. Besides, using ```-at muscle``` to call muscle and using ```-at clustalo``` to call clustal-omega.
(2) the string ```query_recombinant``` in command is the corresponding mark of query lineage in the file ```unalignment_lineages_data.fas```.

The interpretation of the output result is consistent with section 3.1. 


### 3.3. Non-lineage data
In VirusRecom, the reference lineage is allowed to contain only one single sequence. Under this condition, mWIC value of the fragment is essentially a multiple of shared identity. If -g n is used in the calculation, the mWIC is twice as large as shared identity. If ```-g y``` is used in the calculation, the mWIC is $\log_2{5}$ as large as shared identity. 

Of noted, for recombination analysis without lineage data, the additional feature is only recommended for non-highly similar sequences and the user can use it to draw an identity point map.

The test data is in directory ```non_lineage_data``` of the file ```recombination_test_data_v1.1.zip```. 

The Delta-CoV HNU1-1 is a known recombinant from SpCoV HKU17-USA and ThCoV HKU12, and the break points were identified at genome positions nt 21017 and 25056, which is jointly identified and confirmed by RDP3 and Simplot by [Wang et al., 2022](https://onlinelibrary.wiley.com/doi/10.1111/tbed.14029). 

Considering that they are not highly similar sequences, we use all sites (```-m a```) in the alignment. Then, we use a larger window value, and run following command:
```
virusrecom -a alns.fasta -q HNU1-1 -l alns_seq_taxon.txt -g n -m a -w 800 -s 100 -cp 0.7 -mr 6000 -le r -o output

```

The mWIC from reference lineages is as follows:

![hnu1-1.jpg](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/hnu1-1.jpg)

<b>Note,</b> because each “lineage” contains only one sequence and ```-g n``` is used in the example, the mWIC in the picture is actually twice the size of “sequence identity”. 

The possible recombination event identified by VirusRecom is as follows:
```
Possible major parent: HKU17-USA(global mWIC: 1.5914816042426252)  

Other possible parents and significant recombination regions (p<0.05): 
HKU12	20720 to 25297(mWIC: 1.8039433490697028), p_value: 2.783880536189705e-204
```

The possible major parent of HNU1-1 is HKU17-USA and minor parent is HKU12, and the recombination region is about 20720-25297 nt in the alignment.


## 4. Common questions
### 4.1. Default values of parameter 
For the value of a parameter, if not specified, the software uses the default value. 
However, the default value is not suitable for all data. In addition to window size (```-w```) and step size (```-s```) of sliding window, values of ```-cp``` and ```-mr``` also require users to adjust based on the data. 

When VirusRecom runs, the value of each parameter is printed printed on the screen and you can check them. What is more, users should try different values in multiple runs, which will effectively reduce false positives and false negatives.

### 4.2. How to mark lineage in sequence name?
Typically, this is part of the data preparation. In virusrecom v1.1, users can easily get it done via ```-e``` parameter. The ```-e``` parameter can engrave file-name to sequence names in batches. The example is as follows:
```
virusrecom -e input_directory -o outdir
```
<b>Tip:</b> The directory ```input_directory``` can contain multiple fasta files, and each fasta file can contain multiple sequences. After the running, finally, each sequence name will contain its file-name. 

Therefore, if the file-name of fasta file is a lineage name, the lineage name can be written into the sequence name in batches.

### 4.3. How to change the color scheme in an image?
If you own programming skills, you can directly modify the order of the colors in the ```plt_corlor_list.py``` file. If not, you can use output matrix provided by VirusRecom, and they are usually suffixed with ```.xlsx```. 


## 5. Citation
Zhou ZJ, Yang CH, Ye SB, Yu XW, Qiu Y, Ge XY. VirusRecom: an information-theory-based method for recombination detection of viral lineages and its application on SARS-CoV-2. <i>Brief Bioinform</i>. 2023 Jan 19;24(1):bbac513. [doi: 10.1093/bib/bbac513](https://academic.oup.com/bib/article-abstract/24/1/bbac513/6886420). PMID: 36567622.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ZhijianZhou01/virusrecom",
    "name": "virusrecom",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "recombination, virus, evolution, information entropy",
    "author": "Zhi-Jian Zhou",
    "author_email": "zjzhou@hnu.edu.cn",
    "download_url": "https://files.pythonhosted.org/packages/51/70/b4bbba2f00b577a78410627aa49f07644263b8ee028a8a8f32d3980d384c/virusrecom-1.1.5.tar.gz",
    "platform": null,
    "description": "# VirusRecom: Detecting recombination of viral lineages using information theory\r\n\r\n## 1. Download and install\r\n\r\nVirusRecom is developed based on ```Python 3```, and you can get and install VirusRecom in a variety of ways.\r\n\r\n### 1.1. pip method\r\n\r\nvirusrecom has been distributed to the standard library of PyPI, and can be easily installed by the tool ```pip```.\r\n\r\n```\r\npip install virusrecom\r\nvirusrecom -h\r\n```\r\n\r\n### 1.2. Or local installation\r\n\r\nIn addition to the  ```pip``` method, you can also install virusrecom manually using the file ```setup.py```. \r\n\r\nFirstly, download this repository, then, run:\r\n```\r\npython setup.py install\r\nvirusrecom -h\r\n```\r\n\r\n### 1.3. Or run the source code directly\r\n\r\nvirusrecom can also be run using the source code without installation. First, download this repository, then, install the required python environment of virusrecom:\r\n\r\n```\r\npip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple\r\n```\r\n\r\nfinally, run virusrecom by the file ```main.py```. Please view the help documentation by ```python main.py -h```.\r\n\r\n### 1.4. Or use the binary files\r\n\r\nFor the two earlier release packages (versions v1.0 and v1.1), you can also directly run the binary files of virusrecom without installation. The  binary files are provided at https://github.com/ZhijianZhou01/virusrecom/releases. \r\n\r\nIn general, the executable file of virusrecom is located at the  ```main``` folder. Then, running the ```virusrecom.exe``` (windows system) or ```virusrecom``` (Linux or MacOS system) to start. If you could not get permission to run virusrecom on Linux system or MacOS system, you could change permissions by ```chmod -R 775 Directory``` or ```chmod -R 777 Directory```. \r\n\r\n\r\n## 2. Getting help\r\nvirusrecom is a command-line-interface program, users can get help documentation of the software by entering  ```virusrecom -h ``` or  ```virusrecom --help ```. \r\n\r\n<b>For detailed documentation, please refer to</b> [Manual of VirusRecom v1.1](https://github.com/ZhijianZhou01/virusrecom/blob/main/Manual%20of%20VirusRecom%20v1.1_2023.12.23.pdf)\r\n\r\n<b>Tip: since version 1.1, virusrecom optimizes the parameters of input-file, which is slightly different from virusrecom v1.0.</b>\r\n\r\n<b>The simple help documentation of virusrecom v1.1.3 is as follows.</b>\r\n\r\n| Parameter | Description |\r\n| --- | --- |\r\n|-h, --help | Show this help message and exit.|\r\n|-a ALIGNMENT | Aligned sequence file (*.fasta). Note, each sequence name requires containing lineage mark.|\r\n|-ua UNALIGNMENT | Unaligned (non-alignment) sequence file (*.fasta). Note, each sequence name requires containing lineage mark.|\r\n|-at ALIGN_TOOL | Program used for multiple sequence alignments (MSA).|\r\n|-iwic INPUT_WIC | Using the already obtained WIC values of reference lineages directly by a *.csv input-file.|\r\n|-q QUERY | Name of query lineage (usually potential recombinant), such as \u2018-q xxxx\u2019. Besides, \u2018-q auto\u2019 can scan all lineages as potential recombinant in turn.|\r\n|-l LINEAGES | Path of a text-file containing multiple lineage marks.|\r\n|-g GAP | Reserve sites containing gap in subsequent analyses? \u2018-g y\u2019means to reserve, and \u2018-g n\u2019 means to delete.|\r\n|-m METHOD | Method for scanning. \u2018-m p\u2019 means use polymorphic sites only, \u2018-m a\u2019 means use all the sites.|\r\n|-w WINDOW | Number of nucleotides sites per sliding window. Note: if the \u2018-m p\u2019 has been used, -w refers to the number of polymorphic sites per windows.|\r\n|-s STEP | Step size of the sliding window. Note: if the \u2018-m p\u2019 has been used, -s refers to the number of polymorphic sites per jump.|\r\n|-mr MAX_REGION | The maximum allowed recombination region. Note: if the \u2018-m p\u2019 method has been used, it refers the maximum number of polymorphic sites contained in a recombinant region.|\r\n|-cp PERCENTAGE | The cutoff threshold of proportion (cp, default: 0.9) used for searching recombination regions when mWIC/EIC >= cp, the maximum value of cp is 1.|\r\n|-cu CUMULATIVE | Simply using the max cumulative WIC of all sites to identify the major parent. Off by default. If required, specify \u2018-cu y.|\r\n|-b BREAKPOINT | Possible breakpoint scan of recombination. \u2018-b y\u2019 means yes, \u2018-b n\u2019 means no. Note: this option only takes effect when \u2018-m p\u2019 has been specified.|\r\n| -bw BREAKWIN | The window size (default: 200) used for breakpoint scan. The step size is fixed at 1. Note: this option only takes effect when \u2018-m p -b y\u2019 has been specified.|\r\n|-t THREAD | Number of threads (default: 1) used for MAS.|\r\n|-y Y_START | Starting value (default: 0) of the Y-axis in plot diagram.|\r\n|-le LEGEND | The location of the legend, the default is adaptive. '-le r' indicates placed on the right.|\r\n|-owic ONLY_WIC | Only calculate site WIC value. Off by default. If required, please specify \u2018-owic y\u2019.|\r\n|-e ENGRAVE | Engraves file name to sequence names in batches. By specifying a directory containing one or multiple sequence files (*.fasta).|\r\n|-en EXPORT_NAME | Export all sequence name of a *.fasta file.|\r\n|-o | Output directory to store all results.|\r\n|--no_wic_fig | Do not draw the image of WICs.|\r\n|--no_mwic_fig | Do not draw the image of mWICs.|\r\n\r\n\r\nFor more information about the algorithm of virusrecom, please refer to [the publication of virusrecom](https://academic.oup.com/bib/article-abstract/24/1/bbac513/6886420).\r\n\r\n## 3. Example of usage\r\nThe sequences data for test in the documentation was stored at https://github.com/ZhijianZhou01/virusrecom/tree/main/example. \r\n\r\n<b>Note, the ```recombination_test_data.zip``` in directory ```example``` is against virusrecom v1.0, not virusrecom v1.1</b>.\r\n\r\nIn this demonstration, the test data is from the the ```recombination_test_data_v1.1.zip``` provided in the directory ```example```. \r\n\r\n### 3.1. Aligned input-sequences\r\nIf the input sequence-data has been aligned, and it should be loaded via the ```-a``` parameter. Multiple sequence alignments (MSA) can be pre-completed by many programs, this is not introduced. Now, let's focus on the directory ```aligned_input_sequences``` in the file ```recombination_test_data_v1.1.zip```. \r\n\r\n(1) An aligned sequence-file named ```alignment_lineages_data.fasta```, which including multiple sequences from the query lineage and other reference lineages. \r\n    \r\n(2) A text-file named ```reference_lineages_name.txt```, which including the names (marks) of these reference lineages. \r\n    \r\n```\r\n     reference_lineage_1\r\n     reference_lineage_2\r\n     reference_lineage_3\r\n     reference_lineage_4\r\n     reference_lineage_5\r\n     reference_lineage_6\r\n     reference_lineage_7\r\n     reference_lineage_8\r\n     reference_lineage_9\r\n```\r\n     \r\nNote, these marks of reference lineages should also appear in sequence names of the file ```alignment_lineages_data.fasta```. <b>The mark of each reference lineage should be unique</b>, otherwise, there will be duplicate matches in subsequent analysis.\r\n\r\nBefore running the command of VirusRecom, let's think about the search strategy for recombination events. Firstly, we use only polymorphic sites considering that sequences from these lineages are highly similar, which means that the parameter ```-m p``` needs to be specified. Secondly, we do not consider gap-containing sites in this test and use the parameter ```-g n```. Instead, if you consider these gap sites, you need to use the parameter ```-g y```. Next, in the first run, let's try first with a window size of 100 and a step size of 20. Of note the value of \u201csize\u201d at this time represents the number of polymorphic sites because the ```-m p``` parameter has been specified. For the two parameters ```-cp``` and ```-mr```, we use the default value of 0.9 and 1000 in this test. Finally, we specify a folder to save the results by parameter ```-o```. \r\n\r\nThen, switch the current directory to ```aligned_input_sequences```, and run the following command (an example) to detect recombination events in query lineage:\r\n\r\n```\r\nvirusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir\r\n```\r\n\r\nNote: (1) if the current directory is not switched to ```aligned_input_sequences```, the file and directory path in command need the absolute paths instead of relative paths.\r\n(2) the string \u201cquery_recombinant\u201d in command is the corresponding mark of query lineage in the file ```alignment_lineages_data.fasta```.\r\n\r\n\r\n<b>After the run is complete</b>, in the directory ```outdir```, there are three subdirectories and two aggregated reports:\r\n\r\n![outdir.png](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/outdir.png)\r\n\r\n(1) In the directory ```run_record```, if ```-g n``` is specified, and the file ```Record_of_deleted_gap_sites_*.txt``` containing all the gap sites will be created. Besides, If ```-m p``` is specified, and the file ```Record_of_same_sites_in_aligned_sequence*.txt``` containing all the same sites will be created.\r\n\r\n(2) In the directory ```WICs_of_sites```, the file ```*_site_WIC_from_lineages.pdf```, ```*_site_WIC_from_lineages.xlsx``` and the file ```*_site_WIC.csv``` are used to record the WIC value of each site. \r\n\r\n(3) In the directory ```WICs_of_slide_window```, the file ```*_mWIC_from_lineages.xlsx``` and the file ```*_mWIC_from_lineages.pdf``` are used to record the mean WIC of each sliding window. \r\n\r\n![recombination_step4.png](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/recombination_step4.png)\r\n\r\nThe user can fine-tune the window size and step size according to the density of points in the generated graph. In general, very dense points means that the noise is too high and the window size can be increased appropriately in next scan. \r\n\r\nIn addition to the three sub-directories above, VirusRecom provides two summary files. The file ```Possible_recombination_event_conciseness.txt``` only retains results of recombination events with p-values less than 0.05.\r\n```\r\nPossible major parent: reference_lineage_1(global mWIC: 1.8976186779157704)\r\n\r\nOther possible parents and significant recombination regions (p<0.05):\r\nreference_lineage_2\t7237 to 11539(mWIC: 1.9553354371515168), p_value: 7.831109305531908e-06\t\r\n\r\nSignificance test of recombinant regions using Mann-Whitney-U test with two-tailed probabilities, p-value less than 0.05 indicates a significant difference.\r\n```\r\n\r\nIn this output report, the major parent of query lineage was ```reference_lineage_1``` and the minor parent was ```reference_lineage_2```, and the recombination region was site 7237 to 11539 and the p-value was 7.83e-06. The identified recombination event was relatively close to the actual (from site 7333 to 11473 in the genome), and the error of the recombination boundary is also acceptable.\r\n\r\nIn fact, ```Possible_recombination_event_conciseness.txt``` is interpretations of the recombination information contained in ```*_mWIC_from_lineages.pdf```. Although VirusRecom shows a good balance between precision and recall in simulated data, false positive or false negatives sometimes occur. Therefore, for the identification results from VirusRecom, users can make own judgment. \r\n\r\nBesides, the output file ```Possible_recombination_event_detailed.txt``` shows those results with p-values greater than 0.05. <b>Tip: recombination events with p-values over 0.001 are less reliable</b>. \r\n\r\nIf ```-b y``` is specified, then VirusRecom will perform the search of recombination breakpoint and plot. For example:\r\n```\r\nvirusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -b y -bw 200 -o outdir\r\n```\r\n<b>Tip:</b> (1) ```-b y``` only takes effect when ```-m p``` has been specified. \r\n(2) the step size of breakpoint search is fixed to 1. \r\n\r\nThe negative logarithm of p-value in each site is in the file ```*_-lg(p-value)_for_potential_breakpoint.pdf``` and the file ```*_-lg(p-value)_for_potential_breakpoint.xlsx```. \r\n\r\n![breakpoint.jpg](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/breakpoint.jpg)\r\n\r\nThe highest peak (the highest \u2212lgP value) indicated the possible recombination breakpoint.\r\n\r\n### 3.2. Unaligned input-sequences\r\nVirusRecom can also handle unaligned input-sequences. In this case, multiple sequence alignment is performed by calling external program. In <b>virusrecom v1.1</b>, mafft, muscle, and clustal-omega is supported. It is worth mentioning that VirusRecom call them from the system path, so they need to be installed on the machine beforehand.\r\n\r\nFor the example data in directory ```unaligned_input_sequences```, run the following command:\r\n```\r\nvirusrecom -ua unalignment_lineages_data.fas -at mafft -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir\r\n```\r\n<b>Note:</b> (1) ```-at mafft``` means to call mafft in the system path, and the alignment strategy is auto. Besides, using ```-at muscle``` to call muscle and using ```-at clustalo``` to call clustal-omega.\r\n(2) the string ```query_recombinant``` in command is the corresponding mark of query lineage in the file ```unalignment_lineages_data.fas```.\r\n\r\nThe interpretation of the output result is consistent with section 3.1. \r\n\r\n\r\n### 3.3. Non-lineage data\r\nIn VirusRecom, the reference lineage is allowed to contain only one single sequence. Under this condition, mWIC value of the fragment is essentially a multiple of shared identity. If -g n is used in the calculation, the mWIC is twice as large as shared identity. If ```-g y``` is used in the calculation, the mWIC is $\\log_2{5}$ as large as shared identity. \r\n\r\nOf noted, for recombination analysis without lineage data, the additional feature is only recommended for non-highly similar sequences and the user can use it to draw an identity point map.\r\n\r\nThe test data is in directory ```non_lineage_data``` of the file ```recombination_test_data_v1.1.zip```. \r\n\r\nThe Delta-CoV HNU1-1 is a known recombinant from SpCoV HKU17-USA and ThCoV HKU12, and the break points were identified at genome positions nt 21017 and 25056, which is jointly identified and confirmed by RDP3 and Simplot by [Wang et al., 2022](https://onlinelibrary.wiley.com/doi/10.1111/tbed.14029). \r\n\r\nConsidering that they are not highly similar sequences, we use all sites (```-m a```) in the alignment. Then, we use a larger window value, and run following command:\r\n```\r\nvirusrecom -a alns.fasta -q HNU1-1 -l alns_seq_taxon.txt -g n -m a -w 800 -s 100 -cp 0.7 -mr 6000 -le r -o output\r\n\r\n```\r\n\r\nThe mWIC from reference lineages is as follows:\r\n\r\n![hnu1-1.jpg](https://github.com/ZhijianZhou01/virusrecom/blob/main/figture/hnu1-1.jpg)\r\n\r\n<b>Note,</b> because each \u201clineage\u201d contains only one sequence and ```-g n``` is used in the example, the mWIC in the picture is actually twice the size of \u201csequence identity\u201d. \r\n\r\nThe possible recombination event identified by VirusRecom is as follows:\r\n```\r\nPossible major parent: HKU17-USA(global mWIC: 1.5914816042426252)  \r\n\r\nOther possible parents and significant recombination regions (p<0.05): \r\nHKU12\t20720 to 25297(mWIC: 1.8039433490697028), p_value: 2.783880536189705e-204\r\n```\r\n\r\nThe possible major parent of HNU1-1 is HKU17-USA and minor parent is HKU12, and the recombination region is about 20720-25297 nt in the alignment.\r\n\r\n\r\n## 4. Common questions\r\n### 4.1. Default values of parameter \r\nFor the value of a parameter, if not specified, the software uses the default value. \r\nHowever, the default value is not suitable for all data. In addition to window size (```-w```) and step size (```-s```) of sliding window, values of ```-cp``` and ```-mr``` also require users to adjust based on the data. \r\n\r\nWhen VirusRecom runs, the value of each parameter is printed printed on the screen and you can check them. What is more, users should try different values in multiple runs, which will effectively reduce false positives and false negatives.\r\n\r\n### 4.2. How to mark lineage in sequence name?\r\nTypically, this is part of the data preparation. In virusrecom v1.1, users can easily get it done via ```-e``` parameter. The ```-e``` parameter can engrave file-name to sequence names in batches. The example is as follows:\r\n```\r\nvirusrecom -e input_directory -o outdir\r\n```\r\n<b>Tip:</b> The directory ```input_directory``` can contain multiple fasta files, and each fasta file can contain multiple sequences. After the running, finally, each sequence name will contain its file-name. \r\n\r\nTherefore, if the file-name of fasta file is a lineage name, the lineage name can be written into the sequence name in batches.\r\n\r\n### 4.3. How to change the color scheme in an image?\r\nIf you own programming skills, you can directly modify the order of the colors in the ```plt_corlor_list.py``` file. If not, you can use output matrix provided by VirusRecom, and they are usually suffixed with ```.xlsx```. \r\n\r\n\r\n## 5. Citation\r\nZhou ZJ, Yang CH, Ye SB, Yu XW, Qiu Y, Ge XY. VirusRecom: an information-theory-based method for recombination detection of viral lineages and its application on SARS-CoV-2. <i>Brief Bioinform</i>. 2023 Jan 19;24(1):bbac513. [doi: 10.1093/bib/bbac513](https://academic.oup.com/bib/article-abstract/24/1/bbac513/6886420). PMID: 36567622.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An information-theory-based method for recombination detection of viral lineages.",
    "version": "1.1.5",
    "project_urls": {
        "Homepage": "https://github.com/ZhijianZhou01/virusrecom"
    },
    "split_keywords": [
        "recombination",
        " virus",
        " evolution",
        " information entropy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8992bfa23e2be0e06e97b74726e49fad050877110e77350068ae462d71af4728",
                "md5": "984aadcf2a0393ed57a4a48475122d83",
                "sha256": "6f1dbf2525212628e355545d64cb81f3ca12d1c0a62968f9c2362210a6857f7b"
            },
            "downloads": -1,
            "filename": "virusrecom-1.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "984aadcf2a0393ed57a4a48475122d83",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 32463,
            "upload_time": "2024-04-18T13:24:39",
            "upload_time_iso_8601": "2024-04-18T13:24:39.918093Z",
            "url": "https://files.pythonhosted.org/packages/89/92/bfa23e2be0e06e97b74726e49fad050877110e77350068ae462d71af4728/virusrecom-1.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5170b4bbba2f00b577a78410627aa49f07644263b8ee028a8a8f32d3980d384c",
                "md5": "811d884ff7d7d2fac37951328cda65f2",
                "sha256": "88747b2590163526baa57115d649db6e0310ace792fab9e6b50275d71265f2ee"
            },
            "downloads": -1,
            "filename": "virusrecom-1.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "811d884ff7d7d2fac37951328cda65f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 36052,
            "upload_time": "2024-04-18T13:24:41",
            "upload_time_iso_8601": "2024-04-18T13:24:41.993240Z",
            "url": "https://files.pythonhosted.org/packages/51/70/b4bbba2f00b577a78410627aa49f07644263b8ee028a8a8f32d3980d384c/virusrecom-1.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-18 13:24:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ZhijianZhou01",
    "github_project": "virusrecom",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "virusrecom"
}
        
Elapsed time: 0.27021s