hiscan


Namehiscan JSON
Version 1.2.1 PyPI version JSON
download
home_pagehttps://github.com/xiaosheep01/hiscan
SummaryScanning histone mimics in secquences.
upload_time2024-07-18 10:50:27
maintainerNone
docs_urlNone
authorYang Xiao
requires_pythonNone
licenseNone
keywords histone mimic viral protein
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Hiscan:Histone motif scanning
Histone Motif Scan (HiScan) is a Python3-based character interface program that can be used and deployed on multiple systems including Windows, Linux and MacOS. This software is mainly used to quickly identify the presence of histone mimicry (HM) motifs in viral protein sequences.
## 1. Download and install
HiScan is developed via `Python 3`, and you can get and install in following ways.
### 1.1 Pip method
HiScan has been distributed to the standard library pf PYPI, and can be easily installed by the tool `pip`.
```
pip install hiscan
hiscan -h
```
### 1.2 Local installation
In addition to the pip method, you can download and install it using the file `setup.py`.
You need to download the repository, and then run:
```
python setup.py install
hiscan -h
```
## 2. Getting help
You can use "-h" or "--help" to get the help document. The following is a brief introduction to the each parameter of HiScan:
Parameter | Description
--------- | ----------
-h,--help | Show the help message and exit.
-i | The file path of the viral protein sequence that needs for analysis in the file format "fasta". Supports input as a file or folder.
-mr | The raw result of the viral protein sequence query for HM motifs is derived from the "-m" parameter result without any additional annotation information. The file format is "txt", "csv" or "xlsx", and the default is "txt".
-nr | The result of viral protein sequence query of HM motifs, with NCBI species annotation information, derived from the "-na" parameter results, no host annotation information, file format is "txt", "csv" or "xlsx", default is "txt".
-m | HM motifs that need to be queried in viral protein sequences, file formats support "txt" and "xlsx".
-na | To supplement the query results of NCBI species annotation information, input the NCBI protein annotation file (" gpff "format) path, support the form of file or folder input.
-ia | The query results are supplemented with ICTV host annotation information, and the virus classification table path of ICTV is only supported in the form of a separate file, and this step must rely on the NCBI species annotation information, so it should be used after the "-na" parameter.
-mc | The type and number of HM motifs in the raw result file will be counted and their occurrence frequency will be calculated.
-ms | The amino acid preference will be calculated based on the HM motif input file and the result will be saved in matrix form, while the HM motif combination with the highest probability will be given.
-mp | Based on the HM motif input file, all possible HM motif combinations will be calculated and their probabilities will be calculated simultaneously, and the number of motifs is recommended to be no more than 15.
-cc | Based on the raw file results, the number and frequency of the HM motifs included in each grade under the different classification criteria will be calculated. By default, "family" is used for statistics.
-hc | Based on the raw file results, the number and frequency of the occurrence of different host sources of the simulation motif will be calculated.
-ct | Specify the classification level of the '-cc' parameter, which can be selected from 'Realm', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family'.
-dedup | Duplicate removal is performed on the result file. Currently, only files separated by tabs are supported. When using this parameter, there is no need to specify the output directory, and a file ending with "_de" will be automatically generated in the directory where the result file is located. Usage example: *hiscan -d raw_result.txt* 
-oln | For the determination of how many amino acids overlap as a duplicate result, the default is 5. It is valid only when the '-dedup' parameter is used. 
-o | The result file path.
-on | Specify the name of the output file, default is "Result".
-ot | Specify the format of the output file, support "txt", "csv" and "xlsx" three formats of output, the default is "txt", recommended to use "xlsx" table format.
## 3. Example of usage
### 3.1 Raw material preparation
You will need to prepare a viral protein sequence file (required), a HM motif file (required), and protein annotation files (optional), in which the HM motif file content format should be as follows:
>[!NOTE]
>You can download viral protein and its annotation information file from the NCBI database (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). The host annotation file is from the official website of ICTV (https://ictv.global/vmr), and HM motifs are from literature review. The following is an example of Windows platform operation.

#### (1) No motif annotation information
Just the simple HM motif name without any additional information, note that the motif name should only be in the first column.
Col-A | Col-B
------|------
LPKKT |
RGKQG |
GGKAR |
RAKAK |
... | 

#### (2) Motif with annotation information
The format of the motif file with annotated information should be as follows, where the first column is the motif name, the second column is the subunit type of the histone, and the third column is the modification type of the motif. If the subunit type and modification type of some HM motifs are not clear, please replace them with the value "None" or a value. The rest of the annotation information is not currently supported.
Col-A | Col-B | Col-C
-----|------|-----
LPKKT | H2A | Methylation
RGKQG | H2A | Acetylation
GGKAR | H2A | Acetylation
KKTES | H2A | Phosphorylation
PAKSA | H2B | Methylation
... | ... | ...

### 3.2 Obtaining raw result
#### Case 1: NCBI and ICTV annotation files are complete
You can refer to the following code, you need to input the viral protein sequence, HM motif file, NCBI and ICTV annotation file:
```
hiscan -i your/viral_protein_file_path/ -m your/motif_path/ -na your/NCBI_annotation_path/ -ia your/ICTV_annotation_path -o your/result_path -ot result_format
```
The output information is as follows:

![Prompt information-1](images/4.png)

And your results should look like this:
NCBI_ID|Seq_Name|Mimic|Location|Histone_Subunit|Modification|Realm|Kingdom|Phylum|Class|Order|Family|Species|Host_Source
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
YP_009175074.1|polyprotein [Bean rugose mosaic virus]|LPKKT|(1179:1183)|H2A|Methylation|Riboviria|Orthornavirae|Pisuviricota|Pisoniviricetes|Picornavirales|Secoviridae|Bean rugose mosaic virus|plants
YP_009272812.1|polyprotein [Washington bat picornavirus]|LPKKT|(969:973)|H2A|Methylation|Riboviria|Orthornavirae|Pisuviricota|Pisoniviricetes|Picornavirales|Picornaviridae|Washington bat picornavirus|None
YP_009333551.1|hypothetical protein 1 [Beihai picorna-like virus 85]|LPKKT|(438:442)|H2A|Methylation|Riboviria|None|None|None|None|None|Beihai picorna-like virus 85|None
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...

#### Case 2: There is no annotation information
There is no annotation information, and even the HM motif file has no additional information, at which point only the virus protein sequence file and the HM motif file need to be passed in:
```
hiscan -i your/viral_protein_file_path/ -m your/motif_path/ -o your/result_path
```
Your results should look like the following:
NCBI_ID|Seq_Name|Mimic|Location
--- | --- | --- | ---
YP_009175074.1|polyprotein [Bean rugose mosaic virus]|LPKKT|(1179:1183)
YP_009272812.1|polyprotein [Washington bat picornavirus]|LPKKT|(969:973)
YP_009333551.1|hypothetical protein 1 [Beihai picorna-like virus 85]|LPKKT|(438:442)
... | ... | ... | ...

#### Case 3: Supplement annotation information to raw result
If the result file of case 2 is generated, the NCBI or ICTV annotation file or the self-written annotation file (in accordance with the gpff format of NCBI) is available, then the result file of case 2 can be followed for analysis.
The annotation information that complements NCBI is taken as an example below:
```
hiscan -mr your/raw_result_path/ -na your/NCBI_annotation_file -o your/result_path
```
### 3.3 Analysis of motif residue skewness
If you want to calculate the skewness of the HM motifs, you just need to input the HM motif file, you can refer to the following code:
```
hiscan -i your/motif_path/ -ms -o your/result_path
```
And your result should look like the following:
Col_0 | Col_1 | Col_2 | Col_3 | Col_4
--- | --- | --- | --- | ---
('R', 0.15094)|('G', 0.15094)|('K', 0.75472)|('Q', 0.09434)|('G', 0.13208)
('G', 0.13208)|('G', 0.15094)|('K', 0.75472)|('A', 0.13208)|('R', 0.03774)
('R', 0.15094)|('A', 0.16981)|('K', 0.75472)|('A', 0.13208)|('K', 0.09434)
('K', 0.11321)|('A', 0.16981)|('K', 0.75472)|('T', 0.09434)|('R', 0.03774)
... | ... | ... | ... | ...

### 3.4 Predicting HM motifs
If you want to predict possible HM motifs, you need to input a file of HM motifs as a background, and HiScan will calculate all possible combinations of HM motifs and sort them by probability.
```
hiscan -i your/motif_path/ -mp -o your/result_path
```
And your results should look like this:
Mimic | Probabiliry
--- | ---
AGKVT | 0.0020117850867
AGKVL | 0.0020117850867
VAKVT | 0.0013411900578
... | ...

### 3.5 Other Analyses
If you want to perform some additional analysis on the raw result, such as host origin statistics, motif statistics, and so on, you can refer to the following code:
```
hiscan -i your/raw_result_path/ -mc -o your/result_path
```
And your results should look like this:
Mimic|Count|Frequency
--- | --- | ---
ARKSA|584|0.09211
ATKAA|467|0.07366
ARKST|300|0.04732
SGRGK|281|0.04432
GTKAV|270|0.04259
VYKVL|224|0.03533
... | ... | ...

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/xiaosheep01/hiscan",
    "name": "hiscan",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "histone, mimic, viral protein",
    "author": "Yang Xiao",
    "author_email": "fredrik1999@163.com",
    "download_url": "https://files.pythonhosted.org/packages/f4/26/b672d33d757e6395d47dcaf0677f30844ca34606552d2d4023283cad9a0a/hiscan-1.2.1.tar.gz",
    "platform": null,
    "description": "# Hiscan\uff1aHistone motif scanning\r\nHistone Motif Scan (HiScan) is a Python3-based character interface program that can be used and deployed on multiple systems including Windows, Linux and MacOS. This software is mainly used to quickly identify the presence of histone mimicry (HM) motifs in viral protein sequences.\r\n## 1. Download and install\r\nHiScan is developed via `Python 3`, and you can get and install in following ways.\r\n### 1.1 Pip method\r\nHiScan has been distributed to the standard library pf PYPI, and can be easily installed by the tool `pip`.\r\n```\r\npip install hiscan\r\nhiscan -h\r\n```\r\n### 1.2 Local installation\r\nIn addition to the pip method, you can download and install it using the file `setup.py`.\r\nYou need to download the repository, and then run:\r\n```\r\npython setup.py install\r\nhiscan -h\r\n```\r\n## 2. Getting help\r\nYou can use \"-h\" or \"--help\" to get the help document. The following is a brief introduction to the each parameter of HiScan:\r\nParameter | Description\r\n--------- | ----------\r\n-h,--help | Show the help message and exit.\r\n-i | The file path of the viral protein sequence that needs for analysis in the file format \"fasta\". Supports input as a file or folder.\r\n-mr | The raw result of the viral protein sequence query for HM motifs is derived from the \"-m\" parameter result without any additional annotation information. The file format is \"txt\", \"csv\" or \"xlsx\", and the default is \"txt\".\r\n-nr | The result of viral protein sequence query of HM motifs, with NCBI species annotation information, derived from the \"-na\" parameter results, no host annotation information, file format is \"txt\", \"csv\" or \"xlsx\", default is \"txt\".\r\n-m | HM motifs that need to be queried in viral protein sequences, file formats support \"txt\" and \"xlsx\".\r\n-na | To supplement the query results of NCBI species annotation information, input the NCBI protein annotation file (\" gpff \"format) path, support the form of file or folder input.\r\n-ia | The query results are supplemented with ICTV host annotation information, and the virus classification table path of ICTV is only supported in the form of a separate file, and this step must rely on the NCBI species annotation information, so it should be used after the \"-na\" parameter.\r\n-mc | The type and number of HM motifs in the raw result file will be counted and their occurrence frequency will be calculated.\r\n-ms | The amino acid preference will be calculated based on the HM motif input file and the result will be saved in matrix form, while the HM motif combination with the highest probability will be given.\r\n-mp | Based on the HM motif input file, all possible HM motif combinations will be calculated and their probabilities will be calculated simultaneously, and the number of motifs is recommended to be no more than 15.\r\n-cc | Based on the raw file results, the number and frequency of the HM motifs included in each grade under the different classification criteria will be calculated. By default, \"family\" is used for statistics.\r\n-hc | Based on the raw file results, the number and frequency of the occurrence of different host sources of the simulation motif will be calculated.\r\n-ct | Specify the classification level of the '-cc' parameter, which can be selected from 'Realm', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family'.\r\n-dedup | Duplicate removal is performed on the result file. Currently, only files separated by tabs are supported. When using this parameter, there is no need to specify the output directory, and a file ending with \"_de\" will be automatically generated in the directory where the result file is located. Usage example: *hiscan -d raw_result.txt* \r\n-oln | For the determination of how many amino acids overlap as a duplicate result, the default is 5. It is valid only when the '-dedup' parameter is used. \r\n-o | The result file path.\r\n-on | Specify the name of the output file, default is \"Result\".\r\n-ot | Specify the format of the output file, support \"txt\", \"csv\" and \"xlsx\" three formats of output, the default is \"txt\", recommended to use \"xlsx\" table format.\r\n## 3. Example of usage\r\n### 3.1 Raw material preparation\r\nYou will need to prepare a viral protein sequence file (required), a HM motif file (required), and protein annotation files (optional), in which the HM motif file content format should be as follows:\r\n>[!NOTE]\r\n>You can download viral protein and its annotation information file from the NCBI database (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). The host annotation file is from the official website of ICTV (https://ictv.global/vmr), and HM motifs are from literature review. The following is an example of Windows platform operation.\r\n\r\n#### (1) No motif annotation information\r\nJust the simple HM motif name without any additional information, note that the motif name should only be in the first column.\r\nCol-A | Col-B\r\n------|------\r\nLPKKT |\r\nRGKQG |\r\nGGKAR |\r\nRAKAK |\r\n... | \r\n\r\n#### (2) Motif with annotation information\r\nThe format of the motif file with annotated information should be as follows, where the first column is the motif name, the second column is the subunit type of the histone, and the third column is the modification type of the motif. If the subunit type and modification type of some HM motifs are not clear, please replace them with the value \"None\" or a value. The rest of the annotation information is not currently supported.\r\nCol-A | Col-B | Col-C\r\n-----|------|-----\r\nLPKKT | H2A | Methylation\r\nRGKQG | H2A | Acetylation\r\nGGKAR | H2A | Acetylation\r\nKKTES | H2A | Phosphorylation\r\nPAKSA | H2B | Methylation\r\n... | ... | ...\r\n\r\n### 3.2 Obtaining raw result\r\n#### Case 1: NCBI and ICTV annotation files are complete\r\nYou can refer to the following code, you need to input the viral protein sequence, HM motif file, NCBI and ICTV annotation file:\r\n```\r\nhiscan -i your/viral_protein_file_path/ -m your/motif_path/ -na your/NCBI_annotation_path/ -ia your/ICTV_annotation_path -o your/result_path -ot result_format\r\n```\r\nThe output information is as follows:\r\n\r\n![Prompt information-1](images/4.png)\r\n\r\nAnd your results should look like this\uff1a\r\nNCBI_ID|Seq_Name|Mimic|Location|Histone_Subunit|Modification|Realm|Kingdom|Phylum|Class|Order|Family|Species|Host_Source\r\n--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---\r\nYP_009175074.1|polyprotein [Bean rugose mosaic virus]|LPKKT|(1179:1183)|H2A|Methylation|Riboviria|Orthornavirae|Pisuviricota|Pisoniviricetes|Picornavirales|Secoviridae|Bean rugose mosaic virus|plants\r\nYP_009272812.1|polyprotein [Washington bat picornavirus]|LPKKT|(969:973)|H2A|Methylation|Riboviria|Orthornavirae|Pisuviricota|Pisoniviricetes|Picornavirales|Picornaviridae|Washington bat picornavirus|None\r\nYP_009333551.1|hypothetical protein 1 [Beihai picorna-like virus 85]|LPKKT|(438:442)|H2A|Methylation|Riboviria|None|None|None|None|None|Beihai picorna-like virus 85|None\r\n... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...\r\n\r\n#### Case 2: There is no annotation information\r\nThere is no annotation information, and even the HM motif file has no additional information, at which point only the virus protein sequence file and the HM motif file need to be passed in:\r\n```\r\nhiscan -i your/viral_protein_file_path/ -m your/motif_path/ -o your/result_path\r\n```\r\nYour results should look like the following:\r\nNCBI_ID|Seq_Name|Mimic|Location\r\n--- | --- | --- | ---\r\nYP_009175074.1|polyprotein [Bean rugose mosaic virus]|LPKKT|(1179:1183)\r\nYP_009272812.1|polyprotein [Washington bat picornavirus]|LPKKT|(969:973)\r\nYP_009333551.1|hypothetical protein 1 [Beihai picorna-like virus 85]|LPKKT|(438:442)\r\n... | ... | ... | ...\r\n\r\n#### Case 3: Supplement annotation information to raw result\r\nIf the result file of case 2 is generated, the NCBI or ICTV annotation file or the self-written annotation file (in accordance with the gpff format of NCBI) is available, then the result file of case 2 can be followed for analysis.\r\nThe annotation information that complements NCBI is taken as an example below:\r\n```\r\nhiscan -mr your/raw_result_path/ -na your/NCBI_annotation_file -o your/result_path\r\n```\r\n### 3.3 Analysis of motif residue skewness\r\nIf you want to calculate the skewness of the HM motifs, you just need to input the HM motif file, you can refer to the following code:\r\n```\r\nhiscan -i your/motif_path/ -ms -o your/result_path\r\n```\r\nAnd your result should look like the following:\r\nCol_0 | Col_1 | Col_2 | Col_3 | Col_4\r\n--- | --- | --- | --- | ---\r\n('R', 0.15094)|('G', 0.15094)|('K', 0.75472)|('Q', 0.09434)|('G', 0.13208)\r\n('G', 0.13208)|('G', 0.15094)|('K', 0.75472)|('A', 0.13208)|('R', 0.03774)\r\n('R', 0.15094)|('A', 0.16981)|('K', 0.75472)|('A', 0.13208)|('K', 0.09434)\r\n('K', 0.11321)|('A', 0.16981)|('K', 0.75472)|('T', 0.09434)|('R', 0.03774)\r\n... | ... | ... | ... | ...\r\n\r\n### 3.4 Predicting HM motifs\r\nIf you want to predict possible HM motifs, you need to input a file of HM motifs as a background, and HiScan will calculate all possible combinations of HM motifs and sort them by probability.\r\n```\r\nhiscan -i your/motif_path/ -mp -o your/result_path\r\n```\r\nAnd your results should look like this\uff1a\r\nMimic | Probabiliry\r\n--- | ---\r\nAGKVT | 0.0020117850867\r\nAGKVL | 0.0020117850867\r\nVAKVT | 0.0013411900578\r\n... | ...\r\n\r\n### 3.5 Other Analyses\r\nIf you want to perform some additional analysis on the raw result, such as host origin statistics, motif statistics, and so on, you can refer to the following code:\r\n```\r\nhiscan -i your/raw_result_path/ -mc -o your/result_path\r\n```\r\nAnd your results should look like this\uff1a\r\nMimic|Count|Frequency\r\n--- | --- | ---\r\nARKSA|584|0.09211\r\nATKAA|467|0.07366\r\nARKST|300|0.04732\r\nSGRGK|281|0.04432\r\nGTKAV|270|0.04259\r\nVYKVL|224|0.03533\r\n... | ... | ...\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Scanning histone mimics in secquences.",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/xiaosheep01/hiscan"
    },
    "split_keywords": [
        "histone",
        " mimic",
        " viral protein"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f426b672d33d757e6395d47dcaf0677f30844ca34606552d2d4023283cad9a0a",
                "md5": "d0b1f633f70e26d2f8c8ef026737e99b",
                "sha256": "f078be68e3521cd5a9fd16884485647315f0dfbd251ba9722f57eeadb7117a1c"
            },
            "downloads": -1,
            "filename": "hiscan-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d0b1f633f70e26d2f8c8ef026737e99b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 26654,
            "upload_time": "2024-07-18T10:50:27",
            "upload_time_iso_8601": "2024-07-18T10:50:27.624758Z",
            "url": "https://files.pythonhosted.org/packages/f4/26/b672d33d757e6395d47dcaf0677f30844ca34606552d2d4023283cad9a0a/hiscan-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-18 10:50:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xiaosheep01",
    "github_project": "hiscan",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hiscan"
}
        
Elapsed time: 0.34414s