# SAAPS
Single Amino Acids Polylimorphism Statistics (SAAPS) is a Python3-based character interface program that can be used and deployed on multiple systems, including Windows, Linux, and MacOS. The software is mainly used to quickly calculate single amino acid polymorphism (SAP), Shannon information entropy at each site of the sequence, and to analyze sequence features and cluster analysis by using Onehot encoding and dimensionality reduction algorithms.
The raw input file of the SAAPS software is very simple, as long as the sequence file after alignment (nucleotide or amino acid), it can be used for processing and analysis. The main workflow of SAAPS is as follows: Firstly, by analyzing the characteristics of each sequence, the SAPs and Shannon information entropy of each point are calculated, and the corresponding result figures can be drawn at the same time. Then SAAPS screens the key sites according to the specified information entropy range, and transforms them into quantificable data by using Onehot encoding. Finally, the certain dimensionality reduction algorithm (PCA or t-SNE) is used to calculate the similarity level between sequences and draw the clustering scatter plot.
# 1. Download and install
SAAPS is developed via `Python 3`, and you can get and install in following ways.
### 1.1 Pip method
SAAPS has been distributed to the standard library pf PYPI, and can be easily installed by the tool `pip`.
```
pip install saaps
saaps -h
```
### 1.2 Local installation
In addition to the pip method, you can download and install it using the file `setup.py`.
You need to download the repository, and then run:
```
python setup.py install
saaps -h
```
# 2. Getting help
You can use "-h" or "--help" to get the help document. The following is a brief introduction to the each parameter of SAAPS:
**Input Options**
Parameter | Description
--------- | ----------
-i, --input | the path of input file
-sr, --shannon_result | the path of shannon result
**Basic Options**
Parameter | Description
--------- | ----------
-dg, --delete_gap | delete the columns containing gaps in sequences
-t, --translate | translate nucleotides to amino acids
-cs, --cut_start | the start fragment/site of target sequence
-ce, --cut_end | the end fragment/site of target sequence
-osn, --output_seq_name | output sequence names
-csn, --change_seq_name | change the sequence names
**Polymorphism Options**
Parameter | Description
--------- | ----------
-cp, --compute_ppm | compute polymorphism
-pp, --plot_ppm | plot polymorphism figure
**Plotting Options**
Parameter | Description
--------- | ----------
-fw, --figure_width | the width of figure
-fh, --figure_height | the height of figure
-fname, --figure_name | the name of figure
-c, --color | the palette of figure
-pac, --print_all_colors | print all sipported palettes
-dpi, --DPI | the dpi of the figure
-fm, --figure_format | the format of figure
-tp, --transparent | make the background of the figure transparent
-sl, --scatter_label | display scatter plot labels
**Shannon Entropy Options**
Parameter | Description
--------- | ----------
-sha, --shannon | compute shannon entropy
-smax, --shannon_max | the maximum of the shannon value (IC)
-smin, --shannon_min | the minimum of the shannon value (IC)
-pic, --plot_ic | plot IC scatter figure
**Dimension Reduction Options**
Parameter | Description
--------- | ----------
-oh, --onehot | OneHot Encoding
-pca, --PCA | dimensionality reduction by PCA
-tsne, --TSNE | dimensionality reduction by t-SNE
-tpp, --tSNE_perplexity | the perplexity in t_SNE, default is 3
-tlr, --tSNE_learn_rate | the learning rate in t_SNE, default is 200
-tni, --tSNE_n_iter | the n_iter in t_SNE, default is 5000
-trs, --tSNE_random_state | the random state in t_SNE, default is 1
-pc, --plot_cluster | plot clustering scatter plot
**Output Options**
Parameter | Description
--------- | ----------
-o, --output_dir | the directory path of output file, the default is the current user document folder
-pre, --prefix | the prefix of the output file name
# 3. Examples
This chapter uses the test data ('testSeq.txt' in the test folder) provided by the software as an example to describe the main functions of SAAPS. Note that all sample files in this manual are generated by the software using test data (**" testSeq.txt "**), and users can follow the tutorial to learn it before exploring other features on their own.
## 3.1 Preparation for raw materials
Users needs to prepare an alignment sequence file (nucleotide or amino acid file), the alignment process can be completed by MAFFT and other software, and the format of the sequence file after alignment should be "sequence name + (line/multiple lines) sequence body" structure. The aligned sequence file should look like this:
```
>TestSeqNo.1 Sequence1
GPQSGAIYVG……LWLDEEAMEQ
>TestSeqNo.2 Sequence2
GPLGQQSGAV……LWLDDEVMEQ
...
```
## 3.2 Sequence preprocessing
Before the subsequent analysis, it is usually necessary to format the sequence, such as whether the nucleotide sequence needs to be translated into amino acids, whether the sequence name needs to be changed, and so on. This allows us to get the best results in the subsequent analysis and drawing. Here we use the example of modifying the sequence name:
### 3.2.1 Output sequence names
```
saaps -i your/seq/path -osn -o your/output/path
```
**Code explaination**: `-i your/seq/path` indicates that the matched sequence file is passed. `-osn` indicates the sequence name in the output sequence file. By default, the output is in a `.csv` table. `-o your/output/path` indicates the output path. If this parameter is not specified, the result will be output in the default path.After the program is executed, a file named `SeqName.csv` will be generated, which can be used for users to modify the sequence names individually. It is recommended to save the modified new names in the second column, so that subsequent programs can modify the sequence names in batches. The format of the file is as follows:
|Old_Name||
|---|---|
|>TestSeqNo.1 Sequence1||
|>TestSeqNo.1 Sequence2||
|...||
### 3.2.2 Change sequence names
```
saaps -i your/seq/path -csn your/names/file/path -o your/output/path
```
**Code explaination**:In this step, `-csn your/names/file/path` represents the matching table of the old and new names in the sequence, which is the table output in the previous step. The pairing table format for the sequence names is as follows:
|Old_Name|New_Name|
|---|---|
|>TestSeqNo.1 Sequence1|>Seq1|
|>TestSeqNo.1 Sequence2|>Seq2|
|...|...|
The modified sequence file is named "NewNameSeq.txt" by default. The comparison between the original sequence file and the modified sequence file is as follows:
|Old Seq File|New Seq File|
|---|---|
|>TestSeqNo.1 Sequence1|>Seq1|
|GPQSGAIYVG……LWLDEEAMEQ|GPQSGAIYVG……LWLDEEAMEQ|
|...|...|
## 3.3 Polymorphism analysis
Although SAAPS was developed only for the analysis of SAP, the basis of polymorphism analysis is statistical analysis of the symbols and frequencies of each site in the sequence, so SAAPS can also be applied to sequences such as nucleotides.
```
saaps -i your/seq/path -cp -pp -o your/output/path
```
**Code explaination**:`-csn your/names/file/path` represents the matching table of the old and new names in the sequence, which is the table output in the previous step. `-cp` represents to activate the computational polymorphism function. `-pp` represents that the polymorphism result is plotted.
The statistical result of polymorphism analysis will generate three files named `Concise_Result.txt`, `Detail_Result.txt`, and `Seq_Matrix.csv`. If the drawing parameter (`-pp`) is set, A resulting graph (default) named `SAPs.pdf` is generated accordingly. The main result files are explained as follows:
**Concise_Result.txt**: a concise result of polymorphism analysis, which mainly records the frequency of amino acids at each site, and is also an input file for the subsequent calculation of information entropy at each point. The format is shown as follows (part) :
```
Site 1 (G, 15, 100.00%)
Site 2 (K, 7, 46.67%) (R, 1, 6.67%) (C, 1, 6.67%) …
… … … … …
```
**Detail_Result.txt**: detailed result of polymorphism analysis, which records the sequence name of each amino acid at each site. Its file content is shown below (part) :
|col1|col2|col3|coln|
|---|---|---|---|
|Site1|(G: All Sequences)|||
|Site2|(P: >Seq2,>Seq1)|(K: >Seq10,>Seq15,>Seq7,>Seq13,>Seq9,>Seq12,>Seq14)|...|
|Site3|(F: Most Sequences)|(L: >Seq2)|...|
|...|...|...|...|
**Seq_Matrix.csv**: indicates the sequence amino acid matrix. The main contents of the table are as follows (part) :
|Seq1|Seq2|Seq3|Seq4|Seq5|Seq6|Seq7|...|
|---|---|---|---|---|---|---|---|
|G|G|G|G|G|G|G|...|
|P|P|A|A|A|A|K|...|
|...|...|...|...|...|...|...|...|
**SAPs.pdf**: Polymorphism distribution map. The example figure is shown below, and users can customize the resulting graph according to the drawing parameters provided by the software:
![SAPs.png](test/SAPs.png)
## 3.4 Shannon entropy calculation
In the SAAPS calculation results, information content (IC value) is used to replace information entropy, and the higher the IC value, the more conservative the amino acid at the site. During SAAPS information entropy analysis, two result files, `Shannon_log.txt` and `Shannon_IC_Result.csv`, are generated by default.
```
saaps -i your/Concise/result/path -sha -pic -o your/output/path
```
**Code explaination:** When calculating information entropy, we need to pass the `Concise_Result.txt` file in the previous step of polymorphism analysis, `-sha` means to activate the information entropy calculation function, `-pic` means to draw the information entropy result, and a picture file named `IC-Plot.pdf` is generated in the output path by default.
**Shannon_log.txt**: indicates the simple statistical result of IC analysis. The contents of the document are as follows (part) :
```
Basic Statistic Information of the Shannon Entropy (IC)
IC
Count 150
mean 3.8647165333333335
... ...
```
**Shannon_IC_Result.csv**: IC values of all sites in sequence. The table contains all site IC values, which are convenient for the subsequent drawing of IC figure. The contents of the table are as follows (part) :
|SITE|IC|
|---|---|
|Site1|4.45943|
|Site2|2.52919|
|Site3|3.75947|
|Site4|3.75947|
|...|...|
**IC-Plot.pdf**: Sequence IC scatter figure. The figure is as follows:
![IC-plot](test/IC-Plot.png)
## 3.5 Polymorphism site screening and Onehot encoding
```
saaps -i your/Seq/Matrix/path -sr your/IC/Result/path -smin ICmin -smax ICmax -o your/output/path
```
**Code explaination**:During the Onehot encoding process, the `Seq_Matrix.csv` file from the polymorphism analysis in the previous step and the `Shannon_IC_Result.csv` file from the information entropy calculation need to be passed in. If only the sites within the threshold range need to be converted, You can use `-smin` and `-smax` to set the IC threshold range. In the Onehot encoding step, the software automatically generates 4 files: `OneHot-Original.csv`, `One-Hot-ConciseMatrix.csv`, `OneHot-Transform.csv`, and `One-Hot-ForIntersect.csv`.
**OneHot-Original.csv**: a table of polymorphic loci selected according to the IC threshold range and used for onehot encoding. The file content is shown as follows (part) :
||3|4|5|6|7|8|...|
|---|---|---|---|---|---|---|---|
|>Seq1|Q|S|G|A|I|Y|...|
|>Seq2|L|G|Q|Q|S|G|---|
|...|...|...|...|...|...|...|...|
**One-Hot-ConciseMatrix.csv**: the process file after the Onehot encoding of `OneHot-Original.csv`, where each amino acid is represented by the corresponding binary number. The file content is shown as follows (part) :
||3|...|
|---|---|---|
|Seq1|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]|...|
|Seq2|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|...|
|...|...|...|
**OneHot-Transform.csv**: the result of upgrading the dimensions according to the `One-Hot-ConciseMatrix.csv` file, in which the binary list of each point is expanded into the corresponding number dimensions, and each dimension only contains the corresponding values in the binary list. The main contents of the file are shown as follows (part) :
||3_1|3_2|3_3|..|
|---|---|---|---|---|
|Seq1|0|0|0|...|
|Seq2|0|0|0|...|
|...|...|...|...|...|
**One-Hot-ForIntersect.csv**: indicates the record list of sequence polymorphism loci. The table records the amino acids of each sequence at the polymorphic site, which can be used for differential complement analysis of subsequent sets, etc. The contents of the table are as follows (part) :
||3|4|5|6|7|...|
|---|---|---|---|---|---|---|
|>Seq1|3Q|4S|5G|6A|7I|...|
|>Seq2|3L|4G|5Q|6Q|7S|...|
|>Seq3|3F|4G|5Q|6Q|7S|...|
|...|...|...|...|...|...|...|
## 3.6 Dimensionality reduction and clustering
The function of Onehot encoding is to convert classified data into numerical data that can be quantitatively analyzed. After the onehot encoding conversion, the information of each location of each sequence is converted into high-dimensional binary data. Therefore, in order to better analyze the sequence features, we usually adopt a certain dimensionality reduction algorithm to analyze it and observe the relationship between each sequence. Here we take the most commonly used PCA algorithm as an example.
```
saaps -i your/Transform/result/path -pca -pc -o your/output/path
```
**Code explaination**:In the process of dimensionality reduction analysis, the `OneHot-Transform.csv` file in the previous step needs to be passed in; `-pca` indicates that the PCA dimensionality reduction function is activated. `-pc` means to draw a scatter plot based on the dimensionality reduction results. This process produces two result files called `PCA.csv` and `PCA-Plot.pdf`.
**PCA.csv**: PCA calculation result of sequence file. The contents of the document are as follows (part) :
||0|1|2|3|...|
|---|---|---|---|---|---|
|Seq1|6.254129|-2.01068|-0.27758|-0.33611|...|
|Seq2|-0.74108|-0.47749|-0.1452|-1.75072|...|
|...|...|...|...|...|...|...|
**PCA-plot.pdf**: Based on the PCA analysis results, the scatter Plot is generated according to the default parameters, and the example figure is as follows:
![PCA-plot.png](test/PCA-Plot.png)
Raw data
{
"_id": null,
"home_page": "https://github.com/xiaosheep01/SAAPS",
"name": "saaps",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "SAP, Information Entropy, OneHot Encoding, Dimension Reduction",
"author": "Yang Xiao",
"author_email": "fredrik1999@163.com",
"download_url": "https://files.pythonhosted.org/packages/95/4f/d07a86cf364687ba06263039e734f61478a54ed75bc47f1d151eca5a3c81/saaps-1.4.1.tar.gz",
"platform": null,
"description": "# SAAPS\r\nSingle Amino Acids Polylimorphism Statistics (SAAPS) is a Python3-based character interface program that can be used and deployed on multiple systems, including Windows, Linux, and MacOS. The software is mainly used to quickly calculate single amino acid polymorphism (SAP), Shannon information entropy at each site of the sequence, and to analyze sequence features and cluster analysis by using Onehot encoding and dimensionality reduction algorithms.\r\n\r\nThe raw input file of the SAAPS software is very simple, as long as the sequence file after alignment (nucleotide or amino acid), it can be used for processing and analysis. The main workflow of SAAPS is as follows: Firstly, by analyzing the characteristics of each sequence, the SAPs and Shannon information entropy of each point are calculated, and the corresponding result figures can be drawn at the same time. Then SAAPS screens the key sites according to the specified information entropy range, and transforms them into quantificable data by using Onehot encoding. Finally, the certain dimensionality reduction algorithm (PCA or t-SNE) is used to calculate the similarity level between sequences and draw the clustering scatter plot.\r\n\r\n# 1. Download and install\r\nSAAPS is developed via `Python 3`, and you can get and install in following ways.\r\n### 1.1 Pip method\r\nSAAPS has been distributed to the standard library pf PYPI, and can be easily installed by the tool `pip`.\r\n```\r\npip install saaps\r\nsaaps -h\r\n```\r\n### 1.2 Local installation\r\nIn addition to the pip method, you can download and install it using the file `setup.py`.\r\nYou need to download the repository, and then run:\r\n```\r\npython setup.py install\r\nsaaps -h\r\n```\r\n\r\n# 2. Getting help\r\nYou can use \"-h\" or \"--help\" to get the help document. The following is a brief introduction to the each parameter of SAAPS:\r\n\r\n**Input Options**\r\nParameter | Description\r\n--------- | ----------\r\n-i, --input | the path of input file\r\n-sr, --shannon_result | the path of shannon result\r\n\r\n**Basic Options**\r\nParameter | Description\r\n--------- | ----------\r\n-dg, --delete_gap | delete the columns containing gaps in sequences\r\n-t, --translate | translate nucleotides to amino acids\r\n-cs, --cut_start | the start fragment/site of target sequence\r\n-ce, --cut_end | the end fragment/site of target sequence\r\n-osn, --output_seq_name | output sequence names\r\n-csn, --change_seq_name | change the sequence names\r\n\r\n**Polymorphism Options**\r\nParameter | Description\r\n--------- | ----------\r\n-cp, --compute_ppm | compute polymorphism\r\n-pp, --plot_ppm | plot polymorphism figure\r\n\r\n**Plotting Options**\r\nParameter | Description\r\n--------- | ----------\r\n-fw, --figure_width | the width of figure\r\n-fh, --figure_height | the height of figure\r\n-fname, --figure_name | the name of figure\r\n-c, --color | the palette of figure\r\n-pac, --print_all_colors | print all sipported palettes\r\n-dpi, --DPI | the dpi of the figure\r\n-fm, --figure_format | the format of figure\r\n-tp, --transparent | make the background of the figure transparent\r\n-sl, --scatter_label | display scatter plot labels\r\n\r\n**Shannon Entropy Options**\r\nParameter | Description\r\n--------- | ----------\r\n-sha, --shannon | compute shannon entropy\r\n-smax, --shannon_max | the maximum of the shannon value (IC)\r\n-smin, --shannon_min | the minimum of the shannon value (IC)\r\n-pic, --plot_ic | plot IC scatter figure\r\n\r\n**Dimension Reduction Options**\r\nParameter | Description\r\n--------- | ----------\r\n-oh, --onehot | OneHot Encoding\r\n-pca, --PCA | dimensionality reduction by PCA\r\n-tsne, --TSNE | dimensionality reduction by t-SNE\r\n-tpp, --tSNE_perplexity | the perplexity in t_SNE, default is 3\r\n-tlr, --tSNE_learn_rate | the learning rate in t_SNE, default is 200\r\n-tni, --tSNE_n_iter | the n_iter in t_SNE, default is 5000\r\n-trs, --tSNE_random_state | the random state in t_SNE, default is 1\r\n-pc, --plot_cluster | plot clustering scatter plot\r\n\r\n**Output Options**\r\nParameter | Description\r\n--------- | ----------\r\n-o, --output_dir | the directory path of output file, the default is the current user document folder\r\n-pre, --prefix | the prefix of the output file name\r\n\r\n# 3. Examples\r\nThis chapter uses the test data ('testSeq.txt' in the test folder) provided by the software as an example to describe the main functions of SAAPS. Note that all sample files in this manual are generated by the software using test data (**\" testSeq.txt \"**), and users can follow the tutorial to learn it before exploring other features on their own.\r\n## 3.1 Preparation for raw materials\r\nUsers needs to prepare an alignment sequence file (nucleotide or amino acid file), the alignment process can be completed by MAFFT and other software, and the format of the sequence file after alignment should be \"sequence name + (line/multiple lines) sequence body\" structure. The aligned sequence file should look like this:\r\n\r\n```\r\n>TestSeqNo.1 Sequence1 \r\nGPQSGAIYVG\u2026\u2026LWLDEEAMEQ\r\n>TestSeqNo.2 Sequence2\r\nGPLGQQSGAV\u2026\u2026LWLDDEVMEQ\r\n...\r\n```\r\n\r\n## 3.2 Sequence preprocessing\r\nBefore the subsequent analysis, it is usually necessary to format the sequence, such as whether the nucleotide sequence needs to be translated into amino acids, whether the sequence name needs to be changed, and so on. This allows us to get the best results in the subsequent analysis and drawing. Here we use the example of modifying the sequence name: \r\n### 3.2.1 Output sequence names \r\n\r\n```\r\nsaaps -i your/seq/path -osn -o your/output/path\r\n``` \r\n\r\n**Code explaination**: `-i your/seq/path` indicates that the matched sequence file is passed. `-osn` indicates the sequence name in the output sequence file. By default, the output is in a `.csv` table. `-o your/output/path` indicates the output path. If this parameter is not specified, the result will be output in the default path.After the program is executed, a file named `SeqName.csv` will be generated, which can be used for users to modify the sequence names individually. It is recommended to save the modified new names in the second column, so that subsequent programs can modify the sequence names in batches. The format of the file is as follows: \r\n\r\n|Old_Name||\r\n|---|---|\r\n|>TestSeqNo.1 Sequence1||\r\n|>TestSeqNo.1 Sequence2||\r\n|...|| \r\n\r\n### 3.2.2 Change sequence names \r\n\r\n```\r\nsaaps -i your/seq/path -csn your/names/file/path -o your/output/path\r\n```\r\n**Code explaination**:In this step, `-csn your/names/file/path` represents the matching table of the old and new names in the sequence, which is the table output in the previous step. The pairing table format for the sequence names is as follows: \r\n\r\n|Old_Name|New_Name|\r\n|---|---|\r\n|>TestSeqNo.1 Sequence1|>Seq1|\r\n|>TestSeqNo.1 Sequence2|>Seq2|\r\n|...|...| \r\n\r\nThe modified sequence file is named \"NewNameSeq.txt\" by default. The comparison between the original sequence file and the modified sequence file is as follows:\r\n\r\n|Old Seq File|New Seq File|\r\n|---|---|\r\n|>TestSeqNo.1 Sequence1|>Seq1|\r\n|GPQSGAIYVG\u2026\u2026LWLDEEAMEQ|GPQSGAIYVG\u2026\u2026LWLDEEAMEQ|\r\n|...|...| \r\n\r\n## 3.3 Polymorphism analysis \r\nAlthough SAAPS was developed only for the analysis of SAP, the basis of polymorphism analysis is statistical analysis of the symbols and frequencies of each site in the sequence, so SAAPS can also be applied to sequences such as nucleotides.\r\n\r\n```\r\nsaaps -i your/seq/path -cp -pp -o your/output/path\r\n```\r\n\r\n**Code explaination**:`-csn your/names/file/path` represents the matching table of the old and new names in the sequence, which is the table output in the previous step. `-cp` represents to activate the computational polymorphism function. `-pp` represents that the polymorphism result is plotted.\r\n\r\nThe statistical result of polymorphism analysis will generate three files named `Concise_Result.txt`, `Detail_Result.txt`, and `Seq_Matrix.csv`. If the drawing parameter (`-pp`) is set, A resulting graph (default) named `SAPs.pdf` is generated accordingly. The main result files are explained as follows: \r\n\r\n**Concise_Result.txt**: a concise result of polymorphism analysis, which mainly records the frequency of amino acids at each site, and is also an input file for the subsequent calculation of information entropy at each point. The format is shown as follows (part) :\r\n\r\n```\r\nSite 1\t(G, 15, 100.00%)\t\t\t\r\nSite 2\t(K, 7, 46.67%)\t(R, 1, 6.67%)\t(C, 1, 6.67%)\t\u2026\r\n\u2026\t\u2026\t\u2026\t\u2026\t\u2026\r\n```\r\n\r\n**Detail_Result.txt**: detailed result of polymorphism analysis, which records the sequence name of each amino acid at each site. Its file content is shown below (part) :\r\n\r\n|col1|col2|col3|coln|\r\n|---|---|---|---|\r\n|Site1|(G: All Sequences)|||\r\n|Site2|(P: >Seq2,>Seq1)|(K: >Seq10,>Seq15,>Seq7,>Seq13,>Seq9,>Seq12,>Seq14)|...|\r\n|Site3|(F: Most Sequences)|(L: >Seq2)|...|\r\n|...|...|...|...| \r\n\r\n**Seq_Matrix.csv**: indicates the sequence amino acid matrix. The main contents of the table are as follows (part) : \r\n\r\n|Seq1|Seq2|Seq3|Seq4|Seq5|Seq6|Seq7|...|\r\n|---|---|---|---|---|---|---|---|\r\n|G|G|G|G|G|G|G|...|\r\n|P|P|A|A|A|A|K|...|\r\n|...|...|...|...|...|...|...|...|\r\n\r\n**SAPs.pdf**: Polymorphism distribution map. The example figure is shown below, and users can customize the resulting graph according to the drawing parameters provided by the software: \r\n\r\n![SAPs.png](test/SAPs.png) \r\n\r\n## 3.4 Shannon entropy calculation \r\nIn the SAAPS calculation results, information content (IC value) is used to replace information entropy, and the higher the IC value, the more conservative the amino acid at the site. During SAAPS information entropy analysis, two result files, `Shannon_log.txt` and `Shannon_IC_Result.csv`, are generated by default. \r\n\r\n```\r\nsaaps -i your/Concise/result/path -sha -pic -o your/output/path\r\n```\r\n\r\n**Code explaination:** When calculating information entropy, we need to pass the `Concise_Result.txt` file in the previous step of polymorphism analysis, `-sha` means to activate the information entropy calculation function, `-pic` means to draw the information entropy result, and a picture file named `IC-Plot.pdf` is generated in the output path by default. \r\n\r\n**Shannon_log.txt**: indicates the simple statistical result of IC analysis. The contents of the document are as follows (part) : \r\n\r\n ```\r\nBasic Statistic Information of the Shannon Entropy (IC)\r\n IC\r\nCount 150\r\nmean 3.8647165333333335\r\n... ...\r\n```\r\n**Shannon_IC_Result.csv**: IC values of all sites in sequence. The table contains all site IC values, which are convenient for the subsequent drawing of IC figure. The contents of the table are as follows (part) :\r\n\r\n|SITE|IC|\r\n|---|---|\r\n|Site1|4.45943|\r\n|Site2|2.52919|\r\n|Site3|3.75947|\r\n|Site4|3.75947|\r\n|...|...|\r\n\r\n**IC-Plot.pdf**: Sequence IC scatter figure. The figure is as follows:\r\n\r\n![IC-plot](test/IC-Plot.png)\r\n\r\n## 3.5 Polymorphism site screening and Onehot encoding\r\n\r\n```\r\nsaaps -i your/Seq/Matrix/path -sr your/IC/Result/path -smin ICmin -smax ICmax -o your/output/path\r\n```\r\n\r\n**Code explaination**:During the Onehot encoding process, the `Seq_Matrix.csv` file from the polymorphism analysis in the previous step and the `Shannon_IC_Result.csv` file from the information entropy calculation need to be passed in. If only the sites within the threshold range need to be converted, You can use `-smin` and `-smax` to set the IC threshold range. In the Onehot encoding step, the software automatically generates 4 files: `OneHot-Original.csv`, `One-Hot-ConciseMatrix.csv`, `OneHot-Transform.csv`, and `One-Hot-ForIntersect.csv`.\r\n\r\n**OneHot-Original.csv**: a table of polymorphic loci selected according to the IC threshold range and used for onehot encoding. The file content is shown as follows (part) :\r\n\r\n||3|4|5|6|7|8|...|\r\n|---|---|---|---|---|---|---|---|\r\n|>Seq1|Q|S|G|A|I|Y|...|\r\n|>Seq2|L|G|Q|Q|S|G|---|\r\n|...|...|...|...|...|...|...|...|\r\n\r\n**One-Hot-ConciseMatrix.csv**: the process file after the Onehot encoding of `OneHot-Original.csv`, where each amino acid is represented by the corresponding binary number. The file content is shown as follows (part) :\r\n\r\n||3|...|\r\n|---|---|---|\r\n|Seq1|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]|...|\r\n|Seq2|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|...|\r\n|...|...|...|\r\n\r\n**OneHot-Transform.csv**: the result of upgrading the dimensions according to the `One-Hot-ConciseMatrix.csv` file, in which the binary list of each point is expanded into the corresponding number dimensions, and each dimension only contains the corresponding values in the binary list. The main contents of the file are shown as follows (part) : \r\n\r\n||3_1|3_2|3_3|..|\r\n|---|---|---|---|---|\r\n|Seq1|0|0|0|...|\r\n|Seq2|0|0|0|...|\r\n|...|...|...|...|...|\r\n\r\n**One-Hot-ForIntersect.csv**: indicates the record list of sequence polymorphism loci. The table records the amino acids of each sequence at the polymorphic site, which can be used for differential complement analysis of subsequent sets, etc. The contents of the table are as follows (part) :\r\n\r\n||3|4|5|6|7|...|\r\n|---|---|---|---|---|---|---|\r\n|>Seq1|3Q|4S|5G|6A|7I|...|\r\n|>Seq2|3L|4G|5Q|6Q|7S|...|\r\n|>Seq3|3F|4G|5Q|6Q|7S|...|\r\n|...|...|...|...|...|...|...|\r\n\r\n## 3.6 Dimensionality reduction and clustering\r\nThe function of Onehot encoding is to convert classified data into numerical data that can be quantitatively analyzed. After the onehot encoding conversion, the information of each location of each sequence is converted into high-dimensional binary data. Therefore, in order to better analyze the sequence features, we usually adopt a certain dimensionality reduction algorithm to analyze it and observe the relationship between each sequence. Here we take the most commonly used PCA algorithm as an example.\r\n\r\n\r\n```\r\nsaaps -i your/Transform/result/path -pca -pc -o your/output/path\r\n```\r\n\r\n**Code explaination**:In the process of dimensionality reduction analysis, the `OneHot-Transform.csv` file in the previous step needs to be passed in; `-pca` indicates that the PCA dimensionality reduction function is activated. `-pc` means to draw a scatter plot based on the dimensionality reduction results. This process produces two result files called `PCA.csv` and `PCA-Plot.pdf`.\r\n\r\n**PCA.csv**: PCA calculation result of sequence file. The contents of the document are as follows (part) :\r\n\r\n||0|1|2|3|...|\r\n|---|---|---|---|---|---|\r\n|Seq1|6.254129|-2.01068|-0.27758|-0.33611|...|\r\n|Seq2|-0.74108|-0.47749|-0.1452|-1.75072|...|\r\n|...|...|...|...|...|...|...|\r\n\r\n**PCA-plot.pdf**: Based on the PCA analysis results, the scatter Plot is generated according to the default parameters, and the example figure is as follows:\r\n\r\n![PCA-plot.png](test/PCA-Plot.png)\r\n\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Single Amino Acids Polymorphism Statistics",
"version": "1.4.1",
"project_urls": {
"Homepage": "https://github.com/xiaosheep01/SAAPS"
},
"split_keywords": [
"sap",
" information entropy",
" onehot encoding",
" dimension reduction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0228d2e2e1050a5c3cba6067536d5973d5210ba0596693ad12c7e3e1dfca6d84",
"md5": "af62646e46eac5afc4eab3e06f8e5028",
"sha256": "18f538600b1454fc957b3df9cbe1d235ccab05fa7185cddbda4f0f52738c2be9"
},
"downloads": -1,
"filename": "saaps-1.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "af62646e46eac5afc4eab3e06f8e5028",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 888101,
"upload_time": "2024-03-26T07:49:23",
"upload_time_iso_8601": "2024-03-26T07:49:23.825315Z",
"url": "https://files.pythonhosted.org/packages/02/28/d2e2e1050a5c3cba6067536d5973d5210ba0596693ad12c7e3e1dfca6d84/saaps-1.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "954fd07a86cf364687ba06263039e734f61478a54ed75bc47f1d151eca5a3c81",
"md5": "ffe2e4e20ceac73bce4c202f9071c929",
"sha256": "6e3f480644331a01d5c5a50480077a43e9cf5ca81fc8e571c1112b2bc48f1f67"
},
"downloads": -1,
"filename": "saaps-1.4.1.tar.gz",
"has_sig": false,
"md5_digest": "ffe2e4e20ceac73bce4c202f9071c929",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 892498,
"upload_time": "2024-03-26T07:49:28",
"upload_time_iso_8601": "2024-03-26T07:49:28.519151Z",
"url": "https://files.pythonhosted.org/packages/95/4f/d07a86cf364687ba06263039e734f61478a54ed75bc47f1d151eca5a3c81/saaps-1.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-26 07:49:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xiaosheep01",
"github_project": "SAAPS",
"github_not_found": true,
"lcname": "saaps"
}