# panCNSgene pipeline
[](https://badge.fury.io/py/panCG)
<img src="/figures/panCNSGene.png">
## Dependencies
1. [cactus](https://github.com/ComparativeGenomicsToolkit/cactus/blob/v2.9.3/BIN-INSTALL.md)
2. [phast](https://github.com/CshlSiepelLab/phast)
3. [JCVI](https://github.com/tanghaibao/jcvi)
4. [UCSC](https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/): `mafFilter`, `mafSplit`, `wigToBigWig`
``````
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafFilter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafSplit
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
``````
5. [orthofinder](https://github.com/davidemms/OrthoFinder)
6. [blast](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/)
## install
```shell
# step 1. Download dependent software
# step 2. Create environment and install library
conda create -n PanCnsGene python=3.11
conda activate PanCnsGene
pip install biopython
pip install pandas
pip install pyyaml
pip install pyBigWig
pip install pyranges
pip install jcvi
# step 3. Write the absolute path of the software to the CMD_CONFIG.py file
vim CMD_CONFIG.py
```
## Input file format requirements
1. The chromosome ID of the genome cannot contain special characters such as `":", "-", ","`, etc., and no other characters except numbers, letters and "_".
2. In the gff annotation file, it is best to only have `gene, mRNA, exon, cds, and utr` information. And gene must contain the `ID` field, and others must contain the `Parent` field.
3. The bed file of gene must be a standard 6-column bed file. `<chrID> <start> <end> <geneID> <score/0> <chain>`.
4. output
## Output
### cns calling
| Directory | File suffix | Describe |
| ---------------------------------- | ------------------ | ----------------------------------- |
| {Workdir}/03-phastCons/Wig/ | {species}.all.bw | PhastCons Conservative Scoring File |
| {Workdir}/03-phastCons/Wig/CEsDir/ | {species}.CNSs.bed | CNS file of {species} |
### panCNS
| Directory | File suffix | Describe |
| ---------------------------------- | ------------------------------------------ | ------------------------------------------------------------ |
| halLiftoverDir | {que}.{ref}.bed | que’s CNS position on the ref map <ref_map_chrID> <ref_map_start> <ref_map_end> <que_cnsID> |
| halLiftoverDir | {que}.{ref}.merge.bed | The position of que’s CNS in the ref map (merge if the distance is less than the threshold) <ref_map_chrID> <ref_map_start> <ref_map_end> <que_cnsID> |
| halLiftoverDir | {que}.{ref}.halLiftover.anchors | The correspondence between que's CNS in the hal multiple sequence alignment file and ref's CNS <que_cnsID> <ref_cnsID> |
| halLiftoverDir | {que}.{ref}.merge.bw.bed | `{que}.{ref}.merge.bed` add averageBwScore and effecve_len |
| blastnDir | .blastn.fmt6.txt | Original blastn alignment file |
| blastnDir | .blastn.halLiftoverFilter.anchors | blastn anchors after halLiftover filtering |
| blastnDir | .blastn.halLiftoverFilter.anchors.fmt6.txt | fmt6 format of `.blastn.halLiftoverFilter.anchors` file |
| JCVIDir | .lifted.anchors | Recruits additional anchors file output by JCVI |
| JCVIDir | .anchors | High quality anchors file output by JCVI |
| JCVIDir | .halLiftoverFilter.lifted.anchors | JCVI recruits additional anchors after halLiftover filtering |
| JCVIDir | .halLiftoverFilter.anchors | JCVI high quality anchors after halLiftover filtering |
| JCVIDir | .halLiftoverFilter.lifted.anchors.fmt6.txt | fmt6 format of `.halLiftoverFilter.lifted.anchors` file |
| JCVIDir | .halLiftoverFilter.anchors.fmt6.txt | fmt6 format of `.halLiftoverFilter.anchors` file |
| {Workdir}/Index/ | cnsCluster.csv | Clustering information of CNS of all species |
| {Workdir}/Index/{species}/ | {species}.csv | Clustering information of all {species} CNS before |
| {Workdir}/CEsDir/ | {species}.recall_cds.bed | recall_cds coordinates for each species |
| {Workdir}/Ref\_{species}_IndexDir/ | .cnsIndexAssign.csv | Results of cnsIndexAssign |
| {Workdir}/Ref\_{species}_IndexDir/ | .cnsIndexMerge.csv | Results of cnsIndexMerge |
| {Workdir}/Ref\_{species}_IndexDir/ | .recallCEs.csv | The results of recallCEs, which records the results of CE obtained by recall |
| {Workdir}/Ref\_{species}_IndexDir/ | .ReCnsIndexMerge.csv | The result after merging `.recallCEs.csv` |
| {Workdir}/Ref\_{species}_IndexDir/ | .TripleCnsIndexMerge.csv | Classify the CE in `.ReCnsIndexMerge.csv` (cns, cds) and then merge the results |
| {Workdir}/Ref\_{species}_IndexDir/ | .recall.csv | The result of merging `.TripleCnsIndexMerge.csv` with cell |
| {Workdir}/Ref\_{species}_IndexDir/ | .sort.csv | The result of sort `.recall.csv` |
### panGene
| Directory | File suffix | Describe |
| ---------------------------- | --------------- | -------------------- |
| {Workdir}/Cluster/ | All.Cluster.csv | Gene clustering file |
| {Workdir}Ref\_{ref}_IndexDir | .panGene.csv | The result panGene |
The Group column is the homology group identified by orthofinder.
Index column
OGXXXXXXX.1 Indicates the gene index subdivided in the homology group
OGXXXXXXX.1.Un The .Un suffix indicates a set of genes that still exist independently in a single species after CPM.
OGXXXXXXX.1.tree_1 Indicates the gene index subdivided by gene evolution relationship based on the gene index
OGXXXXXXX.1.tree_Un The gene set ending with .tree_Un is a gene set that is not classified using evolutionary relationships.
UnMapOGXXXXXXX.1 UnMap prefix is the gene that orthofinder has no clustering
## quick start
### call CNS
5 min
```shell
for i in C_sinensis C_limon ponkan C_australasica C_glauca F_hindsii A_buxifolia; do echo "nohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/PanCNSgene-main/panCNSgene.py callCns -c /home/ltan/Tmp/01-PanCNSGene_test_data/PanCNSgene-main/CNScalling.config.yaml -w /home/ltan/Tmp/01-PanCNSGene_test_data/03-callCNS/${i} -r ${i} > ${i}.log 2>&1 &" | bash; done
```
### panGene
2 min 15s
```shell
nohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/panCG.py GeneIndex \
--config /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/geneIndex.config.yaml \
--workDir /home/ltan/Tmp/01-PanCNSGene_test_data/04-GeneIndex \
--reference C_sinensis > C_sinensis.geneIndex.log 2>&1 &
```
### panCNS
4min 30 s
```shell
nohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/panCG.py CnsIndex \
--config /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/cnsIndex.config.yaml \
--workDir /home/ltan/Tmp/01-PanCNSGene_test_data/05-cnsIndex \
--reference C_sinensis \
--geneConfig /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/geneIndex.config.yaml \
--geneWorkDir /home/ltan/Tmp/01-PanCNSGene_test_data/04-GeneIndex \
--blastn_threads 7 > CnsIndex.log 2>&1 &
```
## Strategy
### panCNS
1. **cnsMapMerge**: Run halLiftover, blastn and jcvi on the CNS sets of two species, filter the blastn results with low similarity, and then use halLiftover to filter blastn and jcvi (lifted.anchors), that is, satisfy halLiftover and blastn or halLiftover and jcvi (lifted.anchors).
2. Merge the map relationships of filter_blastn and filter_jcvi to obtain the CNS map relationship between the two species.
3. **cnsClustering**: Use the pairwise map relationships between species obtained above to cluster CNS into different groups.
4. **cnsIndexAssign**: Use the given species order as references and use jcvi (lifted.anchors) to subdivide the Group into indices.
5. **cnsIndexMerge**: Merge indices that are too similar in the same group to get the final index.
6. **cnsRecall**: Check the missing CNS in each index, map the reference CNS to the missing species, and recall if the threshold is met.
7. **cnsIndexSort**: Sort each index according to its position in the genome.
### panGene
1. **geneMapMerge**: First, perform diamond analysis between species, and then use the diamond results to perform jcvi analysis.
2. **GeneClustering**: All species are analyzed by OrthoFinder to obtain groups. Genes assigned to groups are written into separate files.
3. **geneIndexAssign**: First, use the results of jcvi to perform a preliminary index on the group.
4. **geneIndexMerge**: For an index with only one species, find the best match for all genes and put it into the index. If it is not in the group, put it into Un. For an index with more than one species, traverse the other indexes in the group to see if there is a similar index. If there is, merge it. If not, form a separate index. The Group column is in the form of OG0019373.Un. The group is the gene set in the original OG0019373 group that has no collinearity and the best hit is no longer in the group.
Synteny.gene.overlap.graph.pkl: 整合了JCVI的anchors和lifted.anchors,得到的新的共线性gene对。
Raw data
{
"_id": null,
"home_page": "https://github.com/rejo27",
"name": "panCG",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "python, panCG, windows, mac, linux",
"author": "ltan",
"author_email": "leitan1127@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/65/1d/eab825ad54372ffee87f0d09edc56e8943bf074c0905345c89c8015fa87e/pancg-0.0.2.tar.gz",
"platform": null,
"description": "\n# panCNSgene pipeline\n[](https://badge.fury.io/py/panCG)\n\n<img src=\"/figures/panCNSGene.png\">\n\n## Dependencies\n\n1. [cactus](https://github.com/ComparativeGenomicsToolkit/cactus/blob/v2.9.3/BIN-INSTALL.md)\n\n2. [phast](https://github.com/CshlSiepelLab/phast)\n\n3. [JCVI](https://github.com/tanghaibao/jcvi)\n\n4. [UCSC](https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/): `mafFilter`, `mafSplit`, `wigToBigWig`\n\n ``````\n wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafFilter\n wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafSplit\n wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig\n ``````\n\n5. [orthofinder](https://github.com/davidemms/OrthoFinder)\n\n6. [blast](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/)\n\n## install\n\n```shell\n# step 1. Download dependent software\n\n# step 2. Create environment and install library\nconda create -n PanCnsGene python=3.11\nconda activate PanCnsGene\npip install biopython\npip install pandas\npip install pyyaml\npip install pyBigWig\npip install pyranges\npip install jcvi\n\n# step 3. Write the absolute path of the software to the CMD_CONFIG.py file\nvim CMD_CONFIG.py\n```\n\n## Input file format requirements\n\n1. The chromosome ID of the genome cannot contain special characters such as `\":\", \"-\", \",\"`, etc., and no other characters except numbers, letters and \"_\".\n\n2. In the gff annotation file, it is best to only have `gene, mRNA, exon, cds, and utr` information. And gene must contain the `ID` field, and others must contain the `Parent` field.\n\n3. The bed file of gene must be a standard 6-column bed file. `<chrID> <start> <end> <geneID> <score/0> <chain>`.\n\n4. output\n\n## Output\n\n### cns calling\n\n| Directory | File suffix | Describe |\n| ---------------------------------- | ------------------ | ----------------------------------- |\n| {Workdir}/03-phastCons/Wig/ | {species}.all.bw | PhastCons Conservative Scoring File |\n| {Workdir}/03-phastCons/Wig/CEsDir/ | {species}.CNSs.bed | CNS file of {species} |\n\n### panCNS\n\n| Directory | File suffix | Describe |\n| ---------------------------------- | ------------------------------------------ | ------------------------------------------------------------ |\n| halLiftoverDir | {que}.{ref}.bed | que\u2019s CNS position on the ref map <ref_map_chrID> <ref_map_start> <ref_map_end> <que_cnsID> |\n| halLiftoverDir | {que}.{ref}.merge.bed | The position of que\u2019s CNS in the ref map (merge if the distance is less than the threshold) <ref_map_chrID> <ref_map_start> <ref_map_end> <que_cnsID> |\n| halLiftoverDir | {que}.{ref}.halLiftover.anchors | The correspondence between que's CNS in the hal multiple sequence alignment file and ref's CNS <que_cnsID> <ref_cnsID> |\n| halLiftoverDir | {que}.{ref}.merge.bw.bed | `{que}.{ref}.merge.bed` add averageBwScore and effecve_len |\n| blastnDir | .blastn.fmt6.txt | Original blastn alignment file |\n| blastnDir | .blastn.halLiftoverFilter.anchors | blastn anchors after halLiftover filtering |\n| blastnDir | .blastn.halLiftoverFilter.anchors.fmt6.txt | fmt6 format of `.blastn.halLiftoverFilter.anchors` file |\n| JCVIDir | .lifted.anchors | Recruits additional anchors file output by JCVI |\n| JCVIDir | .anchors | High quality anchors file output by JCVI |\n| JCVIDir | .halLiftoverFilter.lifted.anchors | JCVI recruits additional anchors after halLiftover filtering |\n| JCVIDir | .halLiftoverFilter.anchors | JCVI high quality anchors after halLiftover filtering |\n| JCVIDir | .halLiftoverFilter.lifted.anchors.fmt6.txt | fmt6 format of `.halLiftoverFilter.lifted.anchors` file |\n| JCVIDir | .halLiftoverFilter.anchors.fmt6.txt | fmt6 format of `.halLiftoverFilter.anchors` file |\n| {Workdir}/Index/ | cnsCluster.csv | Clustering information of CNS of all species |\n| {Workdir}/Index/{species}/ | {species}.csv | Clustering information of all {species} CNS before |\n| {Workdir}/CEsDir/ | {species}.recall_cds.bed | recall_cds coordinates for each species |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .cnsIndexAssign.csv | Results of cnsIndexAssign |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .cnsIndexMerge.csv | Results of cnsIndexMerge |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .recallCEs.csv | The results of recallCEs, which records the results of CE obtained by recall |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .ReCnsIndexMerge.csv | The result after merging `.recallCEs.csv` |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .TripleCnsIndexMerge.csv | Classify the CE in `.ReCnsIndexMerge.csv` (cns, cds) and then merge the results |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .recall.csv | The result of merging `.TripleCnsIndexMerge.csv` with cell |\n| {Workdir}/Ref\\_{species}_IndexDir/ | .sort.csv | The result of sort `.recall.csv` |\n\n\n### panGene\n\n| Directory | File suffix | Describe |\n| ---------------------------- | --------------- | -------------------- |\n| {Workdir}/Cluster/ | All.Cluster.csv | Gene clustering file |\n| {Workdir}Ref\\_{ref}_IndexDir | .panGene.csv | The result panGene |\n\n\n\nThe Group column is the homology group identified by orthofinder.\n\nIndex column\n\nOGXXXXXXX.1 Indicates the gene index subdivided in the homology group\n\nOGXXXXXXX.1.Un The .Un suffix indicates a set of genes that still exist independently in a single species after CPM.\n\nOGXXXXXXX.1.tree_1 Indicates the gene index subdivided by gene evolution relationship based on the gene index\n\nOGXXXXXXX.1.tree_Un The gene set ending with .tree_Un is a gene set that is not classified using evolutionary relationships.\n\nUnMapOGXXXXXXX.1 UnMap prefix is the gene that orthofinder has no clustering\n\n\n\n## quick start\n\n### call CNS\n\n5 min\n\n```shell\nfor i in C_sinensis C_limon ponkan C_australasica C_glauca F_hindsii A_buxifolia; do echo \"nohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/PanCNSgene-main/panCNSgene.py callCns -c /home/ltan/Tmp/01-PanCNSGene_test_data/PanCNSgene-main/CNScalling.config.yaml -w /home/ltan/Tmp/01-PanCNSGene_test_data/03-callCNS/${i} -r ${i} > ${i}.log 2>&1 &\" | bash; done\n```\n\n### panGene\n\n2 min 15s\n\n```shell\nnohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/panCG.py GeneIndex \\\n --config /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/geneIndex.config.yaml \\\n --workDir /home/ltan/Tmp/01-PanCNSGene_test_data/04-GeneIndex \\\n --reference C_sinensis > C_sinensis.geneIndex.log 2>&1 &\n```\n\n### panCNS\n\n4min 30 s\n\n```shell\nnohup /usr/bin/time -v python /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/panCG.py CnsIndex \\\n --config /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/cnsIndex.config.yaml \\\n --workDir /home/ltan/Tmp/01-PanCNSGene_test_data/05-cnsIndex \\\n --reference C_sinensis \\\n --geneConfig /home/ltan/Tmp/01-PanCNSGene_test_data/panCG-main/geneIndex.config.yaml \\\n --geneWorkDir /home/ltan/Tmp/01-PanCNSGene_test_data/04-GeneIndex \\\n --blastn_threads 7 > CnsIndex.log 2>&1 & \n```\n\n\n\n## Strategy\n\n### panCNS\n1. **cnsMapMerge**: Run halLiftover, blastn and jcvi on the CNS sets of two species, filter the blastn results with low similarity, and then use halLiftover to filter blastn and jcvi (lifted.anchors), that is, satisfy halLiftover and blastn or halLiftover and jcvi (lifted.anchors).\n2. Merge the map relationships of filter_blastn and filter_jcvi to obtain the CNS map relationship between the two species.\n3. **cnsClustering**: Use the pairwise map relationships between species obtained above to cluster CNS into different groups.\n4. **cnsIndexAssign**: Use the given species order as references and use jcvi (lifted.anchors) to subdivide the Group into indices.\n5. **cnsIndexMerge**: Merge indices that are too similar in the same group to get the final index.\n6. **cnsRecall**: Check the missing CNS in each index, map the reference CNS to the missing species, and recall if the threshold is met.\n7. **cnsIndexSort**: Sort each index according to its position in the genome.\n\n### panGene\n\n1. **geneMapMerge**: First, perform diamond analysis between species, and then use the diamond results to perform jcvi analysis.\n2. **GeneClustering**: All species are analyzed by OrthoFinder to obtain groups. Genes assigned to groups are written into separate files.\n3. **geneIndexAssign**: First, use the results of jcvi to perform a preliminary index on the group.\n4. **geneIndexMerge**: For an index with only one species, find the best match for all genes and put it into the index. If it is not in the group, put it into Un. For an index with more than one species, traverse the other indexes in the group to see if there is a similar index. If there is, merge it. If not, form a separate index. The Group column is in the form of OG0019373.Un. The group is the gene set in the original OG0019373 group that has no collinearity and the best hit is no longer in the group.\n\n\n\n\nSynteny.gene.overlap.graph.pkl: \u6574\u5408\u4e86JCVI\u7684anchors\u548clifted.anchors\uff0c\u5f97\u5230\u7684\u65b0\u7684\u5171\u7ebf\u6027gene\u5bf9\u3002\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "xxx",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/rejo27"
},
"split_keywords": [
"python",
" pancg",
" windows",
" mac",
" linux"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "093db4ed10d1d0a6ad788488a4433fcf59dd558399c8a6183b3ae8ee89cebc87",
"md5": "953036465cfbad4ffbff03d6edb15cca",
"sha256": "1edfb9c880e0ef700e48dc56261ac7d92318a31ce176b7a2a92af4fdc99760c1"
},
"downloads": -1,
"filename": "panCG-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "953036465cfbad4ffbff03d6edb15cca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 84313,
"upload_time": "2025-07-28T08:36:38",
"upload_time_iso_8601": "2025-07-28T08:36:38.700162Z",
"url": "https://files.pythonhosted.org/packages/09/3d/b4ed10d1d0a6ad788488a4433fcf59dd558399c8a6183b3ae8ee89cebc87/panCG-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "651deab825ad54372ffee87f0d09edc56e8943bf074c0905345c89c8015fa87e",
"md5": "652097b79c2a2ef9fe8bdadb5793c2f9",
"sha256": "4f0f40319b709df4dfa228e967cdab379e1575f50f820eb5bddf1b61952589ef"
},
"downloads": -1,
"filename": "pancg-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "652097b79c2a2ef9fe8bdadb5793c2f9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 70855,
"upload_time": "2025-07-28T08:36:39",
"upload_time_iso_8601": "2025-07-28T08:36:39.975308Z",
"url": "https://files.pythonhosted.org/packages/65/1d/eab825ad54372ffee87f0d09edc56e8943bf074c0905345c89c8015fa87e/pancg-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 08:36:39",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pancg"
}