[](https://doi.org/10.5281/zenodo.16748708)
# PhenGO
## Overview
This project provides a unified Python-based tool to generate ready-to-use WEKA ARFF formatted files, specifically designed for machine learning applications involving gene essentiality prediction.
The tool integrates phenotype data and Gene Ontology (GO) annotations for genes from selected model organisms, streamlining the data preparation process.
## Purpose
The main goal of this project is to simplify and standardise the creation of ARFF files that combine phenotype information with GO-mapped gene data.
This enables researchers to efficiently apply machine learning techniques (using WEKA or similar platforms) to analyse gene essentiality and related biological questions across various model organisms.
## Features
- **Unified Workflow:** Handles data collection, integration, and formatting in a single pipeline.
- **Model Organism Support:** Designed for commonly studied organisms (e.g., *Saccharomyces cerevisiae*, *Mus musculus*).
- **GO Annotation Integration:** Maps genes to their respective GO terms for comprehensive feature representation and traces obo files to acquire parent terms.
- **Phenotype Data Inclusion:** Incorporates phenotype labels for supervised learning tasks.
- **WEKA ARFF Output:** Produces files in the ARFF format, ready for immediate use in WEKA.
## Installation
To install the PhenGO package, you can use pip:
```bash
pip install phengo
```
## Usage
### PhenGO Package:
#### PhenGO Example:
```commandline
PhenGO -species fly -phenotype_file data/fly/phenotype_data/2017/allele_phenotypic_data_fb_2017_05.tsv.gz -gene_association_file data/fly/gene_association/2017/gene_association_2017_05.fb.gz
-go_obo_file data/go/2017/go_2017-05-01.obo.gz -output_dir Documents/PhenGO/fly_2017
```
The output will be saved in the specified output directory, which will contain the ARFF file and other relevant data files.
#### Menu:
```bash
usage: PhenGO.py [-h] -species SPECIES -phenotype_file PHENOTYPE_FILE
-gene_association_file GENE_ASSOCIATION_FILE -go_obo_file
GO_OBO_FILE -output_dir OUTPUT_DIR [-filter_unused_gos]
[-filter_mixed_terms] [-gene_go_pheno]
[-fly_assignments FLY_ASSIGNMENTS]
[-driver_lines DRIVER_LINES] [-filt_with]
[-worm_phenotypes WORM_PHENOTYPES]
[-mouse_phenotypes MOUSE_PHENOTYPES] [-v]
PhenGO v0.1.2 - Convert phenotype and GO data to ARFF format
Required Options:
-species SPECIES Species tag (e.g., fly, yeast)
-phenotype_file PHENOTYPE_FILE
Path to the phenotype data file (.gz)
-gene_association_file GENE_ASSOCIATION_FILE
Path to the gene association file (.gz)
-go_obo_file GO_OBO_FILE
Path to the go.obo file
-output_dir OUTPUT_DIR
Output directory
Optional parameters:
-filter_unused_gos Filter out unused GO terms from the FUNC and ARFF
output (default: True)
-filter_mixed_terms Filter out genes which have both lethal and viable
phenotypes - Terms not specifically lethal/viable are
not counted in this (default: False)
-gene_go_pheno Output "Gene-GO-Phenotype" (Rbbp5 GO:0003674 0) file
for overrepresentation analysis with tools such as
FUNC (default: False)
Fly specific parameters:
-fly_assignments FLY_ASSIGNMENTS
Provide TSV file of fly assignments (file confirming
genes are assignment to drosophila melanogaster
(default: "data/fly/FlyBase_Fields_2017.txt.gz")
-driver_lines DRIVER_LINES
Provide TSV file of fly driver lines (file containing
the name of driver lines (RNAi) to ignore when present
with the "with" tag (default: "data/fly/FlyBase_Driver
Line_Fields_2025_08_05.txt.gz")
-filt_with Filter out phenotype with "with" tag (default: DO NOT
FILTER)
Worm specific parameters:
-worm_phenotypes WORM_PHENOTYPES
Provide TSV file of worm phenotypes (default:
"data/worm/WS297_lethal_terms.tsv.gz")
Mouse specific parameters:
-mouse_phenotypes MOUSE_PHENOTYPES
Provide TSV file of mouse phenotypes (default:
"data/mouse/mouse_lethal_terms.txt.gz")
Misc:
-v, --version show program's version number and exit
```
### Compare-ARFF:
```commandline
usage: compare_arff_genes.py [-h] -arff_a ARFF_A -arff_b ARFF_B -o OUTPUT
PhenoGO v0.1.2 - Compare-ARFF: Compare two ARFF files.
options:
-h, --help show this help message and exit
-arff_a ARFF_A Master ARFF file (reference)
-arff_b ARFF_B Comparison ARFF file
-o OUTPUT Output CSV file
```
**Output:**
The output of the `compare-arff` function is a CSV file that summarizes the comparison between two ARFF files.
```commandline
Gene,Label A,Label B,GO Terms Differ,Status
GeneA,lethal,,,"MISSING_IN_B"
GeneB,lethal,viable,,"LABEL_MISMATCH"
GeneC,viable,viable,GO:0008150;GO:0003674,"GO_TERM_MISMATCH"
GeneD,viable,viable,,"EXACT_MATCH"
```
Raw data
{
"_id": null,
"home_page": "https://github.com/NickJD/PhenoGO",
"name": "PhenGO",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "phenotype, gene ontology, GO, WEKA, ARFF, machine learning, model organism",
"author": "Nicholas Dimonaco",
"author_email": "nicholas@dimonaco.co.uk",
"download_url": "https://files.pythonhosted.org/packages/9f/68/d34a934dbda06eff5fd42b2a8d4f6dc6e8e4ea3613f826d54e1999c289c1/phengo-0.1.2.tar.gz",
"platform": null,
"description": "[](https://doi.org/10.5281/zenodo.16748708)\n\n# PhenGO\n\n## Overview\n\nThis project provides a unified Python-based tool to generate ready-to-use WEKA ARFF formatted files, specifically designed for machine learning applications involving gene essentiality prediction. \nThe tool integrates phenotype data and Gene Ontology (GO) annotations for genes from selected model organisms, streamlining the data preparation process.\n\n## Purpose\n\nThe main goal of this project is to simplify and standardise the creation of ARFF files that combine phenotype information with GO-mapped gene data. \nThis enables researchers to efficiently apply machine learning techniques (using WEKA or similar platforms) to analyse gene essentiality and related biological questions across various model organisms.\n\n## Features\n\n- **Unified Workflow:** Handles data collection, integration, and formatting in a single pipeline.\n- **Model Organism Support:** Designed for commonly studied organisms (e.g., *Saccharomyces cerevisiae*, *Mus musculus*).\n- **GO Annotation Integration:** Maps genes to their respective GO terms for comprehensive feature representation and traces obo files to acquire parent terms.\n- **Phenotype Data Inclusion:** Incorporates phenotype labels for supervised learning tasks.\n- **WEKA ARFF Output:** Produces files in the ARFF format, ready for immediate use in WEKA.\n\n## Installation\nTo install the PhenGO package, you can use pip:\n\n```bash\npip install phengo\n```\n\n## Usage\n### PhenGO Package:\n#### PhenGO Example:\n```commandline\nPhenGO -species fly -phenotype_file data/fly/phenotype_data/2017/allele_phenotypic_data_fb_2017_05.tsv.gz -gene_association_file data/fly/gene_association/2017/gene_association_2017_05.fb.gz\n-go_obo_file data/go/2017/go_2017-05-01.obo.gz -output_dir Documents/PhenGO/fly_2017\n```\nThe output will be saved in the specified output directory, which will contain the ARFF file and other relevant data files.\n#### Menu:\n```bash\nusage: PhenGO.py [-h] -species SPECIES -phenotype_file PHENOTYPE_FILE\n -gene_association_file GENE_ASSOCIATION_FILE -go_obo_file\n GO_OBO_FILE -output_dir OUTPUT_DIR [-filter_unused_gos]\n [-filter_mixed_terms] [-gene_go_pheno]\n [-fly_assignments FLY_ASSIGNMENTS]\n [-driver_lines DRIVER_LINES] [-filt_with]\n [-worm_phenotypes WORM_PHENOTYPES]\n [-mouse_phenotypes MOUSE_PHENOTYPES] [-v]\n\nPhenGO v0.1.2 - Convert phenotype and GO data to ARFF format\n\nRequired Options:\n -species SPECIES Species tag (e.g., fly, yeast)\n -phenotype_file PHENOTYPE_FILE\n Path to the phenotype data file (.gz)\n -gene_association_file GENE_ASSOCIATION_FILE\n Path to the gene association file (.gz)\n -go_obo_file GO_OBO_FILE\n Path to the go.obo file\n -output_dir OUTPUT_DIR\n Output directory\n\nOptional parameters:\n -filter_unused_gos Filter out unused GO terms from the FUNC and ARFF\n output (default: True)\n -filter_mixed_terms Filter out genes which have both lethal and viable\n phenotypes - Terms not specifically lethal/viable are\n not counted in this (default: False)\n -gene_go_pheno Output \"Gene-GO-Phenotype\" (Rbbp5 GO:0003674 0) file\n for overrepresentation analysis with tools such as\n FUNC (default: False)\n\nFly specific parameters:\n -fly_assignments FLY_ASSIGNMENTS\n Provide TSV file of fly assignments (file confirming\n genes are assignment to drosophila melanogaster\n (default: \"data/fly/FlyBase_Fields_2017.txt.gz\")\n -driver_lines DRIVER_LINES\n Provide TSV file of fly driver lines (file containing\n the name of driver lines (RNAi) to ignore when present\n with the \"with\" tag (default: \"data/fly/FlyBase_Driver\n Line_Fields_2025_08_05.txt.gz\")\n -filt_with Filter out phenotype with \"with\" tag (default: DO NOT\n FILTER)\n\nWorm specific parameters:\n -worm_phenotypes WORM_PHENOTYPES\n Provide TSV file of worm phenotypes (default:\n \"data/worm/WS297_lethal_terms.tsv.gz\")\n\nMouse specific parameters:\n -mouse_phenotypes MOUSE_PHENOTYPES\n Provide TSV file of mouse phenotypes (default:\n \"data/mouse/mouse_lethal_terms.txt.gz\")\n\nMisc:\n -v, --version show program's version number and exit\n```\n\n### Compare-ARFF:\n```commandline\nusage: compare_arff_genes.py [-h] -arff_a ARFF_A -arff_b ARFF_B -o OUTPUT\n\nPhenoGO v0.1.2 - Compare-ARFF: Compare two ARFF files.\n\noptions:\n -h, --help show this help message and exit\n -arff_a ARFF_A Master ARFF file (reference)\n -arff_b ARFF_B Comparison ARFF file\n -o OUTPUT Output CSV file\n\n```\n\n**Output:**\nThe output of the `compare-arff` function is a CSV file that summarizes the comparison between two ARFF files.\n```commandline\nGene,Label A,Label B,GO Terms Differ,Status\nGeneA,lethal,,,\"MISSING_IN_B\"\nGeneB,lethal,viable,,\"LABEL_MISMATCH\"\nGeneC,viable,viable,GO:0008150;GO:0003674,\"GO_TERM_MISMATCH\"\nGeneD,viable,viable,,\"EXACT_MATCH\"\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "PhenoGO: A tool to build WEKA-ready ARFF files from model organism phenotype and Gene Ontology (GO) annotations.",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/NickJD/PhenoGO/issues",
"Homepage": "https://github.com/NickJD/PhenoGO"
},
"split_keywords": [
"phenotype",
" gene ontology",
" go",
" weka",
" arff",
" machine learning",
" model organism"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "41c825d3e20e013e3650233b8c62272c71a3ee5dedca9bb186705890ab8879e4",
"md5": "f7a353da865a8a4a3f5ac05a8f8b5080",
"sha256": "11a51280b227e40bb222e56e5a7b2df8700062ddcd8f273e2b161351882ec42b"
},
"downloads": -1,
"filename": "phengo-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f7a353da865a8a4a3f5ac05a8f8b5080",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 28421,
"upload_time": "2025-08-06T18:39:08",
"upload_time_iso_8601": "2025-08-06T18:39:08.805255Z",
"url": "https://files.pythonhosted.org/packages/41/c8/25d3e20e013e3650233b8c62272c71a3ee5dedca9bb186705890ab8879e4/phengo-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9f68d34a934dbda06eff5fd42b2a8d4f6dc6e8e4ea3613f826d54e1999c289c1",
"md5": "ca2f1f4d9d4a7759851eb8129b38d2bb",
"sha256": "b42ca678b84bd86129e79262572aee268729d71106a3dc2ac661e40fa230aced"
},
"downloads": -1,
"filename": "phengo-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "ca2f1f4d9d4a7759851eb8129b38d2bb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 27867,
"upload_time": "2025-08-06T18:39:10",
"upload_time_iso_8601": "2025-08-06T18:39:10.277779Z",
"url": "https://files.pythonhosted.org/packages/9f/68/d34a934dbda06eff5fd42b2a8d4f6dc6e8e4ea3613f826d54e1999c289c1/phengo-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-06 18:39:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NickJD",
"github_project": "PhenoGO",
"github_not_found": true,
"lcname": "phengo"
}