genesummary

Name	genesummary JSON
Version	0.3.2 JSON
	download
home_page	None
Summary	A comprehensive Python package for retrieving detailed gene information from multiple public databases
upload_time	2025-08-08 18:17:51
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	MIT License Copyright (c) 2025 Chun-Jie Liu Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	bioinformatics biology gene-annotation genetics genomics gwas pathways protein
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # GeneInfo

A comprehensive Python package for retrieving detailed gene information from multiple public databases with robust error handling, batch processing capabilities, and modular architecture.

## Features

GeneInfo provides access to comprehensive gene annotation data through a unified interface:

### Core Gene Information
- **Basic gene data** - Gene symbols, Ensembl IDs, descriptions, genomic coordinates, biotypes
- **Transcripts** - All transcript variants with protein coding information and alternative splicing
- **Genomic location** - Chromosome coordinates, strand information, gene boundaries

### Functional Annotation
- **Protein domains** - Domain architecture from UniProt with evidence codes
- **Gene Ontology** - GO terms and annotations (Biological Process, Molecular Function, Cellular Component)
- **Pathways** - Reactome pathway associations and pathway hierarchies
- **Protein interactions** - Dual-source protein-protein interaction networks:
  - **BioGRID** - Experimental evidence with PubMed references (requires API key)
  - **STRING-db** - Computational predictions + experimental evidence (no API key required)

### Evolutionary Information
- **Homologs** - Paralogs and orthologs across species with similarity metrics
- **Cross-species mapping** - Gene orthology relationships and conservation scores

### Clinical & Disease Data
- **Clinical variants** - ClinVar pathogenic and benign variants with clinical significance
- **GWAS associations** - Genome-wide association study data from EBI GWAS Catalog
- **Disease phenotypes** - OMIM disease associations and phenotypic descriptions

### Advanced Features
- **Batch processing** - Concurrent processing of large gene lists (1000+ genes)
- **API key management** - Secure handling of NCBI Entrez and OMIM API keys via environment variables or CLI
- **Graceful degradation** - Works without API keys with limited functionality (no clinical/phenotype data)
- **Rate limiting** - Built-in API courtesy delays and error handling
- **Rich CLI** - Beautiful command-line interface with progress bars and tables
- **Export formats** - JSON, CSV output with detailed and summary views
- **Real data only** - No mock data fallbacks, returns null when data is inaccessible

## Installation

### Using uv (Recommended)
```bash
# Install from source
uv add git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
uv add -e .
```

### Using pip
```bash
# Install from source
pip install git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
pip install -e .
```

### Requirements
- Python 3.11+
- Internet connection for API access (offline mode available)

## Quick Start


### API Key Configuration

For accessing ClinVar (clinical variants), OMIM (phenotype data), and BioGRID (protein interactions), you'll need API keys:

1. **Create a `.env` file** in your project directory:
```bash
# API Keys for external services
OMIM_API_KEY="your_omim_api_key_here"
ENTREZ_API_KEY="your_entrez_api_key_here"
ENTREZ_EMAIL="your.email@example.com"
BIOGRID_API_KEY="your_biogrid_api_key_here"
```

2. **Get API keys**:
   - **OMIM API Key**: Register at [OMIM API](https://omim.org/api)
   - **Entrez API Key**: Register at [NCBI API](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
   - **BioGRID API Key**: Register at [BioGRID API](https://wiki.thebiogrid.org/doku.php/biogridrest)

3. **API key priority**:
   - CLI arguments (highest priority)
   - Environment variables from `.env` file
   - None (graceful degradation - returns null data)

### Python API

```python
from geneinfo import GeneInfo

# Option 1: Use environment variables (recommended)
# Create .env file with API keys (see above)
gene_info = GeneInfo()

# Option 2: Provide API keys explicitly
gene_info = GeneInfo(
    email="your.email@example.com",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

# Option 3: Work without API keys (limited functionality)
gene_info = GeneInfo(
    email=None,
    entrez_api_key=None,
    omim_api_key=None,
    biogrid_api_key=None
)

# Get comprehensive information for a single gene
result = gene_info.get_gene_info("TP53")
print(f"Gene: {result['basic_info']['display_name']}")
print(f"Description: {result['basic_info']['description']}")
print(f"Chromosome: {result['basic_info']['seq_region_name']}")
print(f"Transcripts: {len(result['transcripts'])}")
print(f"GO terms: {len(result['gene_ontology'])}")
print(f"Pathways: {len(result['pathways'])}")
print(f"Protein interactions: {len(result['protein_interactions'])} (BioGRID + STRING-db)")
print(f"Clinical variants: {len(result['clinvar'])} (requires API key)")

# Batch process multiple genes with concurrent workers
genes = ["TP53", "BRCA1", "EGFR", "MYC", "KRAS"]
df = gene_info.get_batch_info(genes, max_workers=5)
print(df[['gene_symbol', 'chromosome', 'transcript_count', 'go_term_count']].head())

# Export detailed information to JSON
gene_info.export_detailed_info(genes, "detailed_results.json")

# Export to organized directory structure
gene_info.export_batch_to_directory(genes, "gene_data/", max_workers=5)
```

### Advanced Usage

```python
# Process large gene lists efficiently
with open("large_gene_list.txt") as f:
    gene_list = [line.strip() for line in f if line.strip()]

# Initialize with API keys for full functionality
gene_info = GeneInfo(
    email="researcher@university.edu",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

# Batch processing with progress tracking
df = gene_info.get_batch_info(gene_list, max_workers=10)

# Filter successful results
successful = df[df['error'].isna()]
print(f"Successfully processed {len(successful)}/{len(gene_list)} genes")

# Access specific data types
for _, gene in successful.iterrows():
    detailed = gene_info.get_gene_info(gene['query'])

    # Protein domains
    if detailed['protein_domains']:
        print(f"\n{gene['gene_symbol']} protein domains:")
        for domain in detailed['protein_domains'][:3]:
            print(f"  - {domain['name']}: {domain['start']}-{domain['end']}")

    # Protein interactions (dual sources)
    if detailed['protein_interactions']:
        biogrid_interactions = [i for i in detailed['protein_interactions']
                              if i.get('source_database') == 'BioGRID']
        stringdb_interactions = [i for i in detailed['protein_interactions']
                               if i.get('source_database') == 'STRING-db']
        print(f"  - {len(biogrid_interactions)} BioGRID interactions (experimental)")
        print(f"  - {len(stringdb_interactions)} STRING-db interactions (computational)")

    # Clinical variants (requires Entrez API key)
    if detailed['clinvar']:
        pathogenic = [v for v in detailed['clinvar']
                     if 'pathogenic' in v.get('clinical_significance', '').lower()]
        print(f"  - {len(pathogenic)} pathogenic variants found")

# Working without API keys (limited functionality)
gene_info_limited = GeneInfo(
    entrez_api_key=None,
    omim_api_key=None,
    biogrid_api_key=None
)

# This will still work but return empty for clinical/phenotype data
result = gene_info_limited.get_gene_info("TP53")
print(f"Basic info available: {bool(result['basic_info'])}")
print(f"Protein interactions: {len(result['protein_interactions'])} (STRING-db only)")
print(f"Clinical variants: {len(result['clinvar'])} (empty without API key)")
print(f"OMIM phenotypes: {bool(result['phenotypes'])} (empty without API key)")
```

### Command Line Interface

```bash
# Single gene information with rich output
geneinfo --gene TP53 --output tp53_info.json

# Using API keys via CLI arguments
geneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --output tp53_info.json

# Using environment variables (recommended - create .env file)
geneinfo --gene TP53 --output tp53_info.json

# Process multiple genes from file
geneinfo --file genes.txt --output results.csv

# Detailed information in JSON format
geneinfo --gene BRCA1 --detailed --output brca1_detailed.json

# Batch processing with custom workers and API keys
geneinfo --file large_gene_list.txt --workers 10 \
  --entrez-api-key YOUR_ENTREZ_KEY \
  --omim-api-key YOUR_OMIM_KEY \
  --biogrid-api-key YOUR_BIOGRID_KEY \
  --email your.email@example.com \
  --output batch_results.csv

# Export to organized directory structure
geneinfo --file genes.txt --output-dir gene_analysis/ --workers 8

# Verbose output for debugging
geneinfo --gene TP53 --verbose --detailed --output tp53_debug.json

# Process Ensembl IDs
geneinfo --gene ENSG00000141510 --output tp53_ensembl.json

# Species-specific queries (when supported)
geneinfo --gene TP53 --species human --output tp53_human.json

# Check CLI help for all options
geneinfo --help
```

### CLI Output Examples

The CLI provides beautiful, formatted output with:
- 📊 Progress bars for batch processing
- 🎨 Colored tables for gene information display
- ⚡ Real-time processing statistics
- 📝 Summary reports with success/failure counts
- 🔍 Verbose logging for troubleshooting

## Input Formats & Output

### Supported Input Formats
The package accepts multiple gene identifier formats:
- **Gene symbols**: `TP53`, `BRCA1`, `EGFR` (case-insensitive)
- **Ensembl Gene IDs**: `ENSG00000141510`, `ENSG00000012048`
- **Mixed lists**: Can process files containing both symbols and IDs

### Output Formats

#### Summary CSV Output
```csv
query,gene_symbol,ensembl_id,chromosome,start_pos,end_pos,strand,transcript_count,go_term_count,pathway_count,interaction_count,clinvar_count,error
TP53,TP53,ENSG00000141510,17,7668421,7687490,-1,12,87,23,71,1043,
BRCA1,BRCA1,ENSG00000012048,17,43044295,43170245,-1,27,34,15,45,892,
```

#### Detailed JSON Output
```json
{
  "query": "TP53",
  "basic_info": {
    "id": "ENSG00000141510",
    "display_name": "TP53",
    "description": "tumor protein p53",
    "seq_region_name": "17",
    "start": 7668421,
    "end": 7687490,
    "strand": -1,
    "biotype": "protein_coding"
  },
  "transcripts": [...],
  "protein_domains": [...],
  "gene_ontology": [...],
  "pathways": [...],
  "protein_interactions": [...],
  "paralogs": [...],
  "orthologs": [...],
  "clinvar": [...],
  "gwas": {...}
}
```

#### Directory Export Structure
```
gene_data/
├── summary.csv              # Overview of all processed genes
├── TP53_ENSG00000141510.json
├── BRCA1_ENSG00000012048.json
└── EGFR_ENSG00000073756.json
```

## Data Sources & Architecture

### Primary Data Sources
- **🧬 Ensembl** - Gene annotation, transcripts, genomic coordinates, homologs
- **🔬 UniProt** - Protein domains, functional annotations, protein features
- **🎯 Gene Ontology** - GO term annotations and functional classifications
- **🛤️ Reactome** - Biological pathways and pathway hierarchies
- **🏥 ClinVar** - Clinical variant classifications and disease associations
- **🧪 EBI GWAS Catalog** - Genome-wide association study results
- **💊 OMIM** - Mendelian disorders and phenotype-genotype relationships
- **📚 MyGene.info** - Enhanced gene annotation aggregation
- **🔗 BioGRID** - Experimental protein-protein interactions with evidence
- **🌐 STRING-db** - Computational + experimental protein interaction networks

### Modular Fetcher Architecture

The package uses a modular design with specialized fetchers:

```python
# Genomic data fetchers
from geneinfo.fetchers.genomic import EnsemblFetcher, MyGeneFetcher

# Protein data fetchers
from geneinfo.fetchers.protein import UniProtFetcher, StringDBFetcher, BioGRIDFetcher

# Functional annotation fetchers
from geneinfo.fetchers.functional import GOFetcher, ReactomeFetcher

# Clinical data fetchers
from geneinfo.fetchers.clinical import ClinVarFetcher, GwasFetcher, OMIMFetcher
```

### Robust Error Handling
- 🔄 **Graceful degradation** - Returns null data when APIs are unavailable or API keys missing
- ⏱️ **Rate limiting** with respectful API usage
- 🛡️ **SSL/TLS handling** for various certificate configurations
- 📝 **Comprehensive logging** with different verbosity levels
- 🔍 **Input validation** for gene symbols and Ensembl IDs
- 🔑 **API key management** - Secure environment variable handling

## Performance & Usage Examples

### Performance Characteristics
- **Throughput**: ~100-500 genes/minute (network dependent)
- **Concurrency**: Configurable worker threads (default: 5, max recommended: 10)
- **Memory**: Efficient streaming processing for large gene lists
- **Rate limiting**: Built-in delays to respect API usage policies

### Real-world Usage Examples

#### Cancer Gene Panel Analysis
```python
# Process a cancer gene panel with API keys for clinical data
cancer_genes = ["TP53", "BRCA1", "BRCA2", "EGFR", "KRAS", "PIK3CA", "AKT1"]
gene_info = GeneInfo(
    email="researcher@university.edu",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

results = gene_info.get_batch_info(cancer_genes)
# Filter for genes with clinical variants (requires Entrez API key)
cancer_variants = results[results['clinvar_count'] > 0]
print(f"Found clinical variants in {len(cancer_variants)} cancer genes")

# Analyze protein interaction networks
for gene in cancer_genes:
    detailed = gene_info.get_gene_info(gene)
    interactions = detailed['protein_interactions']
    if interactions:
        biogrid_count = len([i for i in interactions if i['source_database'] == 'BioGRID'])
        stringdb_count = len([i for i in interactions if i['source_database'] == 'STRING-db'])
        print(f"{gene}: {biogrid_count} experimental + {stringdb_count} predicted interactions")
```

#### Pathway Enrichment Preprocessing
```python
# Prepare data for pathway analysis
gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BBC3"]  # p53 pathway genes
detailed_results = [gene_info.get_gene_info(gene) for gene in gene_list]

# Extract GO terms for enrichment analysis
all_go_terms = []
for result in detailed_results:
    for go_term in result['gene_ontology']:
        all_go_terms.append({
            'gene': result['query'],
            'go_id': go_term['go_id'],
            'go_name': go_term['go_name'],
            'namespace': go_term['namespace']
        })
```

#### Large-scale Genomics Project
```python
# Process GWAS significant genes (thousands of genes)
with open("gwas_significant_genes.txt") as f:
    gwas_genes = [line.strip() for line in f]  # 5000+ genes

# Process in batches with progress tracking
gene_info.export_batch_to_directory(
    gwas_genes,
    "gwas_gene_annotation/",
    max_workers=8
)
# Creates organized directory with individual files + summary
```

## Development & Testing

### Running Tests
```bash
# Install development dependencies
uv add --dev pytest pytest-cov pytest-asyncio

# Run test suite
uv run pytest

# Run with coverage
uv run pytest --cov=geneinfo --cov-report=html
```

### Project Structure
```
geneinfo/
├── geneinfo/
│   ├── __init__.py          # Main package exports
│   ├── core.py              # GeneInfo main class
│   ├── cli.py               # Command-line interface
│   ├── mock_data.py         # Fallback data for offline mode
│   └── fetchers/            # Modular data fetchers
│       ├── base.py          # Base fetcher with common functionality
│       ├── genomic.py       # Ensembl, MyGene fetchers
│       ├── protein.py       # UniProt, STRING-db fetchers
│       ├── functional.py    # GO, Reactome fetchers
│       └── clinical.py      # ClinVar, GWAS, OMIM fetchers
├── tests/                   # Comprehensive test suite
├── examples/                # Usage examples and demos
├── docs/                    # Documentation (you are here!)
└── pyproject.toml          # Modern Python packaging
```

### Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Follow the coding standards in `.github/copilot-instructions.md`
4. Add tests for new functionality
5. Run the test suite: `uv run pytest`
6. Submit a pull request

## Dependencies & Requirements

### Core Dependencies
- **Python 3.11+** - Modern Python features and type hints
- **requests** - HTTP client for API calls
- **pandas** - Data manipulation and analysis
- **numpy** - Numerical computing
- **typer** - CLI framework with rich features
- **rich** - Beautiful terminal output and progress bars
- **biopython** - Bioinformatics tools (for Entrez/ClinVar)
- **mygene** - Enhanced gene annotation client
- **python-dotenv** - Environment variable management for API keys

### System Requirements
- Internet connection for API access
- API keys for full functionality (NCBI Entrez, OMIM, BioGRID)
- Sufficient memory for large gene lists (typically <1GB for 10,000 genes)
- Email address for ClinVar/NCBI Entrez access (required when using API keys)

## Troubleshooting

### Common Issues

#### API Access Problems
```bash
# Test API connectivity
geneinfo --gene TP53 --verbose

# Working without API keys (limited functionality)
geneinfo --gene TP53 --entrez-api-key=None --omim-api-key=None --output results.json
```

#### API Key Configuration
```bash
# Check if API keys are being loaded correctly
geneinfo --gene TP53 --verbose

# Set API keys via environment variables (recommended)
echo 'ENTREZ_API_KEY="your_key_here"' > .env
echo 'OMIM_API_KEY="your_key_here"' >> .env
echo 'BIOGRID_API_KEY="your_key_here"' >> .env
echo 'ENTREZ_EMAIL="your.email@example.com"' >> .env

# Or pass via CLI
geneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --email your@email.com
```

#### Large Gene List Processing
```bash
# For very large lists, reduce concurrent workers
geneinfo --file huge_gene_list.txt --workers 3 --output results.csv

# Process in smaller batches if memory is limited
split -l 1000 huge_gene_list.txt batch_
```

### Getting Help
- 📖 Check the `examples/` directory for usage patterns
- 🐛 Report issues on GitHub with verbose output logs
- 💬 Include gene lists and error messages in bug reports
- 📧 Use `--verbose` flag for detailed debugging information

## License & Citation

### License
MIT License - see LICENSE file for details.

### Citation
If you use GeneInfo in your research, please cite:

```bibtex
@software{geneinfo2025,
  author = {Liu, Chunjie},
  title = {GeneInfo: Comprehensive Gene Information Retrieval},
  url = {https://github.com/chunjie-sam-liu/geneinfo},
  version = {0.1.0},
  year = {2025}
}
```

### Acknowledgments
This package aggregates data from multiple public biological databases. Please also cite the original data sources in your publications:

- **Ensembl**: Cunningham et al. (2022) Nucleic Acids Research
- **UniProt**: The UniProt Consortium (2023) Nucleic Acids Research
- **Gene Ontology**: Aleksander et al. (2023) Genetics
- **Reactome**: Gillespie et al. (2022) Nucleic Acids Research
- **ClinVar**: Landrum et al. (2020) Nucleic Acids Research
- **BioGRID**: Oughtred et al. (2021) Nucleic Acids Research
- **STRING**: Szklarczyk et al. (2023) Nucleic Acids Research
- **GWAS Catalog**: Sollis et al. (2023) Nucleic Acids Research

---

**Author**: Chunjie Liu
**Contact**: chunjie.sam.liu.at.gmail.com
**Version**: 0.1.0
**Date**: 2025-08-06

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "genesummary",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Chunjie Liu <chunjie.sam.liu@gmail.com>",
    "keywords": "bioinformatics, biology, gene-annotation, genetics, genomics, gwas, pathways, protein",
    "author": null,
    "author_email": "Chunjie Liu <chunjie.sam.liu@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/50/7f/e50e556d650da5ffc65fc44a7b3a65fddb205307609163b6d98e3ffca92b/genesummary-0.3.2.tar.gz",
    "platform": null,
    "description": "# GeneInfo\n\nA comprehensive Python package for retrieving detailed gene information from multiple public databases with robust error handling, batch processing capabilities, and modular architecture.\n\n## Features\n\nGeneInfo provides access to comprehensive gene annotation data through a unified interface:\n\n### Core Gene Information\n- **Basic gene data** - Gene symbols, Ensembl IDs, descriptions, genomic coordinates, biotypes\n- **Transcripts** - All transcript variants with protein coding information and alternative splicing\n- **Genomic location** - Chromosome coordinates, strand information, gene boundaries\n\n### Functional Annotation\n- **Protein domains** - Domain architecture from UniProt with evidence codes\n- **Gene Ontology** - GO terms and annotations (Biological Process, Molecular Function, Cellular Component)\n- **Pathways** - Reactome pathway associations and pathway hierarchies\n- **Protein interactions** - Dual-source protein-protein interaction networks:\n  - **BioGRID** - Experimental evidence with PubMed references (requires API key)\n  - **STRING-db** - Computational predictions + experimental evidence (no API key required)\n\n### Evolutionary Information\n- **Homologs** - Paralogs and orthologs across species with similarity metrics\n- **Cross-species mapping** - Gene orthology relationships and conservation scores\n\n### Clinical & Disease Data\n- **Clinical variants** - ClinVar pathogenic and benign variants with clinical significance\n- **GWAS associations** - Genome-wide association study data from EBI GWAS Catalog\n- **Disease phenotypes** - OMIM disease associations and phenotypic descriptions\n\n### Advanced Features\n- **Batch processing** - Concurrent processing of large gene lists (1000+ genes)\n- **API key management** - Secure handling of NCBI Entrez and OMIM API keys via environment variables or CLI\n- **Graceful degradation** - Works without API keys with limited functionality (no clinical/phenotype data)\n- **Rate limiting** - Built-in API courtesy delays and error handling\n- **Rich CLI** - Beautiful command-line interface with progress bars and tables\n- **Export formats** - JSON, CSV output with detailed and summary views\n- **Real data only** - No mock data fallbacks, returns null when data is inaccessible\n\n## Installation\n\n### Using uv (Recommended)\n```bash\n# Install from source\nuv add git+https://github.com/chunjie-sam-liu/geneinfo.git\n\n# Or clone and install locally\ngit clone https://github.com/chunjie-sam-liu/geneinfo.git\ncd geneinfo\nuv add -e .\n```\n\n### Using pip\n```bash\n# Install from source\npip install git+https://github.com/chunjie-sam-liu/geneinfo.git\n\n# Or clone and install locally\ngit clone https://github.com/chunjie-sam-liu/geneinfo.git\ncd geneinfo\npip install -e .\n```\n\n### Requirements\n- Python 3.11+\n- Internet connection for API access (offline mode available)\n\n## Quick Start\n\n\n### API Key Configuration\n\nFor accessing ClinVar (clinical variants), OMIM (phenotype data), and BioGRID (protein interactions), you'll need API keys:\n\n1. **Create a `.env` file** in your project directory:\n```bash\n# API Keys for external services\nOMIM_API_KEY=\"your_omim_api_key_here\"\nENTREZ_API_KEY=\"your_entrez_api_key_here\"\nENTREZ_EMAIL=\"your.email@example.com\"\nBIOGRID_API_KEY=\"your_biogrid_api_key_here\"\n```\n\n2. **Get API keys**:\n   - **OMIM API Key**: Register at [OMIM API](https://omim.org/api)\n   - **Entrez API Key**: Register at [NCBI API](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)\n   - **BioGRID API Key**: Register at [BioGRID API](https://wiki.thebiogrid.org/doku.php/biogridrest)\n\n3. **API key priority**:\n   - CLI arguments (highest priority)\n   - Environment variables from `.env` file\n   - None (graceful degradation - returns null data)\n\n### Python API\n\n```python\nfrom geneinfo import GeneInfo\n\n# Option 1: Use environment variables (recommended)\n# Create .env file with API keys (see above)\ngene_info = GeneInfo()\n\n# Option 2: Provide API keys explicitly\ngene_info = GeneInfo(\n    email=\"your.email@example.com\",\n    entrez_api_key=\"your_entrez_key\",\n    omim_api_key=\"your_omim_key\",\n    biogrid_api_key=\"your_biogrid_key\"\n)\n\n# Option 3: Work without API keys (limited functionality)\ngene_info = GeneInfo(\n    email=None,\n    entrez_api_key=None,\n    omim_api_key=None,\n    biogrid_api_key=None\n)\n\n# Get comprehensive information for a single gene\nresult = gene_info.get_gene_info(\"TP53\")\nprint(f\"Gene: {result['basic_info']['display_name']}\")\nprint(f\"Description: {result['basic_info']['description']}\")\nprint(f\"Chromosome: {result['basic_info']['seq_region_name']}\")\nprint(f\"Transcripts: {len(result['transcripts'])}\")\nprint(f\"GO terms: {len(result['gene_ontology'])}\")\nprint(f\"Pathways: {len(result['pathways'])}\")\nprint(f\"Protein interactions: {len(result['protein_interactions'])} (BioGRID + STRING-db)\")\nprint(f\"Clinical variants: {len(result['clinvar'])} (requires API key)\")\n\n# Batch process multiple genes with concurrent workers\ngenes = [\"TP53\", \"BRCA1\", \"EGFR\", \"MYC\", \"KRAS\"]\ndf = gene_info.get_batch_info(genes, max_workers=5)\nprint(df[['gene_symbol', 'chromosome', 'transcript_count', 'go_term_count']].head())\n\n# Export detailed information to JSON\ngene_info.export_detailed_info(genes, \"detailed_results.json\")\n\n# Export to organized directory structure\ngene_info.export_batch_to_directory(genes, \"gene_data/\", max_workers=5)\n```\n\n### Advanced Usage\n\n```python\n# Process large gene lists efficiently\nwith open(\"large_gene_list.txt\") as f:\n    gene_list = [line.strip() for line in f if line.strip()]\n\n# Initialize with API keys for full functionality\ngene_info = GeneInfo(\n    email=\"researcher@university.edu\",\n    entrez_api_key=\"your_entrez_key\",\n    omim_api_key=\"your_omim_key\",\n    biogrid_api_key=\"your_biogrid_key\"\n)\n\n# Batch processing with progress tracking\ndf = gene_info.get_batch_info(gene_list, max_workers=10)\n\n# Filter successful results\nsuccessful = df[df['error'].isna()]\nprint(f\"Successfully processed {len(successful)}/{len(gene_list)} genes\")\n\n# Access specific data types\nfor _, gene in successful.iterrows():\n    detailed = gene_info.get_gene_info(gene['query'])\n\n    # Protein domains\n    if detailed['protein_domains']:\n        print(f\"\\n{gene['gene_symbol']} protein domains:\")\n        for domain in detailed['protein_domains'][:3]:\n            print(f\"  - {domain['name']}: {domain['start']}-{domain['end']}\")\n\n    # Protein interactions (dual sources)\n    if detailed['protein_interactions']:\n        biogrid_interactions = [i for i in detailed['protein_interactions']\n                              if i.get('source_database') == 'BioGRID']\n        stringdb_interactions = [i for i in detailed['protein_interactions']\n                               if i.get('source_database') == 'STRING-db']\n        print(f\"  - {len(biogrid_interactions)} BioGRID interactions (experimental)\")\n        print(f\"  - {len(stringdb_interactions)} STRING-db interactions (computational)\")\n\n    # Clinical variants (requires Entrez API key)\n    if detailed['clinvar']:\n        pathogenic = [v for v in detailed['clinvar']\n                     if 'pathogenic' in v.get('clinical_significance', '').lower()]\n        print(f\"  - {len(pathogenic)} pathogenic variants found\")\n\n# Working without API keys (limited functionality)\ngene_info_limited = GeneInfo(\n    entrez_api_key=None,\n    omim_api_key=None,\n    biogrid_api_key=None\n)\n\n# This will still work but return empty for clinical/phenotype data\nresult = gene_info_limited.get_gene_info(\"TP53\")\nprint(f\"Basic info available: {bool(result['basic_info'])}\")\nprint(f\"Protein interactions: {len(result['protein_interactions'])} (STRING-db only)\")\nprint(f\"Clinical variants: {len(result['clinvar'])} (empty without API key)\")\nprint(f\"OMIM phenotypes: {bool(result['phenotypes'])} (empty without API key)\")\n```\n\n### Command Line Interface\n\n```bash\n# Single gene information with rich output\ngeneinfo --gene TP53 --output tp53_info.json\n\n# Using API keys via CLI arguments\ngeneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --output tp53_info.json\n\n# Using environment variables (recommended - create .env file)\ngeneinfo --gene TP53 --output tp53_info.json\n\n# Process multiple genes from file\ngeneinfo --file genes.txt --output results.csv\n\n# Detailed information in JSON format\ngeneinfo --gene BRCA1 --detailed --output brca1_detailed.json\n\n# Batch processing with custom workers and API keys\ngeneinfo --file large_gene_list.txt --workers 10 \\\n  --entrez-api-key YOUR_ENTREZ_KEY \\\n  --omim-api-key YOUR_OMIM_KEY \\\n  --biogrid-api-key YOUR_BIOGRID_KEY \\\n  --email your.email@example.com \\\n  --output batch_results.csv\n\n# Export to organized directory structure\ngeneinfo --file genes.txt --output-dir gene_analysis/ --workers 8\n\n# Verbose output for debugging\ngeneinfo --gene TP53 --verbose --detailed --output tp53_debug.json\n\n# Process Ensembl IDs\ngeneinfo --gene ENSG00000141510 --output tp53_ensembl.json\n\n# Species-specific queries (when supported)\ngeneinfo --gene TP53 --species human --output tp53_human.json\n\n# Check CLI help for all options\ngeneinfo --help\n```\n\n### CLI Output Examples\n\nThe CLI provides beautiful, formatted output with:\n- \ud83d\udcca Progress bars for batch processing\n- \ud83c\udfa8 Colored tables for gene information display\n- \u26a1 Real-time processing statistics\n- \ud83d\udcdd Summary reports with success/failure counts\n- \ud83d\udd0d Verbose logging for troubleshooting\n\n## Input Formats & Output\n\n### Supported Input Formats\nThe package accepts multiple gene identifier formats:\n- **Gene symbols**: `TP53`, `BRCA1`, `EGFR` (case-insensitive)\n- **Ensembl Gene IDs**: `ENSG00000141510`, `ENSG00000012048`\n- **Mixed lists**: Can process files containing both symbols and IDs\n\n### Output Formats\n\n#### Summary CSV Output\n```csv\nquery,gene_symbol,ensembl_id,chromosome,start_pos,end_pos,strand,transcript_count,go_term_count,pathway_count,interaction_count,clinvar_count,error\nTP53,TP53,ENSG00000141510,17,7668421,7687490,-1,12,87,23,71,1043,\nBRCA1,BRCA1,ENSG00000012048,17,43044295,43170245,-1,27,34,15,45,892,\n```\n\n#### Detailed JSON Output\n```json\n{\n  \"query\": \"TP53\",\n  \"basic_info\": {\n    \"id\": \"ENSG00000141510\",\n    \"display_name\": \"TP53\",\n    \"description\": \"tumor protein p53\",\n    \"seq_region_name\": \"17\",\n    \"start\": 7668421,\n    \"end\": 7687490,\n    \"strand\": -1,\n    \"biotype\": \"protein_coding\"\n  },\n  \"transcripts\": [...],\n  \"protein_domains\": [...],\n  \"gene_ontology\": [...],\n  \"pathways\": [...],\n  \"protein_interactions\": [...],\n  \"paralogs\": [...],\n  \"orthologs\": [...],\n  \"clinvar\": [...],\n  \"gwas\": {...}\n}\n```\n\n#### Directory Export Structure\n```\ngene_data/\n\u251c\u2500\u2500 summary.csv              # Overview of all processed genes\n\u251c\u2500\u2500 TP53_ENSG00000141510.json\n\u251c\u2500\u2500 BRCA1_ENSG00000012048.json\n\u2514\u2500\u2500 EGFR_ENSG00000073756.json\n```\n\n## Data Sources & Architecture\n\n### Primary Data Sources\n- **\ud83e\uddec Ensembl** - Gene annotation, transcripts, genomic coordinates, homologs\n- **\ud83d\udd2c UniProt** - Protein domains, functional annotations, protein features\n- **\ud83c\udfaf Gene Ontology** - GO term annotations and functional classifications\n- **\ud83d\udee4\ufe0f Reactome** - Biological pathways and pathway hierarchies\n- **\ud83c\udfe5 ClinVar** - Clinical variant classifications and disease associations\n- **\ud83e\uddea EBI GWAS Catalog** - Genome-wide association study results\n- **\ud83d\udc8a OMIM** - Mendelian disorders and phenotype-genotype relationships\n- **\ud83d\udcda MyGene.info** - Enhanced gene annotation aggregation\n- **\ud83d\udd17 BioGRID** - Experimental protein-protein interactions with evidence\n- **\ud83c\udf10 STRING-db** - Computational + experimental protein interaction networks\n\n### Modular Fetcher Architecture\n\nThe package uses a modular design with specialized fetchers:\n\n```python\n# Genomic data fetchers\nfrom geneinfo.fetchers.genomic import EnsemblFetcher, MyGeneFetcher\n\n# Protein data fetchers\nfrom geneinfo.fetchers.protein import UniProtFetcher, StringDBFetcher, BioGRIDFetcher\n\n# Functional annotation fetchers\nfrom geneinfo.fetchers.functional import GOFetcher, ReactomeFetcher\n\n# Clinical data fetchers\nfrom geneinfo.fetchers.clinical import ClinVarFetcher, GwasFetcher, OMIMFetcher\n```\n\n### Robust Error Handling\n- \ud83d\udd04 **Graceful degradation** - Returns null data when APIs are unavailable or API keys missing\n- \u23f1\ufe0f **Rate limiting** with respectful API usage\n- \ud83d\udee1\ufe0f **SSL/TLS handling** for various certificate configurations\n- \ud83d\udcdd **Comprehensive logging** with different verbosity levels\n- \ud83d\udd0d **Input validation** for gene symbols and Ensembl IDs\n- \ud83d\udd11 **API key management** - Secure environment variable handling\n\n## Performance & Usage Examples\n\n### Performance Characteristics\n- **Throughput**: ~100-500 genes/minute (network dependent)\n- **Concurrency**: Configurable worker threads (default: 5, max recommended: 10)\n- **Memory**: Efficient streaming processing for large gene lists\n- **Rate limiting**: Built-in delays to respect API usage policies\n\n### Real-world Usage Examples\n\n#### Cancer Gene Panel Analysis\n```python\n# Process a cancer gene panel with API keys for clinical data\ncancer_genes = [\"TP53\", \"BRCA1\", \"BRCA2\", \"EGFR\", \"KRAS\", \"PIK3CA\", \"AKT1\"]\ngene_info = GeneInfo(\n    email=\"researcher@university.edu\",\n    entrez_api_key=\"your_entrez_key\",\n    omim_api_key=\"your_omim_key\",\n    biogrid_api_key=\"your_biogrid_key\"\n)\n\nresults = gene_info.get_batch_info(cancer_genes)\n# Filter for genes with clinical variants (requires Entrez API key)\ncancer_variants = results[results['clinvar_count'] > 0]\nprint(f\"Found clinical variants in {len(cancer_variants)} cancer genes\")\n\n# Analyze protein interaction networks\nfor gene in cancer_genes:\n    detailed = gene_info.get_gene_info(gene)\n    interactions = detailed['protein_interactions']\n    if interactions:\n        biogrid_count = len([i for i in interactions if i['source_database'] == 'BioGRID'])\n        stringdb_count = len([i for i in interactions if i['source_database'] == 'STRING-db'])\n        print(f\"{gene}: {biogrid_count} experimental + {stringdb_count} predicted interactions\")\n```\n\n#### Pathway Enrichment Preprocessing\n```python\n# Prepare data for pathway analysis\ngene_list = [\"TP53\", \"MDM2\", \"CDKN1A\", \"BAX\", \"BBC3\"]  # p53 pathway genes\ndetailed_results = [gene_info.get_gene_info(gene) for gene in gene_list]\n\n# Extract GO terms for enrichment analysis\nall_go_terms = []\nfor result in detailed_results:\n    for go_term in result['gene_ontology']:\n        all_go_terms.append({\n            'gene': result['query'],\n            'go_id': go_term['go_id'],\n            'go_name': go_term['go_name'],\n            'namespace': go_term['namespace']\n        })\n```\n\n#### Large-scale Genomics Project\n```python\n# Process GWAS significant genes (thousands of genes)\nwith open(\"gwas_significant_genes.txt\") as f:\n    gwas_genes = [line.strip() for line in f]  # 5000+ genes\n\n# Process in batches with progress tracking\ngene_info.export_batch_to_directory(\n    gwas_genes,\n    \"gwas_gene_annotation/\",\n    max_workers=8\n)\n# Creates organized directory with individual files + summary\n```\n\n## Development & Testing\n\n### Running Tests\n```bash\n# Install development dependencies\nuv add --dev pytest pytest-cov pytest-asyncio\n\n# Run test suite\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=geneinfo --cov-report=html\n```\n\n### Project Structure\n```\ngeneinfo/\n\u251c\u2500\u2500 geneinfo/\n\u2502   \u251c\u2500\u2500 __init__.py          # Main package exports\n\u2502   \u251c\u2500\u2500 core.py              # GeneInfo main class\n\u2502   \u251c\u2500\u2500 cli.py               # Command-line interface\n\u2502   \u251c\u2500\u2500 mock_data.py         # Fallback data for offline mode\n\u2502   \u2514\u2500\u2500 fetchers/            # Modular data fetchers\n\u2502       \u251c\u2500\u2500 base.py          # Base fetcher with common functionality\n\u2502       \u251c\u2500\u2500 genomic.py       # Ensembl, MyGene fetchers\n\u2502       \u251c\u2500\u2500 protein.py       # UniProt, STRING-db fetchers\n\u2502       \u251c\u2500\u2500 functional.py    # GO, Reactome fetchers\n\u2502       \u2514\u2500\u2500 clinical.py      # ClinVar, GWAS, OMIM fetchers\n\u251c\u2500\u2500 tests/                   # Comprehensive test suite\n\u251c\u2500\u2500 examples/                # Usage examples and demos\n\u251c\u2500\u2500 docs/                    # Documentation (you are here!)\n\u2514\u2500\u2500 pyproject.toml          # Modern Python packaging\n```\n\n### Contributing\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature-name`\n3. Follow the coding standards in `.github/copilot-instructions.md`\n4. Add tests for new functionality\n5. Run the test suite: `uv run pytest`\n6. Submit a pull request\n\n## Dependencies & Requirements\n\n### Core Dependencies\n- **Python 3.11+** - Modern Python features and type hints\n- **requests** - HTTP client for API calls\n- **pandas** - Data manipulation and analysis\n- **numpy** - Numerical computing\n- **typer** - CLI framework with rich features\n- **rich** - Beautiful terminal output and progress bars\n- **biopython** - Bioinformatics tools (for Entrez/ClinVar)\n- **mygene** - Enhanced gene annotation client\n- **python-dotenv** - Environment variable management for API keys\n\n### System Requirements\n- Internet connection for API access\n- API keys for full functionality (NCBI Entrez, OMIM, BioGRID)\n- Sufficient memory for large gene lists (typically <1GB for 10,000 genes)\n- Email address for ClinVar/NCBI Entrez access (required when using API keys)\n\n## Troubleshooting\n\n### Common Issues\n\n#### API Access Problems\n```bash\n# Test API connectivity\ngeneinfo --gene TP53 --verbose\n\n# Working without API keys (limited functionality)\ngeneinfo --gene TP53 --entrez-api-key=None --omim-api-key=None --output results.json\n```\n\n#### API Key Configuration\n```bash\n# Check if API keys are being loaded correctly\ngeneinfo --gene TP53 --verbose\n\n# Set API keys via environment variables (recommended)\necho 'ENTREZ_API_KEY=\"your_key_here\"' > .env\necho 'OMIM_API_KEY=\"your_key_here\"' >> .env\necho 'BIOGRID_API_KEY=\"your_key_here\"' >> .env\necho 'ENTREZ_EMAIL=\"your.email@example.com\"' >> .env\n\n# Or pass via CLI\ngeneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --email your@email.com\n```\n\n#### Large Gene List Processing\n```bash\n# For very large lists, reduce concurrent workers\ngeneinfo --file huge_gene_list.txt --workers 3 --output results.csv\n\n# Process in smaller batches if memory is limited\nsplit -l 1000 huge_gene_list.txt batch_\n```\n\n### Getting Help\n- \ud83d\udcd6 Check the `examples/` directory for usage patterns\n- \ud83d\udc1b Report issues on GitHub with verbose output logs\n- \ud83d\udcac Include gene lists and error messages in bug reports\n- \ud83d\udce7 Use `--verbose` flag for detailed debugging information\n\n## License & Citation\n\n### License\nMIT License - see LICENSE file for details.\n\n### Citation\nIf you use GeneInfo in your research, please cite:\n\n```bibtex\n@software{geneinfo2025,\n  author = {Liu, Chunjie},\n  title = {GeneInfo: Comprehensive Gene Information Retrieval},\n  url = {https://github.com/chunjie-sam-liu/geneinfo},\n  version = {0.1.0},\n  year = {2025}\n}\n```\n\n### Acknowledgments\nThis package aggregates data from multiple public biological databases. Please also cite the original data sources in your publications:\n\n- **Ensembl**: Cunningham et al. (2022) Nucleic Acids Research\n- **UniProt**: The UniProt Consortium (2023) Nucleic Acids Research\n- **Gene Ontology**: Aleksander et al. (2023) Genetics\n- **Reactome**: Gillespie et al. (2022) Nucleic Acids Research\n- **ClinVar**: Landrum et al. (2020) Nucleic Acids Research\n- **BioGRID**: Oughtred et al. (2021) Nucleic Acids Research\n- **STRING**: Szklarczyk et al. (2023) Nucleic Acids Research\n- **GWAS Catalog**: Sollis et al. (2023) Nucleic Acids Research\n\n---\n\n**Author**: Chunjie Liu\n**Contact**: chunjie.sam.liu.at.gmail.com\n**Version**: 0.1.0\n**Date**: 2025-08-06\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Chun-Jie Liu\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "A comprehensive Python package for retrieving detailed gene information from multiple public databases",
    "version": "0.3.2",
    "project_urls": {
        "Changelog": "https://github.com/chunjie-sam-liu/geneinfo/releases",
        "Documentation": "https://chunjie-sam-liu.github.io/geneinfo",
        "Homepage": "https://chunjie-sam-liu.github.io/geneinfo",
        "Issues": "https://github.com/chunjie-sam-liu/geneinfo/issues",
        "Repository": "https://github.com/chunjie-sam-liu/geneinfo"
    },
    "split_keywords": [
        "bioinformatics",
        " biology",
        " gene-annotation",
        " genetics",
        " genomics",
        " gwas",
        " pathways",
        " protein"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fc9aed35f3ff69417c85c7fd436c430271c5769b411630b360ccaf6a3f88a872",
                "md5": "357303b890f22834b8af271e900fc5aa",
                "sha256": "ed5c84b820db47d147b832f667957ba0cb78184b6e749d08c904924daf32e712"
            },
            "downloads": -1,
            "filename": "genesummary-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "357303b890f22834b8af271e900fc5aa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 33250,
            "upload_time": "2025-08-08T18:17:50",
            "upload_time_iso_8601": "2025-08-08T18:17:50.159989Z",
            "url": "https://files.pythonhosted.org/packages/fc/9a/ed35f3ff69417c85c7fd436c430271c5769b411630b360ccaf6a3f88a872/genesummary-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "507fe50e556d650da5ffc65fc44a7b3a65fddb205307609163b6d98e3ffca92b",
                "md5": "c45404944271cb1d82988c363088afbb",
                "sha256": "0bac7fdb2ecfb466205fefd57130e087c6bb198b7b784fb57f5eedfded5757c0"
            },
            "downloads": -1,
            "filename": "genesummary-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c45404944271cb1d82988c363088afbb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 148029,
            "upload_time": "2025-08-08T18:17:51",
            "upload_time_iso_8601": "2025-08-08T18:17:51.637680Z",
            "url": "https://files.pythonhosted.org/packages/50/7f/e50e556d650da5ffc65fc44a7b3a65fddb205307609163b6d98e3ffca92b/genesummary-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 18:17:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "chunjie-sam-liu",
    "github_project": "geneinfo",
    "github_not_found": true,
    "lcname": "genesummary"
}

None