# SPONGE - Simple Prior Omics Network GEnerator
The SPONGE package generates human prior gene regulatory networks and
protein-protein interaction networks for the involved transcription
factors.
## Table of Contents
- [SPONGE - Simple Prior Omics Network GEnerator](#sponge---simple-prior-omics-network-generator)
- [Table of Contents](#table-of-contents)
- [General Information](#general-information)
- [Features](#features)
- [Setup](#setup)
- [Usage](#usage)
- [File formats](#file-formats)
- [Project Status](#project-status)
- [Room for Improvement](#room-for-improvement)
- [Acknowledgements](#acknowledgements)
- [Contact](#contact)
- [License](#license)
## General Information
This repository contains the SPONGE package, which allows the generation
of human prior gene regulatory networks based mainly on the data from
the JASPAR database.
It also uses NCBI to find the human analogs of vertebrate transcription
factors, Ensembl to collect all the promoter regions in the human
genome, UniProt for symbol matching, and STRING to retrieve
protein-protein interactions between transcription factors.
Because it accesses these databases on the fly, it requires internet
access.
Prior gene regulatory networks are useful mainly as an input for tools
that incorporate additional sources of information to refine them.
The prior networks generated by SPONGE are designed to be compatible
with PANDA and related [NetZoo](https://github.com/netZoo/netZooPy)
tools.
The purpose of this project is to give the ability to generate prior
gene regulatory networks to people who do not have the knowledge or
inclination to do the genome-wide motif search, but would still like
to change some parameters that were used to generate publicly available
prior gene regulatory networks.
It is also designed to facilitate the inclusion of new information from
database updates into the prior networks.
If you just want to use the prior networks generated by the stable
version of SPONGE with the default settings, they are available on
[Zenodo](https://zenodo.org/records/13628784).
## Features
The features already available are:
- Generation of prior gene regulatory network
- Generation of prior protein-protein interaction network for
transcription factors
- Automatic download of required files during setup
- Parallelised motif filtering
- Command line interface
## Setup
The requirements are provided in a `requirements.txt` file.
SPONGE can be installed via pip:
``` bash
pip install netzoopy-sponge
```
Alternatively, it can be installed by downloading this repository and
then installing with pip (possibly in interactive mode):
``` bash
git clone https://github.com/ladislav-hovan/sponge.git
cd sponge
pip install -e .
```
## Usage
SPONGE comes with a `netzoopy-sponge` command line script:
``` bash
# Get information about the available options
netzoopy-sponge --help
# Run the pipeline
netzoopy-sponge
```
The script comes with a lot of options, but the defaults are designed
to be sensible and the users do not have to change any of them unless
desired.
Within Python, the default workflow can be invoked as follows:
``` python
# Import the class definition
from sponge.sponge import Sponge
# Run the default workflow
sponge_obj = Sponge(run_default=True)
```
Much like the command line script, the Sponge class implements many
variables that give control over the process, and they can be changed
from their defaults.
For more information, you can run `help(Sponge)` after the import.
In case one needs more control over the individual steps, the workflow
in Python would be as follows:
``` python
# Import the class definition
from sponge.sponge import Sponge
# Create the SPONGE object
sponge_obj = Sponge()
# Select the vertebrate transcription factors from JASPAR
sponge_obj.select_tfs()
# Find human homologs for the TFs if possible
sponge_obj.find_human_homologs()
# Filter the matches of the JASPAR bigbed file to the ones in the
# promoters of human transcripts
sponge_obj.filter_matches()
# Aggregate the filtered matches on promoters to genes
sponge_obj.aggregate_matches()
# Write the final motif prior to a file
sponge_obj.write_motif_prior()
# Retrieve the protein-protein interactions between the transcription
# factors from the STRING database
sponge_obj.retrieve_ppi()
# Write the PPI prior to a file
sponge_obj.write_ppi_prior()
```
SPONGE will attempt to download the files it needs into a temporary
directory (`.sponge_temp` by default).
Paths can be provided if these files were downloaded in advance.
The JASPAR bigbed file required for filtering is huge (> 100 GB), so
the download might take some time.
Make sure you're running SPONGE somewhere that has enough space!
As an alternative to the bigbed file download, SPONGE can download
tracks for individual TFs on the fly and filter them individually.
This way of processing is slower than the bigbed file when all TFs in
the database are considered, but it becomes competitive when only
a subset is used.
The physical storage footprint is much reduced.
The option is enabled with `on_the_fly_processing=True`.
### File formats
Users are free to provide their own files for the list of regions of
interest (key name `promoter`, default name `promoters.bed`), mapping
of transcripts to genes (`ensembl`: `ensembl.tsv`) and the list of
predicted TF binding sites (`jaspar_bigbed`: `JASPAR.bb`).
By default, if the paths are not provided through the keyword
`paths_to_files`, SPONGE attempts to locate these files in the temporary
folder under the default names.
If it fails to do so, it will proceed to download them.
List of regions of interest expects a bed file in the 6 column format
without a header, for example:
```
chr1 11119 12119 ENST00000456328 0 +
chr1 11260 12260 ENST00000450305 0 +
chr1 17186 18186 ENST00000619216 0 -
chr1 24636 25636 ENST00000488147 0 -
```
Mapping of transcripts to genes expects a four column tsv file with
a defined header, as an example:
```
Transcript stable ID Gene stable ID Gene name Gene type
ENST00000387314 ENSG00000210049 MT-TF Mt_tRNA
ENST00000389680 ENSG00000211459 MT-RNR1 Mt_rRNA
ENST00000387342 ENSG00000210077 MT-TV Mt_tRNA
ENST00000387347 ENSG00000210082 MT-RNR2 Mt_rRNA
ENST00000386347 ENSG00000209082 MT-TL1 Mt_tRNA
ENST00000361390 ENSG00000198888 MT-ND1 protein_coding
```
The `Transcript stable ID` field will be used to match regions of
interest.
Finally, the predicted TF binding sites are expected in a binary bigbed
file, with the following format when decoded:
```
chrom start end name score strand TFName
chr1 10000 10006 MA0467.3 276 - Crx
chr1 10000 10006 MA0648.2 233 + GSC
chr1 10000 10006 MA0682.3 231 + PITX1
chr1 10000 10006 MA0711.2 198 + OTX1
chr1 10000 10006 MA0714.2 246 + PITX3
```
Effectively, it is an extended bed format with a header, which uses
the `name` column to provide JASPAR matrix ID and the `TFName` column
to provide the actual name of the transcription factor.
However, currently SPONGE expects a bigbed file and will not work with
a bed file.
## Project Status
The project is: _in progress_.
## Room for Improvement
Room for improvement:
- Try incorporating unipressed
- Improve overlap computations
To do:
- Support for more species
## Acknowledgements
Many thanks to the members of the
[Kuijjer group](https://www.kuijjerlab.org/)
at NCMM for their feedback and support.
This README is based on a template made by
[@flynerdpl](https://www.flynerd.pl/).
## Contact
Created by Ladislav Hovan (ladislav.hovan@ncmm.uio.no).
Feel free to contact me!
## License
This project is open source and available under the
[GNU General Public License v3](LICENSE).
Raw data
{
"_id": null,
"home_page": null,
"name": "netzoopy-sponge",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "transcription-factors, gene-regulatory-network",
"author": null,
"author_email": "Ladislav Hovan <ladislav.hovan@ncmm.uio.no>",
"download_url": "https://files.pythonhosted.org/packages/29/3c/f9129892e3093e6bfa985aed3b9a6c14e7c11c236f28ded041f2e983b063/netzoopy_sponge-1.0.1.tar.gz",
"platform": null,
"description": "# SPONGE - Simple Prior Omics Network GEnerator\nThe SPONGE package generates human prior gene regulatory networks and\nprotein-protein interaction networks for the involved transcription\nfactors.\n\n\n## Table of Contents\n- [SPONGE - Simple Prior Omics Network GEnerator](#sponge---simple-prior-omics-network-generator)\n - [Table of Contents](#table-of-contents)\n - [General Information](#general-information)\n - [Features](#features)\n - [Setup](#setup)\n - [Usage](#usage)\n - [File formats](#file-formats)\n - [Project Status](#project-status)\n - [Room for Improvement](#room-for-improvement)\n - [Acknowledgements](#acknowledgements)\n - [Contact](#contact)\n - [License](#license)\n\n\n## General Information\nThis repository contains the SPONGE package, which allows the generation\nof human prior gene regulatory networks based mainly on the data from\nthe JASPAR database.\nIt also uses NCBI to find the human analogs of vertebrate transcription\nfactors, Ensembl to collect all the promoter regions in the human\ngenome, UniProt for symbol matching, and STRING to retrieve\nprotein-protein interactions between transcription factors.\nBecause it accesses these databases on the fly, it requires internet\naccess.\n\nPrior gene regulatory networks are useful mainly as an input for tools\nthat incorporate additional sources of information to refine them.\nThe prior networks generated by SPONGE are designed to be compatible\nwith PANDA and related [NetZoo](https://github.com/netZoo/netZooPy)\ntools.\n\nThe purpose of this project is to give the ability to generate prior\ngene regulatory networks to people who do not have the knowledge or\ninclination to do the genome-wide motif search, but would still like\nto change some parameters that were used to generate publicly available\nprior gene regulatory networks.\nIt is also designed to facilitate the inclusion of new information from\ndatabase updates into the prior networks.\n\nIf you just want to use the prior networks generated by the stable\nversion of SPONGE with the default settings, they are available on\n[Zenodo](https://zenodo.org/records/13628784).\n\n\n## Features\nThe features already available are:\n- Generation of prior gene regulatory network\n- Generation of prior protein-protein interaction network for\n transcription factors\n- Automatic download of required files during setup\n- Parallelised motif filtering\n- Command line interface\n\n\n## Setup\nThe requirements are provided in a `requirements.txt` file.\n\nSPONGE can be installed via pip:\n\n``` bash\npip install netzoopy-sponge\n```\n\nAlternatively, it can be installed by downloading this repository and\nthen installing with pip (possibly in interactive mode):\n\n``` bash\ngit clone https://github.com/ladislav-hovan/sponge.git\ncd sponge\npip install -e .\n```\n\n\n## Usage\nSPONGE comes with a `netzoopy-sponge` command line script:\n\n``` bash\n# Get information about the available options\nnetzoopy-sponge --help\n# Run the pipeline\nnetzoopy-sponge\n```\n\nThe script comes with a lot of options, but the defaults are designed\nto be sensible and the users do not have to change any of them unless\ndesired.\n\nWithin Python, the default workflow can be invoked as follows:\n\n``` python\n# Import the class definition\nfrom sponge.sponge import Sponge\n# Run the default workflow\nsponge_obj = Sponge(run_default=True)\n```\n\nMuch like the command line script, the Sponge class implements many\nvariables that give control over the process, and they can be changed\nfrom their defaults.\nFor more information, you can run `help(Sponge)` after the import.\n\nIn case one needs more control over the individual steps, the workflow\nin Python would be as follows:\n\n``` python\n# Import the class definition\nfrom sponge.sponge import Sponge\n# Create the SPONGE object\nsponge_obj = Sponge()\n# Select the vertebrate transcription factors from JASPAR\nsponge_obj.select_tfs()\n# Find human homologs for the TFs if possible\nsponge_obj.find_human_homologs()\n# Filter the matches of the JASPAR bigbed file to the ones in the\n# promoters of human transcripts\nsponge_obj.filter_matches()\n# Aggregate the filtered matches on promoters to genes\nsponge_obj.aggregate_matches()\n# Write the final motif prior to a file\nsponge_obj.write_motif_prior()\n# Retrieve the protein-protein interactions between the transcription\n# factors from the STRING database\nsponge_obj.retrieve_ppi()\n# Write the PPI prior to a file\nsponge_obj.write_ppi_prior()\n```\n\nSPONGE will attempt to download the files it needs into a temporary\ndirectory (`.sponge_temp` by default).\nPaths can be provided if these files were downloaded in advance.\nThe JASPAR bigbed file required for filtering is huge (> 100 GB), so\nthe download might take some time.\nMake sure you're running SPONGE somewhere that has enough space!\n\nAs an alternative to the bigbed file download, SPONGE can download\ntracks for individual TFs on the fly and filter them individually.\nThis way of processing is slower than the bigbed file when all TFs in\nthe database are considered, but it becomes competitive when only\na subset is used.\nThe physical storage footprint is much reduced.\nThe option is enabled with `on_the_fly_processing=True`.\n\n\n### File formats\nUsers are free to provide their own files for the list of regions of\ninterest (key name `promoter`, default name `promoters.bed`), mapping\nof transcripts to genes (`ensembl`: `ensembl.tsv`) and the list of\npredicted TF binding sites (`jaspar_bigbed`: `JASPAR.bb`).\nBy default, if the paths are not provided through the keyword\n`paths_to_files`, SPONGE attempts to locate these files in the temporary\nfolder under the default names.\nIf it fails to do so, it will proceed to download them.\n\nList of regions of interest expects a bed file in the 6 column format\nwithout a header, for example:\n\n```\nchr1 11119 12119 ENST00000456328 0 +\nchr1 11260 12260 ENST00000450305 0 +\nchr1 17186 18186 ENST00000619216 0 -\nchr1 24636 25636 ENST00000488147 0 -\n```\n\nMapping of transcripts to genes expects a four column tsv file with\na defined header, as an example:\n\n```\nTranscript stable ID Gene stable ID Gene name Gene type\nENST00000387314 ENSG00000210049 MT-TF Mt_tRNA\nENST00000389680 ENSG00000211459 MT-RNR1 Mt_rRNA\nENST00000387342 ENSG00000210077 MT-TV Mt_tRNA\nENST00000387347 ENSG00000210082 MT-RNR2 Mt_rRNA\nENST00000386347 ENSG00000209082 MT-TL1 Mt_tRNA\nENST00000361390 ENSG00000198888 MT-ND1 protein_coding\n```\n\nThe `Transcript stable ID` field will be used to match regions of\ninterest.\nFinally, the predicted TF binding sites are expected in a binary bigbed\nfile, with the following format when decoded:\n\n```\nchrom start end name score strand TFName\nchr1 10000 10006 MA0467.3 276 - Crx\nchr1 10000 10006 MA0648.2 233 + GSC\nchr1 10000 10006 MA0682.3 231 + PITX1\nchr1 10000 10006 MA0711.2 198 + OTX1\nchr1 10000 10006 MA0714.2 246 + PITX3\n```\n\nEffectively, it is an extended bed format with a header, which uses\nthe `name` column to provide JASPAR matrix ID and the `TFName` column\nto provide the actual name of the transcription factor.\nHowever, currently SPONGE expects a bigbed file and will not work with\na bed file.\n\n\n## Project Status\nThe project is: _in progress_.\n\n\n## Room for Improvement\nRoom for improvement:\n- Try incorporating unipressed\n- Improve overlap computations\n\nTo do:\n- Support for more species\n\n\n## Acknowledgements\nMany thanks to the members of the\n[Kuijjer group](https://www.kuijjerlab.org/)\nat NCMM for their feedback and support.\n\nThis README is based on a template made by\n[@flynerdpl](https://www.flynerd.pl/).\n\n\n## Contact\nCreated by Ladislav Hovan (ladislav.hovan@ncmm.uio.no).\nFeel free to contact me!\n\n\n## License\nThis project is open source and available under the\n[GNU General Public License v3](LICENSE).\n",
"bugtrack_url": null,
"license": "GPL-3",
"summary": "A package to generate prior gene regulatory networks.",
"version": "1.0.1",
"project_urls": {
"Repository": "https://github.com/ladislav-hovan/sponge"
},
"split_keywords": [
"transcription-factors",
" gene-regulatory-network"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d187206c59aa2dd7b1349194568f56e1413829d5598527a3f32407327c434afd",
"md5": "63017f8d73dc2f1930f56d221d28c4cc",
"sha256": "b363ea688886103d59dc30e6eb3dde6acf54b10361fdc1aa5e16fb57f2a8e160"
},
"downloads": -1,
"filename": "netzoopy_sponge-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "63017f8d73dc2f1930f56d221d28c4cc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 42160,
"upload_time": "2024-10-22T09:24:19",
"upload_time_iso_8601": "2024-10-22T09:24:19.938659Z",
"url": "https://files.pythonhosted.org/packages/d1/87/206c59aa2dd7b1349194568f56e1413829d5598527a3f32407327c434afd/netzoopy_sponge-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "293cf9129892e3093e6bfa985aed3b9a6c14e7c11c236f28ded041f2e983b063",
"md5": "f01f89ee7935fa36f066469da93df604",
"sha256": "deda09669e98f2fcf03645e184870ca93fe18e64faa58fc303daa6bd1149db11"
},
"downloads": -1,
"filename": "netzoopy_sponge-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "f01f89ee7935fa36f066469da93df604",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 41429,
"upload_time": "2024-10-22T09:24:20",
"upload_time_iso_8601": "2024-10-22T09:24:20.984252Z",
"url": "https://files.pythonhosted.org/packages/29/3c/f9129892e3093e6bfa985aed3b9a6c14e7c11c236f28ded041f2e983b063/netzoopy_sponge-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-22 09:24:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ladislav-hovan",
"github_project": "sponge",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "netzoopy-sponge"
}