# seqfold
[![DOI](https://zenodo.org/badge/224018980.svg)](https://zenodo.org/badge/latestdoi/224018980)
Predict the minimum free energy structure of nucleic acids.
`seqfold` is an implementation of the `Zuker, 1981` dynamic programming algorithm, the basis for [UNAFold](http://unafold.rna.albany.edu/?q=DINAMelt/software)/[mfold](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/), with energy functions from `SantaLucia, 2004` (DNA) and `Turner, 2009` (RNA).
## Installation
### pypy3 (strongly recommended)
```bash
pypy3 -m ensurepip
pypy3 -m pip install seqfold
```
For a 200bp sequence (on my laptop), [pypy3](https://doc.pypy.org/en/latest/index.html) takes 2.5 seconds versus 15 seconds for CPython.
### Default pip
```bash
pip install seqfold
```
## Usage
### Python
```python
from seqfold import fold, dg, dg_cache, dot_bracket
# just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0) # -13.4
# `fold` returns a list of `seqfold.Struct` from the minimum free energy structure
structs = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
print(sum(s.e for s in structs)) # -13.4, same as dg()
for struct in structs:
print(struct) # prints the i, j, ddg, and description of each structure
# `dg_cache` returns a 2D array where each (i,j) combination returns the MFE from i to j inclusive
cache = dg_cache("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
# `dot_bracket` returns a dot_bracket representation of the folding
print(dot_bracket(structs)) # ((((((((.((((......))))..((((.......)))).))))))))
```
### CLI
```txt
usage: seqfold [-h] [-t FLOAT] [-d] [-r] [--version] SEQ
Predict the minimum free energy (kcal/mol) of a nucleic acid sequence
positional arguments:
SEQ nucleic acid sequence to fold
optional arguments:
-h, --help show this help message and exit
-t FLOAT, --celcius FLOAT
temperature in Celsius
-d, --dot-bracket write a dot-bracket of the MFE folding to stdout
-r, --sub-structures write each substructure of the MFE folding to stdout
--version show program's version number and exit
```
#### Examples
```bash
$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32
-15.3
```
```bash
$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32 --dot-bracket --sub-structures
GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC
((((((((.((((......))))..((((.......)))).))))))))
i j ddg description
0 48 -1.9 STACK:GG/CC
1 47 -1.9 STACK:GG/CC
2 46 -1.4 STACK:GA/CT
3 45 -1.4 STACK:AG/TC
4 44 -1.9 STACK:GG/CC
5 43 -1.6 STACK:GT/CA
6 42 -1.4 STACK:TC/AG
7 41 -0.5 BIFURCATION:4n/3h
9 22 -1.1 STACK:TT/AA
10 21 -0.7 STACK:TA/AT
11 20 -1.6 STACK:AC/TG
12 19 3.0 HAIRPIN:CA/GG
25 39 -1.9 STACK:CC/GG
26 38 -2.3 STACK:CG/GC
27 37 -1.9 STACK:GG/CC
28 36 3.2 HAIRPIN:GT/CT
-15.3
```
### Notes
- The type of nucleic acid, DNA or RNA, is inferred from the input sequence.
- `seqfold` is case-insensitive with the input sequence.
- The default temperature is 37 degrees Celsius for both the Python and CLI interface.
## Motivation
Secondary structure prediction is used for making [PCR primers](https://academic.oup.com/nar/article/40/15/e115/1223759), designing [oligos for MAGE](https://pubs.acs.org/doi/abs/10.1021/acssynbio.5b00219), and tuning [RBS expression rates](https://www.sciencedirect.com/science/article/pii/B9780123851208000024).
While [UNAFold](http://unafold.rna.albany.edu/?q=DINAMelt/software) and [mfold](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/) are the most widely used applications for nucleic acid secondary structure prediction, their format and license are restrictive. `seqfold` is meant to be an open-source, minimalist alternative for predicting minimum free energy secondary structure.
| | seqfold | mfold | UNAFold |
| ------------ | --------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| License | MIT | [Academic Non-commercial](http://unafold.rna.albany.edu/download/Academic_License.txt) | [\$200-36,000](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/products/) |
| OS | Linux, MacOS, Windows | Linux, MacOS | Linux, MacOS, Windows |
| Format | python, CLI python | CLI binary | CLI binary |
| Dependencies | none | (mfold_util) | Perl, (gnuplot, glut/OpenGL) |
| Graphical | no | yes (output) | yes (output) |
| Heterodimers | no | yes | yes |
| Constraints | no | yes | yes |
## Citations
Papers, and how they helped in developing `seqfold`, are listed below.
### Nussinov, 1980
> Nussinov, Ruth, and Ann B. Jacobson. "Fast algorithm for predicting the secondary structure of single-stranded RNA." Proceedings of the National Academy of Sciences 77.11 (1980): 6309-6313.
Framework for the dynamic programming approach. It has a conceptually helpful "Maximal Matching" example that demonstrates the approach on a simple sequence with only matched or unmatched bp.
### Zuker, 1981
> Zuker, Michael, and Patrick Stiegler. "Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information." Nucleic acids research 9.1 (1981): 133-148.
The most cited paper in this space. Extends further than `Nussinov, 1980` with a nearest neighbor approach to energies and a consideration of each of stack, bulge, internal loop, and hairpin. Their data structure and traceback method are both more intuitive than `Nussinov, 1980`.
### Jaeger, 1989
> Jaeger, John A., Douglas H. Turner, and Michael Zuker. "Improved predictions of secondary structures for RNA." Proceedings of the National Academy of Sciences 86.20 (1989): 7706-7710.
Zuker and colleagues expand on the 1981 paper to incorporate penalties for multibranched loops and dangling ends.
### SantaLucia, 2004
> SantaLucia Jr, John, and Donald Hicks. "The thermodynamics of DNA structural motifs." Annu. Rev. Biophys. Biomol. Struct. 33 (2004): 415-440.
The paper from which almost every DNA energy function in `seqfold` comes from (with the exception of multibranch loops). Provides neighbor entropies and enthalpies for stacks, mismatching stacks, terminal stacks, and dangling stacks. Ditto for bulges, internal loops, and hairpins.
### Turner, 2009
> Turner, Douglas H., and David H. Mathews. "NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure." Nucleic acids research 38.suppl_1 (2009): D280-D282.
Source of RNA nearest neighbor change in entropy and enthalpy parameter data. In `/data`.
### Ward, 2017
> Ward, M., Datta, A., Wise, M., & Mathews, D. H. (2017). Advanced multi-loop algorithms for RNA secondary structure prediction reveal that the simplest model is best. Nucleic acids research, 45(14), 8541-8550.
An investigation of energy functions for multibranch loops that validates the simple linear approach employed by `Jaeger, 1989` that keeps runtime within `O(n³)`.
Raw data
{
"_id": null,
"home_page": "https://github.com/Lattice-Automation/seqfold",
"name": "seqfold",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "",
"author": "JJTimmons",
"author_email": "jtimmons@latticeautomation.com",
"download_url": "https://files.pythonhosted.org/packages/8a/65/c1506b49a63e1ef31cd85f0a88db49928f4ecb2932498a97b573d0117739/seqfold-0.7.17.tar.gz",
"platform": null,
"description": "# seqfold\n\n[![DOI](https://zenodo.org/badge/224018980.svg)](https://zenodo.org/badge/latestdoi/224018980)\n\nPredict the minimum free energy structure of nucleic acids.\n\n`seqfold` is an implementation of the `Zuker, 1981` dynamic programming algorithm, the basis for [UNAFold](http://unafold.rna.albany.edu/?q=DINAMelt/software)/[mfold](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/), with energy functions from `SantaLucia, 2004` (DNA) and `Turner, 2009` (RNA).\n\n## Installation\n\n### pypy3 (strongly recommended)\n\n```bash\npypy3 -m ensurepip\npypy3 -m pip install seqfold\n```\n\nFor a 200bp sequence (on my laptop), [pypy3](https://doc.pypy.org/en/latest/index.html) takes 2.5 seconds versus 15 seconds for CPython.\n\n### Default pip\n\n```bash\npip install seqfold\n```\n\n## Usage\n\n### Python\n\n```python\nfrom seqfold import fold, dg, dg_cache, dot_bracket\n\n# just returns minimum free energy\ndg(\"GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC\", temp = 37.0) # -13.4\n\n# `fold` returns a list of `seqfold.Struct` from the minimum free energy structure\nstructs = fold(\"GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC\")\nprint(sum(s.e for s in structs)) # -13.4, same as dg()\nfor struct in structs:\n print(struct) # prints the i, j, ddg, and description of each structure\n\n# `dg_cache` returns a 2D array where each (i,j) combination returns the MFE from i to j inclusive\ncache = dg_cache(\"GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC\")\n\n# `dot_bracket` returns a dot_bracket representation of the folding\nprint(dot_bracket(structs)) # ((((((((.((((......))))..((((.......)))).))))))))\n```\n\n### CLI\n\n```txt\nusage: seqfold [-h] [-t FLOAT] [-d] [-r] [--version] SEQ\n\nPredict the minimum free energy (kcal/mol) of a nucleic acid sequence\n\npositional arguments:\n SEQ nucleic acid sequence to fold\n\noptional arguments:\n -h, --help show this help message and exit\n -t FLOAT, --celcius FLOAT\n temperature in Celsius\n -d, --dot-bracket write a dot-bracket of the MFE folding to stdout\n -r, --sub-structures write each substructure of the MFE folding to stdout\n --version show program's version number and exit\n```\n\n#### Examples\n\n```bash\n$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32\n-15.3\n```\n\n```bash\n$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32 --dot-bracket --sub-structures\nGGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC\n((((((((.((((......))))..((((.......)))).))))))))\n i j ddg description\n 0 48 -1.9 STACK:GG/CC\n 1 47 -1.9 STACK:GG/CC\n 2 46 -1.4 STACK:GA/CT\n 3 45 -1.4 STACK:AG/TC\n 4 44 -1.9 STACK:GG/CC\n 5 43 -1.6 STACK:GT/CA\n 6 42 -1.4 STACK:TC/AG\n 7 41 -0.5 BIFURCATION:4n/3h\n 9 22 -1.1 STACK:TT/AA\n 10 21 -0.7 STACK:TA/AT\n 11 20 -1.6 STACK:AC/TG\n 12 19 3.0 HAIRPIN:CA/GG\n 25 39 -1.9 STACK:CC/GG\n 26 38 -2.3 STACK:CG/GC\n 27 37 -1.9 STACK:GG/CC\n 28 36 3.2 HAIRPIN:GT/CT\n-15.3\n```\n\n### Notes\n\n- The type of nucleic acid, DNA or RNA, is inferred from the input sequence.\n- `seqfold` is case-insensitive with the input sequence.\n- The default temperature is 37 degrees Celsius for both the Python and CLI interface.\n\n## Motivation\n\nSecondary structure prediction is used for making [PCR primers](https://academic.oup.com/nar/article/40/15/e115/1223759), designing [oligos for MAGE](https://pubs.acs.org/doi/abs/10.1021/acssynbio.5b00219), and tuning [RBS expression rates](https://www.sciencedirect.com/science/article/pii/B9780123851208000024).\n\nWhile [UNAFold](http://unafold.rna.albany.edu/?q=DINAMelt/software) and [mfold](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/) are the most widely used applications for nucleic acid secondary structure prediction, their format and license are restrictive. `seqfold` is meant to be an open-source, minimalist alternative for predicting minimum free energy secondary structure.\n\n| | seqfold | mfold | UNAFold |\n| ------------ | --------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |\n| License | MIT | [Academic Non-commercial](http://unafold.rna.albany.edu/download/Academic_License.txt) | [\\$200-36,000](https://www.ibridgenetwork.org/#!/profiles/1045554571442/innovations/1/products/) |\n| OS | Linux, MacOS, Windows | Linux, MacOS | Linux, MacOS, Windows |\n| Format | python, CLI python | CLI binary | CLI binary |\n| Dependencies | none | (mfold_util) | Perl, (gnuplot, glut/OpenGL) |\n| Graphical | no | yes (output) | yes (output) |\n| Heterodimers | no | yes | yes |\n| Constraints | no | yes | yes |\n\n## Citations\n\nPapers, and how they helped in developing `seqfold`, are listed below.\n\n### Nussinov, 1980\n\n> Nussinov, Ruth, and Ann B. Jacobson. \"Fast algorithm for predicting the secondary structure of single-stranded RNA.\" Proceedings of the National Academy of Sciences 77.11 (1980): 6309-6313.\n\nFramework for the dynamic programming approach. It has a conceptually helpful \"Maximal Matching\" example that demonstrates the approach on a simple sequence with only matched or unmatched bp.\n\n### Zuker, 1981\n\n> Zuker, Michael, and Patrick Stiegler. \"Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information.\" Nucleic acids research 9.1 (1981): 133-148.\n\nThe most cited paper in this space. Extends further than `Nussinov, 1980` with a nearest neighbor approach to energies and a consideration of each of stack, bulge, internal loop, and hairpin. Their data structure and traceback method are both more intuitive than `Nussinov, 1980`.\n\n### Jaeger, 1989\n\n> Jaeger, John A., Douglas H. Turner, and Michael Zuker. \"Improved predictions of secondary structures for RNA.\" Proceedings of the National Academy of Sciences 86.20 (1989): 7706-7710.\n\nZuker and colleagues expand on the 1981 paper to incorporate penalties for multibranched loops and dangling ends.\n\n### SantaLucia, 2004\n\n> SantaLucia Jr, John, and Donald Hicks. \"The thermodynamics of DNA structural motifs.\" Annu. Rev. Biophys. Biomol. Struct. 33 (2004): 415-440.\n\nThe paper from which almost every DNA energy function in `seqfold` comes from (with the exception of multibranch loops). Provides neighbor entropies and enthalpies for stacks, mismatching stacks, terminal stacks, and dangling stacks. Ditto for bulges, internal loops, and hairpins.\n\n### Turner, 2009\n\n> Turner, Douglas H., and David H. Mathews. \"NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure.\" Nucleic acids research 38.suppl_1 (2009): D280-D282.\n\nSource of RNA nearest neighbor change in entropy and enthalpy parameter data. In `/data`.\n\n### Ward, 2017\n\n> Ward, M., Datta, A., Wise, M., & Mathews, D. H. (2017). Advanced multi-loop algorithms for RNA secondary structure prediction reveal that the simplest model is best. Nucleic acids research, 45(14), 8541-8550.\n\nAn investigation of energy functions for multibranch loops that validates the simple linear approach employed by `Jaeger, 1989` that keeps runtime within `O(n\u00b3)`.\n",
"bugtrack_url": null,
"license": "mit",
"summary": "Predict the minimum free energy structure of nucleic acids",
"version": "0.7.17",
"project_urls": {
"Homepage": "https://github.com/Lattice-Automation/seqfold"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5fe42bfa5bb9ae7cf06cebc028e4fbbc3e4e1cc1fd8eee87a39a68da86ec16c4",
"md5": "72e4d85dc4d624b5fd05d97ffba3f134",
"sha256": "beaf8d6064f69d33d4ad7e53eea4d4f425baf2cbd6e54e67537a12f6ff29c617"
},
"downloads": -1,
"filename": "seqfold-0.7.17-py3-none-any.whl",
"has_sig": false,
"md5_digest": "72e4d85dc4d624b5fd05d97ffba3f134",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 29995,
"upload_time": "2023-05-30T14:26:22",
"upload_time_iso_8601": "2023-05-30T14:26:22.706096Z",
"url": "https://files.pythonhosted.org/packages/5f/e4/2bfa5bb9ae7cf06cebc028e4fbbc3e4e1cc1fd8eee87a39a68da86ec16c4/seqfold-0.7.17-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8a65c1506b49a63e1ef31cd85f0a88db49928f4ecb2932498a97b573d0117739",
"md5": "5be48aa5c38eed700bca022ee8684aa0",
"sha256": "ee0f7ca408496aa9908f60d02ca95e0c6161ec8e3cb48dbedfe52070a832b522"
},
"downloads": -1,
"filename": "seqfold-0.7.17.tar.gz",
"has_sig": false,
"md5_digest": "5be48aa5c38eed700bca022ee8684aa0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 30546,
"upload_time": "2023-05-30T14:26:25",
"upload_time_iso_8601": "2023-05-30T14:26:25.500960Z",
"url": "https://files.pythonhosted.org/packages/8a/65/c1506b49a63e1ef31cd85f0a88db49928f4ecb2932498a97b573d0117739/seqfold-0.7.17.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-30 14:26:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Lattice-Automation",
"github_project": "seqfold",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "seqfold"
}