Name | bioino JSON |
Version |
0.0.2.post1
JSON |
| download |
home_page | |
Summary | Lightweight IO and conversion for bioinformatics file formats. |
upload_time | 2024-03-19 16:34:29 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.8 |
license | |
keywords |
biology
bioinformatics
science
io
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# 💻 bioino
![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/bioino/python-publish.yml)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bioino)
![PyPI](https://img.shields.io/pypi/v/bioino)
Command-line tools and Python API for interconverting FASTA, GFF, and CSV.
**bioino** currently converts tables to FASTA, and GFF to tables. Also provides
a Python API for handling GFF and FASTA files, and converting to table
files.
_Warning_: **bioino** is under active development, and not fully tested, so
things may change, break, or simply not work.
## Installation
### The easy way
Install the pre-compiled version from PyPI:
```bash
pip install bioino
```
### From source
Clone the repository, then `cd` into it. Then run:
```bash
pip install -e .
```
## Usage
### Command line
Convert CSV or XLSX of sequences to a FASTA file. Info goes to `stderr`, so you can pipe the output you
want to other tools or to a file.
```bash
$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome-info\n' | bioino table2fasta -n name -s seq -d data
🚀 Generating FASTA from tables with the following parameters:
subcommand: table2fasta
input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
format: TSV
sequence: seq
name: ['name']
description: ['data']
worksheet: Sheet 1
output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
func: <function _table2fasta at 0x7f4b48a43d30>
>Seq1 data=Some-info
AAAAA
⏰ Completed process in 0:00:00.025771
```
Convert GFF tables to TSV (or CSV).
```bash
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table 2> /dev/null
seqid source feature start end score strand phase ID attr1
test_seq test_source gene 1 10 . + . test01 +
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table -f CSV 2> /dev/nul
l
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+
```
#### Detailed usage
```bash
$ bioino --help
usage: bioino [-h] {gff2table,table2fasta} ...
Interconvert some bioinformatics file formats.
optional arguments:
-h, --help show this help message and exit
Sub-commands:
{gff2table,table2fasta}
Use these commands to specify the tool you want to use.
gff2table Convert a GFF to a TSV file.
table2fasta Convert a CSV or TSV of sequences to a FASTA file.
```
```bash
$ bioino gff2table --help
usage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata] [--output OUTPUT] [input]
positional arguments:
input Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".
optional arguments:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
File format. Default: "TSV".
--metadata, -m Write GFF header as commented lines.
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
```
```bash
$ bioino table2fasta --help
usage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE] --name [NAME [NAME ...]]
[--description [DESCRIPTION [DESCRIPTION ...]]] [--worksheet WORKSHEET] [--output OUTPUT]
[input]
positional arguments:
input Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".
optional arguments:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
File format. Default: "TSV".
--sequence SEQUENCE, -s SEQUENCE
Column to take sequence from. Default: "sequence".
--name [NAME [NAME ...]], -n [NAME [NAME ...]]
Column(s) to take sequence name from. Concatenates values with "_", replaces spaces with "-". Required.
--description [DESCRIPTION [DESCRIPTION ...]], -d [DESCRIPTION [DESCRIPTION ...]]
Column(s) to take sequence description from. Concatenates values with ";", replaces spaces with "_".
Default: don't use.
--worksheet WORKSHEET, -w WORKSHEET
For XLSX files, the worksheet to take the table from. Default: "Sheet 1".
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
```
### Python API
#### FASTA
Read FASTA files (or strings) into iterators of named tuples.
```python
>>> from bioino import FastaSequence, FastaCollection
>>> seq1 = FastaSequence("example", "This is a description", "ATCG")
>>> seq1
FastaSequence(name='example', description='This is a description', sequence='ATCG')
>>> seq2 = FastaSequence("example2", "This is another sequence", "GGGAAAA")
>>> fasta_stream = FastaCollection([seq1, seq2])
>>> fasta_stream
FastaCollection(sequences=[FastaSequence(name='example', description='This is a description', sequence='ATCG'), FastaSequence(name='example2', description='This is another sequence', sequence='GGGAAAA')])
```
These objects show as FASTA format when written, toptionally to a file.
```python
>>> fasta_stream.write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA
```
#### GFF
Makes an attempt to conform to GFF3 but makes no guarantees.
Similar to the FSAT utiities, GFF is read into an object.
```python
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = ["##meta1 item1",
... "#meta2 item2 comment",
... '\t'.join("test_seq test_source gene 1 10 . + . ID=test01;attr1=+".split()),
... '\t'.join("test_seq test_source gene 9 100 . + . Parent=test01;attr2=+".split())]
>>> file = StringIO()
>>> for line in lines:
... print(line, file=file)
>>> gff = GffFile.from_file(file)
```
These render as GFF lines when printed.
```python
>>> gff.write()
##meta1 item1
#meta2 item2 comment
test_seq test_source gene 1 10 . + . ID=test01;attr1=+
test_seq test_source gene 9 100 . + . Parent=test01;attr2=+
```
#### GFF lookup table
An iterable of `GffLine`s can be converted into a lookup table mapping
chromosome location to feature annotations. Regions without annotation
are automatically filled with references to upstream or
downstream features.
Just create a `GffFile` with `lookup=True`, or use the `_lookup_table()` method of an instantiated `GffFile`.
There are currently some limitations:
- Currently only works for single-chromosome files.
- Only references parent features. Child features not yet indexed.
- Will not work for GFFs with a single parent feature.
- Ignores the following feature types: "region", :repeat_region"
#### Interconversion
`GFFLine`s can be converted to dictionaries and vice versa.
```python
>>> from bioino import GffLine
>>> d = dict(seqid='TEST', source='test', feature='gene', start=1, end=100, score='.', strand='+', phase='+')
>>> print(GffLine.from_dict(d))
TEST test gene 1 100 . + +
>>> d.update(dict(ID='test001', comment='This is a test'))
>>> GffLine.from_dict(d).write()
TEST test gene 1 100 . + + ID=test001;comment=This is a test
```
```python
>>> from io import StringIO
>>> from bioino import GffFile
>>> file = StringIO()
>>> lines = ["TEST test gene 1 100 . + + ID=test001;comment=Test".split(),
... "TEST2 test2 gene 101 200 . + + ID=test002;comment=Test2".split()]
>>> for line in lines:
... print('\t'.join(line), file=file)
>>> list(GffFile.from_file(file).as_dict())
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'}, {'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]
```
And Pandas DataFrames can be converted to FASTA.
```python
>>> import pandas as pd
>>> df = pd.DataFrame(dict(seq=['atcg', 'aaaa'],
... title=['seq1', 'seq2'],
... info=['SeqA', 'SeqB'],
... score=[1, 2]))
>>> df
seq title info score
0 atcg seq1 SeqA 1
1 aaaa seq2 SeqB 2
>>> FastaCollection.from_pandas(df, sequence='seq',
... names=['title'],
... descriptions=['info', 'score']).write()
>seq1 info=SeqA;score=1
atcg
>seq2 info=SeqB;score=2
aaaa
```
## Suggestions, issues, fixes
File an issue [here](https://github.com/scbirlab/bioino).
## Documentation
Check the API [here](https://bioino.readthedocs.org).
Raw data
{
"_id": null,
"home_page": "",
"name": "bioino",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "biology,bioinformatics,science,io",
"author": "",
"author_email": "Eachan Johnson <eachan.johnson@crick.ac.uk>",
"download_url": "https://files.pythonhosted.org/packages/e4/c8/94bea1143b0d5bf4fe0f309f43eb744fb87ac1058c36b824a15fcedff846/bioino-0.0.2.post1.tar.gz",
"platform": null,
"description": "# \ud83d\udcbb bioino\n\n![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/bioino/python-publish.yml)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bioino)\n![PyPI](https://img.shields.io/pypi/v/bioino)\n\nCommand-line tools and Python API for interconverting FASTA, GFF, and CSV. \n\n**bioino** currently converts tables to FASTA, and GFF to tables. Also provides \na Python API for handling GFF and FASTA files, and converting to table\nfiles.\n\n_Warning_: **bioino** is under active development, and not fully tested, so \nthings may change, break, or simply not work.\n\n## Installation\n\n### The easy way\n\nInstall the pre-compiled version from PyPI:\n\n```bash\npip install bioino\n```\n\n### From source\n\nClone the repository, then `cd` into it. Then run:\n\n```bash\npip install -e .\n```\n\n## Usage\n\n### Command line\n\nConvert CSV or XLSX of sequences to a FASTA file. Info goes to `stderr`, so you can pipe the output you\nwant to other tools or to a file.\n\n```bash\n$ printf 'name\\tseq\\tdata\\nSeq1\\tAAAAA\\tSome-info\\n' | bioino table2fasta -n name -s seq -d data\n\ud83d\ude80 Generating FASTA from tables with the following parameters:\n subcommand: table2fasta\n input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>\n format: TSV\n sequence: seq\n name: ['name']\n description: ['data']\n worksheet: Sheet 1\n output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>\n func: <function _table2fasta at 0x7f4b48a43d30>\n>Seq1 data=Some-info\nAAAAA\n\u23f0 Completed process in 0:00:00.025771\n```\n\nConvert GFF tables to TSV (or CSV).\n\n```bash\n$ printf 'test_seq\\ttest_source\\tgene\\t1\\t10\\t.\\t+\\t.\\tID=test01;attr1=+\\n' | bioino gff2table 2> /dev/null\nseqid source feature start end score strand phase ID attr1\ntest_seq test_source gene 1 10 . + . test01 +\n\n$ printf 'test_seq\\ttest_source\\tgene\\t1\\t10\\t.\\t+\\t.\\tID=test01;attr1=+\\n' | bioino gff2table -f CSV 2> /dev/nul\nl\nseqid,source,feature,start,end,score,strand,phase,ID,attr1\ntest_seq,test_source,gene,1,10,.,+,.,test01,+\n```\n\n#### Detailed usage\n\n```bash\n$ bioino --help\nusage: bioino [-h] {gff2table,table2fasta} ...\n\nInterconvert some bioinformatics file formats.\n\noptional arguments:\n -h, --help show this help message and exit\n\nSub-commands:\n {gff2table,table2fasta}\n Use these commands to specify the tool you want to use.\n gff2table Convert a GFF to a TSV file.\n table2fasta Convert a CSV or TSV of sequences to a FASTA file.\n```\n\n```bash\n$ bioino gff2table --help\nusage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata] [--output OUTPUT] [input]\n\npositional arguments:\n input Input file in GFF format. Default: \"<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>\".\n\noptional arguments:\n -h, --help show this help message and exit\n --format {TSV,CSV}, -f {TSV,CSV}\n File format. Default: \"TSV\".\n --metadata, -m Write GFF header as commented lines.\n --output OUTPUT, -o OUTPUT\n Output file. Default: STDOUT\n```\n\n```bash\n$ bioino table2fasta --help\nusage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE] --name [NAME [NAME ...]]\n [--description [DESCRIPTION [DESCRIPTION ...]]] [--worksheet WORKSHEET] [--output OUTPUT]\n [input]\n\npositional arguments:\n input Input file in GFF format. Default: \"<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>\".\n\noptional arguments:\n -h, --help show this help message and exit\n --format {TSV,CSV}, -f {TSV,CSV}\n File format. Default: \"TSV\".\n --sequence SEQUENCE, -s SEQUENCE\n Column to take sequence from. Default: \"sequence\".\n --name [NAME [NAME ...]], -n [NAME [NAME ...]]\n Column(s) to take sequence name from. Concatenates values with \"_\", replaces spaces with \"-\". Required.\n --description [DESCRIPTION [DESCRIPTION ...]], -d [DESCRIPTION [DESCRIPTION ...]]\n Column(s) to take sequence description from. Concatenates values with \";\", replaces spaces with \"_\".\n Default: don't use.\n --worksheet WORKSHEET, -w WORKSHEET\n For XLSX files, the worksheet to take the table from. Default: \"Sheet 1\".\n --output OUTPUT, -o OUTPUT\n Output file. Default: STDOUT\n```\n\n### Python API\n\n#### FASTA\n\nRead FASTA files (or strings) into iterators of named tuples.\n\n```python\n>>> from bioino import FastaSequence, FastaCollection\n\n>>> seq1 = FastaSequence(\"example\", \"This is a description\", \"ATCG\")\n>>> seq1\nFastaSequence(name='example', description='This is a description', sequence='ATCG')\n>>> seq2 = FastaSequence(\"example2\", \"This is another sequence\", \"GGGAAAA\")\n>>> fasta_stream = FastaCollection([seq1, seq2])\n>>> fasta_stream\nFastaCollection(sequences=[FastaSequence(name='example', description='This is a description', sequence='ATCG'), FastaSequence(name='example2', description='This is another sequence', sequence='GGGAAAA')])\n\n```\n\nThese objects show as FASTA format when written, toptionally to a file.\n\n```python\n>>> fasta_stream.write() \n>example This is a description\nATCG\n>example2 This is another sequence\nGGGAAAA\n```\n\n#### GFF\n\nMakes an attempt to conform to GFF3 but makes no guarantees.\n\nSimilar to the FSAT utiities, GFF is read into an object.\n\n```python\n>>> from io import StringIO\n>>> from bioino import GffFile\n\n>>> lines = [\"##meta1 item1\", \n... \"#meta2 item2 comment\", \n... '\\t'.join(\"test_seq test_source gene 1 10 . + . ID=test01;attr1=+\".split()),\n... '\\t'.join(\"test_seq test_source gene 9 100 . + . Parent=test01;attr2=+\".split())]\n>>> file = StringIO()\n>>> for line in lines:\n... print(line, file=file)\n>>> gff = GffFile.from_file(file)\n```\n\nThese render as GFF lines when printed.\n\n```python\n>>> gff.write() \n##meta1 item1\n#meta2 item2 comment\ntest_seq test_source gene 1 10 . + . ID=test01;attr1=+\ntest_seq test_source gene 9 100 . + . Parent=test01;attr2=+\n\n```\n\n#### GFF lookup table\n\nAn iterable of `GffLine`s can be converted into a lookup table mapping\nchromosome location to feature annotations. Regions without annotation\nare automatically filled with references to upstream or \ndownstream features.\n\nJust create a `GffFile` with `lookup=True`, or use the `_lookup_table()` method of an instantiated `GffFile`.\n\nThere are currently some limitations:\n- Currently only works for single-chromosome files.\n- Only references parent features. Child features not yet indexed.\n- Will not work for GFFs with a single parent feature.\n- Ignores the following feature types: \"region\", :repeat_region\"\n\n#### Interconversion\n\n`GFFLine`s can be converted to dictionaries and vice versa.\n\n```python\n>>> from bioino import GffLine\n\n>>> d = dict(seqid='TEST', source='test', feature='gene', start=1, end=100, score='.', strand='+', phase='+')\n>>> print(GffLine.from_dict(d))\nTEST test gene 1 100 . + +\n>>> d.update(dict(ID='test001', comment='This is a test'))\n>>> GffLine.from_dict(d).write() \nTEST test gene 1 100 . + + ID=test001;comment=This is a test\n```\n\n```python\n>>> from io import StringIO\n>>> from bioino import GffFile\n\n>>> file = StringIO()\n>>> lines = [\"TEST test gene 1 100 . + + ID=test001;comment=Test\".split(),\n... \"TEST2 test2 gene 101 200 . + + ID=test002;comment=Test2\".split()]\n>>> for line in lines:\n... print('\\t'.join(line), file=file)\n>>> list(GffFile.from_file(file).as_dict()) \n[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'}, {'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]\n \n\n```\n\nAnd Pandas DataFrames can be converted to FASTA.\n\n```python\n>>> import pandas as pd\n\n>>> df = pd.DataFrame(dict(seq=['atcg', 'aaaa'], \n... title=['seq1', 'seq2'], \n... info=['SeqA', 'SeqB'], \n... score=[1, 2]))\n>>> df \n seq title info score\n0 atcg seq1 SeqA 1\n1 aaaa seq2 SeqB 2\n>>> FastaCollection.from_pandas(df, sequence='seq', \n... names=['title'], \n... descriptions=['info', 'score']).write() \n>seq1 info=SeqA;score=1\natcg\n>seq2 info=SeqB;score=2\naaaa\n```\n\n## Suggestions, issues, fixes\n\nFile an issue [here](https://github.com/scbirlab/bioino).\n\n## Documentation\n\nCheck the API [here](https://bioino.readthedocs.org).\n",
"bugtrack_url": null,
"license": "",
"summary": "Lightweight IO and conversion for bioinformatics file formats.",
"version": "0.0.2.post1",
"project_urls": {
"Bug Tracker": "https://github.com/scbirlab/bioino/issues",
"Homepage": "https://github.com/scbirlab/bioino"
},
"split_keywords": [
"biology",
"bioinformatics",
"science",
"io"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "de91b71747acd7cf215dd2f0240515f07d6f429896e8ca5213cb76b1d130beef",
"md5": "f90a26f356df72b10d07a949b2f87162",
"sha256": "50e737582822f27be90006518930a35f0109ca6d2f9a4ec03148d97dd6c07797"
},
"downloads": -1,
"filename": "bioino-0.0.2.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f90a26f356df72b10d07a949b2f87162",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 15961,
"upload_time": "2024-03-19T16:34:28",
"upload_time_iso_8601": "2024-03-19T16:34:28.113185Z",
"url": "https://files.pythonhosted.org/packages/de/91/b71747acd7cf215dd2f0240515f07d6f429896e8ca5213cb76b1d130beef/bioino-0.0.2.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e4c894bea1143b0d5bf4fe0f309f43eb744fb87ac1058c36b824a15fcedff846",
"md5": "338e618311aa8abe1c2986326801df97",
"sha256": "63bdbed2f2cb1c8defd93c25387e86e9acba0dead51a994fef8dc4e2da060eb5"
},
"downloads": -1,
"filename": "bioino-0.0.2.post1.tar.gz",
"has_sig": false,
"md5_digest": "338e618311aa8abe1c2986326801df97",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 16862,
"upload_time": "2024-03-19T16:34:29",
"upload_time_iso_8601": "2024-03-19T16:34:29.903577Z",
"url": "https://files.pythonhosted.org/packages/e4/c8/94bea1143b0d5bf4fe0f309f43eb744fb87ac1058c36b824a15fcedff846/bioino-0.0.2.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-19 16:34:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scbirlab",
"github_project": "bioino",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "bioino"
}