biocode


Namebiocode JSON
Version 0.12.0 PyPI version JSON
download
home_pagehttp://github.com/jorvis/biocode
SummaryBioinformatics code libraries and scripts
upload_time2024-10-17 03:32:11
maintainerNone
docs_urlNone
authorJoshua Orvis
requires_pythonNone
licenseMIT
keywords bioinformatics scripts modules gff3 fasta fastq bam sam
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Overview
========

This is a collection of bioinformatics scripts many have found useful
and code modules which make writing new ones a lot faster.

Over the years most bioinformatics people amass a collection of small
utility scripts which make their lives easier. Too often they are kept
either in private repositories or as part of a public collection to
which noone else can contribute. Biocode is a curated repository of
general-use utility scripts my colleagues and I have found useful and
want to share with others. I have also developed some code
libraries/modules which have made my scripting work a lot easier. Some
have found these to be more useful than the scripts themselves.

Look below if you want to learn more, contribute code yourself, or just
get the scripts.

-- Joshua Orvis

The scripts
===========

The scope here is intentionally very open. I want to include anything
that developers find generally useful. There are no limitations on
language choice, though the majority are Python. For now, the following
directories make up the initial groupings but will be expanded as
needed:

-  blast - It if uses, massages, or just reformats BLAST output, it goes
   here.
-  chado - Scripts that are tied into the chado schema (gmod.org) should
   be found here.
-  fasta - Filtering, converting, size distribution plots, etc.
-  fastq - Utilities for fasta's newer sister format.
-  genbank - Anything related to the GenBank? Flat File Format.
-  general - Utility scripts that may not fit in any other existing
   directory or don't warrant creation of their own. We should be
   selective about what we put here and create or use other directories
   whenever appropriate.
-  gff - Extractions, conversions and manipulations of files in the
   `Generic Feature Format <http://sequenceontology.org/gff3.shtml>`__
-  gtf - From Ensembl/WashU, the GTF format is the focus of scripts
   here.
-  hmm - Merging, manipulating or reading HMM libraries.
-  sam\_bam - Analysis of and parsing SAM/BAM files.
-  sandbox - Each committer gets their own personal directory here to
   add anything they want while testing or waiting to be moved to the
   production directories.
-  sysadmin - While not specifically bioinformatics, our work tends to
   be on Unix machines, and utility scripts are often needed to support
   our work. From file system manipulation to database backup scripts,
   put your generic sysadmin utilities here.
-  taxonomy - Anything related to taxonomic analysis.

The modules
===========

If you're a developer these modules can save a lot of time. Yes, there
is some duplicate functionality you'll find in modules like
`Biopython <http://biopython.org/wiki/Main_Page>`__, but these were
written to add features I always wanted and with a more
biologically-focused API.

Three of the primary Python modules:

`biocode.things <https://github.com/jorvis/biocode/blob/master/lib/biocode/things.py>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Classes here represent biological things (as defined by the `Sequence
Ontology <http://sequenceontology.org/>`__) in a way that makes more
sense biologically and hiding some of the CS abstraction. What does this
mean? This is a simple example, but compare these syntax approaches:

::

    # This way is typical of other libraries
    genes = assembly.get_subfeatures_by_type( 'type': 'genes' )
    mRNAs = assembly.get_subfeatures_by_type( 'type': 'mRNA' )

    # And instead, in biothings:
    genes = assembly.genes()
    for gene in genes:
        mRNAs = gene.mRNAs()

This more direct approach is held throughout these libraries. It also
adds some shortcuts for tasks that always annoyed me when working with
things that had coordinates. Consider if you wanted to determine if one
gene is before another one on a molecule:

::

    if gene1 < gene2:
        return True

In the background, biocode checks if the two gene objects are located on
the same molecule and, if so, compares their coordinates. There are many
other methods for coordinate comparison, such as:

-  thing1 <= thing2 : The thing1 overlaps thing2 on the 5' end
-  thing1.contained\_within( thing2 )
-  thing1.overlaps( thing2 )
-  thing1.overlap\_size\_with( thing2 )

This module also contains readable and detailed documention within the
`source
code <https://github.com/jorvis/biocode/blob/master/lib/biocode/things.py>`__.

`biocode.annotation <https://github.com/jorvis/biocode/blob/master/lib/biocode/annotation.py>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This set of classes allows formal definition of functional annotation
which can be attached to various biothings. These include gene product
names, gene symbols, EC numbers, GO terms, etc. Once annotated, the
biothings can be written out in common formats such as GFF3, GenBank,
NCBI tbl, etc.

`biocode.gff <https://github.com/jorvis/biocode/blob/master/lib/biocode/gff.py>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Much of biocode was written while working with genomic data and
annotation, and one of the more common formats for storing these is
`GFF3 <http://sequenceontology.org/resources/gff3.html>`__. Using this
module, you can parse a GFF3 file of annotations into a set of biothings
with a single line of code. For example:

::

    import biocode.gff

    (assemblies, features) = biocode.gff.get_gff3_features( input_file_path )

That's it. You can then iterate over the assemblies and their children,
or access the 'features' dict, which is keyed on each feature's ID.

Installing dependencies
=======================

On Debian-based systems (like Ubuntu) you can be sure to get all biocode
dependencies like this:

::

   apt-get install -y python3 python3-pip zlib1g-dev libblas-dev liblapack-dev libxml2-dev

Getting the code (pip3, latest release)
=======================================

You can install biocode using pip3 (requires Python3) like this:

::

    pip3 install biocode

Getting the code (github, current trunk)
========================================

If you want the latest developer version:

::

    git clone https://github.com/jorvis/biocode.git

**Important**: Many of these scripts use the modules in the biocode/lib
directory, so you'll need to point Python to them. Full setup example:

::

    cd /opt
    git clone https://github.com/jorvis/biocode.git

    # You probably want to add this line to your $HOME/.bashrc file
    export PYTHONPATH=/opt/biocode/lib:$PYTHONPATH

Problems / Suggestions?
=======================

If you encounter any issues with the existing code, or would like to
request new features or scripts please submit to the `Issue tracking
system <https://github.com/jorvis/biocode/issues>`__.

Contributing
============

If you'd like to contribute code to this collection have a look at the
`Requirements And Convention
Guide <https://github.com/jorvis/biocode/blob/master/RequirementsAndConventionGuide.md>`__
and then submit a pull request once your code is ready. We'll check your
script and pull it into the production directories. If you're not that
confident yet we'll happily pull in your sandbox directory if you'd like
to add your code to the project but aren't sure if it's ready to be in
the production directories yet.



            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/jorvis/biocode",
    "name": "biocode",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "bioinformatics scripts modules gff3 fasta fastq bam sam",
    "author": "Joshua Orvis",
    "author_email": "jorvis@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/76/b9/22a5ad1d471977fa223d22f5dfe57f099f1262fdd3fab678dfe555f11344/biocode-0.12.0.tar.gz",
    "platform": null,
    "description": "Overview\n========\n\nThis is a collection of bioinformatics scripts many have found useful\nand code modules which make writing new ones a lot faster.\n\nOver the years most bioinformatics people amass a collection of small\nutility scripts which make their lives easier. Too often they are kept\neither in private repositories or as part of a public collection to\nwhich noone else can contribute. Biocode is a curated repository of\ngeneral-use utility scripts my colleagues and I have found useful and\nwant to share with others. I have also developed some code\nlibraries/modules which have made my scripting work a lot easier. Some\nhave found these to be more useful than the scripts themselves.\n\nLook below if you want to learn more, contribute code yourself, or just\nget the scripts.\n\n-- Joshua Orvis\n\nThe scripts\n===========\n\nThe scope here is intentionally very open. I want to include anything\nthat developers find generally useful. There are no limitations on\nlanguage choice, though the majority are Python. For now, the following\ndirectories make up the initial groupings but will be expanded as\nneeded:\n\n-  blast - It if uses, massages, or just reformats BLAST output, it goes\n   here.\n-  chado - Scripts that are tied into the chado schema (gmod.org) should\n   be found here.\n-  fasta - Filtering, converting, size distribution plots, etc.\n-  fastq - Utilities for fasta's newer sister format.\n-  genbank - Anything related to the GenBank? Flat File Format.\n-  general - Utility scripts that may not fit in any other existing\n   directory or don't warrant creation of their own. We should be\n   selective about what we put here and create or use other directories\n   whenever appropriate.\n-  gff - Extractions, conversions and manipulations of files in the\n   `Generic Feature Format <http://sequenceontology.org/gff3.shtml>`__\n-  gtf - From Ensembl/WashU, the GTF format is the focus of scripts\n   here.\n-  hmm - Merging, manipulating or reading HMM libraries.\n-  sam\\_bam - Analysis of and parsing SAM/BAM files.\n-  sandbox - Each committer gets their own personal directory here to\n   add anything they want while testing or waiting to be moved to the\n   production directories.\n-  sysadmin - While not specifically bioinformatics, our work tends to\n   be on Unix machines, and utility scripts are often needed to support\n   our work. From file system manipulation to database backup scripts,\n   put your generic sysadmin utilities here.\n-  taxonomy - Anything related to taxonomic analysis.\n\nThe modules\n===========\n\nIf you're a developer these modules can save a lot of time. Yes, there\nis some duplicate functionality you'll find in modules like\n`Biopython <http://biopython.org/wiki/Main_Page>`__, but these were\nwritten to add features I always wanted and with a more\nbiologically-focused API.\n\nThree of the primary Python modules:\n\n`biocode.things <https://github.com/jorvis/biocode/blob/master/lib/biocode/things.py>`__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nClasses here represent biological things (as defined by the `Sequence\nOntology <http://sequenceontology.org/>`__) in a way that makes more\nsense biologically and hiding some of the CS abstraction. What does this\nmean? This is a simple example, but compare these syntax approaches:\n\n::\n\n    # This way is typical of other libraries\n    genes = assembly.get_subfeatures_by_type( 'type': 'genes' )\n    mRNAs = assembly.get_subfeatures_by_type( 'type': 'mRNA' )\n\n    # And instead, in biothings:\n    genes = assembly.genes()\n    for gene in genes:\n        mRNAs = gene.mRNAs()\n\nThis more direct approach is held throughout these libraries. It also\nadds some shortcuts for tasks that always annoyed me when working with\nthings that had coordinates. Consider if you wanted to determine if one\ngene is before another one on a molecule:\n\n::\n\n    if gene1 < gene2:\n        return True\n\nIn the background, biocode checks if the two gene objects are located on\nthe same molecule and, if so, compares their coordinates. There are many\nother methods for coordinate comparison, such as:\n\n-  thing1 <= thing2 : The thing1 overlaps thing2 on the 5' end\n-  thing1.contained\\_within( thing2 )\n-  thing1.overlaps( thing2 )\n-  thing1.overlap\\_size\\_with( thing2 )\n\nThis module also contains readable and detailed documention within the\n`source\ncode <https://github.com/jorvis/biocode/blob/master/lib/biocode/things.py>`__.\n\n`biocode.annotation <https://github.com/jorvis/biocode/blob/master/lib/biocode/annotation.py>`__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThis set of classes allows formal definition of functional annotation\nwhich can be attached to various biothings. These include gene product\nnames, gene symbols, EC numbers, GO terms, etc. Once annotated, the\nbiothings can be written out in common formats such as GFF3, GenBank,\nNCBI tbl, etc.\n\n`biocode.gff <https://github.com/jorvis/biocode/blob/master/lib/biocode/gff.py>`__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nMuch of biocode was written while working with genomic data and\nannotation, and one of the more common formats for storing these is\n`GFF3 <http://sequenceontology.org/resources/gff3.html>`__. Using this\nmodule, you can parse a GFF3 file of annotations into a set of biothings\nwith a single line of code. For example:\n\n::\n\n    import biocode.gff\n\n    (assemblies, features) = biocode.gff.get_gff3_features( input_file_path )\n\nThat's it. You can then iterate over the assemblies and their children,\nor access the 'features' dict, which is keyed on each feature's ID.\n\nInstalling dependencies\n=======================\n\nOn Debian-based systems (like Ubuntu) you can be sure to get all biocode\ndependencies like this:\n\n::\n\n   apt-get install -y python3 python3-pip zlib1g-dev libblas-dev liblapack-dev libxml2-dev\n\nGetting the code (pip3, latest release)\n=======================================\n\nYou can install biocode using pip3 (requires Python3) like this:\n\n::\n\n    pip3 install biocode\n\nGetting the code (github, current trunk)\n========================================\n\nIf you want the latest developer version:\n\n::\n\n    git clone https://github.com/jorvis/biocode.git\n\n**Important**: Many of these scripts use the modules in the biocode/lib\ndirectory, so you'll need to point Python to them. Full setup example:\n\n::\n\n    cd /opt\n    git clone https://github.com/jorvis/biocode.git\n\n    # You probably want to add this line to your $HOME/.bashrc file\n    export PYTHONPATH=/opt/biocode/lib:$PYTHONPATH\n\nProblems / Suggestions?\n=======================\n\nIf you encounter any issues with the existing code, or would like to\nrequest new features or scripts please submit to the `Issue tracking\nsystem <https://github.com/jorvis/biocode/issues>`__.\n\nContributing\n============\n\nIf you'd like to contribute code to this collection have a look at the\n`Requirements And Convention\nGuide <https://github.com/jorvis/biocode/blob/master/RequirementsAndConventionGuide.md>`__\nand then submit a pull request once your code is ready. We'll check your\nscript and pull it into the production directories. If you're not that\nconfident yet we'll happily pull in your sandbox directory if you'd like\nto add your code to the project but aren't sure if it's ready to be in\nthe production directories yet.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Bioinformatics code libraries and scripts",
    "version": "0.12.0",
    "project_urls": {
        "Homepage": "http://github.com/jorvis/biocode"
    },
    "split_keywords": [
        "bioinformatics",
        "scripts",
        "modules",
        "gff3",
        "fasta",
        "fastq",
        "bam",
        "sam"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "76b922a5ad1d471977fa223d22f5dfe57f099f1262fdd3fab678dfe555f11344",
                "md5": "07504bee1cda78d759bd42388ec1aed1",
                "sha256": "42e6713223521337c04e3a7f92e1ae6229c5e00999b05a95a6c20643649475e2"
            },
            "downloads": -1,
            "filename": "biocode-0.12.0.tar.gz",
            "has_sig": false,
            "md5_digest": "07504bee1cda78d759bd42388ec1aed1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6472526,
            "upload_time": "2024-10-17T03:32:11",
            "upload_time_iso_8601": "2024-10-17T03:32:11.029334Z",
            "url": "https://files.pythonhosted.org/packages/76/b9/22a5ad1d471977fa223d22f5dfe57f099f1262fdd3fab678dfe555f11344/biocode-0.12.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-17 03:32:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jorvis",
    "github_project": "biocode",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "biocode"
}
        
Elapsed time: 0.31050s