OGU


NameOGU JSON
Version 1.52 PyPI version JSON
download
home_pageNone
SummaryOrganelle Genome Utilities
upload_time2024-04-23 12:03:41
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseAGPL-3.0-or-later
keywords plastid mitochondria organelle genome analysis bioinformatics
VCS
bugtrack_url
requirements biopython certifi coloredlogs matplotlib numpy primer3-py
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/OGU.svg)](https://badge.fury.io/py/OGU)

# Quick start

- Install Python 3 (3.9 or newer).
- Open terminal, run

   ```shell
   # Install, using pip (recommended)
   pip3 install OGU --user

   # Initialize with Internet
   # Windows
   python -m OGU init
   # Linux and macOS
   python3 -m OGU init

   # Run
   # Windows
   python -m OGU
   # Linux and macOS
   python3 -m OGU
   ```

# Table of Contents

* [Quick start](#quickstart)
* [Feature](#feature)
* [Prerequisite](#prerequisite)
    * [Hardware](#hardware)
    * [Software](#software)
* [Installation](#installation)
    * [Portable](#portable)
    * [Install with pip](#Installwithpip)
    * [Install with conda](#Installwithconda)
    * [Initialization](#Initialization)
* [Usage](#usage)
    * [Quick examples](#quick-examples)
    * [Sequence ID](#sequence-id)
    * [Command line](#commandline)
* [Input](#input)
* [Output](#output)
* [Options](#options)
    * [gb2fasta](#gb2fasta)
    * [evaluate](#evaluate)
    * [primer](#primer)
* [Performance](#performance)
* [Citation](#citation)
* [License](#license)
* [Q&A](q&a)

# Features

:heavy_check_mark: Automatically collect, organize and clean sequence data
from NCBI GenBank or local: collect data with abundant options; extract CDS,
intergenic spacer, or any other annotations from original sequence; remove
redundant sequences according to species information; remove invalid or
abnormal sequences/fragments; generate clean dataset with uniform sequence id.

:heavy_check_mark: Evaluate variance of sequences by calculating nucleotide
diversity, observed resolution, Shannon index, tree resolution, phylogenetic
diversity (original and edited version), gap ratio, and others. Support
sliding-window scanning.

:heavy_check_mark: Design universal primer for the alignment. Support
ambiguous bases in primers.

# Prerequisite

## Hardware

`Organelle Genome Utilities (OGU)` requires very few computational resources.
A normal PC/laptop is enough. For downloading large amount of data, make sure
the Internet connection is stable and fast enough.

## Software

For the portable version, nothing need to be installed manually.

For installing from pip, [Python](https://www.python.org/downloads/) is
required. Notice that the python version should be higher than **3.6**.

:white_check_mark: All third-party dependencies will be automatically
installed with Internet, including `biopython`, `matplotlib`, `coloredlogs`,
`numpy`, `primer3-py`, (python packages), and
[MAFFT](https://mafft.cbrc.jp/alignment/software/),
[IQTREE](http://www.iqtree.org/),
[BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).

# Installation

We assume that users have already installed
[Python3](https://www.python.org/downloads/) (3.9 or above).

## Install with pip

1. Install [Python](https://www.python.org/downloads/). 3.9 or newer is
   required.

2. Open command line, run

```shell
pip3 install OGU --user
```

## Initialization

During the first running, `OGU` will check and initialize the
running environment. Missing dependencies will be automatically installed.

This step requires Internet connection.

By default, the program will automatically finish initialization, if any error
occurs, users can choose one of the following methods:

### Automatic

Run the following command.

```shell
# Windows
python -m OGU init
# Linux and macOS
python3 -m OGU init
```

### Use prepared package
According to your system, download related compressed file from [packages](https://github.com/wpwupingwp/OGU/releases).

For Windows users: 
1. paste `%HOMEDRIVE%%HOMEPATH%/` to explorer's address bar and open.
2. create `.OGU` folder. Don't miss the dot.
3. open `.OGU` folder, paste downloaded compressed file and unzip. Make sure after
decompress there are three folders in `.OGU`.

For Linux and macOS users, please download and unpack files into
`~/.OGU`.

### Manually install

For Linux users with root privileges, just use the package manager:

```
# Ubuntu and Debian
sudo apt install mafft ncbi-blast+ iqtree
# Fedora (1)
sudo dnf install mafft ncbi-blast+ iqtree
# Fedora (2)
sudo yum install mafft ncbi-blast+ iqtree
# ArchLinux
sudo pacman -S mafft ncbi-blast+ iqtree
# FreeBSD
sudo pkg install mafft ncbi-blast+ iqtree
```

For macOS users with root privileges, install `brew` if it has not been
installed previously:

```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```

If any errors occur, install `Xcode-select` and retry.

Then:

```
brew install blast mafft brewsci/science/iqtree
```

If using Windows or lacking root privileges, users should follow these
instructions:

1. BLAST+

    * [Windows](https://www.ncbi.nlm.nih.gov/books/NBK52637/)
    * [Linux and macOS](https://www.ncbi.nlm.nih.gov/books/NBK52640/)
2. MAFFT

    * [Windows](https://mafft.cbrc.jp/alignment/software/windows.html)

      Choose "All-in-one version", download and unzip. Then follow the steps
      in the BLAST+ installation manual to set the `PATH`.
    * [Linux](https://mafft.cbrc.jp/alignment/software/linux.html)

      Choose "Portable package", download and unzip. Then follow the
      instructions of BLAST+ to set the `PATH` for `MAFFT`.
    * [macOS](https://mafft.cbrc.jp/alignment/software/macosx.html)

      Choose "All-in-one version", download and unzip. Then follow the steps
      in the BLAST+ installation manual to set the `PATH`.
3. IQ-TREE

    * [Download](http://www.iqtree.org/#download)

      Download the installer according to OS. Unzip and add the path of
      subfolder `bin` to `PATH`.

# Usage

## Graphical user interface
Open the command line (Windows) or terminal (Linux and macOS),
run

```bash
OGU
```

or 
```bash
# linux and macos
python3 -m OGU
# windows
python -m OGU
```

## command line
Once a user opens the command line (Windows) or terminal (Linux and macOS), 
just type the command:

```
# Windows
python -m OGU [input] -[options] -out [out_folder]
# Linux and macOS
python3 -m OGU [input] -[options] -out [out_folder]
```

## Quick examples

1. Download all `rbcL` sequences of species in Poaceae family and do
   pre-process.

```
# Windows
python -m OGU.gb2fasta -gene rbcL -taxon Poaceae -out rbcL_Poaceae
# Linux and macOS
python3 -m OGU.gb2fasta -gene rbcL -taxon Poaceae -out rbcL_Poaceae
```

2. Download all ITS sequences of _Rosa_ genus. Do pre-process and keep redundant
   sequences:

```
# Windows
python -m OGU.gb2fasta -query internal transcribed spacer -taxon Rosa -out Rosa_its -uniq no
# Linux and macOS
python3 -m OGU.gb2fasta -query internal transcribed spacer -taxon Rosa -out Rosa_its -uniq no
```

3. Download all Lamiaceae chloroplast genomic sequences in the RefSeq database.
   Then do pre-process and evaluation of variance (skip primer designing):

```
# Windows
python -m OGU -og cp -refseq yes -taxon Lamiaceae -out Lamiaceae_cp
# Linux and macOS
python3 -m OGU -og cp -refseq yes -taxon Lamiaceae -out Lamiaceae_cp
```

4. Download sequences of _Zea mays_, set length between 100 bp and 3000 bp,
   and then perform evaluation and primer designing. Note that the space in
   the species name is replaced with underscore "\_".

```
# Windows
python -m OGU -taxon Zea_mays -min_len 100 -max_len 3000 -out Zea_mays -primer
# Linux and macOS
python3 -m OGU -taxon Zea_mays -min_len 100 -max_len 3000 -out Zea_mays -primer
```

5. Download all _Oryza_ mitochondria genomes in RefSeq database, keep the
   longest sequence for each species and run a full analysis:

```
# Windows
python -m OGU -taxon Oryza -og mt -min_len 50000 -max_len 200000 -uniq longest -out Oryza_cp -refseq yes -primer
# Linux and macOS
python3 -m OGU -taxon Oryza -og mt -min_len 50000 -max_len 200000 -uniq longest -out Oryza_cp -refseq yes -primer
```

## Sequence ID

`Organelle Genome Utilities` uses a uniform sequence id format for input fasta files and all output
sequences.

```
Locus|Kingdom|Phylum|Class|Order|Family|Genus|Species|Accession|SpecimenID_Isolate|Type
# example
rbcL|Viridiplantae|Streptophyta|Magnoliopsida|Poales|Poaceae|Oryza|longistaminata|MF998442|TAN:GB60B-2014|gene
```

The order of the fields is fixed. The fields are separated by vertical bars
("|"). The space character (" ") was disallowed and was replaced by an
underscore ("\_"). Due to missing data, some fields may be empty.

`Locus`: SeqName refers to the name of a sequence. Usually it is the gene
name. For intergenic spacer, an underscore ("\_") is used to connect two
gene's names, e.g., "geneA_geneB".

If a valid sequence name cannot be found in the annotations of the GenBank
file, `Organelle Genome Utilities` will use "Unknown" instead.

For chloroplast genes, if "-rename" option is set, the program will try to use
regular expressions to fix potential errors in gene names.

`Kingdom`: The kingdom (_Fungi, Viridiplantae, Metazoa_) of a species. For
convenience, a superkingdom (_Bacteria, Archaea, Eukaryota, Viruses, Viroids_)
may be used if the kingdom information for a sequence is missing.

`Phylum`: The phylum of the species.

`Class`: The class of the species.

Because some species' classes are empty (for instance, basal angiosperm),
`Organelle Genome Utilities` will guess the class of the species.

Given the taxonomy information in GenBank file:

```
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
    Spermatophyta; Magnoliophyta; basal Magnoliophyta; Amborellales;
    Amborellaceae; Amborella.
```

`Organelle Genome Utilities` will use "basal Magnoliophyta" as the class because this
expression locates before the order name ("Amborellales").

`Order`: The order name of the species.

`Family`: The family name of the species.

`Genus`: The genus name of the species, i.e., the first part of the scientific
name.

`Species`: The specific epithet of the species, i.e., the second part of the
scientific name of the species. It may contain the subspecies' name.

`Accession`: The GenBank Accession number for the sequence. It does not
contain the record's version.

`SpecimenID` and `Isolate`: Specimen ID and Isolate ID of the sequence. May be empty.

`Type`: Type of the sequence. It is usually "gene" or "spacer".

## Command line

:exclamation: In Linux and macOS, Python2 is `python2` and Python3 is
`python3`. However, in Windows, Python3 is called `python`, too. Please
notice the difference.

* Show help information of each module

 ```shell
 # Windows
 python -m OGU -h
 python -m OGU.gb2fasta -h
 python -m OGU.evaluate -h
 python -m OGU.primer -h
 # Linux and macOS
 python3 -m OGU.gb2fasta -h
 python3 -m OGU.evaluate -h
 python3 -m OGU.primer -h
 ```

* Full process

 ```shell
 # Windows
 python -m OGU -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]
 # Linux and macOS
 python3 -m OGU -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]
 ```

* Collect, convert, and clean GenBank data with gb2fasta module

 ```shell
 # Windows
 python -m OGU.GB2fasta -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]
 # Linux and macOS
 python3 -m OGU.gb2fasta -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]
 ```

* Evaluate variance of given fasta files

 ```shell
 # Windows
 python -m OGU.evaluate -fasta [fasta files]
 # Linux and macOS
 python3 -m OGU.evaluate -fasta [input file]
 ```

* Design universal primers of given alignments.

 ```shell
 # Windows
 python -m OGU.primer -aln [alignment files]
 # Linux and macOS
 python3 -m OGU.primer -aln [alignment files]
 ```

## Visualize
 
Here are to jupyter notebooks for visualize analyze result as detailed circle
figure:
  - `Visualize/draw_circle_plastid.ipynb`: for plastid genomes
  - `Visualize/draw_circle_mitochondria.ipynb`: for mitochondria genomes

Since users may want to customize the figure, we provide jupyter notebooks 
instead of packaged code. Users can get the result following these steps.
0. Run `pip install jupyterlab` to install jupyter notebooks
1. Double click to open in jupyter notebook, Visual Studio Code or other IDEs you prefer.
2. Edit `filename` to the Evaluation.csv you got
3. Edit `gb_file` to extended gb file you got. Remember to generate it with 
`-out_debug` in OGU.gb2fasta
4. If visualize plastid data, you need provide `LSC, SSC, IRa, IRb` lengths. Or
you can use default value, which is for *Tobacum*.
5. Edit color themes as your wish
6. Run all cells to output pdf figure
 
# Input

`Organelle Genome Utilities` accepts:

1. GenBank queries. Users can use "-query" or combine with any other filters;
2. GenBank-format files.
3. Unaligned fasta files. Each file is considered as one locus when evaluating
   the variance;
4. Alignments (fasta format).

# Output

All results will be put in the output folder. If the user does not set the
output path via "-out", `Organelle Genome Utilities` will create a folder labelled "Result".

In the output folder, several sub-folders will be created.

* GenBank

  Raw GenBank files.

* Divide

  Fasta files converted from the GenBank file. Each file represents a
  fragment of the original sequence according to the annotation.

  For instance, a record in a "rbcL.gb" file may also contain atpB gene's
  sequences. The "rbcL.fasta" file does not contain any upstream/downstream
  sequences and "atpB_rbcL.fasta" does not have even one base of the atpB or
  rbcL gene, just the spacer (assuming the annotation is precise).

  User can skip this dividing step with the option "-no_divide".
* Fasta

  Raw fasta files users provided.
* Unique

  Fasta files after removing redundant sequences.
* Expanded_fasta

  To design primers, `Organelle Genome Utilities` extend a sequence to its
  upstream/downstream. Only used in the primer module.
* Alignment

  Aligned fasta files.

  `.aln`: The aligned fasta files.

  `.-consensus.fastq`: The fastq format of the consensus sequence of the
  alignment. Note that it contains alignment gap ("-"). It is NOT
  RECOMMENDED to be used directly because the consensus-generating algorithm is
  optimised for primer design.
* Evaluate

  Including output files from the evaluation module.

  `.pdf`: The PDF format of the figure containing the sliding-window scan
  result of the alignment.

  `.csv`: The CSV format file of the sliding-window scan result. `"Index"`
  means the location of the base in the alignment.

* Primer

  Including output files from the primer module.

  `.primer.fastq`: The fastq format file of a primer's sequence. It contains
  two sequences, and the direction is 5' to 3'. The first is the forward
  primer, and the second is the reverse primer. The quality of each base is
  equal to its proportion of the column in the alignment. Note that the
  sequence may contain ambiguous bases if it was not disabled.

  `.primers.csv`: The list of primer pairs in CSV (comma-separated values
  text) format.

  `.candidate.fasta`: The candidate primers. This file may contain
  thousands of records. Do not recommend paying attention to it.

  `.candidate.fastq`: Again, the candidate primers. This time, each file has
  the quality information that equals to the proportion of the bases in the
  column of the alignment.

* Temp

  Including temporary files. Could be safely deleted .

In the output folder, there are some other important output files:

* Primers.csv

  The list of primer pairs in CSV (comma-separated values text) format.

  Its title:
    ```
    Locus,Samples,Score,AvgProductLength,StdEV,MinProductLength,MaxProductLength,Coverage,Observed_Res,Tree_Res,PD_terminal,Entropy,LeftSeq,LeftTm,LeftAvgBitscore,LeftAvgMismatch,RightSeq,RightTm,RightAvgBitscore,RightAvgMismatch,DeltaTm,AlnStart,AlnEnd,AvgSeqStart,AvgSeqEnd
    ```

  `Locus`: The name of the locus/fragment.

  `Samples`: The number of sequences used to find this pair of primers.

  `Score`: The score of this pair of primers. Usually the higher, the better.

  `AvgProductLength`: The average length of the DNA fragment amplified by
  this pair of primers.

  `StdEV`: The standard deviation of the AvgProductLength. A higher number
  means the primer may amplify different lengths of DNA fragments.

  `MinProductLength`: The minimum length of an amplified fragment.

  `MaxProductLength`: The maximum length of an amplified fragment. Note that
  all of these fields are calculated using given sequences.

  `Coverage`: The coverage of this pair of primers over the sequences it
  used. Calculated with the BLAST result. High coverage means that the pair
  is much more "universal".

  `Observed_Res`: The `observed resolution` of the sub-alignment sliced by
  the primer pair, which is equal to the number of unique sequences divided
  by the number of total sequences. The value is between 0 and 1.

    <img src="https://latex.codecogs.com/svg.latex?\dpi{300}&space;R_{o}=\frac{n_{uniq}}{n_{total}}" title="R_{o}=\frac{n_{uniq}}{n_{total}}" />

  `Tree_Res`: The `tree resolution` of the sub-alignment, which is equal to
  the number of internal nodes on a phylogenetic tree (constructed from the
  alignment) divided by number of terminal nodes. The value is between 0 and
    1.

    <img src="https://latex.codecogs.com/svg.latex?\dpi{300}&space;R_{T}=\frac{n_{internal}}{n_{terminal}}" title="R_{T}=\frac{n_{internal}}{n_{terminal}}" />

  `PD_terminal`: The average of the terminal branch's length. It's an edited
  version of the `Phylogenetic Diversity` for DNA barcoding evaluation.

  `Entropy`: The `Shannon equitability` index of the sub-alignment. The value
  is between 0 and 1.

    <img src="https://latex.codecogs.com/svg.latex?\dpi{300}&space;E_{H}&space;=&space;\frac{-&space;\sum_{i=1}^{k}{p_{i}&space;\log(p_{i})}}{\log(k)}" title="E_{H} = \frac{- \sum_{i=1}^{k}{p_{i} \log(p_{i})}}{\log(k)}" />

  `LeftSeq`: Sequence of the forward primer. The direction is 5' to 3'.

  `LeftTm`: The melting temperature of the forward primer. The unit is
  degree Celsius (°C).

  `LeftAvgBitscore`: The average raw bitscore of the forward primer, which
  is calculated by BLAST.

  `LeftAvgMismatch`: The average number of mismatched bases of the forward
  primer, as counted by BLAST.

  `RightSeq`: Sequence of reverse primer. The direction is 5' to 3'.

  `RightTm`: The melting temperature of the reverse primer. The unit is
  degrees Celsius (°C).

  `RightAvgBitscore`: The average raw bitscore of the reverse primer, which
  is calculated by BLAST.

  `RightAvgMismatch`: The average number of mismatched bases of the reverse
  primer, as counted by BLAST.

  `DeltaTm`: The difference in the melting temperatures of the forward and
  reverse primers. A pair of primers with a high DeltaTm may result in
  failure during the PCR experiment.

  `AlnStart`: The location of the beginning of the forward primer (5',
  leftmost of primer pairs) in the entire alignment.

  `AlnEnd`: The location of the end of the reverse primer (5', rightmost of
  primer pairs) in the entire alignment.

  `AvgSeqStart`: The average beginning of the forward primer in the original
  sequences.  *ONLY USED FOR DEBUG*.

  `AvgSeqEnd`: The average end of the forward primer in the original
  sequences.  *ONLY USED FOR DEBUG*.

  The primer pairs are sorted by `Score`. Since the score may not fully
  satisfy the user's specific considerations, it is suggested that primer
  pairs be chosen manually if the first primer pair fails during the PCR
  experiment.

* Log.txt

  The log file. Contains all the information printed on the screen.

* Evaluation.csv

  The summary of all loci/fragments, which only contains the variance
  information for each fragment. One of the new field, `GapRatio`, means the
  ratio of the gap ("-") in the alignment. `PD` means the original version
  of the phylogenetic diversity and `PD_stem` means an alternative version
  of it which only calculate the length of the stem branch in the
  phylogenetic tree.

# Options

Here are some general options for the program and submodule:

`-h`: Prints help messages of the program or one of the module.

`-gb [filename]`: User-provided GenBank file or files. Could be one or more
files that separated by space.

For instance,

```
# one file
-gb sequence.gb
# multiple files
-gb matK.gb rbcL.gb Oryza.gb Homo_sapiens.gb
```

`-fasta [filename]`: User-provided unaligned fasta files. Could be one or
multiple.

`-aln [filename]`: Alignment files that the user provides. Could be one or
multiple.

It only supports the fasta format. Ambiguous bases and gaps ("-") are supported.

`-out [folder name]`: The output folder's name. All results will be put into
the output folder. If the user does not set an output path via "-out",
`Organelle Genome Utilities` will create a folder named "Result".

`OGU` does not overwrite the existing folder with the same name.

It is HIGHLY RECOMMENDED to use only letters, numbers and underscores ("\_") in
the folder name to avoid mysterious errors caused by other Unicode characters.

Options below are for specific modules.

## gb2fasta

### Query

Options used for querying NCBI GenBank.

`-taxon [taxonomy name]`: The taxonomy name. It could be any taxonomic rank
from kingdom (same as "-group") to species, as long as the user inputs correct
name (the scientific name of species or taxonomic group in latin, NOT
ENGLISH). It will restrict the query to the targeted taxonomy unit. Make sure
to use quotation marks if `taxonomy` has more than one word or use underscore
to replace space, for instance `"Zea mays"` or `Zea_mays`.

`-gene [gene name]`: The gene's name which the user wants to query in GenBank.
If the user wants to use logical expressions like "OR", "AND", "NOT", s/he
should use "-query" instead. If there is space in the gene's name, make sure
to use quotation marks.

Note that "ITS" is not a gene name--it is "internal transcribed spacer".

Sometimes "-gene" options may bring in unwanted sequences. For example, if a
user queries "rbcL[gene]" in GenBank, spacer sequences may contain _rbcL_ or
_rbcL_'s upstream/downstream gene, such as "atpB_rbcL spacer" or _atpB_.

`-og [ignore|both|no|mt|mitochondrion|cp|chloroplast|pl|plastid]`: Query
organelle sequences or not. The default value is `ignore`.

    - `ignore`: do not consider organelle type, same as GenBank website's
      default setting.

    - `both`: only query organelle sequences, including both plastid and
      mitochondrion.

    - `no`: exclude organelle sequences from the query.

    - `cp` or `chloroplast` or `pl` or `plastid`: only query plastid sequences

    - `mt` or `mitochondrion`: only query mitochondrion sequences.

`-refseq [both|yes|no]`: query in RefSeq database or not. The default value is
`both`.

    - `both`: query all sequences in or not in RefSeq database, same as NCBI
      website's default setting.

    - `yes`: only query sequences in RefSeq database.

    - `no`: exclude sequences in RefSeq database.

[RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) is considered to have
higher sequence and annotation quality than GenBank. This option could be used
for getting nuclear/organelle genomes from NCBI. In this situation (`-refseq
yes`), the length limit will be removed automatically.

`-count [number]`: Restrict numbers of sequences to be downloaded. The default
value `0` means no restriction.

`-min_len [length]`: The minimum length of the records downloaded from
GenBank. The default value is `100` (bp). The number must be an integer.

`-max_len [length]`: The maximum length of the records downloaded from
GenBank. The default value is `10000` (bp). The number must be an integer.

`-date_start [yyyy/mm/dd]`: The beginning of the release data range of the
sequences, the format is yyyy/mm/dd.

`-date_end [yyyy/mm/dd]`: The end of the release data range of the sequences,
the format is yyyy/mm/dd.

`-molecular [all|DNA|RNA]`: The molecular type,
which could be DNA or RNA. The
default is `all`--no restriction.

`-email [email address]`: NCBI GenBank database requires users to provide
an email address in case of abnormal situations that NCBI need to contact
the user. For convenience, `OGU` will use
"guest@example.com" if the user does not provide an email address. _However_,
it is better to provide a real email address for potential contact.

`-query [expression]`: The query string provided by the user. It behaves in
the same manner as the query the user typed into the Search Box in NCBI
GenBank's webpage.

Make sure to follow NCBI's grammar for queries. It can contain several words.
Remember to add quotation marks if an item contains more than one words, for
instance, `"Homo sapiens"[organism]`, or use underscore to replace space,
`Homo_sapiens[organism]`.

`-exclude [expression]`: Use this option to use negative option. For instance,
"-exclude Zea [organism]" (do not include quotation marks) will add " NOT
(Zea[organism])" to the query.

This option can be useful for excluding a specific taxon.

```
-taxon Zea -exclude "Zea mays"[organism]
```

This will query all records in genus *Zea* while records of *Zea mays* will be
excluded.

For much more complex exclude options, please consider to use "Advance search"
in GenBank website.

`-group [all|animals|plants|fungi|protists|bacteria|archaea|viruses]`: To
restrict the query in given group. The default value is `all`--no
restriction.

It is reported that the "group" filter may return abnormal records, for
instance, return plants' records when the group is "animal" and the
"organelle" is "chloroplast". Furthermore, it may match a great number of
records in GenBank. Hence, we strongly recommend using "-taxon" instead.

### Divide

Options used for converting GenBank files to fasta files.

`-out_debug`: If you are going to use visualize pipeline to draw detailed circle
figure, use this option to generate extended version genbank file.

`-no_divide`: If set, it will analyse the whole sequence instead of the
divided fragments. By default, `OGU` divides one GenBank record into
several fragments according to its annotation.

`-rename`: If set, the program will try to rename genes. For instance, "rbcl"
will be renamed to "rbcL", and "tRNA UAC" will be renamed to "trnVuac", which
consists of "trn", the amino acid's letter and transcribed codon. This may be
helpful if the annotation has nonstandard uppercase/lowercase or naming format
that it can merge the same sequences to one file for the same locus having
variant names.

If using Windows operating system, consider using this option to avoid
contradictory filenames.

`-unique [longest|first|no]`: The method used to remove redundant sequences.
`OGU` will remove redundant sequences to ensure only one sequence per
species by default. A user can change its behaviour by setting different
methods.

    - `first`: According to the records' order in the original GenBank file,
      only the first sequence of the same species' same locus will be kept.
      Others will be ignored directly. This is the default option due to
      performance considerations.

    - `longest`: Keep the longest sequence for one species. The program will
      compare the sequence's length from the same species' same locus.

    - `no`: Skip this step. All sequences will be kept.

`-allow_mosaic_spacer`: If one gene nested with another gene, normally they
do not have spacers. The default value is `False`.

However, some users want the fragments between two gene's beginnings and ends.
This option is for this specific purpose (e.g., matK-trnK_UUU). For normal
usage, *do not recommend*.

`-expand [number]`: The expansion length in upstream/downstream. If set,
`OGU` will expand the sequence to its upstream/downstream after the
dividing step to find primer candidates. The default value is `0`.

Note that this option is different with "-max_len". This option limits the
length of one annotation's sequence. The "-max_len" limits the whole
sequence's length of one GenBank record.

`-allow_repeat`: If genes repeated in downstream, this option will allow the
repeat region to be extracted, otherwise any repeated region will be omitted.
The default value is `False`.

`-allow_invert_repeat`: If two genes invert-repeated in downstream, this
option will allow spacers of them to be extracted, otherwise the spacer
will be omitted. The default value is `False`.

For instance, geneA-geneB located in one invert-repeat region (IR) of
chloroplast genome. In another IR region, there are geneB-geneA. This option
will extract sequences of two different direction as two unique spacers.

`-max_name_len [number]`: The maximum length of a feature name. Some
annotation's feature name in GenBank file is too long, and usually, they are
not the target sequence the user wants. By setting this option, `OGU`
will truncate the annotation's feature name if it is too long. By default, the
value is `50`.

`-max_gene_len [value]`: The maximum length of a sequence for one annotation.
Some annotations' sequences are too long (for instance, one gene has two
exons, and its intron is longer than 10 Kb). This option will skip those long
sequences. By default, the value is `20000` (bp).

## Evaluate

`-ig` or `-ignore_gap`: ignore gaps in the alignment.

`-iab` or `-ignore_ambigous`: ignore ambiguous bases in the alignment.

`-quick`: skip sliding-window scan.

`-size [number]`: the window size of the sliding window scan. The default
value is `500`.

`-step [number]`: the step size of the sliding window scan. The default value
is `50`.

`-skip_primer`: skip primer designing. The default value is `False`.

## Primer design

`-coverage [value]`: The minimum coverage of the base and primer. The default
value is `0.5` (50%). It is used to remove primer candidates if its coverage
among all sequences is smaller than the threshold. The coverage of primers is
calculated by BLAST.

`-res [value]`: The minimum *observed resolution* of the fragments or primer
pairs. The default *value* is 0.3 (30%). The value should be in 0.0 to 1.0.

`OGU` uses the *observed resolution* instead of others because of the
speed. Also, it is considered to be the lower bound of the real resolution
that a fragment with a low *observed resolution* may not have a satisfactory
tree resolution/phylogenetic diversity, either.

`-pmin [length]`: The minimal length of the primer. The default *value* is 20.

`-pmax [length]`: The maximal length of the primer. The default *value* is 25.

`-topn [number]`: How many pairs of primers is kept for each input alignment.
The default value is `1`, i.e., only keep the _best_ primer pair according to
its `score`. To keep more pairs, set "-t" to more than 1.

`-amin [length]`: The minimum amplified length (include primer). The default
value is `300` (bp). Note this limits the PCR product's length instead of the
sub-alignment's length.

`-amax [length]`: The maximum amplified length (include primer). The default
value is `800` (bp).

The "-amin" and "-amax" are used to screen primer candidates. It uses BLAST
results to set the location of primers on each template sequence and
calculates the average lengths of the products. Because of the variance of
species, the same locus may have different lengths in different species, plus
with the stretching of the alignment that gaps were added during the aligning,
please consider adding some *margins* for these two options.

For instance, if a user wants the amplified length to be smaller than 800 and
greater than 500, s/he could consider setting "-amin" to 550 and "-amax" to

750.

`-ambiguous [number]`: The maximum number of ambiguous bases allowed in one
primer. The default value is `4`.

`-mismatch [number]`: The maximum number of mismatched bases in a primer. This
option is used to remove primer candidates if the BLAST results show that
there is too much mismatch. The default value is `4`.

# Performance

For a taxon that is not very large and includes few fragments, `OGU`
can finish the task in *minutes*. For a large taxon (such as the Asteraceae
family or the whole class of the Poales) containing multiple fragments (such
as the chloroplast genomes), the time to complete may be one hour or more on a
PC or laptop.

`OGU` requires few memories (usually less than 0.5 GB, although, for a
large taxon BLAST may require more) and few CPUs (one core is enough). It can
run very well on a normal PC. Multiple CPU cores may be helpful for the
alignment and tree construction steps.

For Windows users, MAFFT [may be very slow due to antivirus
software](https://mafft.cbrc.jp/alignment/software/windows_without_cygwin.html).
Please consider
following [this instruction](https://mafft.cbrc.jp/alignment/software/ubuntu_on_windows.html) to
install
Ubuntu on Windows to obtain better results.

# Citation

As yet unpublished.

# License

The software itself is licensed under
[AGPL-3.0](https://github.com/wpwupingwp/OGU/blob/master/LICENSE) (**not include
third-party
software**).

# Q&A

Please submit your questions in the
[Issue](https://github.com/wpwupingwp/OGU/issues) page :smiley:
* Q: The first-time run is slow.

  A: OGU will automaticlly install third-party software (MAFFT/BLAST/IQTREE)
  from AWS at first-time running. Microsoft Windows users, especially in some 
  regions may have slow connection. Please be patient, or you can manually 
  install them. See [Initialization](#Initialization).
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "OGU",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Ping Wu <wpwupingwp@outlook.com>",
    "keywords": "plastid, mitochondria, organelle genome, analysis, Bioinformatics",
    "author": null,
    "author_email": "Ping Wu <wpwupingwp@outlook.com>",
    "download_url": "https://files.pythonhosted.org/packages/1b/ee/5b50002c05886bbccb03bb552b25c86e37dba6ff52f9c56bedf046c95602/ogu-1.52.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/OGU.svg)](https://badge.fury.io/py/OGU)\n\n# Quick start\n\n- Install Python 3 (3.9 or newer).\n- Open terminal, run\n\n   ```shell\n   # Install, using pip (recommended)\n   pip3 install OGU --user\n\n   # Initialize with Internet\n   # Windows\n   python -m OGU init\n   # Linux and macOS\n   python3 -m OGU init\n\n   # Run\n   # Windows\n   python -m OGU\n   # Linux and macOS\n   python3 -m OGU\n   ```\n\n# Table of Contents\n\n* [Quick start](#quickstart)\n* [Feature](#feature)\n* [Prerequisite](#prerequisite)\n    * [Hardware](#hardware)\n    * [Software](#software)\n* [Installation](#installation)\n    * [Portable](#portable)\n    * [Install with pip](#Installwithpip)\n    * [Install with conda](#Installwithconda)\n    * [Initialization](#Initialization)\n* [Usage](#usage)\n    * [Quick examples](#quick-examples)\n    * [Sequence ID](#sequence-id)\n    * [Command line](#commandline)\n* [Input](#input)\n* [Output](#output)\n* [Options](#options)\n    * [gb2fasta](#gb2fasta)\n    * [evaluate](#evaluate)\n    * [primer](#primer)\n* [Performance](#performance)\n* [Citation](#citation)\n* [License](#license)\n* [Q&A](q&a)\n\n# Features\n\n:heavy_check_mark: Automatically collect, organize and clean sequence data\nfrom NCBI GenBank or local: collect data with abundant options; extract CDS,\nintergenic spacer, or any other annotations from original sequence; remove\nredundant sequences according to species information; remove invalid or\nabnormal sequences/fragments; generate clean dataset with uniform sequence id.\n\n:heavy_check_mark: Evaluate variance of sequences by calculating nucleotide\ndiversity, observed resolution, Shannon index, tree resolution, phylogenetic\ndiversity (original and edited version), gap ratio, and others. Support\nsliding-window scanning.\n\n:heavy_check_mark: Design universal primer for the alignment. Support\nambiguous bases in primers.\n\n# Prerequisite\n\n## Hardware\n\n`Organelle Genome Utilities (OGU)` requires very few computational resources.\nA normal PC/laptop is enough. For downloading large amount of data, make sure\nthe Internet connection is stable and fast enough.\n\n## Software\n\nFor the portable version, nothing need to be installed manually.\n\nFor installing from pip, [Python](https://www.python.org/downloads/) is\nrequired. Notice that the python version should be higher than **3.6**.\n\n:white_check_mark: All third-party dependencies will be automatically\ninstalled with Internet, including `biopython`, `matplotlib`, `coloredlogs`,\n`numpy`, `primer3-py`, (python packages), and\n[MAFFT](https://mafft.cbrc.jp/alignment/software/),\n[IQTREE](http://www.iqtree.org/),\n[BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).\n\n# Installation\n\nWe assume that users have already installed\n[Python3](https://www.python.org/downloads/) (3.9 or above).\n\n## Install with pip\n\n1. Install [Python](https://www.python.org/downloads/). 3.9 or newer is\n   required.\n\n2. Open command line, run\n\n```shell\npip3 install OGU --user\n```\n\n## Initialization\n\nDuring the first running, `OGU` will check and initialize the\nrunning environment. Missing dependencies will be automatically installed.\n\nThis step requires Internet connection.\n\nBy default, the program will automatically finish initialization, if any error\noccurs, users can choose one of the following methods:\n\n### Automatic\n\nRun the following command.\n\n```shell\n# Windows\npython -m OGU init\n# Linux and macOS\npython3 -m OGU init\n```\n\n### Use prepared package\nAccording to your system, download related compressed file from [packages](https://github.com/wpwupingwp/OGU/releases).\n\nFor Windows users: \n1. paste `%HOMEDRIVE%%HOMEPATH%/` to explorer's address bar and open.\n2. create `.OGU` folder. Don't miss the dot.\n3. open `.OGU` folder, paste downloaded compressed file and unzip. Make sure after\ndecompress there are three folders in `.OGU`.\n\nFor Linux and macOS users, please download and unpack files into\n`~/.OGU`.\n\n### Manually install\n\nFor Linux users with root privileges, just use the package manager:\n\n```\n# Ubuntu and Debian\nsudo apt install mafft ncbi-blast+ iqtree\n# Fedora (1)\nsudo dnf install mafft ncbi-blast+ iqtree\n# Fedora (2)\nsudo yum install mafft ncbi-blast+ iqtree\n# ArchLinux\nsudo pacman -S mafft ncbi-blast+ iqtree\n# FreeBSD\nsudo pkg install mafft ncbi-blast+ iqtree\n```\n\nFor macOS users with root privileges, install `brew` if it has not been\ninstalled previously:\n\n```\n/usr/bin/ruby -e \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)\"\n```\n\nIf any errors occur, install `Xcode-select` and retry.\n\nThen:\n\n```\nbrew install blast mafft brewsci/science/iqtree\n```\n\nIf using Windows or lacking root privileges, users should follow these\ninstructions:\n\n1. BLAST+\n\n    * [Windows](https://www.ncbi.nlm.nih.gov/books/NBK52637/)\n    * [Linux and macOS](https://www.ncbi.nlm.nih.gov/books/NBK52640/)\n2. MAFFT\n\n    * [Windows](https://mafft.cbrc.jp/alignment/software/windows.html)\n\n      Choose \"All-in-one version\", download and unzip. Then follow the steps\n      in the BLAST+ installation manual to set the `PATH`.\n    * [Linux](https://mafft.cbrc.jp/alignment/software/linux.html)\n\n      Choose \"Portable package\", download and unzip. Then follow the\n      instructions of BLAST+ to set the `PATH` for `MAFFT`.\n    * [macOS](https://mafft.cbrc.jp/alignment/software/macosx.html)\n\n      Choose \"All-in-one version\", download and unzip. Then follow the steps\n      in the BLAST+ installation manual to set the `PATH`.\n3. IQ-TREE\n\n    * [Download](http://www.iqtree.org/#download)\n\n      Download the installer according to OS. Unzip and add the path of\n      subfolder `bin` to `PATH`.\n\n# Usage\n\n## Graphical user interface\nOpen the command line (Windows) or terminal (Linux and macOS),\nrun\n\n```bash\nOGU\n```\n\nor \n```bash\n# linux and macos\npython3 -m OGU\n# windows\npython -m OGU\n```\n\n## command line\nOnce a user opens the command line (Windows) or terminal (Linux and macOS), \njust type the command:\n\n```\n# Windows\npython -m OGU [input] -[options] -out [out_folder]\n# Linux and macOS\npython3 -m OGU [input] -[options] -out [out_folder]\n```\n\n## Quick examples\n\n1. Download all `rbcL` sequences of species in Poaceae family and do\n   pre-process.\n\n```\n# Windows\npython -m OGU.gb2fasta -gene rbcL -taxon Poaceae -out rbcL_Poaceae\n# Linux and macOS\npython3 -m OGU.gb2fasta -gene rbcL -taxon Poaceae -out rbcL_Poaceae\n```\n\n2. Download all ITS sequences of _Rosa_ genus. Do pre-process and keep redundant\n   sequences:\n\n```\n# Windows\npython -m OGU.gb2fasta -query internal transcribed spacer -taxon Rosa -out Rosa_its -uniq no\n# Linux and macOS\npython3 -m OGU.gb2fasta -query internal transcribed spacer -taxon Rosa -out Rosa_its -uniq no\n```\n\n3. Download all Lamiaceae chloroplast genomic sequences in the RefSeq database.\n   Then do pre-process and evaluation of variance (skip primer designing):\n\n```\n# Windows\npython -m OGU -og cp -refseq yes -taxon Lamiaceae -out Lamiaceae_cp\n# Linux and macOS\npython3 -m OGU -og cp -refseq yes -taxon Lamiaceae -out Lamiaceae_cp\n```\n\n4. Download sequences of _Zea mays_, set length between 100 bp and 3000 bp,\n   and then perform evaluation and primer designing. Note that the space in\n   the species name is replaced with underscore \"\\_\".\n\n```\n# Windows\npython -m OGU -taxon Zea_mays -min_len 100 -max_len 3000 -out Zea_mays -primer\n# Linux and macOS\npython3 -m OGU -taxon Zea_mays -min_len 100 -max_len 3000 -out Zea_mays -primer\n```\n\n5. Download all _Oryza_ mitochondria genomes in RefSeq database, keep the\n   longest sequence for each species and run a full analysis:\n\n```\n# Windows\npython -m OGU -taxon Oryza -og mt -min_len 50000 -max_len 200000 -uniq longest -out Oryza_cp -refseq yes -primer\n# Linux and macOS\npython3 -m OGU -taxon Oryza -og mt -min_len 50000 -max_len 200000 -uniq longest -out Oryza_cp -refseq yes -primer\n```\n\n## Sequence ID\n\n`Organelle Genome Utilities` uses a uniform sequence id format for input fasta files and all output\nsequences.\n\n```\nLocus|Kingdom|Phylum|Class|Order|Family|Genus|Species|Accession|SpecimenID_Isolate|Type\n# example\nrbcL|Viridiplantae|Streptophyta|Magnoliopsida|Poales|Poaceae|Oryza|longistaminata|MF998442|TAN:GB60B-2014|gene\n```\n\nThe order of the fields is fixed. The fields are separated by vertical bars\n(\"|\"). The space character (\" \") was disallowed and was replaced by an\nunderscore (\"\\_\"). Due to missing data, some fields may be empty.\n\n`Locus`: SeqName refers to the name of a sequence. Usually it is the gene\nname. For intergenic spacer, an underscore (\"\\_\") is used to connect two\ngene's names, e.g., \"geneA_geneB\".\n\nIf a valid sequence name cannot be found in the annotations of the GenBank\nfile, `Organelle Genome Utilities` will use \"Unknown\" instead.\n\nFor chloroplast genes, if \"-rename\" option is set, the program will try to use\nregular expressions to fix potential errors in gene names.\n\n`Kingdom`: The kingdom (_Fungi, Viridiplantae, Metazoa_) of a species. For\nconvenience, a superkingdom (_Bacteria, Archaea, Eukaryota, Viruses, Viroids_)\nmay be used if the kingdom information for a sequence is missing.\n\n`Phylum`: The phylum of the species.\n\n`Class`: The class of the species.\n\nBecause some species' classes are empty (for instance, basal angiosperm),\n`Organelle Genome Utilities` will guess the class of the species.\n\nGiven the taxonomy information in GenBank file:\n\n```\nEukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;\n    Spermatophyta; Magnoliophyta; basal Magnoliophyta; Amborellales;\n    Amborellaceae; Amborella.\n```\n\n`Organelle Genome Utilities` will use \"basal Magnoliophyta\" as the class because this\nexpression locates before the order name (\"Amborellales\").\n\n`Order`: The order name of the species.\n\n`Family`: The family name of the species.\n\n`Genus`: The genus name of the species, i.e., the first part of the scientific\nname.\n\n`Species`: The specific epithet of the species, i.e., the second part of the\nscientific name of the species. It may contain the subspecies' name.\n\n`Accession`: The GenBank Accession number for the sequence. It does not\ncontain the record's version.\n\n`SpecimenID` and `Isolate`: Specimen ID and Isolate ID of the sequence. May be empty.\n\n`Type`: Type of the sequence. It is usually \"gene\" or \"spacer\".\n\n## Command line\n\n:exclamation: In Linux and macOS, Python2 is `python2` and Python3 is\n`python3`. However, in Windows, Python3 is called `python`, too. Please\nnotice the difference.\n\n* Show help information of each module\n\n ```shell\n # Windows\n python -m OGU -h\n python -m OGU.gb2fasta -h\n python -m OGU.evaluate -h\n python -m OGU.primer -h\n # Linux and macOS\n python3 -m OGU.gb2fasta -h\n python3 -m OGU.evaluate -h\n python3 -m OGU.primer -h\n ```\n\n* Full process\n\n ```shell\n # Windows\n python -m OGU -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]\n # Linux and macOS\n python3 -m OGU -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]\n ```\n\n* Collect, convert, and clean GenBank data with gb2fasta module\n\n ```shell\n # Windows\n python -m OGU.GB2fasta -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]\n # Linux and macOS\n python3 -m OGU.gb2fasta -gene [gene name] -taxon [taxon name] -og [organelle type] -out [output name]\n ```\n\n* Evaluate variance of given fasta files\n\n ```shell\n # Windows\n python -m OGU.evaluate -fasta [fasta files]\n # Linux and macOS\n python3 -m OGU.evaluate -fasta [input file]\n ```\n\n* Design universal primers of given alignments.\n\n ```shell\n # Windows\n python -m OGU.primer -aln [alignment files]\n # Linux and macOS\n python3 -m OGU.primer -aln [alignment files]\n ```\n\n## Visualize\n \nHere are to jupyter notebooks for visualize analyze result as detailed circle\nfigure:\n  - `Visualize/draw_circle_plastid.ipynb`: for plastid genomes\n  - `Visualize/draw_circle_mitochondria.ipynb`: for mitochondria genomes\n\nSince users may want to customize the figure, we provide jupyter notebooks \ninstead of packaged code. Users can get the result following these steps.\n0. Run `pip install jupyterlab` to install jupyter notebooks\n1. Double click to open in jupyter notebook, Visual Studio Code or other IDEs you prefer.\n2. Edit `filename` to the Evaluation.csv you got\n3. Edit `gb_file` to extended gb file you got. Remember to generate it with \n`-out_debug` in OGU.gb2fasta\n4. If visualize plastid data, you need provide `LSC, SSC, IRa, IRb` lengths. Or\nyou can use default value, which is for *Tobacum*.\n5. Edit color themes as your wish\n6. Run all cells to output pdf figure\n \n# Input\n\n`Organelle Genome Utilities` accepts:\n\n1. GenBank queries. Users can use \"-query\" or combine with any other filters;\n2. GenBank-format files.\n3. Unaligned fasta files. Each file is considered as one locus when evaluating\n   the variance;\n4. Alignments (fasta format).\n\n# Output\n\nAll results will be put in the output folder. If the user does not set the\noutput path via \"-out\", `Organelle Genome Utilities` will create a folder labelled \"Result\".\n\nIn the output folder, several sub-folders will be created.\n\n* GenBank\n\n  Raw GenBank files.\n\n* Divide\n\n  Fasta files converted from the GenBank file. Each file represents a\n  fragment of the original sequence according to the annotation.\n\n  For instance, a record in a \"rbcL.gb\" file may also contain atpB gene's\n  sequences. The \"rbcL.fasta\" file does not contain any upstream/downstream\n  sequences and \"atpB_rbcL.fasta\" does not have even one base of the atpB or\n  rbcL gene, just the spacer (assuming the annotation is precise).\n\n  User can skip this dividing step with the option \"-no_divide\".\n* Fasta\n\n  Raw fasta files users provided.\n* Unique\n\n  Fasta files after removing redundant sequences.\n* Expanded_fasta\n\n  To design primers, `Organelle Genome Utilities` extend a sequence to its\n  upstream/downstream. Only used in the primer module.\n* Alignment\n\n  Aligned fasta files.\n\n  `.aln`: The aligned fasta files.\n\n  `.-consensus.fastq`: The fastq format of the consensus sequence of the\n  alignment. Note that it contains alignment gap (\"-\"). It is NOT\n  RECOMMENDED to be used directly because the consensus-generating algorithm is\n  optimised for primer design.\n* Evaluate\n\n  Including output files from the evaluation module.\n\n  `.pdf`: The PDF format of the figure containing the sliding-window scan\n  result of the alignment.\n\n  `.csv`: The CSV format file of the sliding-window scan result. `\"Index\"`\n  means the location of the base in the alignment.\n\n* Primer\n\n  Including output files from the primer module.\n\n  `.primer.fastq`: The fastq format file of a primer's sequence. It contains\n  two sequences, and the direction is 5' to 3'. The first is the forward\n  primer, and the second is the reverse primer. The quality of each base is\n  equal to its proportion of the column in the alignment. Note that the\n  sequence may contain ambiguous bases if it was not disabled.\n\n  `.primers.csv`: The list of primer pairs in CSV (comma-separated values\n  text) format.\n\n  `.candidate.fasta`: The candidate primers. This file may contain\n  thousands of records. Do not recommend paying attention to it.\n\n  `.candidate.fastq`: Again, the candidate primers. This time, each file has\n  the quality information that equals to the proportion of the bases in the\n  column of the alignment.\n\n* Temp\n\n  Including temporary files. Could be safely deleted .\n\nIn the output folder, there are some other important output files:\n\n* Primers.csv\n\n  The list of primer pairs in CSV (comma-separated values text) format.\n\n  Its title:\n    ```\n    Locus,Samples,Score,AvgProductLength,StdEV,MinProductLength,MaxProductLength,Coverage,Observed_Res,Tree_Res,PD_terminal,Entropy,LeftSeq,LeftTm,LeftAvgBitscore,LeftAvgMismatch,RightSeq,RightTm,RightAvgBitscore,RightAvgMismatch,DeltaTm,AlnStart,AlnEnd,AvgSeqStart,AvgSeqEnd\n    ```\n\n  `Locus`: The name of the locus/fragment.\n\n  `Samples`: The number of sequences used to find this pair of primers.\n\n  `Score`: The score of this pair of primers. Usually the higher, the better.\n\n  `AvgProductLength`: The average length of the DNA fragment amplified by\n  this pair of primers.\n\n  `StdEV`: The standard deviation of the AvgProductLength. A higher number\n  means the primer may amplify different lengths of DNA fragments.\n\n  `MinProductLength`: The minimum length of an amplified fragment.\n\n  `MaxProductLength`: The maximum length of an amplified fragment. Note that\n  all of these fields are calculated using given sequences.\n\n  `Coverage`: The coverage of this pair of primers over the sequences it\n  used. Calculated with the BLAST result. High coverage means that the pair\n  is much more \"universal\".\n\n  `Observed_Res`: The `observed resolution` of the sub-alignment sliced by\n  the primer pair, which is equal to the number of unique sequences divided\n  by the number of total sequences. The value is between 0 and 1.\n\n    <img src=\"https://latex.codecogs.com/svg.latex?\\dpi{300}&space;R_{o}=\\frac{n_{uniq}}{n_{total}}\" title=\"R_{o}=\\frac{n_{uniq}}{n_{total}}\" />\n\n  `Tree_Res`: The `tree resolution` of the sub-alignment, which is equal to\n  the number of internal nodes on a phylogenetic tree (constructed from the\n  alignment) divided by number of terminal nodes. The value is between 0 and\n    1.\n\n    <img src=\"https://latex.codecogs.com/svg.latex?\\dpi{300}&space;R_{T}=\\frac{n_{internal}}{n_{terminal}}\" title=\"R_{T}=\\frac{n_{internal}}{n_{terminal}}\" />\n\n  `PD_terminal`: The average of the terminal branch's length. It's an edited\n  version of the `Phylogenetic Diversity` for DNA barcoding evaluation.\n\n  `Entropy`: The `Shannon equitability` index of the sub-alignment. The value\n  is between 0 and 1.\n\n    <img src=\"https://latex.codecogs.com/svg.latex?\\dpi{300}&space;E_{H}&space;=&space;\\frac{-&space;\\sum_{i=1}^{k}{p_{i}&space;\\log(p_{i})}}{\\log(k)}\" title=\"E_{H} = \\frac{- \\sum_{i=1}^{k}{p_{i} \\log(p_{i})}}{\\log(k)}\" />\n\n  `LeftSeq`: Sequence of the forward primer. The direction is 5' to 3'.\n\n  `LeftTm`: The melting temperature of the forward primer. The unit is\n  degree Celsius (\u00b0C).\n\n  `LeftAvgBitscore`: The average raw bitscore of the forward primer, which\n  is calculated by BLAST.\n\n  `LeftAvgMismatch`: The average number of mismatched bases of the forward\n  primer, as counted by BLAST.\n\n  `RightSeq`: Sequence of reverse primer. The direction is 5' to 3'.\n\n  `RightTm`: The melting temperature of the reverse primer. The unit is\n  degrees Celsius (\u00b0C).\n\n  `RightAvgBitscore`: The average raw bitscore of the reverse primer, which\n  is calculated by BLAST.\n\n  `RightAvgMismatch`: The average number of mismatched bases of the reverse\n  primer, as counted by BLAST.\n\n  `DeltaTm`: The difference in the melting temperatures of the forward and\n  reverse primers. A pair of primers with a high DeltaTm may result in\n  failure during the PCR experiment.\n\n  `AlnStart`: The location of the beginning of the forward primer (5',\n  leftmost of primer pairs) in the entire alignment.\n\n  `AlnEnd`: The location of the end of the reverse primer (5', rightmost of\n  primer pairs) in the entire alignment.\n\n  `AvgSeqStart`: The average beginning of the forward primer in the original\n  sequences.  *ONLY USED FOR DEBUG*.\n\n  `AvgSeqEnd`: The average end of the forward primer in the original\n  sequences.  *ONLY USED FOR DEBUG*.\n\n  The primer pairs are sorted by `Score`. Since the score may not fully\n  satisfy the user's specific considerations, it is suggested that primer\n  pairs be chosen manually if the first primer pair fails during the PCR\n  experiment.\n\n* Log.txt\n\n  The log file. Contains all the information printed on the screen.\n\n* Evaluation.csv\n\n  The summary of all loci/fragments, which only contains the variance\n  information for each fragment. One of the new field, `GapRatio`, means the\n  ratio of the gap (\"-\") in the alignment. `PD` means the original version\n  of the phylogenetic diversity and `PD_stem` means an alternative version\n  of it which only calculate the length of the stem branch in the\n  phylogenetic tree.\n\n# Options\n\nHere are some general options for the program and submodule:\n\n`-h`: Prints help messages of the program or one of the module.\n\n`-gb [filename]`: User-provided GenBank file or files. Could be one or more\nfiles that separated by space.\n\nFor instance,\n\n```\n# one file\n-gb sequence.gb\n# multiple files\n-gb matK.gb rbcL.gb Oryza.gb Homo_sapiens.gb\n```\n\n`-fasta [filename]`: User-provided unaligned fasta files. Could be one or\nmultiple.\n\n`-aln [filename]`: Alignment files that the user provides. Could be one or\nmultiple.\n\nIt only supports the fasta format. Ambiguous bases and gaps (\"-\") are supported.\n\n`-out [folder name]`: The output folder's name. All results will be put into\nthe output folder. If the user does not set an output path via \"-out\",\n`Organelle Genome Utilities` will create a folder named \"Result\".\n\n`OGU` does not overwrite the existing folder with the same name.\n\nIt is HIGHLY RECOMMENDED to use only letters, numbers and underscores (\"\\_\") in\nthe folder name to avoid mysterious errors caused by other Unicode characters.\n\nOptions below are for specific modules.\n\n## gb2fasta\n\n### Query\n\nOptions used for querying NCBI GenBank.\n\n`-taxon [taxonomy name]`: The taxonomy name. It could be any taxonomic rank\nfrom kingdom (same as \"-group\") to species, as long as the user inputs correct\nname (the scientific name of species or taxonomic group in latin, NOT\nENGLISH). It will restrict the query to the targeted taxonomy unit. Make sure\nto use quotation marks if `taxonomy` has more than one word or use underscore\nto replace space, for instance `\"Zea mays\"` or `Zea_mays`.\n\n`-gene [gene name]`: The gene's name which the user wants to query in GenBank.\nIf the user wants to use logical expressions like \"OR\", \"AND\", \"NOT\", s/he\nshould use \"-query\" instead. If there is space in the gene's name, make sure\nto use quotation marks.\n\nNote that \"ITS\" is not a gene name--it is \"internal transcribed spacer\".\n\nSometimes \"-gene\" options may bring in unwanted sequences. For example, if a\nuser queries \"rbcL[gene]\" in GenBank, spacer sequences may contain _rbcL_ or\n_rbcL_'s upstream/downstream gene, such as \"atpB_rbcL spacer\" or _atpB_.\n\n`-og [ignore|both|no|mt|mitochondrion|cp|chloroplast|pl|plastid]`: Query\norganelle sequences or not. The default value is `ignore`.\n\n    - `ignore`: do not consider organelle type, same as GenBank website's\n      default setting.\n\n    - `both`: only query organelle sequences, including both plastid and\n      mitochondrion.\n\n    - `no`: exclude organelle sequences from the query.\n\n    - `cp` or `chloroplast` or `pl` or `plastid`: only query plastid sequences\n\n    - `mt` or `mitochondrion`: only query mitochondrion sequences.\n\n`-refseq [both|yes|no]`: query in RefSeq database or not. The default value is\n`both`.\n\n    - `both`: query all sequences in or not in RefSeq database, same as NCBI\n      website's default setting.\n\n    - `yes`: only query sequences in RefSeq database.\n\n    - `no`: exclude sequences in RefSeq database.\n\n[RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) is considered to have\nhigher sequence and annotation quality than GenBank. This option could be used\nfor getting nuclear/organelle genomes from NCBI. In this situation (`-refseq\nyes`), the length limit will be removed automatically.\n\n`-count [number]`: Restrict numbers of sequences to be downloaded. The default\nvalue `0` means no restriction.\n\n`-min_len [length]`: The minimum length of the records downloaded from\nGenBank. The default value is `100` (bp). The number must be an integer.\n\n`-max_len [length]`: The maximum length of the records downloaded from\nGenBank. The default value is `10000` (bp). The number must be an integer.\n\n`-date_start [yyyy/mm/dd]`: The beginning of the release data range of the\nsequences, the format is yyyy/mm/dd.\n\n`-date_end [yyyy/mm/dd]`: The end of the release data range of the sequences,\nthe format is yyyy/mm/dd.\n\n`-molecular [all|DNA|RNA]`: The molecular type,\nwhich could be DNA or RNA. The\ndefault is `all`--no restriction.\n\n`-email [email address]`: NCBI GenBank database requires users to provide\nan email address in case of abnormal situations that NCBI need to contact\nthe user. For convenience, `OGU` will use\n\"guest@example.com\" if the user does not provide an email address. _However_,\nit is better to provide a real email address for potential contact.\n\n`-query [expression]`: The query string provided by the user. It behaves in\nthe same manner as the query the user typed into the Search Box in NCBI\nGenBank's webpage.\n\nMake sure to follow NCBI's grammar for queries. It can contain several words.\nRemember to add quotation marks if an item contains more than one words, for\ninstance, `\"Homo sapiens\"[organism]`, or use underscore to replace space,\n`Homo_sapiens[organism]`.\n\n`-exclude [expression]`: Use this option to use negative option. For instance,\n\"-exclude Zea [organism]\" (do not include quotation marks) will add \" NOT\n(Zea[organism])\" to the query.\n\nThis option can be useful for excluding a specific taxon.\n\n```\n-taxon Zea -exclude \"Zea mays\"[organism]\n```\n\nThis will query all records in genus *Zea* while records of *Zea mays* will be\nexcluded.\n\nFor much more complex exclude options, please consider to use \"Advance search\"\nin GenBank website.\n\n`-group [all|animals|plants|fungi|protists|bacteria|archaea|viruses]`: To\nrestrict the query in given group. The default value is `all`--no\nrestriction.\n\nIt is reported that the \"group\" filter may return abnormal records, for\ninstance, return plants' records when the group is \"animal\" and the\n\"organelle\" is \"chloroplast\". Furthermore, it may match a great number of\nrecords in GenBank. Hence, we strongly recommend using \"-taxon\" instead.\n\n### Divide\n\nOptions used for converting GenBank files to fasta files.\n\n`-out_debug`: If you are going to use visualize pipeline to draw detailed circle\nfigure, use this option to generate extended version genbank file.\n\n`-no_divide`: If set, it will analyse the whole sequence instead of the\ndivided fragments. By default, `OGU` divides one GenBank record into\nseveral fragments according to its annotation.\n\n`-rename`: If set, the program will try to rename genes. For instance, \"rbcl\"\nwill be renamed to \"rbcL\", and \"tRNA UAC\" will be renamed to \"trnVuac\", which\nconsists of \"trn\", the amino acid's letter and transcribed codon. This may be\nhelpful if the annotation has nonstandard uppercase/lowercase or naming format\nthat it can merge the same sequences to one file for the same locus having\nvariant names.\n\nIf using Windows operating system, consider using this option to avoid\ncontradictory filenames.\n\n`-unique [longest|first|no]`: The method used to remove redundant sequences.\n`OGU` will remove redundant sequences to ensure only one sequence per\nspecies by default. A user can change its behaviour by setting different\nmethods.\n\n    - `first`: According to the records' order in the original GenBank file,\n      only the first sequence of the same species' same locus will be kept.\n      Others will be ignored directly. This is the default option due to\n      performance considerations.\n\n    - `longest`: Keep the longest sequence for one species. The program will\n      compare the sequence's length from the same species' same locus.\n\n    - `no`: Skip this step. All sequences will be kept.\n\n`-allow_mosaic_spacer`: If one gene nested with another gene, normally they\ndo not have spacers. The default value is `False`.\n\nHowever, some users want the fragments between two gene's beginnings and ends.\nThis option is for this specific purpose (e.g., matK-trnK_UUU). For normal\nusage, *do not recommend*.\n\n`-expand [number]`: The expansion length in upstream/downstream. If set,\n`OGU` will expand the sequence to its upstream/downstream after the\ndividing step to find primer candidates. The default value is `0`.\n\nNote that this option is different with \"-max_len\". This option limits the\nlength of one annotation's sequence. The \"-max_len\" limits the whole\nsequence's length of one GenBank record.\n\n`-allow_repeat`: If genes repeated in downstream, this option will allow the\nrepeat region to be extracted, otherwise any repeated region will be omitted.\nThe default value is `False`.\n\n`-allow_invert_repeat`: If two genes invert-repeated in downstream, this\noption will allow spacers of them to be extracted, otherwise the spacer\nwill be omitted. The default value is `False`.\n\nFor instance, geneA-geneB located in one invert-repeat region (IR) of\nchloroplast genome. In another IR region, there are geneB-geneA. This option\nwill extract sequences of two different direction as two unique spacers.\n\n`-max_name_len [number]`: The maximum length of a feature name. Some\nannotation's feature name in GenBank file is too long, and usually, they are\nnot the target sequence the user wants. By setting this option, `OGU`\nwill truncate the annotation's feature name if it is too long. By default, the\nvalue is `50`.\n\n`-max_gene_len [value]`: The maximum length of a sequence for one annotation.\nSome annotations' sequences are too long (for instance, one gene has two\nexons, and its intron is longer than 10 Kb). This option will skip those long\nsequences. By default, the value is `20000` (bp).\n\n## Evaluate\n\n`-ig` or `-ignore_gap`: ignore gaps in the alignment.\n\n`-iab` or `-ignore_ambigous`: ignore ambiguous bases in the alignment.\n\n`-quick`: skip sliding-window scan.\n\n`-size [number]`: the window size of the sliding window scan. The default\nvalue is `500`.\n\n`-step [number]`: the step size of the sliding window scan. The default value\nis `50`.\n\n`-skip_primer`: skip primer designing. The default value is `False`.\n\n## Primer design\n\n`-coverage [value]`: The minimum coverage of the base and primer. The default\nvalue is `0.5` (50%). It is used to remove primer candidates if its coverage\namong all sequences is smaller than the threshold. The coverage of primers is\ncalculated by BLAST.\n\n`-res [value]`: The minimum *observed resolution* of the fragments or primer\npairs. The default *value* is 0.3 (30%). The value should be in 0.0 to 1.0.\n\n`OGU` uses the *observed resolution* instead of others because of the\nspeed. Also, it is considered to be the lower bound of the real resolution\nthat a fragment with a low *observed resolution* may not have a satisfactory\ntree resolution/phylogenetic diversity, either.\n\n`-pmin [length]`: The minimal length of the primer. The default *value* is 20.\n\n`-pmax [length]`: The maximal length of the primer. The default *value* is 25.\n\n`-topn [number]`: How many pairs of primers is kept for each input alignment.\nThe default value is `1`, i.e., only keep the _best_ primer pair according to\nits `score`. To keep more pairs, set \"-t\" to more than 1.\n\n`-amin [length]`: The minimum amplified length (include primer). The default\nvalue is `300` (bp). Note this limits the PCR product's length instead of the\nsub-alignment's length.\n\n`-amax [length]`: The maximum amplified length (include primer). The default\nvalue is `800` (bp).\n\nThe \"-amin\" and \"-amax\" are used to screen primer candidates. It uses BLAST\nresults to set the location of primers on each template sequence and\ncalculates the average lengths of the products. Because of the variance of\nspecies, the same locus may have different lengths in different species, plus\nwith the stretching of the alignment that gaps were added during the aligning,\nplease consider adding some *margins* for these two options.\n\nFor instance, if a user wants the amplified length to be smaller than 800 and\ngreater than 500, s/he could consider setting \"-amin\" to 550 and \"-amax\" to\n\n750.\n\n`-ambiguous [number]`: The maximum number of ambiguous bases allowed in one\nprimer. The default value is `4`.\n\n`-mismatch [number]`: The maximum number of mismatched bases in a primer. This\noption is used to remove primer candidates if the BLAST results show that\nthere is too much mismatch. The default value is `4`.\n\n# Performance\n\nFor a taxon that is not very large and includes few fragments, `OGU`\ncan finish the task in *minutes*. For a large taxon (such as the Asteraceae\nfamily or the whole class of the Poales) containing multiple fragments (such\nas the chloroplast genomes), the time to complete may be one hour or more on a\nPC or laptop.\n\n`OGU` requires few memories (usually less than 0.5 GB, although, for a\nlarge taxon BLAST may require more) and few CPUs (one core is enough). It can\nrun very well on a normal PC. Multiple CPU cores may be helpful for the\nalignment and tree construction steps.\n\nFor Windows users, MAFFT [may be very slow due to antivirus\nsoftware](https://mafft.cbrc.jp/alignment/software/windows_without_cygwin.html).\nPlease consider\nfollowing [this instruction](https://mafft.cbrc.jp/alignment/software/ubuntu_on_windows.html) to\ninstall\nUbuntu on Windows to obtain better results.\n\n# Citation\n\nAs yet unpublished.\n\n# License\n\nThe software itself is licensed under\n[AGPL-3.0](https://github.com/wpwupingwp/OGU/blob/master/LICENSE) (**not include\nthird-party\nsoftware**).\n\n# Q&A\n\nPlease submit your questions in the\n[Issue](https://github.com/wpwupingwp/OGU/issues) page :smiley:\n* Q: The first-time run is slow.\n\n  A: OGU will automaticlly install third-party software (MAFFT/BLAST/IQTREE)\n  from AWS at first-time running. Microsoft Windows users, especially in some \n  regions may have slow connection. Please be patient, or you can manually \n  install them. See [Initialization](#Initialization).",
    "bugtrack_url": null,
    "license": "AGPL-3.0-or-later",
    "summary": "Organelle Genome Utilities",
    "version": "1.52",
    "project_urls": {
        "Bug tracker": "https://github.com/wpwupingwp/OGU/issues",
        "Homepage": "https://github.com/wpwupingwp/OGU"
    },
    "split_keywords": [
        "plastid",
        " mitochondria",
        " organelle genome",
        " analysis",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4cca76968d51fac5370a6ab1576b139ec0abb288486002c04f4b5713bee818bb",
                "md5": "79dde56acd336fbee24e1f1c879dede4",
                "sha256": "d28a0efaffb2ba306882105ec9bb2add4d06c1c80e224dbb8c6350cc8510c1a4"
            },
            "downloads": -1,
            "filename": "ogu-1.52-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "79dde56acd336fbee24e1f1c879dede4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 88999,
            "upload_time": "2024-04-23T12:03:39",
            "upload_time_iso_8601": "2024-04-23T12:03:39.930498Z",
            "url": "https://files.pythonhosted.org/packages/4c/ca/76968d51fac5370a6ab1576b139ec0abb288486002c04f4b5713bee818bb/ogu-1.52-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1bee5b50002c05886bbccb03bb552b25c86e37dba6ff52f9c56bedf046c95602",
                "md5": "eb3c900b5e5a0657b095b3c7f5ca9acb",
                "sha256": "7400447a94005f5e1e16bdd474586ef0b747e07e2cfe5becd922395fcc2c4e6e"
            },
            "downloads": -1,
            "filename": "ogu-1.52.tar.gz",
            "has_sig": false,
            "md5_digest": "eb3c900b5e5a0657b095b3c7f5ca9acb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 96571,
            "upload_time": "2024-04-23T12:03:41",
            "upload_time_iso_8601": "2024-04-23T12:03:41.811374Z",
            "url": "https://files.pythonhosted.org/packages/1b/ee/5b50002c05886bbccb03bb552b25c86e37dba6ff52f9c56bedf046c95602/ogu-1.52.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-23 12:03:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wpwupingwp",
    "github_project": "OGU",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "biopython",
            "specs": [
                [
                    "==",
                    "1.82"
                ]
            ]
        },
        {
            "name": "certifi",
            "specs": [
                [
                    ">=",
                    "2023.11.17"
                ]
            ]
        },
        {
            "name": "coloredlogs",
            "specs": [
                [
                    ">=",
                    "10.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.6"
                ]
            ]
        },
        {
            "name": "primer3-py",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        }
    ],
    "lcname": "ogu"
}
        
Elapsed time: 0.53866s