ncbi-submit

Name	ncbi-submit JSON
Version	0.8.1 JSON
	download
home_page	https://github.com/enviro-lab/ncbi-submit
Summary	A tool for submitting to NCBI (SRA, BioSample, & GenBank).
upload_time	2024-03-05 14:15:58
maintainer	Sam Kunkleman
docs_url	None
author	Sam Kunkleman
requires_python	>=3.8
license	MIT
keywords	ncbi submission upload
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ncbi-submit
Submitting data to public databases is super important for publically funded laboratories, but it is not always a quick or intuitive process. `ncbi-submit` provides a simple and repeatable way to upload programmatic submissions to NCBI's SRA and GenBank with shared or unique BioProjects and BioSamples. Data can be uploaded as XML or zip files to either the Test or Production environments, and once there, the reports produced by NCBI can be analyzed to check on submission status and get BioSample accessions.

- [ncbi-submit](#ncbi-submit)
  - [Installation:](#installation)
  - [Testing](#testing)
  - [Usage](#usage)
    - [Setup](#setup)
    - [How to create a BioProject accession](#how-to-create-a-bioproject-accession)
    - [File Preparation](#file-preparation)
      - [Python instantiation (not needed on command line):](#python-instantiation-not-needed-on-command-line)
    - [File Submission](#file-submission)
    - [GenBank submission](#genbank-submission)
    - [Check Submission Status](#check-submission-status)
    - [How to get accessions (BioSample, SRA)](#how-to-get-accessions-biosample-sra)
      - [Get accessions by downloading report.xml files](#get-accessions-by-downloading-reportxml-files)
      - [Get accessions from list of report.xml files](#get-accessions-from-list-of-reportxml-files)
  - [Updating samples that have already been submitted](#updating-samples-that-have-already-been-submitted)
    - [Fastq read updates](#fastq-read-updates)
    - [Other metadata updates](#other-metadata-updates)
  - [Input File Paths Explained](#input-file-paths-explained)
    - [Required Files](#required-files)
    - [Optional Files](#optional-files)
    - [Sometimes Required Paths](#sometimes-required-paths)
  - [Links to xml template examples/schema:](#links-to-xml-template-examplesschema)


***
## Installation:
To install from PyPI in a virtual environment `.venv`:
```bash
python3 -m venv .venv
. .venv/bin/activate
pip install ncbi-submit
```
To install from conda (not yet set up) in a new environment `ncbi`:
```bash
conda create -n ncbi ncbi-submit
```

***
## Testing
Add NCBI credentials to file `./.login_credentials` or edit them in either:
* `./example/test.sh` or
* `./config/config.py`

To test creating all example files, run:
```bash
./example/test.sh
```
This script ^^^ could also be a good starting point for your own NCBI submission pipelines. Note: There are several blocks of code in there can be commented in/out, as needed.

***
## Usage

`ncbi_submit` is intended for use on the command line, but the class `ncbi.NCBI` can be imported and used within custom python scripts.

There are three main actions the script can do:
* `file_prep`: 
  * Prepares .tsv & .xml files for SRA, BioSample, & BioProject submissions
  * Used to prepare all files for initial submission to NCBI
  * To add in biosample accessions and prepare for GenBank submission, include the flag `prep_genbank`:
    * Prepares .zip, .sbt, & .tsv files for GenBank Submission
    * Used to add BioSample accessions from a BioSample submission for a GenBank submission
* `ftp` submission or checkup:
  * Interacts with NCBI's ftp host to do either of the following:
    * `submit` data to NCBI databases 
    * `check` on previous ftp submissions
    * `get-accessions` from all previous ftp submissions
* `example`:
  * Writes out example files for one or both of:
    * config.py file (tells `ncbi_submit` lots of important info)
    * template.sbt (used for genbank submission)

### Setup
The required parameters vary by which of the above actions you're attempting but at minimum require a `plate` and `outdir`. To limit the number of parameters required via command line, a `config` file must be used. When running from the command line, one of the three actions (`file_prep` or `ftp`) must be specified. With python, these are associated methods you may use on a single NCBI object.

Run this command to get a example `config.py` file in a directory called './ncbi':
```bash
ncbi_submit example --config --outdir "./nbci"
```

### How to create a BioProject accession
A BioProject accession can be created in NCBI's submission portal, but it can also be created by ncbi-submit either as part of a BioSample/SRA submission or all by itself.

Steps for creating a new BioProject accession via ncbi-submit:
1. In your `config.py` file, set `bioproject['create_new'] = True`
2. Follow the below [file preparation](#file-preparation) advice
3. Follow the below [file submission](#file-submission) advice, but if you're only creating a BioProject and don't want to submit any other data, you can omit the --fastq_dir and --plate options and specify a --subdir instead (as the name of the directory to be used in NCBI's ftp site)
4. Once you have results, add the new accession to your `config.py` file at `bioproject['bioproject_accession']` and set `bioproject['create_new'] = False`

### File Preparation
#### Python instantiation (not needed on command line):
Note: This is the minimum required info for preparing data. Other parameters may be necessary for more functionality or other tasks.  
```python
from ncbi_submit import ncbi_submit
ncbi = ncbi_submit.NCBI(
    fastq_dir = myFastqDir,
    seq_report = mySeqReport,
    plate = myPlate,
    outdir = myOutdir,
    config_file = myConfig,
    )
ncbi.write_presubmission_metadata()
```

Shell:
```bash
ncbi_submit file_prep \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --seq_report "${SEQ_REPORT}" \
    --primer_map "${PRIMER_MAP}" \
    --primer_scheme "${SCHEME_VERSION}" \
    --outdir "${NCBI_DIR}" \
    --gisaid_log "${GENERIC_GISAID_LOG//PLATE/$PLATE}" \
    --fastq_dir ${FASTQS} \
    --plate "${PLATE}"
```
Python:
```python
ncbi.write_presubmission_metadata()
```

### File Submission
NOTE: Once you're ready, you can drop the --test_mode and --test_dir flags

Shell:
```bash
# if submitting to BioSample and SRA (and if creating a new BioProject):
ncbi_submit ftp submit \
    --db bs_sra \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --outdir "${NCBI_DIR}" \
    --fastq_dir "${FASTQS}"

# if only creating a new BioProject:
ncbi_submit ftp submit \
    --db 'bp' \
    --plate \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --subdir "${NCBI_SUBDIR}" \
    --outdir "${NCBI_DIR}" 

# wait a while and try this to download reports and view submission status
ncbi_submit ftp check \
    --plate \
    --db bs_sra \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --outdir "${NCBI_DIR}" 
```
Python:
```python
# if submitting to BioSample and SRA (and if creating a new BioProject):
ncbi.submit(db="bs_sra")
# if only creating a new BioProject:
ncbi.submit(db="bp")

# wait awhile and try this to download reports and view submission status
ncbi.check(db="bs_sra")
```

### GenBank submission
(NOTE: not fully tested)
To link your fasta in GenBank to the associated reads, you'll want to add in the BioSample accessions before submitting.
* Acquire BioSample accessions via one of these methods:
  * download accessions.tsv file from NCBI and then use `ncbi_submit`
    * (Do this if you submitted to BioSample via NCBI's Submission Portal)
  * use `ncbi_submit` for everything
    * (Do this to avoid manual uploads via NCBI's Submission Portal)
Shell:
```bash
# dowload report.xml files to get accesssions from
ncbi_submit ftp check \
    --db ${DB} \
    --outdir "${NCBI_DIR}" \
    --config "${NCBI_CONFIG}" \
    -u "${ncbi_username}" \
    -p "${ncbi_password}" \
    --plate "${PLATE}" \
    --fastq_dir "${FASTQS}"

# add accessions to genbank.tsv
ncbi_submit --prep_genbank \
    --outdir "${NCBI_DIR}" \
    --config ${NCBI_CONFIG} \
    --fasta "${GENERIC_CONSENSUS//PLATE/$PLATE}" \
    --plate "${PLATE}"

# submit to GenBank (NOTE: db='gb')
ncbi_submit ftp submit \
    --db gb \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --outdir "${NCBI_DIR}" \
    --fastq_dir "${FASTQS}"
```
Python:
```python
# dowload report.xml files to get accesssions
ncbi.check(db="bs_sra")
# prepare genbank submission files and submit
ncbi.submit(db="gb")

## or

# files can also be prepared without submitting via:
ncbi.write_genbank_submission_zip()
```

***
### Check Submission Status
Wait awhile (10+ minutes) for NCBI to start processing the submission. Then run this to download reports and view submission status.
This works for whichever db you want to check on. If not specified, you'll get results on all submitted dbs.

Shell:
```bash
# check GenBank submission status (NOTE: db='gb')
ncbi_submit ftp check \
    --db gb \
    --test_mode --test_dir \
    --config "${NCBI_CONFIG}" \
    --outdir "${NCBI_DIR}"
```
Python:
```python
# check GenBank submission status (NOTE: db='gb')
ncbi.check(db='gb')
```

### How to get accessions (BioSample, SRA)
To acquire the accessions for all samples submitted via ftp under your group's account, `ncbi_submit` can download all xml report files and parse out the accession details. A directory will be created in `outdir` containing all submission-specific directories, each containing its report files. The `-f` or `--files` flag allows the use of a list of report files to parse. If provided, those files will be parsed for accession details rather than downloading the latest report files. NCBI only stores uploads for a certain amount of time, so accessions found in newly downloaded reports are combined with those from previously downloaded report files to get the most complete picture. This means it's important that you run `ncbi_submit ftp check` after each submission has been processed to ensure accurate results. The database can be specifed to indicate which accessions are desired and yield csvs (for the BioProject associated with your current `config` file) at `<outdir>/accessions_<bioproject>.csv` with the following fields:
| database | fields |
|-|-|
| 'bs_sra' | sample_name, BioSample, SRA |
| 'bs' | sample_name, BioSample |
| 'sra' | sample_name, SRA |

#### Get accessions by downloading report.xml files
Shell:
```bash
ncbi_submit ftp get-accessions \
    --db "bs_sra" \
    --config "${NCBI_CONFIG}" \
    --outdir "${REPORT_DIR}" \
    -u "${ncbi_username}" \
    -p "${ncbi_password}" \
```
Python:
```python
ncbi.get_all_accessions(db="bs_sra")
```

#### Get accessions from list of report.xml files
Shell:
```bash
ncbi_submit ftp get-accessions \
    --db "bs_sra" \
    --config "${NCBI_CONFIG}" \
    --outdir "${REPORT_DIR}" \
    -u "${ncbi_username}" \
    -p "${ncbi_password}" \
    -f s1/report.xml s2/report.xml
```
Python:
```python
ncbi.get_all_accessions(db="bs_sra",report_files=["file1", "file2"])
```


## Updating samples that have already been submitted
### Fastq read updates
If you want to update the reads for a sample you've already submitted, you must do the followind:
1. Email nlm-support@nlm.nih.gov and supply them with a list of SRA runs to suppress.
2. Once suppressed, you can upload a new version of the sample where the `submission.xml`
  * references the BioSample (rather than submitting a new BioSample block) and
  * has a new, unique SPUID for the SRA action block.

The `submission.xml` can be prepared as shown below and then submitted as discussed previously in [File Submission](#file-submission). Whereas normally an error would occur if a previously-submitted sample appears in the `seq_report` file, the flag `--update_reads` tells `ncbi_submit` to search for BioSasmple accessions of and include previously-submitted samples in the `submission.xml`. In most cases, if you are updating reads for a sample, a new SRA spuid is required. The `--spuid_endings` flag takes a parameter mapping samples that are being updated to a suffix. For any explicitely names samples, the suffix(es) will be added at the end of the automatically-generated SPUID. Usually '2' is a good suffix choice (unless another update has already been made using that same suffix for the sample of interest).

### Other metadata updates
These are not currently supported but could be added in the future if they seem important/useful.

Shell:
```bash
ncbi_submit file_prep \
    --config "${NCBI_CONFIG}" \
    --seq_report "${SEQ_REPORT}" \
    --outdir "${NCBI_DIR}" \
    --fastq_dir ${FASTQS} \
    --plate "${PLATE}" \
    --update_reads \
    --spuid_endings 'suffix1:samp1,samp2;suffix2:samp3'
```
Python:
```python
ncbi.write_presubmission_metadata(update_reads=True,spuid_endings={"sample1":"suffix1", "sample2":"suffix1", "sample3":"suffix2"})
```

***
## Input File Paths Explained
### Required Files
  * `config`: Contains preset values and details about your lab, team, and submission plans that are necessary for submission.
  * `seq_report`: Main metadata file with sample details - can be equivalent to NCBI's BioSample TSV for use with the Submission Portal.
### Optional Files
  * `exclude_file`: Contains a list of "sample_name"s to exclude from NCBI submission (each one on a new line).
  * `barcode_map`: Used as a cross-reference. If all samples from `barcode_map` appear in `seq_report`, that's great. Otherwise, you'll get a warning with directions for adding samples to the `exclude_file` if they shouldn't be submitted. File should have no headers. Lines must be: "{barcode}\t{sample_name}".
### Sometimes Required Paths
  * `fastq_dir`: Required for `file_prep` and `ftp` if submitting reads to SRA. Indicates where the fastqs should be gathered from. Any fastqs with "sample_name" values that aren't supposed to be submitted will be ignored.
  * `outdir`: Highly recommended but will defualt to "./ncbi" or "./ncbi_test". A directory to house output (submission reports, `exclude_file`, output from `file_prep`). Will be created, if needed.
  * `subdir`: Only used for `ftp` tasks. A prefix to use for submissions for the given dataset. Defaults to `plate`, if plate is provided.

***
## Links to xml template examples/schema:
| File type | BioProject | BioSample | SRA | GenBank | Description/Link
|  --- | --- | --- | --- | --- | --- |
| Webpage | &check; | &check; | &check; | &check; | [Protocols & TSVs for use at Submission Portal](https://www.protocols.io/view/overview-of-ncbi-39-s-sars-cov-2-submission-proces-3byl476e2lo5/v5)
| XML | create | create | create |  | [SRA submission w/ new BioSample & BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.bs.bp.run.xml?view=co)
| XML | link | create | create |  | [SRA submission w/ new BioSample & existing BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.bs.run.xml?view=co)
| XML | link | link | create |  | [SRA submission w/ existing BioSample & BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.run.xml?view=co)
| XML |  |  |  | create | [GenBank XML](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/genbank/SARS-CoV-2/submission.xml?view=co)
| doc |  |  |  | example | [Example GenBank submission zip](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/genbank/SARS-CoV-2/)
| XSD |  | schema |  |  | [BioSample XML Schema](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/biosample/biosample.xsd?revision=71107&view=co)
| XSD | schema |  |  |  | [BioProject XML Schema](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/bioproject/bioproject.xsd?view=co)
| err | validate |  |  |  | [Submission Error Explanations](https://www.ncbi.nlm.nih.gov/projects/biosample/docs/submission/validation/errors.xml)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/enviro-lab/ncbi-submit",
    "name": "ncbi-submit",
    "maintainer": "Sam Kunkleman",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "skunklem@uncc.edu",
    "keywords": "ncbi,submission,upload",
    "author": "Sam Kunkleman",
    "author_email": "skunklem@uncc.edu",
    "download_url": "https://files.pythonhosted.org/packages/c5/35/eceec50ef93e2c80bce6f3d336108160b558302d329c0202c0f7333fe8bb/ncbi_submit-0.8.1.tar.gz",
    "platform": null,
    "description": "# ncbi-submit\nSubmitting data to public databases is super important for publically funded laboratories, but it is not always a quick or intuitive process. `ncbi-submit` provides a simple and repeatable way to upload programmatic submissions to NCBI's SRA and GenBank with shared or unique BioProjects and BioSamples. Data can be uploaded as XML or zip files to either the Test or Production environments, and once there, the reports produced by NCBI can be analyzed to check on submission status and get BioSample accessions.\n\n- [ncbi-submit](#ncbi-submit)\n  - [Installation:](#installation)\n  - [Testing](#testing)\n  - [Usage](#usage)\n    - [Setup](#setup)\n    - [How to create a BioProject accession](#how-to-create-a-bioproject-accession)\n    - [File Preparation](#file-preparation)\n      - [Python instantiation (not needed on command line):](#python-instantiation-not-needed-on-command-line)\n    - [File Submission](#file-submission)\n    - [GenBank submission](#genbank-submission)\n    - [Check Submission Status](#check-submission-status)\n    - [How to get accessions (BioSample, SRA)](#how-to-get-accessions-biosample-sra)\n      - [Get accessions by downloading report.xml files](#get-accessions-by-downloading-reportxml-files)\n      - [Get accessions from list of report.xml files](#get-accessions-from-list-of-reportxml-files)\n  - [Updating samples that have already been submitted](#updating-samples-that-have-already-been-submitted)\n    - [Fastq read updates](#fastq-read-updates)\n    - [Other metadata updates](#other-metadata-updates)\n  - [Input File Paths Explained](#input-file-paths-explained)\n    - [Required Files](#required-files)\n    - [Optional Files](#optional-files)\n    - [Sometimes Required Paths](#sometimes-required-paths)\n  - [Links to xml template examples/schema:](#links-to-xml-template-examplesschema)\n\n\n***\n## Installation:\nTo install from PyPI in a virtual environment `.venv`:\n```bash\npython3 -m venv .venv\n. .venv/bin/activate\npip install ncbi-submit\n```\nTo install from conda (not yet set up) in a new environment `ncbi`:\n```bash\nconda create -n ncbi ncbi-submit\n```\n\n***\n## Testing\nAdd NCBI credentials to file `./.login_credentials` or edit them in either:\n* `./example/test.sh` or\n* `./config/config.py`\n\nTo test creating all example files, run:\n```bash\n./example/test.sh\n```\nThis script ^^^ could also be a good starting point for your own NCBI submission pipelines. Note: There are several blocks of code in there can be commented in/out, as needed.\n\n***\n## Usage\n\n`ncbi_submit` is intended for use on the command line, but the class `ncbi.NCBI` can be imported and used within custom python scripts.\n\nThere are three main actions the script can do:\n* `file_prep`: \n  * Prepares .tsv & .xml files for SRA, BioSample, & BioProject submissions\n  * Used to prepare all files for initial submission to NCBI\n  * To add in biosample accessions and prepare for GenBank submission, include the flag `prep_genbank`:\n    * Prepares .zip, .sbt, & .tsv files for GenBank Submission\n    * Used to add BioSample accessions from a BioSample submission for a GenBank submission\n* `ftp` submission or checkup:\n  * Interacts with NCBI's ftp host to do either of the following:\n    * `submit` data to NCBI databases \n    * `check` on previous ftp submissions\n    * `get-accessions` from all previous ftp submissions\n* `example`:\n  * Writes out example files for one or both of:\n    * config.py file (tells `ncbi_submit` lots of important info)\n    * template.sbt (used for genbank submission)\n\n### Setup\nThe required parameters vary by which of the above actions you're attempting but at minimum require a `plate` and `outdir`. To limit the number of parameters required via command line, a `config` file must be used. When running from the command line, one of the three actions (`file_prep` or `ftp`) must be specified. With python, these are associated methods you may use on a single NCBI object.\n\nRun this command to get a example `config.py` file in a directory called './ncbi':\n```bash\nncbi_submit example --config --outdir \"./nbci\"\n```\n\n### How to create a BioProject accession\nA BioProject accession can be created in NCBI's submission portal, but it can also be created by ncbi-submit either as part of a BioSample/SRA submission or all by itself.\n\nSteps for creating a new BioProject accession via ncbi-submit:\n1. In your `config.py` file, set `bioproject['create_new'] = True`\n2. Follow the below [file preparation](#file-preparation) advice\n3. Follow the below [file submission](#file-submission) advice, but if you're only creating a BioProject and don't want to submit any other data, you can omit the --fastq_dir and --plate options and specify a --subdir instead (as the name of the directory to be used in NCBI's ftp site)\n4. Once you have results, add the new accession to your `config.py` file at `bioproject['bioproject_accession']` and set `bioproject['create_new'] = False`\n\n### File Preparation\n#### Python instantiation (not needed on command line):\nNote: This is the minimum required info for preparing data. Other parameters may be necessary for more functionality or other tasks.  \n```python\nfrom ncbi_submit import ncbi_submit\nncbi = ncbi_submit.NCBI(\n    fastq_dir = myFastqDir,\n    seq_report = mySeqReport,\n    plate = myPlate,\n    outdir = myOutdir,\n    config_file = myConfig,\n    )\nncbi.write_presubmission_metadata()\n```\n\nShell:\n```bash\nncbi_submit file_prep \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --seq_report \"${SEQ_REPORT}\" \\\n    --primer_map \"${PRIMER_MAP}\" \\\n    --primer_scheme \"${SCHEME_VERSION}\" \\\n    --outdir \"${NCBI_DIR}\" \\\n    --gisaid_log \"${GENERIC_GISAID_LOG//PLATE/$PLATE}\" \\\n    --fastq_dir ${FASTQS} \\\n    --plate \"${PLATE}\"\n```\nPython:\n```python\nncbi.write_presubmission_metadata()\n```\n\n### File Submission\nNOTE: Once you're ready, you can drop the --test_mode and --test_dir flags\n\nShell:\n```bash\n# if submitting to BioSample and SRA (and if creating a new BioProject):\nncbi_submit ftp submit \\\n    --db bs_sra \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${NCBI_DIR}\" \\\n    --fastq_dir \"${FASTQS}\"\n\n# if only creating a new BioProject:\nncbi_submit ftp submit \\\n    --db 'bp' \\\n    --plate \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --subdir \"${NCBI_SUBDIR}\" \\\n    --outdir \"${NCBI_DIR}\" \n\n# wait a while and try this to download reports and view submission status\nncbi_submit ftp check \\\n    --plate \\\n    --db bs_sra \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${NCBI_DIR}\" \n```\nPython:\n```python\n# if submitting to BioSample and SRA (and if creating a new BioProject):\nncbi.submit(db=\"bs_sra\")\n# if only creating a new BioProject:\nncbi.submit(db=\"bp\")\n\n# wait awhile and try this to download reports and view submission status\nncbi.check(db=\"bs_sra\")\n```\n\n### GenBank submission\n(NOTE: not fully tested)\nTo link your fasta in GenBank to the associated reads, you'll want to add in the BioSample accessions before submitting.\n* Acquire BioSample accessions via one of these methods:\n  * download accessions.tsv file from NCBI and then use `ncbi_submit`\n    * (Do this if you submitted to BioSample via NCBI's Submission Portal)\n  * use `ncbi_submit` for everything\n    * (Do this to avoid manual uploads via NCBI's Submission Portal)\nShell:\n```bash\n# dowload report.xml files to get accesssions from\nncbi_submit ftp check \\\n    --db ${DB} \\\n    --outdir \"${NCBI_DIR}\" \\\n    --config \"${NCBI_CONFIG}\" \\\n    -u \"${ncbi_username}\" \\\n    -p \"${ncbi_password}\" \\\n    --plate \"${PLATE}\" \\\n    --fastq_dir \"${FASTQS}\"\n\n# add accessions to genbank.tsv\nncbi_submit --prep_genbank \\\n    --outdir \"${NCBI_DIR}\" \\\n    --config ${NCBI_CONFIG} \\\n    --fasta \"${GENERIC_CONSENSUS//PLATE/$PLATE}\" \\\n    --plate \"${PLATE}\"\n\n# submit to GenBank (NOTE: db='gb')\nncbi_submit ftp submit \\\n    --db gb \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${NCBI_DIR}\" \\\n    --fastq_dir \"${FASTQS}\"\n```\nPython:\n```python\n# dowload report.xml files to get accesssions\nncbi.check(db=\"bs_sra\")\n# prepare genbank submission files and submit\nncbi.submit(db=\"gb\")\n\n## or\n\n# files can also be prepared without submitting via:\nncbi.write_genbank_submission_zip()\n```\n\n***\n### Check Submission Status\nWait awhile (10+ minutes) for NCBI to start processing the submission. Then run this to download reports and view submission status.\nThis works for whichever db you want to check on. If not specified, you'll get results on all submitted dbs.\n\nShell:\n```bash\n# check GenBank submission status (NOTE: db='gb')\nncbi_submit ftp check \\\n    --db gb \\\n    --test_mode --test_dir \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${NCBI_DIR}\"\n```\nPython:\n```python\n# check GenBank submission status (NOTE: db='gb')\nncbi.check(db='gb')\n```\n\n### How to get accessions (BioSample, SRA)\nTo acquire the accessions for all samples submitted via ftp under your group's account, `ncbi_submit` can download all xml report files and parse out the accession details. A directory will be created in `outdir` containing all submission-specific directories, each containing its report files. The `-f` or `--files` flag allows the use of a list of report files to parse. If provided, those files will be parsed for accession details rather than downloading the latest report files. NCBI only stores uploads for a certain amount of time, so accessions found in newly downloaded reports are combined with those from previously downloaded report files to get the most complete picture. This means it's important that you run `ncbi_submit ftp check` after each submission has been processed to ensure accurate results. The database can be specifed to indicate which accessions are desired and yield csvs (for the BioProject associated with your current `config` file) at `<outdir>/accessions_<bioproject>.csv` with the following fields:\n| database | fields |\n|-|-|\n| 'bs_sra' | sample_name, BioSample, SRA |\n| 'bs' | sample_name, BioSample |\n| 'sra' | sample_name, SRA |\n\n#### Get accessions by downloading report.xml files\nShell:\n```bash\nncbi_submit ftp get-accessions \\\n    --db \"bs_sra\" \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${REPORT_DIR}\" \\\n    -u \"${ncbi_username}\" \\\n    -p \"${ncbi_password}\" \\\n```\nPython:\n```python\nncbi.get_all_accessions(db=\"bs_sra\")\n```\n\n#### Get accessions from list of report.xml files\nShell:\n```bash\nncbi_submit ftp get-accessions \\\n    --db \"bs_sra\" \\\n    --config \"${NCBI_CONFIG}\" \\\n    --outdir \"${REPORT_DIR}\" \\\n    -u \"${ncbi_username}\" \\\n    -p \"${ncbi_password}\" \\\n    -f s1/report.xml s2/report.xml\n```\nPython:\n```python\nncbi.get_all_accessions(db=\"bs_sra\",report_files=[\"file1\", \"file2\"])\n```\n\n\n## Updating samples that have already been submitted\n### Fastq read updates\nIf you want to update the reads for a sample you've already submitted, you must do the followind:\n1. Email nlm-support@nlm.nih.gov and supply them with a list of SRA runs to suppress.\n2. Once suppressed, you can upload a new version of the sample where the `submission.xml`\n  * references the BioSample (rather than submitting a new BioSample block) and\n  * has a new, unique SPUID for the SRA action block.\n\nThe `submission.xml` can be prepared as shown below and then submitted as discussed previously in [File Submission](#file-submission). Whereas normally an error would occur if a previously-submitted sample appears in the `seq_report` file, the flag `--update_reads` tells `ncbi_submit` to search for BioSasmple accessions of and include previously-submitted samples in the `submission.xml`. In most cases, if you are updating reads for a sample, a new SRA spuid is required. The `--spuid_endings` flag takes a parameter mapping samples that are being updated to a suffix. For any explicitely names samples, the suffix(es) will be added at the end of the automatically-generated SPUID. Usually '2' is a good suffix choice (unless another update has already been made using that same suffix for the sample of interest).\n\n### Other metadata updates\nThese are not currently supported but could be added in the future if they seem important/useful.\n\nShell:\n```bash\nncbi_submit file_prep \\\n    --config \"${NCBI_CONFIG}\" \\\n    --seq_report \"${SEQ_REPORT}\" \\\n    --outdir \"${NCBI_DIR}\" \\\n    --fastq_dir ${FASTQS} \\\n    --plate \"${PLATE}\" \\\n    --update_reads \\\n    --spuid_endings 'suffix1:samp1,samp2;suffix2:samp3'\n```\nPython:\n```python\nncbi.write_presubmission_metadata(update_reads=True,spuid_endings={\"sample1\":\"suffix1\", \"sample2\":\"suffix1\", \"sample3\":\"suffix2\"})\n```\n\n***\n## Input File Paths Explained\n### Required Files\n  * `config`: Contains preset values and details about your lab, team, and submission plans that are necessary for submission.\n  * `seq_report`: Main metadata file with sample details - can be equivalent to NCBI's BioSample TSV for use with the Submission Portal.\n### Optional Files\n  * `exclude_file`: Contains a list of \"sample_name\"s to exclude from NCBI submission (each one on a new line).\n  * `barcode_map`: Used as a cross-reference. If all samples from `barcode_map` appear in `seq_report`, that's great. Otherwise, you'll get a warning with directions for adding samples to the `exclude_file` if they shouldn't be submitted. File should have no headers. Lines must be: \"{barcode}\\t{sample_name}\".\n### Sometimes Required Paths\n  * `fastq_dir`: Required for `file_prep` and `ftp` if submitting reads to SRA. Indicates where the fastqs should be gathered from. Any fastqs with \"sample_name\" values that aren't supposed to be submitted will be ignored.\n  * `outdir`: Highly recommended but will defualt to \"./ncbi\" or \"./ncbi_test\". A directory to house output (submission reports, `exclude_file`, output from `file_prep`). Will be created, if needed.\n  * `subdir`: Only used for `ftp` tasks. A prefix to use for submissions for the given dataset. Defaults to `plate`, if plate is provided.\n\n***\n## Links to xml template examples/schema:\n| File type | BioProject | BioSample | SRA | GenBank | Description/Link\n|  --- | --- | --- | --- | --- | --- |\n| Webpage | &check; | &check; | &check; | &check; | [Protocols & TSVs for use at Submission Portal](https://www.protocols.io/view/overview-of-ncbi-39-s-sars-cov-2-submission-proces-3byl476e2lo5/v5)\n| XML | create | create | create |  | [SRA submission w/ new BioSample & BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.bs.bp.run.xml?view=co)\n| XML | link | create | create |  | [SRA submission w/ new BioSample & existing BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.bs.run.xml?view=co)\n| XML | link | link | create |  | [SRA submission w/ existing BioSample & BioProject](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/sra/samples/sra.submission.run.xml?view=co)\n| XML |  |  |  | create | [GenBank XML](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/genbank/SARS-CoV-2/submission.xml?view=co)\n| doc |  |  |  | example | [Example GenBank submission zip](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/genbank/SARS-CoV-2/)\n| XSD |  | schema |  |  | [BioSample XML Schema](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/biosample/biosample.xsd?revision=71107&view=co)\n| XSD | schema |  |  |  | [BioProject XML Schema](https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/bioproject/bioproject.xsd?view=co)\n| err | validate |  |  |  | [Submission Error Explanations](https://www.ncbi.nlm.nih.gov/projects/biosample/docs/submission/validation/errors.xml)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for submitting to NCBI (SRA, BioSample, & GenBank).",
    "version": "0.8.1",
    "project_urls": {
        "Homepage": "https://github.com/enviro-lab/ncbi-submit",
        "Repository": "https://github.com/enviro-lab/ncbi-submit"
    },
    "split_keywords": [
        "ncbi",
        "submission",
        "upload"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae6dcb39a1ca81bce0fa44b0c1080f89d97f34cafb29d5e4295dc04fbbe8cf0f",
                "md5": "66008acfd76965159adb2f833530de80",
                "sha256": "da984d2ff0911a1fc7d815d094fbaf73a092076cf76eef55bf8ab681fecdd8cf"
            },
            "downloads": -1,
            "filename": "ncbi_submit-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "66008acfd76965159adb2f833530de80",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 456360,
            "upload_time": "2024-03-05T14:15:56",
            "upload_time_iso_8601": "2024-03-05T14:15:56.944769Z",
            "url": "https://files.pythonhosted.org/packages/ae/6d/cb39a1ca81bce0fa44b0c1080f89d97f34cafb29d5e4295dc04fbbe8cf0f/ncbi_submit-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c535eceec50ef93e2c80bce6f3d336108160b558302d329c0202c0f7333fe8bb",
                "md5": "6c770fadb8b026c24e40b0828a81ad39",
                "sha256": "54003c5f344f872a809d0fb36870f8e81c8441ef4f569a24f989e1704795c8e0"
            },
            "downloads": -1,
            "filename": "ncbi_submit-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c770fadb8b026c24e40b0828a81ad39",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 432034,
            "upload_time": "2024-03-05T14:15:58",
            "upload_time_iso_8601": "2024-03-05T14:15:58.983808Z",
            "url": "https://files.pythonhosted.org/packages/c5/35/eceec50ef93e2c80bce6f3d336108160b558302d329c0202c0f7333fe8bb/ncbi_submit-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-05 14:15:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "enviro-lab",
    "github_project": "ncbi-submit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "ncbi-submit"
}

Sam Kunkleman