Name | pod5 JSON |
Version |
0.3.10
JSON |
| download |
home_page | None |
Summary | Oxford Nanopore Technologies Pod5 File Format Python API and Tools |
upload_time | 2024-03-25 13:21:56 |
maintainer | None |
docs_url | None |
author | None |
requires_python | ~=3.8 |
license | None |
keywords |
nanopore
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# POD5 Python Package
The `pod5` Python package contains the tools and python API wrapping the compiled bindings
for the POD5 file format from `lib_pod5`.
## Installation
The `pod5` package is available on [pypi](https://pypi.org/project/pod5/) and is
installed using `pip`:
``` console
> pip install pod5
```
## Usage
### Reading a POD5 File
To read a `pod5` file provide the the `Reader` class with the input `pod5` file path
and call `Reader.reads()` to iterate over read records in the file. The example below
prints the read_id of every record in the input `pod5` file.
``` python
import pod5 as p5
with p5.Reader("example.pod5") as reader:
for read_record in reader.reads():
print(read_record.read_id)
```
To iterate over a selection of read_ids supply `Reader.reads()` with a collection
of read_ids which must be `UUID` compatible:
``` python
import pod5 as p5
# Create a collection of read_id UUIDs
read_ids: List[str] = [
"00445e58-3c58-4050-bacf-3411bb716cc3",
"00520473-4d3d-486b-86b5-f031c59f6591",
]
with p5.Reader("example.pod5") as reader:
for read_record in reader.reads(selection=read_ids):
assert str(read_record.read_id) in read_ids
```
### Plotting Signal Data Example
Here is an example of how a user may plot a read’s signal data against time.
``` python
import matplotlib.pyplot as plt
import numpy as np
import pod5 as p5
# Using the example pod5 file provided
example_pod5 = "test_data/multi_fast5_zip.pod5"
selected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'
with p5.Reader(example_pod5) as reader:
# Read the selected read from the pod5 file
# next() is required here as Reader.reads() returns a Generator
read = next(reader.reads(selection=[selected_read_id]))
# Get the signal data and sample rate
sample_rate = read.run_info.sample_rate
signal = read.signal
# Compute the time steps over the sampling period
time = np.arange(len(signal)) / sample_rate
# Plot using matplotlib
plt.plot(time, signal)
```
### Writing a POD5 File
The `pod5` package provides the functionality to write POD5 files.
It is strongly recommended that users first look at the available tools when
manipulating existing datasets, as there may already be a tool to meet your needs.
New tools may be added to support our users and if you have a suggestion for a
new tool or feature please submit a request on the
[pod5-file-format GitHub issues page](https://github.com/nanoporetech/pod5-file-format/issues).
Below is an example of how one may add reads to a new POD5 file using the `Writer`
and its `add_read()` method.
```python
import pod5 as p5
# Populate container classes for read metadata
pore = p5.Pore(channel=123, well=3, pore_type="pore_type")
calibration = p5.Calibration(offset=0.1, scale=1.1)
end_reason = p5.EndReason(name=p5.EndReasonEnum.SIGNAL_POSITIVE, forced=False)
run_info = p5.RunInfo(
acquisition_id = ...
acquisition_start_time = ...
adc_max = ...
...
)
signal = ... # some signal data as numpy np.int16 array
read = p5.Read(
read_id=UUID("0000173c-bf67-44e7-9a9c-1ad0bc728e74"),
end_reason=end_reason,
calibration=calibration,
pore=pore,
run_info=run_info,
...
signal=signal,
)
with p5.Writer("example.pod5") as writer:
# Write the read object
writer.add_read(read)
```
## Tools
1. [pod5 view](#pod5-view)
2. [pod5 inspect](#pod5-inspect)
3. [pod5 merge](#pod5-merge)
4. [pod5 filter](#pod5-filter)
5. [pod5 subset](#pod5-subset)
6. [pod5 repack](#pod5-repack)
7. [pod5 recover](#pod5-recover)
8. [pod5 convert fast5](#pod5-convert-fast5)
9. [pod5 convert to_fast5](#pod5-convert-to_fast5)
10. [pod5 update](#pod5-update)
The ``pod5`` package provides the following tools for inspecting and manipulating
POD5 files as well as converting between ``.pod5`` and ``.fast5`` file formats.
To disable the `tqdm <https://github.com/tqdm/tqdm>`_ progress bar set the environment
variable ``POD5_PBAR=0``.
To enable debugging output which may also output detailed log files, set the environment
variable ``POD5_DEBUG=1``
### Pod5 View
The ``pod5 view`` tool is used to produce a table similarr to a sequencing summary
from the contents of ``.pod5`` files. The default output is a tab-separated table
written to stdout with all available fields.
This tools is indented to replace ``pod5 inspect reads`` and is over 200x faster.
``` bash
> pod5 view --help
# View the list of fields with a short description in-order (shortcut -L)
> pod5 view --list-fields
# Write the summary to stdout
> pod5 view input.pod5
# Write the summary of multiple pod5s to a file
> pod5 view *.pod5 --output summary.tsv
# Write the summary as a csv
> pod5 view *.pod5 --output summary.csv --separator ','
# Write only the read_ids with no header (shorthand -IH)
> pod5 view input.pod5 --ids --no-header
# Write only the listed fields
# Note: The field order is fixed the order shown in --list-fields
> pod5 view input.pod5 --include "read_id, channel, num_samples, end_reason"
# Exclude some unwanted fields
> pod5 view input.pod5 --exclude "filename, pore_type"
```
### Pod5 inspect
The ``pod5 inspect`` tool can be used to extract details and summaries of
the contents of ``.pod5`` files. There are two programs for users within ``pod5 inspect``
and these are read and reads
``` bash
> pod5 inspect --help
> pod5 inspect {reads, read, summary} --help
```
#### Pod5 inspect reads
> :warning: This tool is deprecated and has been replaced by ``pod5 view`` which is significantly faster.
Inspect all reads and print a csv table of the details of all reads in the given ``.pod5`` files.
``` bash
> pod5 inspect reads pod5_file.pod5
read_id,channel,well,pore_type,read_number,start_sample,end_reason,median_before,calibration_offset,calibration_scale,sample_count,byte_count,signal_compression_ratio
00445e58-3c58-4050-bacf-3411bb716cc3,908,1,not_set,100776,374223800,signal_positive,205.3,-240.0,0.1,65582,58623,0.447
00520473-4d3d-486b-86b5-f031c59f6591,220,1,not_set,7936,16135986,signal_positive,192.0,-233.0,0.1,167769,146495,0.437
...
```
#### Pod5 inspect read
Inspect the pod5 file, find a specific read and print its details.
``` console
> pod5 inspect read pod5_file.pod5 00445e58-3c58-4050-bacf-3411bb716cc3
File: out-tmp/output.pod5
read_id: 0e5d6827-45f6-462c-9f6b-21540eef4426
read_number: 129227
start_sample: 367096601
median_before: 171.889404296875
channel data:
channel: 2366
well: 1
pore_type: not_set
end reason:
name: signal_positive
forced False
calibration:
offset: -243.0
scale: 0.1462070643901825
samples:
sample_count: 81040
byte_count: 71989
compression ratio: 0.444
run info
acquisition_id: 2ca00715f2e6d8455e5174cd20daa4c38f95fae2
acquisition_start_time: 2021-07-23 13:48:59.780000
adc_max: 0
adc_min: 0
context_tags
barcoding_enabled: 0
basecall_config_filename: dna_r10.3_450bps_hac_prom.cfg
experiment_duration_set: 2880
...
```
### Pod5 merge
``pod5 merge`` is a tool for merging multiple ``.pod5`` files into one monolithic pod5 file.
The contents of the input files are checked for duplicate read_ids to avoid
accidentally merging identical reads. To override this check set the argument
``-D / --duplicate-ok``
``` bash
# View help
> pod5 merge --help
# Merge a pair of pod5 files
> pod5 merge example_1.pod5 example_2.pod5 --output merged.pod5
# Merge a glob of pod5 files
> pod5 merge *.pod5 -o merged.pod5
# Merge a glob of pod5 files ignoring duplicate read ids
> pod5 merge *.pod5 -o merged.pod5 --duplicate-ok
```
### Pod5 filter
``pod5 filter`` is a simpler alternative to ``pod5 subset`` where reads are subset from
one or more input ``.pod5`` files using a list of read ids provided using the ``--ids`` argument
and writing those reads to a *single* ``--output`` file.
See ``pod5 subset`` for more advanced subsetting.
``` bash
> pod5 filter example.pod5 --output filtered.pod5 --ids read_ids.txt
```
The ``--ids`` selection text file must be a simple list of valid UUID read_ids with
one read_id per line. Only records which match the UUID regex (lower-case) are used.
Lines beginning with a ``#`` (hash / pound symbol) are interpreted as comments.
Empty lines are not valid and may cause errors during parsing.
> The ``filter`` and ``subset`` tools will assert that any requested read_ids are
> present in the inputs. If a requested read_id is missing from the inputs
> then the tool will issue the following error:
>
> ``` bash
> POD5 has encountered an error: 'Missing read_ids from inputs but --missing-ok not set'
> ```
>
> To disable this warning then the '-M / --missing-ok' argument.
When supplying multiple input files to 'filter' or 'subset', the tools is
effectively performing a ``merge`` operation. The 'merge' tool is better suited
for handling very large numbers of input files.
#### Example filtering pipeline
This is a trivial example of how to select a random sample of 1000 read_ids from a
pod5 file using ``pod5 view`` and ``pod5 filter``.
``` bash
# Get a random selection of read_ids
> pod5 view all.pod5 --ids --no-header --output all_ids.txt
> all_ids.txt sort --random-sort | head --lines 1000 > 1k_ids.txt
# Filter to that selection
> pod5 filter all.pod5 --ids 1k_ids.txt --output 1k.pod5
# Check the output
> pod5 view 1k.pod5 -IH | wc -l
1000
```
### Pod5 subset
``pod5 subset`` is a tool for subsetting reads in ``.pod5`` files into one or more
output ``.pod5`` files. See also ``pod5 filter``
The ``pod5 subset`` tool requires a *mapping* which defines which read_ids should be
written to which output. There are multiple ways of specifying this mapping which are
defined in either a ``.csv`` file or by using a ``--table`` (csv or tsv)
and instructions on how to interpret it.
``pod5 subset`` aims to be a generic tool to subset from multiple inputs to multiple outputs.
If your use-case is to ``filter`` read_ids from one or more inputs into a single output
then ``pod5 filter`` might be a more appropriate tool as the only input is a list of read_ids.
``` bash
# View help
> pod5 subset --help
# Subset input(s) using a pre-defined mapping
> pod5 subset example_1.pod5 --csv mapping.csv
# Subset input(s) using a dynamic mapping created at runtime
> pod5 subset example_1.pod5 --table table.txt --columns barcode
```
> Care should be taken to ensure that when providing multiple input ``.pod5`` files to ``pod5 subset``
> that there are no read_id UUID clashes. If a duplicate read_id is detected an exception
> will be raised unless the ``--duplicate-ok`` argument is set. If ``--duplicate-ok`` is
> set then both reads will be written to the output, although this is not recommended.
#### Note on positional arguments
> The ``--columns`` argument will greedily consume values and as such, care should be taken
> with the placement of any positional arguments. The following line will result in an error
> as the input pod5 file is consumed by ``--columns`` resulting in no input file being set.
```bash
# Invalid placement of positional argument example.pod5
$ pod5 subset --table table.txt --columns barcode example.pod5
```
#### Creating a Subset Mapping
##### Target Mapping (.csv)
The example below shows a ``.csv`` subset target mapping. Any lines (e.g. header line)
which do not have a read_id which matches the UUID regex (lower-case) in the second
column is ignored.
``` text
target, read_id
output_1.pod5,132b582c-56e8-4d46-9e3d-48a275646d3a
output_1.pod5,12a4d6b1-da6e-4136-8bb3-1470ef27e311
output_2.pod5,0ff4dc01-5fa4-4260-b54e-1d8716c7f225
output_2.pod5,0e359c40-296d-4edc-8f4a-cca135310ab2
output_2.pod5,0e9aa0f8-99ad-40b3-828a-45adbb4fd30c
```
##### Target Mapping from Table
``pod5 subset`` can dynamically generate output targets and collect associated reads
based on a text file containing a table (csv or tsv) parsible by ``polars``.
This table file could be the output from ``pod5 view`` or from a sequencing summary.
The table must contain a header row and a series of columns on which to group unique
collections of values. Internally this process uses the
`polars.Dataframe.group_by <https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html>`_
function where the ``by`` parameter is the sequence of column names specified with
the ``--columns`` argument.
Given the following example ``--table`` file, observe the resultant outputs given various
arguments:
``` text
read_id mux barcode length
read_a 1 barcode_a 4321
read_b 1 barcode_b 1000
read_c 2 barcode_b 1200
read_d 2 barcode_c 1234
```
``` bash
> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode
> ls barcode_subset
barcode-barcode_a.pod5 # Contains: read_a
barcode-barcode_b.pod5 # Contains: read_b, read_c
barcode-barcode_c.pod5 # Contains: read_d
> pod5 subset example_1.pod5 --output mux_subset --table table.txt --columns mux
> ls mux_subset
mux-1.pod5 # Contains: read_a, read_b
mus-2.pod5 # Contains: read_c, read_d
> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux
> ls barcode_mux_subset
barcode-barcode_a_mux-1.pod5 # Contains: read_a
barcode-barcode_b_mux-1.pod5 # Contains: read_b
barcode-barcode_b_mux-2.pod5 # Contains: read_c
barcode-barcode_c_mux-2.pod5 # Contains: read_d
```
##### Output Filename Templating
When subsetting using a table the output filename is generated from a template
string. The automatically generated template is the sequential concatenation of
``column_name-column_value`` followed by the ``.pod5`` file extension.
The user can set their own filename template using the ``--template`` argument.
This argument accepts a string in the `Python f-string style <https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals>`_
where the subsetting variables are used for keyword placeholder substitution.
Keywords should be placed within curly-braces. For example:
``` bash
# default template used = "barcode-{barcode}.pod5"
> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode
# default template used = "barcode-{barcode}_mux-{mux}.pod5"
> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux
> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode --template "{barcode}.subset.pod5"
> ls barcode_subset
barcode_a.subset.pod5 # Contains: read_a
barcode_b.subset.pod5 # Contains: read_b, read_c
barcode_c.subset.pod5 # Contains: read_d
```
##### Example subsetting from ``pod5 inspect reads``
The ``pod5 inspect reads`` tool will output a csv table summarising the content of the
specified ``.pod5`` file which can be used for subsetting. The example below shows
how to split a ``.pod5`` file by the well field.
``` bash
# Create the csv table from inspect reads
> pod5 inspect reads example.pod5 > table.csv
> pod5 subset example.pod5 --table table.csv --columns well
```
### Pod5 repack
``pod5 repack`` will simply repack ``.pod5`` files into one-for-one output files of the same name.
``` bash
> pod5 repack pod5s/*.pod5 repacked_pods/
```
### Pod5 Recover
``pod5 recover`` will attempt to recover data from corrupted or truncated ``.pod5`` files
by copying all valid table batches and cleanly closing the new files. New files are written
as siblings to the inputs with the `_recovered.pod5` suffix.
``` bash
> pod5 recover --help
> pod5 recover broken.pod5
> ls
broken.pod5 broken_recovered.pod5
```
### pod5 convert fast5
The ``pod5 convert fast5`` tool takes one or more ``.fast5`` files and converts them
to one or more ``.pod5`` files.
If the tool detects single-read fast5 files, please convert them into multi-read
fast5 files using the tools available in the ``ont_fast5_api`` project.
The progress bar shown during conversion assumes the number of reads in an input
``.fast5`` is 4000. The progress bar will update the total value during runtime if
required.
> Some content previously stored in ``.fast5`` files is **not** compatible with the POD5
> format and will not be converted. This includes all analyses stored in the
> ``.fast5`` file.
>
> Please ensure that any other data is recovered from ``.fast5`` before deletion.
By default ``pod5 convert fast5`` will show exceptions raised during conversion as *warnings*
to the user. This is to gracefully handle potentially corrupt input files or other
runtime errors in long-running conversion tasks. The ``--strict`` argument allows
users to opt-in to strict runtime assertions where any exception raised will promptly
stop the conversion process with an error.
``` bash
# View help
> pod5 convert fast5 --help
# Convert fast5 files into a monolithic output file
> pod5 convert fast5 ./input/*.fast5 --output converted.pod5
# Convert fast5 files into a monolithic output in an existing directory
> pod5 convert fast5 ./input/*.fast5 --output outputs/
> ls outputs/
output.pod5 # default name
# Convert each fast5 to its relative converted output. The output files are written
# into the output directory at paths relatve to the path given to the
# --one-to-one argument. Note: This path must be a relative parent to all
# input paths.
> ls input/*.fast5
file_1.fast5 file_2.fast5 ... file_N.fast5
> pod5 convert fast5 ./input/*.fast5 --output output_pod5s/ --one-to-one ./input/
> ls output_pod5s/
file_1.pod5 file_2.pod5 ... file_N.pod5
# Note the different --one-to-one path which is now the current working directory.
# The new sub-directory output_pod5/input is created.
> pod5 convert fast5 ./input/*.fast5 output_pod5s --one-to-one ./
> ls output_pod5s/
input/file_1.pod5 input/file_2.pod5 ... input/file_N.pod5
# Convert all inputs so that they have neibouring pod5 in current directory
> pod5 convert fast5 *.fast5 --output . --one-to-one .
> ls
file_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5 ... file_N.fast5 file_N.pod5
# Convert all inputs so that they have neibouring pod5 files from a parent directory
> pod5 convert fast5 ./input/*.fast5 --output ./input/ --one-to-one ./input/
> ls input/*
file_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5 ... file_N.fast5 file_N.pod5
```
### Pod5 convert to_fast5
The ``pod5 convert to_fast5`` tool takes one or more ``.pod5`` files and converts them
to multiple ``.fast5`` files. The default behaviour is to write 4000 reads per output file
but this can be controlled with the ``--file-read-count`` argument.
``` bash
# View help
> pod5 convert to_fast5 --help
# Convert pod5 files to fast5 files with default 4000 reads per file
> pod5 convert to_fast5 example.pod5 --output pod5_to_fast5/
> ls pod5_to_fast5/
output_1.fast5 output_2.fast5 ... output_N.fast5
```
### Pod5 Update
The ``pod5 update`` tools is used to update old pod5 files to use the latest schema.
Currently the latest schema version is version 3.
Files are written into the ``--output`` directory with the same name.
``` bash
> pod5 update --help
# Update a named files
> pod5 update my.pod5 --output updated/
> ls updated
updated/my.pod5
# Update an entire directory
> pod5 update old/ -o updated/
```
Raw data
{
"_id": null,
"home_page": null,
"name": "pod5",
"maintainer": null,
"docs_url": null,
"requires_python": "~=3.8",
"maintainer_email": null,
"keywords": "nanopore",
"author": null,
"author_email": "Oxford Nanopore Technologies plc <support@nanoporetech.com>",
"download_url": "https://files.pythonhosted.org/packages/69/b0/b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d/pod5-0.3.10.tar.gz",
"platform": null,
"description": "# POD5 Python Package\n\nThe `pod5` Python package contains the tools and python API wrapping the compiled bindings\nfor the POD5 file format from `lib_pod5`.\n\n## Installation\n\nThe `pod5` package is available on [pypi](https://pypi.org/project/pod5/) and is\ninstalled using `pip`:\n\n``` console\n > pip install pod5\n```\n\n## Usage\n\n### Reading a POD5 File\n\nTo read a `pod5` file provide the the `Reader` class with the input `pod5` file path\nand call `Reader.reads()` to iterate over read records in the file. The example below\nprints the read_id of every record in the input `pod5` file.\n\n``` python\nimport pod5 as p5\n\nwith p5.Reader(\"example.pod5\") as reader:\n for read_record in reader.reads():\n print(read_record.read_id)\n```\n\nTo iterate over a selection of read_ids supply `Reader.reads()` with a collection\nof read_ids which must be `UUID` compatible:\n\n``` python\nimport pod5 as p5\n\n# Create a collection of read_id UUIDs\nread_ids: List[str] = [\n \"00445e58-3c58-4050-bacf-3411bb716cc3\",\n \"00520473-4d3d-486b-86b5-f031c59f6591\",\n]\n\nwith p5.Reader(\"example.pod5\") as reader:\n for read_record in reader.reads(selection=read_ids):\n assert str(read_record.read_id) in read_ids\n```\n\n### Plotting Signal Data Example\n\nHere is an example of how a user may plot a read\u2019s signal data against time.\n\n``` python\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nimport pod5 as p5\n\n# Using the example pod5 file provided\nexample_pod5 = \"test_data/multi_fast5_zip.pod5\"\nselected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'\n\nwith p5.Reader(example_pod5) as reader:\n\n # Read the selected read from the pod5 file\n # next() is required here as Reader.reads() returns a Generator\n read = next(reader.reads(selection=[selected_read_id]))\n\n # Get the signal data and sample rate\n sample_rate = read.run_info.sample_rate\n signal = read.signal\n\n # Compute the time steps over the sampling period\n time = np.arange(len(signal)) / sample_rate\n\n # Plot using matplotlib\n plt.plot(time, signal)\n```\n\n### Writing a POD5 File\n\nThe `pod5` package provides the functionality to write POD5 files.\n\nIt is strongly recommended that users first look at the available tools when\nmanipulating existing datasets, as there may already be a tool to meet your needs.\nNew tools may be added to support our users and if you have a suggestion for a\nnew tool or feature please submit a request on the\n[pod5-file-format GitHub issues page](https://github.com/nanoporetech/pod5-file-format/issues).\n\nBelow is an example of how one may add reads to a new POD5 file using the `Writer`\nand its `add_read()` method.\n\n```python\nimport pod5 as p5\n\n# Populate container classes for read metadata\npore = p5.Pore(channel=123, well=3, pore_type=\"pore_type\")\ncalibration = p5.Calibration(offset=0.1, scale=1.1)\nend_reason = p5.EndReason(name=p5.EndReasonEnum.SIGNAL_POSITIVE, forced=False)\nrun_info = p5.RunInfo(\n acquisition_id = ...\n acquisition_start_time = ...\n adc_max = ...\n ...\n)\nsignal = ... # some signal data as numpy np.int16 array\n\nread = p5.Read(\n read_id=UUID(\"0000173c-bf67-44e7-9a9c-1ad0bc728e74\"),\n end_reason=end_reason,\n calibration=calibration,\n pore=pore,\n run_info=run_info,\n ...\n signal=signal,\n)\n\nwith p5.Writer(\"example.pod5\") as writer:\n # Write the read object\n writer.add_read(read)\n```\n\n## Tools\n\n1. [pod5 view](#pod5-view)\n2. [pod5 inspect](#pod5-inspect)\n3. [pod5 merge](#pod5-merge)\n4. [pod5 filter](#pod5-filter)\n5. [pod5 subset](#pod5-subset)\n6. [pod5 repack](#pod5-repack)\n7. [pod5 recover](#pod5-recover)\n8. [pod5 convert fast5](#pod5-convert-fast5)\n9. [pod5 convert to_fast5](#pod5-convert-to_fast5)\n10. [pod5 update](#pod5-update)\n\nThe ``pod5`` package provides the following tools for inspecting and manipulating\nPOD5 files as well as converting between ``.pod5`` and ``.fast5`` file formats.\n\nTo disable the `tqdm <https://github.com/tqdm/tqdm>`_ progress bar set the environment\nvariable ``POD5_PBAR=0``.\n\nTo enable debugging output which may also output detailed log files, set the environment\nvariable ``POD5_DEBUG=1``\n\n### Pod5 View\n\nThe ``pod5 view`` tool is used to produce a table similarr to a sequencing summary\nfrom the contents of ``.pod5`` files. The default output is a tab-separated table\nwritten to stdout with all available fields.\n\nThis tools is indented to replace ``pod5 inspect reads`` and is over 200x faster.\n\n``` bash\n> pod5 view --help\n\n# View the list of fields with a short description in-order (shortcut -L)\n> pod5 view --list-fields\n\n# Write the summary to stdout\n> pod5 view input.pod5\n\n# Write the summary of multiple pod5s to a file\n> pod5 view *.pod5 --output summary.tsv\n\n# Write the summary as a csv\n> pod5 view *.pod5 --output summary.csv --separator ','\n\n# Write only the read_ids with no header (shorthand -IH)\n> pod5 view input.pod5 --ids --no-header\n\n# Write only the listed fields\n# Note: The field order is fixed the order shown in --list-fields\n> pod5 view input.pod5 --include \"read_id, channel, num_samples, end_reason\"\n\n# Exclude some unwanted fields\n> pod5 view input.pod5 --exclude \"filename, pore_type\"\n```\n\n### Pod5 inspect\n\nThe ``pod5 inspect`` tool can be used to extract details and summaries of\nthe contents of ``.pod5`` files. There are two programs for users within ``pod5 inspect``\nand these are read and reads\n\n``` bash\n> pod5 inspect --help\n> pod5 inspect {reads, read, summary} --help\n```\n\n#### Pod5 inspect reads\n\n> :warning: This tool is deprecated and has been replaced by ``pod5 view`` which is significantly faster.\n\nInspect all reads and print a csv table of the details of all reads in the given ``.pod5`` files.\n\n``` bash\n> pod5 inspect reads pod5_file.pod5\n\n read_id,channel,well,pore_type,read_number,start_sample,end_reason,median_before,calibration_offset,calibration_scale,sample_count,byte_count,signal_compression_ratio\n 00445e58-3c58-4050-bacf-3411bb716cc3,908,1,not_set,100776,374223800,signal_positive,205.3,-240.0,0.1,65582,58623,0.447\n 00520473-4d3d-486b-86b5-f031c59f6591,220,1,not_set,7936,16135986,signal_positive,192.0,-233.0,0.1,167769,146495,0.437\n ...\n```\n\n#### Pod5 inspect read\n\nInspect the pod5 file, find a specific read and print its details.\n\n``` console\n> pod5 inspect read pod5_file.pod5 00445e58-3c58-4050-bacf-3411bb716cc3\n\n File: out-tmp/output.pod5\n read_id: 0e5d6827-45f6-462c-9f6b-21540eef4426\n read_number: 129227\n start_sample: 367096601\n median_before: 171.889404296875\n channel data:\n channel: 2366\n well: 1\n pore_type: not_set\n end reason:\n name: signal_positive\n forced False\n calibration:\n offset: -243.0\n scale: 0.1462070643901825\n samples:\n sample_count: 81040\n byte_count: 71989\n compression ratio: 0.444\n run info\n acquisition_id: 2ca00715f2e6d8455e5174cd20daa4c38f95fae2\n acquisition_start_time: 2021-07-23 13:48:59.780000\n adc_max: 0\n adc_min: 0\n context_tags\n barcoding_enabled: 0\n basecall_config_filename: dna_r10.3_450bps_hac_prom.cfg\n experiment_duration_set: 2880\n ...\n```\n\n### Pod5 merge\n\n``pod5 merge`` is a tool for merging multiple ``.pod5`` files into one monolithic pod5 file.\n\nThe contents of the input files are checked for duplicate read_ids to avoid\naccidentally merging identical reads. To override this check set the argument\n``-D / --duplicate-ok``\n\n``` bash\n# View help\n> pod5 merge --help\n\n# Merge a pair of pod5 files\n> pod5 merge example_1.pod5 example_2.pod5 --output merged.pod5\n\n# Merge a glob of pod5 files\n> pod5 merge *.pod5 -o merged.pod5\n\n# Merge a glob of pod5 files ignoring duplicate read ids\n> pod5 merge *.pod5 -o merged.pod5 --duplicate-ok\n```\n\n### Pod5 filter\n\n``pod5 filter`` is a simpler alternative to ``pod5 subset`` where reads are subset from\none or more input ``.pod5`` files using a list of read ids provided using the ``--ids`` argument\nand writing those reads to a *single* ``--output`` file.\n\nSee ``pod5 subset`` for more advanced subsetting.\n\n``` bash\n> pod5 filter example.pod5 --output filtered.pod5 --ids read_ids.txt\n```\n\nThe ``--ids`` selection text file must be a simple list of valid UUID read_ids with\none read_id per line. Only records which match the UUID regex (lower-case) are used.\nLines beginning with a ``#`` (hash / pound symbol) are interpreted as comments.\nEmpty lines are not valid and may cause errors during parsing.\n\n> The ``filter`` and ``subset`` tools will assert that any requested read_ids are\n> present in the inputs. If a requested read_id is missing from the inputs\n> then the tool will issue the following error:\n>\n> ``` bash\n> POD5 has encountered an error: 'Missing read_ids from inputs but --missing-ok not set'\n> ```\n>\n> To disable this warning then the '-M / --missing-ok' argument.\n\nWhen supplying multiple input files to 'filter' or 'subset', the tools is\neffectively performing a ``merge`` operation. The 'merge' tool is better suited\nfor handling very large numbers of input files.\n\n#### Example filtering pipeline\n\nThis is a trivial example of how to select a random sample of 1000 read_ids from a\npod5 file using ``pod5 view`` and ``pod5 filter``.\n\n``` bash\n# Get a random selection of read_ids\n> pod5 view all.pod5 --ids --no-header --output all_ids.txt\n> all_ids.txt sort --random-sort | head --lines 1000 > 1k_ids.txt\n\n# Filter to that selection\n> pod5 filter all.pod5 --ids 1k_ids.txt --output 1k.pod5\n\n# Check the output\n> pod5 view 1k.pod5 -IH | wc -l\n1000\n```\n\n### Pod5 subset\n\n``pod5 subset`` is a tool for subsetting reads in ``.pod5`` files into one or more\noutput ``.pod5`` files. See also ``pod5 filter``\n\nThe ``pod5 subset`` tool requires a *mapping* which defines which read_ids should be\nwritten to which output. There are multiple ways of specifying this mapping which are\ndefined in either a ``.csv`` file or by using a ``--table`` (csv or tsv)\nand instructions on how to interpret it.\n\n``pod5 subset`` aims to be a generic tool to subset from multiple inputs to multiple outputs.\nIf your use-case is to ``filter`` read_ids from one or more inputs into a single output\nthen ``pod5 filter`` might be a more appropriate tool as the only input is a list of read_ids.\n\n``` bash\n# View help\n> pod5 subset --help\n\n# Subset input(s) using a pre-defined mapping\n> pod5 subset example_1.pod5 --csv mapping.csv\n\n# Subset input(s) using a dynamic mapping created at runtime\n> pod5 subset example_1.pod5 --table table.txt --columns barcode\n```\n\n> Care should be taken to ensure that when providing multiple input ``.pod5`` files to ``pod5 subset``\n> that there are no read_id UUID clashes. If a duplicate read_id is detected an exception\n> will be raised unless the ``--duplicate-ok`` argument is set. If ``--duplicate-ok`` is\n> set then both reads will be written to the output, although this is not recommended.\n\n#### Note on positional arguments\n\n> The ``--columns`` argument will greedily consume values and as such, care should be taken\n> with the placement of any positional arguments. The following line will result in an error\n> as the input pod5 file is consumed by ``--columns`` resulting in no input file being set.\n\n```bash\n# Invalid placement of positional argument example.pod5\n$ pod5 subset --table table.txt --columns barcode example.pod5\n```\n\n#### Creating a Subset Mapping\n\n##### Target Mapping (.csv)\n\nThe example below shows a ``.csv`` subset target mapping. Any lines (e.g. header line)\nwhich do not have a read_id which matches the UUID regex (lower-case) in the second\ncolumn is ignored.\n\n``` text\ntarget, read_id\noutput_1.pod5,132b582c-56e8-4d46-9e3d-48a275646d3a\noutput_1.pod5,12a4d6b1-da6e-4136-8bb3-1470ef27e311\noutput_2.pod5,0ff4dc01-5fa4-4260-b54e-1d8716c7f225\noutput_2.pod5,0e359c40-296d-4edc-8f4a-cca135310ab2\noutput_2.pod5,0e9aa0f8-99ad-40b3-828a-45adbb4fd30c\n```\n\n##### Target Mapping from Table\n\n``pod5 subset`` can dynamically generate output targets and collect associated reads\nbased on a text file containing a table (csv or tsv) parsible by ``polars``.\nThis table file could be the output from ``pod5 view`` or from a sequencing summary.\nThe table must contain a header row and a series of columns on which to group unique\ncollections of values. Internally this process uses the\n`polars.Dataframe.group_by <https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html>`_\nfunction where the ``by`` parameter is the sequence of column names specified with\nthe ``--columns`` argument.\n\nGiven the following example ``--table`` file, observe the resultant outputs given various\narguments:\n\n``` text\nread_id mux barcode length\nread_a 1 barcode_a 4321\nread_b 1 barcode_b 1000\nread_c 2 barcode_b 1200\nread_d 2 barcode_c 1234\n```\n\n``` bash\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode\n> ls barcode_subset\nbarcode-barcode_a.pod5 # Contains: read_a\nbarcode-barcode_b.pod5 # Contains: read_b, read_c\nbarcode-barcode_c.pod5 # Contains: read_d\n\n> pod5 subset example_1.pod5 --output mux_subset --table table.txt --columns mux\n> ls mux_subset\nmux-1.pod5 # Contains: read_a, read_b\nmus-2.pod5 # Contains: read_c, read_d\n\n> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux\n> ls barcode_mux_subset\nbarcode-barcode_a_mux-1.pod5 # Contains: read_a\nbarcode-barcode_b_mux-1.pod5 # Contains: read_b\nbarcode-barcode_b_mux-2.pod5 # Contains: read_c\nbarcode-barcode_c_mux-2.pod5 # Contains: read_d\n```\n\n##### Output Filename Templating\n\nWhen subsetting using a table the output filename is generated from a template\nstring. The automatically generated template is the sequential concatenation of\n``column_name-column_value`` followed by the ``.pod5`` file extension.\n\nThe user can set their own filename template using the ``--template`` argument.\nThis argument accepts a string in the `Python f-string style <https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals>`_\nwhere the subsetting variables are used for keyword placeholder substitution.\nKeywords should be placed within curly-braces. For example:\n\n``` bash\n# default template used = \"barcode-{barcode}.pod5\"\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode\n\n# default template used = \"barcode-{barcode}_mux-{mux}.pod5\"\n> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux\n\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode --template \"{barcode}.subset.pod5\"\n> ls barcode_subset\nbarcode_a.subset.pod5 # Contains: read_a\nbarcode_b.subset.pod5 # Contains: read_b, read_c\nbarcode_c.subset.pod5 # Contains: read_d\n```\n\n##### Example subsetting from ``pod5 inspect reads``\n\nThe ``pod5 inspect reads`` tool will output a csv table summarising the content of the\nspecified ``.pod5`` file which can be used for subsetting. The example below shows\nhow to split a ``.pod5`` file by the well field.\n\n``` bash\n# Create the csv table from inspect reads\n> pod5 inspect reads example.pod5 > table.csv\n> pod5 subset example.pod5 --table table.csv --columns well\n```\n\n### Pod5 repack\n\n``pod5 repack`` will simply repack ``.pod5`` files into one-for-one output files of the same name.\n\n``` bash\n> pod5 repack pod5s/*.pod5 repacked_pods/\n```\n\n### Pod5 Recover\n\n``pod5 recover`` will attempt to recover data from corrupted or truncated ``.pod5`` files\nby copying all valid table batches and cleanly closing the new files. New files are written\nas siblings to the inputs with the `_recovered.pod5` suffix.\n\n``` bash\n> pod5 recover --help\n> pod5 recover broken.pod5\n> ls\nbroken.pod5 broken_recovered.pod5\n```\n\n### pod5 convert fast5\n\nThe ``pod5 convert fast5`` tool takes one or more ``.fast5`` files and converts them\nto one or more ``.pod5`` files.\n\nIf the tool detects single-read fast5 files, please convert them into multi-read\nfast5 files using the tools available in the ``ont_fast5_api`` project.\n\nThe progress bar shown during conversion assumes the number of reads in an input\n``.fast5`` is 4000. The progress bar will update the total value during runtime if\nrequired.\n\n> Some content previously stored in ``.fast5`` files is **not** compatible with the POD5\n> format and will not be converted. This includes all analyses stored in the\n> ``.fast5`` file.\n>\n> Please ensure that any other data is recovered from ``.fast5`` before deletion.\n\nBy default ``pod5 convert fast5`` will show exceptions raised during conversion as *warnings*\nto the user. This is to gracefully handle potentially corrupt input files or other\nruntime errors in long-running conversion tasks. The ``--strict`` argument allows\nusers to opt-in to strict runtime assertions where any exception raised will promptly\nstop the conversion process with an error.\n\n``` bash\n# View help\n> pod5 convert fast5 --help\n\n# Convert fast5 files into a monolithic output file\n> pod5 convert fast5 ./input/*.fast5 --output converted.pod5\n\n# Convert fast5 files into a monolithic output in an existing directory\n> pod5 convert fast5 ./input/*.fast5 --output outputs/\n> ls outputs/\noutput.pod5 # default name\n\n# Convert each fast5 to its relative converted output. The output files are written\n# into the output directory at paths relatve to the path given to the\n# --one-to-one argument. Note: This path must be a relative parent to all\n# input paths.\n> ls input/*.fast5\nfile_1.fast5 file_2.fast5 ... file_N.fast5\n> pod5 convert fast5 ./input/*.fast5 --output output_pod5s/ --one-to-one ./input/\n> ls output_pod5s/\nfile_1.pod5 file_2.pod5 ... file_N.pod5\n\n# Note the different --one-to-one path which is now the current working directory.\n# The new sub-directory output_pod5/input is created.\n> pod5 convert fast5 ./input/*.fast5 output_pod5s --one-to-one ./\n> ls output_pod5s/\ninput/file_1.pod5 input/file_2.pod5 ... input/file_N.pod5\n\n# Convert all inputs so that they have neibouring pod5 in current directory\n> pod5 convert fast5 *.fast5 --output . --one-to-one .\n> ls\nfile_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5 ... file_N.fast5 file_N.pod5\n\n# Convert all inputs so that they have neibouring pod5 files from a parent directory\n> pod5 convert fast5 ./input/*.fast5 --output ./input/ --one-to-one ./input/\n> ls input/*\nfile_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5 ... file_N.fast5 file_N.pod5\n```\n\n### Pod5 convert to_fast5\n\nThe ``pod5 convert to_fast5`` tool takes one or more ``.pod5`` files and converts them\nto multiple ``.fast5`` files. The default behaviour is to write 4000 reads per output file\nbut this can be controlled with the ``--file-read-count`` argument.\n\n``` bash\n# View help\n> pod5 convert to_fast5 --help\n\n# Convert pod5 files to fast5 files with default 4000 reads per file\n> pod5 convert to_fast5 example.pod5 --output pod5_to_fast5/\n> ls pod5_to_fast5/\noutput_1.fast5 output_2.fast5 ... output_N.fast5\n```\n\n### Pod5 Update\n\nThe ``pod5 update`` tools is used to update old pod5 files to use the latest schema.\nCurrently the latest schema version is version 3.\n\nFiles are written into the ``--output`` directory with the same name.\n\n``` bash\n> pod5 update --help\n\n# Update a named files\n> pod5 update my.pod5 --output updated/\n> ls updated\nupdated/my.pod5\n\n# Update an entire directory\n> pod5 update old/ -o updated/\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Oxford Nanopore Technologies Pod5 File Format Python API and Tools",
"version": "0.3.10",
"project_urls": {
"Documentation": "https://pod5-file-format.readthedocs.io/en/latest/",
"Homepage": "https://github.com/nanoporetech/pod5-file-format",
"Issues": "https://github.com/nanoporetech/pod5-file-format/issues"
},
"split_keywords": [
"nanopore"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "407b5baef5b0627a14d78ec511b9ea5776a0e99ab8c54fb59ff391836de6597e",
"md5": "1bec9a0dd62a319315aea38d24c99943",
"sha256": "3ecfce9d4d4b2574242b1effc313f3fd25ef4651c44385beb68ad5ba8f539b11"
},
"downloads": -1,
"filename": "pod5-0.3.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1bec9a0dd62a319315aea38d24c99943",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "~=3.8",
"size": 69421,
"upload_time": "2024-03-25T13:21:53",
"upload_time_iso_8601": "2024-03-25T13:21:53.331261Z",
"url": "https://files.pythonhosted.org/packages/40/7b/5baef5b0627a14d78ec511b9ea5776a0e99ab8c54fb59ff391836de6597e/pod5-0.3.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "69b0b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d",
"md5": "4ee69f31a39c962cc4d3c4b112736177",
"sha256": "f2dcb1938fcf51c725393345e480c1d12711089d542a27446fb92fbe2e18ae60"
},
"downloads": -1,
"filename": "pod5-0.3.10.tar.gz",
"has_sig": false,
"md5_digest": "4ee69f31a39c962cc4d3c4b112736177",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "~=3.8",
"size": 65719,
"upload_time": "2024-03-25T13:21:56",
"upload_time_iso_8601": "2024-03-25T13:21:56.096417Z",
"url": "https://files.pythonhosted.org/packages/69/b0/b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d/pod5-0.3.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-25 13:21:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nanoporetech",
"github_project": "pod5-file-format",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pod5"
}