pod5


Namepod5 JSON
Version 0.3.10 PyPI version JSON
download
home_pageNone
SummaryOxford Nanopore Technologies Pod5 File Format Python API and Tools
upload_time2024-03-25 13:21:56
maintainerNone
docs_urlNone
authorNone
requires_python~=3.8
licenseNone
keywords nanopore
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # POD5 Python Package

The `pod5` Python package contains the tools and python API wrapping the compiled bindings
for the POD5 file format from `lib_pod5`.

## Installation

The `pod5` package is available on [pypi](https://pypi.org/project/pod5/) and is
installed using `pip`:

``` console
  > pip install pod5
```

## Usage

### Reading a POD5 File

To read a `pod5` file provide the the `Reader` class with the input `pod5` file path
and call `Reader.reads()` to iterate over read records in the file. The example below
prints the read_id of every record in the input `pod5` file.

``` python
import pod5 as p5

with p5.Reader("example.pod5") as reader:
    for read_record in reader.reads():
        print(read_record.read_id)
```

To iterate over a selection of read_ids supply `Reader.reads()` with a collection
of read_ids which must be `UUID` compatible:

``` python
import pod5 as p5

# Create a collection of read_id UUIDs
read_ids: List[str] = [
  "00445e58-3c58-4050-bacf-3411bb716cc3",
  "00520473-4d3d-486b-86b5-f031c59f6591",
]

with p5.Reader("example.pod5") as reader:
    for read_record in reader.reads(selection=read_ids):
        assert str(read_record.read_id) in read_ids
```

### Plotting Signal Data Example

Here is an example of how a user may plot a read’s signal data against time.

``` python
import matplotlib.pyplot as plt
import numpy as np

import pod5 as p5

# Using the example pod5 file provided
example_pod5 = "test_data/multi_fast5_zip.pod5"
selected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'

with p5.Reader(example_pod5) as reader:

    # Read the selected read from the pod5 file
    # next() is required here as Reader.reads() returns a Generator
    read = next(reader.reads(selection=[selected_read_id]))

    # Get the signal data and sample rate
    sample_rate = read.run_info.sample_rate
    signal = read.signal

    # Compute the time steps over the sampling period
    time = np.arange(len(signal)) / sample_rate

    # Plot using matplotlib
    plt.plot(time, signal)
```

### Writing a POD5 File

The `pod5` package provides the functionality to write POD5 files.

It is strongly recommended that users first look at the available tools when
manipulating existing datasets, as there may already be a tool to meet your needs.
New tools may be added to support our users and if you have a suggestion for a
new tool or feature please submit a request on the
[pod5-file-format GitHub issues page](https://github.com/nanoporetech/pod5-file-format/issues).

Below is an example of how one may add reads to a new POD5 file using the `Writer`
and its `add_read()` method.

```python
import pod5 as p5

# Populate container classes for read metadata
pore = p5.Pore(channel=123, well=3, pore_type="pore_type")
calibration = p5.Calibration(offset=0.1, scale=1.1)
end_reason = p5.EndReason(name=p5.EndReasonEnum.SIGNAL_POSITIVE, forced=False)
run_info = p5.RunInfo(
    acquisition_id = ...
    acquisition_start_time = ...
    adc_max = ...
    ...
)
signal = ... # some signal data as numpy np.int16 array

read = p5.Read(
    read_id=UUID("0000173c-bf67-44e7-9a9c-1ad0bc728e74"),
    end_reason=end_reason,
    calibration=calibration,
    pore=pore,
    run_info=run_info,
    ...
    signal=signal,
)

with p5.Writer("example.pod5") as writer:
    # Write the read object
    writer.add_read(read)
```

## Tools

1. [pod5 view](#pod5-view)
2. [pod5 inspect](#pod5-inspect)
3. [pod5 merge](#pod5-merge)
4. [pod5 filter](#pod5-filter)
5. [pod5 subset](#pod5-subset)
6. [pod5 repack](#pod5-repack)
7. [pod5 recover](#pod5-recover)
8. [pod5 convert fast5](#pod5-convert-fast5)
9. [pod5 convert to_fast5](#pod5-convert-to_fast5)
10. [pod5 update](#pod5-update)

The ``pod5`` package provides the following tools for inspecting and manipulating
POD5 files as well as converting between ``.pod5`` and ``.fast5`` file formats.

To disable the `tqdm <https://github.com/tqdm/tqdm>`_  progress bar set the environment
variable ``POD5_PBAR=0``.

To enable debugging output which may also output detailed log files, set the environment
variable ``POD5_DEBUG=1``

### Pod5 View

The ``pod5 view`` tool is used to produce a table similarr to a sequencing summary
from the contents of ``.pod5`` files. The default output is a tab-separated table
written to stdout with all available fields.

This tools is indented to replace ``pod5 inspect reads`` and is over 200x faster.

``` bash
> pod5 view --help

# View the list of fields with a short description in-order (shortcut -L)
> pod5 view --list-fields

# Write the summary to stdout
> pod5 view input.pod5

# Write the summary of multiple pod5s to a file
> pod5 view *.pod5 --output summary.tsv

# Write the summary as a csv
> pod5 view *.pod5 --output summary.csv --separator ','

# Write only the read_ids with no header (shorthand -IH)
> pod5 view input.pod5 --ids --no-header

# Write only the listed fields
# Note: The field order is fixed the order shown in --list-fields
> pod5 view input.pod5 --include "read_id, channel, num_samples, end_reason"

# Exclude some unwanted fields
> pod5 view input.pod5 --exclude "filename, pore_type"
```

### Pod5 inspect

The ``pod5 inspect`` tool can be used to extract details and summaries of
the contents of ``.pod5`` files. There are two programs for users within ``pod5 inspect``
and these are read and reads

``` bash
> pod5 inspect --help
> pod5 inspect {reads, read, summary} --help
```

#### Pod5 inspect reads

> :warning: This tool is deprecated and has been replaced by ``pod5 view`` which is significantly faster.

Inspect all reads and print a csv table of the details of all reads in the given ``.pod5`` files.

``` bash
> pod5 inspect reads pod5_file.pod5

  read_id,channel,well,pore_type,read_number,start_sample,end_reason,median_before,calibration_offset,calibration_scale,sample_count,byte_count,signal_compression_ratio
  00445e58-3c58-4050-bacf-3411bb716cc3,908,1,not_set,100776,374223800,signal_positive,205.3,-240.0,0.1,65582,58623,0.447
  00520473-4d3d-486b-86b5-f031c59f6591,220,1,not_set,7936,16135986,signal_positive,192.0,-233.0,0.1,167769,146495,0.437
    ...
```

#### Pod5 inspect read

Inspect the pod5 file, find a specific read and print its details.

``` console
> pod5 inspect read pod5_file.pod5 00445e58-3c58-4050-bacf-3411bb716cc3

  File: out-tmp/output.pod5
  read_id: 0e5d6827-45f6-462c-9f6b-21540eef4426
  read_number:    129227
  start_sample:   367096601
  median_before:  171.889404296875
  channel data:
  channel: 2366
  well: 1
  pore_type: not_set
  end reason:
  name: signal_positive
  forced False
  calibration:
  offset: -243.0
  scale: 0.1462070643901825
  samples:
  sample_count: 81040
  byte_count: 71989
  compression ratio: 0.444
  run info
      acquisition_id: 2ca00715f2e6d8455e5174cd20daa4c38f95fae2
      acquisition_start_time: 2021-07-23 13:48:59.780000
      adc_max: 0
      adc_min: 0
      context_tags
      barcoding_enabled: 0
      basecall_config_filename: dna_r10.3_450bps_hac_prom.cfg
      experiment_duration_set: 2880
      ...
```

### Pod5 merge

``pod5 merge`` is a tool for merging multiple  ``.pod5`` files into one monolithic pod5 file.

The contents of the input files are checked for duplicate read_ids to avoid
accidentally merging identical reads. To override this check set the argument
``-D / --duplicate-ok``

``` bash
# View help
> pod5 merge --help

# Merge a pair of pod5 files
> pod5 merge example_1.pod5 example_2.pod5 --output merged.pod5

# Merge a glob of pod5 files
> pod5 merge *.pod5 -o merged.pod5

# Merge a glob of pod5 files ignoring duplicate read ids
> pod5 merge *.pod5 -o merged.pod5 --duplicate-ok
```

### Pod5 filter

``pod5 filter`` is a simpler alternative to ``pod5 subset`` where reads are subset from
one or more input ``.pod5`` files using a list of read ids provided using the ``--ids`` argument
and writing those reads to a *single* ``--output`` file.

See ``pod5 subset`` for more advanced subsetting.

``` bash
> pod5 filter example.pod5 --output filtered.pod5 --ids read_ids.txt
```

The ``--ids`` selection text file must be a simple list of valid UUID read_ids with
one read_id per line. Only records which match the UUID regex (lower-case) are used.
Lines beginning with a ``#`` (hash / pound symbol) are interpreted as comments.
Empty lines are not valid and may cause errors during parsing.

> The ``filter`` and ``subset`` tools will assert that any requested read_ids are
> present in the inputs. If a requested read_id is missing from the inputs
> then the tool will issue the following error:
>
> ``` bash
> POD5 has encountered an error: 'Missing read_ids from inputs but --missing-ok not set'
> ```
>
> To disable this warning then the '-M / --missing-ok' argument.

When supplying multiple input files to 'filter' or 'subset', the tools is
effectively performing a ``merge`` operation. The 'merge' tool is better suited
for handling very large numbers of input files.

#### Example filtering pipeline

This is a trivial example of how to select a random sample of 1000 read_ids from a
pod5 file using ``pod5 view`` and ``pod5 filter``.

``` bash
# Get a random selection of read_ids
> pod5 view all.pod5 --ids --no-header --output all_ids.txt
> all_ids.txt sort --random-sort | head --lines 1000 > 1k_ids.txt

# Filter to that selection
> pod5 filter all.pod5 --ids 1k_ids.txt --output 1k.pod5

# Check the output
> pod5 view 1k.pod5 -IH | wc -l
1000
```

### Pod5 subset

``pod5 subset`` is a tool for subsetting reads in ``.pod5`` files into one or more
output ``.pod5`` files. See also ``pod5 filter``

The ``pod5 subset`` tool requires a *mapping* which defines which read_ids should be
written to which output. There are multiple ways of specifying this mapping which are
defined in either a ``.csv`` file or by using a ``--table`` (csv or tsv)
and instructions on how to interpret it.

``pod5 subset`` aims to be a generic tool to subset from multiple inputs to multiple outputs.
If your use-case is to ``filter`` read_ids from one or more inputs into a single output
then ``pod5 filter`` might be a more appropriate tool as the only input is a list of read_ids.

``` bash
# View help
> pod5 subset --help

# Subset input(s) using a pre-defined mapping
> pod5 subset example_1.pod5 --csv mapping.csv

# Subset input(s) using a dynamic mapping created at runtime
> pod5 subset example_1.pod5 --table table.txt --columns barcode
```

> Care should be taken to ensure that when providing multiple input ``.pod5`` files to ``pod5 subset``
> that there are no read_id UUID clashes. If a duplicate read_id is detected an exception
> will be raised unless the ``--duplicate-ok`` argument is set. If ``--duplicate-ok`` is
> set then both reads will be written to the output, although this is not recommended.

#### Note on positional arguments

> The ``--columns`` argument will greedily consume values and as such, care should be taken
> with the placement of any positional arguments. The following line will result in an error
> as the input pod5 file is consumed by ``--columns`` resulting in no input file being set.

```bash
# Invalid placement of positional argument example.pod5
$ pod5 subset --table table.txt --columns barcode example.pod5
```

#### Creating a Subset Mapping

##### Target Mapping (.csv)

The example below shows a ``.csv`` subset target mapping. Any lines (e.g. header line)
which do not have a read_id which matches the UUID regex (lower-case) in the second
column is ignored.

``` text
target, read_id
output_1.pod5,132b582c-56e8-4d46-9e3d-48a275646d3a
output_1.pod5,12a4d6b1-da6e-4136-8bb3-1470ef27e311
output_2.pod5,0ff4dc01-5fa4-4260-b54e-1d8716c7f225
output_2.pod5,0e359c40-296d-4edc-8f4a-cca135310ab2
output_2.pod5,0e9aa0f8-99ad-40b3-828a-45adbb4fd30c
```

##### Target Mapping from Table

``pod5 subset`` can dynamically generate output targets and collect associated reads
based on a text file containing a table (csv or tsv) parsible by ``polars``.
This table file could be the output from ``pod5 view`` or from a sequencing summary.
The table must contain a header row and a series of columns on which to group unique
collections of values. Internally this process uses the
`polars.Dataframe.group_by <https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html>`_
function where the ``by`` parameter is the sequence of column names specified with
the ``--columns`` argument.

Given the following example ``--table`` file, observe the resultant outputs given various
arguments:

``` text
read_id    mux    barcode      length
read_a     1      barcode_a    4321
read_b     1      barcode_b    1000
read_c     2      barcode_b    1200
read_d     2      barcode_c    1234
```

``` bash
> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode
> ls barcode_subset
barcode-barcode_a.pod5     # Contains: read_a
barcode-barcode_b.pod5     # Contains: read_b, read_c
barcode-barcode_c.pod5     # Contains: read_d

> pod5 subset example_1.pod5 --output mux_subset --table table.txt --columns mux
> ls mux_subset
mux-1.pod5     # Contains: read_a, read_b
mus-2.pod5     # Contains: read_c, read_d

> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux
> ls barcode_mux_subset
barcode-barcode_a_mux-1.pod5    # Contains: read_a
barcode-barcode_b_mux-1.pod5    # Contains: read_b
barcode-barcode_b_mux-2.pod5    # Contains: read_c
barcode-barcode_c_mux-2.pod5    # Contains: read_d
```

##### Output Filename Templating

When subsetting using a table the output filename is generated from a template
string. The automatically generated template is the sequential concatenation of
``column_name-column_value`` followed by the ``.pod5`` file extension.

The user can set their own filename template using the ``--template`` argument.
This argument accepts a string in the `Python f-string style <https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals>`_
where the subsetting variables are used for keyword placeholder substitution.
Keywords should be placed within curly-braces. For example:

``` bash
# default template used = "barcode-{barcode}.pod5"
> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode

# default template used = "barcode-{barcode}_mux-{mux}.pod5"
> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux

> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode --template "{barcode}.subset.pod5"
> ls barcode_subset
barcode_a.subset.pod5    # Contains: read_a
barcode_b.subset.pod5    # Contains: read_b, read_c
barcode_c.subset.pod5    # Contains: read_d
```

##### Example subsetting from ``pod5 inspect reads``

The ``pod5 inspect reads`` tool will output a csv table summarising the content of the
specified ``.pod5`` file which can be used for subsetting. The example below shows
how to split a ``.pod5`` file by the well field.

``` bash
# Create the csv table from inspect reads
> pod5 inspect reads example.pod5 > table.csv
> pod5 subset example.pod5 --table table.csv --columns well
```

### Pod5 repack

``pod5 repack`` will simply repack ``.pod5`` files into one-for-one output files of the same name.

``` bash
> pod5 repack pod5s/*.pod5 repacked_pods/
```

### Pod5 Recover

``pod5 recover`` will attempt to recover data from corrupted or truncated ``.pod5`` files
by copying all valid table batches and cleanly closing the new files. New files are written
as siblings to the inputs with the `_recovered.pod5` suffix.

``` bash
> pod5 recover --help
> pod5 recover broken.pod5
> ls
broken.pod5 broken_recovered.pod5
```

### pod5 convert fast5

The ``pod5 convert fast5`` tool takes one or more ``.fast5`` files and converts them
to one or more ``.pod5`` files.

If the tool detects single-read fast5 files, please convert them into multi-read
fast5 files using the tools available in the ``ont_fast5_api`` project.

The progress bar shown during conversion assumes the number of reads in an input
``.fast5`` is 4000. The progress bar will update the total value during runtime if
required.

> Some content previously stored in ``.fast5`` files is **not** compatible with the POD5
> format and will not be converted. This includes all analyses stored in the
> ``.fast5`` file.
>
> Please ensure that any other data is recovered from ``.fast5`` before deletion.

By default ``pod5 convert fast5`` will show exceptions raised during conversion as *warnings*
to the user. This is to gracefully handle potentially corrupt input files or other
runtime errors in long-running conversion tasks. The ``--strict`` argument allows
users to opt-in to strict runtime assertions where any exception raised will promptly
stop the conversion process with an error.

``` bash
# View help
> pod5 convert fast5 --help

# Convert fast5 files into a monolithic output file
> pod5 convert fast5 ./input/*.fast5 --output converted.pod5

# Convert fast5 files into a monolithic output in an existing directory
> pod5 convert fast5 ./input/*.fast5 --output outputs/
> ls outputs/
output.pod5 # default name

# Convert each fast5 to its relative converted output. The output files are written
# into the output directory at paths relatve to the path given to the
# --one-to-one argument. Note: This path must be a relative parent to all
# input paths.
> ls input/*.fast5
file_1.fast5 file_2.fast5 ... file_N.fast5
> pod5 convert fast5 ./input/*.fast5 --output output_pod5s/ --one-to-one ./input/
> ls output_pod5s/
file_1.pod5 file_2.pod5 ... file_N.pod5

# Note the different --one-to-one path which is now the current working directory.
# The new sub-directory output_pod5/input is created.
> pod5 convert fast5 ./input/*.fast5 output_pod5s --one-to-one ./
> ls output_pod5s/
input/file_1.pod5 input/file_2.pod5 ... input/file_N.pod5

# Convert all inputs so that they have neibouring pod5 in current directory
> pod5 convert fast5 *.fast5 --output . --one-to-one .
> ls
file_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5  ... file_N.fast5 file_N.pod5

# Convert all inputs so that they have neibouring pod5 files from a parent directory
> pod5 convert fast5 ./input/*.fast5 --output ./input/ --one-to-one ./input/
> ls input/*
file_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5  ... file_N.fast5 file_N.pod5
```

### Pod5 convert to_fast5

The ``pod5 convert to_fast5`` tool takes one or more ``.pod5`` files and converts them
to multiple ``.fast5`` files. The default behaviour is to write 4000 reads per output file
but this can be controlled with the ``--file-read-count`` argument.

``` bash
# View help
> pod5 convert to_fast5 --help

# Convert pod5 files to fast5 files with default 4000 reads per file
> pod5 convert to_fast5 example.pod5 --output pod5_to_fast5/
> ls pod5_to_fast5/
output_1.fast5 output_2.fast5 ... output_N.fast5
```

### Pod5 Update

The ``pod5 update`` tools is used to update old pod5 files to use the latest schema.
Currently the latest schema version is version 3.

Files are written into the ``--output`` directory with the same name.

``` bash
> pod5 update --help

# Update a named files
> pod5 update my.pod5 --output updated/
> ls updated
updated/my.pod5

# Update an entire directory
> pod5 update old/ -o updated/
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pod5",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "~=3.8",
    "maintainer_email": null,
    "keywords": "nanopore",
    "author": null,
    "author_email": "Oxford Nanopore Technologies plc <support@nanoporetech.com>",
    "download_url": "https://files.pythonhosted.org/packages/69/b0/b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d/pod5-0.3.10.tar.gz",
    "platform": null,
    "description": "# POD5 Python Package\n\nThe `pod5` Python package contains the tools and python API wrapping the compiled bindings\nfor the POD5 file format from `lib_pod5`.\n\n## Installation\n\nThe `pod5` package is available on [pypi](https://pypi.org/project/pod5/) and is\ninstalled using `pip`:\n\n``` console\n  > pip install pod5\n```\n\n## Usage\n\n### Reading a POD5 File\n\nTo read a `pod5` file provide the the `Reader` class with the input `pod5` file path\nand call `Reader.reads()` to iterate over read records in the file. The example below\nprints the read_id of every record in the input `pod5` file.\n\n``` python\nimport pod5 as p5\n\nwith p5.Reader(\"example.pod5\") as reader:\n    for read_record in reader.reads():\n        print(read_record.read_id)\n```\n\nTo iterate over a selection of read_ids supply `Reader.reads()` with a collection\nof read_ids which must be `UUID` compatible:\n\n``` python\nimport pod5 as p5\n\n# Create a collection of read_id UUIDs\nread_ids: List[str] = [\n  \"00445e58-3c58-4050-bacf-3411bb716cc3\",\n  \"00520473-4d3d-486b-86b5-f031c59f6591\",\n]\n\nwith p5.Reader(\"example.pod5\") as reader:\n    for read_record in reader.reads(selection=read_ids):\n        assert str(read_record.read_id) in read_ids\n```\n\n### Plotting Signal Data Example\n\nHere is an example of how a user may plot a read\u2019s signal data against time.\n\n``` python\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nimport pod5 as p5\n\n# Using the example pod5 file provided\nexample_pod5 = \"test_data/multi_fast5_zip.pod5\"\nselected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'\n\nwith p5.Reader(example_pod5) as reader:\n\n    # Read the selected read from the pod5 file\n    # next() is required here as Reader.reads() returns a Generator\n    read = next(reader.reads(selection=[selected_read_id]))\n\n    # Get the signal data and sample rate\n    sample_rate = read.run_info.sample_rate\n    signal = read.signal\n\n    # Compute the time steps over the sampling period\n    time = np.arange(len(signal)) / sample_rate\n\n    # Plot using matplotlib\n    plt.plot(time, signal)\n```\n\n### Writing a POD5 File\n\nThe `pod5` package provides the functionality to write POD5 files.\n\nIt is strongly recommended that users first look at the available tools when\nmanipulating existing datasets, as there may already be a tool to meet your needs.\nNew tools may be added to support our users and if you have a suggestion for a\nnew tool or feature please submit a request on the\n[pod5-file-format GitHub issues page](https://github.com/nanoporetech/pod5-file-format/issues).\n\nBelow is an example of how one may add reads to a new POD5 file using the `Writer`\nand its `add_read()` method.\n\n```python\nimport pod5 as p5\n\n# Populate container classes for read metadata\npore = p5.Pore(channel=123, well=3, pore_type=\"pore_type\")\ncalibration = p5.Calibration(offset=0.1, scale=1.1)\nend_reason = p5.EndReason(name=p5.EndReasonEnum.SIGNAL_POSITIVE, forced=False)\nrun_info = p5.RunInfo(\n    acquisition_id = ...\n    acquisition_start_time = ...\n    adc_max = ...\n    ...\n)\nsignal = ... # some signal data as numpy np.int16 array\n\nread = p5.Read(\n    read_id=UUID(\"0000173c-bf67-44e7-9a9c-1ad0bc728e74\"),\n    end_reason=end_reason,\n    calibration=calibration,\n    pore=pore,\n    run_info=run_info,\n    ...\n    signal=signal,\n)\n\nwith p5.Writer(\"example.pod5\") as writer:\n    # Write the read object\n    writer.add_read(read)\n```\n\n## Tools\n\n1. [pod5 view](#pod5-view)\n2. [pod5 inspect](#pod5-inspect)\n3. [pod5 merge](#pod5-merge)\n4. [pod5 filter](#pod5-filter)\n5. [pod5 subset](#pod5-subset)\n6. [pod5 repack](#pod5-repack)\n7. [pod5 recover](#pod5-recover)\n8. [pod5 convert fast5](#pod5-convert-fast5)\n9. [pod5 convert to_fast5](#pod5-convert-to_fast5)\n10. [pod5 update](#pod5-update)\n\nThe ``pod5`` package provides the following tools for inspecting and manipulating\nPOD5 files as well as converting between ``.pod5`` and ``.fast5`` file formats.\n\nTo disable the `tqdm <https://github.com/tqdm/tqdm>`_  progress bar set the environment\nvariable ``POD5_PBAR=0``.\n\nTo enable debugging output which may also output detailed log files, set the environment\nvariable ``POD5_DEBUG=1``\n\n### Pod5 View\n\nThe ``pod5 view`` tool is used to produce a table similarr to a sequencing summary\nfrom the contents of ``.pod5`` files. The default output is a tab-separated table\nwritten to stdout with all available fields.\n\nThis tools is indented to replace ``pod5 inspect reads`` and is over 200x faster.\n\n``` bash\n> pod5 view --help\n\n# View the list of fields with a short description in-order (shortcut -L)\n> pod5 view --list-fields\n\n# Write the summary to stdout\n> pod5 view input.pod5\n\n# Write the summary of multiple pod5s to a file\n> pod5 view *.pod5 --output summary.tsv\n\n# Write the summary as a csv\n> pod5 view *.pod5 --output summary.csv --separator ','\n\n# Write only the read_ids with no header (shorthand -IH)\n> pod5 view input.pod5 --ids --no-header\n\n# Write only the listed fields\n# Note: The field order is fixed the order shown in --list-fields\n> pod5 view input.pod5 --include \"read_id, channel, num_samples, end_reason\"\n\n# Exclude some unwanted fields\n> pod5 view input.pod5 --exclude \"filename, pore_type\"\n```\n\n### Pod5 inspect\n\nThe ``pod5 inspect`` tool can be used to extract details and summaries of\nthe contents of ``.pod5`` files. There are two programs for users within ``pod5 inspect``\nand these are read and reads\n\n``` bash\n> pod5 inspect --help\n> pod5 inspect {reads, read, summary} --help\n```\n\n#### Pod5 inspect reads\n\n> :warning: This tool is deprecated and has been replaced by ``pod5 view`` which is significantly faster.\n\nInspect all reads and print a csv table of the details of all reads in the given ``.pod5`` files.\n\n``` bash\n> pod5 inspect reads pod5_file.pod5\n\n  read_id,channel,well,pore_type,read_number,start_sample,end_reason,median_before,calibration_offset,calibration_scale,sample_count,byte_count,signal_compression_ratio\n  00445e58-3c58-4050-bacf-3411bb716cc3,908,1,not_set,100776,374223800,signal_positive,205.3,-240.0,0.1,65582,58623,0.447\n  00520473-4d3d-486b-86b5-f031c59f6591,220,1,not_set,7936,16135986,signal_positive,192.0,-233.0,0.1,167769,146495,0.437\n    ...\n```\n\n#### Pod5 inspect read\n\nInspect the pod5 file, find a specific read and print its details.\n\n``` console\n> pod5 inspect read pod5_file.pod5 00445e58-3c58-4050-bacf-3411bb716cc3\n\n  File: out-tmp/output.pod5\n  read_id: 0e5d6827-45f6-462c-9f6b-21540eef4426\n  read_number:    129227\n  start_sample:   367096601\n  median_before:  171.889404296875\n  channel data:\n  channel: 2366\n  well: 1\n  pore_type: not_set\n  end reason:\n  name: signal_positive\n  forced False\n  calibration:\n  offset: -243.0\n  scale: 0.1462070643901825\n  samples:\n  sample_count: 81040\n  byte_count: 71989\n  compression ratio: 0.444\n  run info\n      acquisition_id: 2ca00715f2e6d8455e5174cd20daa4c38f95fae2\n      acquisition_start_time: 2021-07-23 13:48:59.780000\n      adc_max: 0\n      adc_min: 0\n      context_tags\n      barcoding_enabled: 0\n      basecall_config_filename: dna_r10.3_450bps_hac_prom.cfg\n      experiment_duration_set: 2880\n      ...\n```\n\n### Pod5 merge\n\n``pod5 merge`` is a tool for merging multiple  ``.pod5`` files into one monolithic pod5 file.\n\nThe contents of the input files are checked for duplicate read_ids to avoid\naccidentally merging identical reads. To override this check set the argument\n``-D / --duplicate-ok``\n\n``` bash\n# View help\n> pod5 merge --help\n\n# Merge a pair of pod5 files\n> pod5 merge example_1.pod5 example_2.pod5 --output merged.pod5\n\n# Merge a glob of pod5 files\n> pod5 merge *.pod5 -o merged.pod5\n\n# Merge a glob of pod5 files ignoring duplicate read ids\n> pod5 merge *.pod5 -o merged.pod5 --duplicate-ok\n```\n\n### Pod5 filter\n\n``pod5 filter`` is a simpler alternative to ``pod5 subset`` where reads are subset from\none or more input ``.pod5`` files using a list of read ids provided using the ``--ids`` argument\nand writing those reads to a *single* ``--output`` file.\n\nSee ``pod5 subset`` for more advanced subsetting.\n\n``` bash\n> pod5 filter example.pod5 --output filtered.pod5 --ids read_ids.txt\n```\n\nThe ``--ids`` selection text file must be a simple list of valid UUID read_ids with\none read_id per line. Only records which match the UUID regex (lower-case) are used.\nLines beginning with a ``#`` (hash / pound symbol) are interpreted as comments.\nEmpty lines are not valid and may cause errors during parsing.\n\n> The ``filter`` and ``subset`` tools will assert that any requested read_ids are\n> present in the inputs. If a requested read_id is missing from the inputs\n> then the tool will issue the following error:\n>\n> ``` bash\n> POD5 has encountered an error: 'Missing read_ids from inputs but --missing-ok not set'\n> ```\n>\n> To disable this warning then the '-M / --missing-ok' argument.\n\nWhen supplying multiple input files to 'filter' or 'subset', the tools is\neffectively performing a ``merge`` operation. The 'merge' tool is better suited\nfor handling very large numbers of input files.\n\n#### Example filtering pipeline\n\nThis is a trivial example of how to select a random sample of 1000 read_ids from a\npod5 file using ``pod5 view`` and ``pod5 filter``.\n\n``` bash\n# Get a random selection of read_ids\n> pod5 view all.pod5 --ids --no-header --output all_ids.txt\n> all_ids.txt sort --random-sort | head --lines 1000 > 1k_ids.txt\n\n# Filter to that selection\n> pod5 filter all.pod5 --ids 1k_ids.txt --output 1k.pod5\n\n# Check the output\n> pod5 view 1k.pod5 -IH | wc -l\n1000\n```\n\n### Pod5 subset\n\n``pod5 subset`` is a tool for subsetting reads in ``.pod5`` files into one or more\noutput ``.pod5`` files. See also ``pod5 filter``\n\nThe ``pod5 subset`` tool requires a *mapping* which defines which read_ids should be\nwritten to which output. There are multiple ways of specifying this mapping which are\ndefined in either a ``.csv`` file or by using a ``--table`` (csv or tsv)\nand instructions on how to interpret it.\n\n``pod5 subset`` aims to be a generic tool to subset from multiple inputs to multiple outputs.\nIf your use-case is to ``filter`` read_ids from one or more inputs into a single output\nthen ``pod5 filter`` might be a more appropriate tool as the only input is a list of read_ids.\n\n``` bash\n# View help\n> pod5 subset --help\n\n# Subset input(s) using a pre-defined mapping\n> pod5 subset example_1.pod5 --csv mapping.csv\n\n# Subset input(s) using a dynamic mapping created at runtime\n> pod5 subset example_1.pod5 --table table.txt --columns barcode\n```\n\n> Care should be taken to ensure that when providing multiple input ``.pod5`` files to ``pod5 subset``\n> that there are no read_id UUID clashes. If a duplicate read_id is detected an exception\n> will be raised unless the ``--duplicate-ok`` argument is set. If ``--duplicate-ok`` is\n> set then both reads will be written to the output, although this is not recommended.\n\n#### Note on positional arguments\n\n> The ``--columns`` argument will greedily consume values and as such, care should be taken\n> with the placement of any positional arguments. The following line will result in an error\n> as the input pod5 file is consumed by ``--columns`` resulting in no input file being set.\n\n```bash\n# Invalid placement of positional argument example.pod5\n$ pod5 subset --table table.txt --columns barcode example.pod5\n```\n\n#### Creating a Subset Mapping\n\n##### Target Mapping (.csv)\n\nThe example below shows a ``.csv`` subset target mapping. Any lines (e.g. header line)\nwhich do not have a read_id which matches the UUID regex (lower-case) in the second\ncolumn is ignored.\n\n``` text\ntarget, read_id\noutput_1.pod5,132b582c-56e8-4d46-9e3d-48a275646d3a\noutput_1.pod5,12a4d6b1-da6e-4136-8bb3-1470ef27e311\noutput_2.pod5,0ff4dc01-5fa4-4260-b54e-1d8716c7f225\noutput_2.pod5,0e359c40-296d-4edc-8f4a-cca135310ab2\noutput_2.pod5,0e9aa0f8-99ad-40b3-828a-45adbb4fd30c\n```\n\n##### Target Mapping from Table\n\n``pod5 subset`` can dynamically generate output targets and collect associated reads\nbased on a text file containing a table (csv or tsv) parsible by ``polars``.\nThis table file could be the output from ``pod5 view`` or from a sequencing summary.\nThe table must contain a header row and a series of columns on which to group unique\ncollections of values. Internally this process uses the\n`polars.Dataframe.group_by <https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html>`_\nfunction where the ``by`` parameter is the sequence of column names specified with\nthe ``--columns`` argument.\n\nGiven the following example ``--table`` file, observe the resultant outputs given various\narguments:\n\n``` text\nread_id    mux    barcode      length\nread_a     1      barcode_a    4321\nread_b     1      barcode_b    1000\nread_c     2      barcode_b    1200\nread_d     2      barcode_c    1234\n```\n\n``` bash\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode\n> ls barcode_subset\nbarcode-barcode_a.pod5     # Contains: read_a\nbarcode-barcode_b.pod5     # Contains: read_b, read_c\nbarcode-barcode_c.pod5     # Contains: read_d\n\n> pod5 subset example_1.pod5 --output mux_subset --table table.txt --columns mux\n> ls mux_subset\nmux-1.pod5     # Contains: read_a, read_b\nmus-2.pod5     # Contains: read_c, read_d\n\n> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux\n> ls barcode_mux_subset\nbarcode-barcode_a_mux-1.pod5    # Contains: read_a\nbarcode-barcode_b_mux-1.pod5    # Contains: read_b\nbarcode-barcode_b_mux-2.pod5    # Contains: read_c\nbarcode-barcode_c_mux-2.pod5    # Contains: read_d\n```\n\n##### Output Filename Templating\n\nWhen subsetting using a table the output filename is generated from a template\nstring. The automatically generated template is the sequential concatenation of\n``column_name-column_value`` followed by the ``.pod5`` file extension.\n\nThe user can set their own filename template using the ``--template`` argument.\nThis argument accepts a string in the `Python f-string style <https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals>`_\nwhere the subsetting variables are used for keyword placeholder substitution.\nKeywords should be placed within curly-braces. For example:\n\n``` bash\n# default template used = \"barcode-{barcode}.pod5\"\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode\n\n# default template used = \"barcode-{barcode}_mux-{mux}.pod5\"\n> pod5 subset example_1.pod5 --output barcode_mux_subset --table table.txt --columns barcode mux\n\n> pod5 subset example_1.pod5 --output barcode_subset --table table.txt --columns barcode --template \"{barcode}.subset.pod5\"\n> ls barcode_subset\nbarcode_a.subset.pod5    # Contains: read_a\nbarcode_b.subset.pod5    # Contains: read_b, read_c\nbarcode_c.subset.pod5    # Contains: read_d\n```\n\n##### Example subsetting from ``pod5 inspect reads``\n\nThe ``pod5 inspect reads`` tool will output a csv table summarising the content of the\nspecified ``.pod5`` file which can be used for subsetting. The example below shows\nhow to split a ``.pod5`` file by the well field.\n\n``` bash\n# Create the csv table from inspect reads\n> pod5 inspect reads example.pod5 > table.csv\n> pod5 subset example.pod5 --table table.csv --columns well\n```\n\n### Pod5 repack\n\n``pod5 repack`` will simply repack ``.pod5`` files into one-for-one output files of the same name.\n\n``` bash\n> pod5 repack pod5s/*.pod5 repacked_pods/\n```\n\n### Pod5 Recover\n\n``pod5 recover`` will attempt to recover data from corrupted or truncated ``.pod5`` files\nby copying all valid table batches and cleanly closing the new files. New files are written\nas siblings to the inputs with the `_recovered.pod5` suffix.\n\n``` bash\n> pod5 recover --help\n> pod5 recover broken.pod5\n> ls\nbroken.pod5 broken_recovered.pod5\n```\n\n### pod5 convert fast5\n\nThe ``pod5 convert fast5`` tool takes one or more ``.fast5`` files and converts them\nto one or more ``.pod5`` files.\n\nIf the tool detects single-read fast5 files, please convert them into multi-read\nfast5 files using the tools available in the ``ont_fast5_api`` project.\n\nThe progress bar shown during conversion assumes the number of reads in an input\n``.fast5`` is 4000. The progress bar will update the total value during runtime if\nrequired.\n\n> Some content previously stored in ``.fast5`` files is **not** compatible with the POD5\n> format and will not be converted. This includes all analyses stored in the\n> ``.fast5`` file.\n>\n> Please ensure that any other data is recovered from ``.fast5`` before deletion.\n\nBy default ``pod5 convert fast5`` will show exceptions raised during conversion as *warnings*\nto the user. This is to gracefully handle potentially corrupt input files or other\nruntime errors in long-running conversion tasks. The ``--strict`` argument allows\nusers to opt-in to strict runtime assertions where any exception raised will promptly\nstop the conversion process with an error.\n\n``` bash\n# View help\n> pod5 convert fast5 --help\n\n# Convert fast5 files into a monolithic output file\n> pod5 convert fast5 ./input/*.fast5 --output converted.pod5\n\n# Convert fast5 files into a monolithic output in an existing directory\n> pod5 convert fast5 ./input/*.fast5 --output outputs/\n> ls outputs/\noutput.pod5 # default name\n\n# Convert each fast5 to its relative converted output. The output files are written\n# into the output directory at paths relatve to the path given to the\n# --one-to-one argument. Note: This path must be a relative parent to all\n# input paths.\n> ls input/*.fast5\nfile_1.fast5 file_2.fast5 ... file_N.fast5\n> pod5 convert fast5 ./input/*.fast5 --output output_pod5s/ --one-to-one ./input/\n> ls output_pod5s/\nfile_1.pod5 file_2.pod5 ... file_N.pod5\n\n# Note the different --one-to-one path which is now the current working directory.\n# The new sub-directory output_pod5/input is created.\n> pod5 convert fast5 ./input/*.fast5 output_pod5s --one-to-one ./\n> ls output_pod5s/\ninput/file_1.pod5 input/file_2.pod5 ... input/file_N.pod5\n\n# Convert all inputs so that they have neibouring pod5 in current directory\n> pod5 convert fast5 *.fast5 --output . --one-to-one .\n> ls\nfile_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5  ... file_N.fast5 file_N.pod5\n\n# Convert all inputs so that they have neibouring pod5 files from a parent directory\n> pod5 convert fast5 ./input/*.fast5 --output ./input/ --one-to-one ./input/\n> ls input/*\nfile_1.fast5 file_1.pod5 file_2.fast5 file_2.pod5  ... file_N.fast5 file_N.pod5\n```\n\n### Pod5 convert to_fast5\n\nThe ``pod5 convert to_fast5`` tool takes one or more ``.pod5`` files and converts them\nto multiple ``.fast5`` files. The default behaviour is to write 4000 reads per output file\nbut this can be controlled with the ``--file-read-count`` argument.\n\n``` bash\n# View help\n> pod5 convert to_fast5 --help\n\n# Convert pod5 files to fast5 files with default 4000 reads per file\n> pod5 convert to_fast5 example.pod5 --output pod5_to_fast5/\n> ls pod5_to_fast5/\noutput_1.fast5 output_2.fast5 ... output_N.fast5\n```\n\n### Pod5 Update\n\nThe ``pod5 update`` tools is used to update old pod5 files to use the latest schema.\nCurrently the latest schema version is version 3.\n\nFiles are written into the ``--output`` directory with the same name.\n\n``` bash\n> pod5 update --help\n\n# Update a named files\n> pod5 update my.pod5 --output updated/\n> ls updated\nupdated/my.pod5\n\n# Update an entire directory\n> pod5 update old/ -o updated/\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Oxford Nanopore Technologies Pod5 File Format Python API and Tools",
    "version": "0.3.10",
    "project_urls": {
        "Documentation": "https://pod5-file-format.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/nanoporetech/pod5-file-format",
        "Issues": "https://github.com/nanoporetech/pod5-file-format/issues"
    },
    "split_keywords": [
        "nanopore"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "407b5baef5b0627a14d78ec511b9ea5776a0e99ab8c54fb59ff391836de6597e",
                "md5": "1bec9a0dd62a319315aea38d24c99943",
                "sha256": "3ecfce9d4d4b2574242b1effc313f3fd25ef4651c44385beb68ad5ba8f539b11"
            },
            "downloads": -1,
            "filename": "pod5-0.3.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1bec9a0dd62a319315aea38d24c99943",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.8",
            "size": 69421,
            "upload_time": "2024-03-25T13:21:53",
            "upload_time_iso_8601": "2024-03-25T13:21:53.331261Z",
            "url": "https://files.pythonhosted.org/packages/40/7b/5baef5b0627a14d78ec511b9ea5776a0e99ab8c54fb59ff391836de6597e/pod5-0.3.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "69b0b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d",
                "md5": "4ee69f31a39c962cc4d3c4b112736177",
                "sha256": "f2dcb1938fcf51c725393345e480c1d12711089d542a27446fb92fbe2e18ae60"
            },
            "downloads": -1,
            "filename": "pod5-0.3.10.tar.gz",
            "has_sig": false,
            "md5_digest": "4ee69f31a39c962cc4d3c4b112736177",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "~=3.8",
            "size": 65719,
            "upload_time": "2024-03-25T13:21:56",
            "upload_time_iso_8601": "2024-03-25T13:21:56.096417Z",
            "url": "https://files.pythonhosted.org/packages/69/b0/b5c4ca9cec24b982e72d5c805a9605e7eab4e39333c8cc77295a5eae412d/pod5-0.3.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-25 13:21:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nanoporetech",
    "github_project": "pod5-file-format",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pod5"
}
        
Elapsed time: 1.63636s