triplix


Nametriplix JSON
Version 0.0.4 PyPI version JSON
download
home_page
SummaryA python package to efficiently process, store and retrieve spatial genomics multi-contact data
upload_time2023-09-17 14:18:57
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT License Copyright (c) 2022 Amin Allahyar Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords bioinformatics genome 3d genome conformation multi-contact
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Triplix
[![PyPI version](https://badge.fury.io/py/triplix.svg)](https://badge.fury.io/py/triplix)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/triplix.svg)
[![MIT](https://img.shields.io/pypi/l/triplix.svg?color=green)](https://opensource.org/licenses/MIT)

A python package to efficiently process, store and retrieve spatial genomics multi-contact data.

## Introduction:
Triplix provides access to multi-contact data, that could be stored in three different spaces (i.e., resolution): 
1. **Concatemers**: An HDF5 container that stores multi-contact data in a table-like format. 
Each row in this table is an alignment that is captured by a particular `concatemer` (i.e., a sequenced long `read`). 
There is an index (i.e., `read_idx`) column in this table that specifies alignments that originate from each specific `concatemer`.

2. **Tri-alignments**:
Every possible three cis-alignments of each `concatemer` is stored in the Tri-alignment file. Each three cis-alignment is called a `tri-alignment`. 
Such a data structure facilitates 3-way interaction analyses, and allow random-access. 
To this end, `Triplix` has a simple wrapper around [pairix](https://github.com/4dn-dcic/pairix).
In particular, the relevant `tri-alignments` are loaded by checking the first two given coordinates using `pairix`, and then filtering for the third coordinate via an ordinary overlap check.

3. **Triplets**:
Triples are essentially Tri-alignment files, where `tri-alignments` are aggregated into genomic bins. 
This not only reduces the disk (and memory) footprint, but also facilitates random-access. Considering the sparcity of the multi-contact data, most (if not all) multi-contact analyses are done in this "binned" space (rather than `alignment` space).

## Installation
You may run the following commands in your command line to install Triplix.
```bash
pip3 install triplix

# or
python3 -m pip install triplix
```

## Extracting data:
`triplix` works in three types of data: `Concatermers`, `Tri-Alignments` and `Triplets`.
Below, you can see how you can extract each type of data.

### 1. `Concatemer` space  (*.concatemers.h5)
Provides random-access to read (i.e., `Concatemer`) level data.
In order to have access to the content of a `Concatemers` container, a handle should be initialized via:
```python
import triplix

concatemers_path = './concatemers/mESC.MC-HiC.concatemers.h5'
concatemers_container = triplix.ConcatemersContainer(file_path=concatemers_path)
```

Then, we can access the content, by requesting a specific concatemer/read index, or range of concatemer indices:
```python
print(concatemers_container[6])
# <Concatemers>
#    read_idx  chrom_num     start       end  strand  map_quality  flag  seq_start  seq_end                             read_name  read_length
# 0         6          9  28049720  28051533      -1           60  2064         22     1833  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 1         6          9  27993951  27996370       1           60     0       2233     4617  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 2         6          9  28001381  28003048      -1           60  2064       4602     6245  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 3         6          9  28003034  28003230      -1           60  2064       6429     6621  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 4         6          9  34934155  34935542       1           60  2048       6615     7929  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 5         6          9  36134828  36135055      -1           42  2064       7925     8149  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 6         6         15  48738095  48738498      -1           60  2064       8145     8536  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 7         7          2  63927407  63930186       1           60  2048         23     2803  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151
# 8         7          2  64210922  64221347       1           60     0       2793     8134  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151


print(concatemers_container[6:9])
# <Concatemers>
#     read_idx  chrom_num      start        end  strand  map_quality  flag  seq_start  seq_end                             read_name  read_length
# 0          6          9   28049720   28051533      -1           60  2064         22     1833  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 1          6          9   27993951   27996370       1           60     0       2233     4617  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 2          6          9   28001381   28003048      -1           60  2064       4602     6245  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 3          6          9   28003034   28003230      -1           60  2064       6429     6621  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 4          6          9   34934155   34935542       1           60  2048       6615     7929  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 5          6          9   36134828   36135055      -1           42  2064       7925     8149  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 6          6         15   48738095   48738498      -1           60  2064       8145     8536  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538
# 7          7          2   63927407   63930186       1           60  2048         23     2803  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151
# 8          7          2   64210922   64221347       1           60     0       2793     8134  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151
# 9          8         10  103377945  103378320       1           60  2048         25      391  000a93a0-5d61-42a1-81e2-848b7b197f60         3710
# 10         8         10  105640948  105642721      -1           60    16        387     2134  000a93a0-5d61-42a1-81e2-848b7b197f60         3710
# 11         8         15   43523758   43523837      -1           38  2064       2130     2209  000a93a0-5d61-42a1-81e2-848b7b197f60         3710
# 12         8         13    4296589    4296895       1           60  2048       2205     2501  000a93a0-5d61-42a1-81e2-848b7b197f60         3710
# 13         8         19   48817930   48819166       1            1  2048       2497     3708  000a93a0-5d61-42a1-81e2-848b7b197f60         3710
```

Alternatively, one can iterate over all `concatemers`:
```python
import pandas as pd
concatemers_iter = concatemers_container.iter(return_name=True, return_length=True)
n_concatemer = len(concatemers_container)
for idx, concatemer in enumerate(concatemers_iter):
    df = pd.DataFrame(concatemer)
    print(df[['chrom_num', 'start', 'strand', 'read_name']])
    if idx > 0:
        break

#    chrom_num     start  strand                             read_name
# 0          6  19951796       1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 1          6  18571279       1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 2          6  19944712      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 3          6  18954688      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 4          6  18958868       1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 5          6  20007053      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 6          6  19139306       1  0000b7ad-14d7-4285-a1a3-65a87eef213d
# 7         11  74436453      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d
#    chrom_num     start  strand                             read_name
# 0          5  92933191      -1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b
# 1          5  23899224       1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b
# 2          5  33962641      -1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b
# 
```

#### Description of stored columns:
  * `read_idx`: A Concatemer index (staring from 0). All alignments originating from a particular Concatemer (i.e. sequenced read) are annotated with an identical index.
  * `chrom_num`: A numeric value for the chromosome that the current alignment is mapped to (e.g. `chr1`=1, `chr2`=2, etc.). 
    The ordered list of chromosomes is available in the HDF5 container's attributes (i.e. `concatemers_container.h5_file['concatemers'].attrs['chrom_names']`)

  * `start`: The first basepair that the alignment is mapped to in the reference genome
  * `end`: The last basepair that the alignment is mapped to in the reference genome
  * `strand`: The strand in which the alignment is mapped to
  * `map_quality`: The mapping quality that is assigned by the mapper to this alignment. Note that mappers have a different range of mapping qualities (e.g. minimap2 may assign a value between [0 - 60], while bwa-sw assigns a value between [0 - 255])
  * `flag`: A series of bit-wise flags. These bits are assigned by the mapper, following [SAM file specifications](https://doi.org/10.1093/gigascience/giab008).
  * `seq_start`: The first nucleotide over the sequenced read from which the alignment is collected from. 
  * `seq_end`: The last nucleotide over the sequenced read from which the alignment is collected from.
  * `read_name`: The name of the sequenced read (aka. `read id`) that is defined by the sequence machine.
  * `read_length`: Length of the read (in terms of basepairs) from which the alignment is originating from.

### 2. `Tri-alignment` space (*.tri-alignments.tsv.bgz):
Provides access to alignment (i.e., `tri-alignments`) level data. 
These files are essentially formatted as compressed [pairix files](https://github.com/4dn-dcic/pairix). The difference is that they contain three coordinates instead of two coordinates (as is the case for regular pairix files)

> **Note:** The tri-alignment files need to be indexed if you like to have random access to them. You can do this using `triplix index --input=input-file.tri-alignments.tsv.bgz`

An example of this usage is as follows:

```python
import triplix

tri_aln_path = './tri-alignments/GM12878.Merged.tri-alignments.tsv.bgz'
tri_container = triplix.TriAlignmentsContainer(trialn_path=tri_aln_path)

tri_alignments_iter = tri_container.fetch(chrom_a='chrY', start_a=0e6, end_a=10e6, batch_size=10)
for idx, tri_alignments in enumerate(tri_alignments_iter):
    df = tri_alignments.to_dataframe()
    print(df[['read_name', 'chrom_A', 'start_A', 'start_B', 'start_C']])
    if idx > 0:
        break
#                               read_name chrom_A  start_A  start_B  start_C
# 0  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   363575
# 1  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   449218
# 2  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   449218
# 3  689000cd-3deb-47f9-8c8d-64eb039d0be1    chrY    10710  1203147  1302472
# 4                  SRR11589401.15490287    chrY    11129  1774864  1806803
# 5                  SRR11589391.12389645    chrY    13063   493484  1644842
# 6  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   310549   672618
# 7  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   310549   754824
# 8  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   672618   754824
#                               read_name chrom_A  start_A  start_B  start_C
# 0  e7603585-2e15-423e-a26f-8b114d6bab87    chrY    13338    17122   266818
# 1                  SRR11589390.11433849    chrY    13394    13813  2515291
# 2  e7f93e28-9540-462b-817c-d8acd563b94a    chrY    13394    13844   318989
# 3                   SRR11589394.4086145    chrY    13441  1231515  2705073
# 4  429af33b-3260-4dc5-9cf3-fc7b96f0b932    chrY    13508   252336   323450
# 5  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   278514   364567
# 6  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   278514   366665
# 7  6589f1d9-cd7d-4cba-bfc9-06f123956157    chrY    13584   282655   440783
# 8  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   364567   366665
# 9  11f46754-fbfe-4f9f-8aa6-7f03c7db48df    chrY    13584  1180971  1812799

```

#### Description of stored columns:
In the column names below, the letter `X` refers to either `A`, `B` or `C`. 
For example, by `chrom_X`, we state that each Tri-alignment contains `chrom_A`, `chrom_B` and `chrom_C` columns.
 * `chrom_X`: The chromosome on which the alignment `X` is mapped to.
 * `start_X`: The first basepair on which the alignment `X` is mapped to.
 * `end_X`: The last basepair on which the alignment `X` is mapped to.
 * `strand_X`: The strand on which the alignment `X` is mapped to.
 * `mapq_X`: The mapping quality assigned by the aligner for alignment `X`.
 * `frag_index_X`: An index that refers the order of the alignments collected from the read (sorted by their starting basepair relative to the read). For example, `frag_index_A`=1 indicates that this is the second alignment originating from the read.
 * `read_name`: Name of Concatemer (i.e., sequenced read), as present in the `Concatemers` container.
 * `read_length_nbp`: Length of the read in terms of basepairs.
 * `read_length_nfrag`: Length of the read in terms of number of alignments.
 * `experiment_name`: Name of the experiment, given by the user (or taken from the `Concatemers` filename).


### 3. `Triplet` space (*.triplets.h5):
Provides access to triplet-level data. A usage example is:

```python
import pandas as pd
import triplix

trpl_path = './triplets/GM12878.MC-HiC.hg38.triplets.h5'
trpl_obj = triplix.TripletsContainer(container_path=trpl_path)
triplets = trpl_obj.fetch(
    chrom_a='chr8', start_a=119e6, end_a=120e6,
    columns=['start_A', 'start_B', 'start_C', 'count_ABC'],
)
df = pd.DataFrame(triplets)
print(df)
# output:
#           start_A    start_B    start_C  count_ABC
# 0       119000000  119000000  119000000         73
# 1       119000000  119000000  119025000        136
# 2       119000000  119000000  119050000         82
# 3       119000000  119000000  119075000         55
# 4       119000000  119000000  119100000         54
# ...           ...        ...        ...        ...
# 136156  120000000  121950000  121975000          8
# 136157  120000000  121950000  122000000          2
# 136158  120000000  121975000  121975000          5
# 136159  120000000  121975000  122000000          7
# 136160  120000000  122000000  122000000          6
# 
# [136161 rows x 4 columns]
```

Alternatively, you may iterate over stored [chunks](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) of triplets as follows:
```python
for triplets in trpl_obj.iter_chunks():
  print(triplets)
#        flag  start_A  start_B  start_C  distance_AB  ...
# 0         0        0        0        0            0  ...
# 1         0        0        0    25000            0  ...
# 2         0        0        0    50000            0  ...
# 3         0        0        0    75000            0  ...
# 4         0        0        0   100000            0  ...
# ...     ...      ...      ...      ...          ...  ...
```
This will iterate over all stored triplets, chunk by chunk. 
In addition, you can limit the chromosomes or columns that are retured (to improve load speed). This can be done as follows:
```python
for current_chromosome, triplets in trpl_obj.iter_chunks(chroms=['chr3'], columns=['start_A', 'start_B', 'start_C']):
  print(current_chromosome, triplets)
# chr3 Triplets <50000 x 3>:
#       start_A start_B  start_C
# 0           0       0        0
# 1           0       0    25000
# 2           0       0    50000
# 3           0       0    75000
# 4           0       0   100000
# ***       ***     ***      ***
```
Note that the `.iter_chunks()` function returns a tuple of `(current_chromosome, triplets)`.

#### Description of stored columns:
 * `flag`: A set of binary flags (in the range of [0x0000 - 0xffff]) that specify a property of each triplet:
   * `0x0000`: No flag is set. This is the default.
   * `0x0001`: At least one of the anchors coordinate of the Triplet is located out of the chromosome length.
   * `0x0002`: At least one column in the triplet has an `undefined` value.
 * `start_X`: The starting basepair on which the anchor `A` is defined on.
 * `start_B`: The starting basepair on which the anchor `B` is defined on.
 * `start_C`: The starting basepair on which the anchor `C` is defined on.
 * `distance_AB`, `distance_AC` and `distance_BC`: Distance (in terms of base pairs) between anchor `A-B`, `A-C` and `B-C`. 
 * `count_A`, `count_B`, `count_C`: Number of alignments covering anchor `A`, `B` and `C` respectively. This is simply the 1D coverage track at anchor `A`, `B` and `C` respectively.
 * `count_AB`, `count_AC`, `count_BC`: Number of pairwise contacts found between anchors `A-B`, `A-C` and `B-C`. This count is basically the number of interation found in a HiC representation of the multi-contact data.
 * `count_ABC`: Number of times a 3-way contact is found between anchors `A-B-C`.
 * The following columns are assigned after the `transformation` step of the Triplix pipeline:
   + `observed_A`, `observed_B`, `observed_C`: The `counts_A`, `counts_B` or `counts_C` after performing the transformation. The default transformation is `Gaussian`, which smooths the counts in the 3D space of cube.
   + `observed_AB`, `observed_AC`, `observed_BC`: The `count_AB`, `count_AC`, `count_BC` after performing the transformation.
   + `observed_ABC`: The `count_ABC` after performing the transformation.
 * Correction factors: The following columns are assigned after the `KDTree` construction and neighbor search steps of the Triplix pipeline. 
   For the neighbor search, different factors could be considered to perform the neighbor search. For instance:
   + `singletons`: Triplets are considered neighbors (i.e. similar), if they have similar 1D coverages across their anchors (i.e., `observed_A`, `observed_B` and `observed_C`) as well as similar pairwise anchor distances (i.e., `distance_AB`, `distance_AC` and `distance_BC`).
   + `pairs`: Triplets are considered similar if their pairwise contacts are similar (i.e., `observed_AB`, `observed_AC` and `observed_BC`) as well as their pairwise anchor distances.
   + `pairs_surround`: Triplets are similar if their pairwise contacts, pairwise anchor distances as well as their surrounding 3-way contacts (i.e., the `observed_ABC` of adjacent Triplets) are similar.
 * Based on each correction factors, several statistics are estimated. For example, in case the `pairs` correction factor is chosen:
   + `pairs.neighbors_distance_med`: The median of the K-neighbors Euclidean distance to the current triplet, in the high-dimensional space of the constructed KDTree.
   + `pairs.neighbors_distance_std`: The standard deviation of the K-neighbors Euclidean distance to the current triplet, in the high-dimensional space of the constructed KDTree.
   + `pairs.neighbors_expected_avg`: The average of `observed_ABC` across the identified similar triplets (i.e., neighbors).
   + `pairs.neighbors_expected_std`: The standard deviation of `observed_ABC` across the identified similar triplets (i.e., neighbors).
   + `pairs.enrichment`: The enrichment score calculated after comparing the observed and expected number of 3-way contacts (i.e., `observed_ABC` vs. `pairs.neighbors_expected_avg`) while considering the standard deviation of these observations (i.e., `pairs.neighbors_expected_std`)

## Drawing virtual-hic plots (experimental features):
`triplix` can be used to draw virtual-HiC plots. 
Note that this is an experimental feature. So the arguments may change in future releases.
A minimal example of doing this is as follows:

```python
from triplix.core import plot

call = {
  'chrom': 'chr8',
  'start_A': int(119.825e6),
  'start_B': int(120.100e6),
  'start_C': int(120.425e6),
}
ant_coords = [[call['start_B'] - 50e3, call['start_B'] + 75e3, call['start_C'] - 50e3, call['start_C'] + 75e3]]

plot.plot_virtual_hic(
  input_patterns=f"./GM12878.MC-HiC.hg38.triplets.h5",
  view_chrom=call['chrom'], 
  view_start=call['start_A'] - 750e3, 
  view_end=call['start_C'] + 750e3 ,
  view_point=call['start_A'],
  plot_names=(
      'count_ABC', 
      'observed_ABC', 
      'singletons.enrichment',
      'pairs.enrichment',
      'pairs_surround.enrichment',
  ),
  output_dir='./',
  output_name='virtual-hic_example.pdf',
  plot_params={"sin.*enrichment": {"annotations":ant_coords}}
)
```
The above code would produce a plot like below. Notice the black (rectangle) annotation in the middle plot which is drawn
as requested in `plot_params` argument.
![Example of virtual-HiC plot](./examples/virtual-hic_example.png)

#### Description of arguments:
The majority of the arguments are assumed to be self-explanatory, but some may not be clear
* `input_patterns`: Path to `Triplet` container.
* `plot_names`: The column in the `Triplet` container to be used in the plot.
* `plot_params`: Can be used send varied parameters to the plot (see source code for details). 
  Most importantly, you can define (using regular expressions) which plot should be annotated. Annotation format can be seen in the example above.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "triplix",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "Bioinformatics,Genome,3D Genome Conformation,Multi-contact",
    "author": "",
    "author_email": "Amin Allahyar <amin.allahyar@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f7/c0/f1ff0704b84a1df373afff2f58ac7f5cc065402b2e1398006048486ea035/triplix-0.0.4.tar.gz",
    "platform": null,
    "description": "# Triplix\n[![PyPI version](https://badge.fury.io/py/triplix.svg)](https://badge.fury.io/py/triplix)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/triplix.svg)\n[![MIT](https://img.shields.io/pypi/l/triplix.svg?color=green)](https://opensource.org/licenses/MIT)\n\nA python package to efficiently process, store and retrieve spatial genomics multi-contact data.\n\n## Introduction:\nTriplix provides access to multi-contact data, that could be stored in three different spaces (i.e., resolution): \n1. **Concatemers**: An HDF5 container that stores multi-contact data in a table-like format. \nEach row in this table is an alignment that is captured by a particular `concatemer` (i.e., a sequenced long `read`). \nThere is an index (i.e., `read_idx`) column in this table that specifies alignments that originate from each specific `concatemer`.\n\n2. **Tri-alignments**:\nEvery possible three cis-alignments of each `concatemer` is stored in the Tri-alignment file. Each three cis-alignment is called a `tri-alignment`. \nSuch a data structure facilitates 3-way interaction analyses, and allow random-access. \nTo this end, `Triplix` has a simple wrapper around [pairix](https://github.com/4dn-dcic/pairix).\nIn particular, the relevant `tri-alignments` are loaded by checking the first two given coordinates using `pairix`, and then filtering for the third coordinate via an ordinary overlap check.\n\n3. **Triplets**:\nTriples are essentially Tri-alignment files, where `tri-alignments` are aggregated into genomic bins. \nThis not only reduces the disk (and memory) footprint, but also facilitates random-access. Considering the sparcity of the multi-contact data, most (if not all) multi-contact analyses are done in this \"binned\" space (rather than `alignment` space).\n\n## Installation\nYou may run the following commands in your command line to install Triplix.\n```bash\npip3 install triplix\n\n# or\npython3 -m pip install triplix\n```\n\n## Extracting data:\n`triplix` works in three types of data: `Concatermers`, `Tri-Alignments` and `Triplets`.\nBelow, you can see how you can extract each type of data.\n\n### 1. `Concatemer` space  (*.concatemers.h5)\nProvides random-access to read (i.e., `Concatemer`) level data.\nIn order to have access to the content of a `Concatemers` container, a handle should be initialized via:\n```python\nimport triplix\n\nconcatemers_path = './concatemers/mESC.MC-HiC.concatemers.h5'\nconcatemers_container = triplix.ConcatemersContainer(file_path=concatemers_path)\n```\n\nThen, we can access the content, by requesting a specific concatemer/read index, or range of concatemer indices:\n```python\nprint(concatemers_container[6])\n# <Concatemers>\n#    read_idx  chrom_num     start       end  strand  map_quality  flag  seq_start  seq_end                             read_name  read_length\n# 0         6          9  28049720  28051533      -1           60  2064         22     1833  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 1         6          9  27993951  27996370       1           60     0       2233     4617  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 2         6          9  28001381  28003048      -1           60  2064       4602     6245  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 3         6          9  28003034  28003230      -1           60  2064       6429     6621  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 4         6          9  34934155  34935542       1           60  2048       6615     7929  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 5         6          9  36134828  36135055      -1           42  2064       7925     8149  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 6         6         15  48738095  48738498      -1           60  2064       8145     8536  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 7         7          2  63927407  63930186       1           60  2048         23     2803  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151\n# 8         7          2  64210922  64221347       1           60     0       2793     8134  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151\n\n\nprint(concatemers_container[6:9])\n# <Concatemers>\n#     read_idx  chrom_num      start        end  strand  map_quality  flag  seq_start  seq_end                             read_name  read_length\n# 0          6          9   28049720   28051533      -1           60  2064         22     1833  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 1          6          9   27993951   27996370       1           60     0       2233     4617  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 2          6          9   28001381   28003048      -1           60  2064       4602     6245  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 3          6          9   28003034   28003230      -1           60  2064       6429     6621  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 4          6          9   34934155   34935542       1           60  2048       6615     7929  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 5          6          9   36134828   36135055      -1           42  2064       7925     8149  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 6          6         15   48738095   48738498      -1           60  2064       8145     8536  000a9e3e-5a06-4691-acf7-f71c0b707bc9         8538\n# 7          7          2   63927407   63930186       1           60  2048         23     2803  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151\n# 8          7          2   64210922   64221347       1           60     0       2793     8134  000a76b2-216e-4c40-b0f5-35c7ce10d30c         8151\n# 9          8         10  103377945  103378320       1           60  2048         25      391  000a93a0-5d61-42a1-81e2-848b7b197f60         3710\n# 10         8         10  105640948  105642721      -1           60    16        387     2134  000a93a0-5d61-42a1-81e2-848b7b197f60         3710\n# 11         8         15   43523758   43523837      -1           38  2064       2130     2209  000a93a0-5d61-42a1-81e2-848b7b197f60         3710\n# 12         8         13    4296589    4296895       1           60  2048       2205     2501  000a93a0-5d61-42a1-81e2-848b7b197f60         3710\n# 13         8         19   48817930   48819166       1            1  2048       2497     3708  000a93a0-5d61-42a1-81e2-848b7b197f60         3710\n```\n\nAlternatively, one can iterate over all `concatemers`:\n```python\nimport pandas as pd\nconcatemers_iter = concatemers_container.iter(return_name=True, return_length=True)\nn_concatemer = len(concatemers_container)\nfor idx, concatemer in enumerate(concatemers_iter):\n    df = pd.DataFrame(concatemer)\n    print(df[['chrom_num', 'start', 'strand', 'read_name']])\n    if idx > 0:\n        break\n\n#    chrom_num     start  strand                             read_name\n# 0          6  19951796       1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 1          6  18571279       1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 2          6  19944712      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 3          6  18954688      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 4          6  18958868       1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 5          6  20007053      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 6          6  19139306       1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n# 7         11  74436453      -1  0000b7ad-14d7-4285-a1a3-65a87eef213d\n#    chrom_num     start  strand                             read_name\n# 0          5  92933191      -1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b\n# 1          5  23899224       1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b\n# 2          5  33962641      -1  0000eea6-4554-40c3-b4ab-ecaa5cf5795b\n# \n```\n\n#### Description of stored columns:\n  * `read_idx`: A Concatemer index (staring from 0). All alignments originating from a particular Concatemer (i.e. sequenced read) are annotated with an identical index.\n  * `chrom_num`: A numeric value for the chromosome that the current alignment is mapped to (e.g. `chr1`=1, `chr2`=2, etc.). \n    The ordered list of chromosomes is available in the HDF5 container's attributes (i.e. `concatemers_container.h5_file['concatemers'].attrs['chrom_names']`)\n\n  * `start`: The first basepair that the alignment is mapped to in the reference genome\n  * `end`: The last basepair that the alignment is mapped to in the reference genome\n  * `strand`: The strand in which the alignment is mapped to\n  * `map_quality`: The mapping quality that is assigned by the mapper to this alignment. Note that mappers have a different range of mapping qualities (e.g. minimap2 may assign a value between [0 - 60], while bwa-sw assigns a value between [0 - 255])\n  * `flag`: A series of bit-wise flags. These bits are assigned by the mapper, following [SAM file specifications](https://doi.org/10.1093/gigascience/giab008).\n  * `seq_start`: The first nucleotide over the sequenced read from which the alignment is collected from. \n  * `seq_end`: The last nucleotide over the sequenced read from which the alignment is collected from.\n  * `read_name`: The name of the sequenced read (aka. `read id`) that is defined by the sequence machine.\n  * `read_length`: Length of the read (in terms of basepairs) from which the alignment is originating from.\n\n### 2. `Tri-alignment` space (*.tri-alignments.tsv.bgz):\nProvides access to alignment (i.e., `tri-alignments`) level data. \nThese files are essentially formatted as compressed [pairix files](https://github.com/4dn-dcic/pairix). The difference is that they contain three coordinates instead of two coordinates (as is the case for regular pairix files)\n\n> **Note:** The tri-alignment files need to be indexed if you like to have random access to them. You can do this using `triplix index --input=input-file.tri-alignments.tsv.bgz`\n\nAn example of this usage is as follows:\n\n```python\nimport triplix\n\ntri_aln_path = './tri-alignments/GM12878.Merged.tri-alignments.tsv.bgz'\ntri_container = triplix.TriAlignmentsContainer(trialn_path=tri_aln_path)\n\ntri_alignments_iter = tri_container.fetch(chrom_a='chrY', start_a=0e6, end_a=10e6, batch_size=10)\nfor idx, tri_alignments in enumerate(tri_alignments_iter):\n    df = tri_alignments.to_dataframe()\n    print(df[['read_name', 'chrom_A', 'start_A', 'start_B', 'start_C']])\n    if idx > 0:\n        break\n#                               read_name chrom_A  start_A  start_B  start_C\n# 0  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   363575\n# 1  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   449218\n# 2  b9129419-3e9d-47ea-8ae3-bf65c468ef43    chrY    10710   363575   449218\n# 3  689000cd-3deb-47f9-8c8d-64eb039d0be1    chrY    10710  1203147  1302472\n# 4                  SRR11589401.15490287    chrY    11129  1774864  1806803\n# 5                  SRR11589391.12389645    chrY    13063   493484  1644842\n# 6  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   310549   672618\n# 7  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   310549   754824\n# 8  998d7719-0d64-4202-ba7e-0a4b9d49284e    chrY    13337   672618   754824\n#                               read_name chrom_A  start_A  start_B  start_C\n# 0  e7603585-2e15-423e-a26f-8b114d6bab87    chrY    13338    17122   266818\n# 1                  SRR11589390.11433849    chrY    13394    13813  2515291\n# 2  e7f93e28-9540-462b-817c-d8acd563b94a    chrY    13394    13844   318989\n# 3                   SRR11589394.4086145    chrY    13441  1231515  2705073\n# 4  429af33b-3260-4dc5-9cf3-fc7b96f0b932    chrY    13508   252336   323450\n# 5  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   278514   364567\n# 6  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   278514   366665\n# 7  6589f1d9-cd7d-4cba-bfc9-06f123956157    chrY    13584   282655   440783\n# 8  0f526116-e8fc-459b-aac9-d9d6d2843289    chrY    13584   364567   366665\n# 9  11f46754-fbfe-4f9f-8aa6-7f03c7db48df    chrY    13584  1180971  1812799\n\n```\n\n#### Description of stored columns:\nIn the column names below, the letter `X` refers to either `A`, `B` or `C`. \nFor example, by `chrom_X`, we state that each Tri-alignment contains `chrom_A`, `chrom_B` and `chrom_C` columns.\n * `chrom_X`: The chromosome on which the alignment `X` is mapped to.\n * `start_X`: The first basepair on which the alignment `X` is mapped to.\n * `end_X`: The last basepair on which the alignment `X` is mapped to.\n * `strand_X`: The strand on which the alignment `X` is mapped to.\n * `mapq_X`: The mapping quality assigned by the aligner for alignment `X`.\n * `frag_index_X`: An index that refers the order of the alignments collected from the read (sorted by their starting basepair relative to the read). For example, `frag_index_A`=1 indicates that this is the second alignment originating from the read.\n * `read_name`: Name of Concatemer (i.e., sequenced read), as present in the `Concatemers` container.\n * `read_length_nbp`: Length of the read in terms of basepairs.\n * `read_length_nfrag`: Length of the read in terms of number of alignments.\n * `experiment_name`: Name of the experiment, given by the user (or taken from the `Concatemers` filename).\n\n\n### 3. `Triplet` space (*.triplets.h5):\nProvides access to triplet-level data. A usage example is:\n\n```python\nimport pandas as pd\nimport triplix\n\ntrpl_path = './triplets/GM12878.MC-HiC.hg38.triplets.h5'\ntrpl_obj = triplix.TripletsContainer(container_path=trpl_path)\ntriplets = trpl_obj.fetch(\n    chrom_a='chr8', start_a=119e6, end_a=120e6,\n    columns=['start_A', 'start_B', 'start_C', 'count_ABC'],\n)\ndf = pd.DataFrame(triplets)\nprint(df)\n# output:\n#           start_A    start_B    start_C  count_ABC\n# 0       119000000  119000000  119000000         73\n# 1       119000000  119000000  119025000        136\n# 2       119000000  119000000  119050000         82\n# 3       119000000  119000000  119075000         55\n# 4       119000000  119000000  119100000         54\n# ...           ...        ...        ...        ...\n# 136156  120000000  121950000  121975000          8\n# 136157  120000000  121950000  122000000          2\n# 136158  120000000  121975000  121975000          5\n# 136159  120000000  121975000  122000000          7\n# 136160  120000000  122000000  122000000          6\n# \n# [136161 rows x 4 columns]\n```\n\nAlternatively, you may iterate over stored [chunks](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) of triplets as follows:\n```python\nfor triplets in trpl_obj.iter_chunks():\n  print(triplets)\n#        flag  start_A  start_B  start_C  distance_AB  ...\n# 0         0        0        0        0            0  ...\n# 1         0        0        0    25000            0  ...\n# 2         0        0        0    50000            0  ...\n# 3         0        0        0    75000            0  ...\n# 4         0        0        0   100000            0  ...\n# ...     ...      ...      ...      ...          ...  ...\n```\nThis will iterate over all stored triplets, chunk by chunk. \nIn addition, you can limit the chromosomes or columns that are retured (to improve load speed). This can be done as follows:\n```python\nfor current_chromosome, triplets in trpl_obj.iter_chunks(chroms=['chr3'], columns=['start_A', 'start_B', 'start_C']):\n  print(current_chromosome, triplets)\n# chr3 Triplets <50000 x 3>:\n#       start_A start_B  start_C\n# 0           0       0        0\n# 1           0       0    25000\n# 2           0       0    50000\n# 3           0       0    75000\n# 4           0       0   100000\n# ***       ***     ***      ***\n```\nNote that the `.iter_chunks()` function returns a tuple of `(current_chromosome, triplets)`.\n\n#### Description of stored columns:\n * `flag`: A set of binary flags (in the range of [0x0000 - 0xffff]) that specify a property of each triplet:\n   * `0x0000`: No flag is set. This is the default.\n   * `0x0001`: At least one of the anchors coordinate of the Triplet is located out of the chromosome length.\n   * `0x0002`: At least one column in the triplet has an `undefined` value.\n * `start_X`: The starting basepair on which the anchor `A` is defined on.\n * `start_B`: The starting basepair on which the anchor `B` is defined on.\n * `start_C`: The starting basepair on which the anchor `C` is defined on.\n * `distance_AB`, `distance_AC` and `distance_BC`: Distance (in terms of base pairs) between anchor `A-B`, `A-C` and `B-C`. \n * `count_A`, `count_B`, `count_C`: Number of alignments covering anchor `A`, `B` and `C` respectively. This is simply the 1D coverage track at anchor `A`, `B` and `C` respectively.\n * `count_AB`, `count_AC`, `count_BC`: Number of pairwise contacts found between anchors `A-B`, `A-C` and `B-C`. This count is basically the number of interation found in a HiC representation of the multi-contact data.\n * `count_ABC`: Number of times a 3-way contact is found between anchors `A-B-C`.\n * The following columns are assigned after the `transformation` step of the Triplix pipeline:\n   + `observed_A`, `observed_B`, `observed_C`: The `counts_A`, `counts_B` or `counts_C` after performing the transformation. The default transformation is `Gaussian`, which smooths the counts in the 3D space of cube.\n   + `observed_AB`, `observed_AC`, `observed_BC`: The `count_AB`, `count_AC`, `count_BC` after performing the transformation.\n   + `observed_ABC`: The `count_ABC` after performing the transformation.\n * Correction factors: The following columns are assigned after the `KDTree` construction and neighbor search steps of the Triplix pipeline. \n   For the neighbor search, different factors could be considered to perform the neighbor search. For instance:\n   + `singletons`: Triplets are considered neighbors (i.e. similar), if they have similar 1D coverages across their anchors (i.e., `observed_A`, `observed_B` and `observed_C`) as well as similar pairwise anchor distances (i.e., `distance_AB`, `distance_AC` and `distance_BC`).\n   + `pairs`: Triplets are considered similar if their pairwise contacts are similar (i.e., `observed_AB`, `observed_AC` and `observed_BC`) as well as their pairwise anchor distances.\n   + `pairs_surround`: Triplets are similar if their pairwise contacts, pairwise anchor distances as well as their surrounding 3-way contacts (i.e., the `observed_ABC` of adjacent Triplets) are similar.\n * Based on each correction factors, several statistics are estimated. For example, in case the `pairs` correction factor is chosen:\n   + `pairs.neighbors_distance_med`: The median of the K-neighbors Euclidean distance to the current triplet, in the high-dimensional space of the constructed KDTree.\n   + `pairs.neighbors_distance_std`: The standard deviation of the K-neighbors Euclidean distance to the current triplet, in the high-dimensional space of the constructed KDTree.\n   + `pairs.neighbors_expected_avg`: The average of `observed_ABC` across the identified similar triplets (i.e., neighbors).\n   + `pairs.neighbors_expected_std`: The standard deviation of `observed_ABC` across the identified similar triplets (i.e., neighbors).\n   + `pairs.enrichment`: The enrichment score calculated after comparing the observed and expected number of 3-way contacts (i.e., `observed_ABC` vs. `pairs.neighbors_expected_avg`) while considering the standard deviation of these observations (i.e., `pairs.neighbors_expected_std`)\n\n## Drawing virtual-hic plots (experimental features):\n`triplix` can be used to draw virtual-HiC plots. \nNote that this is an experimental feature. So the arguments may change in future releases.\nA minimal example of doing this is as follows:\n\n```python\nfrom triplix.core import plot\n\ncall = {\n  'chrom': 'chr8',\n  'start_A': int(119.825e6),\n  'start_B': int(120.100e6),\n  'start_C': int(120.425e6),\n}\nant_coords = [[call['start_B'] - 50e3, call['start_B'] + 75e3, call['start_C'] - 50e3, call['start_C'] + 75e3]]\n\nplot.plot_virtual_hic(\n  input_patterns=f\"./GM12878.MC-HiC.hg38.triplets.h5\",\n  view_chrom=call['chrom'], \n  view_start=call['start_A'] - 750e3, \n  view_end=call['start_C'] + 750e3 ,\n  view_point=call['start_A'],\n  plot_names=(\n      'count_ABC', \n      'observed_ABC', \n      'singletons.enrichment',\n      'pairs.enrichment',\n      'pairs_surround.enrichment',\n  ),\n  output_dir='./',\n  output_name='virtual-hic_example.pdf',\n  plot_params={\"sin.*enrichment\": {\"annotations\":ant_coords}}\n)\n```\nThe above code would produce a plot like below. Notice the black (rectangle) annotation in the middle plot which is drawn\nas requested in `plot_params` argument.\n![Example of virtual-HiC plot](./examples/virtual-hic_example.png)\n\n#### Description of arguments:\nThe majority of the arguments are assumed to be self-explanatory, but some may not be clear\n* `input_patterns`: Path to `Triplet` container.\n* `plot_names`: The column in the `Triplet` container to be used in the plot.\n* `plot_params`: Can be used send varied parameters to the plot (see source code for details). \n  Most importantly, you can define (using regular expressions) which plot should be annotated. Annotation format can be seen in the example above.\n\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2022 Amin Allahyar  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "A python package to efficiently process, store and retrieve spatial genomics multi-contact data",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/deLaatLab/triplix"
    },
    "split_keywords": [
        "bioinformatics",
        "genome",
        "3d genome conformation",
        "multi-contact"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "351f0f1d7529d30acda4c58f55d1551e805c8292db1d1bf4caaafbbc176a0762",
                "md5": "f12bbc946bc1950b7dac4287a6ddc031",
                "sha256": "fdf81f7f2839b58fe7173dadf1c45e5f46db95f61cece6f18bd2a4419af1e0d6"
            },
            "downloads": -1,
            "filename": "triplix-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f12bbc946bc1950b7dac4287a6ddc031",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 73001,
            "upload_time": "2023-09-17T14:18:55",
            "upload_time_iso_8601": "2023-09-17T14:18:55.871071Z",
            "url": "https://files.pythonhosted.org/packages/35/1f/0f1d7529d30acda4c58f55d1551e805c8292db1d1bf4caaafbbc176a0762/triplix-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f7c0f1ff0704b84a1df373afff2f58ac7f5cc065402b2e1398006048486ea035",
                "md5": "a07c019de913938f70f230e67a22f414",
                "sha256": "ab51631be21accd9906753142e1b546853f3575e0eded885160f3f16949b9f2a"
            },
            "downloads": -1,
            "filename": "triplix-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "a07c019de913938f70f230e67a22f414",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 67797,
            "upload_time": "2023-09-17T14:18:57",
            "upload_time_iso_8601": "2023-09-17T14:18:57.863824Z",
            "url": "https://files.pythonhosted.org/packages/f7/c0/f1ff0704b84a1df373afff2f58ac7f5cc065402b2e1398006048486ea035/triplix-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-17 14:18:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "deLaatLab",
    "github_project": "triplix",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "triplix"
}
        
Elapsed time: 0.23126s