pasio


Namepasio JSON
Version 1.1.3 PyPI version JSON
download
home_pagehttps://github.com/autosome-ru/pasio
SummaryPasio is a tool for segmentation and denosing DNA coverage profiles coming from high-throughput sequencing data.
upload_time2024-12-30 04:48:51
maintainerNone
docs_urlNone
authorAndrey Lando, Ilya Vorontsov
requires_python!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7.1
licensePasio is licensed under WTFPL, but if you prefer more standard licenses, feel free to treat it as MIT license
keywords bioinformatics ngs coverage segmentation denoise
VCS
bugtrack_url
requirements numpy scipy future pytest pytest-benchmark
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![Pasio logo](logos/pasio_256.png?raw=true "Logo")
# PASIO

Pasio is a tool for denosing DNA coverage profiles coming from high-throughput sequencing data.
Example of experiments pasio works well on is ChIP-seq, DNAse-seq, ATAC-seq.

It takes a .bed file of counts (integer values, normalization is not supported). And produces
tsv file with genome splited into segments which coverage can be treated as equal.

Pasio runs on both Python 2 and 3 (Python 2 interpreter runs a bit faster).
The only dependencies are numpy and scipy.

Defaults are reasonable for fast yet almost precise computation, so usually it is enough to run:

```
pasio input.bedgraph
```

This defaults are yet subject to change, so if you want results to be reproducible between versions, please specify all substantial parameters (especially α and β) explicitly!

Note that PASIO process bedgraph contig by contig. Thus bedgraph should be sorted by contig/chromosome!

PASIO can read and write to gzipped files (filename should have `.gz` extension).

PASIO can also process bedgraph from stdin by supplying `-` instead of filename.
It can be useful for transcriptomic data where contigs are short enough to be processed on the fly.


## Installation
PASIO works with Python 2.7.1+ and Python 3.4+. The tool is available on PyPA, so you can install it using pip:

```
  python -m pip install pasio
```


Note that pip install wrapper to run pasio without specifying python. One can use one of two options to run it:

```
pasio <options>...
python -m pasio <options>...
```

The latter option can be useful if you want to run it using a certain python version.

## Underlying math

PASIO is a program to segment chromatin accessibility profile. It accepts a bedgraph file
with coverage of position by DNase cuts (e.g. by 5'-ends of DNase-seq)
and splits each contig/chromosome into segments with different accessibilites in an optimal way.

Method is based on two assumptions:
* cuts are introduced by Poisson process `P(λ)` with `λ` depending on segment
* `λ` are distributed as `λ ~ Г(α, β) = β^α * λ^{α - 1} * exp(-βλ) / Г(α)`

`α` and `β` are the only parameters of segmentation.

Then we can derive (logarithmic) marginal likelyhood `logML` to be optimized.
`logML` for a single segment `S` of length `L` with coverages `(S_1, S_2, ...S_L)` and total coverage `C = \sum_i(S_i)` will be:
`logML(S,α,β) = α*log(β) − log(Γ(α)) + log(Γ(C + α)) − (C + α) * log(L + β) − \sum_i log (S_i!)`

Here `α*log(β) − log(Γ(α))` can be treated as a penalty for segment creation (and is approximately proportional to `α*log(β/α))`.
Total `logML(α,β)` is a sum of logarithmic marginal likelihoods for all segments: `logML(α,β) = \sum_S logML(S,α,β)`.
Given a chromosome coverage profile, term `\sum_S {\sum_i log (S_i!)}` doesn't depend on segmentation.
Value `\sum_S {log(Γ(C + α)) − (C + α) * log(L + β)}` is refered as a `self score` of a segment.
We optimize only segmentation-dependent part of `logML` which is termed just `score`.
This score is a sum of self score of a segment and a penalty for segment creation.

## Program design

`split_bedgraph` loads bedgraph file chromosome by chromosome, splits them into segments and writes into output tsv file.
Coverage counts are stored internally with 1-nt resolution.

Splitting is proceeded in two steps: (a) reduce a list of candidate split points (sometimes this step is omitted),
(b) choose splits from a list of candidates and calculate score of segmentation.
The first step is performed with one of so called *reducers*. The second step is performed
with one of *splitters* (each splitter also implements a reducer interface but not vice versa).

Splitters and reducers:
* The most basic splitter is `SquareSplitter` which implements dynamic programming algorithm
   with `O(N^2)` complexity where `N` is a number of split candidates. Other splitters/reducers perform
   some heuristical optimisations on top of `SquareSplitter`
* `SlidingWindowReducer` tries to segment not an entire contig (chromosome) but shorter parts of contig.
   So they scan a sequence with a sliding window and remove split candidates which are unlikely.
   Each window is processed using some base splitter (typically `SquareSplitter`).
   Candidates from different windows are then aggregated.
* `RoundReducer` perform the same procedure and repeat it for several rounds or until list of split candidates converges.
* `NotZeroReducer` discards (all) splits if all points of an interval under consideration are zeros.
* `NotConstantReducer` discards splits between same-valued points.
* `ReducerCombiner` accept a list of reducers to be sequentially applied. The last reducer can also be a splitter.
In that case combiner allows for splitting and scoring a segmentation. To transform any reducer into splitter one can combine
that reducer with `NopSplitter` - so that split candidates obtained by reducer will be treated as
final splitting and NopSplitter make it possible to calculate its score.

Splits denote segment boundaries to the left of position. Adjacent splits `a` and `b` form semi-closed interval `[a, b)`
E.g. for coverage counts `[99,99,99, 1,1,1]` splits should be `[0, 3, 6]`.
So that we have two segments: `[0, 3)` and `[3, 6)`.

Splits and split candidates are stored as numpy arrays and always include both inner split points and segment boundaries, i.e. point just before config start and right after the contig end.

One can also treat splits as positions between-elements (like in python slices)
```
counts:            |  99   99   99  |   1    1     1  |
splits candidates: 0     1    2     3     4     5     6
splits:            0                3                 6
```
Splitters invoke `LogMarginalLikelyhoodComputer` which can compute `logML` for a splitting (and for each segment).
`LogMarginalLikelyhoodComputer` store cumulative sums of coverage counts at split candidates
and also distances between candidates. It allows one to efficiently compute `logML` and doesn't need
to recalculate total segment coverages each time.

In order to efficiently compute `log(x)` and `log(Г(x))` we precompute values for some first million of integer numbers `x`.
Computation efficiency restricts us to integer values of `α` and `β`. Segment lengths are naturally integer,
coverage counts (and total segment counts) are also integer because they represent numbers of cuts.
`LogComputer` and `LogGammaComputer` store precomputed values and know how to calculate these values efficiently.

## See also
Predecessor of our approach — “[Segmentation of long genomic sequences into domains with homogeneous composition with BASIO software](https://doi.org/10.1093/bioinformatics/17.11.1065)”.

## Development

Bumping new version:
```
VERSION='1.2.3'
echo "__version__ = '${VERSION}'"  >  src/pasio/version.py
git commit -am "bump version ${VERSION}"
git tag "${VERSION}"
git push
git push --tags
rm dist/ build/ -r
python3 setup.py sdist
python3 setup.py bdist_wheel --universal
twine upload dist/*
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/autosome-ru/pasio",
    "name": "pasio",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7.1",
    "maintainer_email": null,
    "keywords": "bioinformatics NGS coverage segmentation denoise",
    "author": "Andrey Lando, Ilya Vorontsov",
    "author_email": "dronte.l@gmail.com, vorontsov.i.e@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6b/0c/cb62ca69a18ad4a010efdd0458d4e4a1948bc27838d194ec2a2005c91887/pasio-1.1.3.tar.gz",
    "platform": null,
    "description": "![Pasio logo](logos/pasio_256.png?raw=true \"Logo\")\n# PASIO\n\nPasio is a tool for denosing DNA coverage profiles coming from high-throughput sequencing data.\nExample of experiments pasio works well on is ChIP-seq, DNAse-seq, ATAC-seq.\n\nIt takes a .bed file of counts (integer values, normalization is not supported). And produces\ntsv file with genome splited into segments which coverage can be treated as equal.\n\nPasio runs on both Python 2 and 3 (Python 2 interpreter runs a bit faster).\nThe only dependencies are numpy and scipy.\n\nDefaults are reasonable for fast yet almost precise computation, so usually it is enough to run:\n\n```\npasio input.bedgraph\n```\n\nThis defaults are yet subject to change, so if you want results to be reproducible between versions, please specify all substantial parameters (especially \u03b1 and \u03b2) explicitly!\n\nNote that PASIO process bedgraph contig by contig. Thus bedgraph should be sorted by contig/chromosome!\n\nPASIO can read and write to gzipped files (filename should have `.gz` extension).\n\nPASIO can also process bedgraph from stdin by supplying `-` instead of filename.\nIt can be useful for transcriptomic data where contigs are short enough to be processed on the fly.\n\n\n## Installation\nPASIO works with Python 2.7.1+ and Python 3.4+. The tool is available on PyPA, so you can install it using pip:\n\n```\n  python -m pip install pasio\n```\n\n\nNote that pip install wrapper to run pasio without specifying python. One can use one of two options to run it:\n\n```\npasio <options>...\npython -m pasio <options>...\n```\n\nThe latter option can be useful if you want to run it using a certain python version.\n\n## Underlying math\n\nPASIO is a program to segment chromatin accessibility profile. It accepts a bedgraph file\nwith coverage of position by DNase cuts (e.g. by 5'-ends of DNase-seq)\nand splits each contig/chromosome into segments with different accessibilites in an optimal way.\n\nMethod is based on two assumptions:\n* cuts are introduced by Poisson process `P(\u03bb)` with `\u03bb` depending on segment\n* `\u03bb` are distributed as `\u03bb ~ \u0413(\u03b1, \u03b2) = \u03b2^\u03b1 * \u03bb^{\u03b1 - 1} * exp(-\u03b2\u03bb) / \u0413(\u03b1)`\n\n`\u03b1` and `\u03b2` are the only parameters of segmentation.\n\nThen we can derive (logarithmic) marginal likelyhood `logML` to be optimized.\n`logML` for a single segment `S` of length `L` with coverages `(S_1, S_2, ...S_L)` and total coverage `C = \\sum_i(S_i)` will be:\n`logML(S,\u03b1,\u03b2) = \u03b1*log(\u03b2) \u2212 log(\u0393(\u03b1)) + log(\u0393(C + \u03b1)) \u2212 (C + \u03b1) * log(L + \u03b2) \u2212 \\sum_i log (S_i!)`\n\nHere `\u03b1*log(\u03b2) \u2212 log(\u0393(\u03b1))` can be treated as a penalty for segment creation (and is approximately proportional to `\u03b1*log(\u03b2/\u03b1))`.\nTotal `logML(\u03b1,\u03b2)` is a sum of logarithmic marginal likelihoods for all segments: `logML(\u03b1,\u03b2) = \\sum_S logML(S,\u03b1,\u03b2)`.\nGiven a chromosome coverage profile, term `\\sum_S {\\sum_i log (S_i!)}` doesn't depend on segmentation.\nValue `\\sum_S {log(\u0393(C + \u03b1)) \u2212 (C + \u03b1) * log(L + \u03b2)}` is refered as a `self score` of a segment.\nWe optimize only segmentation-dependent part of `logML` which is termed just `score`.\nThis score is a sum of self score of a segment and a penalty for segment creation.\n\n## Program design\n\n`split_bedgraph` loads bedgraph file chromosome by chromosome, splits them into segments and writes into output tsv file.\nCoverage counts are stored internally with 1-nt resolution.\n\nSplitting is proceeded in two steps: (a) reduce a list of candidate split points (sometimes this step is omitted),\n(b) choose splits from a list of candidates and calculate score of segmentation.\nThe first step is performed with one of so called *reducers*. The second step is performed\nwith one of *splitters* (each splitter also implements a reducer interface but not vice versa).\n\nSplitters and reducers:\n* The most basic splitter is `SquareSplitter` which implements dynamic programming algorithm\n   with `O(N^2)` complexity where `N` is a number of split candidates. Other splitters/reducers perform\n   some heuristical optimisations on top of `SquareSplitter`\n* `SlidingWindowReducer` tries to segment not an entire contig (chromosome) but shorter parts of contig.\n   So they scan a sequence with a sliding window and remove split candidates which are unlikely.\n   Each window is processed using some base splitter (typically `SquareSplitter`).\n   Candidates from different windows are then aggregated.\n* `RoundReducer` perform the same procedure and repeat it for several rounds or until list of split candidates converges.\n* `NotZeroReducer` discards (all) splits if all points of an interval under consideration are zeros.\n* `NotConstantReducer` discards splits between same-valued points.\n* `ReducerCombiner` accept a list of reducers to be sequentially applied. The last reducer can also be a splitter.\nIn that case combiner allows for splitting and scoring a segmentation. To transform any reducer into splitter one can combine\nthat reducer with `NopSplitter` - so that split candidates obtained by reducer will be treated as\nfinal splitting and NopSplitter make it possible to calculate its score.\n\nSplits denote segment boundaries to the left of position. Adjacent splits `a` and `b` form semi-closed interval `[a, b)`\nE.g. for coverage counts `[99,99,99, 1,1,1]` splits should be `[0, 3, 6]`.\nSo that we have two segments: `[0, 3)` and `[3, 6)`.\n\nSplits and split candidates are stored as numpy arrays and always include both inner split points and segment boundaries, i.e. point just before config start and right after the contig end.\n\nOne can also treat splits as positions between-elements (like in python slices)\n```\ncounts:            |  99   99   99  |   1    1     1  |\nsplits candidates: 0     1    2     3     4     5     6\nsplits:            0                3                 6\n```\nSplitters invoke `LogMarginalLikelyhoodComputer` which can compute `logML` for a splitting (and for each segment).\n`LogMarginalLikelyhoodComputer` store cumulative sums of coverage counts at split candidates\nand also distances between candidates. It allows one to efficiently compute `logML` and doesn't need\nto recalculate total segment coverages each time.\n\nIn order to efficiently compute `log(x)` and `log(\u0413(x))` we precompute values for some first million of integer numbers `x`.\nComputation efficiency restricts us to integer values of `\u03b1` and `\u03b2`. Segment lengths are naturally integer,\ncoverage counts (and total segment counts) are also integer because they represent numbers of cuts.\n`LogComputer` and `LogGammaComputer` store precomputed values and know how to calculate these values efficiently.\n\n## See also\nPredecessor of our approach \u2014 \u201c[Segmentation of long genomic sequences into domains with homogeneous composition with BASIO software](https://doi.org/10.1093/bioinformatics/17.11.1065)\u201d.\n\n## Development\n\nBumping new version:\n```\nVERSION='1.2.3'\necho \"__version__ = '${VERSION}'\"  >  src/pasio/version.py\ngit commit -am \"bump version ${VERSION}\"\ngit tag \"${VERSION}\"\ngit push\ngit push --tags\nrm dist/ build/ -r\npython3 setup.py sdist\npython3 setup.py bdist_wheel --universal\ntwine upload dist/*\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Pasio is licensed under WTFPL, but if you prefer more standard licenses, feel free to treat it as MIT license",
    "summary": "Pasio is a tool for segmentation and denosing DNA coverage profiles coming from high-throughput sequencing data.",
    "version": "1.1.3",
    "project_urls": {
        "Homepage": "https://github.com/autosome-ru/pasio"
    },
    "split_keywords": [
        "bioinformatics",
        "ngs",
        "coverage",
        "segmentation",
        "denoise"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2e9a6595779b5a480ab6aa6622581f04c59394af7b4eb96770d4874b1a6d7f98",
                "md5": "7432480a7dd66c7303502454f3b0c2c4",
                "sha256": "431bf731bbb16478140350a1dc9934ce3c26198743960bf4f0b6fa7ddb2cb93e"
            },
            "downloads": -1,
            "filename": "pasio-1.1.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7432480a7dd66c7303502454f3b0c2c4",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7.1",
            "size": 22120,
            "upload_time": "2024-12-30T04:48:48",
            "upload_time_iso_8601": "2024-12-30T04:48:48.527000Z",
            "url": "https://files.pythonhosted.org/packages/2e/9a/6595779b5a480ab6aa6622581f04c59394af7b4eb96770d4874b1a6d7f98/pasio-1.1.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6b0ccb62ca69a18ad4a010efdd0458d4e4a1948bc27838d194ec2a2005c91887",
                "md5": "2564d6eb52481f7fb93b09e06122b788",
                "sha256": "9eed4c48437927f21fe15c83db6959961274b45b5d3d5fbce07add41ef7898f6"
            },
            "downloads": -1,
            "filename": "pasio-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "2564d6eb52481f7fb93b09e06122b788",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7.1",
            "size": 185418,
            "upload_time": "2024-12-30T04:48:51",
            "upload_time_iso_8601": "2024-12-30T04:48:51.879917Z",
            "url": "https://files.pythonhosted.org/packages/6b/0c/cb62ca69a18ad4a010efdd0458d4e4a1948bc27838d194ec2a2005c91887/pasio-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-30 04:48:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "autosome-ru",
    "github_project": "pasio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "future",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "pytest-benchmark",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "pasio"
}
        
Elapsed time: 0.54842s