DAStk


NameDAStk JSON
Version 1.0.1.post1 PyPI version JSON
download
home_pagehttps://github.com/Dowell-Lab/DAStk
SummaryDifferential ATAC-seq toolkit
upload_time2023-01-26 22:13:59
maintainer
docs_urlNone
authorIgnacio Tripodi
requires_python
licenseBSD
keywords bioinformatics genomics chromatin atac-seq motif transcription_factor tf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            DAStk
=====

The Differential
`ATAC-seq <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374986/>`__
Toolkit (DAStk) is a set of command-line tools to aid analyzing
differential ATAC-seq data. By leveraging changes in accessible
chromatin, we can detect significant changes in transcription factor
(TF) activity. This is a simple but powerful tool for cellular
perturbation analysis. In fact, the applications of DAStk are not
necessarily limited to ATAC-seq, but can extend to comparing any pair of
bedGraphs containing regions of interest, e.g. transcription regulatory
elements (TRE) from nascent transcription assays like PRO-seq, or others
like ChIP-seq peaks.

You will need the following inputs:

-  A pair of files listing peaks of ATAC-seq signal in two biological
   conditions (e.g. DMSO and drug-treated) in any BedGraph-compatible
   format (tab-delimited)
-  A set of files listing the putative binding sites across the
   reference genome of choice, one file per transcription factor motif,
   also in any BedGraph-like format. These are normally generated from
   position weight matrices (PWMs) available at TF model databases like
   `HOCOMOCO <http://hocomoco11.autosome.ru>`__. These files are
   expected to have a ``.bed``, ``.BedGraph`` or ``.txt`` extension.

**IMPORTANT: All files mentioned above (ATAC-seq peaks and computed
motif sites) MUST be sorted by the same criteria. Different
bioinformatics tools use different lexical sorting criteria (e.g. chr10
after chr1, or chr2 after chr1) so please ensure the sorting criteria is
uniform.**

Install
~~~~~~~

You can install DAStk using ``pip``:

::

   $ pip install DAStk

… or:

::

   $ pip install --upgrade DAStk 

This is the simplest option, and it will also create the executable
commands ``process_atac`` and ``differential_md_score``. Alternatively,
you can clone this repository by running:

::

   $ git clone https://github.com/Dowell-Lab/DAStk.git

Required Python libraries (automatically taken care of if installed thru ``pip``):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  numpy
-  argparse
-  matplotlib
-  scipy
-  adjustText
-  pandas
-  multiprocessing
-  pybedtools
-  futures
-  scikit-learn
-  itertools
-  networkx
-  upsetplot

These scripts feature comprehensive help when called with the ``--help``
argument. Every argument provides a short and long form (i.e. ``-t`` or
``--threads``). There are normally two steps in a typical workflow:

1. Process the ATAC-seq peak files (process_atac) to calculate the
   `MD-score
   statistic <https://genome.cshlp.org/content/28/3/334.short>`__ for
   each motif provided.
2. Detect the most statistically significant changes in MD-score
   (differential_md_score) between both biological conditions (taking
   into the account the peaks involved), and generate MA and barcode
   plots.

TL;DR;
~~~~~~

Process each ATAC-seq peak bedGraph file, the genome abbreviations are
“hg38”, “mm10”, etc:

::

   $ process_atac -e PEAKS_FILENAME_1 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY
   $ process_atac -e PEAKS_FILENAME_2 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY

Perform differential analysis on the calculated motif displacement
scores, highlighting the significant motifs at your p-value threshold of
choice:

::

   $ differential_md_score -1 MD_OUTPUT_FOR_FILENAME_1 -2 MD_OUTPUT_FOR_FILENAME_2 -p 0.0000001 -m "DMSO" -n "Drug treatment" -b -o OUTPUT_DIRECTORY

Perform downstream analysis to discover what is common among those
significant TFs:

::

   $ tf_result_explanations -p 0.0000001 -d DIFFERENTIAL_MD_SCORE_OUTPUT_FILENAME -o EXPLANATION_FILENAME

Questions?

::

   $ process_atac --help
   $ differential_md_score --help

Data Processing and Plotting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are three plotting functions to assess the results from
``process_atac`` and ``differential_md_score``:

-  ma_plot
-  barcode_plot
-  tf_intersect

The first two allow you to generate custom MA and Barcode plots using
the stats file produced using the differential_md_score module. While
the MA and Barcode plots will be generated by default when running the
differential_md_score module, it may be useful to adjust the labels,
adjust the p-value threshold, compare motifs with non-significant
p-values etc. The default plots will generate labels based on your file
basenames (everything before the \_md_score.txt extension if you kept
the default file names from ``process_atac``). However, to keep plots
scaled appropriately, this label is limited to 19 characters. If it is
longer, it will be truncated.

The third plotting module, ``tf_intersect`` can be used to generate a
venn diagram (3 or fewer conditions) or an upset plot using multiple
result files as inputs to compare common motifs between conditions. For
example, using the -e/–enriched argument and by providing multiple
\*_md_score.txt files files from ``process_atac``, all motifs with an MD
score >0.2 will be intersected. Additionally, if more than three input
stats files are specified, three catplots will be generated alongside
the upset plot by taking the mean value of all of the stats files for
total genome-wide motif hits, motif hits within 3kb of the region file,
and the MD score.

**IMPORTANT** Either raw MD score outputs from ``process_atac`` or
differential MD scores from ``differential_md_score`` may be used,
however it is not recommended to mix file input types. There are also
some plotting functions unique to each (e.g. intersecting significant
motifs is only relevant to the ``differential_md_score`` outputs.
Furthermore, -e/–enriched and -d/–depleted have different relative
values between the two output types. Using the ``process_atac`` results,
the raw MD score will be used and defaults are relative to a background
singal of 0.1. Using the ``differential_md_score`` stats output, the
enriched/depleted is based on the *differential* MD score (i.e. the
difference in MD scores between the two conditions) and therefore the
thresholds are specified by a different argument.

A stats file of the intersection data will also be generated in the
format needed to generate the upset plot (see
https://pypi.org/project/pyupset/0.1.1.post4/). There is a limit of 12
input files for plotting. However, if more than 12 files are specified,
a stats file output will still be generated specifying motif
intersections.

The following will provide a full list of arguments and usage
instructions instructions for each of these plotting tools:

::

   $ ma_plot --help
   $ barcode_plot --help
   $ tf_intersect --help

While these plots will be generated by default when running the
differential_md_score module, it may be useful to adjust the labels,
adjust the p-value threshold, compare motifs with non-significant
p-values etc. The default plots will generate labels based on your file
basenames (everything before the \_md_score.txt extension if you kept
the default file names from process_atac). However, to keep plots scaled
appropriately, this label is limited to 19 characters. If it is longer,
it will be truncated.

Inquiring about TF-TF relations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In release v1.0.0 a new functionality has been included, if using assays
on human tissue and the TF motifs from HOCOMOCO:

::

   $ tf_result_explanations --help

This build now includes a graph that combines all human proteins in
`Uniprot <https://www.uniprot.org/>`__, with their respective
annotations, and all human `Reactome <https://reactome.org/>`__ data on
biological pathways, biochemical reactions, and complex formation. This
allows the user to query for existing knowledge about what aspects of
the significantly changing TFs are shared among two or more of them.
They may participate in the same pathway or biological process, be
members of the same complex, have shared cofactors, directly interact
with each other, etc.

You can optionally provide a list of “uninteresting” intersections by
listing the ontology concept URIs in a file (see `our
example <./DAStk/uninteresting_relations.txt>`__ list, for things in
common between TFs that are too obvious), so that they are omitted from
the output. You can also specify additional entities to include in the
search, besides the TFs showing a change of activity beyond the provided
cutoff, also in a different file listing ontology URIs. The `list of
labels for each concept
used <./DAStk/public_knowledge/all_labels.tsv>`__ shows the possible
URIs that could be included as an additional search endpoint. For
example, if we end up with 5 different TFs that are significantly
changing between conditions and we also want to search how they all
relate to zinc, we can provide a file with the URI
`<http://purl.obolibrary.org/obo/CHEBI_29105> <http://purl.obolibrary.org/obo/CHEBI_29105>`__.
This tool produces a report that looks like the excerpt below:

::

   Transcription factors displaying a significant difference in activity:
   CEBPB, CEBPD, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, ZNF93

   Here's what we know about these TFs presenting significant activity changes (p=1.00E-03):

   Direct interactions between each of these TFs:
   NFYB interacts with TP53
   TP53 interacts with CEBPB

   Other ways these TFs are related:
   CEBPB, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: located in nucleus

   ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component Zinc finger, C2H2 type

   ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: has function metal ion binding

   ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component KRAB box

   CEBPB, CEBPD, CEBPE, NFYB, TP53, and ZNF396: has function protein heterodimerization activity

   ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: molecularly interacts with KRAB-ZNF / KAP Complex [nucleoplasm]

   CEBPB, CEBPD, CEBPE, and HLF: has component Basic region leucine zipper

   [...]

   CEBPB and CEBPD: interacts with ATF4

   CEBPB and CEBPD: interacts with CEBPA

   CEBPB and CEBPD: participates in positive regulation of osteoblast differentiation

   CEBPB and CEBPE: participates in cellular response to lipopolysaccharide

   CEBPB and CEBPE: participates in defense response to bacterium

Usage examples
~~~~~~~~~~~~~~

Unpack the motif files (see below for how to create your own, instead):

::

   $ mkdir /path/to/grch38_motifs
   $ tar xvfz motifs/HOCOMOCO_v11_p1e-6_grch38.tar.gz --directory /path/to/grch38_motifs

Calculate the MD-scores for the first biological condition:

::

   $ process_atac --threads 8 --atac-peaks /path/to/DMSO/ATAC/peaks/file \
     --genome hg38 \
     --motif-path /path/to/directory/containing/motif/files \
     --output /path/to/output/directory

The above command generates a file called ``BASENAME_md_scores.txt``.
It’s generally a good idea to use the cell type (or sample number) and a
brief condition description (e.g. ``k562_DMSO`` or
``SRR1234123_Metronidazole``) in the file name provided.

We would then generate the same file, for the other biological condition
we are comparing against:

::

   $ process_atac --threads 8 --atac-peaks /path/to/treatment/ATAC/peaks/file \
     --genome hg38 \
     --motif-path /path/to/directory/containing/motif/files \
     --output /path/to/output/directory

The above generates a file called ``BASENAME_md_scores.txt``. Finally:

::

   $ differential_md_score --assay-1 DMSO --assay-2 Treatment --p-value 0.0000001 --barcodes --output /path/to/output/directory

The above generates a tab-delimited file with all differential MD scores
for each motif and their p-values (sorted by p-value), an MA plot that
labels the most significant TF activity changes, at a p-value cutoff of
1e-7. Note that better plot-friendly condition names (say, “DMSO” and
“Treatment”) can be provided using the ``--label-1`` and ``--label-2``
arguments. The plots look like the example below:

.. figure:: ./doc_files/sample_MA_plot.png
   :alt: Sample MA plot

   Sample MA plot

The ``-b`` flag also generates a “barcode plot” of each of these
statistically significat motif differences that depicts how close the
ATAC-seq peaks were to the (putative TF-binding) motif sites, within a
1500 base-pair radius of the motif center:

.. figure:: ./doc_files/sample_barcode_plot.png
   :alt: Sample barcode plot

   Sample barcode plot

If you can take advantage of multiprocessing, you can calculate
MD-scores for both conditions simultaneously, assigning several threads
to each, then generate the plots once both ``*_md_scores.txt`` files are
ready.

The columns for the tab-separated output file from
``differential_md_score`` are:

::

   Motif name , p-value , # total motif hits, # nearby peaks in condition 1 , # nearby peaks in condition 2 , MD-score in condition 1 , MD-score in condition 2, differential MD-score

To query what we know about these highlighted TFs displaying a
significant difference in activity, we can use:

::

   $ tf_result_explanations -p 0.0000001 \
     -d /path/to/output/directory/your_condition1_vs_condition2_differential_md_score.txt \
     -o /path/to/report_file.txt

Additional Arguments
--------------------

Genome File
~~~~~~~~~~~

If your genome is not incuded in the UCSC genome repository, you will
instead need to provide a chromosome sizes file in processess_atac. This
can be generated using
`samtools <http://www.htslib.org/doc/samtools.html>`__ faidx as follows:

::

   $ samtools faidx genome.fa
   $ cut -f1,2 genome.fa.fai > genome.chrom.sizes

This file can then be specified using the ``-c``/``--chromosomes``
argument in ``process_atac``. Scaffold chromosomes will be removed.
Please note that the ``--genome`` and ``--chromosomes`` arguments are
mutually exclusive, and that the ``--genome`` argument assumes the
``[fetchChromsizes](https://anaconda.org/bioconda/ucsc-fetchchromsizes)``
tool is installed.

Normalization
~~~~~~~~~~~~~

If the -g/–global-normalization argument is used in the
``differential_md_score`` module, then the total number of genome-wide
motif hits will be used to normalize the barcode plots. This may be
helpful in better assessing changes across different
perturbations/experiments or cell lines. Otherwise, the barcode plots
will simply be set to the same max heat to facilitate better
visualization of relative motif density between conditions. This
normalization argument has also been implemented in the ``barcode_plot``
plotting module and as such the output stats file from
``differential_md_score`` now include the total number of genome-wide
motif hits for each motif.

**NOTE** Using the -g/–global-normalization argument will also result in
a different heatmap color (YlOrRd) rather than the default (cividis) so
that the two normalization types are easily differentiated.

Altering Window Size
~~~~~~~~~~~~~~~~~~~~

While we strongly recommend using the default 1500bp radius window in
calculating the MD score (and differential MD score), as of v0.3.0 we
now have a radius argument (``-r``/``--radius``) which will allow you to
expand or shrink this window. If changed, the MD score calculation will
follow the same principle in that it will be a ratio of motifs hits
within the cetner 1/10th of the window relative divided by the number of
total motif hits within the window. For example, if the user specifies a
radius of 2000bp, there will be a window size of 4000bp, a center of
400bp around the features of interest, and the MD score will be # motif
hits within 400bp/ # motifs within 4000bp. Keep in mind that expanding
this window may be useful in visualization, but will result in an MD
score approaching 0.1 (background).

Motif Files
~~~~~~~~~~~

Feel free to use the pre-scanned motif files provided,
HOCOMOCO_v11_p1e-6_grch38.tar.gz(`mirror
1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_grch38.tar.gz>`__,
`mirror
2 <https://drive.google.com/file/d/19D1iW9x0mswiFLoj6hrDFjVfhYmAbLqG/view?usp=sharing>`__),
HOCOMOCO_v11_p1e-6_hg19.tar.gz(`mirror
1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_hg19.tar.gz>`__,
`mirror
2 <https://drive.google.com/file/d/10_0kuPQbswmhoazjvEJ1KfGRJDuJ-O0y/view?usp=sharing>`__)
and HOCOMOCO_v11_p1e-6_mm10.tar.gz (`mirror
1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_mm10.tar.gz>`__,
`mirror
2 <https://drive.google.com/file/d/1qCirs0AfHzFwnbXMEa8vTd06tEiyE42Z/view?usp=sharing>`__
for the ``GRCh38/hg38``, ``hg19`` and ``mm10`` reference genomes,
respectively. They have been generated from HOCOMOCO’s v11
mononucleotide model, background-corrected for each reference genome. To
generate your own ``bed`` files for each motif from this or any other
source, you can use FIMO in combination with the downloaded ``.meme``
files from your TF database of choice. For example, if using HOCOMOCO,
you can create the motif file for TP53 using their mononucleotide model
with a p-value threshold of 0.000001 by:

::

   $ fimo -max-stored-scores 10000000 --thresh 1e-6 -oc /path/to/output/directory -motif /path/to/motif/file \
     /path/to/HOCOMOCOv11_HUMAN_mono_meme_format.meme /path/to/whole_genome.fa

Please refer to the complete
`FIMO <http://meme-suite.org/doc/fimo.html?man_type=web>`__
documentation for any questions.

--------------

Citation
~~~~~~~~

Please cite DAStk if you have used it in your research!

*Tripodi, I.J.; Allen, M.A.; Dowell, R.D.* `Detecting Differential
Transcription Factor Activity from ATAC-Seq
Data <https://www.mdpi.com/1420-3049/23/5/1136>`__\ *. Molecules 2018,
23, 1136.*

If you have used the provided scanned motif regions from HOCOMOCO,
please cite them as well:

*Kulakovskiy, I.V., Vorontsov, I.E., Yevshin, I.S., Sharipov, R.N.,
Fedorova, A.D., Rumynskiy, E.I., Medvedeva, Y.A., Magana-Mora, A.,
Bajic, V.B., Papatsenko, D.A., et al. (2018).* `HOCOMOCO: towards a
complete collection of transcription factor binding models for human and
mouse via large-scale ChIP-Seq
analysis <https://academic.oup.com/nar/article/46/D1/D252/4616875>`__\ *.
Nucleic Acids Res 46, D252–D259.*

For any questions or bug reports, please use the Issue Tracker.

| *Ignacio Tripodi (ignacio.tripodi at colorado.edu)*
| *Computer Science Department, BioFrontiers Institute*
| *University of Colorado, Boulder, USA*

| *Margaret Gruca (margaret.gruca at colorado.edu)*
| *BioFrontiers Institute*
| *University of Colorado, Boulder, USA*
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Dowell-Lab/DAStk",
    "name": "DAStk",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "bioinformatics genomics chromatin ATAC-seq motif transcription_factor TF",
    "author": "Ignacio Tripodi",
    "author_email": "ignacio.tripodi@colorado.edu",
    "download_url": "https://files.pythonhosted.org/packages/6b/d9/3ff879c202f7a9cebfc5ec605c39f6ef5c06aa264cec092e1851fb034f7a/DAStk-1.0.1.post1.tar.gz",
    "platform": null,
    "description": "DAStk\n=====\n\nThe Differential\n`ATAC-seq <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374986/>`__\nToolkit (DAStk) is a set of command-line tools to aid analyzing\ndifferential ATAC-seq data. By leveraging changes in accessible\nchromatin, we can detect significant changes in transcription factor\n(TF) activity. This is a simple but powerful tool for cellular\nperturbation analysis. In fact, the applications of DAStk are not\nnecessarily limited to ATAC-seq, but can extend to comparing any pair of\nbedGraphs containing regions of interest, e.g.\u00a0transcription regulatory\nelements (TRE) from nascent transcription assays like PRO-seq, or others\nlike ChIP-seq peaks.\n\nYou will need the following inputs:\n\n-  A pair of files listing peaks of ATAC-seq signal in two biological\n   conditions (e.g.\u00a0DMSO and drug-treated) in any BedGraph-compatible\n   format (tab-delimited)\n-  A set of files listing the putative binding sites across the\n   reference genome of choice, one file per transcription factor motif,\n   also in any BedGraph-like format. These are normally generated from\n   position weight matrices (PWMs) available at TF model databases like\n   `HOCOMOCO <http://hocomoco11.autosome.ru>`__. These files are\n   expected to have a ``.bed``, ``.BedGraph`` or ``.txt`` extension.\n\n**IMPORTANT: All files mentioned above (ATAC-seq peaks and computed\nmotif sites) MUST be sorted by the same criteria. Different\nbioinformatics tools use different lexical sorting criteria (e.g.\u00a0chr10\nafter chr1, or chr2 after chr1) so please ensure the sorting criteria is\nuniform.**\n\nInstall\n~~~~~~~\n\nYou can install DAStk using ``pip``:\n\n::\n\n   $ pip install DAStk\n\n\u2026 or:\n\n::\n\n   $ pip install --upgrade DAStk \n\nThis is the simplest option, and it will also create the executable\ncommands ``process_atac`` and ``differential_md_score``. Alternatively,\nyou can clone this repository by running:\n\n::\n\n   $ git clone https://github.com/Dowell-Lab/DAStk.git\n\nRequired Python libraries (automatically taken care of if installed thru ``pip``):\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n-  numpy\n-  argparse\n-  matplotlib\n-  scipy\n-  adjustText\n-  pandas\n-  multiprocessing\n-  pybedtools\n-  futures\n-  scikit-learn\n-  itertools\n-  networkx\n-  upsetplot\n\nThese scripts feature comprehensive help when called with the ``--help``\nargument. Every argument provides a short and long form (i.e.\u00a0``-t`` or\n``--threads``). There are normally two steps in a typical workflow:\n\n1. Process the ATAC-seq peak files (process_atac) to calculate the\n   `MD-score\n   statistic <https://genome.cshlp.org/content/28/3/334.short>`__ for\n   each motif provided.\n2. Detect the most statistically significant changes in MD-score\n   (differential_md_score) between both biological conditions (taking\n   into the account the peaks involved), and generate MA and barcode\n   plots.\n\nTL;DR;\n~~~~~~\n\nProcess each ATAC-seq peak bedGraph file, the genome abbreviations are\n\u201chg38\u201d, \u201cmm10\u201d, etc:\n\n::\n\n   $ process_atac -e PEAKS_FILENAME_1 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY\n   $ process_atac -e PEAKS_FILENAME_2 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY\n\nPerform differential analysis on the calculated motif displacement\nscores, highlighting the significant motifs at your p-value threshold of\nchoice:\n\n::\n\n   $ differential_md_score -1 MD_OUTPUT_FOR_FILENAME_1 -2 MD_OUTPUT_FOR_FILENAME_2 -p 0.0000001 -m \"DMSO\" -n \"Drug treatment\" -b -o OUTPUT_DIRECTORY\n\nPerform downstream analysis to discover what is common among those\nsignificant TFs:\n\n::\n\n   $ tf_result_explanations -p 0.0000001 -d DIFFERENTIAL_MD_SCORE_OUTPUT_FILENAME -o EXPLANATION_FILENAME\n\nQuestions?\n\n::\n\n   $ process_atac --help\n   $ differential_md_score --help\n\nData Processing and Plotting\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThere are three plotting functions to assess the results from\n``process_atac`` and ``differential_md_score``:\n\n-  ma_plot\n-  barcode_plot\n-  tf_intersect\n\nThe first two allow you to generate custom MA and Barcode plots using\nthe stats file produced using the differential_md_score module. While\nthe MA and Barcode plots will be generated by default when running the\ndifferential_md_score module, it may be useful to adjust the labels,\nadjust the p-value threshold, compare motifs with non-significant\np-values etc. The default plots will generate labels based on your file\nbasenames (everything before the \\_md_score.txt extension if you kept\nthe default file names from ``process_atac``). However, to keep plots\nscaled appropriately, this label is limited to 19 characters. If it is\nlonger, it will be truncated.\n\nThe third plotting module, ``tf_intersect`` can be used to generate a\nvenn diagram (3 or fewer conditions) or an upset plot using multiple\nresult files as inputs to compare common motifs between conditions. For\nexample, using the -e/\u2013enriched argument and by providing multiple\n\\*_md_score.txt files files from ``process_atac``, all motifs with an MD\nscore >0.2 will be intersected. Additionally, if more than three input\nstats files are specified, three catplots will be generated alongside\nthe upset plot by taking the mean value of all of the stats files for\ntotal genome-wide motif hits, motif hits within 3kb of the region file,\nand the MD score.\n\n**IMPORTANT** Either raw MD score outputs from ``process_atac`` or\ndifferential MD scores from ``differential_md_score`` may be used,\nhowever it is not recommended to mix file input types. There are also\nsome plotting functions unique to each (e.g.\u00a0intersecting significant\nmotifs is only relevant to the ``differential_md_score`` outputs.\nFurthermore, -e/\u2013enriched and -d/\u2013depleted have different relative\nvalues between the two output types. Using the ``process_atac`` results,\nthe raw MD score will be used and defaults are relative to a background\nsingal of 0.1. Using the ``differential_md_score`` stats output, the\nenriched/depleted is based on the *differential* MD score (i.e.\u00a0the\ndifference in MD scores between the two conditions) and therefore the\nthresholds are specified by a different argument.\n\nA stats file of the intersection data will also be generated in the\nformat needed to generate the upset plot (see\nhttps://pypi.org/project/pyupset/0.1.1.post4/). There is a limit of 12\ninput files for plotting. However, if more than 12 files are specified,\na stats file output will still be generated specifying motif\nintersections.\n\nThe following will provide a full list of arguments and usage\ninstructions instructions for each of these plotting tools:\n\n::\n\n   $ ma_plot --help\n   $ barcode_plot --help\n   $ tf_intersect --help\n\nWhile these plots will be generated by default when running the\ndifferential_md_score module, it may be useful to adjust the labels,\nadjust the p-value threshold, compare motifs with non-significant\np-values etc. The default plots will generate labels based on your file\nbasenames (everything before the \\_md_score.txt extension if you kept\nthe default file names from process_atac). However, to keep plots scaled\nappropriately, this label is limited to 19 characters. If it is longer,\nit will be truncated.\n\nInquiring about TF-TF relations\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn release v1.0.0 a new functionality has been included, if using assays\non human tissue and the TF motifs from HOCOMOCO:\n\n::\n\n   $ tf_result_explanations --help\n\nThis build now includes a graph that combines all human proteins in\n`Uniprot <https://www.uniprot.org/>`__, with their respective\nannotations, and all human `Reactome <https://reactome.org/>`__ data on\nbiological pathways, biochemical reactions, and complex formation. This\nallows the user to query for existing knowledge about what aspects of\nthe significantly changing TFs are shared among two or more of them.\nThey may participate in the same pathway or biological process, be\nmembers of the same complex, have shared cofactors, directly interact\nwith each other, etc.\n\nYou can optionally provide a list of \u201cuninteresting\u201d intersections by\nlisting the ontology concept URIs in a file (see `our\nexample <./DAStk/uninteresting_relations.txt>`__ list, for things in\ncommon between TFs that are too obvious), so that they are omitted from\nthe output. You can also specify additional entities to include in the\nsearch, besides the TFs showing a change of activity beyond the provided\ncutoff, also in a different file listing ontology URIs. The `list of\nlabels for each concept\nused <./DAStk/public_knowledge/all_labels.tsv>`__ shows the possible\nURIs that could be included as an additional search endpoint. For\nexample, if we end up with 5 different TFs that are significantly\nchanging between conditions and we also want to search how they all\nrelate to zinc, we can provide a file with the URI\n`<http://purl.obolibrary.org/obo/CHEBI_29105> <http://purl.obolibrary.org/obo/CHEBI_29105>`__.\nThis tool produces a report that looks like the excerpt below:\n\n::\n\n   Transcription factors displaying a significant difference in activity:\n   CEBPB, CEBPD, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, ZNF93\n\n   Here's what we know about these TFs presenting significant activity changes (p=1.00E-03):\n\n   Direct interactions between each of these TFs:\n   NFYB interacts with TP53\n   TP53 interacts with CEBPB\n\n   Other ways these TFs are related:\n   CEBPB, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: located in nucleus\n\n   ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component Zinc finger, C2H2 type\n\n   ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: has function metal ion binding\n\n   ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component KRAB box\n\n   CEBPB, CEBPD, CEBPE, NFYB, TP53, and ZNF396: has function protein heterodimerization activity\n\n   ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: molecularly interacts with KRAB-ZNF / KAP Complex [nucleoplasm]\n\n   CEBPB, CEBPD, CEBPE, and HLF: has component Basic region leucine zipper\n\n   [...]\n\n   CEBPB and CEBPD: interacts with ATF4\n\n   CEBPB and CEBPD: interacts with CEBPA\n\n   CEBPB and CEBPD: participates in positive regulation of osteoblast differentiation\n\n   CEBPB and CEBPE: participates in cellular response to lipopolysaccharide\n\n   CEBPB and CEBPE: participates in defense response to bacterium\n\nUsage examples\n~~~~~~~~~~~~~~\n\nUnpack the motif files (see below for how to create your own, instead):\n\n::\n\n   $ mkdir /path/to/grch38_motifs\n   $ tar xvfz motifs/HOCOMOCO_v11_p1e-6_grch38.tar.gz --directory /path/to/grch38_motifs\n\nCalculate the MD-scores for the first biological condition:\n\n::\n\n   $ process_atac --threads 8 --atac-peaks /path/to/DMSO/ATAC/peaks/file \\\n     --genome hg38 \\\n     --motif-path /path/to/directory/containing/motif/files \\\n     --output /path/to/output/directory\n\nThe above command generates a file called ``BASENAME_md_scores.txt``.\nIt\u2019s generally a good idea to use the cell type (or sample number) and a\nbrief condition description (e.g.\u00a0``k562_DMSO`` or\n``SRR1234123_Metronidazole``) in the file name provided.\n\nWe would then generate the same file, for the other biological condition\nwe are comparing against:\n\n::\n\n   $ process_atac --threads 8 --atac-peaks /path/to/treatment/ATAC/peaks/file \\\n     --genome hg38 \\\n     --motif-path /path/to/directory/containing/motif/files \\\n     --output /path/to/output/directory\n\nThe above generates a file called ``BASENAME_md_scores.txt``. Finally:\n\n::\n\n   $ differential_md_score --assay-1 DMSO --assay-2 Treatment --p-value 0.0000001 --barcodes --output /path/to/output/directory\n\nThe above generates a tab-delimited file with all differential MD scores\nfor each motif and their p-values (sorted by p-value), an MA plot that\nlabels the most significant TF activity changes, at a p-value cutoff of\n1e-7. Note that better plot-friendly condition names (say, \u201cDMSO\u201d and\n\u201cTreatment\u201d) can be provided using the ``--label-1`` and ``--label-2``\narguments. The plots look like the example below:\n\n.. figure:: ./doc_files/sample_MA_plot.png\n   :alt: Sample MA plot\n\n   Sample MA plot\n\nThe ``-b`` flag also generates a \u201cbarcode plot\u201d of each of these\nstatistically significat motif differences that depicts how close the\nATAC-seq peaks were to the (putative TF-binding) motif sites, within a\n1500 base-pair radius of the motif center:\n\n.. figure:: ./doc_files/sample_barcode_plot.png\n   :alt: Sample barcode plot\n\n   Sample barcode plot\n\nIf you can take advantage of multiprocessing, you can calculate\nMD-scores for both conditions simultaneously, assigning several threads\nto each, then generate the plots once both ``*_md_scores.txt`` files are\nready.\n\nThe columns for the tab-separated output file from\n``differential_md_score`` are:\n\n::\n\n   Motif name , p-value , # total motif hits, # nearby peaks in condition 1 , # nearby peaks in condition 2 , MD-score in condition 1 , MD-score in condition 2, differential MD-score\n\nTo query what we know about these highlighted TFs displaying a\nsignificant difference in activity, we can use:\n\n::\n\n   $ tf_result_explanations -p 0.0000001 \\\n     -d /path/to/output/directory/your_condition1_vs_condition2_differential_md_score.txt \\\n     -o /path/to/report_file.txt\n\nAdditional Arguments\n--------------------\n\nGenome File\n~~~~~~~~~~~\n\nIf your genome is not incuded in the UCSC genome repository, you will\ninstead need to provide a chromosome sizes file in processess_atac. This\ncan be generated using\n`samtools <http://www.htslib.org/doc/samtools.html>`__ faidx as follows:\n\n::\n\n   $ samtools faidx genome.fa\n   $ cut -f1,2 genome.fa.fai > genome.chrom.sizes\n\nThis file can then be specified using the ``-c``/``--chromosomes``\nargument in ``process_atac``. Scaffold chromosomes will be removed.\nPlease note that the ``--genome`` and ``--chromosomes`` arguments are\nmutually exclusive, and that the ``--genome`` argument assumes the\n``[fetchChromsizes](https://anaconda.org/bioconda/ucsc-fetchchromsizes)``\ntool is installed.\n\nNormalization\n~~~~~~~~~~~~~\n\nIf the -g/\u2013global-normalization argument is used in the\n``differential_md_score`` module, then the total number of genome-wide\nmotif hits will be used to normalize the barcode plots. This may be\nhelpful in better assessing changes across different\nperturbations/experiments or cell lines. Otherwise, the barcode plots\nwill simply be set to the same max heat to facilitate better\nvisualization of relative motif density between conditions. This\nnormalization argument has also been implemented in the ``barcode_plot``\nplotting module and as such the output stats file from\n``differential_md_score`` now include the total number of genome-wide\nmotif hits for each motif.\n\n**NOTE** Using the -g/\u2013global-normalization argument will also result in\na different heatmap color (YlOrRd) rather than the default (cividis) so\nthat the two normalization types are easily differentiated.\n\nAltering Window Size\n~~~~~~~~~~~~~~~~~~~~\n\nWhile we strongly recommend using the default 1500bp radius window in\ncalculating the MD score (and differential MD score), as of v0.3.0 we\nnow have a radius argument (``-r``/``--radius``) which will allow you to\nexpand or shrink this window. If changed, the MD score calculation will\nfollow the same principle in that it will be a ratio of motifs hits\nwithin the cetner 1/10th of the window relative divided by the number of\ntotal motif hits within the window. For example, if the user specifies a\nradius of 2000bp, there will be a window size of 4000bp, a center of\n400bp around the features of interest, and the MD score will be # motif\nhits within 400bp/ # motifs within 4000bp. Keep in mind that expanding\nthis window may be useful in visualization, but will result in an MD\nscore approaching 0.1 (background).\n\nMotif Files\n~~~~~~~~~~~\n\nFeel free to use the pre-scanned motif files provided,\nHOCOMOCO_v11_p1e-6_grch38.tar.gz(`mirror\n1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_grch38.tar.gz>`__,\n`mirror\n2 <https://drive.google.com/file/d/19D1iW9x0mswiFLoj6hrDFjVfhYmAbLqG/view?usp=sharing>`__),\nHOCOMOCO_v11_p1e-6_hg19.tar.gz(`mirror\n1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_hg19.tar.gz>`__,\n`mirror\n2 <https://drive.google.com/file/d/10_0kuPQbswmhoazjvEJ1KfGRJDuJ-O0y/view?usp=sharing>`__)\nand HOCOMOCO_v11_p1e-6_mm10.tar.gz (`mirror\n1 <http://dowell.colorado.edu/pubs/DAStk/motifs/HOCOMOCO_v11_p1e-6_mm10.tar.gz>`__,\n`mirror\n2 <https://drive.google.com/file/d/1qCirs0AfHzFwnbXMEa8vTd06tEiyE42Z/view?usp=sharing>`__\nfor the ``GRCh38/hg38``, ``hg19`` and ``mm10`` reference genomes,\nrespectively. They have been generated from HOCOMOCO\u2019s v11\nmononucleotide model, background-corrected for each reference genome. To\ngenerate your own ``bed`` files for each motif from this or any other\nsource, you can use FIMO in combination with the downloaded ``.meme``\nfiles from your TF database of choice. For example, if using HOCOMOCO,\nyou can create the motif file for TP53 using their mononucleotide model\nwith a p-value threshold of 0.000001 by:\n\n::\n\n   $ fimo -max-stored-scores 10000000 --thresh 1e-6 -oc /path/to/output/directory -motif /path/to/motif/file \\\n     /path/to/HOCOMOCOv11_HUMAN_mono_meme_format.meme /path/to/whole_genome.fa\n\nPlease refer to the complete\n`FIMO <http://meme-suite.org/doc/fimo.html?man_type=web>`__\ndocumentation for any questions.\n\n--------------\n\nCitation\n~~~~~~~~\n\nPlease cite DAStk if you have used it in your research!\n\n*Tripodi, I.J.; Allen, M.A.; Dowell, R.D.* `Detecting Differential\nTranscription Factor Activity from ATAC-Seq\nData <https://www.mdpi.com/1420-3049/23/5/1136>`__\\ *. Molecules 2018,\n23, 1136.*\n\nIf you have used the provided scanned motif regions from HOCOMOCO,\nplease cite them as well:\n\n*Kulakovskiy, I.V., Vorontsov, I.E., Yevshin, I.S., Sharipov, R.N.,\nFedorova, A.D., Rumynskiy, E.I., Medvedeva, Y.A., Magana-Mora, A.,\nBajic, V.B., Papatsenko, D.A., et al.\u00a0(2018).* `HOCOMOCO: towards a\ncomplete collection of transcription factor binding models for human and\nmouse via large-scale ChIP-Seq\nanalysis <https://academic.oup.com/nar/article/46/D1/D252/4616875>`__\\ *.\nNucleic Acids Res 46, D252\u2013D259.*\n\nFor any questions or bug reports, please use the Issue Tracker.\n\n| *Ignacio Tripodi (ignacio.tripodi at colorado.edu)*\n| *Computer Science Department, BioFrontiers Institute*\n| *University of Colorado, Boulder, USA*\n\n| *Margaret Gruca (margaret.gruca at colorado.edu)*\n| *BioFrontiers Institute*\n| *University of Colorado, Boulder, USA*",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "Differential ATAC-seq toolkit",
    "version": "1.0.1.post1",
    "split_keywords": [
        "bioinformatics",
        "genomics",
        "chromatin",
        "atac-seq",
        "motif",
        "transcription_factor",
        "tf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bd93ff879c202f7a9cebfc5ec605c39f6ef5c06aa264cec092e1851fb034f7a",
                "md5": "e921d3f92206a0be75c5fcc217b62f36",
                "sha256": "99dc29ab7ca7ccc6bf56e0f1ce0aa3be310f5c3513c67e011465533554a8ac8f"
            },
            "downloads": -1,
            "filename": "DAStk-1.0.1.post1.tar.gz",
            "has_sig": false,
            "md5_digest": "e921d3f92206a0be75c5fcc217b62f36",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1160389,
            "upload_time": "2023-01-26T22:13:59",
            "upload_time_iso_8601": "2023-01-26T22:13:59.850655Z",
            "url": "https://files.pythonhosted.org/packages/6b/d9/3ff879c202f7a9cebfc5ec605c39f6ef5c06aa264cec092e1851fb034f7a/DAStk-1.0.1.post1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-26 22:13:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Dowell-Lab",
    "github_project": "DAStk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dastk"
}
        
Elapsed time: 0.04929s