pyLoopSage

Name	pyLoopSage JSON
Version	1.0.17 JSON
	download
home_page	https://github.com/SFGLab/pyLoopSage
Summary	An energy-based stochastic model of loop extrusion in chromatin.
upload_time	2024-11-21 09:02:35
maintainer	None
docs_url	None
author	Sebastian Korsak
requires_python	>=3.10
license	GNU General Public License v3.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pyLoopSage
Updated version of the stochastic loop extrusion model: LoopSage with capability to run incredibly fast, parallelized across CPU cores. This package is even more user-friendly and it can be installed via PyPI.

## New features

- More user-friendly environment.
- Installable with `pip install pyLoopSage`.
- Parralelization of the stochastic simulation across multiple CPUs.
- Capability of modelling the whole chromosome.
- Visualization functions.
- Capability to run form terminal with a simple command `loopsage -c config.ini`.

## Publication
Please cite the method paper in case that you would like to use this model for your work,

* Korsak, Sevastianos, and Dariusz Plewczynski. "LoopSage: An energy-based Monte Carlo approach for the loop extrusion modelling of chromatin." Methods (2024).

## The model

### Stochastic Simulation
Before simulating 3D structures, a stochastic simulation takes place that it purely applies in the locations of cohesins. The purpose of it is to create realistic ensemble of cohesin locations, and recontruct their trajectories. Otherwise, a molecular simulation alone, could not have enough variability to reconstruct the experimental heatmaps with the same fixed positions. Therefore, the stochastic simulation can be seen as a generator of cohesin possible configurations.

We have a polymer chain with $N_{\text{beads}}$ number of monomers. In general we can scale by deault the granularity of the simulation so as to give reasonable results. Therefore if we have a `region` counted in genomic coordinates, we can assume that `N_beads=(region[1]-region[0])//2000`.

Let's assume that each cohesin $i$ can be represented of two coordinates $(m_{i},n_{i})$ we allow three moves in our simulation:

* Slide both locations randomly (linearly or as 1D random walk) or
* Rebind somewhere else.

In general a good working assumption is that the number of cohesins (or loop extrusion factors LEFs) is $N_{\text{lef}}=2N_{\text{CTCF}}$.

The main idea of the algorithm is to ensemble loop extrusion from a Boltzmann probability distribution, with Hamiltonian,

$$E = c_{\text{fold}}\sum_{i=1}^{N_{\text{coh}}}\log(n_i-m_i)+c_{\text{cross}}\sum_{i,j}K(m_i,n_i;m_j,n_j)+c_{\text{bind}}\sum_{i=1}^{N_{\text{coh}}}\left(L(m_i)+R(n_i)\right)$$

The first term corresponds to the folding of chromatin, and the second term is a penalty for the appearance of crosss. Therefore, we have the function,
$K(m_{i},n_{i};m_{j},n_{j})$ which takes the value 1 when $m_{i} < m_{j} < n_{i} < n_{j}$ or $m_{i}=m_{j}$ or $m_{i}=n_{j}$.

These $L(\cdot), R(\cdot)$ functions are two functions that define the binding potential and they are orientation specific - so they are different for left and right position of cohesin (because CTCF motifs are orientation specific), therefore when we have a gap in these functions, it means presence of CTCF. These two functions are derived from data with CTCF binning and by running the script for probabilistic orientation. Moreover, by $N_{(\cdot)}$ we symbolize the normalization constants for each factor,

$$c_{\text{fold}}=-\dfrac{N_{\text{beads}}f}{N_{\text{lef}}\log(N_{\text{beads}}/N_{\text{lef}})},\quad c_{\text{bind}}=-\dfrac{N_{\text{beads}}b}{\sum_i \left(L(m_i)+R(n_i)\right)},\quad c_{\text{cross}}=\kappa \times 10^4.$$

The parameters are defined in such a way that when $f=b=\kappa=1$, the three terms of the stochastic energy are balanced. To clarrify, this parametrization is slightly different than this of the original paper. We multiply with $N_{\text{beads}}$, so as to keep the input parameters $f,b,\kappa$ simpler close to 1. Therefore, a value $f=0.1$ leads to unfolded structures, and a value $f=2$ it leads to very fast propagating long loops. 

And the energy difference can be expressed as the energy difference of each term,

$$\Delta E = \Delta E_{\text{fold}}+\Delta E_{\text{cross}}+\Delta E_{\text{bind}}.$$

In this manner we accept a move in two cases:

* If $\Delta E<0$ or,
* if $\Delta E>0$ with probability $e^{-\Delta E/kT}$.

And of course, the result - the distribution of loops in equilibrium -  depends on temperature of Boltzmann distribution $T$.

#### Modelling with two families of cohesins

In this version of LoopSage it is possible to have two families of cohesins (LEFs) as well, with different folding coefficients (which is equivalent to different cohesin speeds). Therefore, we could write the energy of folding,

$$E_{\text{fold}} = c_{\text{fold,1}}\sum_{i=1}^{N_{\text{coh,1}}}\log(n_i-m_i)+c_{\text{fold,2}}\sum_{i=N_{\text{coh,1}}}^{N_{\text{coh,1}}+N_{\text{coh},2}}\log(n_i-m_i)$$

The coefficient of the first family of LEFs is set to one by default $f_1 = 1$, whereas the other one is set to zero $f_2 = 0$. Usually a small amount of fast cohesins with $f_2>f_1$ can reconstruct more long-range loops, because the cohesins have time to extrude. Contrary, leads in very stable small loops.

### Molecular Simulation

Let us consider a system comprising $N_{\text{lef}}$ LEFs, as well as two matrices, $M$ and $N$, both of which possess dimensions $N_{\text{lef}}\times N_{\text{steps}}$. These matrices represent the respective constraints associated with each LEF. Consequently, we define a time-dependent force field as follows:

$$E(t) = E_{\text{bond}}+E_{\text{angle}}+E_{\text{rep}}+E_{\text{loop}}(t)$$

where,

- $E_{\text{bond}}$ corresponds to a typical harmonic bond force field that connects adjacent beads $i$ and $i+1$, with an equilibrium length of $x_{0}=0.1 \text{nm}$ and a Hook constant assumed to be $k=3\times 10^5 \text{kJ/(mol}\cdot \text{nm}^2)$.
- $E_{\text{angle}}$ a harmonic angle force that connects beads $i-1,i,i+1$, and has equilibrium angle $\theta_{0}=\pi$ and Hook constant $200  kJ/(mol\cdot nm^2)$. The strength of the angle force, can be tuned by the user.
- $E_{\text{rep}}$ which is a repelling forcefield of the form: $$E_{\text{rep}} = \epsilon\left(\frac{\sigma_{1}+\sigma_{2}}{r}\right)^{\alpha}$$ where $\epsilon=10 kJ/mol$ and $\sigma_{1}=\sigma_{2}=0.05 nm$. The power $\alpha$ is a parameter of the simulation, but it is set as $\alpha=3$ by default.
- $E_{\text{loop}}(t)$ represents a time-dependent LE force. This force reads the matrices $M$ and $N$, applying a distinct set of constraints $C_{t_i}=(m_j(t_i),n_j(t_i))$ at each time step $t_i$. Each LEF $j$ is subjected to specific constraints $m_{j,t_i}$ and $n_{j,t_i}$. The functional form of this force is also a harmonic bond force, with parameters $x_{0}=0.1 \text{nm}$ and a Hook constant assumed to be $k=5\times 10^4 \text{kJ/(mol}\cdot \text{nm}^2)$. The strength and equillibrium distance of the looping bonds can also be set in different way, if the user wishes.
  
For the implementation of this model in python, we used OpenMM and CUDA acceleration. To minimize the energy Langevin dynamics were used, in temperature of $T_{L}=310 K$, friction coefficient $\gamma = 0.05  psec^{-1}$ and time step $t_{s}=100 fsec$. Note that the temperature of molecular dynamics simulation is independent from the temperature of stochastic simulation and they represent different physical realities. 

In general the user can run simulation in two different ways:

1. **Energy minimization (EM)**: It means that for each sample of cohesin positions $C_{t_i}=(m_j(t_i),n_j(t_i))$ start from a different initial structure (usually 3D random walk) and we apply the forcefield. For each structure we start from a different initial condition. In general, it is suggested to run the model in this way because it is faster, less prone to errors and the structures are not correlated to each other.
2. **Molecular Dynamics (MD)**: In this case we have only one initial structure, we minimize the energy according to the forcefield only once and then we run a molecular dynamics simulation over time. This creates a continuous trajectory of structures, and it is cool for visualization pruposes. Novertheness, it is not suggested because it is more prone to errors and can lead to autocorrelated structures. We suggest our users to set the power of excluded volume $\alpha=1$ for the purposes of MD simulation. However, this depends on the research interests of the user.

## Installation

Can be easily installed with `pip install pyLoopSage`. To have CUDA acceleration, it is needed to have cuda-toolkit installed in case that you use nvidia drivers (otherwise you can use OpenCL or parallelization across CPU cores).

## How to use?

### Python Implementation

The main script is `stochastic_simulation.py`. However, the implementation of the code is very easy and it can be described in the following lines,

```python
# Definition of Monte Carlo parameters
import loopsage.stochastic_simulation as lps

N_steps, MC_step, burnin, T, T_min = int(4e4), int(5e2), 1000, 2.5, 1.0
mode = 'Metropolis'

# Simulation Strengths
f, b, kappa = 1.0, 1.0, 1.0

# Definition of region
region, chrom = [15550000,16850000], 'chr6'

# Definition of data
output_name='../HiChIP_Annealing_T1_MD_region'
bedpe_file = '/home/skorsak/Data/HiChIP/Maps/hg00731_smc1_maps_2.bedpe'

sim = lps.StochasticSimulation(region,chrom,bedpe_file,out_dir=output_name,N_beads=1000)
Es, Ms, Ns, Bs, Ks, Fs, ufs = sim.run_energy_minimization(N_steps,MC_step,burnin,T,T_min,mode=mode,viz=True,save=True)
sim.run_EM('CUDA')
```

Firstly, we need to define the input files from which LoopSage would take the information to construct the potential. We define also the specific region that we would like to model. Therefore, in the code script above we define a `bedpe_file` from which information about the CTCF loops it is imported.

Note that the `.bedpe_file` must be in the following format,

```
chr1	903917	906857	chr1	937535	939471	16	3.2197903072213415e-05	0.9431392038374097
chr1	979970	987923	chr1	1000339	1005916	56	0.00010385804708107556	0.9755733944997329
chr1	980444	982098	chr1	1063024	1065328	12	0.15405319074060866	0.999801529750033
chr1	981076	985322	chr1	1030933	1034105	36	9.916593137693526e-05	0.01617512105347667
chr1	982171	985182	chr1	990837	995510	27	2.7536240913152036e-05	0.5549511180231224
chr1	982867	987410	chr1	1061124	1066833	71	1.105408615726611e-05	0.9995462969421808
chr1	983923	985322	chr1	1017610	1019841	11	1.7716275555648395e-06	0.10890923034907056
chr1	984250	986141	chr1	1013038	1015474	14	1.7716282101935205e-06	0.025665007111095667
chr1	990949	994698	chr1	1001076	1003483	34	0.5386388489931403	0.9942742844900859
chr1	991375	993240	chr1	1062647	1064919	15	1.0	0.9997541297643132
```

where the last two columns represent the probabilites for left and right anchor respectively to be tandem right. If the probability is negative it means that no CTCF motif was detected in this anchor. You can extract these probabilities from the repo: https://github.com/SFGLab/3d-analysis-toolkit, with `find_motifs.py` file. Please set `probabilistic=True` `statistics=False`.


To use this script you need to have a python 3.8 environment, and install the following dependencies from this file: [mateusz_script_env.txt](https://github.com/user-attachments/files/17568248/mateusz_script_env.txt).

You also need to download the reference genome from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

*Alternativelly, it is possible to import a .bedpe file without the last two columns (CTCF orientation). In this case, CTCF would act as an orientation independent barrier. This might affect slightly the results, but it is an easier option, if you do not want to run a CTCF motif finding script.*

Then, we define the main parameters of the simulation `N_beads,N_coh,kappa,f,b` or we can choose the default ones (take care because it might be the case that they are not the appropriate ones and they need to be changed), the parameters of Monte Carlo `N_steps, MC_step, burnin, T`, and we initialize the class `LoopSage()`. The command `sim.run_energy_minimization()` corresponds to the stochastic Monte Carlo simulation, and it produces a set of cohesin constraints as a result (`Ms, Ns`). Note that the stochastic simulation has two modes: `Annealing` and `Metropolis`. We feed cohesin constraints to the molecular simulation part of and we run `MD_LE()` or `EM_LE()` simulation which produces a trajectory of 3d-structures, and the average heatmap. `MD_LE()` function can produce an actual trajectory and a `.dcd` video of how simulation changes over time. However, the implementation needs a high amount of memory since we need to define a bond for each time step, and it may not work for large systems. `EM_LE()` is suggested for large simulations, because it does not require so big amount of memory. *We would like to advise to our users that in case that they would like to use molecular dynamics (`MD`), it is better to choose small `MC_step`, because the trajectory of LEFs should be continuous.*

### Running LoopSage from command-line
To run LoopSage from command-line, you only need to type a command

```bash
loopsage -c config.ini
```

With this command the model will run with parameters specified in `config.ini` file. An example of a `config.ini` file would be the following,

```txt
[Main]

; Input Data and Information
BEDPE_PATH = /home/skorsak/Data/HiChIP/Maps/hg00731_smc1_maps_2.bedpe
REGION_START = 15550000
REGION_END = 16850000
CHROM = chr6
OUT_PATH = ../HiChIP_Annealing_T15_MD_region

; Simulation Parameters
N_BEADS = 1000
N_STEPS = 40000
MC_STEP = 500
BURNIN = 1000
T_INIT = 1.5
T_FINAL = 0.01
METHOD = Metropolis

; Molecular Dynamics
PLATFORM = CUDA
INITIAL_STRUCTURE_TYPE = rw
SIMULATION_TYPE = EM 
TOLERANCE = 1.0
``` 

### Visualization with PyVista

There are many tools for visualization of polymer structures. A very good one is UCSF chimera: https://www.cgl.ucsf.edu/chimera/. Usually these visualization tools work well for proteins, but we can use them for chromatin as well.

LoopSage offers its own visualization which relies in the `pyvista` python library. To visualize a structure, you can run some very simple commands, which call LoopSage functions,

```python
import loopsage.vizualization_tools as vz
import loopsage.utils as uts

V = uts.get_coordinates_cif('/home/skorsak/Projects/mine/LoopSage/HiChIP_Annealing_T15_EM_region/ensemble/EMLE_1.cif')
vz.viz_structure(V)
```

The output should be something like that,

![image](https://github.com/user-attachments/assets/457d2b1f-e037-4ff1-8cec-c9eef7de789d)


In case that you would like to create a continuous video from the enseble of structures, you can use the following command, which would generate an interactive video in gif format which would show how the structure changes in 3D. The command includes quaternion Kabsch aligment as well.

```python
import loopsage.vizualization_tools as vz

vz.interactive_plot('/home/skorsak/Projects/mine/LoopSage/HiChIP_Annealing_T15_EM_region/ensemble')
```

### Output Files
In the output files, simulation produces one folder with 4 subfolders. In subfolder `plots`, you can find plots that are the diagnostics of the algorithm. One of the most basic results you should see is the trajectories of cohesins (LEFs). this diagram should look like that,

![coh_trajectories](https://github.com/SFGLab/LoopSage/assets/49608786/f73ffd2b-8359-4c6d-958b-9a770d4834ba)

In this diagram, each LEF is represented by a different colour. In case of Simulated Annealing, LEFs should shape shorter loops in the first simulation steps, since they have higher kinetic energy due to the high temperature, and very stable large loops in the final steps if the final temperature $T_f$ is low enough. Horizontal lines represent the presence of CTCF points. In case of Metropolis, the distribution of LEFs should look invariant in respect to computational time,

![coh_trajectories](https://github.com/SFGLab/LoopSage/assets/49608786/b48e2383-2509-4c68-a726-91a5e61aabf3)

Good cohesin trajectory diagrams should be like the ones previously shown, which means that we do not want to see many unoccupied (white) regions, but we also do not like to see static loops. If the loops are static then it is better to choose higher temperature, or bigger number of LEFs. If the loops are too small, maybe it is better to choose smaller temperature.

Now, to reassure that our algorithm works well we need to observe some fundamental diagnostics of Monte Carlo algorithms. Some other important diagnostics can be seen in the following picture,

![github_diagnostics](https://github.com/SFGLab/LoopSage/assets/49608786/b90e2355-be84-47c4-906f-5a2a62497b26)

In graph A, we can see the plot of the energy as a function of simulation time. In Metropolis after some steps, the simulation should reach equillibrium after the defined burnin period (blue vertical line). In case of Simulated Annealing, the energy should decrease as function of time because the temperature decreases, and thus the energy should not be conserved. Autocorrelations (B), show us if the thermodyncamic ensembles that we obtained are autocorrelated or not. The Monte Carlo step (sampling frequency) should be big enough so as to have small autocorrelations (<0.5). The averaged heatmap (C), shows the simulated heatmap, after averaging inversed all-versus-all distances of the region of interest. Finally, (D) shows the Pearson correlation between projected 1D signal of heatmap of experimental and simulated data.

In the output folder there are another three subfolders: 
* `ensembles` has the ensembles of 3D structures in `.cif` format (it can open with vizualization software Chimera: https://www.cgl.ucsf.edu/chimera/) or `pyvista`,
* `heatmaps` with the inversed all-versus-all distance heatmap of each one of these structures.
* `other` here are some numpy arrays and some computed statistics. Numpy arrays like `Ms` and `Ns` have the degrees of freedoms of LEFs over time, then `Fs, Ks, Es` they have folding, corssing energy and total energy over time. `Ts` is the temperature over time. And finally in `other.txt` you can see the statistics of the simulation and the input parameters. In `correlations.txt` you can find a file with Pearson, Spearmann and Kendau tau correlations between estimated and experimental data. We provide an optimistic simulations where zeros of the signal are taken into account, and a pessimistic one where the zeros are not taken into account.

An example, illustrated with Chimera software, simulated trajectory of structures after running Simulated Annealing and molecular dynamics.

![siman_traj_GitHub](https://github.com/SFGLab/LoopSage/assets/49608786/c6626641-f709-46e0-b01b-42566b1829ef)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SFGLab/pyLoopSage",
    "name": "pyLoopSage",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Sebastian Korsak",
    "author_email": "s.korsak@datascience.edu.pl",
    "download_url": "https://files.pythonhosted.org/packages/1c/f1/3270386e220bea717406fafd6ef786871af3932da048bc9a7607e4d4a17b/pyloopsage-1.0.17.tar.gz",
    "platform": null,
    "description": "# pyLoopSage\nUpdated version of the stochastic loop extrusion model: LoopSage with capability to run incredibly fast, parallelized across CPU cores. This package is even more user-friendly and it can be installed via PyPI.\n\n## New features\n\n- More user-friendly environment.\n- Installable with `pip install pyLoopSage`.\n- Parralelization of the stochastic simulation across multiple CPUs.\n- Capability of modelling the whole chromosome.\n- Visualization functions.\n- Capability to run form terminal with a simple command `loopsage -c config.ini`.\n\n## Publication\nPlease cite the method paper in case that you would like to use this model for your work,\n\n* Korsak, Sevastianos, and Dariusz Plewczynski. \"LoopSage: An energy-based Monte Carlo approach for the loop extrusion modelling of chromatin.\" Methods (2024).\n\n## The model\n\n### Stochastic Simulation\nBefore simulating 3D structures, a stochastic simulation takes place that it purely applies in the locations of cohesins. The purpose of it is to create realistic ensemble of cohesin locations, and recontruct their trajectories. Otherwise, a molecular simulation alone, could not have enough variability to reconstruct the experimental heatmaps with the same fixed positions. Therefore, the stochastic simulation can be seen as a generator of cohesin possible configurations.\n\nWe have a polymer chain with $N_{\\text{beads}}$ number of monomers. In general we can scale by deault the granularity of the simulation so as to give reasonable results. Therefore if we have a `region` counted in genomic coordinates, we can assume that `N_beads=(region[1]-region[0])//2000`.\n\nLet's assume that each cohesin $i$ can be represented of two coordinates $(m_{i},n_{i})$ we allow three moves in our simulation:\n\n* Slide both locations randomly (linearly or as 1D random walk) or\n* Rebind somewhere else.\n\nIn general a good working assumption is that the number of cohesins (or loop extrusion factors LEFs) is $N_{\\text{lef}}=2N_{\\text{CTCF}}$.\n\nThe main idea of the algorithm is to ensemble loop extrusion from a Boltzmann probability distribution, with Hamiltonian,\n\n$$E = c_{\\text{fold}}\\sum_{i=1}^{N_{\\text{coh}}}\\log(n_i-m_i)+c_{\\text{cross}}\\sum_{i,j}K(m_i,n_i;m_j,n_j)+c_{\\text{bind}}\\sum_{i=1}^{N_{\\text{coh}}}\\left(L(m_i)+R(n_i)\\right)$$\n\nThe first term corresponds to the folding of chromatin, and the second term is a penalty for the appearance of crosss. Therefore, we have the function,\n$K(m_{i},n_{i};m_{j},n_{j})$ which takes the value 1 when $m_{i} < m_{j} < n_{i} < n_{j}$ or $m_{i}=m_{j}$ or $m_{i}=n_{j}$.\n\nThese $L(\\cdot), R(\\cdot)$ functions are two functions that define the binding potential and they are orientation specific - so they are different for left and right position of cohesin (because CTCF motifs are orientation specific), therefore when we have a gap in these functions, it means presence of CTCF. These two functions are derived from data with CTCF binning and by running the script for probabilistic orientation. Moreover, by $N_{(\\cdot)}$ we symbolize the normalization constants for each factor,\n\n$$c_{\\text{fold}}=-\\dfrac{N_{\\text{beads}}f}{N_{\\text{lef}}\\log(N_{\\text{beads}}/N_{\\text{lef}})},\\quad c_{\\text{bind}}=-\\dfrac{N_{\\text{beads}}b}{\\sum_i \\left(L(m_i)+R(n_i)\\right)},\\quad c_{\\text{cross}}=\\kappa \\times 10^4.$$\n\nThe parameters are defined in such a way that when $f=b=\\kappa=1$, the three terms of the stochastic energy are balanced. To clarrify, this parametrization is slightly different than this of the original paper. We multiply with $N_{\\text{beads}}$, so as to keep the input parameters $f,b,\\kappa$ simpler close to 1. Therefore, a value $f=0.1$ leads to unfolded structures, and a value $f=2$ it leads to very fast propagating long loops. \n\nAnd the energy difference can be expressed as the energy difference of each term,\n\n$$\\Delta E = \\Delta E_{\\text{fold}}+\\Delta E_{\\text{cross}}+\\Delta E_{\\text{bind}}.$$\n\nIn this manner we accept a move in two cases:\n\n* If $\\Delta E<0$ or,\n* if $\\Delta E>0$ with probability $e^{-\\Delta E/kT}$.\n\nAnd of course, the result - the distribution of loops in equilibrium -  depends on temperature of Boltzmann distribution $T$.\n\n#### Modelling with two families of cohesins\n\nIn this version of LoopSage it is possible to have two families of cohesins (LEFs) as well, with different folding coefficients (which is equivalent to different cohesin speeds). Therefore, we could write the energy of folding,\n\n$$E_{\\text{fold}} = c_{\\text{fold,1}}\\sum_{i=1}^{N_{\\text{coh,1}}}\\log(n_i-m_i)+c_{\\text{fold,2}}\\sum_{i=N_{\\text{coh,1}}}^{N_{\\text{coh,1}}+N_{\\text{coh},2}}\\log(n_i-m_i)$$\n\nThe coefficient of the first family of LEFs is set to one by default $f_1 = 1$, whereas the other one is set to zero $f_2 = 0$. Usually a small amount of fast cohesins with $f_2>f_1$ can reconstruct more long-range loops, because the cohesins have time to extrude. Contrary, leads in very stable small loops.\n\n### Molecular Simulation\n\nLet us consider a system comprising $N_{\\text{lef}}$ LEFs, as well as two matrices, $M$ and $N$, both of which possess dimensions $N_{\\text{lef}}\\times N_{\\text{steps}}$. These matrices represent the respective constraints associated with each LEF. Consequently, we define a time-dependent force field as follows:\n\n$$E(t) = E_{\\text{bond}}+E_{\\text{angle}}+E_{\\text{rep}}+E_{\\text{loop}}(t)$$\n\nwhere,\n\n- $E_{\\text{bond}}$ corresponds to a typical harmonic bond force field that connects adjacent beads $i$ and $i+1$, with an equilibrium length of $x_{0}=0.1 \\text{nm}$ and a Hook constant assumed to be $k=3\\times 10^5 \\text{kJ/(mol}\\cdot \\text{nm}^2)$.\n- $E_{\\text{angle}}$ a harmonic angle force that connects beads $i-1,i,i+1$, and has equilibrium angle $\\theta_{0}=\\pi$ and Hook constant $200  kJ/(mol\\cdot nm^2)$. The strength of the angle force, can be tuned by the user.\n- $E_{\\text{rep}}$ which is a repelling forcefield of the form: $$E_{\\text{rep}} = \\epsilon\\left(\\frac{\\sigma_{1}+\\sigma_{2}}{r}\\right)^{\\alpha}$$ where $\\epsilon=10 kJ/mol$ and $\\sigma_{1}=\\sigma_{2}=0.05 nm$. The power $\\alpha$ is a parameter of the simulation, but it is set as $\\alpha=3$ by default.\n- $E_{\\text{loop}}(t)$ represents a time-dependent LE force. This force reads the matrices $M$ and $N$, applying a distinct set of constraints $C_{t_i}=(m_j(t_i),n_j(t_i))$ at each time step $t_i$. Each LEF $j$ is subjected to specific constraints $m_{j,t_i}$ and $n_{j,t_i}$. The functional form of this force is also a harmonic bond force, with parameters $x_{0}=0.1 \\text{nm}$ and a Hook constant assumed to be $k=5\\times 10^4 \\text{kJ/(mol}\\cdot \\text{nm}^2)$. The strength and equillibrium distance of the looping bonds can also be set in different way, if the user wishes.\n  \nFor the implementation of this model in python, we used OpenMM and CUDA acceleration. To minimize the energy Langevin dynamics were used, in temperature of $T_{L}=310 K$, friction coefficient $\\gamma = 0.05  psec^{-1}$ and time step $t_{s}=100 fsec$. Note that the temperature of molecular dynamics simulation is independent from the temperature of stochastic simulation and they represent different physical realities. \n\nIn general the user can run simulation in two different ways:\n\n1. **Energy minimization (EM)**: It means that for each sample of cohesin positions $C_{t_i}=(m_j(t_i),n_j(t_i))$ start from a different initial structure (usually 3D random walk) and we apply the forcefield. For each structure we start from a different initial condition. In general, it is suggested to run the model in this way because it is faster, less prone to errors and the structures are not correlated to each other.\n2. **Molecular Dynamics (MD)**: In this case we have only one initial structure, we minimize the energy according to the forcefield only once and then we run a molecular dynamics simulation over time. This creates a continuous trajectory of structures, and it is cool for visualization pruposes. Novertheness, it is not suggested because it is more prone to errors and can lead to autocorrelated structures. We suggest our users to set the power of excluded volume $\\alpha=1$ for the purposes of MD simulation. However, this depends on the research interests of the user.\n\n## Installation\n\nCan be easily installed with `pip install pyLoopSage`. To have CUDA acceleration, it is needed to have cuda-toolkit installed in case that you use nvidia drivers (otherwise you can use OpenCL or parallelization across CPU cores).\n\n## How to use?\n\n### Python Implementation\n\nThe main script is `stochastic_simulation.py`. However, the implementation of the code is very easy and it can be described in the following lines,\n\n```python\n# Definition of Monte Carlo parameters\nimport loopsage.stochastic_simulation as lps\n\nN_steps, MC_step, burnin, T, T_min = int(4e4), int(5e2), 1000, 2.5, 1.0\nmode = 'Metropolis'\n\n# Simulation Strengths\nf, b, kappa = 1.0, 1.0, 1.0\n\n# Definition of region\nregion, chrom = [15550000,16850000], 'chr6'\n\n# Definition of data\noutput_name='../HiChIP_Annealing_T1_MD_region'\nbedpe_file = '/home/skorsak/Data/HiChIP/Maps/hg00731_smc1_maps_2.bedpe'\n\nsim = lps.StochasticSimulation(region,chrom,bedpe_file,out_dir=output_name,N_beads=1000)\nEs, Ms, Ns, Bs, Ks, Fs, ufs = sim.run_energy_minimization(N_steps,MC_step,burnin,T,T_min,mode=mode,viz=True,save=True)\nsim.run_EM('CUDA')\n```\n\nFirstly, we need to define the input files from which LoopSage would take the information to construct the potential. We define also the specific region that we would like to model. Therefore, in the code script above we define a `bedpe_file` from which information about the CTCF loops it is imported.\n\nNote that the `.bedpe_file` must be in the following format,\n\n```\nchr1\t903917\t906857\tchr1\t937535\t939471\t16\t3.2197903072213415e-05\t0.9431392038374097\nchr1\t979970\t987923\tchr1\t1000339\t1005916\t56\t0.00010385804708107556\t0.9755733944997329\nchr1\t980444\t982098\tchr1\t1063024\t1065328\t12\t0.15405319074060866\t0.999801529750033\nchr1\t981076\t985322\tchr1\t1030933\t1034105\t36\t9.916593137693526e-05\t0.01617512105347667\nchr1\t982171\t985182\tchr1\t990837\t995510\t27\t2.7536240913152036e-05\t0.5549511180231224\nchr1\t982867\t987410\tchr1\t1061124\t1066833\t71\t1.105408615726611e-05\t0.9995462969421808\nchr1\t983923\t985322\tchr1\t1017610\t1019841\t11\t1.7716275555648395e-06\t0.10890923034907056\nchr1\t984250\t986141\tchr1\t1013038\t1015474\t14\t1.7716282101935205e-06\t0.025665007111095667\nchr1\t990949\t994698\tchr1\t1001076\t1003483\t34\t0.5386388489931403\t0.9942742844900859\nchr1\t991375\t993240\tchr1\t1062647\t1064919\t15\t1.0\t0.9997541297643132\n```\n\nwhere the last two columns represent the probabilites for left and right anchor respectively to be tandem right. If the probability is negative it means that no CTCF motif was detected in this anchor. You can extract these probabilities from the repo: https://github.com/SFGLab/3d-analysis-toolkit, with `find_motifs.py` file. Please set `probabilistic=True` `statistics=False`.\n\n\nTo use this script you need to have a python 3.8 environment, and install the following dependencies from this file: [mateusz_script_env.txt](https://github.com/user-attachments/files/17568248/mateusz_script_env.txt).\n\nYou also need to download the reference genome from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa\n\n*Alternativelly, it is possible to import a .bedpe file without the last two columns (CTCF orientation). In this case, CTCF would act as an orientation independent barrier. This might affect slightly the results, but it is an easier option, if you do not want to run a CTCF motif finding script.*\n\nThen, we define the main parameters of the simulation `N_beads,N_coh,kappa,f,b` or we can choose the default ones (take care because it might be the case that they are not the appropriate ones and they need to be changed), the parameters of Monte Carlo `N_steps, MC_step, burnin, T`, and we initialize the class `LoopSage()`. The command `sim.run_energy_minimization()` corresponds to the stochastic Monte Carlo simulation, and it produces a set of cohesin constraints as a result (`Ms, Ns`). Note that the stochastic simulation has two modes: `Annealing` and `Metropolis`. We feed cohesin constraints to the molecular simulation part of and we run `MD_LE()` or `EM_LE()` simulation which produces a trajectory of 3d-structures, and the average heatmap. `MD_LE()` function can produce an actual trajectory and a `.dcd` video of how simulation changes over time. However, the implementation needs a high amount of memory since we need to define a bond for each time step, and it may not work for large systems. `EM_LE()` is suggested for large simulations, because it does not require so big amount of memory. *We would like to advise to our users that in case that they would like to use molecular dynamics (`MD`), it is better to choose small `MC_step`, because the trajectory of LEFs should be continuous.*\n\n### Running LoopSage from command-line\nTo run LoopSage from command-line, you only need to type a command\n\n```bash\nloopsage -c config.ini\n```\n\nWith this command the model will run with parameters specified in `config.ini` file. An example of a `config.ini` file would be the following,\n\n```txt\n[Main]\n\n; Input Data and Information\nBEDPE_PATH = /home/skorsak/Data/HiChIP/Maps/hg00731_smc1_maps_2.bedpe\nREGION_START = 15550000\nREGION_END = 16850000\nCHROM = chr6\nOUT_PATH = ../HiChIP_Annealing_T15_MD_region\n\n; Simulation Parameters\nN_BEADS = 1000\nN_STEPS = 40000\nMC_STEP = 500\nBURNIN = 1000\nT_INIT = 1.5\nT_FINAL = 0.01\nMETHOD = Metropolis\n\n; Molecular Dynamics\nPLATFORM = CUDA\nINITIAL_STRUCTURE_TYPE = rw\nSIMULATION_TYPE = EM \nTOLERANCE = 1.0\n``` \n\n### Visualization with PyVista\n\nThere are many tools for visualization of polymer structures. A very good one is UCSF chimera: https://www.cgl.ucsf.edu/chimera/. Usually these visualization tools work well for proteins, but we can use them for chromatin as well.\n\nLoopSage offers its own visualization which relies in the `pyvista` python library. To visualize a structure, you can run some very simple commands, which call LoopSage functions,\n\n```python\nimport loopsage.vizualization_tools as vz\nimport loopsage.utils as uts\n\nV = uts.get_coordinates_cif('/home/skorsak/Projects/mine/LoopSage/HiChIP_Annealing_T15_EM_region/ensemble/EMLE_1.cif')\nvz.viz_structure(V)\n```\n\nThe output should be something like that,\n\n![image](https://github.com/user-attachments/assets/457d2b1f-e037-4ff1-8cec-c9eef7de789d)\n\n\nIn case that you would like to create a continuous video from the enseble of structures, you can use the following command, which would generate an interactive video in gif format which would show how the structure changes in 3D. The command includes quaternion Kabsch aligment as well.\n\n```python\nimport loopsage.vizualization_tools as vz\n\nvz.interactive_plot('/home/skorsak/Projects/mine/LoopSage/HiChIP_Annealing_T15_EM_region/ensemble')\n```\n\n### Output Files\nIn the output files, simulation produces one folder with 4 subfolders. In subfolder `plots`, you can find plots that are the diagnostics of the algorithm. One of the most basic results you should see is the trajectories of cohesins (LEFs). this diagram should look like that,\n\n![coh_trajectories](https://github.com/SFGLab/LoopSage/assets/49608786/f73ffd2b-8359-4c6d-958b-9a770d4834ba)\n\nIn this diagram, each LEF is represented by a different colour. In case of Simulated Annealing, LEFs should shape shorter loops in the first simulation steps, since they have higher kinetic energy due to the high temperature, and very stable large loops in the final steps if the final temperature $T_f$ is low enough. Horizontal lines represent the presence of CTCF points. In case of Metropolis, the distribution of LEFs should look invariant in respect to computational time,\n\n![coh_trajectories](https://github.com/SFGLab/LoopSage/assets/49608786/b48e2383-2509-4c68-a726-91a5e61aabf3)\n\nGood cohesin trajectory diagrams should be like the ones previously shown, which means that we do not want to see many unoccupied (white) regions, but we also do not like to see static loops. If the loops are static then it is better to choose higher temperature, or bigger number of LEFs. If the loops are too small, maybe it is better to choose smaller temperature.\n\nNow, to reassure that our algorithm works well we need to observe some fundamental diagnostics of Monte Carlo algorithms. Some other important diagnostics can be seen in the following picture,\n\n![github_diagnostics](https://github.com/SFGLab/LoopSage/assets/49608786/b90e2355-be84-47c4-906f-5a2a62497b26)\n\nIn graph A, we can see the plot of the energy as a function of simulation time. In Metropolis after some steps, the simulation should reach equillibrium after the defined burnin period (blue vertical line). In case of Simulated Annealing, the energy should decrease as function of time because the temperature decreases, and thus the energy should not be conserved. Autocorrelations (B), show us if the thermodyncamic ensembles that we obtained are autocorrelated or not. The Monte Carlo step (sampling frequency) should be big enough so as to have small autocorrelations (<0.5). The averaged heatmap (C), shows the simulated heatmap, after averaging inversed all-versus-all distances of the region of interest. Finally, (D) shows the Pearson correlation between projected 1D signal of heatmap of experimental and simulated data.\n\nIn the output folder there are another three subfolders: \n* `ensembles` has the ensembles of 3D structures in `.cif` format (it can open with vizualization software Chimera: https://www.cgl.ucsf.edu/chimera/) or `pyvista`,\n* `heatmaps` with the inversed all-versus-all distance heatmap of each one of these structures.\n* `other` here are some numpy arrays and some computed statistics. Numpy arrays like `Ms` and `Ns` have the degrees of freedoms of LEFs over time, then `Fs, Ks, Es` they have folding, corssing energy and total energy over time. `Ts` is the temperature over time. And finally in `other.txt` you can see the statistics of the simulation and the input parameters. In `correlations.txt` you can find a file with Pearson, Spearmann and Kendau tau correlations between estimated and experimental data. We provide an optimistic simulations where zeros of the signal are taken into account, and a pessimistic one where the zeros are not taken into account.\n\nAn example, illustrated with Chimera software, simulated trajectory of structures after running Simulated Annealing and molecular dynamics.\n\n![siman_traj_GitHub](https://github.com/SFGLab/LoopSage/assets/49608786/c6626641-f709-46e0-b01b-42566b1829ef)\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3.0",
    "summary": "An energy-based stochastic model of loop extrusion in chromatin.",
    "version": "1.0.17",
    "project_urls": {
        "Homepage": "https://github.com/SFGLab/pyLoopSage"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "89560ab1e78ae4cf32b2676de4290ca9a08e996ea817f8f94d2bbf57497e079c",
                "md5": "614a880d8c735c61bdbe6b37a9ff6c64",
                "sha256": "2b90a5f2a36c3ecd76fa3f03b4933650469761ab00703a6d17f66a259a117954"
            },
            "downloads": -1,
            "filename": "pyLoopSage-1.0.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "614a880d8c735c61bdbe6b37a9ff6c64",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 51133,
            "upload_time": "2024-11-21T09:02:33",
            "upload_time_iso_8601": "2024-11-21T09:02:33.565537Z",
            "url": "https://files.pythonhosted.org/packages/89/56/0ab1e78ae4cf32b2676de4290ca9a08e996ea817f8f94d2bbf57497e079c/pyLoopSage-1.0.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1cf13270386e220bea717406fafd6ef786871af3932da048bc9a7607e4d4a17b",
                "md5": "58df38edbf49d4d82ef16857d35a2f6d",
                "sha256": "435480cf2b7aa4e3087ad32847215729291e709c353b87e9c6f5880716d559ba"
            },
            "downloads": -1,
            "filename": "pyloopsage-1.0.17.tar.gz",
            "has_sig": false,
            "md5_digest": "58df38edbf49d4d82ef16857d35a2f6d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 52647,
            "upload_time": "2024-11-21T09:02:35",
            "upload_time_iso_8601": "2024-11-21T09:02:35.308422Z",
            "url": "https://files.pythonhosted.org/packages/1c/f1/3270386e220bea717406fafd6ef786871af3932da048bc9a7607e4d4a17b/pyloopsage-1.0.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-21 09:02:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SFGLab",
    "github_project": "pyLoopSage",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pyloopsage"
}

Sebastian Korsak