| Name | phosx JSON |
| Version |
0.20.0
JSON |
| download |
| home_page | None |
| Summary | Differential kinase activity inference from phosphosproteomics data |
| upload_time | 2025-10-09 21:45:58 |
| maintainer | None |
| docs_url | None |
| author | Alessandro Lussana |
| requires_python | <4.0,>=3.10 |
| license | None |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
<p align="center">
<img width="250" src="https://raw.githubusercontent.com/alussana/phosx/refs/heads/main/phosx/assets/logo.png">
<br><br>
Kinase activity inference from phosphosproteomics data based on substrate sequence specificity
<br><br>
</p>
 
> Research paper: [https://doi.org/10.1093/bioinformatics/btae697](https://doi.org/10.1093/bioinformatics/btae697) (NOTE: outdated; the current method is vastly improved and includes new features)
> Benchmark: [https://github.com/alussana/phosx-benchmark](https://github.com/alussana/phosx-benchmark)
> Data: [https://github.com/alussana/kinase_pssms](https://github.com/alussana/kinase_pssms)
# Overview
<p align="center">
<br>
<img width="900" src="https://raw.githubusercontent.com/alussana/phosx/refs/heads/main/phosx/assets/workflow.png">
<br>
</p>
PhosX infers differential kinase activities from phosphoproteomics data without requiring any prior knowledge database of kinase-phosphosite associations. PhosX assigns the detected phosphopeptides to potential upstream kinases based on experimentally determined substrate sequence specificities, and it tests the enrichment of a kinase's potential substrates in the extremes of a ranked list of phosphopeptides using a Kolmogorov-Smirnov-like statistic. A _p_ value for this statistic is extracted empirically by random permutations of the phosphosite ranks. By considering the A-loop sequence of kinase domains, PhosX refines the inferred kinase activity changes by computing the [_upstream activation evidence_](#upstream-activation-evidence), further improving accuracy.
In the [benchmark](https://github.com/alussana/phosx-benchmark) PhosX consistently outperformed popular alternative methods, including KSTAR, KSEA, Z-score, Kinex, and PTM-SEA, in identifying expected regulated kinases in over a hundred phosphoproteomics perturbation experiments. The performance gain was expecially remarkable in identifying upregulated kinases, potentially making PhosX an ideal tool to discover therapeutic targets for kinase inhibitors. All evaluated methods except Kinex and PhosX are based on prior knowledge of kinase-substrate associations.
# Installation
## From [PyPI](https://pypi.org/project/phosx/)
```bash
pip install phosx
```
## From source (requires [Poetry](https://python-poetry.org))
```
poetry build
pip install dist/*.whl
```
# Usage
PhosX can be used as a command line tool (`phosx`) with minimal effort. Its output is redirected by default in `STDOUT`, making it easy to use in bioinformatics pipelines. Alternatively, the user can specify an output filename (option `-o`).
Example: run PhosX with default parameters on an example dataset, using up to 4 cores, and redirecting the output table to `kinase_activities.tsv`:
```bash
phosx -c 4 tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk > kinase_activities.tmp
```
<details>
<summary>A brief description of the command line options can be viewed with `phosx -h`:</summary>
```bash
██████╗░██╗░░██╗░█████╗░░██████╗██╗░░██╗
██╔══██╗██║░░██║██╔══██╗██╔════╝╚██╗██╔╝
██████╔╝███████║██║░░██║╚█████╗░░╚███╔╝░
██╔═══╝░██╔══██║██║░░██║░╚═══██╗░██╔██╗░
██║░░░░░██║░░██║╚█████╔╝██████╔╝██╔╝╚██╗
╚═╝░░░░░╚═╝░░╚═╝░╚════╝░╚═════╝░╚═╝░░╚═╝
Version 0.20.0
Copyright (C) 2025 Alessandro Lussana
Licence Apache 2.0
Command: /home/alussana/Xiv_local/venvs/phosx/bin/phosx -h
usage: phosx [-h] [-yp Y_PSSM] [-stp S_T_PSSM] [-yq Y_PSSM_QUANTILES] [-stq S_T_PSSM_QUANTILES] [-no-uae] [-meta KINASE_METADATA] [-n N_PERMUTATIONS] [-s] [-t] [-stk S_T_N_TOP_KINASES] [-yk Y_N_TOP_KINASES] [-astqth A_LOOP_S_T_QUANTILE_THRESHOLD]
[-ayqth A_LOOP_Y_QUANTILE_THRESHOLD] [-urt UPREG_REDUNDANCY_THRESHOLD] [-drt DOWNREG_REDUNDANCY_THRESHOLD] [-mh MIN_N_HITS] [-stmq S_T_MIN_QUANTILE] [-ymq Y_MIN_QUANTILE] [-df1 DECAY_FACTOR] [-c N_PROC] [--plot-figures] [-d OUTPUT_DIR] [-nd NETWORK_PATH]
[-o OUTPUT_PATH] [-v]
seqrnk
Data-driven differential kinase activity inference from phosphosproteomics data
positional arguments:
seqrnk Path to the seqrnk file.
options:
-h, --help show this help message and exit
-yp Y_PSSM, --y-pssm Y_PSSM
Path to the h5 file storing custom Tyr PSSMs; defaults to built-in PSSMs
-stp S_T_PSSM, --s-t-pssm S_T_PSSM
Path to the h5 file storing custom Ser/Thr PSSMs; defaults to built-in PSSMs
-yq Y_PSSM_QUANTILES, --y-pssm-quantiles Y_PSSM_QUANTILES
Path to the h5 file storing custom Tyr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles
-stq S_T_PSSM_QUANTILES, --s-t-pssm-quantiles S_T_PSSM_QUANTILES
Path to the h5 file storing custom Ser/Thr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles
-no-uae, --no-upstream-activation-evidence
Do not compute upstream activation evidence to modify the activity scores of kinases with correlated activity; default: False
-meta KINASE_METADATA, --kinase-metadata KINASE_METADATA
Path to the h5 file storing kinase metadata ("aloop_seq"); defaults to built-in metadata
-n N_PERMUTATIONS, --n-permutations N_PERMUTATIONS
Number of random permutations; default: 20000
-s, --ser-thr-only Only compute Ser/Thr kinases activity; default: False
-t, --tyr-only Only compute Tyr kinases activity; default: False
-stk S_T_N_TOP_KINASES, --s-t-n-top-kinases S_T_N_TOP_KINASES
Number of top-scoring Ser/Thr kinases potentially associatiated to a given phosphosite; default: 5
-yk Y_N_TOP_KINASES, --y-n-top-kinases Y_N_TOP_KINASES
Number of top-scoring Tyr kinases potentially associatiated to a given phosphosite; default: 5
-astqth A_LOOP_S_T_QUANTILE_THRESHOLD, --a-loop-s-t-quantile-threshold A_LOOP_S_T_QUANTILE_THRESHOLD
Minimum Ser/Thr PSSM score quantile for an activation loop to be considered potential substrate of a kinase; default: 0.95
-ayqth A_LOOP_Y_QUANTILE_THRESHOLD, --a-loop-y-quantile-threshold A_LOOP_Y_QUANTILE_THRESHOLD
Minimum Tyr PSSM score quantile for an activation loop to be considered potential substrate of a kinase; default: 0.95
-urt UPREG_REDUNDANCY_THRESHOLD, --upreg-redundancy-threshold UPREG_REDUNDANCY_THRESHOLD
Minimum Jaccard index of target substrates to consider two upregulated kinases having potentially correlated activity; upstream activation evidence is used to prioritize the activity of individual ones; default: 0.5
-drt DOWNREG_REDUNDANCY_THRESHOLD, --downreg-redundancy-threshold DOWNREG_REDUNDANCY_THRESHOLD
Minimum Jaccard index of target substrates to consider two downregulated kinases having potentially correlated activity; upstream activation evidence is used to prioritize the activity of individual ones; default: 0.5
-mh MIN_N_HITS, --min-n-hits MIN_N_HITS
Minimum number of phosphosites associated with a kinase for the kinase to be considered in the analysis; default: 4
-stmq S_T_MIN_QUANTILE, --s-t-min-quantile S_T_MIN_QUANTILE
Minimum PSSM score quantile that a phosphosite has to satisfy to be potentially assigned to a Ser/Thr kinase; default: 0.95
-ymq Y_MIN_QUANTILE, --y-min-quantile Y_MIN_QUANTILE
Minimum PSSM score quantile that a phosphosite has to satisfy to be potentially assigned to a Tyr kinase; default: 0.90
-df1 DECAY_FACTOR, --decay-factor DECAY_FACTOR
Decay factor for the exponential decay of the activation evidence when competing kinases have different activation scores. See utils.decay_from_1(); default: 256
-c N_PROC, --n-proc N_PROC
Number of cores used for multithreading; default: 1
--plot-figures Save figures in pdf format; see also --output_dir
-d OUTPUT_DIR, --output-dir OUTPUT_DIR
Output files directory; only relevant if used with --plot_figures; defaults to 'phosx_output/'
-nd NETWORK_PATH, --network-path NETWORK_PATH
Output file path for the inferred Kinase-->A-loop network; if not specified, the network will not be saved
-o OUTPUT_PATH, --output-path OUTPUT_PATH
Main output table with differential kinase activitiy scores; if not specified it will be printed in STDOUT
-v, --version Print package version and exit
```
</details>
## Input
### _seqrnk_
PhosX's input format is a simple text file that we name _seqrnk_. It consists of 2 tab-separated columns containing phosphopeptide sequences and values, respectively. The values should be biologically relevant measures of differential phosphorylation, typically intensity log fold changes as obtained when comparing two conditions in mass spectrometry experiments. Amino acid sequences should be of length $10$, with the phosphorylated residue in position $6$ (1-based), in order to match the Position Specific Scoring Matrix (PSSM) models (see [Ser/Thr PSSMs](https://doi.org/10.1038/s41586-022-05575-3) & [Tyr PSSMs](https://doi.org/10.1038/s41586-024-07407-y)). Undefined amino acids are represented by the character `_`. Every other residue is represented by the corresponding 1-letter symbol according to the IUPAC nomenclature for amino acids and additional phosphorylated Serine, Threonine or Tyrosine residues are represented with the symbols `s`, `t`, and `y`, respectively. Phosphorylated residues that act as potential priming sites and are therefore not in the $6^{th}$ position of the peptide are represented with lowercase letters. An example is included in this repository:
```bash
$ head tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk
QEEAEYVRAL 5.646644
ANFSAYPSEE 4.33437
YLNRNYWEKK 4.174151
AENAEYLRVA 3.685413
STYTSYPKAE 3.491975
SFLQRYSSDP 3.295341
AAEPGSPTAA 3.202242
EPAHAYAQPQ 3.160899
RQKSTYTSYP 3.114077
ETKSLYPSSE 3.04653
```
Alongside the main program, this package also installs `make-seqrnk`. This utily can be used to help generating a _seqrnk_ file given a list of phosphosites, each one identified by a UniProt Acession Number and residue coordinate. `make-seqrnk` will query the [UniProt](https://www.uniprot.org) database to fetch the appropriate subsequences to build the _seqrnk_ file.
Run an example:
```bash
cat tests/p_list/15_3.tsv | make-seqrnk > 15_3.seqrnk
```
<details>
<summary>See `make-seqrnk -h` for more details:</summary>
```bash
usage: make-seqrnk [-h] [-i INPUT] [-o OUTPUT]
Make a seqrnk file to be used to compute differential kinase activity with PhosX
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path of the input phosphosites to be converted in seqrnk format. It should be a TSV file where the 1st column is the UniProtAC (str), the 2nd is the sequence coordinate (int), and the 3rd is the logFC (float); defaults to STDIN
-o OUTPUT, --output OUTPUT
Path of the seqrnk file; if not specified it will be printed in STDOUT
```
</details>
### PSSMs
PhosX estimates the affinity between human kinases and phosphopeptides based on the substrate sequence specificity encoded in Position Specific Scoring Matrices (PSSMs). A kinase PSSM is a $10 \times 23$ matrix containing amino acid affinity scores at each one of the $10$ positions of the substrate. The 6\textsuperscript{th} position corresponds to the modified residue and should have non-$0$ values only for Serine, Threonine (for Ser/Thr kinases), or Tyrosine (for Tyr kinases) residues.
PhosX comes with built-in, default PSSMs for human kinases, that can be found at `phosx/data/*_PSSMs.h5`. The user can also run PhosX using custom PSSMs, whose path can be specified with the options `-yp` and `-stp`.
<details>
<summary> Open and inspect the structure of the HDF5 files containing the PSSMs </summary>
```python
import phosx
import pandas as pd
import h5py
AA_LIST = [
"G","P","A","V","L","I",
"M","C","F","Y","W","H",
"K","R","Q","N","E","D",
"S","T","s","t","y",
]
POSITIONS_LIST = list(range(-5, 5))
s_t_pssms_h5_file = f"{os.path.dirname(phosx.__file__)}/data/S_T_PSSMs.h5"
y_pssms_h5_file = f"{os.path.dirname(phosx.__file__)}/data/Y_PSSMs.h5"
def read_pssms(pssms_h5_file: str):
pssms_h5 = h5py.File(pssms_h5_file, "r")
pssm_df_dict = {}
for kinase in pssms_h5.keys():
pssm_df_dict[kinase] = pd.DataFrame(pssms_h5[kinase])
pssm_df_dict[kinase].columns = AA_LIST
pssm_df_dict[kinase].index = POSITIONS_LIST
return pssm_df_dict
if __name__ == "__main__":
s_t_pssms = read_pssms(s_t_pssms_h5_file)
y_pssms = read_pssms(y_pssms_h5_file)
```
</details>
### Background PSSM score distributions
Similarly, PhosX also has built-in kinase PSSM scores quantile distributions computed on a reference human phosphoproteome from the [PhosphositePlus](https://www.phosphosite.org/) database. These can be found at `phosx/data/*_PSSM_score_quantiles.h5`. When supplying custom PSSMs, it is necessary to also specify the appropriate background distributions with the options `-yq` and `-stq`.
<details>
<summary>Open and inspect the HDF5 files containing the background PSSM scores</summary>
```python
import phosx
import pandas as pd
import os
s_t_pssm_score_quantiles_h5_file = f"{os.path.dirname(phosx.__file__)}/data/S_T_PSSM_score_quantiles.h5"
y_pssm_score_quantiles_h5_file = f"{os.path.dirname(phosx.__file__)}/data/Y_PSSM_score_quantiles.h5"
def read_pssm_score_quantiles(pssm_score_quantiles_h5_file: str):
pssm_bg_scores_df = pd.read_hdf(pssm_score_quantiles_h5_file, key="pssm_scores")
return pssm_bg_scores_df
if __name__ == "__main__":
s_t_pssm_score_quantiles = read_pssm_score_quantiles(s_t_pssm_score_quantiles_h5_file)
y_pssm_score_quantiles = read_pssm_score_quantiles(y_pssm_score_quantiles_h5_file)
```
</details>
### Kinases metadata
While keeping a fully data-driven inference, PhosX can benefit from kinase sequence information. For each kinase, the activation loop (A-loop) sequence of the kinase domain, if present, is used to enable the assessment of the [_upstream activation evidence_](#upstream-activation-evidence) associated to that kinase, and consequently compute the final differential activity scores. This step can substantially improve the inference accuracy, as observed in the [benchmark](https://github.com/alussana/phosx-benchmark).
Built-in A-loop sequences are supplied in a dictionary saved under the key `"aloop_seq"` of the metadata HDF5 file, found at `phosx/data/kinase_metadata.h5`. This file could be expanded to also contain different types of metadata in future versions.
It is possible to pass custom A-loop sequences by specifying a suitable `.h5` metadata file with the command line option `-meta`.
<details>
<summary>Open and inspect the HDF5 file containing the A-loop sequences</summary>
```python
import phosx
import pandas as pd
import os
metadata_h5_file = f"{os.path.dirname(phosx.__file__)}/data/kinase_metadata.h5"
def read_aloops(metadata_h5_file):
kinase_metadata_dict = phosx.utils.hdf5_to_dict(metadata_h5_file)
aloop_df = pd.DataFrame.from_dict(
kinase_metadata_dict["aloop_seq"], orient="index", columns=["Sequence"]
)
return aloop_df
if __name__ == "__main__":
aloop_df = read_aloops(metadata_h5_file)
```
</details>
## Output
### Differential kinase activity score
PhosX's main output is a text file reporting the computed kinase activities with associated statistics as described in the [Method](#method) section. For each kinase, the KS statistics, the _p_ value, the FDR _q_ value, and the Activity Score are reported. Kinases for which a differential activity could not be computed due to a low number of assigned phosphosites (see option `-mh MIN_HITS`) are reported as having Activity Score = `NA`. See an output example from the command executed [above](#usage):
<details>
<summary>head kinase_activities.tmp</summary>
```bash
KS p value FDR q value Legacy Activity Score Activity Score
ACVR2A 0.23331 0.4396 1.0 0.35694 0.35694
ACVR2B 0.45501 0.024 1.0 1.61979 1.61979
ALK2 0.25047 0.3692 1.0 0.43274 0.43274
ALK4 -0.35766 0.1504 1.0 -0.82275 -0.82275
ALPHAK3 -0.2778 0.3792 1.0 -0.42113 -0.42113
AMPKA2 0.44011 0.296 1.0 0.52871 0.52871
ATM -0.51674 0.1008 1.0 -0.99654 -0.99654
ATR -0.49195 0.1272 1.0 -0.89551 -0.89551
AURA 0.64746 0.0952 1.0 1.02136 1.02136
```
</details>
### Statistics plots
Additionally, PhosX can also save plots of the weighted running sum and of the KS statistic compared to its empirical null distribution, similarly to the ones shown [above](#overview), for each kinase. To enable this behavior the option `--plot-figures` must be specified. A custom directory to save the plots can be passed with `-d`.
### Kinase $\rightarrow$ A-loop network
A kinase-kinase network, where directed edges represent the potential of a source node to phosphorylate the A-loop of a target node, is inferred as part of evauating the [upstream activation evidence](#upstream-activation-evidence). After a PhosX run, such network can be saved on disk by specifying path with the commad line argument `--network-path`. See an output example generated by running:
```bash
phosx -c 4 --network-path aloop_network.tsv tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk > kinase_activities.tmp`
```
<details>
<summary>head aloop_network.tsv</summary>
```bash
source target complementarity source_activity_score target_activity_score
ACVR2A LATS1 1.0 0.35694 0.56543
ACVR2A LATS2 1.0 0.35694 0.66474
ACVR2A MARK1 1.0 0.35694 -1.43415
ACVR2A MARK2 1.0 0.35694 -1.22475
ACVR2A MARK3 1.0 0.35694 -0.91593177
ACVR2A MARK4 1.0 0.35694 -1.19382
ACVR2A NLK 1.0 0.35694 1.79588
ACVR2A PINK1 1.0 0.35694 0.62051
ACVR2A BLK 0.98 0.35694 -2.4437
```
</details>
# Method
## Phosphopeptide scoring
For each kinase PSSM, a score is assigned to each phosphopeptide sequence $S$ that quantifies its similarity to the PSSM. First, a "raw PSSM score" is computed as:
```math
\texttt{score}(S,k) := \prod_{i=-5}^{4}
\begin{cases}
M^{k}_{i,S_i}, & \text{if } S_i \neq \texttt{'\_'} \\
1, & \text{if } S_i = \texttt{'\_'}
\end{cases}
```
where $S_i$ is the amino acid residue at position $i$ of the phosphopeptide sequence $S$; $M^k_{i,j}$ is the value of the PSSM for kinase $k$ at position $i$ for residue $j$. Raw PSSM scores for each kinase are then transformed between $0$ and $1$ based on the quantile they fall in, considering a background distribution of proteome-wide raw PSSM scores. For each kinase, phosphopeptides with raw PSSM score equal to $0$ are discarded, and the remaining are used to determine the values of the $10,000$-quantiles of the raw PSSM score distribution. The background $10,000$-quantiles raw PSSM scores for each kinase PSSM are used to derive the final PSSM scores for each phosphopeptide.
## Weighted running sum statistics
PhosX uses the PSSM scores to link kinases to their potential substrates. Each phosphopeptide is assigned as potential target to its $n$ top-scoring kinases, with default value of $10$. With the method has little sensitivity to this parameter in the range $[5,15]$. The activity change of a given kinase is estimated by calculating a running sum statistic over the ranked list of phosphosites, and by estimating its significance based on an empirical distribution generated by random permutations of the ranks. Let $C$ be the set of indexes corresponding to the ranked phosphosites associated with kinase $k$; $N$ the total number of phosphosites; $N_h$ the size of $C$; $r_i$ the value of the ranking metric of the phosphosite at rank $i$, where $r_0$ is the highest value. Then, the running sum ($RS$) up to the phosphosite rank $n$ is given by
```math
RS(k,n) := \sum_{i=0}^{n}
\begin{cases}
\frac{|r_i|}{N_R}, & \text{if } i \in C \\
-\frac{1}{N - N_h}, & \text{if } i \not\in C
\end{cases}
```
where
```math
N_R = \sum_{i \in C} |r_i|
```
The kinase enrichment score ($ES$) corresponds to the maximum deviation from $0$ of $RS$.
## Empirical _p_ values
For each kinase, PhosX computes an empirical _p_ value of the $ES$ by generating a null distribution of the $ES$ through random permutations of the phosphosite ranks. A False Discovery Rate (FDR) _q_ value is also calculated by applying the Bonferroni method considering the number of kinases independently tested. The number of permutations is a tunable parameter but we recommend performing at least $10^4$ random permutations to be able to compute FDR values $< 0.05$.
## Differential activity scores
The activity score (before correction based on the [upstream activation evidence](#upstream-activation-evidence)) for a given kinase $i$ is defined as:
```math
a_i = -\log_{10}{\left(p_i\right)} \cdot \texttt{sign}\left(ES_i\right)
```
where $\texttt{sign}$ is the sign function, and $p_i$ is the [_p_ value](#empirical-p-values) associated with kinase $i$, capped at the smallest computable _p_ value different from $0$, _i.e._ the inverse of the number of random permutations.
Activity scores greater than $0$ denote kinase activation, while the opposite corresponds to kinase inhibition.
## Upstream activation evidence
Kinases that are more closely evolutionarily related tend to have more similar PSSMs, leading to a correlation in their inferred differential activities which might not be biologically real. PhosX attempts to find these instances in any given experiment and discriminate the truly differentially active kinases from the ones whose activity is falsely correlated with them.
In doing so, PhosX first builds a [directed network of kinases](#kinase--a-loop-network) to represent the potential of each kinase to phosphorylate the activation loop (A-loop) of any other except itself. Edges are inferred based on the same [PSSM score](#phosphopeptide-scoring) logic used to link the phosphosites to the putative upstream kinases, except that the maximum score is taken by sliding a PSSM over each possible position along an A-loop.
If kinases have highly overlapping sets of assigned phosphosites in the experiment, and also a similar differential activity score, then their activity changes are considered to be potentially correlated mostly because of PSSM similarity. In order to prioritize a putative "true" regulated kinase between those candidates, we look for other kinases that target the A-loops of the candidates. The inferred differential activity of such upstream kinases is treated as evidence for the regulation of their downstream targets. If such evidence supports the activity of a specific kinase, then the activity change of the other candidates is dampened down, reducing the false positive rate of identifying differentially regulated kinases.
The logic above is implemented in PhosX using the following procedure, which is applied separately to kinases that are inferred to be upregulated ($a_i > 0$) or downregulated ($a_i < 0$). For the downregulated kinases we take the absolute value of their activity score and then negate the final modified score.
Let $a$ be the activity vector of the kinases. If we are considering upregulated kinases, each $a_i$ is the [Activity Score](#kinase-activity-score) of kinase $i$ if $a_i > 0$, otherwise we assign a "pseudo-null" activity, setting $a_i=0.01$. If we are considering downregulated kinases, we first take the opposite of the Activity Scores and then modify $a$ analogously. Let $a'$ be the vector where each element $a' _i$ is the reciprocal of $a_i$; $A$ the diagonal matrix of $a$; $D^T$ the transposed adjacency matrix of the directed kinase network, i.e. a Boolean matrix indicating for each kinase which other kinases may target its A-loop; $C$ the redundancy matrix, a Boolean matrix indicating for each kinase which other kinases have an extensive overlap of substrates and therefore a potentially correlated activity. By default $C_{ij}=1$ if the phosphosites assigned to kinases $i$ and $j$ in the experiment have a Jaccard Index $J > 0.5$.
We compute $E = D^T \cdot A$, which is the activation evidence matrix, indicating for each kinase the activation coming from every other kinase.
We then obtain $e$, the evidence vector, as the row-wise maximum of $E$, containing the upstream activation evidence of each kinase.
Let $e'$ be a vector where each element $e'_i$ is the reciprocal of $e_i$.
Eventually, we want to modify $a_i$ (the activity of a kinase $i$) proportionally to:
```math
m_{ij} = 1 - \left\{ C_{ij} \cdot \left(1 - \frac{e_i}{e_j} \right) \cdot \exp\left[ -d \left( \frac{a_i}{a_j} \right)^2 \right] \right\}
```
for whatever kinase $j$ gives the minimum $m_{ij}$, and where $d \in \N$ is the decay factor. $d=64$ by default, and controls how fast $\exp [ -d ( a_i/a_j)^2 ]$ decays from 1 to 0 as $a_i / a_j$ becomes different from $1$.
Namely, we want to disregard the differential activity of kinase $i$ only if kinase $j$ is potentially correlated ($C_{ij} = 1$), _and_ if kinase $i$ and $j$ have similar inferred activities (_i.e._ $a_i/a_j$ is close to $1$), by a degree that is greater when the upstream activation evidence of the kinase $i$ is smaller than the one of kinase $j$ (_i.e._ $e_i/e_j$ is small).
Before applying this formula, we need to consider some cases regarding the value of $e_i / e_j$:
* $0 > e_i/e_j < 1$, no change is needed;
* $e_j=0 \implies e_i/e_j = inf$, the competing kinase has no upstream activation evidence, we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ );
* $e_i=0 \land e_j=0 \implies e_i/e_j$ is undefined, both kinases have no upstream activation evidence, we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ );
* $e_i/e_j \ge 1$, the upstream activation evidence of kinase $i$ is greater than kinase $j$, therefore we don't want to correct $a_i$ and we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ ).
We can then obtain $F = ee'^T$, the outer product of the evidence vector and its reciprocal, containing the ratio of upstream evidences for each kinase pair. To the elements of $F$ the conditional transformations above have been applied.
Let also be $B = aa'^T$, the outer product of the activity vector and its reciprocal, containing the ratio of inferred activities for each kinase pair; and $X = \exp ( -d B^2 )$, a matrix of values between $0$ and $1$ indicating how similar the inferred activity changes of any two kinases are.
We can then rewrite the equation for $m_{ij}$ more simply as:
```math
m_{ij} = 1 - \left\{ C_{ij} \cdot \left(1 - F_{ij} \right) \cdot X_{ij} \right\}
```
Therefore to find all possible $m_{ij}$ and then, for each $i$, select for the minimum, we first compute the matrix $M$:
```math
M = 1 - \left\{ C \circ \left(1 - F \right) \circ X \right\}
```
and then get $z$, the activity modifier vector indicating the modifier factor for each kinase, by taking the row-wise minimum of $M$.
Lastly, we set $z_i=1$ for each kinase $i$ that doesn't have a regulatory A-loop. Only the differential activity of kinases that have an A-loop reported in the [metadata](#kinases-metadata) will be modified, as only their activities are assumed to depend on such a regulatory feature.
The final kinase differential activity scores are given by $a \circ z$.
# Cite
Please cite one of the following references if you use PhosX in your work.
## Bioinformatics
BibTeX:
```bibtex
@article{10.1093/bioinformatics/btae697,
author = {Lussana, Alessandro and Müller-Dott, Sophia and Saez-Rodriguez, Julio and Petsalaki, Evangelia},
title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},
journal = {Bioinformatics},
volume = {40},
number = {12},
pages = {btae697},
year = {2024},
month = {11},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae697},
url = {https://doi.org/10.1093/bioinformatics/btae697},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/12/btae697/60972735/btae697.pdf},
}
```
## BioRxiv
BibTeX:
```bibtex
@article{Lussana2024,
title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},
url = {http://dx.doi.org/10.1101/2024.03.22.586304},
DOI = {10.1101/2024.03.22.586304},
publisher = {Cold Spring Harbor Laboratory},
author = {Lussana, Alessandro and Petsalaki, Evangelia},
year = {2024},
month = mar
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "phosx",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Alessandro Lussana",
"author_email": "alussana@ebi.ac.uk",
"download_url": "https://files.pythonhosted.org/packages/51/0d/d45785ec5cead0c7c9cb89606464a1ffbd76798e9637575c89972a54e351/phosx-0.20.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img width=\"250\" src=\"https://raw.githubusercontent.com/alussana/phosx/refs/heads/main/phosx/assets/logo.png\">\n <br><br>\n Kinase activity inference from phosphosproteomics data based on substrate sequence specificity\n <br><br>\n</p>\n\n \n\n\n> Research paper: [https://doi.org/10.1093/bioinformatics/btae697](https://doi.org/10.1093/bioinformatics/btae697) (NOTE: outdated; the current method is vastly improved and includes new features)\n\n> Benchmark: [https://github.com/alussana/phosx-benchmark](https://github.com/alussana/phosx-benchmark)\n\n> Data: [https://github.com/alussana/kinase_pssms](https://github.com/alussana/kinase_pssms)\n\n# Overview\n\n<p align=\"center\">\n <br>\n <img width=\"900\" src=\"https://raw.githubusercontent.com/alussana/phosx/refs/heads/main/phosx/assets/workflow.png\">\n <br>\n</p>\n\nPhosX infers differential kinase activities from phosphoproteomics data without requiring any prior knowledge database of kinase-phosphosite associations. PhosX assigns the detected phosphopeptides to potential upstream kinases based on experimentally determined substrate sequence specificities, and it tests the enrichment of a kinase's potential substrates in the extremes of a ranked list of phosphopeptides using a Kolmogorov-Smirnov-like statistic. A _p_ value for this statistic is extracted empirically by random permutations of the phosphosite ranks. By considering the A-loop sequence of kinase domains, PhosX refines the inferred kinase activity changes by computing the [_upstream activation evidence_](#upstream-activation-evidence), further improving accuracy.\n\nIn the [benchmark](https://github.com/alussana/phosx-benchmark) PhosX consistently outperformed popular alternative methods, including KSTAR, KSEA, Z-score, Kinex, and PTM-SEA, in identifying expected regulated kinases in over a hundred phosphoproteomics perturbation experiments. The performance gain was expecially remarkable in identifying upregulated kinases, potentially making PhosX an ideal tool to discover therapeutic targets for kinase inhibitors. All evaluated methods except Kinex and PhosX are based on prior knowledge of kinase-substrate associations.\n\n# Installation\n\n## From [PyPI](https://pypi.org/project/phosx/)\n\n```bash\npip install phosx\n```\n\n## From source (requires [Poetry](https://python-poetry.org))\n\n```\npoetry build\npip install dist/*.whl\n```\n\n# Usage\n\nPhosX can be used as a command line tool (`phosx`) with minimal effort. Its output is redirected by default in `STDOUT`, making it easy to use in bioinformatics pipelines. Alternatively, the user can specify an output filename (option `-o`). \n\nExample: run PhosX with default parameters on an example dataset, using up to 4 cores, and redirecting the output table to `kinase_activities.tsv`:\n\n```bash\nphosx -c 4 tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk > kinase_activities.tmp\n```\n\n<details>\n <summary>A brief description of the command line options can be viewed with `phosx -h`:</summary>\n\n ```bash\n\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2591\u2588\u2588\u2557\u2591\u2591\u2588\u2588\u2557\u2591\u2588\u2588\u2588\u2588\u2588\u2557\u2591\u2591\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2557\u2591\u2591\u2588\u2588\u2557\n\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2551\u2591\u2591\u2588\u2588\u2551\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u255a\u2588\u2588\u2557\u2588\u2588\u2554\u255d\n\u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2551\u2588\u2588\u2551\u2591\u2591\u2588\u2588\u2551\u255a\u2588\u2588\u2588\u2588\u2588\u2557\u2591\u2591\u255a\u2588\u2588\u2588\u2554\u255d\u2591\n\u2588\u2588\u2554\u2550\u2550\u2550\u255d\u2591\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2551\u2588\u2588\u2551\u2591\u2591\u2588\u2588\u2551\u2591\u255a\u2550\u2550\u2550\u2588\u2588\u2557\u2591\u2588\u2588\u2554\u2588\u2588\u2557\u2591\n\u2588\u2588\u2551\u2591\u2591\u2591\u2591\u2591\u2588\u2588\u2551\u2591\u2591\u2588\u2588\u2551\u255a\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2554\u255d\u255a\u2588\u2588\u2557\n\u255a\u2550\u255d\u2591\u2591\u2591\u2591\u2591\u255a\u2550\u255d\u2591\u2591\u255a\u2550\u255d\u2591\u255a\u2550\u2550\u2550\u2550\u255d\u2591\u255a\u2550\u2550\u2550\u2550\u2550\u255d\u2591\u255a\u2550\u255d\u2591\u2591\u255a\u2550\u255d\n\nVersion 0.20.0\nCopyright (C) 2025 Alessandro Lussana\nLicence Apache 2.0\n\nCommand: /home/alussana/Xiv_local/venvs/phosx/bin/phosx -h\n\nusage: phosx [-h] [-yp Y_PSSM] [-stp S_T_PSSM] [-yq Y_PSSM_QUANTILES] [-stq S_T_PSSM_QUANTILES] [-no-uae] [-meta KINASE_METADATA] [-n N_PERMUTATIONS] [-s] [-t] [-stk S_T_N_TOP_KINASES] [-yk Y_N_TOP_KINASES] [-astqth A_LOOP_S_T_QUANTILE_THRESHOLD]\n [-ayqth A_LOOP_Y_QUANTILE_THRESHOLD] [-urt UPREG_REDUNDANCY_THRESHOLD] [-drt DOWNREG_REDUNDANCY_THRESHOLD] [-mh MIN_N_HITS] [-stmq S_T_MIN_QUANTILE] [-ymq Y_MIN_QUANTILE] [-df1 DECAY_FACTOR] [-c N_PROC] [--plot-figures] [-d OUTPUT_DIR] [-nd NETWORK_PATH]\n [-o OUTPUT_PATH] [-v]\n seqrnk\n\nData-driven differential kinase activity inference from phosphosproteomics data\n\npositional arguments:\n seqrnk Path to the seqrnk file.\n\noptions:\n -h, --help show this help message and exit\n -yp Y_PSSM, --y-pssm Y_PSSM\n Path to the h5 file storing custom Tyr PSSMs; defaults to built-in PSSMs\n -stp S_T_PSSM, --s-t-pssm S_T_PSSM\n Path to the h5 file storing custom Ser/Thr PSSMs; defaults to built-in PSSMs\n -yq Y_PSSM_QUANTILES, --y-pssm-quantiles Y_PSSM_QUANTILES\n Path to the h5 file storing custom Tyr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles\n -stq S_T_PSSM_QUANTILES, --s-t-pssm-quantiles S_T_PSSM_QUANTILES\n Path to the h5 file storing custom Ser/Thr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles\n -no-uae, --no-upstream-activation-evidence\n Do not compute upstream activation evidence to modify the activity scores of kinases with correlated activity; default: False\n -meta KINASE_METADATA, --kinase-metadata KINASE_METADATA\n Path to the h5 file storing kinase metadata (\"aloop_seq\"); defaults to built-in metadata\n -n N_PERMUTATIONS, --n-permutations N_PERMUTATIONS\n Number of random permutations; default: 20000\n -s, --ser-thr-only Only compute Ser/Thr kinases activity; default: False\n -t, --tyr-only Only compute Tyr kinases activity; default: False\n -stk S_T_N_TOP_KINASES, --s-t-n-top-kinases S_T_N_TOP_KINASES\n Number of top-scoring Ser/Thr kinases potentially associatiated to a given phosphosite; default: 5\n -yk Y_N_TOP_KINASES, --y-n-top-kinases Y_N_TOP_KINASES\n Number of top-scoring Tyr kinases potentially associatiated to a given phosphosite; default: 5\n -astqth A_LOOP_S_T_QUANTILE_THRESHOLD, --a-loop-s-t-quantile-threshold A_LOOP_S_T_QUANTILE_THRESHOLD\n Minimum Ser/Thr PSSM score quantile for an activation loop to be considered potential substrate of a kinase; default: 0.95\n -ayqth A_LOOP_Y_QUANTILE_THRESHOLD, --a-loop-y-quantile-threshold A_LOOP_Y_QUANTILE_THRESHOLD\n Minimum Tyr PSSM score quantile for an activation loop to be considered potential substrate of a kinase; default: 0.95\n -urt UPREG_REDUNDANCY_THRESHOLD, --upreg-redundancy-threshold UPREG_REDUNDANCY_THRESHOLD\n Minimum Jaccard index of target substrates to consider two upregulated kinases having potentially correlated activity; upstream activation evidence is used to prioritize the activity of individual ones; default: 0.5\n -drt DOWNREG_REDUNDANCY_THRESHOLD, --downreg-redundancy-threshold DOWNREG_REDUNDANCY_THRESHOLD\n Minimum Jaccard index of target substrates to consider two downregulated kinases having potentially correlated activity; upstream activation evidence is used to prioritize the activity of individual ones; default: 0.5\n -mh MIN_N_HITS, --min-n-hits MIN_N_HITS\n Minimum number of phosphosites associated with a kinase for the kinase to be considered in the analysis; default: 4\n -stmq S_T_MIN_QUANTILE, --s-t-min-quantile S_T_MIN_QUANTILE\n Minimum PSSM score quantile that a phosphosite has to satisfy to be potentially assigned to a Ser/Thr kinase; default: 0.95\n -ymq Y_MIN_QUANTILE, --y-min-quantile Y_MIN_QUANTILE\n Minimum PSSM score quantile that a phosphosite has to satisfy to be potentially assigned to a Tyr kinase; default: 0.90\n -df1 DECAY_FACTOR, --decay-factor DECAY_FACTOR\n Decay factor for the exponential decay of the activation evidence when competing kinases have different activation scores. See utils.decay_from_1(); default: 256\n -c N_PROC, --n-proc N_PROC\n Number of cores used for multithreading; default: 1\n --plot-figures Save figures in pdf format; see also --output_dir\n -d OUTPUT_DIR, --output-dir OUTPUT_DIR\n Output files directory; only relevant if used with --plot_figures; defaults to 'phosx_output/'\n -nd NETWORK_PATH, --network-path NETWORK_PATH\n Output file path for the inferred Kinase-->A-loop network; if not specified, the network will not be saved\n -o OUTPUT_PATH, --output-path OUTPUT_PATH\n Main output table with differential kinase activitiy scores; if not specified it will be printed in STDOUT\n -v, --version Print package version and exit\n ```\n</details>\n\n## Input\n\n### _seqrnk_\n\nPhosX's input format is a simple text file that we name _seqrnk_. It consists of 2 tab-separated columns containing phosphopeptide sequences and values, respectively. The values should be biologically relevant measures of differential phosphorylation, typically intensity log fold changes as obtained when comparing two conditions in mass spectrometry experiments. Amino acid sequences should be of length $10$, with the phosphorylated residue in position $6$ (1-based), in order to match the Position Specific Scoring Matrix (PSSM) models (see [Ser/Thr PSSMs](https://doi.org/10.1038/s41586-022-05575-3) & [Tyr PSSMs](https://doi.org/10.1038/s41586-024-07407-y)). Undefined amino acids are represented by the character `_`. Every other residue is represented by the corresponding 1-letter symbol according to the IUPAC nomenclature for amino acids and additional phosphorylated Serine, Threonine or Tyrosine residues are represented with the symbols `s`, `t`, and `y`, respectively. Phosphorylated residues that act as potential priming sites and are therefore not in the $6^{th}$ position of the peptide are represented with lowercase letters. An example is included in this repository:\n\n```bash\n$ head tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk\n\nQEEAEYVRAL 5.646644\nANFSAYPSEE 4.33437\nYLNRNYWEKK 4.174151\nAENAEYLRVA 3.685413\nSTYTSYPKAE 3.491975\nSFLQRYSSDP 3.295341\nAAEPGSPTAA 3.202242\nEPAHAYAQPQ 3.160899\nRQKSTYTSYP 3.114077\nETKSLYPSSE 3.04653\n```\n\nAlongside the main program, this package also installs `make-seqrnk`. This utily can be used to help generating a _seqrnk_ file given a list of phosphosites, each one identified by a UniProt Acession Number and residue coordinate. `make-seqrnk` will query the [UniProt](https://www.uniprot.org) database to fetch the appropriate subsequences to build the _seqrnk_ file. \n\nRun an example:\n\n```bash\ncat tests/p_list/15_3.tsv | make-seqrnk > 15_3.seqrnk\n```\n\n<details>\n <summary>See `make-seqrnk -h` for more details:</summary>\n\n ```bash\n usage: make-seqrnk [-h] [-i INPUT] [-o OUTPUT]\n\n Make a seqrnk file to be used to compute differential kinase activity with PhosX\n\n options:\n -h, --help show this help message and exit\n -i INPUT, --input INPUT\n Path of the input phosphosites to be converted in seqrnk format. It should be a TSV file where the 1st column is the UniProtAC (str), the 2nd is the sequence coordinate (int), and the 3rd is the logFC (float); defaults to STDIN\n -o OUTPUT, --output OUTPUT\n Path of the seqrnk file; if not specified it will be printed in STDOUT\n ```\n</details>\n\n\n### PSSMs\n\nPhosX estimates the affinity between human kinases and phosphopeptides based on the substrate sequence specificity encoded in Position Specific Scoring Matrices (PSSMs). A kinase PSSM is a $10 \\times 23$ matrix containing amino acid affinity scores at each one of the $10$ positions of the substrate. The 6\\textsuperscript{th} position corresponds to the modified residue and should have non-$0$ values only for Serine, Threonine (for Ser/Thr kinases), or Tyrosine (for Tyr kinases) residues.\n\nPhosX comes with built-in, default PSSMs for human kinases, that can be found at `phosx/data/*_PSSMs.h5`. The user can also run PhosX using custom PSSMs, whose path can be specified with the options `-yp` and `-stp`. \n\n<details>\n <summary> Open and inspect the structure of the HDF5 files containing the PSSMs </summary>\n\n ```python\n import phosx\n import pandas as pd\n import h5py\n\n\n AA_LIST = [\n \"G\",\"P\",\"A\",\"V\",\"L\",\"I\",\n \"M\",\"C\",\"F\",\"Y\",\"W\",\"H\",\n \"K\",\"R\",\"Q\",\"N\",\"E\",\"D\",\n \"S\",\"T\",\"s\",\"t\",\"y\",\n ]\n\n\n POSITIONS_LIST = list(range(-5, 5))\n\n\n s_t_pssms_h5_file = f\"{os.path.dirname(phosx.__file__)}/data/S_T_PSSMs.h5\"\n y_pssms_h5_file = f\"{os.path.dirname(phosx.__file__)}/data/Y_PSSMs.h5\"\n\n\n def read_pssms(pssms_h5_file: str):\n\n pssms_h5 = h5py.File(pssms_h5_file, \"r\")\n pssm_df_dict = {}\n\n for kinase in pssms_h5.keys():\n pssm_df_dict[kinase] = pd.DataFrame(pssms_h5[kinase])\n pssm_df_dict[kinase].columns = AA_LIST\n pssm_df_dict[kinase].index = POSITIONS_LIST\n\n return pssm_df_dict\n\n\n if __name__ == \"__main__\":\n s_t_pssms = read_pssms(s_t_pssms_h5_file)\n y_pssms = read_pssms(y_pssms_h5_file)\n ```\n</details>\n\n### Background PSSM score distributions\n\nSimilarly, PhosX also has built-in kinase PSSM scores quantile distributions computed on a reference human phosphoproteome from the [PhosphositePlus](https://www.phosphosite.org/) database. These can be found at `phosx/data/*_PSSM_score_quantiles.h5`. When supplying custom PSSMs, it is necessary to also specify the appropriate background distributions with the options `-yq` and `-stq`.\n\n<details>\n <summary>Open and inspect the HDF5 files containing the background PSSM scores</summary>\n\n ```python\n import phosx\n import pandas as pd\n import os\n\n\n s_t_pssm_score_quantiles_h5_file = f\"{os.path.dirname(phosx.__file__)}/data/S_T_PSSM_score_quantiles.h5\"\n y_pssm_score_quantiles_h5_file = f\"{os.path.dirname(phosx.__file__)}/data/Y_PSSM_score_quantiles.h5\"\n\n\n def read_pssm_score_quantiles(pssm_score_quantiles_h5_file: str):\n pssm_bg_scores_df = pd.read_hdf(pssm_score_quantiles_h5_file, key=\"pssm_scores\")\n return pssm_bg_scores_df\n\n\n if __name__ == \"__main__\":\n s_t_pssm_score_quantiles = read_pssm_score_quantiles(s_t_pssm_score_quantiles_h5_file)\n y_pssm_score_quantiles = read_pssm_score_quantiles(y_pssm_score_quantiles_h5_file)\n ```\n</details>\n\n\n### Kinases metadata\n\nWhile keeping a fully data-driven inference, PhosX can benefit from kinase sequence information. For each kinase, the activation loop (A-loop) sequence of the kinase domain, if present, is used to enable the assessment of the [_upstream activation evidence_](#upstream-activation-evidence) associated to that kinase, and consequently compute the final differential activity scores. This step can substantially improve the inference accuracy, as observed in the [benchmark](https://github.com/alussana/phosx-benchmark).\n\nBuilt-in A-loop sequences are supplied in a dictionary saved under the key `\"aloop_seq\"` of the metadata HDF5 file, found at `phosx/data/kinase_metadata.h5`. This file could be expanded to also contain different types of metadata in future versions.\n\nIt is possible to pass custom A-loop sequences by specifying a suitable `.h5` metadata file with the command line option `-meta`.\n\n<details>\n <summary>Open and inspect the HDF5 file containing the A-loop sequences</summary>\n\n ```python\n import phosx\n import pandas as pd\n import os\n\n\n metadata_h5_file = f\"{os.path.dirname(phosx.__file__)}/data/kinase_metadata.h5\"\n\n\n def read_aloops(metadata_h5_file):\n kinase_metadata_dict = phosx.utils.hdf5_to_dict(metadata_h5_file)\n aloop_df = pd.DataFrame.from_dict(\n kinase_metadata_dict[\"aloop_seq\"], orient=\"index\", columns=[\"Sequence\"]\n )\n return aloop_df\n\n\n if __name__ == \"__main__\":\n aloop_df = read_aloops(metadata_h5_file)\n ```\n</details>\n\n## Output\n\n### Differential kinase activity score\n\nPhosX's main output is a text file reporting the computed kinase activities with associated statistics as described in the [Method](#method) section. For each kinase, the KS\tstatistics, the _p_ value, the FDR _q_ value, and the\tActivity Score are reported. Kinases for which a differential activity could not be computed due to a low number of assigned phosphosites (see option `-mh MIN_HITS`) are reported as having Activity Score = `NA`. See an output example from the command executed [above](#usage):\n\n<details>\n <summary>head kinase_activities.tmp</summary>\n\n ```bash\n KS p value FDR q value Legacy Activity Score Activity Score\n ACVR2A 0.23331 0.4396 1.0 0.35694 0.35694\n ACVR2B 0.45501 0.024 1.0 1.61979 1.61979\n ALK2 0.25047 0.3692 1.0 0.43274 0.43274\n ALK4 -0.35766 0.1504 1.0 -0.82275 -0.82275\n ALPHAK3 -0.2778 0.3792 1.0 -0.42113 -0.42113\n AMPKA2 0.44011 0.296 1.0 0.52871 0.52871\n ATM -0.51674 0.1008 1.0 -0.99654 -0.99654\n ATR -0.49195 0.1272 1.0 -0.89551 -0.89551\n AURA 0.64746 0.0952 1.0 1.02136 1.02136\n ```\n</details>\n\n### Statistics plots\n\nAdditionally, PhosX can also save plots of the weighted running sum and of the KS statistic compared to its empirical null distribution, similarly to the ones shown [above](#overview), for each kinase. To enable this behavior the option `--plot-figures` must be specified. A custom directory to save the plots can be passed with `-d`.\n\n### Kinase $\\rightarrow$ A-loop network\n\nA kinase-kinase network, where directed edges represent the potential of a source node to phosphorylate the A-loop of a target node, is inferred as part of evauating the [upstream activation evidence](#upstream-activation-evidence). After a PhosX run, such network can be saved on disk by specifying path with the commad line argument `--network-path`. See an output example generated by running:\n\n```bash\nphosx -c 4 --network-path aloop_network.tsv tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk > kinase_activities.tmp`\n```\n\n<details>\n <summary>head aloop_network.tsv</summary>\n\n ```bash\n source target complementarity source_activity_score target_activity_score\n ACVR2A LATS1 1.0 0.35694 0.56543\n ACVR2A LATS2 1.0 0.35694 0.66474\n ACVR2A MARK1 1.0 0.35694 -1.43415\n ACVR2A MARK2 1.0 0.35694 -1.22475\n ACVR2A MARK3 1.0 0.35694 -0.91593177\n ACVR2A MARK4 1.0 0.35694 -1.19382\n ACVR2A NLK 1.0 0.35694 1.79588\n ACVR2A PINK1 1.0 0.35694 0.62051\n ACVR2A BLK 0.98 0.35694 -2.4437\n ```\n</details>\n\n# Method\n\n## Phosphopeptide scoring\n\nFor each kinase PSSM, a score is assigned to each phosphopeptide sequence $S$ that quantifies its similarity to the PSSM. First, a \"raw PSSM score\" is computed as:\n\n```math\n\\texttt{score}(S,k) := \\prod_{i=-5}^{4} \n\\begin{cases}\n M^{k}_{i,S_i}, & \\text{if } S_i \\neq \\texttt{'\\_'} \\\\\n 1, & \\text{if } S_i = \\texttt{'\\_'} \n\\end{cases}\n```\n\nwhere $S_i$ is the amino acid residue at position $i$ of the phosphopeptide sequence $S$; $M^k_{i,j}$ is the value of the PSSM for kinase $k$ at position $i$ for residue $j$. Raw PSSM scores for each kinase are then transformed between $0$ and $1$ based on the quantile they fall in, considering a background distribution of proteome-wide raw PSSM scores. For each kinase, phosphopeptides with raw PSSM score equal to $0$ are discarded, and the remaining are used to determine the values of the $10,000$-quantiles of the raw PSSM score distribution. The background $10,000$-quantiles raw PSSM scores for each kinase PSSM are used to derive the final PSSM scores for each phosphopeptide.\n\n\n## Weighted running sum statistics\n\nPhosX uses the PSSM scores to link kinases to their potential substrates. Each phosphopeptide is assigned as potential target to its $n$ top-scoring kinases, with default value of $10$. With the method has little sensitivity to this parameter in the range $[5,15]$. The activity change of a given kinase is estimated by calculating a running sum statistic over the ranked list of phosphosites, and by estimating its significance based on an empirical distribution generated by random permutations of the ranks. Let $C$ be the set of indexes corresponding to the ranked phosphosites associated with kinase $k$; $N$ the total number of phosphosites; $N_h$ the size of $C$; $r_i$ the value of the ranking metric of the phosphosite at rank $i$, where $r_0$ is the highest value. Then, the running sum ($RS$) up to the phosphosite rank $n$ is given by\n\n```math\nRS(k,n) := \\sum_{i=0}^{n} \n\\begin{cases}\n \\frac{|r_i|}{N_R}, & \\text{if } i \\in C \\\\\n -\\frac{1}{N - N_h}, & \\text{if } i \\not\\in C \n\\end{cases}\n```\n\nwhere\n\n```math\nN_R = \\sum_{i \\in C} |r_i|\n```\n\nThe kinase enrichment score ($ES$) corresponds to the maximum deviation from $0$ of $RS$.\n\n## Empirical _p_ values\n\nFor each kinase, PhosX computes an empirical _p_ value of the $ES$ by generating a null distribution of the $ES$ through random permutations of the phosphosite ranks. A False Discovery Rate (FDR) _q_ value is also calculated by applying the Bonferroni method considering the number of kinases independently tested. The number of permutations is a tunable parameter but we recommend performing at least $10^4$ random permutations to be able to compute FDR values $< 0.05$.\n\n## Differential activity scores\n\nThe activity score (before correction based on the [upstream activation evidence](#upstream-activation-evidence)) for a given kinase $i$ is defined as:\n\n```math\na_i = -\\log_{10}{\\left(p_i\\right)} \\cdot \\texttt{sign}\\left(ES_i\\right)\n```\n\nwhere $\\texttt{sign}$ is the sign function, and $p_i$ is the [_p_ value](#empirical-p-values) associated with kinase $i$, capped at the smallest computable _p_ value different from $0$, _i.e._ the inverse of the number of random permutations.\nActivity scores greater than $0$ denote kinase activation, while the opposite corresponds to kinase inhibition.\n\n## Upstream activation evidence\n\nKinases that are more closely evolutionarily related tend to have more similar PSSMs, leading to a correlation in their inferred differential activities which might not be biologically real. PhosX attempts to find these instances in any given experiment and discriminate the truly differentially active kinases from the ones whose activity is falsely correlated with them. \n\nIn doing so, PhosX first builds a [directed network of kinases](#kinase--a-loop-network) to represent the potential of each kinase to phosphorylate the activation loop (A-loop) of any other except itself. Edges are inferred based on the same [PSSM score](#phosphopeptide-scoring) logic used to link the phosphosites to the putative upstream kinases, except that the maximum score is taken by sliding a PSSM over each possible position along an A-loop.\n\nIf kinases have highly overlapping sets of assigned phosphosites in the experiment, and also a similar differential activity score, then their activity changes are considered to be potentially correlated mostly because of PSSM similarity. In order to prioritize a putative \"true\" regulated kinase between those candidates, we look for other kinases that target the A-loops of the candidates. The inferred differential activity of such upstream kinases is treated as evidence for the regulation of their downstream targets. If such evidence supports the activity of a specific kinase, then the activity change of the other candidates is dampened down, reducing the false positive rate of identifying differentially regulated kinases.\n\nThe logic above is implemented in PhosX using the following procedure, which is applied separately to kinases that are inferred to be upregulated ($a_i > 0$) or downregulated ($a_i < 0$). For the downregulated kinases we take the absolute value of their activity score and then negate the final modified score.\n\nLet $a$ be the activity vector of the kinases. If we are considering upregulated kinases, each $a_i$ is the [Activity Score](#kinase-activity-score) of kinase $i$ if $a_i > 0$, otherwise we assign a \"pseudo-null\" activity, setting $a_i=0.01$. If we are considering downregulated kinases, we first take the opposite of the Activity Scores and then modify $a$ analogously. Let $a'$ be the vector where each element $a' _i$ is the reciprocal of $a_i$; $A$ the diagonal matrix of $a$; $D^T$ the transposed adjacency matrix of the directed kinase network, i.e. a Boolean matrix indicating for each kinase which other kinases may target its A-loop; $C$ the redundancy matrix, a Boolean matrix indicating for each kinase which other kinases have an extensive overlap of substrates and therefore a potentially correlated activity. By default $C_{ij}=1$ if the phosphosites assigned to kinases $i$ and $j$ in the experiment have a Jaccard Index $J > 0.5$.\n\nWe compute $E = D^T \\cdot A$, which is the activation evidence matrix, indicating for each kinase the activation coming from every other kinase.\n\nWe then obtain $e$, the evidence vector, as the row-wise maximum of $E$, containing the upstream activation evidence of each kinase.\nLet $e'$ be a vector where each element $e'_i$ is the reciprocal of $e_i$.\n\nEventually, we want to modify $a_i$ (the activity of a kinase $i$) proportionally to:\n\n```math\nm_{ij} = 1 - \\left\\{ C_{ij} \\cdot \\left(1 - \\frac{e_i}{e_j} \\right) \\cdot \\exp\\left[ -d \\left( \\frac{a_i}{a_j} \\right)^2 \\right] \\right\\}\n```\n\nfor whatever kinase $j$ gives the minimum $m_{ij}$, and where $d \\in \\N$ is the decay factor. $d=64$ by default, and controls how fast $\\exp [ -d ( a_i/a_j)^2 ]$ decays from 1 to 0 as $a_i / a_j$ becomes different from $1$.\n\nNamely, we want to disregard the differential activity of kinase $i$ only if kinase $j$ is potentially correlated ($C_{ij} = 1$), _and_ if kinase $i$ and $j$ have similar inferred activities (_i.e._ $a_i/a_j$ is close to $1$), by a degree that is greater when the upstream activation evidence of the kinase $i$ is smaller than the one of kinase $j$ (_i.e._ $e_i/e_j$ is small).\n\nBefore applying this formula, we need to consider some cases regarding the value of $e_i / e_j$:\n\n* $0 > e_i/e_j < 1$, no change is needed;\n* $e_j=0 \\implies e_i/e_j = inf$, the competing kinase has no upstream activation evidence, we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ );\n* $e_i=0 \\land e_j=0 \\implies e_i/e_j$ is undefined, both kinases have no upstream activation evidence, we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ );\n* $e_i/e_j \\ge 1$, the upstream activation evidence of kinase $i$ is greater than kinase $j$, therefore we don't want to correct $a_i$ and we set $e_i / e_j= 1$ (leading to $m_{ij} = 1$ ).\n\nWe can then obtain $F = ee'^T$, the outer product of the evidence vector and its reciprocal, containing the ratio of upstream evidences for each kinase pair. To the elements of $F$ the conditional transformations above have been applied.\n\nLet also be $B = aa'^T$, the outer product of the activity vector and its reciprocal, containing the ratio of inferred activities for each kinase pair; and $X = \\exp ( -d B^2 )$, a matrix of values between $0$ and $1$ indicating how similar the inferred activity changes of any two kinases are. \n\nWe can then rewrite the equation for $m_{ij}$ more simply as:\n\n```math\nm_{ij} = 1 - \\left\\{ C_{ij} \\cdot \\left(1 - F_{ij} \\right) \\cdot X_{ij} \\right\\}\n```\n\nTherefore to find all possible $m_{ij}$ and then, for each $i$, select for the minimum, we first compute the matrix $M$:\n\n```math\nM = 1 - \\left\\{ C \\circ \\left(1 - F \\right) \\circ X \\right\\}\n```\n\nand then get $z$, the activity modifier vector indicating the modifier factor for each kinase, by taking the row-wise minimum of $M$.\n\nLastly, we set $z_i=1$ for each kinase $i$ that doesn't have a regulatory A-loop. Only the differential activity of kinases that have an A-loop reported in the [metadata](#kinases-metadata) will be modified, as only their activities are assumed to depend on such a regulatory feature.\n\nThe final kinase differential activity scores are given by $a \\circ z$.\n\n# Cite\n\nPlease cite one of the following references if you use PhosX in your work.\n\n## Bioinformatics\n\nBibTeX:\n\n```bibtex\n@article{10.1093/bioinformatics/btae697,\n author = {Lussana, Alessandro and M\u00fcller-Dott, Sophia and Saez-Rodriguez, Julio and Petsalaki, Evangelia},\n title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},\n journal = {Bioinformatics},\n volume = {40},\n number = {12},\n pages = {btae697},\n year = {2024},\n month = {11},\n issn = {1367-4811},\n doi = {10.1093/bioinformatics/btae697},\n url = {https://doi.org/10.1093/bioinformatics/btae697},\n eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/12/btae697/60972735/btae697.pdf},\n}\n```\n\n## BioRxiv\n\nBibTeX:\n\n```bibtex\n@article{Lussana2024,\n title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},\n url = {http://dx.doi.org/10.1101/2024.03.22.586304},\n DOI = {10.1101/2024.03.22.586304},\n publisher = {Cold Spring Harbor Laboratory},\n author = {Lussana, Alessandro and Petsalaki, Evangelia},\n year = {2024},\n month = mar \n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Differential kinase activity inference from phosphosproteomics data",
"version": "0.20.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8539681315e67e8669761f3f81f42dec2a03c04cd24a6817935ad3fd26de8a89",
"md5": "b13c2507fc70a509202e8f5c0a38d3e5",
"sha256": "ef2ec437444bd0bd2981cee4ed51e4086bb8cecf429a94c1592b3062f70d1999"
},
"downloads": -1,
"filename": "phosx-0.20.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b13c2507fc70a509202e8f5c0a38d3e5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 31601789,
"upload_time": "2025-10-09T21:45:56",
"upload_time_iso_8601": "2025-10-09T21:45:56.011442Z",
"url": "https://files.pythonhosted.org/packages/85/39/681315e67e8669761f3f81f42dec2a03c04cd24a6817935ad3fd26de8a89/phosx-0.20.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "510dd45785ec5cead0c7c9cb89606464a1ffbd76798e9637575c89972a54e351",
"md5": "9389dd476b334f9edec8b2e486ff22d6",
"sha256": "7cf6ffdaa02b227403e5a3a9509d50c493545391ef973bef61c6d11e6866a8f4"
},
"downloads": -1,
"filename": "phosx-0.20.0.tar.gz",
"has_sig": false,
"md5_digest": "9389dd476b334f9edec8b2e486ff22d6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 31604693,
"upload_time": "2025-10-09T21:45:58",
"upload_time_iso_8601": "2025-10-09T21:45:58.440393Z",
"url": "https://files.pythonhosted.org/packages/51/0d/d45785ec5cead0c7c9cb89606464a1ffbd76798e9637575c89972a54e351/phosx-0.20.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 21:45:58",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "phosx"
}