python-katlas


Namepython-katlas JSON
Version 2025.10.20.2 PyPI version JSON
download
home_pagehttps://github.com/sky1ove/katlas
Summarytools for predicting kinome specificities
upload_time2025-10-20 22:13:33
maintainerNone
docs_urlNone
authorlily
requires_python>=3.7
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # KATLAS


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

<img alt="Katlas logo" width="600" caption="Katlas logo" src="https://github.com/sky1ove/katlas/raw/main/logo.png" id="logo"/>

KATLAS is a repository containing python tools to predict kinases given
a substrate sequence. It also contains datasets of kinase substrate
specificities and human phosphoproteomics.

***References***: Please cite the appropriate papers if KATLAS is
helpful to your research.

- KATLAS was described in the paper \[Computational Decoding of Human
  Kinome Substrate Specificities and Functions\]

- The positional scanning peptide array (PSPA) data is from paper [An
  atlas of substrate specificities for the human serine/threonine
  kinome](https://www.nature.com/articles/s41586-022-05575-3) and paper
  [The intrinsic substrate specificity of the human tyrosine
  kinome](https://www.nature.com/articles/s41586-024-07407-y)

- The kinase substrate datasets used for generating PSSMs are derived
  from
  [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/)
  and paper [Large-scale Discovery of Substrates of the Human
  Kinome](https://www.nature.com/articles/s41598-019-46385-4)

- Phosphorylation sites are acquired from
  [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/),
  paper [The functional landscape of the human
  phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),
  and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /
  [LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)

## Reproduce datasets & figures

Follow the instructions in katlas_raw:
https://github.com/sky1ove/katlas_raw

## Web applications

Users can now run the analysis directly on the web without needing to
code.

Check out our latest web platform:
[kinase-atlas.com](https://kinase-atlas.com/)

## Install

UV:

``` bash
uv add -U python-katlas
```

pip:

``` bash
pip install -U python-katlas
```

If using machine-learning related modules, need to install development
verison: `pip install -U "python-katlas[dev]"`

## Import

``` python
from katlas.common import *
```

# Quick start

We provide two methods to calculate substrate sequence:

- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

- all capital
- contains lower cases indicating phosphorylation status

## Quick start

### Site scoring

CDDM, all capital

``` python
predict_kinase('AAAAAAASGAGSDN',**Params("CDDM_upper"))
```

    considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2A', '3G', '4S', '5D', '6N']

    GCN2      4.556
    MPSK1     4.425
    MEKK2     4.253
    WNK3      4.213
    WNK1      4.064
              ...  
    PDK1    -25.077
    PDHK3   -25.346
    CLK2    -27.251
    ROR2    -27.582
    DDR1    -53.581
    Length: 328, dtype: float64

CDDM, with lower case indicating phosphorylation status

``` python
predict_kinase('AAAAAAAsGGAGsDN',**Params("CDDM"))
```

    considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

    ROR1       8.355
    WNK1       4.907
    WNK2       4.782
    ERK5       4.466
    RIPK2      4.045
               ...  
    DDR1     -29.393
    TNNI3K   -29.884
    CHAK1    -31.775
    VRK1     -45.287
    BRAF     -49.403
    Length: 328, dtype: float64

PSPA, with lower case indicating phosphorylation status

``` python
predict_kinase('AEEKEyHsEGG',**Params("PSPA"))
```

    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

    kinase
    EGFR          4.013
    FGFR4         3.568
    ZAP70         3.412
    CSK           3.241
    SYK           3.209
                  ...  
    JAK1         -3.837
    DDR2         -4.421
    TNK2         -4.534
    TNNI3K_TYR   -4.651
    TNK1         -5.320
    Length: 93, dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: [The Kinase
Library](https://kinase-library.mit.edu/site?s=AEEKEy*HSEGG&pp=false&scp=true),
and use log2(score) to rank, it shows same results with the below (with
slight differences due to rounding).

``` python
out = predict_kinase('AEEKEyHSEGG',**Params("PSPA"))
out
```

    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

    kinase
    EGFR     3.181
    FGFR4    2.390
    CSK      2.308
    ZAP70    2.068
    SYK      1.998
             ...  
    EPHA1   -3.501
    FES     -3.699
    TNK1    -4.269
    TNK2    -4.577
    DDR2    -4.920
    Length: 93, dtype: float64

- So far [The kinase Library](https://kinase-library.phosphosite.org)
  considers all ***tyr sequences*** in capital regardless of whether or
  not they contain lower cases, which is a small bug and should be fixed
  soon.
- Kinase with “\_TYR” indicates it is a dual specificity kinase tested
  in PSPA tyrosine setting, which has not been included in
  kinase-library yet.

We can also calculate the percentile score using a referenced score
sheet.

``` python
# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()
```

``` python
get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y"))
```

    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|       | log2(score) | percentile |
|-------|-------------|------------|
| EGFR  | 3.181       | 96.787423  |
| FGFR4 | 2.390       | 94.012303  |
| CSK   | 2.308       | 95.201640  |
| ZAP70 | 2.068       | 88.380041  |
| SYK   | 1.998       | 85.522898  |
| ...   | ...         | ...        |
| EPHA1 | -3.501      | 12.139440  |
| FES   | -3.699      | 21.216678  |
| TNK1  | -4.269      | 5.481887   |
| TNK2  | -4.577      | 2.050581   |
| DDR2  | -4.920      | 10.403281  |

<p>93 rows × 2 columns</p>
</div>

### Site scoring in a df

Load your csv:

``` python
# df = pd.read_csv('your_file.csv')
```

Or load a demo df

``` python
# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | site_seq        | gene_site      |
|-----|-----------------|----------------|
| 0   | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
| 1   | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
| 2   | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
| 3   | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
| 4   | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |

</div>

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all
in capital. You can also choose other params.

``` python
results = predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results
```

    input dataframe has a length 5
    Preprocessing
    Finish preprocessing
    Merging reference
    Finish merging

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | SRC | EPHA3 | FES | NTRK3 | ALK | ABL1 | FLT3 | EPHA8 | EPHB2 | EPHB1 | ... | VRK1 | PKMYT1 | GRK3 | CAMK1B | CDC7 | SMMLCK | ROR1 | GAK | MAST2 | BRAF |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | -2.440640 | -0.818753 | -1.663990 | -0.738991 | -2.047628 | -3.602344 | -3.200998 | -0.935176 | -1.388444 | -1.859450 | ... | -17.103237 | -113.698143 | -16.848783 | -41.520172 | -41.646187 | 1.284159 | -26.566362 | -69.165062 | -17.706400 | -87.763214 |
| 1 | -3.838486 | -2.735969 | -2.533986 | -2.150399 | -3.792498 | -4.725527 | -5.711791 | -4.534240 | -3.148449 | -2.511518 | ... | -67.889053 | -68.652641 | -45.833855 | -64.171600 | -39.465572 | -65.061722 | -109.561707 | -85.911224 | -60.105064 | -63.889122 |
| 2 | -2.610423 | -2.370090 | -3.235637 | -1.508413 | -2.571347 | -3.740941 | -3.025596 | -3.373504 | -2.776297 | -3.060740 | ... | -15.798462 | -45.905319 | -61.440742 | -67.695694 | -55.047962 | -42.135216 | -38.501572 | -62.624382 | -56.119389 | -107.060989 |
| 3 | -5.180541 | -4.201880 | -5.766463 | -3.038421 | -3.836897 | -4.249900 | -5.029885 | -5.411311 | -4.713308 | -4.827825 | ... | -96.978317 | -83.419777 | -22.559393 | -110.611588 | -63.283070 | -37.240440 | -24.497492 | -112.878151 | -43.538158 | -60.348518 |
| 4 | -2.844254 | -3.322700 | -3.681745 | -1.766435 | -2.666579 | -3.748774 | -4.083619 | -3.912834 | -3.724181 | -3.948160 | ... | -35.824612 | -87.983566 | -83.312317 | -107.162407 | -61.478374 | -85.793571 | -43.738819 | -47.004211 | -42.281624 | -59.518513 |

<p>5 rows × 328 columns</p>
</div>

``` python
results.iloc[0].sort_values(ascending=False)
```

    TLK2        8.264621
    GCN2        8.101542
    TLK1        7.693897
    HRI         6.691402
    PLK3        6.579368
                 ...    
    NIK       -64.605148
    SRPK2     -67.300667
    GAK       -69.165062
    BRAF      -87.763214
    PKMYT1   -113.698143
    Name: 0, Length: 328, dtype: float32

## Dataset

Besides calculating sequence scores, we also provides multiple datasets
of phosphorylation sites.

### CPTAC pan-cancer phosphoproteomics

``` python
df = Data.get_cptac_ensembl_site()
df.head(3)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | gene | site | site_seq | protein | gene_name | gene_site | protein_site |
|----|----|----|----|----|----|----|----|
| 0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
| 1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
| 2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |

</div>

### [Ochoa et al. human phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3)

``` python
df = Data.get_ochoa_site()
df.head(3)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | A0A075B6Q4 | 24 | S | 1.0 | 0.91 | 6.839384 | 1.0 | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
| 1 | A0A075B6Q4 | 35 | S | 1.0 | 0.87 | 9.192622 | 0.0 | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
| 2 | A0A075B6Q4 | 57 | S | 0.0 | 0.28 | 0.818834 | 0.0 | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |

</div>

### PhosphoSitePlus human phosphorylation site

``` python
df = Data.get_psp_human_site()
df.head(3)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | \_\_\_\_\_\_MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
| 1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | \_\_MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
| 2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |

</div>

### Unique sites of combined Ochoa & PhosphoSitePlus

``` python
df = Data.get_combine_site_psp_ochoa()
df.head(3)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score |
|----|----|----|----|----|----|----|----|----|
| 0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |
| 1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |
| 2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |

</div>

## Phosphorylation site sequence example

***All capital - 15 length (-7 to +7)***

- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ

***All capital - 10 length (-5 to +4)***

- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG

***With lowercase - (-7 to +7)***

- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ

***With lowercase - (-5 to +4)***

- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sky1ove/katlas",
    "name": "python-katlas",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "lily",
    "author_email": "lcai888666@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/9a/9b/bb087ef004e42ca232351ffb69a086ff924aa6f85ed128ed95fc80b7507e/python_katlas-2025.10.20.2.tar.gz",
    "platform": null,
    "description": "# KATLAS\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n<img alt=\"Katlas logo\" width=\"600\" caption=\"Katlas logo\" src=\"https://github.com/sky1ove/katlas/raw/main/logo.png\" id=\"logo\"/>\n\nKATLAS is a repository containing python tools to predict kinases given\na substrate sequence. It also contains datasets of kinase substrate\nspecificities and human phosphoproteomics.\n\n***References***: Please cite the appropriate papers if KATLAS is\nhelpful to your research.\n\n- KATLAS was described in the paper \\[Computational Decoding of Human\n  Kinome Substrate Specificities and Functions\\]\n\n- The positional scanning peptide array (PSPA) data is from paper [An\n  atlas of substrate specificities for the human serine/threonine\n  kinome](https://www.nature.com/articles/s41586-022-05575-3) and paper\n  [The intrinsic substrate specificity of the human tyrosine\n  kinome](https://www.nature.com/articles/s41586-024-07407-y)\n\n- The kinase substrate datasets used for generating PSSMs are derived\n  from\n  [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/)\n  and paper [Large-scale Discovery of Substrates of the Human\n  Kinome](https://www.nature.com/articles/s41598-019-46385-4)\n\n- Phosphorylation sites are acquired from\n  [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/),\n  paper [The functional landscape of the human\n  phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),\n  and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /\n  [LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)\n\n## Reproduce datasets & figures\n\nFollow the instructions in katlas_raw:\nhttps://github.com/sky1ove/katlas_raw\n\n## Web applications\n\nUsers can now run the analysis directly on the web without needing to\ncode.\n\nCheck out our latest web platform:\n[kinase-atlas.com](https://kinase-atlas.com/)\n\n## Install\n\nUV:\n\n``` bash\nuv add -U python-katlas\n```\n\npip:\n\n``` bash\npip install -U python-katlas\n```\n\nIf using machine-learning related modules, need to install development\nverison: `pip install -U \"python-katlas[dev]\"`\n\n## Import\n\n``` python\nfrom katlas.common import *\n```\n\n# Quick start\n\nWe provide two methods to calculate substrate sequence:\n\n- Computational Data-Driven Method (CDDM)\n- Positional Scanning Peptide Array (PSPA)\n\nWe consider the input in two formats:\n\n- a single input string (phosphorylation site)\n- a csv/dataframe that contains a column of phosphorylation sites\n\nFor input sequences, we also consider it in two conditions:\n\n- all capital\n- contains lower cases indicating phosphorylation status\n\n## Quick start\n\n### Site scoring\n\nCDDM, all capital\n\n``` python\npredict_kinase('AAAAAAASGAGSDN',**Params(\"CDDM_upper\"))\n```\n\n    considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2A', '3G', '4S', '5D', '6N']\n\n    GCN2      4.556\n    MPSK1     4.425\n    MEKK2     4.253\n    WNK3      4.213\n    WNK1      4.064\n              ...  \n    PDK1    -25.077\n    PDHK3   -25.346\n    CLK2    -27.251\n    ROR2    -27.582\n    DDR1    -53.581\n    Length: 328, dtype: float64\n\nCDDM, with lower case indicating phosphorylation status\n\n``` python\npredict_kinase('AAAAAAAsGGAGsDN',**Params(\"CDDM\"))\n```\n\n    considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']\n\n    ROR1       8.355\n    WNK1       4.907\n    WNK2       4.782\n    ERK5       4.466\n    RIPK2      4.045\n               ...  \n    DDR1     -29.393\n    TNNI3K   -29.884\n    CHAK1    -31.775\n    VRK1     -45.287\n    BRAF     -49.403\n    Length: 328, dtype: float64\n\nPSPA, with lower case indicating phosphorylation status\n\n``` python\npredict_kinase('AEEKEyHsEGG',**Params(\"PSPA\"))\n```\n\n    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']\n\n    kinase\n    EGFR          4.013\n    FGFR4         3.568\n    ZAP70         3.412\n    CSK           3.241\n    SYK           3.209\n                  ...  \n    JAK1         -3.837\n    DDR2         -4.421\n    TNK2         -4.534\n    TNNI3K_TYR   -4.651\n    TNK1         -5.320\n    Length: 93, dtype: float64\n\nTo replicate the results from The Kinase Library (PSPA)\n\nCheck this link: [The Kinase\nLibrary](https://kinase-library.mit.edu/site?s=AEEKEy*HSEGG&pp=false&scp=true),\nand use log2(score) to rank, it shows same results with the below (with\nslight differences due to rounding).\n\n``` python\nout = predict_kinase('AEEKEyHSEGG',**Params(\"PSPA\"))\nout\n```\n\n    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']\n\n    kinase\n    EGFR     3.181\n    FGFR4    2.390\n    CSK      2.308\n    ZAP70    2.068\n    SYK      1.998\n             ...  \n    EPHA1   -3.501\n    FES     -3.699\n    TNK1    -4.269\n    TNK2    -4.577\n    DDR2    -4.920\n    Length: 93, dtype: float64\n\n- So far [The kinase Library](https://kinase-library.phosphosite.org)\n  considers all ***tyr sequences*** in capital regardless of whether or\n  not they contain lower cases, which is a small bug and should be fixed\n  soon.\n- Kinase with \u201c\\_TYR\u201d indicates it is a dual specificity kinase tested\n  in PSPA tyrosine setting, which has not been included in\n  kinase-library yet.\n\nWe can also calculate the percentile score using a referenced score\nsheet.\n\n``` python\n# Percentile reference sheet\ny_pct = Data.get_pspa_tyr_pct()\n```\n\n``` python\nget_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params(\"PSPA_y\"))\n```\n\n    considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|       | log2(score) | percentile |\n|-------|-------------|------------|\n| EGFR  | 3.181       | 96.787423  |\n| FGFR4 | 2.390       | 94.012303  |\n| CSK   | 2.308       | 95.201640  |\n| ZAP70 | 2.068       | 88.380041  |\n| SYK   | 1.998       | 85.522898  |\n| ...   | ...         | ...        |\n| EPHA1 | -3.501      | 12.139440  |\n| FES   | -3.699      | 21.216678  |\n| TNK1  | -4.269      | 5.481887   |\n| TNK2  | -4.577      | 2.050581   |\n| DDR2  | -4.920      | 10.403281  |\n\n<p>93 rows \u00d7 2 columns</p>\n</div>\n\n### Site scoring in a df\n\nLoad your csv:\n\n``` python\n# df = pd.read_csv('your_file.csv')\n```\n\nOr load a demo df\n\n``` python\n# Load a demo df with phosphorylation sites\ndf = Data.get_ochoa_site().head()\ndf.iloc[:,-2:]\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|     | site_seq        | gene_site      |\n|-----|-----------------|----------------|\n| 0   | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |\n| 1   | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |\n| 2   | IADHLFWSEETKSRF | A0A075B6Q4_S57 |\n| 3   | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |\n| 4   | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |\n\n</div>\n\nSet the column name and param to calculate\n\nHere we choose param_CDDM_upper, as the sequences in the demo df are all\nin capital. You can also choose other params.\n\n``` python\nresults = predict_kinase_df(df,'site_seq',**Params(\"CDDM_upper\"))\nresults\n```\n\n    input dataframe has a length 5\n    Preprocessing\n    Finish preprocessing\n    Merging reference\n    Finish merging\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|  | SRC | EPHA3 | FES | NTRK3 | ALK | ABL1 | FLT3 | EPHA8 | EPHB2 | EPHB1 | ... | VRK1 | PKMYT1 | GRK3 | CAMK1B | CDC7 | SMMLCK | ROR1 | GAK | MAST2 | BRAF |\n|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|\n| 0 | -2.440640 | -0.818753 | -1.663990 | -0.738991 | -2.047628 | -3.602344 | -3.200998 | -0.935176 | -1.388444 | -1.859450 | ... | -17.103237 | -113.698143 | -16.848783 | -41.520172 | -41.646187 | 1.284159 | -26.566362 | -69.165062 | -17.706400 | -87.763214 |\n| 1 | -3.838486 | -2.735969 | -2.533986 | -2.150399 | -3.792498 | -4.725527 | -5.711791 | -4.534240 | -3.148449 | -2.511518 | ... | -67.889053 | -68.652641 | -45.833855 | -64.171600 | -39.465572 | -65.061722 | -109.561707 | -85.911224 | -60.105064 | -63.889122 |\n| 2 | -2.610423 | -2.370090 | -3.235637 | -1.508413 | -2.571347 | -3.740941 | -3.025596 | -3.373504 | -2.776297 | -3.060740 | ... | -15.798462 | -45.905319 | -61.440742 | -67.695694 | -55.047962 | -42.135216 | -38.501572 | -62.624382 | -56.119389 | -107.060989 |\n| 3 | -5.180541 | -4.201880 | -5.766463 | -3.038421 | -3.836897 | -4.249900 | -5.029885 | -5.411311 | -4.713308 | -4.827825 | ... | -96.978317 | -83.419777 | -22.559393 | -110.611588 | -63.283070 | -37.240440 | -24.497492 | -112.878151 | -43.538158 | -60.348518 |\n| 4 | -2.844254 | -3.322700 | -3.681745 | -1.766435 | -2.666579 | -3.748774 | -4.083619 | -3.912834 | -3.724181 | -3.948160 | ... | -35.824612 | -87.983566 | -83.312317 | -107.162407 | -61.478374 | -85.793571 | -43.738819 | -47.004211 | -42.281624 | -59.518513 |\n\n<p>5 rows \u00d7 328 columns</p>\n</div>\n\n``` python\nresults.iloc[0].sort_values(ascending=False)\n```\n\n    TLK2        8.264621\n    GCN2        8.101542\n    TLK1        7.693897\n    HRI         6.691402\n    PLK3        6.579368\n                 ...    \n    NIK       -64.605148\n    SRPK2     -67.300667\n    GAK       -69.165062\n    BRAF      -87.763214\n    PKMYT1   -113.698143\n    Name: 0, Length: 328, dtype: float32\n\n## Dataset\n\nBesides calculating sequence scores, we also provides multiple datasets\nof phosphorylation sites.\n\n### CPTAC pan-cancer phosphoproteomics\n\n``` python\ndf = Data.get_cptac_ensembl_site()\ndf.head(3)\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|  | gene | site | site_seq | protein | gene_name | gene_site | protein_site |\n|----|----|----|----|----|----|----|----|\n| 0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |\n| 1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |\n| 2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |\n\n</div>\n\n### [Ochoa et al.\u00a0human phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3)\n\n``` python\ndf = Data.get_ochoa_site()\ndf.head(3)\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|  | uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site |\n|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|\n| 0 | A0A075B6Q4 | 24 | S | 1.0 | 0.91 | 6.839384 | 1.0 | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |\n| 1 | A0A075B6Q4 | 35 | S | 1.0 | 0.87 | 9.192622 | 0.0 | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |\n| 2 | A0A075B6Q4 | 57 | S | 0.0 | 0.28 | 0.818834 | 0.0 | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |\n\n</div>\n\n### PhosphoSitePlus human phosphorylation site\n\n``` python\ndf = Data.get_psp_human_site()\ndf.head(3)\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|  | gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site |\n|----|----|----|----|----|----|----|----|----|----|----|----|----|----|\n| 0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | \\_\\_\\_\\_\\_\\_MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |\n| 1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | \\_\\_MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |\n| 2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |\n\n</div>\n\n### Unique sites of combined Ochoa & PhosphoSitePlus\n\n``` python\ndf = Data.get_combine_site_psp_ochoa()\ndf.head(3)\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|  | uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score |\n|----|----|----|----|----|----|----|----|----|\n| 0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |\n| 1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |\n| 2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |\n\n</div>\n\n## Phosphorylation site sequence example\n\n***All capital - 15 length (-7 to +7)***\n\n- QSEEEKLSPSPTTED\n- TLQHVPDYRQNVYIP\n- TMGLSARyGPQFTLQ\n\n***All capital - 10 length (-5 to +4)***\n\n- SRDPHYQDPH\n- LDNPDyQQDF\n- AAAAAsGGAG\n\n***With lowercase - (-7 to +7)***\n\n- QsEEEKLsPsPTTED\n- TLQHVPDyRQNVYIP\n- TMGLsARyGPQFTLQ\n\n***With lowercase - (-5 to +4)***\n\n- sRDPHyQDPH\n- LDNPDyQQDF\n- AAAAAsGGAG\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "tools for predicting kinome specificities",
    "version": "2025.10.20.2",
    "project_urls": {
        "Homepage": "https://github.com/sky1ove/katlas"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "795df85f3e6ea87aac900fd8b1b686a93dd33ca90b036794c2dfdb7b375d2e59",
                "md5": "6274b091d2321544cde01f47342c5458",
                "sha256": "d5c4057ea066da320fffd70054da4c98a84a5242b3c4c7eefe1c1391181a9692"
            },
            "downloads": -1,
            "filename": "python_katlas-2025.10.20.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6274b091d2321544cde01f47342c5458",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 59905,
            "upload_time": "2025-10-20T22:13:31",
            "upload_time_iso_8601": "2025-10-20T22:13:31.955285Z",
            "url": "https://files.pythonhosted.org/packages/79/5d/f85f3e6ea87aac900fd8b1b686a93dd33ca90b036794c2dfdb7b375d2e59/python_katlas-2025.10.20.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9a9bbb087ef004e42ca232351ffb69a086ff924aa6f85ed128ed95fc80b7507e",
                "md5": "bca1b1c768a3042905a9faec1b16cf47",
                "sha256": "4f16ecd7b19bd3f0e75bf49f0292fee163d0155a119c4038b1536cea776aa6e9"
            },
            "downloads": -1,
            "filename": "python_katlas-2025.10.20.2.tar.gz",
            "has_sig": false,
            "md5_digest": "bca1b1c768a3042905a9faec1b16cf47",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 61399,
            "upload_time": "2025-10-20T22:13:33",
            "upload_time_iso_8601": "2025-10-20T22:13:33.312154Z",
            "url": "https://files.pythonhosted.org/packages/9a/9b/bb087ef004e42ca232351ffb69a086ff924aa6f85ed128ed95fc80b7507e/python_katlas-2025.10.20.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-20 22:13:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sky1ove",
    "github_project": "katlas",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "python-katlas"
}
        
Elapsed time: 3.76717s