msstitch

Name	msstitch JSON
Version	3.17 JSON
	download
home_page	https://github.com/lehtiolab/msstitch
Summary	MS proteomics post processing utilities
upload_time	2025-02-01 10:04:50
maintainer	Jorrit Boekel
docs_url	None
author	Jorrit Boekel
requires_python	None
license	MIT
keywords	mass spectrometry proteomics processing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # msstitch - MS proteomics post-processing utilities

Shotgun proteomics has a number of bioinformatic tools available for identification 
and quantification of peptides, and the subsequent protein inference. `msstitch` is a 
tool to integrate a number of these tools, generating ready to use result files. Supported
formats are currently:

- mzML files
- Fasta peptides
- MSGF+ PSM tables (and mzIdentML)
- Sage PSM tables (not for quant yet)
- Percolator output
- OpenMS consensusXML isobaric quant
- Hardklor/Kronik MS1 quant output
- Dinosaur MS1 quant output

If you need support for another format, there is limited time but infinite gratitude :)

## Usage

### Storing data
An example command flow would first store mzML spectra data in an SQLite file:

```
msstitch storespectra --spectra file1.mzML file2.mzML \
  --setnames sampleset1 sampleset2 -o db.sqlite
```

Or, to add spectra to an existing SQLite lookup:

```
msstitch storespectra --dbfile lookup.sqlite --spectra file3.mzML file4.mzML \
  --setnames sampleset2 sampleset3
```

Then store quantification data from dinosaur (MS1 precursor quant) and isobaric 
quantification (including precursor purities, but use centroided MS1 for this)
from OpenMS together with the spectra. MS1 precursor quant features will be aligned
with MS2 spectra by selecting the best m/z match (an m/z window of 2x m/z 
tolerance), from all features within the retention time window of 2x RT tolerance.
--rttol is specified in seconds.

```
msstitch storequant --dbfile db.sqlite --spectra file1.mzML file2.mzML \
  --mztol 10 --mztoltype ppm --rttol 10 \
  --dinosaur file1.dinosaur file2.dinosaur \
  --isobaric file1.consensusXML file2.consensusXML
```

When using Hardklor/Kronik instead of Dinosaur, you can instead use:

```
msstitch storequant --dbfile db.sqlite --spectra file1.mzML file2.mzML \
  --kronik file1.kronik file2.kronik \
  --isobaric file1.consensusXML file2.consensusXML
```

For both Dinosaur and Kronik, the MS1 peak sum is used which theoretically would be more correct
when having differently shaped envelopes. If you'd rather use the envelope apex, pass `--apex`
in the above command.


### Handling MS search engines
Create a decoy database where peptides are reversed between tryptic residues. Decoy
peptides will be shuffled if they match a target sequence, but if they after shuffling
a number of times (max as in `--maxshuffle`) still match a target sequence, they will 
be removed. To avoid removal (e.g. for keeping database sizes identical), pass `--keep-target'.

```
msstitch makedecoy uniprot.fasta -o decoy.fasta --scramble tryp_rev --maxshuffle 10
```

Or without even trying to shuffle peptide sequences that match to the target DB:

```
msstitch makedecoy uniprot.fasta -o decoy.fasta --scramble tryp_rev --ignore-target-hits
```


After running two samples of MSGF and percolator, we can start making 
a more proper set of PSM tables by adding percolator data and filtering
on FDR. You need percolator data, a PSM table and a matching mzIdentML file.
The PSM table / mzIdentML should not have been filtered differently, as each PSM
is expected to be a single item in the mzIdentML file and vice versa. PSMs not found
in the percolator data are not output. The following adds percolator svm-score,
q-value (FDR), and posterior error as columns to the PSM table:

```
# Add percolator data, filter 0.01 FDR
msstitch perco2psm -i psms1.txt \
  --perco percolator1.xml --mzid psms1.mzIdentML \
  --filtpsm 0.01 --filtpep 0.01
msstitch perco2psm -i psms2.txt \
  --perco percolator2.xml --mzid psms2.mzIdentML \
  --filtpsm 0.01 --filtpep 0.01
# Combine the two sets and split to a target and decoy file
msstitch concat -i psms1.txt psms2.txt -o allpsms.txt
msstitch split -i allpsms.txt --splitcol TD
```

This also adds ExplainedIonCurrentRatio, which is parsed from the mzIdentML file as delivered by MSGF+.

If you're running Sage instead of MSGFplus, you will not need the mzIdentML, so the `perco2psm`
commands above can be written e.g.:
```
msstitch perco2psm -i psms1.sage.txt -o psms.sage.percolated.txt \
  --perco percolator1.xml \
  --filtpsm 0.01 --filtpep 0.01
... etc
```

Now refine the PSM tables, using the earlier created SQLite DB, 
adding more information (sample name, MS1 precursor quant,
isobaric quant, proteingroups, genes). In this example we set isobaric
quantitation intensities to NA if the precursor purity measured is <0.3.

```
cp db.sqlite decoy_db.sqlite
msstitch psmtable -i target.tsv -o target_psmtable.txt --fasta uniprot.fasta \
  --dbfile db.sqlite --addmiscleav --addbioset --ms1quant --isobaric \
  --min-precursor-purity 0.3 --proteingroup --genes
msstitch psmtable -i decoy.tsv -o decoy_psmtable.txt --fasta decoy.fasta \
  --dbfile decoy_db.sqlite --proteingroup --genes --addbioset
```

When using Sage results, you will have missed cleavages out of the box (and retention time, ion mobility)! So leave
out `--addmiscleav`. Otherwise msstitch will accept the Sage file format as well.


If necessary (e.g. multiple TMT sample sets), split the table before making
protein/peptide tables:

```
msstitch split -i target_psmtable.txt --splitcol bioset
```

### Summarizing PSMs
Create a peptide table, with summarized median isobaric quant ratios,
highest MS1 intensity PSM as the peptide MS1 quant intensity, and an additional
linear-modeled q-value column:

```
msstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \
  --scorecolpattern svm --modelqvals --ms1quant \
  --isobquantcolpattern tmt10plex --denompatterns _126 _127C 
```

The same peptide table can also be made using median sweeping, which takes the median
intensity channel for each PSM as a denominator. Here is also exemplified how
to do channel-median centering of the ratios to normalize and use log2 intensity
values before calculating ratios:

```
msstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \
  --scorecolpattern svm --modelqvals --ms1quant \
  --isobquantcolpattern tmt10plex --mediansweep --logisoquant --median-normalize
```

Or, if you only want the median PSM intensity per peptide summarized, use `--medianintensity`
Here is also illustrated that you can use the --keep-psms-na-quant flag to NOT
throw out the PSMs which have isobaric intensity below the mininum intensity 
(default 0, here 100) IN ANY channel:

```
msstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \
  --scorecolpattern svm --modelqvals --ms1quant \
  --isobquantcolpattern tmt10plex --medianintensity \
  --minint 100 --keep-psms-na-quant
```

In case of analyzing peptides with PTMs, you may want to process a subset of
PSMs (those with the PTMs) to create a separate peptide table from. In
that case, there is an option to divide (or subtract for log2 data) isobaric 
quant values to a protein (or gene) table from a non-PTM search, often done on
another non-enriched sample. This allows discerning PTM-peptide 
differential expression from its respective protein differential expression in the sample.
The protein/gene table should obviously contain the same samples/channel,
and for example be from an `msstitch proteins` or `msstitch isosummarize` command,
using `--median-normalize` to get median centered ratios for the proteins or genes.
After that, create a peptide table from PTM-PSMs as follows:

```
msstitch peptides -i set1_ptm_psms.txt -o set1_ptm_peptides.txt \
  --scorecolpattern svm --isobquantcolpattern tmt10plex --denompatterns _126 _127C \
  --logisoquant --totalproteome set1_proteins.txt
```

For proper normalizing of this table (as it would otherwise be impacted by
sample differences per channel), you may want to median-center. In the case of 
small and possibly noisy PTM tables, it can be advisable to use another table
from a global search (or e.g. the full peptide or protein table from the PTM search)
for determining channel medians. This is possible by specifying the above command
plus:

```
  --median-normalize --normalization-factors-table /path/to/set1_global_proteins.txt
```

To create a protein table, with isobaric quantification as for peptides, the
average of the top-3 highest intensity peptides for MS1 quantification.
For all of these, summarizing isobaric PSM data to peptide, protein, gene features 
is done using medians of log2 PSM quantification values per feature (e.g. a protein). If you'd
rather use averages, use `--summarize-average` as below, where we also show log2
transformation of intensities before summarizing and subsequent median-centering.
FDR (q-values) for the protein table is here calculated 'classically', by ranking
target and decoy proteins, taking their best scoring peptide's q-value as a score.
`--logscore` is used since q-value is used (higher is better when comparing peptides).

```
msstitch proteins -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \
  --psmtable set1_target_psms.txt \
  -o set1_proteins.txt \
  --scorecolpattern '^q-value' --logscore \
  --ms1quant \
  --isobquantcolpattern tmt10plex --denompatterns _126 _127C \
  --summarize-average --logisoquant --median-normalize
```

Or the analogous process for genes, using median sweeping to get intensity ratios instead of denominators:
As for peptides above, one can use the --keep-psms-na-quant flag to NOT
throw out the PSMs which have isobaric intensity below the mininum intensity
(default 0 used here) in any channel. Here we use picked FDR (Savitski et al. 2015 MCP)
to define q-values for the genes, for which you need a target and decoy fasta to form pairs.
The fasta files need to be analogous, i.e. the same order for matching T/D pair genes.

```
msstitch genes -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \
  --psmtable set1_target_psms.txt \
  -o set1_genes.txt \
  --scorecolpattern '^q-value' --logscore \
  --fdrtype picked --targetfasta tdb.fa --decoyfasta decoy.fa \
  --ms1quant \
  --isobquantcolpattern tmt10plex --mediansweep \
  --keep-psms-na-quant
```

Or when there are ENSEMBL entries in the fasta search database, even for ENSG, here with summarized median PSM intensity per ENSG:

```
msstitch ensg -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \
  --psmtable set1_target_psms.txt \
  -o set1_ensg.txt \
  --scorecolpattern '^q-value' --logscore \
  --ms1quant \
  --isobquantcolpattern tmt10plex --medianintensity \
  --median-normalize
```

Finally, merge multiple sets of proteins (or genes/ENSG) into a single output.
Here we set an cutoff so that features with FDR > 0.01 are set to NA for the 
respective sample set.

```
msstitch merge -i set1_proteins.txt set2_proteins.txt -o protein_table.txt \
  --setnames sampleset1 sampleset2 \
  --dbfile db.sqlite \
  --fdrcolpattern 'q-value' --mergecutoff 0.01 \
   --ms1quantcolpattern area --isobquantcolpattern plex
```


### Some other useful commands
Trypsinize a fasta file (minimum retained peptide length, do cut K/RP, allow 1 missed cleavage, 
also generate peptides from protein N-term with methionine loss).
This also treats stop codons as separators and will not read through them if they are in a protein.
Pass `--ignore-stop-codons` if you would like to treat an `*` as any other amino acid.

```
msstitch trypsinize -i uniprot.fasta -o tryp_up.fasta --minlen 7 \
  --cutproline --miscleav 1 --nterm-meth-loss
```

Create an SQLite file with tryptic sequences for filtering out e.g. known-sequence data.
Options and behaviour as for `trypsinize`, plus `--insourcefrag` which builds a lookup with support for 
in-source fragmented peptides that have lost some N-terminal residues. If one instead of 
in-source fragmentation only wishes to include protein N-term methionine loss, use `--nterm-meth-loss`

```
msstitch storeseq -i canonical.fa -o tryptic.sqlite --cutproline --minlen 7 \
  --miscleav 1 --insourcefrag
```

Filter a percolator output file, or a PSM file using the created SQLite,
removing sequences
that match those stored in the SQLite. The below also removes sequences in the 
sample which are deamidated (i.e. D -> N), and sequences that have lost at most
2 N-terminal amino acids due to in-source fragmentation (DB must have been 
built with support for that).

```
# Percolator:
msstitch filterperco -i perco.xml --dbfile tryptic.sqlite \
  --insourcefrag 2 --deamidate -o filtered.xml

# PSM file:
msstitch seqfilt -i psms.txt --dbfile tryptic.sqlite \
  --insourcefrag 2 --deamidate -o filtered.psms.txt
```

Now that you have filtered percolator, the svm-score populations will have changed,
and you may want to rerun its `qvality` component for the PSM and peptide score populations
to obtain fresh FDRs. To put those into the PSM table, you can use:

```
msstitch perco2psm -i psms.txt \
  --perco filtered.xml --mzid psms1.mzIdentML \
  --qvalitypsms qvality_psms.txt --qvalitypeps qvality_peptides.txt \
  --filtpsm 0.01 --filtpep 0.01
```
Note that the qvality PSMs and peptides files containing svm-scores and FDR, have to
be made of the filtered.xml percolator output, as their content will be zipped, thus
each line in the qvality PSMs file represents a PSM present in percolator XML, 
and vice versa. The same goes for peptides.

Create an SQLite file with full-protein sequences for filtering any peptide of 
a minimum length specified that matches to those. Stop codons are also here treated
as separating peptides. Slower than filtering tryptic sequences but more comprehensive:

```
msstitch storeseq -i canonical.fa -o proteins.sqlite --fullprotein --minlen 7
```

Filter a percolator output or PSM file on protein sequences using the SQLite, removing 
sequences in sample which match to anywhere in the protein. Sequences may be 
deamidated, and minimum length parameter must match the one the database is 
built with.

```
# Percolator
msstitch filterperco -i perco.xml --dbfile proteins.sqlite \
  --fullprotein --deamidate --minlen 7 -o filtered.xml

# PSM file:
msstitch seqfilt -i psms.txt --dbfile proteins.sqlite \
  --fullprotein --deamidate --minlen 7 -o filtered.psms.txt
```

Using a similar DB from `storeseq` with `--map-accessions`, we can add a column in a PSM file which contains
sequence-matches from a fasta file, so any peptide matching these sequences can
get annotated with the fasta ID for that sequences as provided in e.g. `external.fa`:

```
msstitch storeseq -i external.fa -o tryptic.sqlite --cutproline --minlen 7 \
  --miscleav 1 --insourcefrag --map-accessions

# Or use a full protein DB to match non-tryptic peptides:
msstitch storeseq -i external.fa -o fullprotein.sqlite --fullprotein \
  --minlen 7 --map-accessions

# Now add the annotations
msstitch seqmatch -i psms.txt --dbfile tryptic.sqlite --matchcolname COLUMN_NAME \
  --insourcefrag 3 --deamidate -o annotated_psms.txt

# For the fullprotein DB
msstitch seqmatch -i psms.txt --dbfile fullprotein.sqlite --matchcolname ANOTHER_COLUMN_NAME \
  --minlen 7 --fullprotein --enforce-tryptic -o annotated_psms.txt
```

Split a percolator file with PSMs and peptides into files with specific protein headers.
The below will split perco.xml and output two files: perco.xml_h0.xml, containing all
PSMs/peptides that have at least one mapping to a `novel_p` protein, and perco.xml_h1.xml,
which will contain all PSMs/peptides mapping to either `lncRNA` or `intergenic` proteins,
but which will __NOT__ contain PSMs/peptides mapping to also either/or `ENSP` and `sp`. 
More files, _h2.xml etc can be created by adding more headers.

```
msstitch splitperco -i perco.xml --protheaders "novel_p" "known:ENSP;sp|novel:lncRNA;intergenic"`
```

Remove duplicate peptides from percolator output (if youve introduced these somehow)
by only retaining the best peptides:

```
msstitch dedupperco -i perco.xml -o dedup_peptides.xml
```

Remove duplicate PSMs from the output if you've introduced those, for really edge cases 
like removing identical bare peptides where the only difference is 
a modification position.
```
msstitch deduppsms -i psms.txt --peptidecolpattern 'Peptide' -o dedup_psms.txt
```

Create an isobaric ratio table median-summarizing the PSMs by any column number 
you want in a PSM table. E.g. you have added a column with exons. The following 
uses average of two channels as denominator, outputs a new table with first column
the features found in column nr.20 of the PSM table:

```
msstitch isosummarize -i psm_table.txt --featcol 20 \
  --isobquantcolpattern tmt10plex --denompatterns 126 127C
```

We can also use this command to create a table multi-mapping PSMs are counted towards
identification and quant of all its mappings, e.g. proteins separated by `;`. This is
not recommended for conventional protein table construction, use at your own risk
in cases that benefit from this behaviour.

```
msstitch isosummarize -i psm_table.txt --featcol 20 \
  --isobquantcolpattern tmt10plex --denompatterns 126 127C \
  --split-multi-entries
```





Re-use an earlier PSM table and add PSMs from searched spectra files of a new or
re-searched sample. Saves time so you won't have to re-search all the spectra 
in case of a big analysis. In the example below, new PSMs are the result of a 
sample set that has been re-searched, (e.g. when MS reruns are done in case of 
bad spectra), so we delete the existing sample set before continuing. 
Protein grouping is done after regenerating the PSM table, to illustrate you 
can do protein grouping on the entire table
instead of only on the sample set. Since the new table is the one which supplies
the header, the columns not supplied in the command (here protein groups) will
be removed from the final result. This function assumes all PSMs presented are
in the same order in the table, so they should not have been inserted in parallel,
safest is to not generate the lookup table by hand.

```
msstitch deletesets -i old_psmtable.txt -o cleaned_psmtable.txt \
    --dbfile db.sqlite --setnames bad_set
msstitch psmtable -i rerun_target.tsv --oldpsms cleaned_psmtable.txt \
   -o new_almost_done_psmtable.txt --fasta uniprot.fasta \
  --dbfile db.sqlite --addmiscleav --addbioset --ms1quant --isobaric
msstitch psmtable -i new_almost_done_psmtable.txt -o new_target_psms.txt \
  --proteingroup
```

It is also possible to only pass a PSM table to `deletesets`:

```
msstitch deletesets -i old_psmtable.txt -o cleaned_psmtable.txt \
  --setnames bad_set
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lehtiolab/msstitch",
    "name": "msstitch",
    "maintainer": "Jorrit Boekel",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "jorrit.boekel@scilifelab.se",
    "keywords": "mass spectrometry, proteomics, processing",
    "author": "Jorrit Boekel",
    "author_email": "jorrit.boekel@scilifelab.se",
    "download_url": "https://files.pythonhosted.org/packages/43/f2/0f41bd1068855bcbc55eb1a1bf702c5a3a566639dc5efff8a5bbdf74a83a/msstitch-3.17.tar.gz",
    "platform": null,
    "description": "# msstitch - MS proteomics post-processing utilities\n\nShotgun proteomics has a number of bioinformatic tools available for identification \nand quantification of peptides, and the subsequent protein inference. `msstitch` is a \ntool to integrate a number of these tools, generating ready to use result files. Supported\nformats are currently:\n\n- mzML files\n- Fasta peptides\n- MSGF+ PSM tables (and mzIdentML)\n- Sage PSM tables (not for quant yet)\n- Percolator output\n- OpenMS consensusXML isobaric quant\n- Hardklor/Kronik MS1 quant output\n- Dinosaur MS1 quant output\n\nIf you need support for another format, there is limited time but infinite gratitude :)\n\n## Usage\n\n### Storing data\nAn example command flow would first store mzML spectra data in an SQLite file:\n\n```\nmsstitch storespectra --spectra file1.mzML file2.mzML \\\n  --setnames sampleset1 sampleset2 -o db.sqlite\n```\n\nOr, to add spectra to an existing SQLite lookup:\n\n```\nmsstitch storespectra --dbfile lookup.sqlite --spectra file3.mzML file4.mzML \\\n  --setnames sampleset2 sampleset3\n```\n\nThen store quantification data from dinosaur (MS1 precursor quant) and isobaric \nquantification (including precursor purities, but use centroided MS1 for this)\nfrom OpenMS together with the spectra. MS1 precursor quant features will be aligned\nwith MS2 spectra by selecting the best m/z match (an m/z window of 2x m/z \ntolerance), from all features within the retention time window of 2x RT tolerance.\n--rttol is specified in seconds.\n\n```\nmsstitch storequant --dbfile db.sqlite --spectra file1.mzML file2.mzML \\\n  --mztol 10 --mztoltype ppm --rttol 10 \\\n  --dinosaur file1.dinosaur file2.dinosaur \\\n  --isobaric file1.consensusXML file2.consensusXML\n```\n\nWhen using Hardklor/Kronik instead of Dinosaur, you can instead use:\n\n```\nmsstitch storequant --dbfile db.sqlite --spectra file1.mzML file2.mzML \\\n  --kronik file1.kronik file2.kronik \\\n  --isobaric file1.consensusXML file2.consensusXML\n```\n\nFor both Dinosaur and Kronik, the MS1 peak sum is used which theoretically would be more correct\nwhen having differently shaped envelopes. If you'd rather use the envelope apex, pass `--apex`\nin the above command.\n\n\n### Handling MS search engines\nCreate a decoy database where peptides are reversed between tryptic residues. Decoy\npeptides will be shuffled if they match a target sequence, but if they after shuffling\na number of times (max as in `--maxshuffle`) still match a target sequence, they will \nbe removed. To avoid removal (e.g. for keeping database sizes identical), pass `--keep-target'.\n\n```\nmsstitch makedecoy uniprot.fasta -o decoy.fasta --scramble tryp_rev --maxshuffle 10\n```\n\nOr without even trying to shuffle peptide sequences that match to the target DB:\n\n```\nmsstitch makedecoy uniprot.fasta -o decoy.fasta --scramble tryp_rev --ignore-target-hits\n```\n\n\nAfter running two samples of MSGF and percolator, we can start making \na more proper set of PSM tables by adding percolator data and filtering\non FDR. You need percolator data, a PSM table and a matching mzIdentML file.\nThe PSM table / mzIdentML should not have been filtered differently, as each PSM\nis expected to be a single item in the mzIdentML file and vice versa. PSMs not found\nin the percolator data are not output. The following adds percolator svm-score,\nq-value (FDR), and posterior error as columns to the PSM table:\n\n```\n# Add percolator data, filter 0.01 FDR\nmsstitch perco2psm -i psms1.txt \\\n  --perco percolator1.xml --mzid psms1.mzIdentML \\\n  --filtpsm 0.01 --filtpep 0.01\nmsstitch perco2psm -i psms2.txt \\\n  --perco percolator2.xml --mzid psms2.mzIdentML \\\n  --filtpsm 0.01 --filtpep 0.01\n# Combine the two sets and split to a target and decoy file\nmsstitch concat -i psms1.txt psms2.txt -o allpsms.txt\nmsstitch split -i allpsms.txt --splitcol TD\n```\n\nThis also adds ExplainedIonCurrentRatio, which is parsed from the mzIdentML file as delivered by MSGF+.\n\nIf you're running Sage instead of MSGFplus, you will not need the mzIdentML, so the `perco2psm`\ncommands above can be written e.g.:\n```\nmsstitch perco2psm -i psms1.sage.txt -o psms.sage.percolated.txt \\\n  --perco percolator1.xml \\\n  --filtpsm 0.01 --filtpep 0.01\n... etc\n```\n\nNow refine the PSM tables, using the earlier created SQLite DB, \nadding more information (sample name, MS1 precursor quant,\nisobaric quant, proteingroups, genes). In this example we set isobaric\nquantitation intensities to NA if the precursor purity measured is <0.3.\n\n```\ncp db.sqlite decoy_db.sqlite\nmsstitch psmtable -i target.tsv -o target_psmtable.txt --fasta uniprot.fasta \\\n  --dbfile db.sqlite --addmiscleav --addbioset --ms1quant --isobaric \\\n  --min-precursor-purity 0.3 --proteingroup --genes\nmsstitch psmtable -i decoy.tsv -o decoy_psmtable.txt --fasta decoy.fasta \\\n  --dbfile decoy_db.sqlite --proteingroup --genes --addbioset\n```\n\nWhen using Sage results, you will have missed cleavages out of the box (and retention time, ion mobility)! So leave\nout `--addmiscleav`. Otherwise msstitch will accept the Sage file format as well.\n\n\nIf necessary (e.g. multiple TMT sample sets), split the table before making\nprotein/peptide tables:\n\n```\nmsstitch split -i target_psmtable.txt --splitcol bioset\n```\n\n### Summarizing PSMs\nCreate a peptide table, with summarized median isobaric quant ratios,\nhighest MS1 intensity PSM as the peptide MS1 quant intensity, and an additional\nlinear-modeled q-value column:\n\n```\nmsstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \\\n  --scorecolpattern svm --modelqvals --ms1quant \\\n  --isobquantcolpattern tmt10plex --denompatterns _126 _127C \n```\n\nThe same peptide table can also be made using median sweeping, which takes the median\nintensity channel for each PSM as a denominator. Here is also exemplified how\nto do channel-median centering of the ratios to normalize and use log2 intensity\nvalues before calculating ratios:\n\n```\nmsstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \\\n  --scorecolpattern svm --modelqvals --ms1quant \\\n  --isobquantcolpattern tmt10plex --mediansweep --logisoquant --median-normalize\n```\n\nOr, if you only want the median PSM intensity per peptide summarized, use `--medianintensity`\nHere is also illustrated that you can use the --keep-psms-na-quant flag to NOT\nthrow out the PSMs which have isobaric intensity below the mininum intensity \n(default 0, here 100) IN ANY channel:\n\n```\nmsstitch peptides -i set1_target_psms.txt -o set1_target_peptides.txt \\\n  --scorecolpattern svm --modelqvals --ms1quant \\\n  --isobquantcolpattern tmt10plex --medianintensity \\\n  --minint 100 --keep-psms-na-quant\n```\n\nIn case of analyzing peptides with PTMs, you may want to process a subset of\nPSMs (those with the PTMs) to create a separate peptide table from. In\nthat case, there is an option to divide (or subtract for log2 data) isobaric \nquant values to a protein (or gene) table from a non-PTM search, often done on\nanother non-enriched sample. This allows discerning PTM-peptide \ndifferential expression from its respective protein differential expression in the sample.\nThe protein/gene table should obviously contain the same samples/channel,\nand for example be from an `msstitch proteins` or `msstitch isosummarize` command,\nusing `--median-normalize` to get median centered ratios for the proteins or genes.\nAfter that, create a peptide table from PTM-PSMs as follows:\n\n```\nmsstitch peptides -i set1_ptm_psms.txt -o set1_ptm_peptides.txt \\\n  --scorecolpattern svm --isobquantcolpattern tmt10plex --denompatterns _126 _127C \\\n  --logisoquant --totalproteome set1_proteins.txt\n```\n\nFor proper normalizing of this table (as it would otherwise be impacted by\nsample differences per channel), you may want to median-center. In the case of \nsmall and possibly noisy PTM tables, it can be advisable to use another table\nfrom a global search (or e.g. the full peptide or protein table from the PTM search)\nfor determining channel medians. This is possible by specifying the above command\nplus:\n\n```\n  --median-normalize --normalization-factors-table /path/to/set1_global_proteins.txt\n```\n\nTo create a protein table, with isobaric quantification as for peptides, the\naverage of the top-3 highest intensity peptides for MS1 quantification.\nFor all of these, summarizing isobaric PSM data to peptide, protein, gene features \nis done using medians of log2 PSM quantification values per feature (e.g. a protein). If you'd\nrather use averages, use `--summarize-average` as below, where we also show log2\ntransformation of intensities before summarizing and subsequent median-centering.\nFDR (q-values) for the protein table is here calculated 'classically', by ranking\ntarget and decoy proteins, taking their best scoring peptide's q-value as a score.\n`--logscore` is used since q-value is used (higher is better when comparing peptides).\n\n```\nmsstitch proteins -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \\\n  --psmtable set1_target_psms.txt \\\n  -o set1_proteins.txt \\\n  --scorecolpattern '^q-value' --logscore \\\n  --ms1quant \\\n  --isobquantcolpattern tmt10plex --denompatterns _126 _127C \\\n  --summarize-average --logisoquant --median-normalize\n```\n\nOr the analogous process for genes, using median sweeping to get intensity ratios instead of denominators:\nAs for peptides above, one can use the --keep-psms-na-quant flag to NOT\nthrow out the PSMs which have isobaric intensity below the mininum intensity\n(default 0 used here) in any channel. Here we use picked FDR (Savitski et al. 2015 MCP)\nto define q-values for the genes, for which you need a target and decoy fasta to form pairs.\nThe fasta files need to be analogous, i.e. the same order for matching T/D pair genes.\n\n```\nmsstitch genes -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \\\n  --psmtable set1_target_psms.txt \\\n  -o set1_genes.txt \\\n  --scorecolpattern '^q-value' --logscore \\\n  --fdrtype picked --targetfasta tdb.fa --decoyfasta decoy.fa \\\n  --ms1quant \\\n  --isobquantcolpattern tmt10plex --mediansweep \\\n  --keep-psms-na-quant\n```\n\nOr when there are ENSEMBL entries in the fasta search database, even for ENSG, here with summarized median PSM intensity per ENSG:\n\n```\nmsstitch ensg -i set1_target_peptides.txt --decoyfn set1_decoy_peptides \\\n  --psmtable set1_target_psms.txt \\\n  -o set1_ensg.txt \\\n  --scorecolpattern '^q-value' --logscore \\\n  --ms1quant \\\n  --isobquantcolpattern tmt10plex --medianintensity \\\n  --median-normalize\n```\n\nFinally, merge multiple sets of proteins (or genes/ENSG) into a single output.\nHere we set an cutoff so that features with FDR > 0.01 are set to NA for the \nrespective sample set.\n\n```\nmsstitch merge -i set1_proteins.txt set2_proteins.txt -o protein_table.txt \\\n  --setnames sampleset1 sampleset2 \\\n  --dbfile db.sqlite \\\n  --fdrcolpattern 'q-value' --mergecutoff 0.01 \\\n   --ms1quantcolpattern area --isobquantcolpattern plex\n```\n\n\n### Some other useful commands\nTrypsinize a fasta file (minimum retained peptide length, do cut K/RP, allow 1 missed cleavage, \nalso generate peptides from protein N-term with methionine loss).\nThis also treats stop codons as separators and will not read through them if they are in a protein.\nPass `--ignore-stop-codons` if you would like to treat an `*` as any other amino acid.\n\n```\nmsstitch trypsinize -i uniprot.fasta -o tryp_up.fasta --minlen 7 \\\n  --cutproline --miscleav 1 --nterm-meth-loss\n```\n\nCreate an SQLite file with tryptic sequences for filtering out e.g. known-sequence data.\nOptions and behaviour as for `trypsinize`, plus `--insourcefrag` which builds a lookup with support for \nin-source fragmented peptides that have lost some N-terminal residues. If one instead of \nin-source fragmentation only wishes to include protein N-term methionine loss, use `--nterm-meth-loss`\n\n```\nmsstitch storeseq -i canonical.fa -o tryptic.sqlite --cutproline --minlen 7 \\\n  --miscleav 1 --insourcefrag\n```\n\nFilter a percolator output file, or a PSM file using the created SQLite,\nremoving sequences\nthat match those stored in the SQLite. The below also removes sequences in the \nsample which are deamidated (i.e. D -> N), and sequences that have lost at most\n2 N-terminal amino acids due to in-source fragmentation (DB must have been \nbuilt with support for that).\n\n```\n# Percolator:\nmsstitch filterperco -i perco.xml --dbfile tryptic.sqlite \\\n  --insourcefrag 2 --deamidate -o filtered.xml\n\n# PSM file:\nmsstitch seqfilt -i psms.txt --dbfile tryptic.sqlite \\\n  --insourcefrag 2 --deamidate -o filtered.psms.txt\n```\n\nNow that you have filtered percolator, the svm-score populations will have changed,\nand you may want to rerun its `qvality` component for the PSM and peptide score populations\nto obtain fresh FDRs. To put those into the PSM table, you can use:\n\n```\nmsstitch perco2psm -i psms.txt \\\n  --perco filtered.xml --mzid psms1.mzIdentML \\\n  --qvalitypsms qvality_psms.txt --qvalitypeps qvality_peptides.txt \\\n  --filtpsm 0.01 --filtpep 0.01\n```\nNote that the qvality PSMs and peptides files containing svm-scores and FDR, have to\nbe made of the filtered.xml percolator output, as their content will be zipped, thus\neach line in the qvality PSMs file represents a PSM present in percolator XML, \nand vice versa. The same goes for peptides.\n\nCreate an SQLite file with full-protein sequences for filtering any peptide of \na minimum length specified that matches to those. Stop codons are also here treated\nas separating peptides. Slower than filtering tryptic sequences but more comprehensive:\n\n```\nmsstitch storeseq -i canonical.fa -o proteins.sqlite --fullprotein --minlen 7\n```\n\nFilter a percolator output or PSM file on protein sequences using the SQLite, removing \nsequences in sample which match to anywhere in the protein. Sequences may be \ndeamidated, and minimum length parameter must match the one the database is \nbuilt with.\n\n```\n# Percolator\nmsstitch filterperco -i perco.xml --dbfile proteins.sqlite \\\n  --fullprotein --deamidate --minlen 7 -o filtered.xml\n\n# PSM file:\nmsstitch seqfilt -i psms.txt --dbfile proteins.sqlite \\\n  --fullprotein --deamidate --minlen 7 -o filtered.psms.txt\n```\n\nUsing a similar DB from `storeseq` with `--map-accessions`, we can add a column in a PSM file which contains\nsequence-matches from a fasta file, so any peptide matching these sequences can\nget annotated with the fasta ID for that sequences as provided in e.g. `external.fa`:\n\n```\nmsstitch storeseq -i external.fa -o tryptic.sqlite --cutproline --minlen 7 \\\n  --miscleav 1 --insourcefrag --map-accessions\n\n# Or use a full protein DB to match non-tryptic peptides:\nmsstitch storeseq -i external.fa -o fullprotein.sqlite --fullprotein \\\n  --minlen 7 --map-accessions\n\n# Now add the annotations\nmsstitch seqmatch -i psms.txt --dbfile tryptic.sqlite --matchcolname COLUMN_NAME \\\n  --insourcefrag 3 --deamidate -o annotated_psms.txt\n\n# For the fullprotein DB\nmsstitch seqmatch -i psms.txt --dbfile fullprotein.sqlite --matchcolname ANOTHER_COLUMN_NAME \\\n  --minlen 7 --fullprotein --enforce-tryptic -o annotated_psms.txt\n```\n\nSplit a percolator file with PSMs and peptides into files with specific protein headers.\nThe below will split perco.xml and output two files: perco.xml_h0.xml, containing all\nPSMs/peptides that have at least one mapping to a `novel_p` protein, and perco.xml_h1.xml,\nwhich will contain all PSMs/peptides mapping to either `lncRNA` or `intergenic` proteins,\nbut which will __NOT__ contain PSMs/peptides mapping to also either/or `ENSP` and `sp`. \nMore files, _h2.xml etc can be created by adding more headers.\n\n```\nmsstitch splitperco -i perco.xml --protheaders \"novel_p\" \"known:ENSP;sp|novel:lncRNA;intergenic\"`\n```\n\nRemove duplicate peptides from percolator output (if youve introduced these somehow)\nby only retaining the best peptides:\n\n```\nmsstitch dedupperco -i perco.xml -o dedup_peptides.xml\n```\n\nRemove duplicate PSMs from the output if you've introduced those, for really edge cases \nlike removing identical bare peptides where the only difference is \na modification position.\n```\nmsstitch deduppsms -i psms.txt --peptidecolpattern 'Peptide' -o dedup_psms.txt\n```\n\nCreate an isobaric ratio table median-summarizing the PSMs by any column number \nyou want in a PSM table. E.g. you have added a column with exons. The following \nuses average of two channels as denominator, outputs a new table with first column\nthe features found in column nr.20 of the PSM table:\n\n```\nmsstitch isosummarize -i psm_table.txt --featcol 20 \\\n  --isobquantcolpattern tmt10plex --denompatterns 126 127C\n```\n\nWe can also use this command to create a table multi-mapping PSMs are counted towards\nidentification and quant of all its mappings, e.g. proteins separated by `;`. This is\nnot recommended for conventional protein table construction, use at your own risk\nin cases that benefit from this behaviour.\n\n```\nmsstitch isosummarize -i psm_table.txt --featcol 20 \\\n  --isobquantcolpattern tmt10plex --denompatterns 126 127C \\\n  --split-multi-entries\n```\n\n\n\n\n\nRe-use an earlier PSM table and add PSMs from searched spectra files of a new or\nre-searched sample. Saves time so you won't have to re-search all the spectra \nin case of a big analysis. In the example below, new PSMs are the result of a \nsample set that has been re-searched, (e.g. when MS reruns are done in case of \nbad spectra), so we delete the existing sample set before continuing. \nProtein grouping is done after regenerating the PSM table, to illustrate you \ncan do protein grouping on the entire table\ninstead of only on the sample set. Since the new table is the one which supplies\nthe header, the columns not supplied in the command (here protein groups) will\nbe removed from the final result. This function assumes all PSMs presented are\nin the same order in the table, so they should not have been inserted in parallel,\nsafest is to not generate the lookup table by hand.\n\n```\nmsstitch deletesets -i old_psmtable.txt -o cleaned_psmtable.txt \\\n    --dbfile db.sqlite --setnames bad_set\nmsstitch psmtable -i rerun_target.tsv --oldpsms cleaned_psmtable.txt \\\n   -o new_almost_done_psmtable.txt --fasta uniprot.fasta \\\n  --dbfile db.sqlite --addmiscleav --addbioset --ms1quant --isobaric\nmsstitch psmtable -i new_almost_done_psmtable.txt -o new_target_psms.txt \\\n  --proteingroup\n```\n\nIt is also possible to only pass a PSM table to `deletesets`:\n\n```\nmsstitch deletesets -i old_psmtable.txt -o cleaned_psmtable.txt \\\n  --setnames bad_set\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "MS proteomics post processing utilities",
    "version": "3.17",
    "project_urls": {
        "Homepage": "https://github.com/lehtiolab/msstitch"
    },
    "split_keywords": [
        "mass spectrometry",
        " proteomics",
        " processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dfc5faa86f86d981743832740afc035820772d5042f56cd120d158d0af36d616",
                "md5": "575df9950526bb4519d06afc0091fb1f",
                "sha256": "bdae0582c1e0ce5ade05550a94efec7a1f430fc5cdbcb9eb458703ca52033c5c"
            },
            "downloads": -1,
            "filename": "msstitch-3.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "575df9950526bb4519d06afc0091fb1f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 98574,
            "upload_time": "2025-02-01T10:04:47",
            "upload_time_iso_8601": "2025-02-01T10:04:47.782577Z",
            "url": "https://files.pythonhosted.org/packages/df/c5/faa86f86d981743832740afc035820772d5042f56cd120d158d0af36d616/msstitch-3.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "43f20f41bd1068855bcbc55eb1a1bf702c5a3a566639dc5efff8a5bbdf74a83a",
                "md5": "90bb36f10ba94cf6b1f90e5f414b2130",
                "sha256": "b3676265e7b1f8f04d8237eb20de6b2e12db57a35d469326c0cc71e99b7b8764"
            },
            "downloads": -1,
            "filename": "msstitch-3.17.tar.gz",
            "has_sig": false,
            "md5_digest": "90bb36f10ba94cf6b1f90e5f414b2130",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7433227,
            "upload_time": "2025-02-01T10:04:50",
            "upload_time_iso_8601": "2025-02-01T10:04:50.861985Z",
            "url": "https://files.pythonhosted.org/packages/43/f2/0f41bd1068855bcbc55eb1a1bf702c5a3a566639dc5efff8a5bbdf74a83a/msstitch-3.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-01 10:04:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lehtiolab",
    "github_project": "msstitch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "msstitch"
}

Jorrit Boekel