Cleavage method (enzymatic digestion, chemical cleavage)
Trypsin cleaves after Arg and Lys if they’re not followed by a Pro
Specific cleavage, unspecific cleavage (e.g., protease contamination),
missed cleavage (e.g., charge residues in proximity)
Modifications
Fixed modifications: consistent change in mass of an amino acid
Variable modifications: phosphorylation, glycosylation, etc.
Mass accuracy
Monoisotopic or average mass
Average mass is useful for low-resolution instrument
Often specified in ppm or Da
Global calibration vs local calibration vs internal standard calibration
Retention time alignment: potentially better accuracy as a second dimension besides m/z alignment
Data storage and format
Two data types contained within raw MS data sets:
Numeric data: m/z, peak intensities, and retention times
Metadata: instrument and experiment settings
mzML: unified format built upon mzXML and mzData supported by the Proteomics Standards Initiative (PSI)
Many fold larger file size than the original proprietary vendor format
Slower read and write speeds
Metadata accurately and unambiguously annotated using PSI-MS controlled vocabulary
Stored in text-based XML format
Numeric data get converted to text strings with Base64 encoding
zlib compression possible before encoding
Numpress: an encoding scheme to compress the binary numeric data before Base64 encoding
Truncation of numbers and rounding to integers
File size reduction by 61%
Relatively constant relative error
Appropriate choice of threshold can minimize effect on downstream analysis
mz5: data format based on HDF5 (hierarchical data format version 5)
Binary format allowing complex data relationships and dependencies
50% size file reduction
Multiple data sets can be stored within a hierarchical group structure
Group has a container construct
Data sets are multidimensional arrays of data elements
Metadata stored as attributes (key-valued pairs)
Large data sets stored in chunks of variable sizes storable in cache for repeated access
Requires complete reimplementation of mzML where tags and numeric data have to be remapped to structures that mimic tables in a relational database
mzDB: lightweight SQLite relational database
2-dimensional data blocks
25% file size reduction
mzDB-access and pwiz-mzDB for DDA and mzDB-Swath (for DIA.
Metadata are not compressed but stored in param_tree fields in XML format
Numeric data are also not compressed (can be compressed with SQLite)
imzML: similar to mzML for easier visualization of the data using third party softwares
Data parsing
Search algorithms
Peptide sequence tag by Mann et al. (1994)
Cross-correlation function for comparing theoretical and observed spectra by Eng et al. (SEQUEST first paper 1994)
Iterative search approach by Jensen et al. (1997) for peptide mass mapping to sequence databases
Many MS databases exist but are commercial. How to validate the search results?
Log likelihood ratio
Hidden Markov Models for scoring peptide sequence matches to observed MS/MS spectra (Ref 31 of 1)
Receiver Operating Characteristics Plot
References
Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116
Chen et al., Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis (2010) doi: 10.3390/ijms21082873
Bhamber et al., mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements, J. Proteome Res. 2021, 20, 172−183