Centroiding process is used to assign each peak with a single m/z value
The assigned m/z of the centroided peak may be different from the real m/z of the ion – mass error tolerances
Detection range: usually 100Da to a few thousands Da as proteins are first enzymatically digested into peptides
Dynamic range: detect proteins and peptides with vast difference in relative abundance
Orbitrap typically 5-6 orders of magnitude, while peptide and protein concentrations can span > 10 orders of magnitude in a biological matrix
Sensitivity – the ability to detect low level protein and peptide signals
Ionization efficiency: peak intensity or ion current count is not directly proportional to molecule’s abundance in the sample
Ion source used: ESI or MALDI
Mass analyzer: LIT, QQQ, TOF, FTICR, Orbitrap
Fragmentation method: CID/CAD, ETD, HCD, etc.
Proteolysis results in unanticipated cleavage products
Only a small number of peptides/proteins are recovered and identified
The protein inference problem – multiple proteins can share the same subset of peptides under proteolysis (peptide degeneracy)
Peptides are not detected
Issue during protein digestion
Lost on the column during LC
Unexpected PTMs on the peptides
Inefficient ionization
Inefficient fragmentation during MS
Data processing
Data analysis:
First each MS/MS spectrum is used by a database to identify a peptide sequence.
Then, the peptides are grouped together to identify the proteins.
Qualitative analysis: identification of proteins present in in complex biological matrix
Erroneous assignments
Data variations: proteins might not be reproducibly identified from the same sample?
The absence of evidence for the presence of a peptide is not necessarily evidence for the absence of the peptide
Throughput
Challenges in data processing and analysis
Data quality: low quality results in high percentage of spectra being unmatched/unassigned
Noise: poor fragmentation of the peptide ions, poorly selected precursor ions, etc.
Data processing
Better peak picking algorithm
Deconvolution algorithms
Charge deconvolution: conversion of multiply charged ions to singly charged ions
Noise deconvolution
Isotope deconvolution
Large amount of input data with hundreds of thousands of MS/MS spectra generated per sample
Handling of large dataset: size of data exceeds the scalability of software often resulting in a need to divide the dataset, run in batches, and merge data together in the end
Data compression: no direct access to individual spectrum
Lossy vs lossless compression
Data storage and format
mzXML and mzData double to triple data file size compared to proprietary raw data (has mzML solved this?)
mzML is 64-bit encoded – not easily readable to human
raw data is binary and can break XML structure
converting binary to text for readability will increase file size
Database search method
Database needs to be error free
Peptide-matching algorithms:
Accurate simulation of the MS/MS spectrum of a peptide given a specific proteolytic digestion from the peptide sequence
Type of mass spectrometer
Annotated spectrum library for easier comparison with spectral library
Indexed relational database
Scoring function to accurately access the matching between a peptide and a given MS/MS spectrum
Accurate retention time prediction and have RT included in the scoring function can help increase identification accuracy
Combining RT with m/z values from traditional mass fingerprinting method
Certain modified peptides are not present in database
PTMs of peptides: especially for a variable PTM where a residue may or may not be modified resulting in exponential growth of search space
Low quality data due to poor fragmentation of peptide
Try different fragmentation methods together
Multistage MS \((MS^n)\)
Multiple precursor ions possible result in a complex composite spectra of fragment ions
Difficulty/bias when matching low-abundance and/or modified proteins
Search criteria/parameters: too much flexibility results in too large search space and consequently increasing search time and false-positive IDs.
Proteotypic peptides: those are most confidently observed during fragmentation of proteins through enzyme digestion or chemical cleavage
Facilitate search when possible
Interpretation of result: is the match real? True/False Positive/Negative
Receiver Operating Characteristic (ROC) plot: Sensitivity vs Specificity
Decoy database method (or target-decoy search)
Result validation method to correctly estimate false discovery rate
Protein identification from peptides: sequence tag search
Extrapolation in the absence/missing of data
“one hit wonders”: when a protein only has one peptide identified
Identified peptides are not unique to a protein (protein inference problem)
Target protein sequence is slightly different from existing protein in the database (different species for example)
Homology search?
Peptide de novo sequencing algorithms when target proteins/peptides are not present in the database
Imperfect data
Incomplete ladders with missing fragment ions
A lot more peaks than just the N-terminal and C-terminal ion ladders
The only choice for new peptide that is not present in a database
Explore multiple fragmentation modes for the same peptide and compare results from de novo sequencing of the different spectra
Correctly identifying the proteins is not the same as providing accurate protein sequence
The original protein sequence might be slightly different from that in the database due to mutations
Protein sequencing
Edman degradation
Digest target protein with multiple enzymes, measure with MS/MS, followed by de novo sequencing and finally assembly of peptides informatically
References
Lill et al., Proteomics in the pharmaceutical and biotechnology industry: a look to the next decade (2021) (https://doi.org/10.1080/14789450.2021.1962300)
Duncan et al. The pros and cons of peptide centric proteomics (2010)
(https://www.nature.com/articles/nbt0710-659)
Sadygov, Cociorva, and Yates III, Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book (2004)
(https://www.nature.com/articles/nmeth725i)
Ma, Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics (2010) (https://doi.org/10.1007/s11390-010-9309-1)
Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116