Anh H. Reynolds

Physical/Analytical Chemist

Challenges in Bottom-up Proteomics

working on this...

Sample preparation
Instrumentation: mass spectrometry

Resolution: ability to distinguish different m/z
Mass accuracy
Centroiding process is used to assign each peak with a single m/z value
The assigned m/z of the centroided peak may be different from the real m/z of the ion – mass error tolerances
Detection range: usually 100Da to a few thousands Da as proteins are first enzymatically digested into peptides
Dynamic range: detect proteins and peptides with vast difference in relative abundance
Orbitrap typically 5-6 orders of magnitude, while peptide and protein concentrations can span > 10 orders of magnitude in a biological matrix
Sensitivity – the ability to detect low level protein and peptide signals
Ionization efficiency: peak intensity or ion current count is not directly proportional to molecule’s abundance in the sample
Ion source used: ESI or MALDI
Mass analyzer: LIT, QQQ, TOF, FTICR, Orbitrap
Fragmentation method: CID/CAD, ETD, HCD, etc.

Proteolysis results in unanticipated cleavage products

Only a small number of peptides/proteins are recovered and identified

The protein inference problem – multiple proteins can share the same subset of peptides under proteolysis (peptide degeneracy)
Peptides are not detected

Issue during protein digestion
Lost on the column during LC
Unexpected PTMs on the peptides
Inefficient ionization
Inefficient fragmentation during MS

Data processing
Data analysis: First each MS/MS spectrum is used by a database to identify a peptide sequence. Then, the peptides are grouped together to identify the proteins.
Qualitative analysis: identification of proteins present in in complex biological matrix

Erroneous assignments
Data variations: proteins might not be reproducibly identified from the same sample?
The absence of evidence for the presence of a peptide is not necessarily evidence for the absence of the peptide

Throughput

Challenges in data processing and analysis

Data quality: low quality results in high percentage of spectra being unmatched/unassigned

Noise: poor fragmentation of the peptide ions, poorly selected precursor ions, etc.

Data processing

Better peak picking algorithm
Deconvolution algorithms

Charge deconvolution: conversion of multiply charged ions to singly charged ions
Noise deconvolution
Isotope deconvolution

Large amount of input data with hundreds of thousands of MS/MS spectra generated per sample

Handling of large dataset: size of data exceeds the scalability of software often resulting in a need to divide the dataset, run in batches, and merge data together in the end
Data compression: no direct access to individual spectrum

Lossy vs lossless compression

Data storage and format

mzXML and mzData double to triple data file size compared to proprietary raw data (has mzML solved this?)
mzML is 64-bit encoded – not easily readable to human

raw data is binary and can break XML structure
converting binary to text for readability will increase file size

Database search method

Database needs to be error free
Peptide-matching algorithms:

Accurate simulation of the MS/MS spectrum of a peptide given a specific proteolytic digestion from the peptide sequence

Type of mass spectrometer
Annotated spectrum library for easier comparison with spectral library
Indexed relational database

Scoring function to accurately access the matching between a peptide and a given MS/MS spectrum

Accurate retention time prediction and have RT included in the scoring function can help increase identification accuracy
Combining RT with m/z values from traditional mass fingerprinting method

Certain modified peptides are not present in database

PTMs of peptides: especially for a variable PTM where a residue may or may not be modified resulting in exponential growth of search space

Low quality data due to poor fragmentation of peptide

Try different fragmentation methods together
Multistage MS \((MS^n)\)

Multiple precursor ions possible result in a complex composite spectra of fragment ions
Difficulty/bias when matching low-abundance and/or modified proteins
Search criteria/parameters: too much flexibility results in too large search space and consequently increasing search time and false-positive IDs.
Proteotypic peptides: those are most confidently observed during fragmentation of proteins through enzyme digestion or chemical cleavage

Facilitate search when possible

Interpretation of result: is the match real? True/False Positive/Negative

Receiver Operating Characteristic (ROC) plot: Sensitivity vs Specificity
Decoy database method (or target-decoy search)
Result validation method to correctly estimate false discovery rate

Protein identification from peptides: sequence tag search

Extrapolation in the absence/missing of data
“one hit wonders”: when a protein only has one peptide identified
Identified peptides are not unique to a protein (protein inference problem)
Target protein sequence is slightly different from existing protein in the database (different species for example)
Homology search?

Peptide de novo sequencing algorithms when target proteins/peptides are not present in the database

Imperfect data

Incomplete ladders with missing fragment ions
A lot more peaks than just the N-terminal and C-terminal ion ladders

The only choice for new peptide that is not present in a database
Explore multiple fragmentation modes for the same peptide and compare results from de novo sequencing of the different spectra

Correctly identifying the proteins is not the same as providing accurate protein sequence

The original protein sequence might be slightly different from that in the database due to mutations
Protein sequencing

Edman degradation
Digest target protein with multiple enzymes, measure with MS/MS, followed by de novo sequencing and finally assembly of peptides informatically

References

Lill et al., Proteomics in the pharmaceutical and biotechnology industry: a look to the next decade (2021) (https://doi.org/10.1080/14789450.2021.1962300)
Duncan et al. The pros and cons of peptide centric proteomics (2010) (https://www.nature.com/articles/nbt0710-659)
Sadygov, Cociorva, and Yates III, Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book (2004) (https://www.nature.com/articles/nmeth725i)
Ma, Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics (2010) (https://doi.org/10.1007/s11390-010-9309-1)
Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116