Anh H. Reynolds
  • Home
  • Blogs
  • Contact

Anh H. Reynolds

Physical/Analytical Chemist

Challenges in Bottom-up Proteomics



working on this...

  1. Sample preparation
  2. Instrumentation: mass spectrometry
    • Resolution: ability to distinguish different m/z
    • Mass accuracy
    • Centroiding process is used to assign each peak with a single m/z value
    • The assigned m/z of the centroided peak may be different from the real m/z of the ion – mass error tolerances
    • Detection range: usually 100Da to a few thousands Da as proteins are first enzymatically digested into peptides
    • Dynamic range: detect proteins and peptides with vast difference in relative abundance
    • Orbitrap typically 5-6 orders of magnitude, while peptide and protein concentrations can span > 10 orders of magnitude in a biological matrix
    • Sensitivity – the ability to detect low level protein and peptide signals
    • Ionization efficiency: peak intensity or ion current count is not directly proportional to molecule’s abundance in the sample
    • Ion source used: ESI or MALDI
    • Mass analyzer: LIT, QQQ, TOF, FTICR, Orbitrap
    • Fragmentation method: CID/CAD, ETD, HCD, etc.
  3. Proteolysis results in unanticipated cleavage products
    • Only a small number of peptides/proteins are recovered and identified
  4. The protein inference problem – multiple proteins can share the same subset of peptides under proteolysis (peptide degeneracy)
  5. Peptides are not detected
    • Issue during protein digestion
    • Lost on the column during LC
    • Unexpected PTMs on the peptides
    • Inefficient ionization
    • Inefficient fragmentation during MS
  6. Data processing
  7. Data analysis: First each MS/MS spectrum is used by a database to identify a peptide sequence. Then, the peptides are grouped together to identify the proteins.
  8. Qualitative analysis: identification of proteins present in in complex biological matrix
    • Erroneous assignments
    • Data variations: proteins might not be reproducibly identified from the same sample?
    • The absence of evidence for the presence of a peptide is not necessarily evidence for the absence of the peptide
  9. Throughput

Challenges in data processing and analysis

  1. Data quality: low quality results in high percentage of spectra being unmatched/unassigned
    • Noise: poor fragmentation of the peptide ions, poorly selected precursor ions, etc.
  2. Data processing
    • Better peak picking algorithm
    • Deconvolution algorithms
      • Charge deconvolution: conversion of multiply charged ions to singly charged ions
      • Noise deconvolution
      • Isotope deconvolution
  3. Large amount of input data with hundreds of thousands of MS/MS spectra generated per sample
    • Handling of large dataset: size of data exceeds the scalability of software often resulting in a need to divide the dataset, run in batches, and merge data together in the end
    • Data compression: no direct access to individual spectrum
      • Lossy vs lossless compression
    • Data storage and format
      • mzXML and mzData double to triple data file size compared to proprietary raw data (has mzML solved this?)
      • mzML is 64-bit encoded – not easily readable to human
        • raw data is binary and can break XML structure
        • converting binary to text for readability will increase file size
  4. Database search method
    • Database needs to be error free
    • Peptide-matching algorithms:
      • Accurate simulation of the MS/MS spectrum of a peptide given a specific proteolytic digestion from the peptide sequence
        • Type of mass spectrometer
        • Annotated spectrum library for easier comparison with spectral library
        • Indexed relational database
      • Scoring function to accurately access the matching between a peptide and a given MS/MS spectrum
        • Accurate retention time prediction and have RT included in the scoring function can help increase identification accuracy
        • Combining RT with m/z values from traditional mass fingerprinting method
      • Certain modified peptides are not present in database
        • PTMs of peptides: especially for a variable PTM where a residue may or may not be modified resulting in exponential growth of search space
      • Low quality data due to poor fragmentation of peptide
        • Try different fragmentation methods together
        • Multistage MS \((MS^n)\)
      • Multiple precursor ions possible result in a complex composite spectra of fragment ions
      • Difficulty/bias when matching low-abundance and/or modified proteins
      • Search criteria/parameters: too much flexibility results in too large search space and consequently increasing search time and false-positive IDs.
      • Proteotypic peptides: those are most confidently observed during fragmentation of proteins through enzyme digestion or chemical cleavage
        • Facilitate search when possible
      • Interpretation of result: is the match real? True/False Positive/Negative
        • Receiver Operating Characteristic (ROC) plot: Sensitivity vs Specificity
        • Decoy database method (or target-decoy search)
        • Result validation method to correctly estimate false discovery rate
    • Protein identification from peptides: sequence tag search
      • Extrapolation in the absence/missing of data
      • “one hit wonders”: when a protein only has one peptide identified
      • Identified peptides are not unique to a protein (protein inference problem)
      • Target protein sequence is slightly different from existing protein in the database (different species for example)
      • Homology search?
  5. Peptide de novo sequencing algorithms when target proteins/peptides are not present in the database
    • Imperfect data
      • Incomplete ladders with missing fragment ions
      • A lot more peaks than just the N-terminal and C-terminal ion ladders
    • The only choice for new peptide that is not present in a database
    • Explore multiple fragmentation modes for the same peptide and compare results from de novo sequencing of the different spectra
  6. Correctly identifying the proteins is not the same as providing accurate protein sequence
    • The original protein sequence might be slightly different from that in the database due to mutations
    • Protein sequencing
      • Edman degradation
      • Digest target protein with multiple enzymes, measure with MS/MS, followed by de novo sequencing and finally assembly of peptides informatically

References

  1. Lill et al., Proteomics in the pharmaceutical and biotechnology industry: a look to the next decade (2021) (https://doi.org/10.1080/14789450.2021.1962300)
  2. Duncan et al. The pros and cons of peptide centric proteomics (2010) (https://www.nature.com/articles/nbt0710-659)
  3. Sadygov, Cociorva, and Yates III, Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book (2004) (https://www.nature.com/articles/nmeth725i)
  4. Ma, Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics (2010) (https://doi.org/10.1007/s11390-010-9309-1)
  5. Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116

© 2022 · Anh H. Reynolds

Cite

Copy Download