Anh H. Reynolds

Physical/Analytical Chemist

Bottom-up Proteomics: An Overview

working on this...

MS-based proteomics falls into 2 categories: bottom-up proteomics and top-down proteomics. n top-down proteomics, intact proteins are analyzed directly using high-resolution mass spectrometry and the highly charged protein is fragmented to produce an MS/MS spectrum. n bottom-up proteomics, also called shot-gun or peptide-specific proteomics, proteins are digested via chemical cleavage or proteolysis prior to MS analysis. The goal of bottom-up proteomics is often to identify and/or quantify all proteins and the complete protein sequences including post-translational modifications, in a complex biological matrix.

Given a complex mixture of proteins, each of which is among over a hundred thousand coded proteins not including possible post-translational modifications, the identification of the proteins in this mixture is challenging. The approach used in bottom-up proteomics is through an LC-MS/MS analysis of the peptide mixture resulting from site-specific cleavage/proteolysis of the original proteins.

This solves 2 problems:

Solubility: Peptides are generally more soluble than proteins
Sensitivity: LCMS can detect peptides at much lower levels than the parent proteins

The resulting MS/MS spectra can be matched with data from a database of simulated MS/MS spectra of peptides generated through in silico digestion of proteins. Protein identification this way is called automated sequence database searching . Alternatives are isolation and purification of individual proteins followed by residue-by-residue sequencing, and sequence inference by de novo interpretation of fragment ion spectra or by means of sequence tags. Stable isotope-labelled peptides can be added for quantitative analysis of specific proteins. Label-free quantitation can be done with spectral counting and ion current measurement.

Standard protocol for protein identification

Digestion of proteins into peptides
Peptide separation with liquid chromatography
MS/MS spectra acquisition

Data-dependent acquisition (DDA) or data-independent acquisition (DIA)
LC/1D or 2D gel separation might have to be done off-line for complex samples
Hundreds of thousands of MS/MS spectra are obtained for one sample: some are from noises, and low-level proteins may not result in any spectra

Protein identification through matching with a database or de novo sequencing

Database search

Mascot, PEAKS, Sequest, Tandem, Ommsa, Phenyx, etc.

De novo sequencing: constructing ladders of fragment ions, for example, y-ions (or b-ions) for CID MS/MS spectrum. The amino acid sequence of the peptide is determined from the mass differences between adjacent peaks in the ladders

PEAKS, PepNovo, Lutefisk

Assumptions in bottom-up analysis

Proteolysis by a specific protease reproducibly results in a small number of peptides
Unique identification of a protein precursor is possible with a small subset of peptides
This relationship holds whether the protein is purified or in a complex protein mixture
The target protein and all its variants are in the database
Peptides resulting from protein cleavage are fully recovered and detected

Practical applications

Protein identification of a biological sample

Not very useful as comprehensive, consistent, and reproducible results are not possible

Single-analyte assay: differences in protein levels between two or more sample populations

Method validation

Accuracy, precision (repeatability, intermediate precision, and reproducibility)
Specificity
LOD, LOQ, linearity and range
Ruggedness and robustness

Label-free quantification: fast, cost-effective, simplicity

Samples are analyzed in separate experiments under the same condition

Digest protein mixture with a protease such as tryptic digestion

Spectral counting or ion-current measurement (chromatographic peak intensity)

Ion-current measurement: difficult in practice due to presence of multiple confounding factors that compromise precision: sample preparation, injection volume, retention time, coeluting species, temperature and pressure fluctuations, etc.

Identify and map characteristic peaks for the same peptide from different spectral data (peptide features)

Retention time correction due to LC variations
Spectra/sequence alignment algorithms (Smith-Waterman)

Compute protein ratios from averaged peptide ratios

Intensity splitting when a peptide feature is shared by multiple proteins
Outlier removal: errors from incorrect peptide identification or overlapping of peptide features

Stable isotope labels: improved precision at the expense of higher cost and complexity

Isotope-Coded Affinity Tags (ICAT)
Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC)
Isobaric Tag for Relative and Absolute Quantitation (iTRAQ)
Relative intensities between the characteristic peaks are used to compute the quantity ratio

PTM characterization: report post-translational modifications (PTMs) in the target protein

Most common PTMs: phosphorylation, glycosylation, methylation, acetylation, and acylation
Modification sites are not defined by the genome, so often not present in protein database

Searching for all possible PTMs exponentially increase computational complexity

Proteins with PTMs may have low abundance and not detectable with MS/MS
Peptides with PTMs can result in too complex spectra

Phosphorylation

β-elimination

Glycosylation

Glycan has tree structure instead of linear structure like peptide
Glycan can have variable structures and mass values depending on the modification sites
Cleave glycans with enzymes then identify structure of released glycans?

The existence and extent of PTMs in a given sample is unknown

PTM quantification: what percentage of proteins are modified by a certain variable PTM
Sequencing of non-standard peptides with non-linear sequence: peptides with disulfide bonds or non-ribosomal peptides

References

Lill et al., Proteomics in the pharmaceutical and biotechnology industry: a look to the next decade (2021) (https://doi.org/10.1080/14789450.2021.1962300)
Duncan et al. The pros and cons of peptide centric proteomics (2010) (https://www.nature.com/articles/nbt0710-659)
Sadygov, Cociorva, and Yates III, Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book (2004) (https://www.nature.com/articles/nmeth725i)
Ma, Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics (2010) (https://doi.org/10.1007/s11390-010-9309-1)
Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116