Anh H. Reynolds
  • Home
  • Blogs
  • Contact

Anh H. Reynolds

Physical/Analytical Chemist

Database-Dependent Searching



working on this...

  1. Choice of database
  2. Search parameters
    • Fragmentation mode (CID, ETD, etc.)
    • Cleavage method (enzymatic digestion, chemical cleavage)
      • Trypsin cleaves after Arg and Lys if they’re not followed by a Pro
      • Specific cleavage, unspecific cleavage (e.g., protease contamination), missed cleavage (e.g., charge residues in proximity)
    • Modifications
      • Fixed modifications: consistent change in mass of an amino acid
      • Variable modifications: phosphorylation, glycosylation, etc.
    • Mass accuracy
      • Monoisotopic or average mass
      • Average mass is useful for low-resolution instrument
      • Often specified in ppm or Da
      • Global calibration vs local calibration vs internal standard calibration
    • Retention time alignment: potentially better accuracy as a second dimension besides m/z alignment
  3. Data storage and format
    • Two data types contained within raw MS data sets:
      • Numeric data: m/z, peak intensities, and retention times
      • Metadata: instrument and experiment settings
    • mzML: unified format built upon mzXML and mzData supported by the Proteomics Standards Initiative (PSI)
      • Many fold larger file size than the original proprietary vendor format
      • Slower read and write speeds
      • Metadata accurately and unambiguously annotated using PSI-MS controlled vocabulary
      • Stored in text-based XML format
      • Numeric data get converted to text strings with Base64 encoding
      • zlib compression possible before encoding
      • Numpress: an encoding scheme to compress the binary numeric data before Base64 encoding
        • Truncation of numbers and rounding to integers
        • File size reduction by 61%
        • Relatively constant relative error
        • Appropriate choice of threshold can minimize effect on downstream analysis
    • mz5: data format based on HDF5 (hierarchical data format version 5)
      • Binary format allowing complex data relationships and dependencies
      • 50% size file reduction
      • Multiple data sets can be stored within a hierarchical group structure
      • Group has a container construct
      • Data sets are multidimensional arrays of data elements
      • Metadata stored as attributes (key-valued pairs)
      • Large data sets stored in chunks of variable sizes storable in cache for repeated access
      • Requires complete reimplementation of mzML where tags and numeric data have to be remapped to structures that mimic tables in a relational database
    • mzDB: lightweight SQLite relational database
      • 2-dimensional data blocks
      • 25% file size reduction
      • mzDB-access and pwiz-mzDB for DDA and mzDB-Swath (for DIA.
      • Metadata are not compressed but stored in param_tree fields in XML format
      • Numeric data are also not compressed (can be compressed with SQLite)
    • imzML: similar to mzML for easier visualization of the data using third party softwares
  4. Data parsing
  5. Search algorithms
    • Peptide sequence tag by Mann et al. (1994)
    • Cross-correlation function for comparing theoretical and observed spectra by Eng et al. (SEQUEST first paper 1994)
    • Iterative search approach by Jensen et al. (1997) for peptide mass mapping to sequence databases
    • Many MS databases exist but are commercial. How to validate the search results?
      • Log likelihood ratio
      • Hidden Markov Models for scoring peptide sequence matches to observed MS/MS spectra (Ref 31 of 1)
      • Receiver Operating Characteristics Plot

References

  1. Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116
  2. Chen et al., Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis (2010) doi: 10.3390/ijms21082873
  3. Bhamber et al., mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements, J. Proteome Res. 2021, 20, 172−183

© 2022 · Anh H. Reynolds

Cite

Copy Download