Anh H. Reynolds

Database-Dependent Searching

working on this...

Trypsin cleaves after Arg and Lys if they’re not followed by a Pro
Specific cleavage, unspecific cleavage (e.g., protease contamination), missed cleavage (e.g., charge residues in proximity)

Retention time alignment: potentially better accuracy as a second dimension besides m/z alignment

mzML: unified format built upon mzXML and mzData supported by the Proteomics Standards Initiative (PSI)

Many fold larger file size than the original proprietary vendor format
Slower read and write speeds
Metadata accurately and unambiguously annotated using PSI-MS controlled vocabulary
Stored in text-based XML format
Numeric data get converted to text strings with Base64 encoding
zlib compression possible before encoding
Numpress: an encoding scheme to compress the binary numeric data before Base64 encoding

Binary format allowing complex data relationships and dependencies
50% size file reduction
Multiple data sets can be stored within a hierarchical group structure
Group has a container construct
Data sets are multidimensional arrays of data elements
Metadata stored as attributes (key-valued pairs)
Large data sets stored in chunks of variable sizes storable in cache for repeated access
Requires complete reimplementation of mzML where tags and numeric data have to be remapped to structures that mimic tables in a relational database

imzML: similar to mzML for easier visualization of the data using third party softwares

Peptide sequence tag by Mann et al. (1994)
Cross-correlation function for comparing theoretical and observed spectra by Eng et al. (SEQUEST first paper 1994)
Iterative search approach by Jensen et al. (1997) for peptide mass mapping to sequence databases
Many MS databases exist but are commercial. How to validate the search results?
- Log likelihood ratio
- Hidden Markov Models for scoring peptide sequence matches to observed MS/MS spectra (Ref 31 of 1)
- Receiver Operating Characteristics Plot

Matthiesen, Methods, algorithms and tools in computational proteomics: A practical point of view (2007) DOI: 10.1002/pmic.200700116
Chen et al., Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis (2010) doi: 10.3390/ijms21082873
Bhamber et al., mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements, J. Proteome Res. 2021, 20, 172−183