The Global Proteome Machine Organization
   GPM Blog
Interesting data: PXD001197 (2017/3/5)
The raw data set was made available through the ProteomeXchange site, PXD001197. It was published as part of the study entitled "Cellular Signature of SIL1 Depletion: Disease Pathogenesis due to Alterations in Protein Composition Beyond the ER Machinery" by Roos A, Kollipara L, Buchkremer S, Labisch T, Brauers E, Gatz C, Lentz C, Gerardo-Nava J, Weis J and Zahedi RP, published in Mol Neurobiol. 2016 Oct;53(8):5527-41 (PubMed). These experiments were represented in GPMDB by 18 files, each an individual LC/MS/MS run.
This data set is a good demonstration of what can be obtained by using label-free 1D HPLC/MS/MS to profile differences induced in the common cell line HEK-293. The study reliably identifies about 3,000 distinct protein groups per LC/MS/MS experiment from about 30,000 high quality peptide-to-sequence matches (PSMs). The PSMs are remarkable in that there were very few experimental artifacts, allowing the reliable detection of phosphorylations, acetylations and dimethyl-arginines, as well as a good distribution of the SAVs commonly observed in HEK-293 cells. As is normal in HEK-293 cells, the E1B 55K and E1B 19K proteins from Human mastadenovirus C are both prominently observed, e.g., 50 PSMs associated with these two protein GPM64230001481. Anyone interested in pushing the confidence limits in protein detection should consider using this data set as an example of unusually good quality data from an hybrid linear quadrupole ion trap/orbitrap instrument.
Interesting data: MSV000079017 (2017/1/29)
The raw data set was made available through the Massive ProteomeXchange site, MSV000079017. It was published as part of the study entitled "Lenalidomide causes selective degradation of IKZF1 and IKZF3 in multiple myeloma cells" by Krönke J, Udeshi ND, Narla A, Grauman P, Hurst SN, McConkey M, Svinkina T, Heckl D, Comer E, Li X, Ciarlo C, Hartman E, Munshi N, Schenone M, Schreiber SL, Carr SA, and Ebert BL, published in Science. 2014 Jan 17;343(6168):301-5 (PubMed). These experiments were represented in GPMDB by 90 files, each an individual LC/MS/MS run.
This data set is a great example of how well current methods work for isolating ubiquitinylated peptides. Many of the analyses that target the lysine epsilon-amino-KK remnant result in more than 70% of the identified peptides corresponding to this modification. The results clearly show the folly of the often-quoted canard about the incompatibility of iodoacetimide cysteine blocking and ubiquitination detection: the problem only arises if the cysteine-blocking reaction is done very poorly. Attempts to replace iodoacetamide with the less reactive chloroacetamide usually result in an unacceptible loss of cysteine-containing peptides, which constitute 20% of observable tryptic peptides.
Interesting data: MSV000080368 (2017/1/14)
The raw data set was made available through the Massive ProteomeXchange site, MSV000080368. It was published as part of the study entitled "Proteomic Analysis of Pemphigus Autoantibodies Indicates a Larger, More Diverse, and More Dynamic Repertoire than Determined by B Cell Genetics" by Chen J, Zheng Q, Hammers CM, Ellebrecht CT, Mukherjee EM, Tang HY, Lin C, Yuan H, Pan M, Langenhan J, Komorowski L, Siegel DL, Payne AS and Stanley JR, published in Cell Rep. 2017 Jan 3;18(1):237-247 (PubMed). These experiments were represented in GPMDB by 64 files, each an individual LC/MS/MS run.
This research was carried out to characterize the antibodies responsible for an autoimmune disease known as Pemphigus inwhich antibodies form against the common epidermal protein family the desmogleins. The experiments involved pull-downs using desmoglein as the bait to obtain samples enriched in anti-desmoglein antibodies from serum derived from six patients. The results generated small lists of proteins — averaging about 200 per run — but the results are very complex to interpret, with many immunoglobulin-related sequences with extensive regions of overlapping tryptic peptides. This data is ideal for anyone interested in developing algorithms for coping with this type of protein reassembly complexity ("protein inference"). It is also a good set of data to work through if your main interest is applying proteomics to the immunology of antibody response.
How Deep is It Really: Mitochondrial Chromosome-Encoded Proteins (2017/1/12)
Last week (see below) it was suggested that cytosolic aminoacyl tRNA synthetases were a group of proteins that could be used to characterize sampling-related effects (particularly "under-sampling") in proteomics results sets. This week a group of proteins that can be used to check the level-of-detection for integral membrane proteins will be proposed, namely the thirteen protein subunits encoded on the mitochondrial chromosome.
These thirteen proteins are translated inside of the mitchondrion, using the mitochrondrial ribosome (mitoribosome). All of these proteins are inner mitochondrial membrane protein subunits involved in the electron transport chain and are required for oxidative phosphorylation. The proteins contain membrane spanning domains and include some of the most hydrophobic proteins in the human proteome. The members of this group in Homo sapiens are listed in the Table 1.
Protein Observations Peptides Length (residues)
MT-ATP6 5,646 11 226
MT-ATP8 2,306 4 68
MT-CO1 1,870 26 513
MT-CO2 16,978 50 227
MT-CO3 1,096 5 261
MT-CYB 968 3 380
MT-ND1 2,414 10 318
MT-ND2 1,435 4 347
MT-ND3 1,582 1 115
MT-ND4 2,287 7 459
MT-ND4L 0 0 98
MT-ND5 3,368 22 603
MT-ND6 561 1 174
Table 1. The thirteen human protein encoded by the mitochondrial chromosome, how often they have been observed and the number of confidently observed peptides (taken from GPMDB).
These protein subunits are easy to locate in a result list, as they have the only gene names that begin with "MT-". They have a wide range of observability, ranging from MT-CO2:p (16,978 ×) to MT-ND4L:p (0 ×). Counting the number of these sequences that are present in a particular result set obtained from a cell lysate or membrane preparation indicates of how well an experimental protocol performed for obtaining peptides from integral membrane proteins.
Please note that the observability of these proteins varies from species to species because of minor changes in the amino acid sequence. For example, in mice MT-ND4L:p is observable while MT-ND6:p is not.
Interesting data: PXD003818 (2017/1/4)
The raw data set was made available through the PRIDE ProteomeXchange site, PXD003818. It was published as part of the study entitled "Nuclear Proteomics Uncovers Diurnal Regulatory Landscapes in Mouse Liver" by Wang J, Mauvoisin D, Martin E, Atger F, Galindo AN, Dayon L, Sizzano F, Palini A, Kussmann M, Waridel P, Quadroni M, Dulić V, Naef F, and Gachon F, published in Cell Metab. 2016 Oct 31. pii: S1550-4131(16)30534-4 (PubMed). These experiments were represented in GPMDB by 20 result files, each a composite of 12 fractions obtained by off-gel focusing.
These results demonstrate that the experiment succeeded in enriching nuclear proteins from Mus musculus hepatocytes. They also show that +6 Da lysine SILAC labelling works well for liver samples from mice fed labelled lysine chow. The mass spectrometry was good quality with good calibration stability over the course of the multiple fraction measurements. This stability and good parent ion peak shaped allowed the confident assignment of N and Q deamidations and a significant number of common protein phosphorylations. The experimental protocol resulted in some urea-generated amine carbamylations (3–4 % of identifiable peptides) but kept the IAA-generated amine carboxamidomethylations to a minimum (~ 0.2 %).
How Deep is It Really: Cytosolic Aminoacyl tRNA Synthetases (2017/1/2)
For some reason that I have never really understood, journal editors started to allow the use of the descriptive term "deep" as a qualifier for larger proteomics result sets. While it has been used frequently in the literature, as far as I know there has never been any discussion as to how to qualify a data set as being "deep" (or not). While this term may simply be a reflection of the current trend towards increasingly baroque terms used to promote a particular group's work, it does suggest an interesting question: How can you easily state the LOD and the extent of undersampling in a set of results so that it can be readily explained to biomedical collaborators?
One way to characterize a data set is to compare the proteins observed with a list of proteins that should be present in a sample. Many groups use this approach, but tend to be rather coy about the lists of proteins that they use. These lists often are based on the research interests of the particular group, so they may be difficult to adapt to general proteomics results. Over the next few weeks, I'll propose a few lists of protein groups that can be used for specific purposes in proteomics result analysis.
The first of these protein groups is the Cytosolic Aminoacyl tRNA Synthetases. These enzymes are responsible for charging tRNA with the appropriate amino acid for use in protein synthesis. All of these enzymes must be present for protein synthesis to occur. Most of these enzymes require only one subunit, with the exception of Phe-tRNA synthetase which is a heterodimer composed of FASRA:p & FASRB:p. Most of these enzymes only charge one specific tRNA, with the exceptions EPRS (charges both Glu- and Pro-tRNA) and SARS (charges both both Ser-tRNA and Sec-rRNA with serine). This enyzme group is useful for characterizing samples composed mainly of cell contents that were prepared without affinity purification. The twenty members of this group in Homo sapiens are listed in the Table 1.
Protein Observations f (%) Length
CARS 16,048 3.3 831
HARS 17,225 3.5 509
FARSA 18,746 3.9 508
SARS 19,315 4.0 514
FARSB 19,344 4.0 589
NARS 21,285 4.4 548
YARS 23,442 4.8 528
MARS 23,641 4.9 900
WARS 24,445 5.0 471
QARS 24,497 5.0 775
LARS 24,566 5.1 1,176
TARS 25,214 5.2 723
KARS 25,914 5.3 597
GARS 25,968 5.3 739
RARS 26,798 5.5 660
VARS 27,525 5.7 1264
IARS 27,757 5.7 1262
AARS 28,017 5.8 968
DARS 29,059 6.0 501
EPRS 36,915 7.6 1,512
Table 1. The twenty human cytosolic aminoacyl tRNA synthetase subunits and their relative frequency of observation in GPMDB.
This table shows that the most frequently observed enzyme EPRS:p has been seen a little more than twice as often as the least frequently observed CARS:p (36,915:16,048), but none of the subunits are inherently difficult to find in MS/MS proteomics data. They are all mid-sized, soluble cytosolic proteins with many peptides that can be used for identification in either data dependent or data independent experiments. Simply counting the number of these subunits observed and dividing by 20 gives a very quick estimate of how well an experiment has performed. The higher this value, the less an experiment has been affected by undersampling.
Tips & Tricks: Trypsin methylation (2016/12/18)
A recent article by M. Schittmayer, et al. — Cleaning out the Litterbox of Proteomic Scientists' Favorite Pet: Optimized Data Analysis Avoiding Trypsin Artifacts, J Proteome Res. 2016 Apr 1;15(4):1222-9, PubMed — has stimulated some interest in testing proteomics data for peptides associated with chemically modified trypsin. Trypsin can be chemically N-methylated at lysine to reduce the amount of autolytic cleavage, resulting in enyzmatic activity being sustained over longer periods of time. In practice, the sequencing grade trypsin offered by ProMega has become a standard reagent because of this property.
If you want to be sure to catch the modified trypsin peaks in your data using X! Tandem, you should use the following steps:
  1. Download the file crap_ptm.xml from the GPM FTP site and place it in the directory you are currently using for amino acid variation information, e.g. "../saps". This file contains the residues in porcine trypsin that can be methylated.
  2. Add the following line to the taxonomy.xml file in your installation:
    <file format="saps" URL="../saps/crap_ptm.xml" />
    in each taxon entry that you want to include checking for methylated tryptic peptides. The taxon must include the "crap.fasta.pro" sequence file.
You are done! Now any search that uses that taxon will include the porcine trypsin mono- and di-methylated lysine residues that were present in the data described by the manuscript.

P.S. The letter "O" is already used for the rare genomically encoded amino acid pyrolysine (Pyr) and it should never be used in FASTA files to substitute for lysine (K).
Interesting data: PXD002121 (2016/12/13)
The raw data set was made available through the PRIDE ProteomeXchange site, PXD002121. It was published as part of the study entitled "Sensing Small Changes in Protein Abundance: Stimulation of Caco-2 Cells by Human Whey Proteins" by Cundiff JK, McConnell EJ, Lohe KJ, Maria SD, McMahon RJ and Zhang Q, published in J Proteome Res. 2016 Jan 4;15(1):125-43 (PubMed). These experiments were represented in GPMDB by 16 result files.
While this paper may not be well known, it contains many of the best identifications of human cellular proteins currently available. The data is composed of 857 RAW files, which have been organized into 16 multidimensional chromatography runs using 6-plex TMT for relative quantitation. The experiments were performed using the human male colon adenocarcinoma cell line CACO-2, which produces significant amounts of L1RE1:p. The only significant experimental artifact was the commonly found off-target carbamidomethylation of lysine amino groups and peptide N-terminii. It is an excellent data set to find examplar spectra for peptides derivatized with 6-plex TMT with HCD fragmentation as well as high accuracy parent & fragment mass determination (Q-Exactive).
Interesting data: PXD003700 (2016/12/8)
The raw data set was made available through the PRIDE ProteomeXchange site, PXD003700. It was published as part of the study entitled "Proteome-wide analysis of arginine monomethylation reveals widespread occurrence in human cells" by Larsen SC, Sylvestersen KB, Mund A, Lyon D, Mullari M, Madsen MV, Daniel JA, Jensen LJ and Nielsen ML, published in Sci Signal. 2016 Aug 30;9(443):rs9 (PubMed). These experiments were represented in GPMDB by 10 result files.
These experiments were undertaken to determine the extent of arginine monomethylation in a normally functioning cellular proteome. HEK293-T cells were chosen as a stand in for normal cells. The methods used do a good job of enriching for monomethyl-arginine and the modified residue was easily detectable in the resulting MS/MS data. Dimethyl-arginine was also easily detectable, although in lesser amounts. The samples were also enriched for the rare PTM hypusine (which only occurs on one residue of EIF5A2:p). The data makes an excellent case study for testing algorithms attempting to find single amino acid variants (SAVs), as methylation mimics many common SAVs, which can lead to the over-prediction of SAVs with naïve algorithms.
New Hardware Added to GPMDB (2016/12/7)
GPMDB is the largest source of detailed information about the evidence supporting the observation of proteins, peptides, PTMs and SAVs using modern tandem mass spectrometry-based proteomics. This means that it has to keep up with the large amount of raw data being made available through resources like ProteomeXchange, PeptideAtlas, jPOST, Chorus and others. While actually doing the data analysis to convert the raw data into results can be a bit of a chore, it is something that can be done by simply adding more computers to solve the problems. Recording that information in the GPMDB datbase is a very linear process that can not be easily parallelized, making this database loading step a potential bottleneck.
The system GPMDB had been using for the last four years had been optimized repeatedly, but the maximum data recording rate that could be achieved was about 0.4 million peptides identifications per hour. At this rate, the results generated by analyzing public data were frequently requiring 24 hour-a-day operation and still there were days when all of the results could not be added: they had to wait for a pause in raw data availability to complete. This situation has only been getting worse as the size, complexity and tempo of proteomics data set release increases.
To resolve this problem, it was necessary to create a new hardware solution to increase the speed of loading results into GPMDB. Last week the new hardware was assembled, installed and tested. This new equipment has a proven result loading rate of 5 million peptide identifications per hour, which gives GPMDB a maximum loading capacity of about 40 billion peptide identifications a year. This capacity should be sufficient for at least the next three years of efficient operation.
Copyright © 2016, The Global Proteome Machine Organization. Privacy Statement (8)