The GPM Data Set of the Week, 2010

The Global Proteome Machine Organization

GPMDB Data set of the week

The GPMDB contains thousands of data sets contributed by researchers around the world. Every week, we select a data set because of its technical excellence, biological interest or simply because we think it is of general interest to the proteomics community.

By year posted

| 2013 | 2012 | 2011 | 2010 |

Data sets of the year: (2010/12/26)
Technical, Biological and Clinical.

This week we are awarding the title "Data set of the year" to three outstanding examples of publicly available proteomics experimental data. These awards are in three categories:

Technical data: Nagaraj N, et al.
Feasibility of large scale phosphoproteomics with HCD fragmentation.
This data is the most convincing evidence yet that HCD is in the process of revolutionizing the experimental and instrumental requirements for top quality proteomics.
Biological data: Merrihew GE, et al.
Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations.
A clear demonstration that the interaction between protein-level and genome-level experimental data is valuable, even in very well studied model species.
Clinical data: Drake RR, et al.
In-depth proteomic analyses of direct expressed prostatic secretions.
The very model of a well-designed proteomics clinical study.

Data set of the week: (2010/12/19)
An Expanded Oct4 Interaction Network: Implications for Stem Cell Biology, Development, and Disease.

This study contains 7 LC/MS/MS runs, from pull-down experiments. The manuscript describing this work was published by Pardo M, Lang B, Yu L, Prosser H, Bradley A, Babu MM, and Choudhary J, Cell Stem Cell. 2010 6:382-95 (PubMed).

This study contains very high-quality pull-down results that represent rarely observed Mus musculus proteins and peptides. Unfortunately, the original data was not made publicly available: only spectra that resulted in identifications were stored in PRIDE. Hopefully the authors will make the original data available at some point so that a more thorough analysis can be performed.

Nota bene: In looking through these results, some may notice that there was no observation of a protein named "Oct4". This seemly odd fact was due to the confusing nature of protein naming: "Oct4" is not a currently accepted name for any mouse protein. The current name for that gene product is "Pou5f1" (POU domain, class 5, transcription factor 1), corresponding to ENSMUSP00000025271. Inspection of the current observations show clearly that this protein has been over-represented in samples coming from mouse embryonic stem cells.

Data set of the week: (2010/12/12)
Nucleosome-interacting proteins regulated by DNA and histone methylation.

This study contains 160 LC/MS/MS runs, grouped into sets of SDS-PAGE bands. The manuscript describing this work was published by Bartke T, Vermeulen M, Xhemalce B, Robson SC, Mann M, and Kouzarides T, Cell 2010 143:470-84 (PubMed).

This work demonstrates the extent to which SILAC quantitation has become a main stream technique in molecular biology. The study addresses a biologically important question, uses an exellent lab to perform the proteomics instrumental analysis and applies straightforward, established informatics methods to interpret the proteomics data in the context of the biological question.

Data set of the week: (2010/12/05)
Comparative shotgun proteomics using spectral count data and quasi-likelihood modeling.

This study contains 153 LC/MS/MS runs, grouped into sets of MudPit experiments. The analysis for each individual LC/MS/MS and summaries of the MudPit runs were recorded. The manuscript describing this work was published by Li M, Gray W, Zhang H, Chung CH, Billheimer D, Yarbrough WG, Liebler DC, Shyr Y, and Slebos RJ, J Proteome Res. 2010 9:4295-305 (PubMed).

While this set of data was generated for a specific statistical study, it also represents a very good resource for anyone interested in the study of the bioinformatics and statistics of proteomics experimental analysis. The tissues selected were of clinical interest (head and neck carcinomas), the equipment was state-of-the-art and the experimental groups involved were first rate. Many data sets generated for bioinformatics analysis are not really representative of current best laboratory practices, but this one genuinely exceeds expectations.

Data set of the week: (2010/11/28)
Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.

This study contains 28 tissue sample data sets. The manuscript describing this work was published by Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, and Baginsky S, Science 2008 320:938-41 (PubMed).

This work is still probably the most comprehensive proteomics study of Arabidopsis thaliana tissues available. Each of the individual samples corresponds to > 9,000 peptide identifications and > 1,000 non-redundant protein identifications. It can be used as a reliable catalogue of observable peptides and proteins for the corresponding A. thaliana tissues and cell-culture samples.

Data set of the week: (2010/11/21)
Prioritization of candidate protein biomarkers from an in vitro model system of breast tumor progression toward clinical verification.

This study contains 5 individual LC/MS/MS runs. The manuscript describing this work was published by Lau TY, Power KA, Dijon S, de Gardelle I, McDonnell S, Duffy MJ, Pennington SR, and Gallagher WM., J Proteome Res. 9(3):1450-9 (PubMed).

The data is a good example of what can be achieved using a QTOF-style instrument for analyzing gel bands. The relatively good resolution obtained on the fragment ions makes peptide identifications more positive (FDR ≈ 0.1%) and generally improves the confidence of the resulting protein identifications. The approach used in the paper has some merit for determining the suitability of proteins as biomarkers, although much of the comparitive work could have been done using existing databases of observable plasma and serum proteins.

Data set of the week: (2010/11/14)
Proteomic Analysis of Human Nail Plate.

This study contains 40 individual LC/MS/MS runs. The manuscript describing this work was published by Rice RH, Xia Y, Alvarado RJ, and Phinney BS, J Proteome Res. 2010 Nov 1 (Epub ahead of print, PubMed).

The data investigates the proteins present in two common but sparsely investigated human tissues: hair and nail plate. These non-cellular tissues are composed mainly of high-sulphur (hard) keratins and keratin-associated proteins in different proportions. These proteins are unusually abundant on Chromosome 17, with more than 60 genes clustered between chromosome coordinates 38,810,917-39,780,829 (see the Human Proteome Guide for the gene names, positions and frequency of observation).

Data set of the week: (2010/11/07)
Proteomic screen defines the Polo-box domain interactome and identifies Rock2 as a Plk1 substrate.

This study contains 24 individual result sets derived from SDS-PAGE gel bands. The manuscript describing this work was published by Lowery DM, Clauser KR, Hjerrild M, Lim D, Alexander J, Kishi K, Ong SE, Gammeltoft S, Carr SA, and Yaffe MB in EMBO J. 2007 26:2262-73 (PubMed).

This study demonstrates the power of protein affinity methods for enriching relatively rare, but biologically important proteins. The result sets contain many of the best identifications observed for proteins such as GRIPAP1, ROCK2, ANLN, EPB41L3, CLIP2 and the minichromosome maintenance complex. The methodology used here was relatively simple, but it revealed an interesting, high quality interactome that will take years of biological research to thoroughly investigate and understand.

Data set of the week: (2010/10/31)
Genome analysis and genome-wide proteomics of Thermococcus gammatolerans, the most radioresistant organism known amongst the Archaea.

This study contains 7 individual result sets; each set is the union of all spectra collected from a single SDS-PAGE gel. The manuscript describing this work was published by Zivanovic Y, Armengaud J, Lagorce A, Leplat C, Guérin P, Dutertre M, Anthouard V, Forterre P, Wincker P, and Confalonieri F. in Genome Biol. 2009;10(6):R70 (PubMed).

This study was a straightforward analysis of the proteome of a previously unexamined archaeon, T. gammatolerans. What set this study apart was the level of competence displayed by the research team in obtaining this data. The methodology used was straightforward, but they were able to consistently generate spectra good enough so that ~50% of the spectra resulted in high quality identifications. Generally, this type of strategy results in high levels of human keratins 1, 2, 9 and 10 identified, but not in this case. The data corresponded to >1000 T. gammatolerans proteins, with the largest of the individual gel sets having >60,000 identified peptides.

Data set of the week: (2010/10/24)
Feasibility of large scale phosphoproteomics with HCD fragmentation.

This study contains 25 individual samples, contrasting two methods for phophopeptide detection. The manuscript describing this work was published by Nagaraj N, D'Souza RC, Cox J, Olsen JV, and Mann M in J. Proteome Res. 2010 (Epub ahead of print, PubMed).

This data set is a major game-changer for any group interested in high-throughput phosphopeptide detection. The combination of HCD fragmentation with high accuracy parent and fragment ion mass measurement described in the associated publication result a level of sequence and PTM assignment accuracy that simply cannot be matched by the conventional CID approach using a low accuracy LTQ for fragment ion analysis. It is also clearly superior to ETD for high throughput phosphoproteomics: the physical chemistry of ETD make it much better suited to the detailed characterization of difficult cases rather than broad surveys of large mixtures.

Data set of the week: (2010/10/17)
Coupled global and targeted proteomics of human embryonic stem cells during induced differentiation.

This data set contains 18 sample analyses. The manuscript describing this work was published by Yocum AK, Gratsch TE, Leff N, Strahler JR, Hunter CL, Walker AK, Michailidis G, Omenn GS, O'Shea KS, and Andrews PC in Mol Cell Proteomics 2008 7:750-67 (PubMed).

This study utilizes MALDI TOF-TOF technology to provide an excellent survey of proteins in embryonic stem cells. While MALDI has become a secondary ionization method compared with electrospray, it still is a robust method for protein identification and it provides the most reliable source for library spectra of singly charge peptide ions.

Data set of the week: (2010/10/10)
Glycosylation signatures in Drosophila: fishing with lectins.

This data set contains 1 LC/MS/MS result. The manuscript describing this work was published by Vandenborre G, Van Damme EJ, Ghesquière B, Menschaert G, Hamshou M, Rao RN, Gevaert K, and Smagghe G. in J Proteome Res. 2010 9:3235-42 (PubMed).

A carefully selected set of lectins was used to purify glycoproteins by affinity capture from Drosophila melanogaster samples. The results show that this method was able to obtain an unusually high quality set of identifications for proteins of this species, as demonstrated by the very large fraction of "best ever" identifications for the proteins reported. The peptides identified also show significantly more chymotryptic peptide cleavage than would be typical for such a study.

Data set of the week: (2010/10/03)
Global analysis of lysine ubiquitination by ubiquitin remnant immunoaffinity profiling.

This data set contains 1 LC/MS/MS result. The manuscript describing this work was published by Xu G, Paige JS, and Jaffrey SR in Nat Biotechnol. 2010 28:868-73 (PubMed).

This data was obtained from a very interesting study that describes the utility of an immunoaffinity method for purifying the peptides generated by the trypsin digest of proteins that have N-lysyl-ubiquitination. Trypsin cleaves away most of the ubiquitin bound to the lysine sidechain, leaving a Gly-Gly sequence attached. By generating an antibody that was specific for this type of modified lysine sidechain, they were able to isolate peptides from ubiquitinated proteins. This purification allowed them to overcome the large concentration ratio between the modified and unmodified proteins that has made identifying this type of modification difficult in the past. The availability of this antibody should make many interesting studies of the ubiquitin-mediated protein degradation pathway possible.

Data set of the week: (2010/09/26)
The Asia Oceania Human Proteome Organisation Membrane Proteomics Initiative. Preparation and characterisation of the carbonate-washed membrane standard.

This data set contains 2 LC/MS/MS results. The manuscript describing this work was published by Peng L, Kapp EA, Fenyö D, Kwon MS, Jiang P, Wu S, Jiang Y, Aguilar MI, Ahmed N, Baker MS, Cai Z, Chen YJ, Van Chi P, Chung MC, He F, Len AC, Liao PC, Nakamura K, Ngai SM, Paik YK, Pan TL, Poon TC, Hosseini Salekdeh G, Simpson RJ, Sirdeshmukh R, Srisomsap C, Svasti J, Tyan YC, Dreyer FS, McLauchlan D, Rawson P, and Jordan TW. in Proteomics. 2010 May 18 (PubMed).

This study, the results of a HUPO-affiliated AOHUPO project, demonstrate the effectiveness of a standardized, relatively simple protocol for the enrichment of membrane proteins. A quick inspection of the GO displays for unwashed and carbonate washed samples proves this point very nicely. Many groups still seem to believe that membrane proteins are difficult to observe using proteomics methods, so a straightforward study such as this one demonstrating the contrary is a welcome addition to the field and an excellent subject for a HUPO study.

Data set of the week: (2010/09/19)
Insulin receptor substrate influences female caste development in honeybees.

This data set contains 23 LC/MS/MS results. The original data was obtained from Peptidome (Study PSE129). The manuscript describing this work was published by Wolschin F, Mutti NS, and Amdam GV. in Biol Lett. 2011 Feb 23;7(1):112-5 (PubMed).

This study explores the insulin/insulin-like signalling (IIS) network in honeybees. Apis mellifera is an economically important species with a complete genome but which has recieved only limited attention from the proteomics community. Fortunately bee proteomics scientists have been very active in contributing their data to public repositories. Inspection of the list of all A. mellifera proteins in GPMDB shows that more than 2450 proteins have been observed and a surprising number of them have been observed more than 500 times.

Data set of the week: (2010/09/12)
Identification of pathways associated with invasive behavior by ovarian cancer cells using multidimensional protein identification technology (MudPIT).

This data set contains 252 LC/MS/MS results. The original data was obtained from TRANCHE. The manuscript describing this work was published by Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, Ringuette MJ, Jurisica I, and Kislinger T. in Mol Biosyst. 2008 4:762-73 (PubMed).

This study contains probably the best information set for the detailed exploration of proteomics as a reproducible technology. Six different ovarian cancer cell lines were examined, each of which is analyzed in six replicates, each replicate containing six SCX fractions. While this study was designed to explore the differences between these cell lines, it also affords a truly useful collection of data for anyone interested in proteomics sample preparation reproducibility, measurement undersampling, search engine effectiveness, peak finding efficacy or any other aspect of proteomics data generation and handling.

The GPM results are grouped according to cell line replicates, with each replicate having six entries corresponding to the individual SCX fractions, followed by a summary result generated from those six analyses. A description containing a statement like "Data directory: SKOV_5" indicates that the result was obtained from replicate "5" of cell line "SKOV".

Data set of the week: (2010/09/05)
A quantitative proteomics design for systematic identification of protease cleavage events.

This data set contains three (3) COFRADIC analyses (COmbined FRActional DIagonal Chromatography). The original manuscript describing this work was published by Impens F, Colaert N, Helsens K, Ghesquiere B, Timmerman E, De Bock PJ, Chain BM, Vandekerckhove J, and Gevaert K in Mol Cell Proteomics. 2010 Jul 13 (PubMed).

The study demonstrates a relatively straightforward method for determining the cleavage specificity of proteolytic enzymes. The data analysis technique used in the original paper is somewhat complex, but the more flexible modes of analysis available in the GPM simplied the process considerably. Simple inspection of the AAA display allows the assignment of the appropriate cleavage specificities for the enzymes:

Data set of the week: (2010/08/29)
Human Ccr4-Not complexes contain variable deadenylase subunits.

This data set contains nine (9) LC/MS/MS analyses. The original manuscript describing this work was published by Lau NC, Kolkman A, van Schaik FM, Mulder KW, Pijnappel WW, Heck AJ, and Timmers HT. in Biochem J. 2009 422:443-53 (PubMed).

The study contained eight (8) pulldown experiments and one (1) control. Each pull-down is annotated with the bait protein. The experiment uses the combination of Lys-C and bovine trypsin characteristic of the Heck group, which generates a rather complete set of tryptic peptides, although there were a signficant number of non-tryptic peptides generated. The sample preparation method used urea, so there was also a significant number of carbamylated peptides detected. Neither of these artifacts affect the conclusions of the study.

The study contains also contains a surprising number of protein identifications that are the best so far obtained in GPMDB, e.g., TNKS1BP1, RAVER1, FHL2, RQCD1, RNF219, UBAP2L, BAG3 as well as the bait CNOT proteins. Pull-down experiments, with their ability to purify an unusual fraction of proteins, seem to be very effective at obtaining the best observations of rare proteins, compared to large MudPit-style survey experiments.

Data set of the week: (2010/08/22)
Low abundance proteome of human red blood cells captured by combinatorial peptide libraries. Behavior of mono- to hexapeptides.

This data set contains 19 LC/MS/MS analyses. The original manuscript describing this work was published by Sim C, Bachi A, Cattaneo A, Guerrier L, Fortis F, Boschetti E, Podtelejnikov A, and Righetti PG. in Anal Chem 2008 80:3547-56 (PubMed).

This study is an excellent example of a very important class of study: attempting to use novel separation strategies to increase the dynamic range of tissue proteomics. The particular strategy used in this case appears to work quite well at obtaining distributions of proteins with limited specificity, while at the same time producing fractions depleted in high abundance proteins. Technically, the data is also very quality and it contains an unusual number of high confidence identifications of relatively small peptides (< 1000 Da).

Data set of the week: (2010/08/15)
Quantitative analysis of kinase-proximal signaling in lipopolysaccharide-induced innate immune response.

This data set contains 73 LC/MS/MS data sets of obtained from mouse RAW 264.7 cells (macrophage cell line) that have been treated with lipopolysaccharide to simulate infection with Gram-negative bacteria. This data was published by Sharma K, Kumar C, Kéri G, Breitkopf SB, Oppermann FS, and Daub H in J Proteome Res. 2010, 9:2539-49 (PubMed).

The goal of the paper was to follow TOLL-like receptor phospho-signaling during this sort of simulated infection using SILAC: a combination of unlabelled and labelled samples with two different isotopic tag pairs (K(4),R(6) and K(8)R(10)) were used to detect differential protein and phosphopeptide concentrations.

In addition to the biological conclusions, this data contains some excellent examples of a common analytical artifact associated with the use of titanium dioxide phosphopeptide enrichment. Metal oxide columns work by binding peptides with low pIs (i.e., acidic peptides). While phosphopeptides certainly fill the bill as being acidic relative to most peptides, normal peptide sequences with multiple acidic sidechains are also strongly enriched by these columns. This effect can be clearly seen by using the pI vs. RT and the amino acid analysis displays. In example used here, most of the peptides detected have a pI < 5. Aspartic acid (D) and glutamic acid (E) residues in the detected peptides are enriched to 250% and 215% of their expected composition, based on the composition of the associated proteins.

Data set of the week: (2010/08/08)
Comparative proteome profiling of Mycobacterium tuberculosis: the response of drug-resistant and drug-sensitive stains.

This data set contains 6 (six) MudPit data sets of two different strains of M. tuberculosis, A12998 (daughter strain, drug-resistant) and A7494 (parent strain, drug-sensitive). This data was published via upload to Peptidome as Study PSE133 by Moo-Jin Suh, Rembert Pieper, and Shih-Ting Huang from the J. Craig Venter Institute.

From Peptidome: The study describes the analysis of proteins from Drug-resistant and -sensitive strains of Mycboacterium tuberculosis. LC-MS-based proteomics approach was combined with APEX to quantitatively measure relative proteins abundance and to compare the cellular protein composition of Mycobacterium tuberculosis strains A12998 (daughter strain, drug-resistant) and A7494 (parent strain, drug-sensitive).

The results are probably the most thorough analysis of proteins from this important pathogen and they make up a large fraction of the Annotated Spectrum Libraries available from M. tuberculosis strains.

An unexpected piece of information made available through this data set is a good initial measurement of the phosphoproteome of this prokaryote. M. tuberculosis is known to have a serine/threonine kinase and this data set has a number of very good phophopeptides generated by this kinase. The kinase appears to prefer threonine phosphorylation, with a S:T ratio of about 1:3. This ratio is the reverse of typical eukaryote kinases, which seem to prefer serine by about 3:1. The phosphoproteome generated from this study is available in either Excel, html or tab-separated text formats, as projected on to the proteome of strain CDC1551. Note: the original analysis in Peptidome did not include phosphorylation, so these results are only present in the GPMDB re-analysis. It would be very useful to have an IMAC-type study done on these and other M. tuberculosis strains.

Data set of the week: (2010/08/01)
In-depth proteomic analyses of direct expressed prostatic secretions.

This data set contains 9 (nine) MudPit data sets, each measured from a different prostatic fluid sample from individuals with prostate cancer. The original raw data was obtained from TRANCHE. It was published by Drake RR, Elschenbroich S, Lopez-Perez O, Kim Y, Ignatchenko V, Ignatchenko A, Nyalwidhe JO, Basu G, Wilkins CE, Gjurich B, Lance RS, Semmes OJ, Medin JA, and Kislinger T. in J Proteome Res. 2010, 9:2109-16 (PubMed).

The results show the amount of variability that can be expected when analyzing biological replicates of clinically sampled material. The identifications were very high quality and are the best quality measurements of many rather rare proteins, such as KLK3 (prostate-specific antigen) and ACPP (Prostatic acid phosphatase). The data shows moderate levels of carbamylation from the urea solublization method used. There were also significant concentrations of peptides generated by non-tryptic cleavage, probably from the presence of proteases in the sample itself as the cleavage sites were not chymotryptic. An examination of the AAA page (e.g., sample #2) showed that the "Pre" and "C-terminal" columns were broadly populated for most residues, not just the K and R residues normally expected in a trypsin cleavage experiment.

Interestingly for a sample obtained from prostate secretions, no proteins originating from genes on the Y chromosome were detected. This fact points out a general feature of proteomics: there does not seem to be any "common sense" association between tissue-specific protein concentrations and chromosomes.

Data set of the week: (2010/07/25)
Proteomic analysis of the secretome of human umbilical vein endothelial cells using a combination of free-flow electrophoresis and nanoflow LC-MS/MS.

This data set contains a single LC/MS/MS data set, using a combination of free-flow electrophoresis and nanoflow HPLC separations. The original raw data was made available as a Scaffold file from a web site maintained by the authors (www.vascular-proteomics.com). It was published by Tunica DG, Yin X, Sidibe A, Stegemann C, Nissum M, Zeng L, Brunet M, and Mayr M in Proteomics. 2009, 9:4991-6 (PubMed).

This study attempts to discover a difficult thing: the secretome of human umbilical vein endothelial cells in the face of the background proteins in a complex growth medium. The results provide a good basis for the examination of this important cell type, with a very good set of identifications that provides a broad survey of the proteins that can be readily obtained from these cells.

Data set of the week: (2010/07/18)
Proteomics Analysis of the Causative Agent of Typhoid Fever.

This data set contains 313 LC/MS/MS runs using Thermo LTQ mass spectrometers. The original raw files originally from the Resource Center for Biodefense Proteomics Research, which has been superceded by the Pathogen Portal (raw data). It was published by Ansong C, Yoon H, Norbeck AD, Gustin JK, McDermott JE, Mottaz HM, Rue J, Adkins JN, Heffron F, and Smith RD in J Proteome Res. 2008, 7:546-57 (PubMed).

This very thorough data set is the primary large collection of information that has allowed for the creation of the rather comprehensive annotated spectrum libraries that are now available for S. enterica related species, including S. typhi and S. typhimurium. The Pacific Northwestern National Laboratory group was an early proponent of making publicly-funded proteomics raw data widely available and their efforts legitimized the idea for many other groups.

Data set of the week: (2010/07/11)
Discovery of Anthrax Biomarkers Using Label-Free Quantitative Phosphoproteomics via Mass Spectrometry.

This data set contains 66 individual phosphopeptide enriched LC/MS/MS runs made using a Thermo Orbitrap hybrid mass spectrometer. The original raw files were transferred from TRANCHE. The data was credited to Nathan P. Manes, Li Dong, Weidong Zhou, Xiuxia Du, Nikitha Reghu, Arjan C. Kool, Dahan Choi, Charles L. Bailey, Emanuel F. Petricoin III, Lance A. Liotta, and Serguei G. Popov. It was made available prior to publications, although some part of the data was presented at the 2010 ASMS conference.

The analyzed results are simply the best, most consistent set of phosphopeptide results that we have ever seen. The combination of sample preparation, HPLC and mass spectrometry used by the authors has generated what can only be considered a milestone in the application of phospho-proteomics technique to real tissue samples.

Data set of the week: (2010/07/04)
Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions.

This data set contains 61 individual experiments using both SILAC and label-free quantitation. The experimental protocols used either trypsin or endo-LysC to digest the proteins, depending on the type of protocol being used. The original raw files were transferred from TRANCHE. The data was published by Hubner NC, Bird AW, Cox J, Splettstoesser B, Bandilla P, Poser I, Hyman A, Mann M in J Cell Biol. 2010 189:739-54 (PubMed).

The data was generated to demonstrate the utility of a new technique for protein quantitation developed by the authors: "quantitative BAC-green fluorescent protein interactomics" (QUBIC). The technique is meant to be applied to the quantitative study of protein-protein interactions, several of which are demonstrated here. The technical quality of the MS/MS data is excellent, with many ids for individual proteins in the top 10% of all GPMDB observations.

Data set of the week: (2010/06/27)
mTAL Phosphoproteome Data.

This data set contains metal oxide enriched LC/MS/MS observations of phosphopeptides from R. rattus medullary Thick Ascending Limb (mTAL) cells. The raw files were transferred from TRANCHE. The original analysis was reported by Ruwan Gunaratne, Guozhong Ma, Trairak Pisitkun, and Mark A. Knepper as part of the mTAL-PD project. It appears to be closely related to the Collecting Duct Phosphoproteome Database.

The phosphorylated domains obtained are interesting because there is surprisingly little publicly available data from rat cell lines or tissue samples. The phosphopeptide enrichment here was somewhat less effective than in some other studies, however overall it is quite typical of IMAC phosphopeptide enrichment studies. This study has significantly added to the known phosphorylated domains for available R. rattus through GPMDB's pSYT interface.

Added 2010/09/08: This data has been published in "Quantitative phosphoproteomic analysis reveals cAMP/vasopressin-dependent signaling pathways in native renal thick ascending limb cells." Proc Natl Acad Sci U S A. 2010 107:15653-8 (PubMed).

Data set of the week: (2010/06/20)
Proteomic analysis of mouse brain microsomes: identification and bioinformatic characterization of endoplasmic reticulum proteins in the mammalian central nervous system.

This data set contains 1 2DLC MS/MS and 3 1DLC MS/MS runs obtained from mouse brain microsomal preparations. The original data was transferred from TRANCHE. The original data analysis was reported by Stevens SM Jr, Duncan RS, Koulen P, Prokai L. in J Proteome Res. 2008 7:1046-54. (PubMed).

This data set is interesting in a number of ways. It shows the difference in the depth of analysis available using of multi-dimensional chromatographic analysis versus simple, single separation HPLC. The three repetitions of the 1D LCMS approach give a good indication of the statistical variability that is to be expected caused by the under-sampling inherent in this type of measurement. A Gene Ontology analysis of the data (e.g., GPM33080005862) shows the complexity of real microsomal samples, compared to simply believing that they contain only membrane and membrane-associated proteins. A similar study can be compared, showing some significant differences in microsome proteome composition, which are most likely due to variations in the sample preparation methods.

Data set of the week: (2010/06/13)
The minor salivary gland proteome in Sjögren's syndrome.

This data set contains 2 LC-MS-MS runs obtained from human salivary gland tissue. The original data was transferred from PRIDE entries 7962-3. The data was reported by Hjelmervik TO, Jonsson R, Bolstad AI. in Oral Dis. 2009 15:342-53. (PubMed).

The two sets of identifications are meant to show the differences in the protein compliment of salivary glands caused by the autoimmune disease, Sjögren's syndrome. Technically, the data is a good example of the use of a high resolution MS/MS device (ESI-QTOF, Ultima Global) applied to tissue samples. The high accuracy fragment ion masses significantly improve the quality of the identifications.

Data set of the week: (2010/06/06)
Identification of Ricin and Concanavalin A-binding Trypanosoma brucei Glycoproteins.

This data set contains 1 data set obtained from T. brucei. The original data was transferred from PRIDE 9223. A portion of the data was report by Izquierdo L, Schulz BL, Rodrigues JA, Güther ML, Procter JB, Barton GJ, Aebi M, Ferguson MA in EMBO J. 2009 28:2650-61 (PubMed).

The data was obtained by using the the lectins concanavalin A and ricin to pull down glycoproteins from T. brucei (blood stream form) and then glycosidases were used to remove the N-linked glycosylation, leaving a deamidated asparagine residue behind. Any deamidated N residue that was associated with the N-{P}-[ST] glycosylation motif should be considered a potential N-linked glycosylation site. You can see just these peptides by clicking here.

Data set of the week: (2010/05/30)
Use of fluorescence-activated vesicle sorting for isolation of naked2-associated, basolaterally-targeted exocytic vesicles for proteomic analysis.

This data set contains 6 experiments obtained from C. familiaris and it is probably the best single data set we have in GPMDB from the domestic dog proteome. This work was transferred from TRANCHE and it was published by Cao Z, Li C, Higginbotham JN, Franklin JL, Tabb DL, Graves-Deal R, Hill S, Cheek K, Jerome WG, Lapierre LA, Goldenring JR, Ham AJ, Coffey RJ. in Mol. Cell. Proteomics 2008, 7:1651-67 (PubMed).

The individual experiments show how well fairly straightforward proteomics techniques can perform on vesicular membrane proteins. They also demonstrate of the type of comprehensive results that can be obtained using a proteome sequence that is almost completely the result of genome annotation.

Data set of the week: (2010/05/23)
A Global Protein Kinase and Phosphatase Interaction Network in Yeast.

This data set contains 450 pull-down experiments obtained from S. cerevisiae. This work was transferred from TRANCHE and it was published by Ashton Breitkreutz, Hyungwon Choi, Jeffrey R. Sharom, Lorrie Boucher, Victor Neduva, Brett Larsen, Zhen-Yuan Lin, Bobby-Joe Breitkreutz, Chris Stark, Guomin Liu, Jessica Ahn, Danielle Dewar-Darch, Teresa Reguly, Xiaojing Tang, Ricardo Almeida, Zhaohui Steve Qin, Tony Pawson, Anne-Claude Gingras, Alexey I. Nesvizhskii, Mike Tyers Science 2010 328:1043-6.

Each of the individual results is annotated with the identity of the bait used in the pull-down experiment. L-A and L-BC virus proteins are present in some of the pull-downs. The group did a remarkably job at detecting phosphopeptides for a study that did not do any specific enrichment for these peptides.

Data set of the week: (2010/05/16)
Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast.

This data set contains 505 LC/MS/MS runs obtained from S. cerevisiae diploid and haploid populations. This work was transferred from TRANCHE and it was published in de Godoy LM, Olsen JV, Cox J, Nielsen ML, Hubner NC, Fröhlich F, Walther TC, Mann M. Nature. 2008 455:1251-4. (PubMed).

The results give a good indication of the relative abundance and observability of yeast proteins in both haploid and diploid cells using either trypsin or endopeptidase LysC to generate peptides and SILAC labels to provide relative quantitation. The data also shows very good examples of the major proteins observable from the double stranded DNA viruses L-A and L-BC that are almost ubiquitously present in yeast cell cultures. In some cases, these proteins are very strongly observed (e.g. protein #3 in GPM77711001229) and the SILAC labelling can used to estimate the relative amounts of virus present in the two cell types. To located the virus and virus-related proteins in any of the individual runs, type "virus" into the Find box at the top of any model page (click here for an example).

Data set of the week: (2010/05/09)
Phosphoproteome analysis of Drosophila melanogaster embryo.

This data set contains 24 LC/MS/MS runs obtained from D. melanogaster embryos. This work was transferred from TRANCHE and it was published in Zhai B, Villén J, Beausoleil SA, Mintseris J, Gygi SP, J Proteome Res. 2008 7:1675-82 (PubMed).

The assignments in this data set give a good overview of phosphorylation in D. melanogaster and they are good examples of phosphopeptides identified using an Orbitrap-LTQ hybrid instrument with CID. The mapped phosphorylation sites from this data set were a major contribution to the pSYT annotation now available for the fruit fly. The predominance of yolk proteins and other larvae-specific proteins in the identified peptides gives a good view of the phosphorylation patterns on proteins that may be under-represented or absent from studies that use mature flies or cells from tissue culture.

Data set of the week: (2010/05/02)
Activated Macrophage Proteomics

This data set contains 9 merged results obtained from human macrophages under various conditions. This work was transferred from a TRANCHE project of the same name, created and maintained by Maureen M. Goodenow, Dept. of Pathology, Immunology and Laboratory Medicine University of Florida.

The experiments reported by Dr. Goodenow are proteomics survey studies of macrophages, in which the proteomes of treated cells are separated by SDS-PAGE and the resulting gel is sliced into 15 pieces. The proteins are then digested, the peptides extracted and run using LC/MS/MS. Each one of the entries in GPMDB correspond to the merged results of the 15 bands. They are good examples of what can be done using gel-slicing experiments to obtain proteomics information about a cell type. It is also an admirable example of valuable data being made available to the general community by an individual investigator.

Data set of the week: (2010/04/25)
Large-scale quantitative LC-MS/MS analysis of detergent-resistant membrane proteins from rat renal collecting duct.

This data set contains 78 LC/MS/MS runs obtained from membrance enriched fractions of tissue samples from rat renal ducts. It was originally published by Yu MJ, Pisitkun T, Wang G, Aranda JF, Gonzales PA, Tchapyjnikov D, Shen RF, Alonso MA, Knepper MA. in Am J Physiol Cell Physiol. 2008 295:C661-78 (PubMed). The data was transferred to GPMDB from TRANCHE.

This study demonstrates that it is possible to generate very good results from membrane proteins isolated from tissue, even those that do not readily dissolve in detergent solutions, such as lipid raft proteins. GO analysis of the resulting protein identifications shows very significant enrichments in proteins known to be either integral membrane, membrane associated or part of the extracellular matrix.

Data set of the week: (2010/04/18)
Targeted tandem affinity purification of PSD-95 recovers core postsynaptic complexes and schizophrenia susceptibility proteins.

This data set contains 70 LC/MS/MS runs obtained using TAP-tag protein isolation, SDS-PAGE separation followed by tandem mass spectrometry. It was originally published by Fernández E, Collins MO, Uren RT, Kopanitsa MV, Komiyama NH, Croning MD, Zografos L, Armstrong JD, Choudhary JS, Grant SG. Mol Syst Biol. 2009;5:269 (PubMed). The data corresponds to the PeptideAtlas accession PAe001454 and was transferred to GPMDB.

The results are a good demonstration of the depth and detail of a particular molecular system that can be obtained by coupling TAP-tagging with protein and subsequent peptide separations. The use of multiple gel slices allows a depth of proteome coverage that would be difficult to obtain using other techniques.

Data set of the week: (2010/04/11)
Proteomics of mouse liver microsomes

This data set contains 9 LC/MS/MS runs obtained using SDS-PAGE separation followed by tandem mass spectrometry. It was originally published by Zgoda VG, Moshkovskii SA, Ponomarenko EA, Andreewski TV, Kopylov AT, Tikhonova OV, Melnik SA, Lisitsa AV, and Archakov AI in Proteomics, 2009,9:4102-5 (PubMed). The data corresponds to the PRIDE accessions 8848-8856 and was transferred to GPMDB.

This data set is an example of the isolation of a specific experimental fraction (mouse liver microsome from the endoplasmic reticulum) that provides a good representation of proteins not commonly observed, in this case the cytochrome P450 family of metabolic oxidases. The quality of the isolation can be easily seen when viewed as either KEGG pathways or GO cellular components.

Data set of the week: (2010/04/04)
Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

This data set contains 369 LC/MS/MS runs obtained using a Thermo Finnigan LTQ instrument. It was originally published by Merrihew GE, Davis C, Ewing B, Williams G, Käll L, Frewen BE, Noble WS, Green P, Thomas JH, MacCoss MJ. in Genome Res. 2008, 18:1660-9 (PubMed). The data was obtained directly from the authors' web site and it is not currently held in any of the other data sites.

The original analysis of this data set in the publication used the C. elegans WS150 proteome sequence and it was found to indicate the presence of additional coding sequences. The analysis in GPMDB was performed using the WS200 proteome (ENSEMBL v. 55), which has taken into account the original work. It serves as a good example of the proteins that can be seen using conventional proteomics techniques in C. elegans.

Data set of the week: (2010/03/28)
Global proteomic profiling of Shigella dysenteriae Sd1617

This data corresponds to Peptidome Study PSE140, comprised of samples PSM1302, PSM1303 and PSM1304 The data was obtained by Rembert Pieper, Srilatha Kuntumalla, Shih-Ting Huang at the J. Craig Venter Institute and it was transferred from Peptidome.

Each of the samples is composed of 3 replicate multidimensional chromatography runs of soluble proteins obtained from S. dysenteriae. The tandem mass spectra are good quality, obtained using a Thermo LTQ instrument. The results give a good indication of the type of depth and reproducibility that can be expected in this type of straight-forward analysis of soluble proteins from an enterobacterial culture.

Data set of the week: (2010/03/21)
Global Impact of Oncogenic Src on a Phosphotyrosine Proteome

The data is composed of 31 separate runs. The data was obtained from a study published in J. Proteome Res., 2008, 7 (8), pp 3447–3460, by Weifeng Luo, Robbert J. Slebos, Salisha Hill, Ming Li, Jan Brbek, Ramars Amanchy, Raghothama Chaerkady, Akhilesh Pandey, Amy-Joan L. Ham and Steven K. Hanks (DOI: 10.1021/pr800187n). This information was transferred from TRANCHE.

The data investigates the impact of Src transformation of mouse cells by determining the tyrosine phosphorylation differences between control and transformed cells. The data also demonstrates the utility of using multiple peptidases to increase the coverage of peptides, compared to trypsin alone. The data is very high quality LTQ data and it is an excellent reference work for what is to be expected when looking for mouse tyrosine phosphophorylation.

Data set of the week: (2010/03/14)
Quantitative phosphoproteomic analysis reveals vasopressin V2-receptor-dependent signaling pathways in renal collecting duct cells.

The data is composed of 2 separate sets, corresponding to the Peptidome accession numbers PSM1275 and PSM1276. The data was obtained from a study published in Proc Natl Acad Sci U S A. 2010 Feb 23;107(8):3882-7, by Rinschen MM, Yu MJ, Wang G, Boja ES, Hoffert JD, Pisitkun T, and Knepper MA (PubMed). This information was transferred from TRANCHE. The data is of high quality, containing good identifications of serine and threonine phosphorylation sites in M. musculus proteins and it is an excellent example of the use of SILAC to monitor the relative quantitation of protein phosphorylation.

Data set of the week: (2010/03/07)
Phosphorylation dynamics during early differentiation of human emrbyonic stem cells.

The data is composed of 12 individual LC/MS/MS runs obtained from a study published in Cell Stem Cell, Volume 5, Issue 2, 214-226, 7 August 2009 by Van Hoof D, Muñoz J, Braam SR, Pinkse MW, Linding R, Heck AJ, Mummery CL, and Krijgsveld J. (PubMed). This information was transferred from TRANCHE. Each of these data sets is large and contain significant numbers of phosphorylated peptides.

The experiments performed were to investigate how "pluripotent stem cells self-renew indefinitely and possess characteristic protein-protein networks that remodel during differentiation. How this occurs is poorly understood. Using quantitative mass spectrometry, the (phospho)proteome of human embryonic stem cells (hESCs) was analyzed during differentiation induced by bone morphogenetic protein (BMP) and removal of hESC growth factors."

Data set of the week: (2010/02/28)
A Lectin HPLC Method to Enrich Selectively-glycosylated Peptides from Complex Biological Samples.

The data is composed of 83 individual LC/MS/MS runs obtained from a study published in J Vis Exp. 2009 Oct 1;(32). pii: 1398 by Johansen E, Schilling B, Lerch M, Niles RK, Liu H, Li B, Allen S, Hall SC, Witkowska HE, Regnier FE, Gibson BW, Fisher SJ, and Drake PM (PubMed). This information was transferred from TRANCHE.

Briefly, plasma was depleted of the fourteen most abundant proteins using a multiple affinity removal system. Depleted plasma was trypsin-digested and separated into flow-through and bound fractions by SNA or AAL HPLC. The fractions were treated with PNGaseF to remove N-linked glycans, and analyzed by LC-MS/MS on a QStar Elite. There is an accompanying video explaining the methods used.

Data set of the week: (2010/02/21)
Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors.

The data is composed of 729 individual LC/MS/MS runs obtained from a study published in Nature Biotechnology by Bantscheff M, Eberhard D, Abraham Y, Bastuck S, Boesche M, Hobson S, Mathieson T, Perrin J, Raida M, Rau C, Reader V, Sweetman G, Bauer A, Bouwmeester T, Hopf C, Kruse U, Neubauer G, Ramsden N, Rick J, Kuster B, and Drewes G. (DOI: 10.1038/nbt1328). This information was transferred from PRIDE (PRIDE accession numbers 2445-3178).

Labelling with iTRAQ is used for quantitative profiling of the consequences of the introductions of tge drugs imatinib (Gleevec), dasatinib (Sprycel) and bosutinib in K562 cells confirms known targets including ABL and SRC family kinases.

Data set of the week: (2010/02/14)
Cell-Specific Information Processing in Segregating Populations of Eph Receptor Ephrin-Expressing Cells.

This dataset was transfered to GPMDB from PRIDE. The data is composed of 2 large LC/MS/MS runs is from a study published in Science by Jørgensen C, Sherman A, Chen GI, Pasculescu A, Poliakov A, Hsiung M, Larsen B, Wilkinson DG, Linding R, and Pawson T (DOI: 10.1126/science.1176615).

The data is from a set of quantitative mass spectrometric analyses of mixed populations of EphB2- and ephrin-B1–expressing cells that were labeled with different isotopes revealed cell-specific tyrosine phosphorylation events. The data is of very high quality and it has a very rich set of tyrosine phosphorylated peptides.

Data set of the week: (2010/02/07)
The value of using multiple proteases for large-scale mass spectrometry-based proteomics.

This dataset was transfered to GPMDB from TRANCHE. The data is composed of 15 LC/MS/MS runs is from a study published in J. Proteome Research by Danielle L. Swaney, Craig D. Wenger and Joshua J. Coon (DOI: 10.1021/pr900863u).

The data is from experiments in which an S. cerevisiae whole cell lysate was digested with one of five enzymes (trypsin, LysC, ArgC, AspN, and GluC), in triplicate. The results clearly show that any of these proteases can be used very effectively with standard proteomics equipment, giving very similar protein identifications.

Data set of the week: (2010/01/31)
Identifying blood biomarkers and physiological processes that distinguish humans with superior performance under psychological stress.

This dataset was transfered to GPMDB from PRIDE (Pride accessions 10075-10092). The data (GPM77710000113-GPM77710000130) is from a study published in PLoS One by Cooksey AM, Momen N, Stocker R, and Burgess SC (PLoS One. 2009 Dec 18;4(12):e8371 PubMed).

The results show the plasma proteins that change in response to the Modular Egress Training psychological stress test, given to a group of naval aviation students. The data was obtained using an LCQ DECA XP Plus and analyzed using X! Hunter (annotated spectrum library searches).

Data set of the week: (2010/01/24)
High quality catalog of proteotypic peptides from human heart

This dataset was transfered to GPMDB from the authors' web site, corresponding to the manuscript of the same name, Kline, KG, et al.,J Proteome Res. 2008 Nov;7(11):5055-61. PubMed. This data is not currently available on other respositories.

The data consists of 96 LCMS runs analyzed with a ThermoFinnigan LTQ mass spectrometer. It is a good example of the type of data that can be obtained from cardiac muscle using multidimensional chromatography directly on tissue lysate.

Data set of the week: (2010/01/17)
A Mitochondrial Protein Compendium Elucidates Complex I Disease Biology

This dataset was transfered to GPMDB from TRANCHE, corresponding to the manuscript of the same name, Pagliarini, DJ, et al., Cell 134:112-123 doi:10.1016/j.cell.2008.06.016.

The data consists of 26 individual data sets, composed of replicates of mitochondrial proteins obtained from a variety of mouse tissues (cerebellum, cerebrum, brainstem, spinal cord, kidney, liver, heart, skeletal muscle, testis and placenta). It is a good example of high quality proteomics data, obtained using a Thermo-Finnigan Orbitrap hybrid mass spectrometer.

Data set of the week: (2010/01/10)
Comparative analysis of the human and mouse placental transcriptome and proteome

This dataset was transfered to GPMDB from Peptidome, from the Peptidome entries PSM1063 (mouse) and and PSM1064 (human). The cells in the tissue were separated from extracellular proteins and various subcellular fractions were analyzed separately. The data was originally published in Cox B, et al., Mol Syst Biol 2009;5:279. PMID: 19536202.

Note: the Peptidome entry misidentifies the mass spectrometry platform as being an "TRAP-FTMS" while it is actually a Thermo-Finnigan LTQ (with no additional hybrid component).

Data set of the week: (2010/01/03)
Large-scale phosphorylation analysis of mouse liver

This dataset was transfered to GPMDB from TRANCHE and it is not currently held in any other repository (see data). It is credited to Villén J, Beausoleil SA, Gerber SA, and Gygi SP, and it is described in Proc Natl Acad Sci U S A. 2007 Jan 30;104(5):1488-93.

This data set is a good example of the quality of phosphorylation data that can be obtained using SCX separation of a tissue extract, followed by IMAC phosphopeptide enrichment of each fraction, when using an LTQ-Orbitrap mass spectrometer. The data view that is obtained from the link above shows all of the detected phosphopeptides, with a peptide false positive rate of ~ 0.14%, i.e., about 10 times more stringent than the analysis in the original paper.

Data set of the week: (2009/12/28)
Community proteogenomics reveals insights into the physiology of phyllosphere bacteria

This dataset was transfered to GPMDB from PRIDE (see data). It is credited to Delmotte N, et al. and it is described in Proc Natl Acad Sci U S A. 2009 Sep 22;106(38):16428-33.

Data-set-of-the-week is a new feature for GPMDB, started with the intent of highlighting high quality data sets that have been made available via GPMDB and other proteomics repositories. Data sets will be selected by a panel, but any suggestions (email to dsotw@thegpm.org) of suitable data will be considered.