Interesting data: PXD003700 (2016/12/8)
The raw data set was made available through the PRIDE ProteomeXchange site, PXD003700. It was published as part of the study entitled "Proteome-wide analysis of arginine monomethylation reveals widespread occurrence in human cells" by Larsen SC, Sylvestersen KB, Mund A, Lyon D, Mullari M, Madsen MV, Daniel JA, Jensen LJ and Nielsen ML, published in Sci Signal. 2016 Aug 30;9(443):rs9 (PubMed). These experiments were represented in GPMDB by 10 result files.
These experiments were undertaken to determine the extent of arginine monomethylation in a normally functioning cellular proteome. HEK293-T cells were chosen as a stand in for normal cells. The methods used do a good job of enriching for monomethyl-arginine and the modified residue was easily detectable in the resulting MS/MS data. Dimethyl-arginine was also easily detectable, although in lesser amounts. The samples were also enriched for the rare PTM hypusine (which only occurs on one residue of EIF5A2:p). The data makes an excellent case study for testing algorithms attempting to find single amino acid variants (SAVs), as methylation mimics many common SAVs, which can lead to the over-prediction of SAVs with na´ve algorithms.
New Hardware Added to GPMDB (2016/12/7)
GPMDB is the largest source of detailed information about the evidence supporting the observation of proteins, peptides, PTMs and SAVs using modern tandem mass spectrometry-based proteomics. This means that it has to keep up with the large amount of raw data being made available through resources like ProteomeXchange, PeptideAtlas, jPOST, Chorus and others. While actually doing the data analysis to convert the raw data into results can be a bit of a chore, it is something that can be done by simply adding more computers to solve the problems. Recording that information in the GPMDB datbase is a very linear process that can not be easily parallelized, making this database loading step a potential bottleneck.
The system GPMDB had been using for the last four years had been optimized repeatedly, but the maximum data recording rate that could be achieved was about 0.4 million peptides identifications per hour. At this rate, the results generated by analyzing public data were frequently requiring 24 hour-a-day operation and still there were days when all of the results could not be added: they had to wait for a pause in raw data availability to complete. This situation has only been getting worse as the size, complexity and tempo of proteomics data set release increases.
To resolve this problem, it was necessary to create a new hardware solution to increase the speed of loading results into GPMDB. Last week the new hardware was assembled, installed and tested. This new equipment has a proven result loading rate of 5 million peptide identifications per hour, which gives GPMDB a maximum loading capacity of about 40 billion peptide identifications a year. This capacity should be sufficient for at least the next three years of efficient operation.
Recent updates to GPM & GPMDB (2016/01/17)
GPM and GPMDB are being continuously updated to increase the amount of information that can be extracted by users, to keep up with changes to external data sources and to better utilize external data interfaces. In the last few weeks, we have made the following changes:
GPMDB data sources (2016/01/13)
GPMDB derives its information from the re-analysis of raw data from many sources: we try to find all of the publicly available proteomics data, download it and see what it tells us. If you would like to see where we normally check for new data, as well as take a look at the publications from which GPMDB has already drawn data, please take a look at our new GPMDB Data Sources page on our project Wiki. The page will be refreshed every week, adding in any new data sources we have found and any new publications that have been included in the collection. If you want to refer to the data in GPMDB as a whole, please use the URL for this page as a reference.
Release of v. 21 of both the Human & Mouse Proteome Guides (2016/01/13)
We would like to announce the release of the 21st version of the Human and Mouse Proteome Guides. These spreadsheets summarize the information about the identification of all genes and their associated splice variants currently available in GPMDB for these two species.
Release of v. 20 of both the Human & Mouse Proteome Guides (2015/09/30)
We would like to announce the release of the 20th version of the Human and Mouse Proteome Guides. These spreadsheets summarize the information about the identification of all genes and their associated splice variants currently available in GPMDB for these two species.
FTP site service interuption (2015/09/17)
The power system in the building that houses the GPM FTP site (ftp.thegpm.org) is undergoing some needed equipment upgrades which are going to be extended to Sept. 18, 2015. The FTP site will be unavailable from 2:00 – 13:00 UTC during this maintenance process. Hopefully the maintenance work will be complete and no further interuptions will be necessary.
FTP site service interuption (2015/09/11)
The power system in the building that houses the GPM FTP site (ftp.thegpm.org) is undergoing some needed equipment upgrades during the period Sept. 14 to Sept 17, 2015. The FTP site will be unavailable for significant periods of time during this maintenance process. There will be an announcement on this page once the maintenance is complete.
Proteomics basics 101 #6, by Ron Beavis (2015/05/20)
Monte Carlo methods for evaluating LC/MS/MS data sets: II. Peptides & derviatives.
In the previous blog, the use of Monte Carlo simulations to understand the relationship between the number of spectra taken and the number of unique proteins identified was introduced. In this entry, the same type of analysis will be applied to the peptides, once again trying to give a clearer idea of how the number of spectra taken is related to the number of unique peptides identified.
Monte Carlo simulations for this type of analysis are performed in the same way as they were for proteins. more ...
Proteomics basics 101 #5, by Ron Beavis (2015/05/05)
Monte Carlo methods for evaluating LC/MS/MS data sets: I. Protein ids.
A common problem in practical proteomics data analysis is the characterization of data from a particular experiment using some set of parameters that indicate how well the experiment was performed. For example, the number of unique proteins identified, the number of unique peptides identified and the total number of spectra assigned to peptide sequences are frequently reported. These parameters can be used to compare results within a set of similar experiments, but they are limited in their utility when comparing different experimental protocols, more ...
Guides to the Human & Mouse Proteomes v. 18 released (2015/4/20)
The Guide to the Mouse Proteome (v. 18) and the Guide to the Human Proteome (v. 18) based on ENSEMBL v. 76 that uses the GRCm38 and GRCh38 genome assemblies (respectively) as the scaffolds for the associated genes has been released: human and mouse. The new versions include protein-coding genes associated with haplotypes, patches and DNA that have not be assigned to a reference chromosome (in the "CHROT" tab of the spreadsheets). Evidence codes for this release used the current version of the NBS algorithm.
X! Tandem PILEDRIVER released (2015/4/17)
The latest version of X! Tandem (PILEDRIVER) has been released. This version includes updates to the mzML input types allowed: the new NumPress numerical compression data types described in Teleman, 2014. A new output file type, mzIdent, has been added, although this implementation should be considered a beta test until the mzIdent standard has been finalized to include more details about the assembly of proteins from identified peptides.
Guide to the Mouse Proteome v. 17 released (2015/1/27)
The first version of the Guide to the Mouse Proteome (v. 17) based on ENSEMBL v. 76 that uses the GRCm38 genome assembly as the scaffold for the associated genes has been released here. The new version includes for the first time protein-coding genes associated with haplotypes, patches and DNA that has not be assigned to a chromosome (in the "OTHER" tab of the spreadsheets). Evidence codes for this release used the current version of the NBS algorithm.
Guide to the Human Proteome v. 17 released (2015/1/26)
The first version of the Guide to the Human Proteome (v. 17) based on ENSEMBL v. 76 that uses the GRCh38 genome assembly as the scaffold for the associated genes has been released here. The new version includes for the first time protein-coding genes associated with haplotypes, patches and DNA that has not be assigned to a chromosome (in the "OTHER" tab of the spreadsheets). Evidence codes for this release used the current version of the NBS algorithm.
The following histogram is a stacked plot that provides a chromosome-by-chromosome breakdown of the identification status of the genes in the human proteome.