The GPM Lists index

The Global Proteome Machine Organization

Index of GPMDB lists

Proteomics often requires the assembly of category wide lists of things. These categories can be proteins associated with particular sequence or biological properties, post-translational modifications, or types of experiments. GPMDB can be used to generate of these lists and this page serves as an index to the lists announced for the system.

Available lists of things

Post-translational modifications:

C. elegans: phosphorylation
D. melanogaster: phosphorylation
M. musculus: acetylation, phosphorylation
Mycobacterium tuberculosis: phosphorylation
H. sapiens: acetylation, phosphorylation, ubiquitination
S. cerevisiae: acetylation, phosphorylation

Amino acid polymorphisms:

List of all amino acid polymorphisms in GPMDB

Proteins by classifiers:

GPMDB Guide to the Human Proteome.
GPMDB Guide to the Mouse Proteome.
GPMDB Guide to the Saccharomyces cerevisiae Proteome.
Human, mouse and yeasts proteins by GO category.
Human and mouse proteins by chromosome.
Human proteins by BTO tissue type.
Top 1,000 human & mouse proteins.

Proteotypic peptides and annotated spectrum libraries:

GPMDB Guide to the Saccharomyces cerevisiae Proteome v. 2 (2016/8/4)

The Saccharomyces cerevisiae protein identification information in GPMDB has been summarized into a collection of spreadsheets, GPMDB Guide to the S. cerevisiae Proteome (GYP). This guide has the information organized into separate spreadsheets for each evidence code, as well as an overall listing. All of the spreadsheets are sorted by chromosome and the centrosomal naming convention commonly used for yeast ORFs. The protein accession numbers and other information was obtained from ENSEMBL's EF4.72 release of the yeast proteome. The NBS v. 2 algorithm was used to determine the evidence codes for this edition. This 2^nd edition of the Guide (GYP 2016.08.01) is available in the following formats:

The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/yeast_proteome_guide/

GPMDB Guide to the Human Proteome v. 22 (2016/8/3)

The human protein identification information in GPMDB has been summarized into a collection of spreadsheets, GPMDB Guide to the Human Proteome (GHP). This guide has the information organized into separate spreadsheets for each chromosome, as well as mitochrondrial DNA. The protein accession numbers, HGNC names and chromosomal coordinates were taken from ENSEMBL v. 76 (genome assembly GRCh38). The NBS v. 2 algorithm was used to determine the evidence codes for this edition. This 22^nd edition of the Guide (GHP 2016.7.01) is available in the following formats:

The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/human_proteome_guide/

GPMDB Guide to the Mouse Proteome v. 22 (2016/8/4)

The mouse protein identification information in GPMDB has been summarized into a collection of spreadsheets, the GPMDB Guide to the Mouse Proteome (GMP). This guide has the information organized into separate spreadsheets for each chromosome, as well as mitochrondrial DNA. The protein accession numbers, MGI names and chromosomal coordinates were taken from ENSEMBL v. 76 (genome assembly GRCm38). The new NBS v. 2 algorithm was used to determine the evidence codes for this edition. This 22^nd edition of the Guide (GMP 2016.7.01) is available in the following formats:

The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/mouse_proteome_guide/

C. elegans protein phosphorylation sites (2010/08/11)

These files represent a comprehensive list of all C. elegans protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

The files associated with the annotation for a merged list of all chromosomes is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:

unique proteins	25,076
total genes	19,788
phospho-proteins	997
phospho-genes	609
phosphorylation sites	3,069

Fruit fly protein phosphorylation sites (2013/05/13)

These files represent a comprehensive list of all D. melanogaster protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

unique proteins	21,223
total genes	13,937
phospho-proteins	2,283
phospho-genes	1,565
phosphorylation sites	8,774

Yeast protein acetylation sites (2013/06/17)

These files represent a comprehensive list of all S. cerevisiae protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

The files associated with the annotation and a merged list of all chromosomes are now available by FTP for lysine & N-terminal acetylation. A description of the format of these files is available in the associated "README.txt" file in in the same directory. A short summary of the number of acetylated proteins, genes and sites of each typeis given "stats/stats.txt" file.

Yeast protein phosphorylation sites (2013/05/13)

These files represent a comprehensive list of all S. cerevisiae protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

unique proteins	6,692
total genes	6,692
phospho-proteins	2,449
phospho-genes	2,449
phosphorylation sites	16,664

Mouse protein acetylation sites (2013/06/17)

These files represent a comprehensive list of all Mus muscullus protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

Human protein acetylation sites (2013/06/17)

These files represent a comprehensive list of all Homo sapiens protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.

Mycobacterium tuberculosis protein phosphorylation sites (2010/08/10)

This list is a compilation of observed serine/threonine phosphorylation sites for the Mycobacterium tuberculosis proteome (strain CDC1551), based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 41 phosphorylation sites on 35 protein sequences, with the following composition:

serine: 18; and
threonine: 23.

Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:

gi|15840936|

aconitate hydratase

S[716]4

The columns have the following interpretation:

The NCBI gi accession number for the protein splice variant;
The NCBI gene description associated with that accession number; and
The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).

We have to again thank all of the data contributors who have made these comprehensive lists possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.

Mouse protein phosphorylation sites (2013/05/13)

These files represent a comprehensive list of all mouse protein phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 71 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.

The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:

unique proteins	45,557
total genes	22,796
phospho-proteins	10,134
phospho-genes	5,277
phosphorylation sites	49,416

Human protein phosphorylation sites (2013/05/12)

As part of our contribution to the Human Proteome Project, we have compiled a comprehensive list of all human protein phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 70 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.

unique proteins	87,222
total genes	23,287
phospho-proteins	22,621
phospho-genes	7,563
phosphorylation sites	142,832

Human protein ubiquitination sites (2013/09/01)

We have compiled a comprehensive list of all human protein ubiquitination sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 70 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.

The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of ubiquitin-modified proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:

unique proteins	87,222
total genes	23,287
ubiquitin-modified proteins	21,436
ubiquitin-modified genes	6,282
ubiquitin-modified sites	77,684

Amino acid polymorphisms in GPMDB (2013/1/2)

The GPM has been generating information about amino acid polymorphisms in model species for the last 5 years. This information has been recorded in GPMDB, which as of Jan. 1, 2013 had approximately 4.8 million observations of amino acid polymorphisms. The information about these observations has been dumped into a file, using either tab-separated value (.txt) or SQLite (.db) formats via FTP. The specific entries in these files are as follows:

SNP id	GPMDB obs. id	HGVS id
rs34037627	199905983	ENSP00000333994:p.V55D

If available, the first column corresponds to an identifier for the associated single nucleotide polymorphism. In cases were there was no associated SNP information the "HGVS id" information was repeated in this column. The "GPMDB obs. id" is the unique id for the specific peptide sequence identification that was the evidence for each polymorphism.

Observed proteins categorized by Gene Ontology terms (2010/05/01)

The ENSEMBL protein accessions used in GPMDB can be readily assigned to specific Gene Ontology (GO) terms, using ENSEMBL's BioMart utility. These lists for all available GO terms have been constructed for three species:

The lists are divided up into the three main GO categories: biological process; cellular component; and molecular function. For each individual has an entry like:

GO:0006189 [7/7]

'de novo' IMP biosynthetic process

The first column has a link to the list of proteins associated with the GO term accession number. The notation following the accession number "[n/m]" indicates that "n" proteins have been observed in GPMDB out of the "m" proteins in the proteome assigned to this category. The second category is a the controlled vocabulary description of each GO category.

Observed human proteins by tissue type (2010/05/01)

The lists below were constructed from data supplied by the Normal Clinical Tissue Alliance. Proteomics data from selected studies of clinical tissue were analyzed and conservative lists of indentified proteins were constructed. The lists are organized by the best available BRENDA ontology term for the tissue, with the exception of red blood cells, which are not currently in BRENDA.

The lists given below have the proteins in plasma removed (with the exception of the plasma list).

BRENDA ID	Description
BTO:0000131	blood plasma
BTO:0000132	blood platelet
BTO:0000133	blood serum
BTO:0000140	bone
BTO:0000142	brain
BTO:0000155	bronchoalveolar lavage
BTO:0000237	cerebrospinal fluid
CL:0000232	erythrocyte
BTO:0000502	gastric fundus
BTO:0001501	hair
BTO:0000723	lens
BTO:0000759	liver
BTO:0000763	lung
BTO:0001202	saliva
BTO:0001419	urine

The 1,000 most observed human & mouse proteins (updated 2010/07/07)

These spreadsheets (top_1000_human_100707.xls and top_1000_mouse_100707.xls) list protein sequences that have been observed most often by GPM users who used the "human" or "mouse" ENSEMBL proteome sequences. The columns in the spreadsheet are as follows:

Column A: ENSEMBL protein accession number for the sequences;
Column B: HUGO Gene Naming Committee symbol for the associated gene;
Column C: NCBI gene number for the associated gene;
Column D: International Protein Index accession number for the sequence;
Column E: SwissProt/Uniprot accession for the sequence;
Column F: the probability that a protein will be found in a dataset (%);
Column G: the base-10 log of the minimum protein expectation value found; &
Column H: a text description of the protein.

A "dataset" corresponds to a submitted set of MS/MS spectra, which results in a GPM result file, so it is roughly equivalent to the set of data from an LC/MS/MS run. A protein can only be observed once in a dataset. The value in Column F was calculated by taking the number of times (n_i) that the protein was observed in the approximately 24,000 (N) datasets examined and doing the simple calculation:

p_i = 100(n_i/N)