|
The Global Proteome Machine The home of proteomics crowd-sourced "Big Data" |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Index of GPMDB lists
Proteomics often requires the assembly of category wide lists of things. These
categories can be proteins associated with particular sequence or biological
properties, post-translational modifications, or types of experiments. GPMDB can be
used to generate of these lists and this page serves as an index to the lists announced
for the system.
Available lists of things
Post-translational modifications:
Amino acid polymorphisms:
Proteins by classifiers:
Proteotypic peptides and annotated spectrum libraries:
The human protein identification
information in GPMDB has been summarized into a collection of spreadsheets that we are
calling the GPMDB Guide to the Human Proteome (GHP). This guide has the information
organized into separate spreadsheets for each chromosome, as well as two transposons
and mitochrondrial DNA. The protein accession numbers, HGNC names and chromosomal
coordinates were taken from ENSEMBL v. 65. Protein sequences corresponding to transcripts
labelled as non-stop or nonsense-mediated decay products have been removed.
This edition of the Guide (GHP 2013.04.01) is available in the following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/human_proteome_guide/
The mouse protein identification
information in GPMDB has been summarized into a collection of spreadsheets that we are
calling the GPMDB Guide to the Mouse Proteome (GMP). This guide has the information organized into
separate spreadsheets for each chromosome, as well as NT transcripts and mitochrondrial
DNA. The protein accession numbers, MGI names and chromosomal coordinates were taken
from ENSEMBL v. 65. Protein sequences corresponding to transcripts
labelled as non-stop or nonsense-mediated decay products have been removed.
This edition of the Guide (GMP 2013.04.01) is available in the
following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/mouse_proteome_guide/
We have also compiled a list for the fruit fly proteome acetylation, based on the data
in GPMDB. This list is available in Excel
spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine
acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed
sites in a single row, that looks like the following:
The columns have the following interpretation:
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site
assignments.
These files represent a comprehensive list of all C. elegans protein
phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed
by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available
by FTP. A description of the
format of these files
(README.txt) is
in the same directory. A short summary of the number of phospho-proteins, genes and sites is given
here.
For unique protein sequences in the proteome, the overall totals are as follows:
These files represent a comprehensive list of all D. melanogaster protein
phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed
by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available
by FTP. A description of the
format of these files
(README.txt) is
in the same directory. A short summary of the number of phospho-proteins, genes and sites is given
here.
For unique protein sequences in the proteome, the overall totals are as follows:
We have also compiled a list for the yeast proteome acetylation, based on the data in
GPMDB. This list is available in Excel
spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine
acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed
sites in a single row, that looks like the following:
The columns have the following interpretation:
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site
assignments.
These files represent a comprehensive list of all S. cerevisiae protein
phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed
by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available
by FTP. A description of the
format of these files
(README.txt) is
in the same directory. A short summary of the number of phospho-proteins, genes and sites is given
here.
For unique protein sequences in the proteome, the overall totals are as follows:
We have also compiled a list for the mouse proteome acetylation, based on the data in
GPMDB. This list is available in Excel
spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine
acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed
sites in a single row, that looks like the following:
The columns have the following interpretation:
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site
assignments.
We have also compiled a list for the human proteome acetylation, based on the data in
GPMDB. This list is available in Excel
spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine
acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed
sites in a single row, that looks like the following:
The columns have the following interpretation:
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site
assignments.
This list is a compilation of observed serine/threonine phosphorylation sites for the
Mycobacterium tuberculosis proteome (strain CDC1551), based on the data in
GPMDB. This list is available in Excel spreadsheet, tab-separated text
and HTML
formats. It contains 41 phosphorylation sites on 35 protein sequences, with the
following composition:
Each ENSEMBL splice variant protein accession number has a listing of all observed
sites in a single row, that looks like the following:
The columns have the following interpretation:
We have to again thank all of the data contributors who have made these comprehensive
lists possible. When using this type of information, please use normal caution.
Click here for our recommendations for using lists
of site assignments.
These files represent a comprehensive list of all mouse protein
phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome
basis, using ENSEMBL v. 71 as the source of the protein and gene sequences. All of the splice variants listed
by ENSEMBL have been annotated.
The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available
by FTP. A description of the
format of these files
(README.txt) is
in the same directory. A short summary of the number of phospho-proteins, genes and sites is given
here.
For unique protein sequences in the proteome, the overall totals are as follows:
As part of our contribution to the Human Proteome Project, we have compiled a comprehensive list of all human protein
phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome
basis, using ENSEMBL v. 70 as the source of the protein and gene sequences. All of the splice variants listed
by ENSEMBL have been annotated.
The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available
by FTP. A description of the
format of these files
(README.txt) is
in the same directory. A short summary of the number of phospho-proteins, genes and sites is given
here.
For unique protein sequences in the proteome, the overall totals are as follows:
The GPM has been generating information about amino acid polymorphisms in
model species for the last 5 years. This information has been recorded in
GPMDB, which as of Jan. 1, 2013 had approximately 4.8 million observations of
amino acid polymorphisms. The information about these observations has been
dumped into a file, using either tab-separated value (.txt)
or SQLite (.db) formats via FTP.
The specific entries in these files are as follows:
If available, the first column corresponds to an identifier for the associated single nucleotide polymorphism. In cases
were there was no associated SNP information the "HGVS id" information was repeated in this column. The "GPMDB obs. id"
is the unique id for the specific peptide sequence identification that was the evidence for each polymorphism.
The ENSEMBL protein accessions used in GPMDB can be readily assigned to specific Gene
Ontology (GO) terms, using ENSEMBL's BioMart utility. These lists for all available GO
terms have been constructed for three species:
The lists are divided up into the three main GO categories: biological process;
cellular component; and molecular function. For each individual has an entry like:
The first column has a link to the list of proteins associated with the GO term
accession number. The notation following the accession number "[n/m]" indicates that
"n" proteins have been observed in GPMDB out of the "m" proteins in the proteome
assigned to this category. The second category is a the controlled vocabulary
description of each GO category.
The lists below were constructed from data supplied by the Normal
Clinical Tissue Alliance. Proteomics data from selected studies of clinical tissue
were analyzed and conservative lists of indentified proteins were constructed. The
lists are organized by the best available BRENDA ontology term for the tissue, with the
exception of red blood cells, which are not currently in BRENDA.
The lists given below have the proteins in plasma removed (with the exception of the
plasma list).
These spreadsheets (top_1000_human_100707.xls
and top_1000_mouse_100707.xls)
list protein sequences that have been observed most often by GPM users who used the
"human" or "mouse" ENSEMBL proteome sequences. The columns in the spreadsheet are as
follows:
A "dataset" corresponds to a submitted set of MS/MS spectra, which results in a GPM
result file, so it is roughly equivalent to the set of data from an LC/MS/MS run. A
protein can only be observed once in a dataset. The value in Column F was calculated by taking the number of times (ni) that
the protein was observed in the approximately 24,000 (N) datasets examined and doing
the simple calculation:
pi = 100(ni/N)
Copyright © 2010-2011, The Global Proteome Machine Organization
|