The X! search engine project

X! Search Engine Development

  The COMMON MS/MS file format

The COMMON file format project has the goal of determining the true information content of a set of tandem mass spectrum data files and to define a simple compression scheme that takes advantage of that knowledge. The software (named COMMON) that is made available both through the GPM SVN repository and the GPM ftp site.

The current version of this compression scheme (CMN 1.0) uses the X! series peak processing system and a simple differential compression scheme. The most recent versions of the X! series search engines (2007.07.07.2) have been updated to read CMN 1.0 files directly. The file compression ratios vary, depending on the input file type, however a simple example is as follows:

Orbitrap RAW file mzXML file CMN 1.0 file
File size 222.5 MBytes 530.1 MBytes 1.8 MBytes
bytes/spectrum 19,189 45,727 155

Note: the mzXML file was generated from the RAW file using the reAdw software made available by the Sashimi project.

Binary executable versions of the compression software are available for Windows, Linux and OS X. The compilable code for the three platforms is also available. To perform a compression on a Windows platform, the simplest method is to place the binary executable in a suitable place and type the following on using the console:

>common FILENAME

where FILENAME is the name of the file you wish to compress (note: for Linux or OS X use "./common FILENAME"). This file containing the MS/MS spectrum information can be in any one of the following formats:

  1. Mascot Generic Format;
  2. mzXML;
  3. mzData;
  4. DTA (single file or concatenated);
  5. PKL; or
  6. CMN 1.0.

The result will be a compressed file named "FILENAME.cmn". The utility has several input flags that can be used to control the output. A list of these flags and a brief description of their use can be obtained by simply running the program with no command line parameters (or using the flag -h). To extract the information from a CMN file back into a simple ASCII format, such as Mascot Generic Format, simply typing:

>common FILENAME.cmn -dmgf -oFILENAME.mgf

will generate a file named "FILENAME.mgf" that can be analyzed by other search engines.

Copyright © 2004-2011, The Global Proteome Machine Organization