Contents © 2004-2011 Massachusetts
General Hospital

Capsule Summary

The Data Interpretation Core (DIC) performs computational genomics and proteomics analyses, conducts statistical pathway data analysis for gene expression microarray and high throughput mass spectrometric proteomics, and develops tools for data computation and modeling. This core serves as a tool for integrative analysis of the comprehensive human genomic and proteomic data generated by the Program. Our approaches to data interpretation include the concept of clinical data interpretation groups, or Clinical DIGs, consisting of multidisciplinary teams of biostatisticians, systems biologists, basic science investigators knowledgeable about pathways, and clinicians aimed at studying a particular aspect of the clinical problem (for example, metabolic perturbations in burns). Such interactions are clearly necessary for the discovery process to work and the biological significance of the results to be uncovered.

The Challenge

New knowledge regarding the host response to trauma and burns will be acquired only by the identification of important dynamic relationships among multiple molecular and genetic interactions varying in time. The work of the DIC represents a pioneering effort in clinical research to develop and apply pattern analysis and data mining strategies to correlate comprehensive genomic and proteomic data with human physiologic and clinical information. The Program believes that high throughput genomics and proteomics can be used both to class predict or identify clinical trajectories in patients with severe injury, and hopes to demonstrate that these changes in expression and proteomics can reveal novel insights into the biology of injury.

The Approach

The DIC relies heavily on known statistical methods in the broad areas of variance analysis, such as Analysis of Variance (ANOVA); clustering methods, such as Principal Component Analysis (PCA), and Hierarchical Clustering; signal-to-noise metrics such as Intra-class Correlation Coefficient (ICC) and Coefficient of Variation (CoV); and tests for significance such as t-tests, permutation tests and different cross-validations, etc.

This core is also responsible for developing novel statistical approaches, either by using extant statistical techniques in novel ways, or developing new methods to analyze the data. Some of these have been the use of various filters to reduce the data sets to groups of more significant or interesting genes, Fisher Discriminant Analysis to rank the discriminatory genes, and PCA to find time-profile patterns of expression.


A gene expression clustergram
Gene Expression Clustergram

Different software tools have been used by each of the different DIC groups to analyze the large data sets that have been generated to date. The following is a list of the main statistical and biocomputing suites used:

Excel: Microsoft Excel has several built in statistical functions that can be used to quickly test the data for potential faulty chips and Coefficient of Variation. It also is ideal as a first pass to order and label chip data.

MAS-5: Affymetrix Microarray Suite v.5 software is part of the Affymetrix GeneChip instrument and probe array platform. It is used to analyze experimental data for GeneChip probe array assays; generate reports that summarize intensity data, analysis output, and algorithm settings; and publish data (LIMS mode only) to a database that can be queried using the Affymetrix Data Mining Tool.

SAS: Statistical Analysis System, a powerful, comprehensive statistics package for analysis and data manipulation. The major advantage of SAS is its ability to use data from a wide variety of sources, and to manipulate the data to suit nearly any analysis need a user might have. In addition to reading raw data, SAS can use or convert system files created by other statistical or database packages such as Excel, Access and others.

dChip: DNA-Chip Analyzer (dChip) is a software package implementing model-based expression analysis of oligonucleotide arrays (Li and Wong 2001a) and several high-level analysis procedures. The model-based approach allows probe-level analysis on multiple arrays. By pooling information across multiple arrays, it is possible to assess standard errors for the expression indexes. This approach also allows automatic probe selection in the analysis stage to reduce errors due to cross-hybridizing probes and image contamination. High-level analysis in dChip includes comparative analysis and hierarchical clustering.

Cluster: Performs a variety of types of cluster analysis and other types of processing on large microarray datasets. Currently includes hierarchical clustering, self-organizing maps (SOMs), k-means clustering, and principal component analysis. Hierarchical clustering methods are described in Eisen et al. (1998) PNAS 95:14863.

TreeView: Graphically browses the results of clustering and other analyses from Cluster. Supports tree-based and image-based browsing of hierarchical trees, and has multiple output formats for generation of images for publications.

BiosystAnSe: Biological Systems Analysis Suite, is a software package developed at the Bioinformatics & Metabolic Engineering Laboratory at MIT for different aspects of biological systems analysis such as RNA expression data and Metabolic Network data. Currently it has functions that filter the PCA results of multiple array data to be able to refine discriminatory gene lists, as well as perform Fisher Discriminant Analysis (FDA) for the ranking of discriminatory genes found on pre-classified microarray data.

SAM: Significance Analysis of Microarrays or “supervised learning software for genomic expression data mining” is a program developed at Stanford University and based on a recent paper of Tusher, Tibshirani and Chu (2001): "Significance analysis of microarrays applied to the ionizing radiation response" (ps file). PNAS 2001 98: 5116-5121, (Apr 24). The software is used to correlate gene expression data to a wide variety of clinical parameters including treatment, diagnosis categories, survival time and time trends, and provides estimates of false discovery rate for multiple testing. It works with data from both cDNA and oligo microarrays, and can also be applied to protein expression data and SNP chip data.

S-Plus: S-PLUS is software for exploratory data analysis and statistical modeling. At the core of this software is the S version 4 language from Bell Labs’ Lucent Technologies, a language designed specifically for data visualization and modeling. It contains over 4,200 built-in functions for graphics, statistics, and program control and built-in object-oriented data types for vectors, matrices, and data arrays.