A Versatile MicroArray Data Base: Goals, Design, and Implementation

Marc Rejali, Marco Antoniotti, Vera Cherpinsky, and Bud Mishra

Abstract

Many problems in functional genomics are being tackled using Microarray Technology. While this approach holds much promise for answering open questions in Biology, it poses significant problems from the "Data Management" point of view.

Our Bioinformatics group at NYU has been involved in several projects that use Microarray technology, for instance:

To address the needs of these collaborative research groups and others, we have developed the NYU Microarray Database (NYUMAD). It's functionality ranges from the storage of the data in relational data base management systems to front-end capabilities for the presentation and maintenance of the data.

Collaborating groups can share data and analysis results immediately. The system ensures the security and integrity of the data whilst allowing easy, yet complex querying of the data with a powerful and dynamic front-end.

The database is a unified platform to understand the microarray based gene expression data. The data can be output to a wide class of clustering algorithm, based on various "similarity measures" and various approaches to grouping. Particularly, we have developed a new statistically-robust similarity measure based on James-Stein Shrinkage estimators and provided a Bayesian explanation for its superior performance. Additional research is focused on incorporating statistical tests for validation and measuring the significance (e.g., jackknife and bootstrap tests). Finally, we plan to add an experiment design module, that suggests how the future array experiments should be organized, given that we understand how the past experiments have performed

Most of the underlying DB schema design follows closely the specifications put forth by the Microarray Gene Expression Database group (http://www.ebi.ac.uk/microarray/MGED), especially when it comes to the XML-based MAML exchange format.

Functionality: The functionality of the NYUMAD system is summarized hereafter:

Architecture: The NYUMAD has a three-tier architecture as shown in the diagram below. We have built the NYUMAD database as part of our integrated system for Bioinformatics centered on the VALIS tool. We took extreme care in making the system "distributable" from the start, by clearly defining a three tiered architecture that allows us to concentrate on different aspects of the design. We also closely followed the MAML standard put forth in the Spring of 2001 by the MGED group. It is our intention to follow up on this design and to augment the NYUMAD DB with a module capable to communicate with other systems on the basis of the new MAGE-ML OMG Object model standard.