A Versatile MicroArray Data Base: Goals, Design, and Implementation

Marc Rejali, Marco Antoniotti, Vera Cherpinsky, and Bud Mishra

Abstract

Many problems in functional genomics are being tackled using Microarray Technology. While this approach holds much promise for answering open questions in Biology, it poses significant problems from the "Data Management" point of view.

Our Bioinformatics group at NYU has been involved in several projects that use Microarray technology, for instance:

Genome Mapping and Probe Placement (in cooperation with CSHL)
Nitrogen Pathway analysis in Arabidobpsis (in cooperation with NYU Biology Dept.)
Hallucinogen effects on brain functions (in collaboration with Mount Sinai School of Medicine)
Cancer related cell signalling using different cell lines (in cooperation with CSHL)

To address the needs of these collaborative research groups and others, we have developed the NYU Microarray Database (NYUMAD). It's functionality ranges from the storage of the data in relational data base management systems to front-end capabilities for the presentation and maintenance of the data.

Collaborating groups can share data and analysis results immediately. The system ensures the security and integrity of the data whilst allowing easy, yet complex querying of the data with a powerful and dynamic front-end.

The database is a unified platform to understand the microarray based gene expression data. The data can be output to a wide class of clustering algorithm, based on various "similarity measures" and various approaches to grouping. Particularly, we have developed a new statistically-robust similarity measure based on James-Stein Shrinkage estimators and provided a Bayesian explanation for its superior performance. Additional research is focused on incorporating statistical tests for validation and measuring the significance (e.g., jackknife and bootstrap tests). Finally, we plan to add an experiment design module, that suggests how the future array experiments should be organized, given that we understand how the past experiments have performed

Most of the underlying DB schema design follows closely the specifications put forth by the Microarray Gene Expression Database group (http://www.ebi.ac.uk/microarray/MGED), especially when it comes to the XML-based MAML exchange format.

Functionality: The functionality of the NYUMAD system is summarized hereafter:

Microarray data is stored in relational data base management systems (RDBMS) using a database schema based on the MAML (Microarray Mark-up Language) specification.
Data is served to "clients" via the world wide web (WWW). Clients can be the NYUMAD Java applet that is part of the system described below, or custom-built user programs, or MAML XML files retrieved using a simple HTTP text based request format. In the case of the NYUMAD Java applet, data retrieval is generally transparent to the user and is carried out as a natural part of using the GUI front-end (see below). For text based requests, the returned data is in the MAML XML format.
Data submissions for updating existing data or inserting new data can be made using the NYUMAD Java applet client, or by custom-built user programs, or HTML forms that access directly the server middle tier server . As with data retrieval, the GUI front-end capabilities of the NYUMAD Java applet make data submission transparent to the user.
The NYUMAD applet presents data in a logical manner and allows easy navigation through the data. As the user navigates through the data, the required information is retrieved. It also allows straight-forward updating of existing data and the insertion of new data. The NYUMAD applet can also retrieve data in the MAML XML format which can then be cut and pasted to other applications.
The NYUMAD system integrates several Clustering algorithms and libraries, in order to provide a complete service to the user. The integration is such to automatically access the Data Base and avoid tedious data reformatting and translation tasks.

Architecture: The NYUMAD has a three-tier architecture as shown in the diagram below.

Front tier The Front tier comprises the NYUMAD applet and/or user's custom-built programs and HTML forms. The applet is written in Java and interacts with the middle tier using HTTP, requesting or submitting data using either a text based format or (Java) object serialization. Custom applications interact using HTTP and a text based format. The Microarray data in text based interactions is in MAML XML format.
Middle tier The middle tier is provided by NYUMAD servlets written in Java that handle requests and submissions from the front tier. The middle tier is invisible to the end user. Requested data is retrieved from the RDBMS in the back tier using JDBC and then sent to the front tier either in MAML XML format or in the form of serialized objects for the NYUMAD applet or applications capable of interpreting the Java Object Serialization protocol. The middle tier servlets provide all the application logic necessary to ensure the integrity of the data and adherence to necessary rules, constraints and security restrictions. In addition the middle tier caches data, allowing for faster data retrieval and better scalability. The middle tier can access multiple back tier databases, allowing for data distribution and scalability. The middle tier uses the server's file management system to store and retrieve large files such as image files. In addition, the functional abstraction provided by the middle tier shields the front tier from changes in the back end structure, thus ensuring development extensibility and flexibility for the system.
Back tier The back tier comprises the relational database management systems (RDBMS, currently PostgreSQL running on a 6 nodes Linux cluster) that store the Microarray and related data. It also includes the file management system used to store large files such as image files. The database schema is based on the MAML specification adapted to relational systems. Since the database access code in the middle tier uses JDBC, databases from different vendors can be used with relatively little additional code.

We have built the NYUMAD database as part of our integrated system for Bioinformatics centered on the VALIS tool. We took extreme care in making the system "distributable" from the start, by clearly defining a three tiered architecture that allows us to concentrate on different aspects of the design. We also closely followed the MAML standard put forth in the Spring of 2001 by the MGED group. It is our intention to follow up on this design and to augment the NYUMAD DB with a module capable to communicate with other systems on the basis of the new MAGE-ML OMG Object model standard.