Goals, Design, and Implementation of a Versatile MicroArray Data Base
Marc Rejali, Marco Antoniotti, Vera Cherpinsky, Caroline Leventhal,
Salvatore Paxia, Archisman Rudra, Joe West and Bud Mishra
Abstract
Many problems in functional genomics are being tackled using
Microarray Technology. While this approach holds much promise for
answering open questions in Biology, it poses significant problems
from the "Data Management" point of view.
Our Bioinformatics group at NYU has been involved in several projects
that use Microarray technology, for instance:
- Genome Mapping and Probe Placement (in cooperation with CSHL)
- Nitrogen Pathway analysis in Arabidobpsis (in cooperation with NYU
Biology Dept.)
- Hallucinogen effects on brain functions (in collaboration with Mount
Sinai School of Medicine)
- Cancer related cell signalling using different cell lines (in
cooperation with CSHL)
To address the needs of these collaborative research groups and
others, we have developed the NYU Microarray Database
(NYUMAD). It's functionality ranges from the storage of the data in
relational data base management systems to front-end capabilities for
the presentation and maintenance of the data.
The database is a unified platform to understand the microarray based
gene expression data. The data can be output to a wide class of
clustering algorithm, based on various "similarity measures" and
various approaches to grouping. Particularly, we have developed a new
statistically-robust similarity measure based on James-Stein Shrinkage
estimators and provided a Bayesian explanation for its superior
performance. Additional research is focussed on incorporating
statistical tests for validation and measuring the significance (e.g.,
jackknife and bootstrap tests). Finally, we plan to add an experiment
design module, that suggests how the future array experiments should
be organized, given that we understand how the past experiments have
performed
Most of the underlying DB schema design follows closely the
specifications put forth by the Microarray Gene Expression Database
group (http://www.ebi.ac.uk/microarray/MGED), especially when it comes
to the XML-based MAML exchange format.
Functionality:
The functionality of the NYUMAD system is summarized hereafter:
- Microarray data is stored in relational data base management systems (RDBMS)
using a database schema based on the MAML (Microarray Mark-up
Language) specification.
- Data is served to "clients" via the world wide web (WWW).
Clients can be the NYUMAD Java applet that is part of the system
described below, or custom-built user programs, or MAML XML files
retrieved using a simple HTTP text based request format. In the
case of the NYUMAD Java applet, data retrieval is generally
transparent to the user and is carried out as a natural part of
using the GUI front-end (see below). For text based requests, the
returned data is in the MAML XML format.
- Data submissions for updating existing data or inserting new data
can be made using the NYUMAD Java applet client, or by custom-built
user programs, or HTML forms that access directly the server middle
tier server . As with data retrieval, the GUI front-end
capabilities of the NYUMAD Java applet make data submission
transparent to the user.
- The NYUMAD applet presents data in a logical manner and allows easy
navigation through the data. As the user navigates through the data,
the required information is retrieved. It also allows
straight-forward updating of existing data and the insertion of new
data. The NYUMAD applet can also retrieve data in the MAML XML
format which can then be cut and pasted to other applications.
- The NYUMAD system integrates several Clustering algorithms and
libraries, in order to provide a complete service to the user. The
integration is such to automatically access the Data Base and
avoid tedious data reformatting and translation tasks.
Architecture:
The NYUMAD has a three-tier architecture as shown in the diagram below.
- Front tier
The Front tier comprises the NYUMAD applet and/or user's custom-built
programs and HTML forms. The applet is written in Java and interacts
with the middle tier using HTTP, requesting or submitting data using
either a text based format or (Java) object serialization. Custom
applications interact using HTTP and a text based format. The
Microarray data in text based interactions is in MAML XML format.
- Middle tier
The middle tier is provided by NYUMAD servlets written in Java that
handle requests and submissions from the front tier. The middle tier
is invisible to the end user. Requested data is retrieved from the
RDBMS in the back tier using JDBC and then sent to the front tier
either in MAML XML format or in the form of serialized objects for the
NYUMAD applet or applications capable of interpreting the Java Object
Serialization protocol. The middle tier servlets provide all the
application logic necessary to ensure the integrity of the data and
adherence to necessary rules, constraints and security
restrictions. In addition the middle tier caches data, allowing for
faster data retrieval and better scalability. The middle tier can
access multiple back tier databases, allowing for data distribution
and scalability. The middle tier uses the server's file management
system to store and retrieve large files such as image files. In
addition, the functional abstraction provided by the middle tier
shields the front tier from changes in the back end structure, thus
ensuring development extensibility and flexibility for the system.
- Back tier
The back tier comprises the relational database management systems
(RDBMS, currently PostgreSQL running on a 6 nodes Linux cluster) that
store the Microarray and related data. It also includes the file
management system used to store large files such as image files. The
database schema is based on the MAML specification adapted to
relational systems. Since the database access code in the middle tier
uses JDBC, databases from different vendors can be used with
relatively little additional code.
We have built the NYUMAD database as part of our integrated system for
Bioinformatics centered on the VALIS tool. We took extreme care
in making the system "distributable" from the start, by clearly
defining a three tiered architecture that allows us to concentrate on
different aspects of the design.
We also closely followed the MAML standard put forth in the Spring of
2001 by the MGED group. It is our intention to follow up on this
design and to augment the NYUMAD DB with a module capable to
communicate with other systems on the basis of the new MAGE-ML OMG
Object model standard.