Sungear: a visualization tool for genomics experiments and related data sets Genomics researchers are often faced with significant numbers of large collections of genes. A given collection may, for example, arise from the genes that are induced more than two-fold in some experiment. Alternatively, a collection may correspond to genes that are induced or repressed by more than five-fold or genes that don't change or genes from different species. The criteria of interest may vary depending on the researcher's basic question. The researcher wants to ask questions such as: (1) Which gene collections are most similar to one another? (2) For some functionality (represented, say, by a GO term) of interest, which collections are most highly represented? (3) Given a subset of those collections, are any functional terms highly represented? In addressing such questions, the researcher might like to interact with the data visually in an exploratory real-time fashion. For example, the researcher might like to select a few collections and GO terms to see whether an interesting interrelationship arises. Sometimes, the researcher would like computer support to find interesting patterns, but all within a single interactive framework. Sungear is such a framework. Figure 1 shows a Sungear representation of induced genes from X experiments. [Colleagues, I suggest we start with a four experiment Sungear. Iniitally we illustrate only the Gear polygon window itself.] The experiment names are listed around the X-faced polygon. A "gear" is a circle whose arrows pointing towards some vertices of the polygon. The location of the gear is based on the locations of those vertices. The size of the gear is proportional to the number of genes that are induced in all the experiments to which the gear points but that are not induced in any superset of those experiments. The number of gears corresponds to all possible subsets of the experiments except the null subset. So if there are X experiments (polygonal vertices) there can be up to (2^X) - 1 gears. (We say "up to", because a gear having no members will not be represented.) When X = 3, this number of subsets corresponds to those that can be represented by a Venn diagram. Whereas the Venn diagram representation doesn't extend beyond 3, Sungear can represent, in principle, an arbitrary number of collections depending only on the researcher's willingness to understand a visual display having many gears. In our lab, we have constructed useful displays consisting of collections. Whereas the display by itself offers information about the relationships of the different experiments thus answering question (1) above, Sungear provides multiple interactive views of the data as shown in Figure 2. [Figure 2: Please display the four experiments in its entirety, including the gene lists and the go terms] In Sungear, one or more gears, genes, and/or GO terms may be selected. Figure 3 shows the result of a selection of a single GO term , thus addressing question (2) above. The display of the Gear Polygon shows gears whose outlines correspond to their pre-selected sizes but having only a smaller highlighted ball in their interiors. The empty annulus represents genes that do not meet the criteria of the selection -- in this case, do not related directly or indirectly to the selected GO term. On the gene list to the left, the genes that relate to the GO term are highlighted. For readers unfamiliar with GO, the gene ontology corresponds to a directed acyclic graph, so each gene may belong to many terms. After making one or more selections, some gears often have become mostly dark and the selected elements of the gene list may be widely scattered among the original gene list. For this reason, it is useful to click on the Narrow button giving Figure 4 [result of Narrow]. On the left frame gene list of Figure 4, only the genes that were previously selected are present. On the middle frame Gear Polygon, each gear has no annulus, because the relevant universe of genes is now the previously selected set. So each gear's size is proportional to the number of genes corresponding to the experiments associated with the gear and that belong to the previously selected set. The bottom right frame corresponds to a list of GO terms listed in descending order by a Z-score. That score corresponds to a degree of overrepresentation compared to a random sample of the genome under analysis. A large positive Z-score, as associated with , corresponds to a GO term that is likely to be overrepresented in a hypergeometric sense even after correcting for multitesting [Rodrigo: reference to multitesting]. In this case, it is unsurprising that is overrepresented because we selected that term initially. However, Figure 5 shows a selection of a vessel (one involving two of the four experiments) from the original data followed by a Narrow. The lower right frame in that case shows that is overrepresented. In database systems, it is useful to distinguish "and-functionality" from "or-functionality". And-functionality arises for example when a researcher wants to select all those genes that are in vessel and also satisfy GO term . Figure 6 shows such a situation. Figure 7, by contrast, shows genes that are either in vessel or satisfy GO term . This set is normally larger. By simple combinations of mouse clicks, shift and alt keys, one can support these as well as other selection semantics including exclusion (not-functionality) and range selection. Whereas Sungear supports several forms of visual exploration, researchers sometimes would like the data to speak for itself. A typical question in practice is: "Tell me which combination of experiments gives an interesting pattern of overrepresentation." Because Sungear supports Z scores that measure overrepresentation, the system can provide this functionality by simulating several selections and finding ones that indicate significant overrepresentation. Sungear embodies this with a button immodestly called Cool! Figure 8 shows the result of selecting the Cool! button and following one of its recommendations. Case Studies Whereas the running example above consisted of only X collections of experiments on a single species, Sungear also gives insight for inter-species studies. [Manny and Rodrigo, here is an outline of this part: Explain the multi-species data set and how it was obtained. Show a few selections that illustrate interesting science. Try to use the Cool! button if possible.] Conclusion Sungear generalizes Venn diagrams to multiple collections of genes, relates those collections to functional categories, and permits visual data exploration. After two minutes of training, even senior researchers can learn to be comfortable navigating Sungear. For the moderately sophisticated user, Sungear permits various data selection capabilities including and-functionality, or-functionality, and range selection. Sungear provides support for the discovery of overrepresentation using any functional directed acyclic graph (such as GO). Finally Sungear integrates easily with other tools such as Cytoscape through an overarching framework we call Virtual Plant. Sungear is indifferent to the species under consideration and can be extended beyond genomics to proteomic and other kinds of data. For this reason, we consider Sungear to be a useful tool for a wide variety of Systems Biology tasks. References Go multitesting in Go hierarchies some ref maybe to Venn diagrams some ref to Cytoscape refs to data sources