\section{gLabTrie structure} \subsection{Preliminaries} We define the terms: \emph{input network} (denoted by $\mathbb{G}$), \emph{topologies} (denoted by $G$), i.e. unlabeled graphs that represent motif structures and \emph{topology instances} (denoted by $g$), i.e. subgraphs of $\mathbb{G}$ that accommodate certain topologies. We also introduce the notation that we use in the rest of the paper (\emph{isomorphism}, \emph{orbit} and \emph{canonical label}). For simplicity, our discussion will center on undirected graphs, although our method works with directed graphs as well. Given a graph $G$, we denote by $V_G$ its set of vertices, by $E_G$ its set of edges, by $L_G$ its alphabet of labels and by $l_G$ a function that assigns a label to each vertex. We also write $G=(V_G,E_G,L_G,l_G)$. A subgraph $G'$ of a graph $G$ (denoted by $G'\subseteq G$) is a graph that contain a subset of vertices and a subset of edges of $G$ (i.e. $V_{G'}\subseteq V_{G}$ and $E_{G'}\subseteq E_{G}$). An \emph{isomorphism} between two graphs $G_1$ and $G_2$ is a one-to-one mapping $\varphi : V_{G_1} \rightarrow V_{G_2}$ between vertices, which preserves the structure, i.e. $(u,v)\in G_1 \Leftrightarrow (\varphi(u),\varphi(v)) \in G_2$, and the labels, i.e. $l_G(u)=l_G(\varphi(u))$. If there is at least an isomorphism between $G_1$ and $G_2$ we say that they are \emph{isomorphic} and write $G_1 \sim G_2$. An automorphism in $G$ is an isomorphism between $G$ and itself. Every graph admits at least one automorphism (where each node corresponds to itself). Typically, a graph can have many automorphisms. We abuse the notation and write $\varphi(G)$, with $G \subseteq G_1$ to denote the subgraph of $G_2$ that corresponds to $G$ according to $\varphi$ (i.e. the subgraph composed by nodes $\varphi(v)$ with $v\in G$ and edges $(\varphi(v_1),\varphi(v_2))$ with $(v_1,v_2)\in G$). A \emph{labeled topology} is an undirected (node-) labeled connected graph $G$. An \emph{unlabeled topology} is a labeled topology stripped of its labels. A labeled topology that occurs frequently in $G$ is also called \emph{motif}. An \emph{occurrence} $g$ of a topology $G$ is a connected subgraph of $\mathbb{G}$ that is isomorphic to $G$. So, a given motif may have zero, one, or more occurrences in a graph. %\reminder{Say that we call $u$ both the node and its id} %\reminder{Describe canonical form and canonical order} %\reminder{define orbits} \subsection{Problem definition} We aim to support \emph{motif-based queries} in which the user specifies a set of constraints and the system returns all motifs that satisfy the constraints. In our framework, a user specifies a frequency threshold, a p-value threshold and a bag (multiset) of labels that the motifs must contain. An example query would be: ``Give me all labeled topologies that have at least two labels A and one label B, occur at least $f$ times and have a p-value smaller than $p$''. We also want the query processing to be really fast, so when a user is not satisfied with the answer, he or she can change the constraints and quickly get a new answer. We accept a slow (but still reasonable) off-line \emph{preprocessing} step in exchange for fast \emph{query processing}. Formally, we define a \emph{motif-based query} (more simply a query) as a quadruple $Q=(C,k,f,p)$, where $C$ is a bag of labels (a bag, also called multiset, is similar to a set, but an element may occur more than once), $k$ is the requested size of motifs, $f$ is a frequency threshold and $p$ is a p-value threshold. \begin{definition} \textbf{Motif-based query processing}. Given a network $\mathbb{G}$ and a query $Q=(C,k,f,p)$, find all labeled topologies $T$ with number of nodes (size) $k$, whose number of occurrences in $\mathbb{G}$ is at least $f$ and whose p-value is no more than $p$. \end{definition} We solve the defined problem in two steps. During an offline \emph{preprocessing} phase we census the input network to find all labeled motifs up to a certain size $K$, and organize them in a suitable data structure (that we call the \emph{gLabTrie}). Later, during the online \emph{query processing} phase, we probe the gLabTrie to efficiently retrieve motifs that satisfy the query constraints. In the remaining part of this section we describe how we extend existing approaches to support labeled motifs and the data structure used for quickly processing queries. \subsection{gLabTrie Data Structure} We propose an index and an algorithm for labeled motif discovery. The main data structure of a network-centric method is a key-value map (hash table or binary tree) that associates each unlabeled topology (up to a certain size) to a counter. Unlabeled topologies can be represented by their canonical form, so that equivalence check is efficient. The overall algorithm is shown in Alg.~\ref{alg:overall}. \begin{algorithm} \caption{Overall network-centric algorithm for unlabeled motif discovery}\label{alg:overall} \begin{algorithmic} \REQUIRE network, frequency threshold $f$, p-value threshold $p$, number of randomizations $r$ \COMMENT {returns the set of motifs with frequency $\geq f$ and p-value $\leq p$} \STATE initialize $map$ \STATE call $census(network,map)$ \STATE initialize $map\_count$ \FOR{$i=0\ldots r$} \STATE $rand\_net = randomize(network)$ \STATE initialize $map\_rand$ \STATE call $census(rand\_net, map\_rand)$ \FOR{all $t \in keys(map\_rand)$} \IF {$map\_rand[t]\geq map[t]$} \STATE $map\_count[t] = map\_count[t] + 1$ \ENDIF \ENDFOR \ENDFOR \FOR{all $t \in keys(map\_count)$} \STATE $pval = map\_count[t]/r$ \IF {$map[t] \geq f$ and $pval\leq p$} \STATE output $t$, $map[t]$, $pval$ \ENDIF \ENDFOR \end{algorithmic} \end{algorithm} The core procedure is $census()$, which takes as input a network and fills a map. This procedure enumerates all subgraphs of the network one by one and calls a method $notify\_subgraph(map,topology)$ (not shown) for every subgraph of the input network. {\em Misael: we either have to show the method or describe carefully what it does.} To perform a correct counting, every subgraph has to be processed exactly once. Graph census algorithms use a subgraph expanding procedure that takes one node at a time and expands it in all possible ways. Since this procedure is prone to find the same subgraph multiple times (there are many ways of producing the same subgraph), the census algorithms consider a carefully designed set of breaking conditions that guarantee that a subgraph is enumerated exactly once. The nodes of each topology must be given in a canonical order in order to quickly compute the canonical form. {\em Misael: notion of canonical form must be defined} {\em Misael: this first sentence needs to be made clear. What does the slash mean, but more globally what does the whole thing mean?} A straightforward way to handle labeled topologies consists in considering as keys for the map topology/counter the canonical form of labeled topologies as opposed to unlabeled topologies. This would require computing a canonical form for every subgraph of the network and would frustrate the effort of avoiding this expensive step, made by recent methods (such as gTrie). To avoid computing the canonical form for each subgraph, we resort to a different approach that consists in combining the canonical form of the unlabeled topology with the sequence of labels. This approach introduces the problem of determining the order of labels. Indeed, the canonical order of unlabeled topologies is not sufficient. To clarify this point let us consider the two subgraphs in Fig.~\ref{fig:example}. Numbers represent the canonical order of nodes, while letters represent labels. Note that the order between 2 and 3 is ambiguous (1-3-2 would be a valid order as well) since by exchanging them we obtain the same unlabeled canonical form. The two labeled topologies are clearly isomorphic. However the label sequences in the canonical orders are different (ABC vs. ACB). \begin{figure}[ht] \centering \includegraphics[width=.30\textwidth]{FIG/example.png} \caption{Example of two unlabeled canonical orders that produce different sequence of labels on isomorphic graphs. The order is given by numbers. The two corresponding sequences of labels are ABC and ACB.}\label{fig:example} \end{figure} To guarantee that isomorphic labeled topologies have the same label sequence, we solve the ambiguity in the canonical ordering by ordering labels (e.g. in alphabetic order) and using this order to break the ties. This is equivalent to choosing, among all possible canonical orders, the one that corresponds to the lexicographically minimum sequence of labels. Note that network-centric tools (e.g. gTrie) solve the ambiguity by considering the order of node ids to break the ties. Therefore we just need to ensure that the order of node ids is consistent with the order of labels. This can be done by reassigning node ids of the input network so that nodes with smaller labels are assigned with smaller node {\em Misael: I changed label to node} ids (i.e. $v \leq u$ if $l_\mathbb{G}(v) \leq l_\mathbb{G}(u)$). The procedure described above solve the problem in Fig.~\ref{fig:example}. The order of the second topology is forced to be as in Fig.~\ref{fig:example} and hence both label sequences would be ABC. Now we prove that this procedure always gives the correct result. Specifically, we prove that: \begin{figure}[ht] \centering \includegraphics[width=.30\textwidth]{FIG/example2.png} \caption{By considering unambiguous canonical orders we can guarantee that isomorphic graphs are associated with the same sequence of labels. In this example both sequences of labels are ABC.}\label{fig:example} \end{figure} \begin{itemize} \item if two labeled topologies are isomorphic then their associated canonical forms coincide; \item given two labeled topologies, if their corresponding canonical forms coincide, then they are isomorphic. \end{itemize} The second condition is trivial. Indeed the canonical order of two topologies defines an association between nodes that preserves both the structure and the label sequence. In order to prove the first condition, we need to prove that if two labeled topologies are isomorphic then their corresponding sequence of labels coincide. In fact, the labeled canonical form is computed by combining the unlabeled canonical form with the sequence of labels. Since by stripping off the labels two isomorphic topologies remain isomorphic, the two unlabeled canonical forms coincide. Therefore we need to be concerned only about the label sequences. \begin{lemma} Let $G_1$, $G_2$ be two isomorphic labeled topologies and $S_1$, $S_2$ be the sequence of labels given by their corresponding unambiguous canonical order (i.e. there is no other canonical order that produces a sequence of labels lexicographicallysmaller than $S_1$, nor other canonical order that produces a sequence of labels smaller than $S_1$, as assured by the previously described ordering of node ids). Then $S_1=S_2$. {\em Misael: the as assured part in the parentheses of a lemma is bad form. Define a graph that has this node ordering a "lexically numbered graph" and then start the lemma with something like, In a lexcially numbered graph} \end{lemma} Proof: by contradiction. Suppose $S_1 \neq S_2$. Without lost of generality consider $S_1 \leq S_2$. Since $G_1$ and $G_2$ are isomorphic, there is at least an isomorphism between $G_1$ and $G_2$ (that is a one-to-one association between nodes of $G_1$ and nodes of $G_2$). We can use this node correspondence to construct an order of nodes for $G_2$ that is equivalent to the canonical order of $G_1$. This is a valid canonical order for $G_2$ since it produces the same unlabeled canonical form. Moreover this order corresponds to the same sequence of labels as $S_1$. We found a valid canonical order for $G_2$ that produces a sequence of labels smaller than $S_2$. This contradicts the hypothesis that there is no canonical order that produces a sequence of label smaller than $S_2$. {\em Misael: This should follow from the special kind of graph -- lexically nubmered graph} QED We change the overall algorithm (Alg.~\ref{alg:overall}) to support labels. The main change concerns the data structure that maintains the topology counters. Specifically, we substitute the maps topology/counter ($map$, $map\_rand$, $map\_count$) with key-value maps where keys are unlabeled topologies and values are themselves maps (as opposed to counters) that associate label sequences to counters. To retrieve the counter of a labeled topology we first look up the entry corresponding to its unlabeled topology, then we look up the counter associated with the label sequence in the corresponding map. Specifically, we apply the following changes: \begin{enumerate} \item we introduce a first step that reassigns ids to nodes of the input network so that nodes with smaller labels are assigned with smaller node ids; \item we change the maps topology/counter into maps topology/label map, where label map is a key-value map that associates label sequences to counters; \item we change the method $notify\_subgraph(map,topology)$ into $notify\_subgraph(map,topology,label\_seq)$ and change the method $census()$ accordingly. {\em Misael: we really need to describe these routines then} \end{enumerate} The final overall algorithm for labeled motif discovery is shown in Alg.~\ref{alg:overall_labeled}. \begin{algorithm} \caption{Overall network-centric algorithm for labeled motif discovery}\label{alg:overall_labeled} \begin{algorithmic} \REQUIRE network, frequency threshold $f$, p-value threshold $p$, number of randomizations $r$ \COMMENT {returns the set of motifs with frequency $\geq f$ and p-value $\leq p$} \STATE initialize $map$ \STATE call $census(network,map)$ \STATE initialize $map\_count$ \FOR{$i=0\ldots r$} \STATE $rand\_net = randomize(network)$ \STATE initialize $map\_rand$ \STATE call $census(rand\_net, map\_rand)$ \FOR{all $t \in keys(map\_rand)$} \FOR{all $ls \in keys(map\_rand[t])$} \IF {$map\_rand[t][ls]\geq map[t][ls]$} \STATE $map\_count[t][ls] = map\_count[t][ls] + 1$ \ENDIF \ENDFOR \ENDFOR \ENDFOR \FOR{all $t \in keys(map\_count)$} \FOR{all $ls \in keys(map\_rand[t])$} \STATE $pval = map\_count[t][ls]/r$ \IF {$map[t][ls] \geq f$ and $pval\leq p$} \STATE output $t$, $map[t][ls]$, $pval$ \ENDIF \ENDFOR \ENDFOR \end{algorithmic} \end{algorithm} \subsection{An index for querying motifs} During the preprocessing phase we find all motifs up to size $K$ (a pre-defined parameter)in the input network as described previously. We not set neither a frequency threshold nor a p-value threshold, so all labeled topologies occurring in the input network are considered. We limit the search to motifs having size $K$ or less. For simplicity of exposition, in the following, we consider only motifs of size exactly $K$, although our method can encompass also motifs with size smaller than $K$. We put all extracted labeled topologies in a data structure, which we call the \emph{gLabtrie}, that facilitates later retrieval. An example of a gLabTrie for $K=3$ and two labels (A and B) is depicted in Fig.~\ref{fig:index}. \begin{figure}[ht] \centering \includegraphics[width=.50\textwidth]{FIG/index.png} \caption{The gLabTrie. Our data structure for processing motif-based queries}\label{fig:index} \end{figure} The gLabTrie consists of a DAG, which embodies the superset relation between sets, and a collection of lists of topologies contained in the leaves of the DAG. Specifically, nodes of the DAG represent bags of labels (label constraints) and an edge is drawn between two nodes $u$ and $v$ if $v$ is superset of $u$ (i.e. it contains all labels in $u$ with multiplicity below or attained to the one in $v$), and $v$ has exactly one label more than $u$. The edge is associated with the label that is different between $u$ and $v$. Each leaf (node that does not have any outgoing edges) contains a list of all labeled topologies that satisfy the label constraints associated with the leaf, with the topologies' frequencies and p-values. The described gLabTrie enables fast lookup of a bag of labels and then fast retrieval of associated topologies (by exploring the part of the DAG reachable from the corresponding node). The DAG shown in the example in Fig.~\ref{fig:index} is complete {\em Misael: define this, e.g. contains all possible ...}, but in general it may be incomplete. For instance, if there are no topologies with labels ABB and BBB, the nodes ABB, BBB and BB are not included in the DAG, thus making its exploration more efficient. \subsubsection{Building the gLabTrie} The building procedure is given in Alg.~\ref{alg:building}. First we group the topologies by their label bags. Then, for each label bag we create a leaf and store it in a hash table that associates label bags with the corresponding nodes. We create the other nodes of the DAG by calling $create\_dag()$ (Alg.~\ref{alg:create_dag}), which recursively removes one label a time from nodes and creates nodes up to the root. \begin{algorithm} \caption{Building the gLabTrie}\label{alg:building} \begin{algorithmic} \REQUIRE set $T$ of labeled topologies of size $K$ with associated frequency and p-value \COMMENT {returns the root of the gLabTrie data structure} \STATE group $T$ by label bags \FOR {each label bags $lb$ and its corresponding set of topologies $T_{lb}$} \STATE initialize $node$ \COMMENT {create a leaf node} \STATE $node.label\_bag = lb$ \STATE $node.topologies = T_{lb}$ \STATE $hash\_table[lb] = node$ \STATE call $create\_dag(node, hash\_table)$ \ENDFOR \STATE return $hash\_table[\{\emptyset\}]$ \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Recursive procedure $create\_dag$ for building the gLabTrie}\label{alg:create_dag} \begin{algorithmic} \REQUIRE a node $node$ and the hash table of nodes $hash\_table$ \IF {$node.label\_bag == \emptyset$} \STATE return \ENDIF \FOR {all label $l$ in $node.label\_bag$} \STATE $lb\_parent = node.label\_bag - \{l\}$ \IF {$lb\_parent \in keys(hash\_table)$} \STATE $parent = hash\_table[lb\_parent]$ \ELSE \STATE initialize $parent$ \COMMENT {create a new node} \STATE $parent.label\_bag = lb\_parent$ \STATE $hash\_table[lb\_parent] = parent$ \STATE call $create\_dag(parent, hash\_table)$ \ENDIF \STATE $parent.children[l] = node$ \ENDFOR \end{algorithmic} \end{algorithm} \subsubsection{Query processing} Given the gLabTrie described above, and a query $Q=(C,k,f,p)$ with $k=K$, query processing is quite straightforward. To perform a query $Q=(C,k,f,p)$ with $k=K$, first look up the node $n$ of the DAG associated with the set of labels in $C$, then explore all nodes of the DAG reachable from $n$. Finally, retrieve all topologies associated with reachable leaves and return the ones whose frequencies are greater than or equal to $f$ and whose p-values are less than or equal to $p$. Note that with small changes the gLabTrie can support queries of size $k\leq K$. {\em Misael: Please say how} Answering queries with $k\geq K$ is not trivial and is the subject of our current work.