Title: Expert-Driven Validation of Set-Based Data Mining Results
Candidate: Adomavicius, Gediminas
Advisor(s): Tuzhilin, Alexander; Davis, Ernest
This dissertation addresses the problem of dealing with large numbers of set-based patterns, such as association rules and itemsets, discovered by data mining algorithms. Since many discovered patterns may be spurious, irrelevant, or trivial, one of the main problems is how to validate them, e.g., how to separate the ``good'' rules from the ``bad.'' Many researchers have advocated the explicit involvement of a human expert in the validation process. However, scalability becomes an issue when large numbers of patterns are discovered, since the expert cannot perform the validation on a pattern-by-pattern basis in a reasonable period of time. To address this problem, this dissertation describes a new expert-driven approach to set-based pattern validation.
The proposed validation approach is based on validation sequences, i.e., we rely on the expert's ability to iteratively apply various validation operators that can validate multiple patterns at a time, thus making the expert-based validation feasible. We identified the class of scalable set predicates called cardinality predicates and demonstrated how these predicates can be effectively used in the validation process, i.e., as a basis for validation operators. We examined various properties of cardinality predicates, including their expressiveness. We also have developed and implemented the set validation language (SVL) that can be used for manual specification of cardinality predicates by a domain expert. In addition, we have proposed and developed a scalable algorithm for set and rule grouping that can be used to generate cardinality predicates automatically.
The dissertation also explores various theoretical properties of sequences of validation operators and facilitates a better understanding of the validation process. We have also addressed the problem of finding optimal validation sequences and have shown that certain formulations of this problem are NP-complete. In addition, we provided some heuristics for addressing this problem.
Finally, we have tested our rule validation approach on several real-life applications, including personalization and bioinformatics applications.
Title: Responsive Thinwire Visualization of Large Geographic Datasets
Candidate: Been, Kenneth
Advisor(s): Yap, Chee
This thesis describes a web-based, responsive, zooming and panning visual- ization system for a full-featured geographic description of the United States. Current web-based map servers provide, from a visualization standpoint, little more than one static image per page, with hyperlinks for navigation; continuous zooming and panning requires locally stored data. Our primary contribution is a multi-threaded, scalable and responsive client-server architecture that responds to user requests as naturally and quickly as possible, regardless of network band- width reliability. This architecture can be generalized for use in other applica- tions, including non-geographic ones. To this we add a scalable and exible user interface for navigation of multi-scale geographic data, with intuitive zooming and panning, pop-up feature labels, and a user controlled tree-hierarchy of windows. We build software tools and algorithms for translating the U.S. Census Bureau's TIGER data into a format designed for speedy database retrieval and network delivery, and for generalizing the data into multiple levels of detail. Because of anomalies in the TIGER data, this processing requires some human intervention.
Title: Representing and Modifying Complex Surfaces
Candidate: Biermann, Henning
Advisor(s): Zorin, Denis
The increasing demand for highly detailed geometric models poses new and important problems in computer graphics and geometric modeling. Applications for complex models range from geometric design and scientific simulations to feature movies and video games.
We focus on the fundamental problem of creating and manipulating complex surface models. We address the problem by designing an efficient and general surface representation, and develop algorithms for efficient modification of surfaces represented in this form. Our surface representation extends existing subdivision-based representations with explicit representation of sharp features and boundaries, which is crucial in many computer-aided design applications.
We consider two types of surface modifications: boolean operations on solids bounded by surfaces, and surface pasting. Our technique rapidly and robustly computes an approximate result rather than aiming for the precise solution. At the same time, our approach allows one to trade speed for accuracy, and, in most cases, compute the result with any desired accuracy. The second type of editing operations we consider address the problem of transferring geometric features between different objects. Our technique makes it easy to combine geometric data from various sources (e.g. 3D scanning, CAGD model) into a single model.
Title: An Embedded Boundary Integral Solver for the Unsteady Incompressible Navier-Stokes Equations
Author(s): Biros, George; Ying, Lexing; Zorin, Denis
We present a new method for the solution of the unsteady incompressible Navier-Stokes equations. Our goal is to achieve a robust and scalable methodology for two and three dimensional incompressible laminar flows. The Navier-Stokes operator discretization is done using boundary integrals and structured-grid finite elements. We use a two-step second-order accurate scheme to advance the equations in time. The convective term is discretized by an explicit, but unconditionally stable, semi-Lagrangian formulation; at each time step we inverta spatial constant-coefficient (modified) Stokes operator. The Dirichlet problem for the modified Stokes operator is formulated as a double-layer boundary integral equation. Domain integrals are computed via finite elements with appropriate forcing singularities to account for the irregular geometry. We use a velocity-pressure formulation which we discretize with bilinear elements (Q1-Q1), which give equal order interpolation for the velocities and pressures. Stabilization is used to circumvent the div-stability condition for the pressure space. The integral equations are discretized by Nystrom's method. For the specific approximation choices the method is second order accurate. We will present numerical results and discuss the performance and scalability of the method in two dimensions.
Title: On computing the Pareto-optimal solution set in a large scale dynamic network
Candidate: Daruwala, Raoul-Sam
Advisor(s): Mishra, Bud
Let G=(V,E) be a graph with time-dependent edges where the cost of a path p through the graph is determined by a vector functions F(p)=[f_1(p),f_2(p), \dots, f_n(p)], where f_1,f_2,...,f_n are independent objective functions. Where n>1 there is no clear idea of what a ``best'' solution is, instead we turn to the idea of Pareto-optimality to define the efficiency of a path. Given the set of paths P through the network, a path p' is Pareto-optimal if for every p in P for all the objective functions (f_i(p) >= f_i(p')).
The problem of planning itineraries on a transportation system involves computing the set of optimal paths through a time-dependent network where the cost of a path is determined by more than one, possibly non-linear and non-additive, cost function. This thesis introduces an algorithmic toolkit for finding the set of Pareto-optimal paths in time-dependent networks in the presence of multiple objective functions.
Multi-criteria path optimization problems are known to be NP-Hard, however, by exploiting geometric and periodic properties of the dynamic graphs that model transit networks we show that it is possible to compute the Pareto-optimal solutions sets rapidly without using heuristics. We show that we can solve the itinerary problem in the presence of response time constraints for a large scale graph.
Title: Adaptive Service Access in Shared Wireless Environments
Author(s): Fu, Xiaodong; Karamcheti, Vijay
Adaptation to network changes is important to provide applications with seamless service access in a shared wireless environment. Path-based mechanisms, which augment data paths with application-specific ``bridging'' components guided by minimal application input, are promising approaches for providing such support. Although shown to be successful in static network situations, their utility under dynamically changing network conditions has not been well-studied.
In this paper, we answer this question by investigating the performance of a path-based approach, CANS (Composable Adaptive Network Services) in a dynamic environment. We find that the suitability of CANS-like approaches is hampered by inaccurate component models and expensive planning and reconfiguration. We address these problems by extending CANS to support (1) generalized path creation strategies to match different application performance preferences; (2) refined component models that enable adjustment at a finer granularity and more accurately represent behavior of component compositions; and (3) local planning and reconfiguration mechanisms that improve responsiveness. We present the problems and evaluate our solutions using an image streaming application. The experiment results show that our solutions are effective.
Title: Balancing Neumann-Neumann Preconditioners for Mixed Approximations of Heterogeneous Problems in Linear Elasticity
Author(s): Goldfeld, Paulo; Pavarino, Luca F.; Widlund, Olof B.
Balancing Neumann-Neumann methods are extented to mixed formulations of the linear elasticity system with discontinuous coefficients, discretized with mixed finite or spectral elements with discontinuous pressures.
These domain decomposition methods implicitly eliminate the degrees of freedom associated with the interior of each subdomain and solve iteratively the resulting saddle point Schur complement using a hybrid preconditioner based on a coarse mixed elasticity problem and local mixed elasticity problems with natural and essential boundary conditions. A polylogarithmic bound in the local number of degrees of freedom is proven for the condition number of the preconditioned operator in the constant coefficient case.
Parallel and serial numerical experiments confirm the theoretical results, indicate that they still hold for systems with discontinuous coefficients, and show that our algorithm is scalable, parallel, and robust with respect to material heterogeneities. The results on heterogeneous general problems are also supported in part by our theory.
Title: Overlapping Schwarz Preconditioners for Spectral Nedelec Elements for a Model Problem in H(curl)
Author(s): Hientzsch, Bernhard
A two-level overlapping domain decomposition method is analyzed for a Nedelec spectral element approximation of a model problem appearing in the solution of Maxwell's equations. The overlap between subdomains can consist of entire spectral elements or rectangular subsets of spectral elements. For fixed relative overlap and overlap made from entire elements, the condition number of the method is bounded, independently of the mesh size, the number of subregions, the coefficients and the degree of the spectral elements. In the case of overlap including just parts of spectral elements, a bound linear in the degree of the elements is proven. It is assumed that the coarse and fine mesh are quasi-uniform and shape-regular and that the domain is convex. Arguments that would not require quasi-uniformity of the coarse mesh and convexity of the domain are mentioned. Our work generalizes results obtained for lower-order Nedelec elements in Toselli [Numerische Mathematik (2000) 86:733-752]. Numerical results for the two-level algorithm in two dimensions are also presented, supporting our analysis.
Title: Dual-Primal FETI Methods for Incompressible Stokes and Linearized Navier-Stokes Equations
Author(s): Li, Jing
In this paper, a dual-primal FETI method is developed for solving incompressible Stokes equations approximated by mixed finite elements with discontinuous pressures in three dimensions. The domain of the problem is decomposed into non-overlapping subdomains, and the continuity of the velocity across the subdomain interface is enforced by introducing Lagrange multipliers. By a Schur complement procedure, the indefinite Stokes problem is reduced to a symmetric positive definite problem for the dual variables, i.e., the Lagrange multipliers. This dual problem is solved by a Krylov space method with a Dirichlet preconditioner. At each step of the iteration, both subdomain problems and a coarse problem on a coarse subdomain mesh are solved by a direct method. It is proved that the condition number of this preconditioned dual problem is independent of the number of subdomains and bounded from above by the product of the inverse of the inf-sup constant of the discrete problem and the square of the logarithm of the number of unknowns in the individual subdomain problems. Illustrative numerical results are presented by solving lid driven cavity problems. This algorithm is also extended to solving linearized non-symmetric Navier-Stokes equation.
Title: Dual-Primal FETI Methods for Stationary Stokes and Navier-Stokes Equations
Author(s): Li, Jing
Finite element tearing and interconnecting (FETI) type domain decomposition methods are first extended to solving incompressible Stokes equations. One-level, two-level, and dual-primal FETI algorithms are proposed. Numerical experiments show that these FETI type algorithms are scalable, i.e., the number of iterations is independent of the number of subregions into which the given domain is subdivided. A convergence analysis is then given for dual-primal FETI algorithms both in two and three dimensions.
Extension to solving linearized nonsymmetric stationary Navier-Stokes equations is also discussed. The resulting linear system is no longer symmetric and a GMRES method is used to solve the preconditioned linear system. Eigenvalue estimates show that, for small Reynolds number, the nonsymmetric preconditioned linear system is a small perturbation of that in the symmetric case. Numerical experiments also show that, for small Reynolds number, the convergence of GMRES method is similar to the convergence of solving symmetric Stokes equations with the conjugate gradient method. The convergence of GMRES method depends on the Reynolds number; the larger the Reynolds number, the slower the convergence.
Dual-primal FETI algorithms are further extended to nonlinear stationary Navier-Stokes equations, which are solved by using a Picard iteration. In each iteration step, a linearized Navier-Stokes equation is solved by using a dual-primal FETI algorithm. Numerical experiments indicate that convergence of the Picard iteration depends on the Reynolds number, but is independent of both the number of subdomains and the subdomain problem size.
Title: Efficiently Distributing Component-based Applications Across Wide-Area Environments
Author(s): Llambiri, Deni; Totok, Alexander; Karamcheti, Vijay
Distribution and replication of network-accessible applications has been shown to be an effective approach for delivering improved Quality of Service (QoS) to end users. An orthogonal trend seen in current-day network services is the use of component-based frameworks. Even though such component-based applications are natural candidates for distributed deployment, it is unclear if the design patterns underlying component frameworks also enable efficient service distribution in wide-area environments. In this paper, we investigate application design rules and their accompanying system-level support essential to a beneficial and efficient service distribution process. Our study targets the widely used Java 2 Enterprise Edition (J2EE) component platform and two sample component-based applications: Java Pet Store and RUBiS. Our results present strong experimental evidence that component-based applications can be efficiently distributed in wide-area environments, significantly improving QoS delivered to end users as compared to a centralized solution. Although current design patterns underlying component frameworks are not always suitable, we identify a small set of design rules for orchestrating interactions and managing component state that together enable efficient distribution. Futhermore, we show how enforcement of the identified design rules and automation of pattern implementation can be supported by container frameworks.
Title: Online Codes
Author(s): Maymounkov, Petar
We introduce online codes - a class of near-optimal codes for a very general loss channel which we call the free channel. Online codes are linear encoding / decoding time codes, based on sparse bipartite graphs, similar to Tornado codes, with a couple of novel properties: local encodability and rateless-ness. Local encodability is the property that each block of the encoding of a message can be computed independently from the others in constant time. This also implies that each encoding block is only dependent on a constant-sized part of the message and a few preprocessed bits. Rateless-ness is the property that each message has an encoding of practically infinite size.
We argue that rateless codes are more appropriate than fixed-rate codes for most situations where erasure codes were considered a solution. Furthermore, rateless codes meet new areas of application, where they are not replaceable by fixed-rate codes. One such area is information dispersal over peer-to-peer networks.
Title: Building Secure File Systems Out of Byzantine Storage
Author(s): Mazieres, David; Shasha, Dennis
This paper shows how to implement a trusted network file system on an untrusted server. While cryptographic storage techniques exist that allow users to keep data secret from untrusted servers, this work concentrates on the detection of tampering attacks and stale data. Ideally, users of an untrusted storage server would immediately and unconditionally notice any misbehavior on the part of the server. This ideal is unfortunately not achievable. However, we define a notion of data integrity called fork consistency in which, if the server delays just one user from seeing even a single change by another, the two users will never again see one another's changes - a failure easily detectable with on-line communication. We give a practical protocol for a multi-user network file system called SUNDR, and provfe that SUNDR offers fork consistency whether or not the server obeys the protocol.
Title: Image Denoising using a Gaussian Scale Mixture in the Wavelet Domain
Author(s): Portilla, Javier; Strela, Vasily; Wainwright, Martin J.; Simoncelli, Eero P.
We describe a method for removing noise from digital images, based on a statistical model of the coefficients of an overcomplete multi-scale oriented basis. Neighborhoods of coefficients at adjacent positions and scales are modeled as the product of two independent random variables: a Gaussian vector and a hidden positive scalar multiplier. The latter modulates the local variance of the coefficients in the neighborhood, and is thus able to account for the empirically observed correlation between the amplitudes of pyramid coefficients. Under this model, the Bayesian least squares estimate of each coefficient reduces to a weighted average of the local linear (Wiener) estimate over all possible values of the hidden multiplier variable. We demonstrate through simulations with images contaminated by additive Gaussian noise of known covariance that the performance of this method substantially surpasses that of previously published methods, both visually and in terms of mean squared error. In addition, we demonstrate the performance of the algorithm in removing sensor noise from high-ISO digital camera images.
Title: Informative Features in Vision and Learning
Candidate: Rudra, Archisman
Advisor(s): Geiger, Davi
We explore the role of features in solving problems in computer vision and learning. Features captures important domain-dependent knowledge and are fundamental in simplifying problems. Our goal is to consider the universal features of the problem concerned, and not just particular algorithms used in its solution. Such an approach reveals only the fundamental difficulties of any problem. For most problems we will face a host of other specialized concerns. Therefore, we consider simplified problems which captures the essence of our approach.
This thesis consists of two parts. First, we explore means of discovering features. We come up with an information theoretic criterion to identify features which has deep connections to statistical estimation theory. We consider features to be ``nice'' representations of objects. We find that, ideally, a feature space representation of on image is the most concise representation of an image which captures all available information in it. In practice, however, we are satisfied with an approximation to it. Therefore, we explore a few such approximations and explain their connection to the information-theoretic approach. We look at the algorithms which implement these approximation and look at their generalizations in the related field of stereo vision.
Using features, whether they come from some feature-discovery algorithm or are hand crafted, is usually an ad hoc process which depends on the actual problem, and the exact representation of features. This diversity mostly arises from the multitude of ways features capture information. In the second part of this thesis, we come up with an architecture which lets us use features in a very flexible way, in the context of content-addressable memories. We apply this approach to two radically different domains, face images and English words. We also look at human performance in reconstructing words from fragments, which give us some information about the memory subsystem in human beings.
Title: Workload Characterization of a Personalized Web Site - And Its Implications for Dynamic Content Caching
Author(s): Shi, Weisong; Wright, Randy; Collins, Eli; Karamcheti, Vijay
Requests for dynamic and personalized content increasingly dominate current-day Internet traffic; however, traditional caching architectures are not well-suited to cache such content. Several recently proposed techniques, which exploit reuse at the sub-document level, promise to add this shortcoming, but require a better understanding of the workloads seen on web sites that serve such content. In this paper, we study the characteristicsof a medium-sized personalized web site, NYUHOME, which is a customizable portal used by approximately 44,000 users from the New York University community. Our study leverages detailed server-side overheads, and the client-perceived request latencies. We then use these statistics to derive general implications for efficient caching and edge generation of dynamic content in the context of our ongoing CONCA project. Our study verifies both the need for and likely benefit from caching content at sub-document granularity, and points to additional opportunities for reducing client-perceived latency using prefetching, access prediction, and content transcoding.
Title: StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Author(s): Zhu, Yunyue; Shasha, Dennis
Consider the problem of monitoring tens of thousands of time series data streams in an online fashion and making decisions based on them. In addition to single stream statistics such as average and standard deviation, we also want to find high correlations among all pairs of streams. A stock market trader might use such a tool to spot arbitrage opportunities. This paper proposes efficient methods for solving this problem based on Discrete Fourier Transforms and a three level time interval hierarchy. Extensive experiments on synthetic data and real world financial trading data show that our algorithm beats the direct computation approach by several orders of magnitude. It also improves on previous Fourier Transform approaches by allowing the efficient computation of time-delayed correlation over any size sliding window and any time delay. Correlation also lends itself to an efficient grid-based data structure.The result is the first algorithm that we know of to compute correlations over thousands of data streams in real time. The algorithm is incremental,has fixed response time, and can monitor the pairwise correlations of 10,000 streams on a single PC. The algorithm is embarrassingly parallelizable.