Instructions for submitting a technical report or thesis.
Title: Fast and Cheap Genome wide Haplotype Construction via Optical Mapping
Author(s): Anantharaman, Thomas; Mysore, Venkatesh; Mishra, Bud
Abstract:
We describe an efficient algorithm to construct genome wide haplotype restriction maps of an individual by aligning single molecule DNA fragments collected with Optical Mapping technology. Using this algorithm and small amount of genomic material, we can construct the parental haplotypes for each diploid chromosome for any individual, one from the father and the other from the mother. Since such haplotype maps reveal the polymorphisms due to single nucleotide differences (SNPs) and small insertions and deletions (RFLPs), they are useful in association studies, studies involving genomic instabilities in cancer, and genetics. For instance, such haplotype restriction maps of individuals in a population can be used in association studies to locate genes responsible for genetics diseases with relatively low cost and high throughput. If the underlying problem is formulated as a combinatorial optimization problem, it can be shown to be NP-complete (a special case of K-population problem). But by effectively exploiting the structure of the underlying error processes and using a novel analog of the Baum-Welch algorithm for HMM models, we devise a probabilistic algorithm with a time complexity that is linear in the number of markers. The algorithms were tested by constructing the first genome wide haplotype restriction map of the microbe T. Pseudoana, as well as constructing a haplotype restriction map of a 120 Megabase region of Human chromosome 4. The frequency of false positives and false negatives was estimated using simulated data. The empirical results were found very promising.
Title: Naturally Speaking: A Systems Biology Tool with Natural Language Interfaces
Author(s): Antoniotti, Marco; Lau, Ian T.; Mishra, Bud
Abstract:
This short paper describes a systems biology software tool that can engage in a dialogue with a biologist by responding to questions posed to it in English (or another natural language) regarding the behavior of a complex biological system, and by suggesting a set of facts about the biological system based on a timetested generate and test approach. Thus, this bioinformatics system improves the quality of the interaction that a biologist can have with a system built on rigorous mathematical modeling, but without being aware of the underlying mathematically sophisticated concepts or notations. Given the nature of the mathematical semantics of our Simpathica/XSSYS tool, it was possible to construct a well-founded natural language interface on top of the computational kernel. We discuss our tool and illustrate its use with a few examples. The natural language subsystem is available as an integrated subsystem of the Simpathica/XSSYS tool and through a simple Web-based interface; we describe both systems in the paper. More details about the system can be found at: http://bioinformatics.nyu.edu, and its sub-pages.
Title: Practical Packrat Parsing
Author(s): Grimm, Robert
Abstract:
A considerable number of research projects are exploring how to extend object-oriented programming languages such as Java with, for example, support for generics, multiple dispatch, or pattern matching. To keep up with these changes, language implementors need appropriate tools. In this context, easily extensible parser generators are especially important because parsing program sources is a necessary first step for any language processor, be it a compiler, syntax-highlighting editor, or API documentation generator. Unfortunately, context-free grammars and the corresponding LR or LL parsers, while well understood and widely used, are also unnecessarily hard to extend. To address this lack of appropriate tools, we introduce Rats!, a parser generator for Java that supports easily modifiable grammars and avoids the complexities associated with altering LR or LL grammars. Our work builds on recent research on packrat parsers, which are recursive descent parsers that perform backtracking but also memoize all intermediate results (hence their name), thus ensuring linear-time performance. Our work makes this parsing technique, which has been developed in the context of functional programming languages, practical for object-oriented languages. Furthermore, our parser generator supports simpler grammar specifications and more convenient error reporting, while also producing better performing parsers through aggressive optimizations. In this paper, we motivate the need for more easily extensible parsers, describe our parser generator and its optimizations in detail, and present the results of our experimental evaluation.
Title: Partitionable Services Framework: Seamless Access to Distributed Applications
Candidate: Ivan, Anca
Advisor(s): Karamcheti, Vijay
Abstract:
A key problem in contemporary distributed systems is how to satisfy user quality of service (QoS) requirements for distributed applications deployed in heterogeneous, dynamically changing environments spanning multiple administrative domains.
An attractive solution is to create an infrastructure which satisfies user QoS requirements by automatically and transparently adapting distributed applications to any environment changes with minimum user input. However, successful use of this approach requires overcoming three challenges: (1) Capturing the application behavior and its relationship with the environment as a set of compact local specifications, using both general, quantitative (e.g., CPU usage) and qualitative (e.g., security) properties. Such information should be sufficient to reason about the global behavior of the application deployment. (2) Finding the ``best'' application deployment that satisfies both application and user requirements, and the various domain policies. The search algorithm should be complete, efficient, scalable with regard to application and network sizes, and guarantee optimality (e.g., resources consumed by applications). (3) Ensuring that the found deployments are practical and efficient, i.e., that the efficiency of automatic deployments is comparable with the efficiency of hand-tuned solutions.
This dissertation describes three techniques that address these challenges in the context of component-based applications. The modularity and reusability of the latter enable automatic deployments while supporting reasoning about the global connectivity based on the local information exposed by each component. The first technique extends the basic component-based application model with information about conditions and effects of component deployments and linkages, together with interactions between components and the network. The second technique uses AI planning to build an efficient and scalable algorithm which exploits the expressivity of the application model to find an application deployment that satisfies user QoS and application requirements. The last technique ensures that application deployments are both practical and efficient, by leveraging language and run-time system support to automatically customize components, as appropriate for the desired security and data consistency guarantees. These techniques are implemented as integral parts of the Partitionable Services Framework (PSF), a Java-based framework which flexibly assembles component-based applications to suit the properties of their environment. PSF facilitates on-demand, transparent migration and replication of application components at locations closer to clients, while retaining the illusion of a monolithic application.
The benefits of PSF are evaluated by deploying representative component-based applications in an environment simulating fast and secure domains connected by slow and insecure links. Analysis of the programming and the deployment processes shows that: (1) the code modifications required by PSF are minimal,(2) PSF appropriately adapts the deployments based on the network state and user QoS requirements, (3) the run-time deployment overheads incurred by PSF are negligible compared to the application lifetime, and (4) the efficiency of PSF-deployed applications matches that of hand-crafted solutions.
Title: Sekitei: An AI planner for Constrained Component Deployment in Wide-Area Networks
Author(s): Kichkaylo, Tatiana; Ivan, Anca; Karamcheti, Vijay
Abstract:
Wide-area network applications are increasingly being built using component-based models, enabling integration of diverse functionality in modules distributed across the network. In such models, dynamic component selection and deployment enables an application to flexibly adapt to changing client and network characteristics, achieve loadbalancing, and satisfy QoS requirements. Unfortunately, the problem of finding a valid component deployment is hard because one needs to decide on the set of components while satisfying various constraints resulting from application semantic requirements, network resource limitations, and interactions between the two. In this paper, we describe a general model for the component placement problem and present an algorithm for it, which is based on AI planning algorithms. We validate the effectiveness of our algorithm by demonstrating its scalability with respect to network size and number of components in the context of deployments generated for two example applications a security-sensitive mail service, and a webcast service in a variety of network environments.
Title: Dual-Primal FETI Methods for Linear Elasticity
Author(s): Klawonn, Axel; Widlund, Olof B.
Abstract:
Dual-Primal FETI methods are nonoverlapping domain decomposition methods where some of the continuity constraints across subdomain boundaries are required to hold throughout the iterations, as in primal iterative substructuring methods, while most of the constraints are enforced by Lagrange multipliers, as in one-level FETI methods. The purpose of this article is to develop strategies for selecting these constraints, which are enforced throughout the iterations, such that good convergence bounds are obtained, which are independent of even large changes in the stiffnesses of the subdomains across the interface between them. A theoretical analysis is provided and condition number bounds are established which are uniform with respect to arbitrarily large jumps in the Young's modulus of the material and otherwise only depend polylogarithmically on the number of unknowns of a single subdomain.
Title: VALIS: A Multi-language System for Rapid Prototyping in Computational Biology
Candidate: Paxia, Salvatore
Advisor(s): Mishra, Bud
Abstract:
Bioinformatics is a challenging area for computer science, since the underlying computational formalisms span database systems, numerical methods, geometric modeling and visualization, imaging and image analysis, combinatorial algorithms, data analysis and mining, statistical approaches, and reasoning under uncertainty.
This thesis describes the Valis environment for rapid application prototyping in bioinformatics. The core components of the Valis system are the underlying database structure and the algorithmic development platform.
This thesis presents a novel set of data structures that has marked advantages when dealing with unstructured and unbounded data that are common in scientific fields and bioinformatics.
Bioinformatics problems rarely have a one-language, one-platform solution. The Valis environment allows seamless integration between scripts written in different programming languages and includes tools to rapidly prototype graphical user interfaces.
To date the speed of computation of most whole genome analysis tools have stood in the way of developing fast interactive programs that may be used as exploratory tools. This thesis presents the basic algorithms and widgets that permit rapid prototyping of whole genomic scale real-time applications within Valis.
Title: Thick Surfaces: Interactive Modeling of Topologically Complex Geometric Details
Candidate: Peng, Jianbo
Advisor(s): Zorin, Denis
Abstract:
Lots of objects in computer graphics applications are represented by surfaces. It works very well for objects of simple topology, but can get prohibitively expensive for objects with complex small-scale geometrical details.
Volumetric textures aligned with a surface can be used to add topologically complex geometric details to an object, while retaining an underlying simple surface structure. The simple surface structure provides great controllability on the overall shape of the model, and volumetric textures handle geometric details and topological changes efficiently.
Adding a volumetric texture to a surface requires more than a conventional twodimensional parameterization: a part of the space surrounding the surface has to be parameterized. Another problem with using volumetric textures for adding geometric detail is the difficulty of the rendering of implicitly represented surfaces, especially when they are changed interactively.
We introduce thick surfaces to represent objects with topologically complex geometric details. A thick surface consists of three components. First, a base mesh of simple structure is used to approximate the overall shape of the object. Second, a layer of space along the base mesh is parameterized. We define the layer of space as a shell, which covers the geometric details of the object. Third, volumetric textures of geometric details are mapped into the shell. The object is represented as the implicit surface encoded by the volumetric textures. Places without volumetric textures are filled with patches of the base mesh.
We present algorithms for constructing a shell around a surface and rendering a volumetric-textured surface. Mipmap technique for volumetric textures is explored as well. The gradient field of a generalized distance function is used to construct a non-self-intersecting shell, which has other properties desirable for volumetric texture mapping. The rendering algorithm is designed and implemented on NVIDIA GeForceFX video chips. Finally we demonstrate a number of interactive operations that these algorithms enable.
Title: TM-LPSAT: Encoding Temporal Metric Planning in Continuous Time
Candidate: Shin, Ji-Ae
Advisor(s): Davis, Ernest
Abstract:
In any domain with change, the dimension of time is inherently involved. Whether the domain should be modeled in discrete time or continuous time depends on aspects of the domain to be modeled. Many complex real-world domains involve continuous time, resources, metric quantities and concurrent actions. Planning in such domains must necessarily go beyond simple discrete models of time and change.
In this thesis, we show how the SAT-based planning framework can be extended to generate plans of concurrent asynchronous actions that may depend on or make change piecewise linear metric constraints in continuous time.
In the SAT-based planning framework, a planning problem is formulated as a satisfiability problem of a set of propositional constraints (axioms) such that any model of the axioms corresponds to a valid plan. There are two parameters to a SAT-based planning system: an encoding scheme for representing plans of bounded length and a propositional SAT solver to search for a model. The LPSAT architecture is composed of a SAT solver integrated with a linear arithmetic constraint solver in order to deal with metric aspects of domains.
We present encoding schemes for temporal models of continuous time defined in PDDL+: ( i ) Durative actions with discrete and/or continuous changes; (ii) Real-time temporal model with exogenous events and autonomous processes capturing continuous changes. The encoding represents, in a CNF formula over arithmetic constraints and propositional fluents, time-stamped parallel plans possibly with concurrent continuous and/or discrete changes. In addition, we present encoding schemes for multi-capacity resources, partitioned interval resources, and metric quantities which are represented as intervals. An interval type can be used as a parameter to action as well as a fluent type.
Based on the LPSAT engine, the TM-LPSAT temporal metric planner has been implemented: Given a PDDL+ representation of a planning problem, the compiler of TM-LPSAT translates it in a CNF formula, which is fed into the LPSAT engine to find a solution corresponding to a plan for the planning problem. We also have experimented on our temporal metric encodings with other decision procedure, MathSAT, which deals with propositional combinations of linear constraints and Boolean variables. The results show that in terms of searching time the SAT-based approach to temporal metric planning can be comparable to other planning approaches and there is plenty of room to push further the limits of the SAT-based approach.
Title: Optical flow estimation as distributed optimization problem - an aVLSI implementation
Author(s): Stocker, Alan
Abstract:
I present a new focal-plane analog VLSI sensor that estimates optical flow in two visual dimensions. The chip significantly improves previous approaches both with respect to the applied model of optical flow estimation as well as the actual hardware implementation. Its distributed computational architecture consists of an array of locally connected motion units that collectively solve for the unique optimal optical flow estimate. The novel gradient-based motion model assumes visual motion to be translational, smooth and biased. The model guarantees that the estimation problem is computationally well-posed regardless of the visual input. Model parameters can be globally adjusted, leading to a rich output behavior. Varying the smoothness strength, for example, can provide a continuous spectrum of motion estimates, ranging from normal to global optical flow. Unlike approaches that rely on the explicit matching of brightness edges in space or time, the applied gradient-based model assures spatiotemporal continuity on visual information. The non-linear coupling of the individual motion units improves the resulting optical flow estimate because it reduces spatial smoothing across large velocity differences. Extended measures of a 30x30 array prototype sensor under real-world conditions demonstrate the validity of the model and the robustness and functionality of the implementation.
Title: Unsupervised Discovery of Extraction Patterns for InformationExtraction
Candidate: Sudo, Kiyoshi
Advisor(s): Grishman, Ralph; Sekine, Satoshi
Abstract:
The task of Information Extraction (IE) is to find specific types of information in natural language text. In particular, *event extraction* identifies instances of a particular type of event or fact (a particular "scenario"), including the entities involved, and fills a database which has been pre-defined for the scenario. As the number of documents available on-line has multiplied, entity extraction has grown in importance for various applications, including tracking terrorist activities from newswire sources and building a database of job postings from the Web, to name a few.
Linguistic contexts, such as predicate-argument relationships, have been widely used as *extraction patterns* to identify the items to be extracted from the text. The cost of creating extraction patterns for each scenario has been a bottleneck limiting the portability of information extraction systems to different scenarios, although there has been some research on semi-supervised pattern discovery procedures to reduce this cost. The challenge is to develop a fully automatic method for identifying extraction patterns for a scenario specified by the user.
This dissertation presents a novel approach for the unsupervised discovery of extraction patterns for event extraction from raw text. First, we present a framework that allows the user to have a self-customizing information extraction system for his/her query: the Query-Driven Information Extraction (QDIE) framework. The input to the QDIE framework is the user's query: either a set of keywords or a narrative description of the event extraction task.
Second, we assess the improvement in extraction pattern models. By considering the shortcomings of the prior work based on predicate-argument models and their extensions, we propose a novel extraction pattern model that is based on arbitrary subtrees of dependency trees.
Third, we address the issue of portability across languages. As a case study of the QDIE framework, we implemented a pre-CODIE system, a Cross-Lingual On-Demand Information Extraction system requiring minimal human intervention, which incorporates the QDIE framework as a component for pattern discovery. In addition, we assess the role of machine translation in cross-lingual information extraction by comparing translation-based implementations.
Title: Three-level BDDC in Two Dimensions
Author(s): Tu, Xuemin
Abstract:
BDDC methods are nonoverlapping iterative substructuring domain decomposition methods for the solutions of large sparse linear algebraic systems arising from discretization of elliptic boundary value problems. They are similar to the balancing Neumann-Neumann algorithm. However, in BDDC methods, a small number of continuity constraints are enforced across the interface, and these constraints form a new coarse, global component. An important advantage of using such constraints is that the Schur complements that arise in the computation willa ll be strictly positive definite. The coarse problem is generated and factored by a direct solver at the beginning of the computation. However, this problem can ultimately become a bottleneck, if the number of subdomains is very large. In this paper, two three-level BDDC methods are introduced for solving the coarse problem approximately in two dimensional space, while still maintaining a good convergence rate. Estimates of the condition numbers are provided for the two three-level BDDC methods and numerical experiments are also discussed.
Title: An Efficient and High-Order Accurate Boundary Integral Solver for the Stokes Equations in Three Dimensional Complex Geometries
Candidate: Ying, Lexing
Advisor(s): Zorin, Denis
Abstract:
This dissertation presents an efficient and high-order boundary integral solver for the Stokes equations in complex 3D geometries. The targeted applications of this solver are the flow problems in domains involving moving boundaries. In such problems, traditional finite element methods involving 3D unstructured mesh generation expe- rience difficulties. Our solver uses the indirect boundary integral formulation and discretizes the equation using the Nyström method.
Although our solver is designed for the Stokes equations, we show that it can be generalized to other constant coefficient elliptic partial differential equations (PDEs) with non-oscillatory kernels.
First, we present a new geometric representation of the domain boundary. This scheme takes quadrilateral control meshes with arbitrary geometry and topology as input, and produces smooth surfaces approximating the control meshes. Our surfaces are parameterized over several overlapping charts through explicit nonsingular C ^{ ∞ } parameterizations, depend linearly on the control points, have fixed-size local support for basis functions, and have good visual quality.
Second, we describe a kernel independent fast multipole method (FMM) and its parallel implementation. The main feature of our algorithm is that it is based only on kernel evaluation and does not require the multipole expansions of the underlying kernel. We have tested our method on kernels from a wide range of elliptic PDEs. Our numerical results indicate that our method is efficient and accurate. Other ad- vantages include the simplicity of the implementation and its immediate extension to other elliptic PDE kernels. We also present an MPI based parallel implementation which scales well up to thousands of processors.
Third, we present a framework to evaluate the singular integrals in our solver. A singular integral is decomposed into a smooth far field part and a local part that contains the singularity. The smooth part of the integral is integrated using the trape- zoidal rule over overlapping charts, and the singular part is integrated in the polar coordinates which removes or decreases the order of singularity. We also describe a novel algorithm to integrate the nearly singular integrals coming from the evaluation at points close to the boundary.
Title: High Performance Data Mining in Time Series: Techniques and Case Studies
Candidate: Zhu, Yunyue
Advisor(s): Shasha, Dennis
Abstract:
Note: A significantly improved and expanded description of this material is available in the book High Performance Discovery in Time Series Springer Verlag 2004 ISBN 0-387-00857-8.
As extremely large time series data sets grow more prevalent in a wide variety of settings, we face the significant challenge of developing efficient analysis methods. This dissertation addresses the problem in designing fast, scalable algorithms for the analysis of time series.
The first part of this dissertation describes the framework for high performance time series data mining based on important primitives. Data reduction trasform such as the Discrete Fourier Transform, the Discrete Wavelet Transform, Singular Value Decomposition and Random Projection, can reduce the size of the data without substantial loss of information, therefore provides a synopsis of the data. Indexing methods organize data so that the time series data can be retrieved efficiently. Transformation on time series, such as shifting, scaling, time shifting, time scaling and dynamic time warping, facilitates the discovery of flexible patterns from time series.
The second part of this dissertation integrates the above primitives into useful applications ranging from music to physics to finance to medicine.
StatStream
StatStream is a system based on fast algorithms for finding the most highly correlated pairs of time series from among thousands of time series streams and doing so in a moving window fashion. It can be used to find correlations in time series in finance and in scientific applications.
HumFinder
Most people hum rather poorly. Nevertheless, somehow people have some idea what we are humming when we hum. The goal of the query by humming program, HumFinder, is to make a computer do what a person can do. Using pitch translation, time dilation, and dynamic time warping, one can match an inaccurate hum to a melody remarkably accurately.
OmniBurst
Burst detection is the activity of finding abnormal aggregates in data streams. Our software, OmniBurst, can detect bursts of varying durations. Our example applications are monitoring gamma rays and stock market price volatility. The software makes use of a shifted wavelet structure to create a linear time filter that can guarantee that no bursts will be missed at the same time that it guarantees (under a reasonable statistical model) that the filter eliminates nearly all false positives.