Instructions for submitting a technical report or thesis.
Title: Learning Representations of Text through Language and Discourse Modeling: From Characters to Sentences
Candidate: Jernite, Yacine
Advisor(s): Sontag, David
Abstract:
In this thesis, we consider the problem of obtaining a representation of the meaning expressed in a text. How to do so correctly remains a largely open problem, combining a number of inter-related questions (e.g. what is the role of context in interpreting text? how should language understanding models handle compositionality? etc...) In this work, after reflecting on some of these questions and describing the most common sequence modeling paradigms in use in recent work, we focus on two specifically: what level of granularity text should be read at, and what training objectives can lead models to learn useful representations of a text’s meaning.
In a first part, we argue for the use of sub-word information for that purpose, and present new neural network architectures which can either process words in a way that takes advantage of morphological information, or do away with word separations altogether while still being able to identify relevant units of meaning.
The second part starts by arguing for the use of language modeling as a learning objective, and provides algorithms which can help with its scalability issues and propose a globally rather than locally normalized probability distribution. It then explores the question of what makes a good language learning objective, and introduces discriminative objectives inspired by the notion of discourse coherence which help learn a representation of the meaning of sentences.
Title: Deep Learning for Information Extraction
Candidate: Nguyen, Thien Huu
Advisor(s): Grishman, Ralph
Abstract:
The explosion of data has made it crucial to analyze the data and distill important information effectively and efficiently. A significant part of such data is presented in unstructured and free-text documents. This has prompted the development of the techniques for information extraction that allow computers to automatically extract structured information from the natural free-text data. Information extraction is a branch of natural language processing in artificial intelligence that has a wide range of applications, including question answering, knowledge base population, information retrieval etc. The traditional approach for information extraction has mainly involved hand-designing large feature sets (feature engineering) for different information extraction problems, i.e, entity mention detection, relation extraction, coreference resolution, event extraction, and entity linking. This approach is limited by the laborious and expensive effort required for feature engineering for different domains, and suffers from the unseen word/feature problem of natural languages.
This dissertation explores a different approach for information extraction that uses deep learning to automate the representation learning process and generate more effective features. Deep learning is a subfield of machine learning that uses multiple layers of connections to reveal the underlying representations of data. I develop the fundamental deep learning models for information extraction problems and demonstrate their benefits through systematic experiments.
First, I examine word embeddings, a general word representation that is produced by training a deep learning model on a large unlabelled dataset. I introduce methods to use word embeddings to obtain new features that generalize well across domains for relation extraction. This is done for both the feature-based method and the kernel-based method of relation extraction.
Second, I investigate deep learning models for different problems, including entity mention detection, relation extraction and event detection. I develop new mechanisms and network architectures that allow deep learning to model the structures of information extraction problems more effectively. Some extensive experiments are conducted on the domain adaptation and transfer learning settings to highlight the generalization advantage of the deep learning models for information extraction.
Finally, I investigate the joint frameworks to simultaneously solve several information extraction problems and benefit from the inter-dependencies among these problems. I design a novel memory augmented network for deep learning to properly exploit such inter-dependencies. I demonstrate the effectiveness of this network on two important problems of information extraction, i.e, event extraction and entity linking.
Title: Accelerating Approximate Simulation with Deep Learning
Candidate: Schlachter, Kristofer
Advisor(s): Perlin, Ken
Abstract:
Once a simulation resorts to an approximate numerical solution one is faced with various tradeoffs in accuracy versus computation time. We propose that another approximate solution can be learned for two chosen simulations, which in our case, are just as useful but can be made faster to compute. The two problems addressed in this thesis are fluid simulation and the simulation of diffuse inter-reflection in computer graphics.
Real-time simulation of fluid and smoke is a long standing problem in computer graphics, where state-of-the-art approaches require large compute resources, making real-time applications often impractical. In this work, we propose a data-driven approach that leverages the approximation power of deep-learning methods with the precision of standard fluid solvers to obtain both fast and highly realistic simulations. The proposed method solves the incompressible Euler equations following the standard operator splitting method in which a large, often ill-condition linear system must be solved. We propose replacing this system by learning a Convolutional Network (ConvNet) from a training set of simulations using a semi-supervised learning method to minimize long-term velocity divergence.
ConvNets are amenable to efficient GPU implementations and, unlike exact iterative solvers, have fixed computational complexity and latency. The proposed hybrid approach restricts the learning task to a linear projection without modeling the well understood advection and body forces. We present real-time 2D and 3D simulations of fluids and smoke; the obtained results are realistic and show good generalization properties to unseen geometry.
The next simulation that we address is the synthesis of images for training convnets. A challenge with training deep learning models is that they commonly require a large corpus of training data and retrieving sufficient real world data may be unachievable. A solution to this problem can be found in the use of synthetic or simulated training data. However, for simulated photographs or renderings, there hasn't been a systematic approach to comparing the relative benefits of different techniques in image synthesis.
We compare multiple synthesis techniques to one another as well as the real data that they seek to replicate. We also introduce learned synthesis techniques that either train models better than the most realistic graphical methods used by standard rendering packages or else approach their fidelity using far less computation. We accomplish this by learning shading of geometry as well as denoising the results of low sample Monte Carlo image synthesis. Our major contributions are (i) a dataset that allows comparison of real and synthetic versions of the same scene, (ii) an augmented data representation that boosts the stability of learning, and (iii) three different partially differentiable rendering techniques where lighting, denoising and shading are learned. Finally we are able to generate datasets that can outperform full global illumination rendering and approach the performance of training on real data.
Title: Elements of Intelligence: Memory, Communication and Intrinsic Motivation
Candidate: Sukhbaatar, Sainbayar
Advisor(s): Fergus, Rob
Abstract:
Building an intelligent agent that can learn and adapt to its environment has always been a challenging task. This is because intelligence consists of many different elements such as recognition, memory, and planning. In recent years, deep learning has shown impressive results in recognition tasks. The aim of this thesis is to advance the deep learning techniques to other elements of intelligence.
We start our investigation with memory, an integral part of intelligence that bridges past experience with current decision making. In particular, we focus on the episodic memory, which is responsible for storing our past experiences and recalling them. An agent without such memory will struggle at many tasks such as having a coherent conversation. We show that a neural network with an external memory is better at such tasks, outperforming traditional recurrent networks with an internal memory.
Another crucial ingredient of intelligence is the capability to communicate with others. In particular, communication is essential for cooperative tasks, enabling agents to better collaborate and improve their division of labor. We investigate whether agents can learn to communicate from scratch without any external supervision. Our finding is that communication through a continuous vector facilitates faster learning by allowing gradients to flow between agents.
Lastly, an intelligent agent must have an intrinsic motivation to learn about its environment on its own without any external supervision or rewards. Our investigation led to one such learning strategy where an agent plays a two-role game with itself. The first role proposes a task, and the second role tries to execute it. Since their goal is to make the other fail, their adversarial interplay pushes them to explore increasingly complex tasks, which results in a better understanding of the environment.
Title: Predictive Analytics from Noisy, Diverse, Incomplete and Heterogeneous Data
Candidate: Venkataraman, Ashwin
Advisor(s): Jagabathula, Srikanth; Subramanian, Lakshminarayanan
Abstract:
The unique characteristics of big data complicate the extraction of critical information needed for better decision-making. For instance, data comes in diverse types, which vary in the levels of noise and their respective measurement scales (e.g., clicks, ratings, purchase signals, etc.). The data also originates from heterogeneous sources or is incomplete, with several missing observations.
The above data characteristics require us to revisit and rethink many traditional problems because existing techniques fall short. This dissertation revisits two classical problems: (1) segmenting (or clustering) heterogeneous data sources, and (2) learning mixture models. It provides novel, principled, and scalable solutions to deal with the challenges described above.
The first part considers the problem of segmenting customers based on their preferences when the preference signals (e.g., clicks, ratings, etc.) come from a multitude of "big" and "complicated" data sources. We propose a model-based embedding technique which takes the customer preference signals and a probabilistic model class generating the signals as inputs, and outputs an embedding---a low-dimensional representation in Euclidean space---for each customer. We then cluster the embeddings to obtain the segments. We derive precise necessary and sufficient conditions that guarantee asymptotic recovery of the true segments. Using two case studies, including a real-world implementation on eBay data, we show that our method outperforms standard latent class, empirical bayesian and demographic-based techniques.
We also apply similar ideas to propose algorithms for clustering workers based on their "quality" or "reputation" in crowdsourced labeling tasks, where the (noisy) collected labels are aggregated for inferring the (unknown) true labels. Theoretically, we show that our algorithms successfully identify unreliable workers, workers adopting deterministic strategies, and worst-case sophisticated adversaries who can adopt arbitrary labeling strategies to degrade the accuracy of the inferred task labels. Empirically, we show that filtering out low quality workers identified by our algorithms can significantly improve the accuracy of several state-of-the-art label aggregation algorithms in real-world crowdsourcing datasets.
The second part considers the problem of learning mixture models and proposes a new methodology for nonparametric estimation of the mixing distribution. Our main contributions are two-fold: (a) formulating the likelihood-based estimation problem as a constrained convex program and (b) applying the conditional gradient (aka Frank-Wolfe) algorithm to solve this convex program. We show that our method iteratively generates the support of the mixing distribution, so that the algorithm may be terminated at the desired number of mixture components. For mixtures of logit models, we establish that our method has a sublinear rate of convergence to the optimal solution of the convex program. We also characterize the structure of the estimated mixing distribution and show that it bears close resemblance to the notion of consideration sets in existing literature. We test our approach using two case studies on real data, and show that it significantly outperforms the standard expectation-maximization (EM) benchmark on in-sample fit, predictive, and decision accuracy, while being an order of magnitude faster.
Title: On Quadtrees, Voronoi Diagrams, and Lattices: Results in Geometric Algorithms
Candidate: Bennett, Huxley
Advisor(s): Yap, Chee
Abstract:
We present several results on geometric algorithms, and somewhat more specifically on algorithmic aspects of geometric structures including quadtrees, Voronoi diagrams, and lattices. Our work contains two parts, the first of which is on subdivision algorithms, and the second of which is on lattice algorithms.
Subdivision algorithms amount to recursively splitting an ambient space into smaller pieces until certain conditions hold. Often the underlying space is a square in the plane (or a box in higher dimensions), whose subdivision is represented by a quadtree (or its higher-dimensional analogs). A quadtree is smooth if any two adjacent leaf boxes differ by at most one in depth. We first study the cost of the smooth split operation in quadtrees, showing that it has constant amortized cost in quadtrees of any fixed dimension.
We then present a subdivision-based algorithm for computing isotopic epsilon-approximations of planar minimization diagrams. Given a family of continuous functions, its minimization diagram partitions the plane into regions on which each function is minimal. Minimization diagrams generalize many natural Voronoi diagrams, and we show how to use our framework to compute an anisotropic Voronoi diagram on polygonal sites. We have implemented a prototype of our algorithm for anisotropic Voronoi diagrams, and we provide experimental results.
We then turn to studying lattice algorithms. A lattice is a regular ordering of points in Euclidean space, which is represented as the set of all integer combinations of some linearly independent vectors (which we call a basis of the lattice). In our first work on lattices, we introduce and study the Lattice Distortion Problem (LDP). LDP asks how "similar" two lattices are, i.e., what the minimum distortion of a linear bijection between two lattices is. We show how to compute low-distortion mappings with a tradeoff between approximation quality and running time based on a notion of basis reduction introduced by Seysen (Combinatorica 1993). We also show that LDP is NP-hard to approximate to within any constant factor (under randomized reductions).
Finally, we study the problem of finding lattice bases which are optimal with respect to two basis quality measures. Namely, we study the problem of finding bases with minimal orthogonality defect, and with nearly minimal Seysen condition number. We give algorithms which solve both problems while running in time depending only on the rank of the lattice times a polynomial in the input length.
Title: Improving Event Extraction: Casting a Wider Net
Candidate: Cao, Kai
Advisor(s): Grishman, Ralph
Abstract:
Information extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. One facet of information extraction is event extraction (EE): identifying instances of selected types of events appearing in natural language text. For each instance, EE should identify the type of the event, the event trigger (the word or phrase which evokes the event), the participants in the event, and (where possible) the time and place of the event.
One EE task was defined and intensively studied as part of the ACE (Automatic Content Extraction) research program. The 2005 ACE EE task involved 8 types and 33 subtypes of events. For instance, given the sentence "She was killed by an automobile yesterday.", an EE system should be able to recognize the word "killed" as a trigger for an event of subtype DIE, and discover "an automobile" and "yesterday" as the Agent and Time arguments. This task is quite challenging, as the same event might appear in the form of various trigger expressions and an expression might represent different types of events in different contexts.
To support the development and evaluation of ACE EE systems, the Linguistic Data Consortium annotated a text corpus (consisting primarily of news articles) with information on the events mentioned. This corpus was widely used to train ACE EE systems. However, the event instances in the ACE corpus are not evenly distributed, and so some frequent expressions involving ACE events do not appear in the training data, adversely affecting performance.
The thesis presents several strategies for improving the performance of EE. We first demonstrate the effectiveness of two types of linguistic analysis -- dependency regularization and Abstract Meaning Representation -- in boosting EE performance. Next we show the benefit of an active learning strategy in which a person is asked to judge a limited number of phrases which may be event triggers. Finally we report the impact of combining our baseline system with event patterns from a system developed for a different EE task (the TABARI program). This step contains expert-level patterns generated by other research groups. Because the information received is complicated and quite different from the original corpus (ACE), the integration of this information requires more complex processing.
Title: Random Growth Models
Candidate: Florescu, Laura
Advisor(s): Spencer, Joel
Abstract:
This work explores variations of randomness in networks, and more specifically, how drastically the dynamics and structure of a network change when a little bit of information is added to "chaos". On one hand, I investigate how much determinism in diffusions de-randomizes the process, and on the other hand, I look at how superposing "planted" information on a random network changes its structure in such a way that the "planted" structure can be recovered.
The first part of the dissertation is concerned with rotor-router walks, a deterministic counterpart to random walk, which is the mathematical model of a path consisting of a succession of random steps. I study and show results on the volume (``the range") of the territory explored by the random rotor-router model, confirming an old prediction of physicists.
The second major part in the dissertation consists of two constrained diffusion problems. The questions in this model are to understand the long-term behavior of the models, as well as how the boundary of the processes evolves in time.
The third part is detecting communities in, or more generally, clustering networks. This is a fundamental problem in mathematics, machine learning, biology and economics, both for its theoretical foundations as well as for its practical implications. This problem can be viewed as "planting" some structure in a random network; for example, in cryptography, a code can be viewed as hiding some integers in a random sequence. For such a model with two communities, I show both information theoretic thresholds when it is impossible to recover the communities based on the density of the edges "planted" between the communities, as well as thresholds for when it is computationally possible to recover the communities.
Title: Zero-knowledge Proofs: Efficient Techniques for Combination Statements and their Applications
Candidate: Ganesh, Chaya
Advisor(s): Dodis, Yevgeniy
Abstract:
Zero-knowledge proofs provide a powerful tool, which allows a prover to convince a verifier that a statement is true without revealing any further information. It is known that every language in NP has a zero knowledge proof system, thus opening up several cryptographic applications. While true in theory, designing proof systems that are efficient to be used in practice remains challenging. The most common and most efficient systems implemented are approaches based on sigma protocols, and approaches based on SNARKs (Succinct Non-interactive ARguments of Knowledge). Each approach has its own advantages and shortcomings, and are suited for certain statements.
While sigma protocols are efficient for algebraic statements, they are expensive for non-algebraic statements. SNARKs, on the other hand, result in short proofs and efficient verification, and are better suited for proving statements about hash functions. But proving an algebraic statement, for instance, knowledge of discrete logarithm, is expensive as the prover needs to perform public-key operations proportional to the size of the circuit.
Recent work achieve zero-knowledge proofs that are efficient for statements phrased as Boolean circuits based on Garbled circuits (GC). This, again, is expensive for large circuits, in addition to being inherently interactive. Thus, SNARKs and GC-based approaches are better suited for non-algebraic statements, and sigma protocols are efficient for algebraic statements.
But in some applications, one is interested in proving combination statements, that is, statements that have both algebraic and non-algebraic components. The state of the art fails to take advantage of the best of all worlds and has to forgo the efficiency of one approach to obtain the other's. In this work, we ask how to efficiently prove a statement that is a combination of algebraic and non-algebraic statements.
We first show how to combine the GC-based approach with sigma protocols. Then, we study how to combine sigma protocol proofs with SNARKs to obtain non-interactive arguments for combination statements. We show applications of our techniques to anonymous credentials, and privacy-preserving protocols on the blockchain. Finally, we study garbled circuits as a primitive and present an efficient way of hashing garbled circuits. We show applications of our hashing technique, including application to GC-based zero-knowledge.
Title: Circuit Complexity: New Techniques and Their Limitations
Candidate: Golovnev, Aleksandr
Advisor(s): Dodis, Yevgeniy; Regev, Oded
Abstract:
We study the problem of proving circuit lower bounds. The strongest known lower bound of 3n-o(n) for an explicit function was proven by Blum in 1984. We prove a lower bound of (3+1/86)n-o(n) for affine dispersers for sublinear dimensions.
We introduce the weighted gate elimination method to give an elementary proof of a 3.11n lower bound for quadratic dispersers. (Although currently there are no explicit constructions of such functions.) Also, we develop a general framework which allows us to turn lower bounds proofs into upper bounds for Circuit SAT algorithms.
Finally, we prove strong limitations of the developed techniques.
Title: Unsupervised Learning Under Uncertainty
Candidate: Mathieu, Michael
Advisor(s): LeCun, Yann
Abstract:
Deep learning, in particular neural networks, achieved remarkable success in the recent years. However, most of it is based on supervised learning, and relies on ever larger datasets, and immense computing power. One step towards general artificial intelligence is to build a model of the world, with enough knowledge to acquire a kind of “common sense”. Representations learned by such a model could be reused in a number of other tasks. It would reduce the requirement for labeled samples and possibly acquire a deeper understanding of the problem. The vast quantities of knowledge required to build common sense preclude the use of supervised learning, and suggest to rely on unsupervised learning instead.
The concept of uncertainty is central to unsupervised learning. The task is usually to learn a complex, multimodal distribution. Density estimation and generative models aim at representing the whole distribution of the data, while predictive learning consists of predicting the state of the world given the context and, more often than not, the prediction is not unique. That may be because the model lacks the capacity or the computing power to make a certain prediction, or because the future depends on parameters that are not part of the observation. Finally, the world can be chaotic of truly stochastic. Representing complex, multimodal continuous distributions with deep neural networks is still an open problem.
In this thesis, we first assess the difficulties of representing probabilities in high dimensional spaces, and review the related work in this domain. We then introduce two methods to address the problem of video prediction, first using a novel form of linearizing auto-encoder and latent variables, and secondly using Generative Adversarial Networks (GANs). We show how GANs can be seen as trainable loss functions to represent uncertainty, then how they can be used to disentangle factors of variation. Finally, we explore a new non-probabilistic framework for GANs.
Title: Fine-scale Structure Design for 3D Printing
Candidate: Panetta, Francis Julian
Advisor(s): Zorin, Denis
Abstract:
Modern additive fabrication technologies can manufacture shapes whose geometric complexities far exceed what existing computational design tools can analyze or optimize. At the same time, falling costs have placed these fabrication technologies within the average consumer's reach. Especially for inexpert designers, new software tools are needed to take full advantage of 3D printing technology.
My thesis develops such tools and demonstrates the exciting possibilities enabled by fine-tuning objects at the small scales achievable by 3D printing. The thesis applies two high-level ideas to invent these tools: two-scale design and worst-case analysis.
The two-scale design approach addresses the problem that accurately simulating---let alone optimizing---geometry at the full resolution one can print requires orders of magnitude more computational power than currently available. However, we can use periodic homogenization to decompose the design problem into a small-scale problem (designing tileable structures achieving a particular deformation behavior) and a macro-scale problem (deciding where to place these structures in the larger object). We can then design structures for every possible deformation behavior and store them in a database, so that they can be re-used for many different macro-scale design problems.
Worst-case analysis refers to determining how likely an object is to fracture by studying the worst possible scenario: the forces most efficiently breaking it. This analysis is needed when the designer has insufficient knowledge or experience to predict what forces an object will undergo, or when the design is intended for use in many different scenarios unknown a priori.
Title: On the Gaussian Measure Over Lattices
Candidate: Stephens-Davidowitz, Noah
Advisor(s): Dodis, Yevgeniy; Regev, Oded
Abstract:
We study the Gaussian mass of a lattice coset \[ \rho_s(\mathcal{L} - \vec{t}) := \sum_{\vec{y} \in \mathcal{L}} \exp(-\pi \|\vec{y} - \vec{t}\|^2/s^2) \; , \] where \(\mathcal{L} \subset \mathbb{R}^n\) is a lattice and \(\vec{t} \in \mathbb{R}^n\) is a vector describing a shift of the lattice. In particular, we use bounds on this Gaussian mass to obtain a partial converse to Minkowski's celebrated theorem bounding the number of lattice points in a ball.
We also consider the discrete Gaussian distribution \(D_{\mathcal{L} - \vec{t}, s}\) induced by the Gaussian measure over \(\mathcal{L} - \vec{t}\), and we use procedures for sampling from this distribution to construct the current fastest known algorithms for the two most important computation problems over lattices, the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP).
Finally, we study \(\rho_s(\mathcal{L} - \vec{t})\) and \(D_{\mathcal{L} - \vec{t}, s}\) as interesting computational and mathematical objects in their own right. In particular, we show that the computational problem of sampling from \(D_{\mathcal{L} - \vec{t}, s}\) is equivalent to CVP in a very strong sense (and that sampling from \(D_{\mathcal{L}, s}\) is no harder than SVP). We also prove a number of bounds on the moments of \(D_{\mathcal{L} - \vec{t}, s}\) and various monotonicity properties of \(\rho_s(\mathcal{L} - \vec{t})\).
Title: Decision Procedures for Finite Sets with Cardinality, and Local Theories Extensions
Candidate: Bansal, Kshitij
Advisor(s): Barrett, Clark; Wies, Thomas
Abstract:
Many tasks in design, verification, and testing of hardware and computer systems can be reduced to checking satisfiability of logical formulas. Certain fragments of first-order logic that model the semantics of prevalent data types, and hardware and software constructs, such as integers, bit-vectors, and arrays are thus of most interest. The appeal of satisfiability modulo theories (SMT) solvers is that they implement decision procedures for efficiently reasoning about formulas in these fragments. Thus, they can often be used off-the-shelf as automated back-end solvers in verification tools. In this thesis, we expand the scope of SMT solvers by developing decision procedures for new theories of interest in reasoning about hardware and software.
First, we consider the theory of finite sets with cardinality. Sets are a common high-level data structure used in programming; thus, such a theory is useful for modeling program constructs directly. More importantly, sets are a basic construct of mathematics and thus natural to use when mathematically defining the properties of a computer system. We extend a calculus for finite sets to reason about cardinality constraints. The reasoning for cardinality involves tracking how different sets overlap. For an efficient procedure in an SMT solver, we'd like to avoid considering Venn regions directly, which has been the approach in earlier work. We develop a novel technique wherein potentially overlapping regions are considered incrementally. We use a graph to track the interaction of the different regions. Additionally, our technique leverages the procedure for reasoning about the other set operations (besides cardinality) in a modular fashion.
Second, a limitation frequently encountered is that verification problems are often not fully expressible in the theories supported natively by the solvers. Many solvers allow the specification of application-specific theories as quantified axioms, but their handling is incomplete outside of narrow special cases. We show how SMT solvers can be used to obtain complete decision procedures for local theory extensions, an important class of theories that are decidable using finite instantiation of axioms. We present an algorithm that uses E-matching to generate instances incrementally during the search, significantly reducing the number of generated instances compared to eager instantiation strategies.
Title: Analyzing Source Code Across Static Conditionals
Candidate: Gazzillo, Paul
Advisor(s): Wies, Thomas
Abstract:
We need better tools for C, such as source browsers, bug finders, and automated refactorings. The problem is that large C systems such as Linux are software product lines, containing thousands of configuration variables controlling every aspect of the software from architecture features to file systems and drivers. The challenge of such configurability is how do software tools accurately analyze all configurations of the source without the exponential explosion of trying them all separately. To this end, we focus on two key subproblems, parsing and the build system. The contributions of this thesis are the following: (1) a configuration-preserving preprocessor and parser called SuperC that preserves configurations in its output syntax tree; (2) a configuration-preserving Makefile evaluator called Kmax that collections Linux's compilation units and their configurations; and (3) a framework for configuration-aware analyses of source code using these tools.
C tools need to process two languages: C itself and the preprocessor. The latter improves expressivity through file includes, macros, and static conditionals. But it operates only on tokens, making it hard to even parse both languages. SuperC is a complete, performant solution to parsing all of C. First, a configuration-preserving preprocessor resolves includes and macros yet leaves static conditionals intact, thus preserving a program's variability. To ensure completeness, we analyze all interactions between preprocessor features and identify techniques for correctly handling them. Second, a configuration-preserving parser generates a well-formed AST with static choice nodes for conditionals. It forks new subparsers when encountering static conditionals and merges them again after the conditionals. To ensure performance, we present a simple algorithm for table-driven Fork-Merge LR parsing and four novel optimizations. We demonstrate SuperC's effectiveness on the x86 Linux kernel.
Large-scale C codebases like Linux are software product families, with complex build systems that tailor the software with myriad features. Such variability management is a challenge for tools, because they need awareness of variability to process all software product lines within the family. With over 14,000 features, processing all of Linux's product lines is infeasible by brute force, and current solutions employ incomplete heuristics. But having the complete set of compilation units with precise variability information is key to static tools such a bug-finders, which could miss critical bugs, and refactoring tools, since behavior-preservation requires a complete view of the software project. Kmax is a new tool for the Linux build system that extracts all compilation units with precise variability information. It processes build system files with a variability-aware \texttt{make} evaluator that stores variables in a conditional symbol table and hoists conditionals around complete statements, while tracking variability information as presence conditions. Kmax is evaluated empirically for correctness and completeness on the Linux kernel. Kmax is compared to previous work for correctness and running time, demonstrating that a complete solution's added complexity incurs only minor latency compared to the incomplete heuristic solutions.
SuperC's configuration-preserving parsing of compilation units and Kmax's project-wide capabilities are in a unique position to process source code across all configurations. Bug-finding is one area where such capability is useful. Bugs may appear in untested combinations of configurations and testing each configuration one-at-a-time is infeasible. For example, one compilation units that defines a global function called by other compilation units may not be linked into the final program due to configuration variable selection. Such a bug can be found with Kmax and SuperC's cross-configuration capability. Cilantro is a framework for creating variability-aware bug-checkers. Kmax is used to determine the complete set of compilation units and the combinations of features that activate them, while SuperC's parsing framework is extended with semantic actions in order implement the checkers. A checker for linker errors across all compilation units in the Linux kernel demonstrates each part of the Cilantro framework and is evaluated on the Linux source code.
Title: Semi-Supervised Learning for Electronic Phenotyping in Support of Precision Medicine
Candidate: Halpern, Yonatan
Advisor(s): Sontag, David
Abstract:
Medical informatics plays an important role in precision medicine, delivering the right information to the right person, at the right time. With the introduction and widespread adoption of electronic medical records, in the United States and world-wide, there is now a tremendous amount of health data available for analysis. Electronic record phenotyping refers to the task of determining, from an electronic medical record entry, a concise descriptor of the patient, comprising of their medical history, current problems, presentation, etc. In inferring such a phenotype descriptor from the record, a computer, in a sense, "understands", the relevant parts of the record. These phenotypes can then be used in downstream applications such as cohort selection for retrospective studies, real-time clinical decision support, contextual displays, intelligent search, and precise alerting mechanisms.
To handle the incomplete data present in medical records, we use the formal framework of probabilistic graphical models with latent or unobserved variables. The first part of the thesis presents two different structural conditions under which learning with latent variables is computationally tractable. The first is the "anchored" condition, where every latent variable has at least one child that is not shared by any other parent. The second is the "singly-coupled" condition, where every latent variable is connected to at least three children that satisfy conditional independence (possibly after a transformation of the data). Variables that satisfy these conditions can be specified by an expert without requiring that the entire structure or its parameters be specified, allowing for effective use of human expertise and making room for statistical learning to do some of the heavy lifting in model learning. For both the anchored and singly-coupled conditions, practical algorithms are presented.
The second part of the thesis describes real-life applications using the anchored condition for electronic phenotyping. A human-in-the-loop learning system and a functioning emergency informatics system for real-time extraction of important clinical variables are described and evaluated.
The algorithms and discussion presented here were developed for the purpose of improving healthcare, but are much more widely applicable, dealing with the very basic questions of identifiability and learning models with latent variables - a problem that lies at the very heart of the natural and social sciences.
Title: Improving Knowledge Base Population with Information Extraction
Candidate: Li, Xiang
Advisor(s): Grishman, Ralph
Abstract:
Knowledge Bases (KBs) are data resources that encode world knowledge in machine-readable formats. Knowledge Base Population (KBP) aims at understanding this knowledge and extending KBs with more semantic information, which is a fundamental problem in Artificial Intelligence. It can benefit a wide range of tasks, such as semantic search and question answering. Information Extraction (IE), the task of discovering important types of facts (entities, relations and events) in unstructured text, is necessary and crucial for successfully populating knowledge bases. This dissertation focuses on four essential aspects of knowledge base population by leveraging IE techniques: extracting facts from unstructured data, validating the extracted information, accelerating and enhancing systems with less annotation effort, and utilizing knowledge bases to improve real-world applications.
First, we investigate the Slot Filling task, which is a key component for knowledge base population. Slot filling aims to collect information from a large collection of news, web, or other sources of documents to determine a set of predefined attributes ("slots") for given person and organization entities. We introduce a statistical language understanding approach to automatically construct personal (user-centric) knowledge bases from conversational dialogs.
Second, we consider how to probabilistically estimate the correctness of the extracted slot values. Despite the significant progress of KBP research and systems in recent years, slot filling approaches are still far from completely reliable. Using the NIST KBP Slot Filling task as a case study, we propose a confidence estimation model based on the Maximum Entropy framework, and demonstrate the effectiveness of this model in both precision and the capability to improve the slot filling aggregation through a weighted voting strategy.
Third, we study rich annotation guided learning to fill the gap between an expert annotator and a feature engineer. We develop an algorithm to enrich features with the guidance of all levels of rich annotations from human annotators. We also evaluate the comparative efficacy, generality and scalability of this framework by conducting case studies on three distinct applications in various domains, including facilitating KBP slot filling systems. Empirical studies demonstrate that with little additional annotation time, we can significantly improve the performance for all tasks.
Finally, we explore utilizing knowledge bases in a real-world application - personalized content recommendation. Traditional systems infer user interests from surface-level features derived from online activity logs and user demographic profiles, rather than deeply understanding the context semantics. We conduct a systematic study to show the effectiveness of incorporating deep semantic knowledge encoded in the entities on modeling user interests, by utilizing the abundance of entity information from knowledge bases.
Title: Improving SAT Solvers by Exploiting Empirical Characteristics of CDCL
Candidate: Oh, Chanseok
Advisor(s): Wies, Thomas
Abstract:
The Boolean Satisfiability Problem (SAT) is a canonical decision problem originally shown to be NP-complete in Cook’s seminal work on the theory of computational complexity. The SAT problem is one of several computational tasks identified by researchers as core problems in computer science. The existence of an efficient decision procedure for SAT would imply P = NP. However, numerous algorithms and techniques for solving the SAT problem have been proposed in various forms in practical settings. Highly efficient solvers are now actively being used, either directly or as a core engine of a larger system, to solve real-world problems that arise from many application domains. These state-of-the-art solvers use the Davis-Putnam-Logemann-Loveland (DPLL) algorithm extended with ConflictDriven Clause Learning (CDCL). Due to the practical importance of SAT, building a fast SAT solver can have a huge impact on current and prospective applications. The ultimate contribution of this thesis is improving the state of the art of CDCL by understanding and exploiting the empirical characteristics of how CDCL works on real-world problems. The first part of the thesis shows empirically that most of the unsatisfiable real-world problems solvable by CDCL have a refutation proof with near-constant width for the great portion of the proof. Based on this observation, the thesis provides an unconventional perspective that CDCL solvers can solve real-world problems very efficiently and often more efficiently just by maintaining a small set of certain classes of learned clauses. The next part of the thesis focuses on understanding the inherently different natures of satisfiable and unsatisfiable problems and their implications on the empirical workings of CDCL. We examine the varying degree of roles and effects of crucial elements of CDCL based on the satisfiability status of a problem. Ultimately, we propose effective techniques to exploit the new insights about the different natures of proving satisfi- ability and unsatisfiability to improve the state of the art of CDCL. In the last part of the thesis, we present a reference solver that incorporates all the techniques described in the thesis. The design of the presented solver emphasizes minimality in implementation while guaranteeing state-of-the-art performance. Several versions of the reference solver have demonstrated top-notch performance, earning several medals in the annual SAT competitive events. The minimal spirit of the reference solver shows that a simple CDCL framework alone can still be made competitive with state-of-the-art solvers that implement sophisticated techniques outside the CDCL framework.
Title: Graph-based Approaches to Resolve Entity Ambiguity
Candidate: Pershina, Maria
Advisor(s): Grishman, Ralph
Abstract:
Information Extraction is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. One of the challenges of Information Extraction is to resolve ambiguity between entities either in a knowledge base or in text documents. There are many variations of this problem and it is known under different names, such as coreference resolution, entity disambiguation, entity linking, entity matching, etc. For example, the task of coreference resolution decides whether two expressions refer to the same entity; entity disambiguation determines how to map an entity mention to an appropriate entity in a knowledge base (KB); the main focus of entity linking is to infer that two entity mentions in a document(s) refer to the same real world entity even if they do not appear in a KB; entity matching (also record deduplication, entity resolution, reference reconciliation) is to merge records from databases if they refer to the same object.
Resolving ambiguity and finding proper matches between entities is an important step for many downstream applications, such as data integration, question answering, relation extraction, etc. The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains, posing a scalability challenge for Information Extraction systems. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and to answer complex queries. However the efficient alignment of large-scale knowledge bases still poses a considerable challenge.
Various aspects and different settings to resolve ambiguity between entities are studied in this dissertation. A new scalable domain-independent graph-based approach utilizing Personalized Page Rank is developed for entity matching across large-scale knowledge bases and evaluated on datasets of 110 million and 203 million entities. A new model for entity disambiguation between a document and a knowledge base utilizing a document graph and effectively filtering out noise is proposed. A new technique based on a paraphrase detection model is proposed to recognize name variations for an entity in a document. A new approach integrating a graph-based entity disambiguation model and this technique is presented for an entity linking task and is evaluated on a dataset for Â the Text Analysis Conference Entity Discovery and Linking 2014 task.
Title: Partition Memory Models for Program Analysis
Candidate: Wang, Wei
Advisor(s): Barrett, Clark; Wies, Thomas
Abstract:
Scalability is a key challenge in static program analyses based on solvers for Satisfiability Modulo Theories (SMT). For imperative languages like C, the approach taken for modeling memory can play a significant role in scalability. The main theme of this thesis is using partitioned memory models to divide up memory based on the alias information derived from a points-to analysis.
First, a general analysis framework based on memory partitioning is presented. It incorporates a points-to analysis as a preprocessing step to determine a conservative approximation of which areas of memory may alias or overlap and splits the memory into distinct arrays for each of these areas.
Then we propose a new cell-based field-sensitive points-to analysis, which is an extension of Steensgaard’s unification-based algorithms. A cell is a unit of access with scalar or record type. Arrays and dynamically memory allocations are viewed as a collection of cells. We show how our points-to analysis yields more precise alias information for programs with complex heap data structures.
Our work is implemented in Cascade, a static analysis framework for C programs. It replaces the former flat memory model that models the memory as a single array of bytes. We show that the partitioned memory models achieve better scalability within Cascade, and the cell-based memory model, in particular, improves the performance significantly, making Cascade a state-of-the-art C analyzer.
Title: Learning Algorithms from Data
Candidate: Zaremba, Wojciech
Advisor(s): Fergus, Rob; LeCun, Yann
Abstract:
Statistical machine learning is concerned with learning models that describe observations. We train our models from data on tasks like machine translation or object recognition because we cannot explicitly write down programs to solve such problems. A statistical model is only useful when it generalizes to unseen data. Solomonoff has proved that one should choose the model that agrees with the observed data, while preferring the model that can be compressed the most, because such a choice guarantees the best possible generalization. The size of the best possible compression of the model is called the Kolmogorov complexity of the model. We define an algorithm as a function with small Kolmogorov complexity.
This Ph.D. thesis outlines the problem of learning algorithms from data and shows several partial solutions to it. Our data model is mainly neural networks as they have proven to be successful in various domains like object recognition, language modeling, speech recognition and others. First, we examine empirical trainability limits for classical neural networks. Then, we extend them by providing interfaces, which provide a way to read memory, access the input, and postpone predictions. The model learns how to use them with reinforcement learning techniques like Reinforce and Q-learning. Next, we examine whether contemporary algorithms such as convolution layer can be automatically rediscovered. We show that it is possible indeed to learn convolution as a special case in a broader range of models. Finally, we investigate whether it is directly possible to enumerate short programs and find a solution to a given problem. This follows the original line of thought behind the Solomonoff induction. Our approach is to learn a prior over programs such that we can explore them efficiently.
Title: Distributed Stochastic Optimization for Deep Learning
Candidate: Zhang, Sixin
Advisor(s): LeCun, Yann
Abstract:
We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze the convergence rate of the EASGD method in the synchronous scenario and compare its stability condition with the existing ADMM method in the round-robin scheme. An asynchronous and momentum variant of the EASGD method is applied to train deep convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Our approach accelerates the training and furthermore achieves better test accuracy. It also requires a much smaller amount of communication than other common baseline approaches such as the DOWNPOUR method.
We then investigate the limit in speedup of the initial and the asymptotic phase of the mini-batch SGD, the momentum SGD, and the EASGD methods. We find that the spread of the input data distribution has a big impact on their initial convergence rate and stability region. We also find a surprising connection between the momentum SGD and the EASGD method with a negative moving average rate. A non-convex case is also studied to understand when EASGD can get trapped by a saddle point.
Finally, we scale up the EASGD method by using a tree structured network topology. We show empirically its advantage and challenge. We also establish a connection between the EASGD and the DOWNPOUR method with the classical Jacobi and the Gauss-Seidel method, thus unifying a class of distributed stochastic optimization methods.
Title: Pushing the Limits of Additive Fabrication Technologies
Candidate: Zhou, Qingnan (James)
Advisor(s): Zorin, Denis
Abstract:
A rough symmetry can be observed in the stock price of 3D Systems (NYSE:DDD), the leading and largest 3D printer manufacturer, from its IPO on June 3, 2011 to the beginning of 2016. The price sky rocketed nearly 600% from 2011 to the end of 2013, and took a free fall back to its original value by 2016. Coincidentally, it is also the period during which I got my hands dirty and investigated some of the toughest challenges as well as exciting new possibilities associated with different types of 3D printing technologies. In this thesis, I documented my attempts from 3 different angles to push the limits of 3D printing: printability, microstructure design and robust geometry processing with mesh arrangements.
Printability check has long been the bottleneck that prevents 3D printing from scaling up. Oftentimes, designers of 3D models lack the expertise or tools to ensure 3D printability. 3D printing service providers typically rely human inspections to filter out unprintable designs. This process is manual and error-prone. As designs become ever more complex, manual printability check becomes increasingly difficult. To tackle this problem, my colleagues and I proposed an algorithm to automatically determine structurally weak regions and the worst-case usage scenario to break a given model. We validate the algorithm by physically break a number of real 3D printed designs.
A key distinctive feature of 3D printing technologies is that the cost and time of fabrication is uncorrelated with geometric complexity. This opens up many exciting new possibilities. In particular, by pushing geometric complexity to the extreme, 3D printing has the potential of fabricating soft, deformable shapes with microscopic structures using a single raw material. In our recent SIGGRAPH publication, my colleagues and I have not only demonstrated fabricating microscopic frame structures is possible but also proposed an entire pipeline for designing spatially varying microstructures to satisfy target material properties or deformation goals.
With the boost of 3D printing technologies, 3D models have become more abundant and easily accessible than ever before. These models are sometimes known as "wild" models because they differ significantly in complexity and quality from traditional models in graphics researches. This poses a serious challenge in robustly analyzing 3D designs. Many state-of-the-art geometry processing algorithms/libraries are ill-prepared for dealing with "wild" models that are non-manifold, self-intersecting, locally degenerate and/or containing multiple and possibly nested components. In our most recent SIGGRAPH submission, we proposed a systematic recipe based on mesh arrangements for conducting a family of exact constructive solid geometry operations. We exhaustively tested our algorithm on 10,000 "wild" models crawled from Thingiverse, a popular online shape repository. Both the code and the dataset are freely available to the public.
Title: Big Data Analytics for Development: Events, Knowledge Graphs and Predictive Models
Candidate: Chakraborty, Sunandan
Advisor(s): Subramanian, Lakshminarayanan; Nyarko, Yaw
Abstract:
Volatility in critical socio-economic indices can have a significant negative impact on global development. This thesis presents a suite of novel big data analytics algorithms that operate on unstructured Web data streams to automatically infer events, knowledge graphs and predictive models to understand, characterize and predict the volatility of socioeconomic indices.
This thesis makes four important research contributions. First, given a large volume of diverse unstructured news streams, we present new models for capturing events and learning spatio-temporal characteristics of events from news streams. We specifically explore two types of event models in this thesis: one centered around the concept of event triggers and a probabilistic meta-event model that explicitly delineates named entities from text streams to learn a generic class of meta-events. The second contribution focuses on learning several different types of knowledge graphs from news streams and events: a) Spatio-temporal article graphs capture intrinsic relationships between different news articles; b) Event graphs characterize relationships between events and given a news query, provide a succinct summary of a timeline of events relating to a query; c) Event-phenomenon graphs that provide a condensed representation of classes of events that relate to a given phenomena at a given location and time; d) Causality testing on word-word graphs which can capture strong spatio-temporal relationships between word occurrences in news streams; e) Concept graphs that capture relationships between different word concepts that occur in a given text stream.
The third contribution focuses on connecting the different knowledge graph representations and structured time series data corresponding to a socio-economic index to automatically learn event-driven predictive models for the given socio-economic index to predict future volatility. We propose several types of predictive models centered around our two event models: event triggers and probabilistic meta-events. The final contribution focuses on a broad spectrum of inference case studies for different types of socio-economic indices including food prices, stock prices, disease outbreaks and interest rates. Across all these indices, we show that event-driven predictive models provide significant improvements in prediction accuracy over state-of-the-art techniques.
Title: SMT-Based and Disjunctive Relational Abstract Domains for StaticAnalysis
Candidate: Chen, Junjie
Advisor(s): Patrick Cousot
Abstract:
Abstract Interpretation is a theory of sound approximation of program semantics. In recent decades, it has been widely and successfully applied to the static analysis of computer programs. In this thesis, we will work on abstract domains, one of the key concepts in abstract interpretation, which aim at automatically collecting information about the set of all possible values of the program variables. We will focus, in particularly, on two aspects: the combination with theorem provers and the refinement of existing abstract domains.
Satisfiability modulo theories (SMT) solvers are popular theorem provers, which proved to be very powerful tools for checking the satisfiability of first-order logical formulas with respect to some background theories. In the first part of this thesis, we introduce two abstract domains whose elements are logical formulas involving finite conjunctions of affine equalities and finite conjunctions of linear inequalities. These two abstract domains rely on SMT solvers for the computation of transformations and other logical operations.
In the second part of this thesis, we present an abstract domain functor whose elements are binary decision trees. It is parameterized by decision nodes which are a set of boolean tests appearing in the programs and by a numerical or symbolic abstract domain whose elements are the leaves. This new binary decision tree abstract domain functor provides a flexible way of adjusting the cost/precision ratio in path-dependent static analysis.
Title: Iris: Mitigating Phase Noise in Millimeter Wave OFDM Systems
Candidate: Dhananjay, Aditya
Advisor(s): Li, Jinyang
Abstract:
Next-generation wireless networks are widely expected to operate over millimeter-wave (mmW) frequencies of over 28GHz. These bands mitigate the acute spectrum shortage in the conventional microwave bands of less than 6GHz. The shorter wavelengths in these bands also allow for building dense antenna arrays on a single chip, thereby enabling various MIMO configurations and highly directional links that can increase the spatial reuse of spectrum.
While attempting to build a practical over-the-air (OTA) link over mmW, we realized that the traditional baseband processing techniques used in the microwave bands simply could not cope with the exacerbated frequency offsets (or phase noise) observed in the RF oscillators at these bands. While the frequency offsets are large, the real difficulty arose from the fact that they varied significantly over very short time-scales.Traditional feedback loop techniques still left significant residual offsets, which in turn led to inter-carrier-interference (ICI). The result was very high symbol error rates (SER).
This thesis presents Iris, a baseband processing block that enables clean mmW links, even in the presence of previously fatal amounts of phase noise. Over real mmW hardware, Iris reduces the SER by one to two orders of magnitude, as compared to competing techniques.
Title: Predicting Images using Convolutional Networks: Visual Scene Understanding with Pixel Maps
Candidate: Eigen, David
Advisor(s): Fergus, Rob
Abstract:
In the greater part of this thesis, we develop a set of convolutional networks that infer predictions at each pixel of an input image. This is a common problem that arises in many computer vision applications: For example, predicting a semantic label at each pixel describes not only the image content, but also fine-grained locations and segmenta- tions; at the same time, finding depth or surface normals provide 3D geometric relations between points. The second part of this thesis investigates convolutional models also in the contexts of classification and unsupervised learning.
To address our main objective, we develop a versatile Multi-Scale Convolutional Network that can be applied to diverse vision problems using simple adaptations, and apply it to predict depth at each pixel, surface normals and semantic labels. Our model uses a series of convolutional network stacks applied at progressively finer scales. The first uses the entire image field of view to predict a spatially coarse set of feature maps based on global relations; subsequent scales correct and refine the output, yielding a high resolution prediction. We look exclusively at depth prediction first, then generalize our method to multiple tasks. Our system achieves state-of-the-art results on all tasks we investigate, and can match many image details without the need for superpixelation.
Leading to our multi-scale network, we also design a purely local convolutional network to remove dirt and raindrops present on a window surface, which learns to identify and inpaint compact corruptions. We also we investigate a weighted nearest-neighbors labeling system applied to superpixels, in which we learn weights for each example, and use local context to find rare class instances.
In addition, we investigate the relative importance of sizing parameters using a recursive convolutional network, finding that network depth is most critical. We also develop a Convolutional LISTA Autoencoder, which learns features similar to stacked sparse coding at a fraction of the cost, combine it with a local entropy objective, and describe a convolutional adaptation of ZCA whitening.
Title: Unsupervised Feature Learning in Computer Vision
Candidate: Goroshin, Ross
Advisor(s): LeCun, Yann
Abstract:
Much of computer vision has been devoted to the question of representation through feature extraction. Ideal features transform raw pixel intensity values to a representation in which common problems such as object identification, tracking, and segmentation are easier to solve. Recently, deep feature hierarchies have proven to be immensely successful at solving many problems in computer vision. In the supervised setting, these hierarchies are trained to solve specific problems by minimizing an objective function of the data and problem specific label information. Recent findings suggest that despite being trained on a specific task, the learned features can be transferred across multiple visual tasks. These findings suggests that there exists a generically useful feature representation for natural visual data.
This work aims to uncover the principles that lead to these generic feature representations in the unsupervised setting, which does not require problem specific label information. We begin by reviewing relevant prior work, particularly the literature on autoencoder networks and energy based learning. We introduce a new regularizer for autoencoders that plays an analogous role to the partition function in probabilistic graphical models. Next we explore the role of specialized encoder architectures for sparse inference. The remainder of the thesis explores visual feature learning from video. We establish a connection between slow-feature learning and metric learning, and experimentally demonstrate that semantically coherent metrics can be learned from natural videos. Finally, we posit that useful features linearize natural image transformations in video. To this end, we introduce a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabeled natural video sequences by learning to predict future frames in the presence of uncertainty.
Title: Efficient and Trustworthy Theory Solver for Bit-vectors in SatisfiabilityModulo Theories
Candidate: Hadarean, Liana
Advisor(s): Barrett, Clark
Abstract:
As software and hardware systems grow in complexity, automated techniques for ensuring their correctness are becoming increasingly important. Many modern formal verification tools rely on back-end satisfiability modulo theories (SMT) solvers to discharge complex verification goals. These goals are usually formalized in one or more fixed first-order logic theories, such as the theory of fixed-width bit-vectors. The theory of bit-vectors offers a natural way of encoding the precise semantics of typical machine operations on binary data. The predominant approach to deciding the bit-vector theory is via eager reduction to propositional logic. While this often works well in practice, it does not scale well as the bit-width and number of operations increase. The first part of this thesis seeks to fill this gap, by exploring efficient techniques of solving bit-vector constraints that leverage the word-level structure. We propose two complementary approaches: an eager approach that takes full advantage of the solving power of off the shelf propositional logic solvers, and a lazy approach that combines on-the-fly algebraic reasoning with efficient propositional logic solvers. In the second part of the thesis, we propose a proof system for encoding automatically checkable refutation proofs in the theory of bit-vectors. These proofs can be automatically generated by the SMT solver, and act as a certificate for the correctness of the result.
Title: Predicting the Market Value of Single-Family Residences
Candidate: Lowrance, Roy
Advisor(s): LeCun, Yann; Shasha, Dennis
Abstract:
This work develops the best linear model of residential real estate prices for 2003 through 2009 in Los Angeles County. It differs from other studies comparing models for predicting house prices by covering a larger geographic area than most, more houses than most, a longer time period than most, and the time period both before and after the real estate price boom in the United States.
In addition, it open sources all of the software. We test designs for linear models to determine the best form for the model as well as the training period, features, and regularizer that produce the lowest errors. We compare the best of our linear models to random forests and point to directions for further research.
Title: Building Fast, CPU-Efficient Distributed Systems on Ultra-Low Latency, RDMA-Capable Networks
Candidate: Mitchell, Christopher
Advisor(s): Li, Jinyang
Abstract:
Modern datacenters utilize traditional Ethernet interconnects to connect hundreds or thousands of machines. Although inexpensive and ubiquitous, Ethernet imposes design constraints on datacenter-scale distributed storage systems that use traditional client-server architectures. Recent technological trends indicate that future datacenters will embrace interconnects with ultra-low latency, high bandwidth, and the ability to offload work from servers to clients. Future datacenter-scale distributed storage systems will need to be designed specifically to exploit these features. This thesis explores what these features mean for large-scale in-memory storage systems, and derives two key insights for building RDMA-aware distributed systems.
First, relaxing locality between data and computation is now practical: data can be copied from servers to clients for computation. Second, selectively relaxing data-computation locality makes it possible to optimally balance load between server and client CPUs to maintain low application latency. This thesis presents two in-memory distributed storage systems built around these two insights, Pilaf and Cell, that demonstrate effective use of ultra-low-latency, RDMA-capable interconnects. Through Pilaf and Cell, this thesis demonstrates that by combining RDMA and message passing to selectively relax locality, systems can achieve ultra-low latency and optimal load balancing with modest CPU resources.
Title: Instance Segmentation of RGBD Scenes
Candidate: Silberman, Nathan
Advisor(s): Fergus, Rob
Abstract:
The vast majority of literature in scene parsing can be described as semantic pixel labeling or semantic segmentation: predicting the semantic class of the object represented by each pixel in the scene. Our familiar perception of the world, however, provides a far richer representation. Firstly, rather than just being able to predict the semantic class of a location in a scene, humans are able to reason about object instances. Discriminating between a region that might represent a single object versus ten objects is a crucial and basic faculty. Secondly, rather than reasoning about objects as merely occupying the space visible from a single vantage point, we are able to quickly and easily reason about an object's true extent in 3D. Thirdly, rather than viewing a scene as a collection of objects independently existing in space, humans exhibit a representation of scenes that is highly grounded through a intuitive model of physics. Such models allow us to reason about how objects relate physically: via physical support relationships.
Instance segmentation is the task of segmenting a scene into regions which correspond to individual object instances. We argue that this task is not only closer to our own perception of the world than semantic segmentation, but also directly allows for subsequent reasoning about a scenes constituent elements. We explore various strategies for instance segmentation in indoor RGBD scenes.
Firstly, we explore tree-based instance segmentation algorithms. The utility of trees for semantic segmentation has been thoroughly demonstrated and we adapt them to instance segmentation and analyze both greedy and global approaches to inference.
Next, we investigate exemplar-based instance segmentation algorithms, in which a set of representative exemplars are chosen from a large pool of regions and pixels are assigned to exemplars. Inference can either be performed in two stages, exemplar selection followed by pixel-to-exemplar assignment, or in a single joint reasoning stage. We consider the advantages and disadvantages of each approach.
We introduce the task of support-relation prediction in which we predict which objects are physically supporting other objects. We propose an algorithm and a new set of features for performing discriminative support prediction, we demonstrate the effectiveness of our method and compare training mechanisms.
Finally, we introduce an algorithm for inferring scene and object extent. We demonstrate how reasoning about 3D extent can be done by extending known 2D methods and highlight the strengths and limitations of this approach.
Title: Localization of Humans in Images Using Convolutional Networks
Candidate: Tompson, Jonathan
Advisor(s): Bregler, Christopher
Abstract:
Tracking of humans in images is a long standing problem in computer vision research for which, despite significant research effort, an adequate solution has not yet emerged. This is largely due to the fact that human body localization is complicated and difficult; potential solutions must find the location of body joints in images with invariance to shape, lighting and texture variation and it must do so in the presence of occlusion and incomplete data. However, despite these significant challenges, this work will present a framework for human body pose localization that not only offers a significant improvement over existing traditional architectures, but has sufficient localization performance and computational efficiency for use in real-world applications.
At it's core, this framework makes use of Convolutional Networks to infer the location of body joints efficiently and accurately. We describe solutions to two applications 1) hand-tracking from a depth image source and 2) human body-tracking from and RGB image source. For both these applications we show that Convolutional Networks are able to significantly out-perform existing state-of-the-art.
We propose a new hybrid architecture that consists of a deep Convolutional Network and a Probabilistic Graphical Model which can exploit structural domain constraints such as geometric relationships between body joint locations to improve tracking performance. We then explore the use of both color and motion features to improve tracking performance. Finally we introduce a novel architecture which includes an efficient ‘position refinement’ model that is trained to estimate the joint offset location within a small region of the image. This refinement model allows our network to improve spatial localization accuracy even with large amounts of spatial pooling.
Title: Joint Training of a Neural Network and a Structured Model for Computer Vision
Candidate: Wan, Li
Advisor(s): Fergus, Rob
Abstract:
Identifying objects and telling where they are in real world images is one of the most important problems in Artificial Intelligence. The problem is challenging due to: occluded objects, varying object viewpoints and object deformations. This makes the vision problem extremely difficult and cannot be efficiently solved without learning.
This thesis explores hybrid systems that combine a neural network as a trainable feature extractor and structured models that capture high level information such as object parts. The resulting models combine the strengths of the two approaches: a deep neural network which provides a powerful non-linear feature transformation and a high level structured model which integrates domain-specific knowledge. We develop discriminative training algorithms to jointly optimize these entire models end-to-end.
First, we proposed a unified model which combines a deep neural network with a latent topic model for image classification. The hybrid model is shown to outperform models based solely on neural networks or topic model alone. Next, we investigate techniques for training a neural network system, introducing an effective way of regularizing the network called DropConnect. DropConnect allows us to train large models while avoiding over-fitting. This yields state-of-the-art results on a variety of standard benchmarks for image classification. Third, we worked on object detection for PASCAL challenge. We improved the deformable parts model and proposed a new non-maximal suppression algorithm. This system was the joint winner of the 2011 challenge. Finally, we develop a new hybrid model which integrates a deep network, deformable parts model and non-maximal suppression. Joint training of our hybrid model shows clear advantage over train each component individually, and achieving competitive result on standard benchmarks.
Title: Partition Memory Models in Program Analysis
Candidate: Wang, Wei
Advisor(s): Barrett, Clark
Abstract:
Scalability is a key challenge in static program analyses based on solvers for Satisfiability Modulo Theories (SMT). For imperative languages like C, the approach taken for modeling memory can play a significant role in scalability. The main theme of this thesis is using partitioned memory models to divide up memory based on the alias information derived from a points-to analysis.
First, a general analysis framework based on memory partitioning is presented. It incorporates a points-to analysis as a preprocessing step to determine a conservative approximation of which areas of memory may alias or overlap and splits the memory into distinct arrays for each of these areas.
Then we propose a new cell-based field-sensitive points-to analysis, which is an extension of Steensgaard's unification-based algorithms. A cell is a unit of access with scalar or record type. Arrays and dynamically memory allocations are viewed as a collection of cells. We show how our points-to analysis yields more precise alias information for programs with complex heap data structures.
Our work is implemented in Cascade, a static analysis framework for C programs. It replaces the former at memory model that models the memory as a single array of bytes. We show that the partitioned memory models achieve better scalability within Cascade, and the cell-based memory model, in particular, improves the performance significantly, making Cascade a state-of-the-art C analyzer.
Title: On the Human Form: Efficient acquisition, modeling and manipulation of thehuman body
Candidate: Braga, Otavio
Advisor(s): Geiger, Davi
Abstract:
This thesis concerns the acquisition, modeling and manipulation of the human form.
First, we acquire body models. We introduce an efficient bootstraped algorithm that we employed to register over 2,000 high resolution body scans of male and female adult subjects. Our algorithm outputs not only the traditional vertex correspondences, but also directly produces a high quality model which can be immediately deformed. We then employ the result to fit noisy depth maps coming from now commercially available 3D sensors such as Microsoft's Kinect and PrimeSense's Carmine.
We conclude by describing a new real-time system for image-based body manipulation called BodyJam, that lets you change your outfit with a finger snap. BodyJam is inspired by a technique invented by the surrealists a century ago: "Exquisite corpse", a method by which a collection of images (of body parts) is collectively assembled. BodyJam does it on a video display that mirrors the pose in real-time of a real-person standing in front of the camera/display mirror, and allows the user to change clothes and other appearance attributes. Using Microsoft's Kinect, poses are matched to a video database of different torsos and legs, and "pages" showing different clothes are turned by handwitch focus to the topic of body manipulation. We first revisit the more traditional way of specifying bodies from a set of measurements, such as coming from clothing sizing charts, showing how the statistics of the population learned during the registration can aid us in accurately defining the body shape. We then introduce a new manipulation metaphor, where we navigate through the space of body shapes and poses by directly dragging the body mesh surface.
We conclude by describing a new real-time system for image-based body manipulation called BodyJam, that lets you change your outfit with a finger snap. BodyJam is inspired by a technique invented by the surrealists a century ago: "Exquisite Corpse", a method by which a collection of images (of body parts) is collectively assembled. BodyJam does it on a video display that mirrors the pose in real-time of a real-person standing in front of the camera/display mirror, and allows the user to change clothes and other appearance attributes. Using Microsoft's Kinect, poses are matched to a video database of different torsos and legs, and "pages" showing different clothes are turned by hand gestures.
Title: Analyzing Tatonnement Dynamics in Economic Markets
Candidate: Cheung, Yun Kuen
Advisor(s): Cole, Richard
Abstract:
The impetus for this dissertation is to explain why well-functioning markets might be able to stay at or near a market equilibrium. We argue that tatonnement, a natural, simple and distributed price update dynamic in economic markets, is a plausible candidate to explain how markets might reach their equilibria.
Tatonnement is broadly defined as follows: if the demand for a good is more than the supply, increase the price of the good, and conversely, decrease the price when the demand is less than the supply. Prior works show that tatonnement converges to market equilibrium in some markets while it fails to converge in other markets. Our goal is to extend the classes of markets in which tatonnement is shown to converge. The prior positive results largely concerned markets with substitute goods. We seek market constraints which enable tatonnement to converge in markets with complementary goods, or with a mixture of substitutes and complementary goods. We also show fast convergence rates for some of these markets.
We introduce an amortized analysis technique to handle asynchronous events - in our case asynchronous price updates. On the other hand, for some markets we show that tatonnement is equivalent to generalized gradient descent (GGD). The amortized analysis and our analysis on GGD may be of independent interests.
Title: Low-latency Image Recognition withGPU-accelerated Convolutional Networksfor Web-based Services
Candidate: Huang, Fu Jie
Advisor(s): LeCun, Yann
Abstract:
In this work, we describe an application of convolutional networks to object classification and detection in images. The task of image based object recognition is surveyed in the first chapter. Its application in internet advertisement is one of the main motivations of this work.
The architecture of the convolutional networks is described in details in the following chapter. Stochastic gradient descent is used to train the networks.
We then describe the data collection and labelling process. The set of training data labelled basically decides what kind of recognizer is being built. Four binary classifers are trained for the object types of sailboat, car, motorbike, and dog.
GPU based massive parallel implementation of the convolutional networks is built. This enables us to run the convolution operations at close to 40 times faster than running on a traditional CPU. Details about how to implement the convolutional operation on NVIDIA GPUs using CUDA is disscused.
In order to apply the object recognizer in a production environment where millions of images are processed daily, we have built a platform with cloud computing. We describe how large scale and low latency image processing can be achieved with such a system.
Title: Effective Algorithms for the Satisfiability of Quantifier-Free Formulas Over Linear Real and Integer Arithmetic
Candidate: King, Tim
Advisor(s): Barrett, Clark
Abstract:
A core technique of modern tools for formally reasoning about computing systems is generating and dispatching queries to automated theorem provers, including Satisfiability Modulo Theories (SMT) provers. SMT provers aim at the tight integration of decision procedures for propositional satisfiability and decision procedures for fixed first-order theories ‒ known as theory solvers. This thesis presents several advancements in the design and implementation of theory solvers for quantifier-free linear real, integer, and mixed integer and real arithmetic. These are implemented within the SMT system CVC4. We begin by formally describing the Satisfiability Modulo Theories problem and the role of theory solvers within CVC4. We discuss known techniques for building solvers for quantifier-free linear real, integer, and mixed integer and real arithmetic around the Simplex for DPLL(T) algorithm. We give several small improvements to theory solvers using this algorithm and describe the implementation and theory of this algorithm in detail. To extend the class of problems that the theory solver can robustly support, we borrow and adapt several techniques from linear programming (LP) and mixed integer programming (MIP) solvers which come from the tradition of optimization. We propose a new decicion procedure for quantifier-free linear real arithmetic that replaces the Simplex for DPLL(T) algorithm with a variant of the Simplex algorithm that performs a form of optimization ‒ minimizing the sum of infeasibilties. In this thesis, we additionally describe techniques for leveraging LP and MIP solvers to improve the performance of SMT solvers without compromising correctness. Previous efforts to leverage such solvers in the context of SMT have concluded that in addition to being potentially unsound, such solvers are too heavyweight to compete in the context of SMT. We present an empirical comparison against other state-of-the-art SMT tools to demonstrate the effectiveness of the proposed solutions.
Title: Cryptographic Algorithms for the SecureDelegation of Multiparty Computation
Candidate: Lopez-Alt, Adriana
Advisor(s): Dodis, Yevgeniy
Abstract:
In today’s world, we store our data and perform expensive computations remotely on powerful servers (a.k.a. “the cloud”) rather than on our local devices. In this dissertation we study the question of achieving cryptographic security in the setting where multiple (mutually distrusting) clients wish to delegate the computation of a joint function on their inputs to an untrusted cloud, while keeping these inputs private. We introduce two frameworks for modeling such protocols.
We construct cloud-assisted and on-the-fly MPC protocols using fully homomorphic encryption (FHE). However, FHE requires inputs to be encrypted under the same key; we extend it to the multiparty setting in two ways:
Title: Robust and Efficient Methods for Approximation and Optimization of Stability Measures
Candidate: Mitchell, Tim
Advisor(s): Overton, Michael
Abstract:
We consider two new algorithms with practical application to the problem of designing controllers for linear dynamical systems with input and output: a new spectral value set based algorithm called hybrid expansion-contraction intended for approximating the H-infinity norm, or equivalently, the complex stability radius, of large-scale systems, and a new BFGS SQP based optimization method for nonsmooth, nonconvex constrained optimization motivated by multi-objective controller design. In comprehensive numerical experiments, we show that both algorithms in their respect domains are significantly faster and more robust compared to other available alternatives. Moreover, we present convergence guarantees for hybrid expansion-contraction, proving that it converges at least superlinearly, and observe that it converges quadratically in practice, and typically to good approximations to the H-infinity norm, for problems which we can verify this. We also extend the hybrid expansion-contraction algorithm to the real stability radius, a measure which is known to be more difficult to compute than the complex stability radius. Finally, for the purposes of comparing multiple optimization methods, we present a new visualization tool called relative minimization profiles that allow for simultaneously assessing the relative performance of algorithms with respect to three important performance characteristics, highlighting how these measures interrelate to one another and compare to the other competing algorithms on heterogenous test sets. We employ relative minimization profiles to empirically validate our proposed BFGS SQP method in terms of quality of minimization, attaining feasibility, and speed of progress compared to other available methods on challenging test sets comprised of nonsmooth, nonconvex constrained optimization problems arising in controller design.
Title: Building Efficient Distributed In-memory Systems
Candidate: Power, Russell
Advisor(s): Li, Jinyang
Abstract:
The recent cloud computing revolution has changed the distributed computing landscape, making the resources of entire datacenters available to ordinary users. This process has been greatly aided by dataflow style frameworks such as MapReduce which expose simple model for programs, allowing for efficient, fault-tolerant execution across many machines. While the MapReduce model has proved to be effective for many applications, there are a wide class of applications which are difficult to write or inefficient in such a model. This includes many familiar and important applications such as PageRank, matrix factorization and a number of machine learning algorithms. In lieu of a good framework for building these applications, users resort to writing applications using MPI or RPC, a difficult and error-prone construction.
This thesis presents 2 complementary frameworks, Piccolo and Spartan, which help programmers to write in-memory distributed applications not served well by existing approaches.
Piccolo presents a new data-centric programming model for in-memory applications. Unlike data-flow models, Piccolo allows programs running on different machines to share distributed, mutable state via a key-value table interface. This design allows for both high-performance and additional flexibility. Piccolo makes novel use of commutative updates to efficiently resolve write-write conflicts. We find Piccolo provides an efficient backend for a wide-range of applications: from PageRank and matrix multiplication to web-crawling.
While Piccolo provides an efficient backend for distributed computation, it can still be some- what cumbersome to write programs using it directly. To address this, we created Spartan. Spartan implements a distributed implementation of the NumPy array language, and fully sup- ports important array language features such as spatial indexing (slicing), fancy indexing and broadcasting. A key feature of Spartan is its use of a small number of simple, powerful high-level operators to provide most functionality. Not only do these operators dramatically simplify the design and implementation of Spartan, they also allow users to implement new functionality with ease.
We evaluate Piccolo and Spartan on a wide range of applications and find that they both perform significantly better than existing approaches.
Title: Runtime Compilation of Array-Oriented Python Programs
Candidate: Rubinsteyn, Alex
Advisor(s): Shasha, Dennis
Abstract:
The Python programming language has become a popular platform for data analysis and scientific computing. To mitigate the poor performance of Python's standard interpreter, numerically intensive computations are typically offloaded to library functions written in languages such as Fortran or C. If, however, some algorithm does not have an existing low-level implementation, then the scientific programmer must either accept sub-standard performance (sometimes orders of magnitude slower than native code) or themselves implement the desired functionality in a less productive but more efficient language.
To alleviate this problem, this thesis present Parakeet, a runtime compiler for an array-oriented subset of Python. Parakeet does not replace the Python interpreter, but rather selectively augments it by compiling and executing functions explicitly marked by the programmer. Parakeet uses runtime type specialization to eliminate the performance-defeating dynamicism of untyped Python code. Parakeet's pervasive use of data parallel operators as a means for implementing array operations enables high-level restructuring optimization and compilation to parallel hardware such as multi-core CPUs and graphics processors. We evaluate Parakeet on a collection of numerical benchmarks and demonstrate its dramatic capacity for accelerating array-oriented Python programs.
Title: A Deep Learning Pipeline for Image Understanding and Acoustic Modeling
Candidate: Sermanet, Pierre
Advisor(s): LeCun, Yann
Abstract:
One of the biggest challenges artificial intelligence faces is making sense of the real world through sensory signals such as audio or video. Noisy inputs, varying object viewpoints, deformations and lighting conditions turn it into a high-dimensional problem which cannot be efficiently solved without learning from data.
This thesis explores a general way of learning from high dimensional data (video, images, audio, text, financial data, etc.) called deep learning. It strives on the increasingly large amounts of data available to learn robust and invariant internal features in a hierarchical manner directly from the raw signals.
We propose an unified pipeline for feature learning, recognition, localization and detection using Convolutional Networks (ConvNets) that can obtain state-of-the-art accuracy on a number of pattern recognition tasks, including acoustic modeling for speech recognition and object recognition in computer vision. ConvNets are particularly well suited for learning from continuous signals in terms of both accuracy and efficiency.
Additionally, a novel and general deep learning approach to detection is proposed and successfully demonstrated on the most challenging vision datasets. We then generalize it to other modalities such as speech data. This approach allows accurate localization and detection objects in images or phones in voice signals by learning to predict boundaries from internal representations. We extend the reach of deep learning from classification to detection tasks in an integrated fashion by learning multiple tasks using a single deep model. This work is among the first to outperform human vision and establishes a new state of the art on some computer vision and speech recognition benchmarks.
Title: Towards New Interfaces For Pedagogy
Candidate: Stein, Murphy
Advisor(s): Perlin, Ken
Abstract:
Developing technology to help people teach and learn is an important topic in Human Computer Interaction (HCI).
In this thesis we present three studies on this topic. In the first study, we demonstrate new games for learning mathematics and discuss the evidence for key design decisions from user studies. In the second study, we develop a real-time video compositing system for distance education and share evidence for its potential value compared to standard techniques from two user studies. In the third study, we demonstrate our markerless hand tracking interface for real-time 3D manipulation and explain its advantages compared to other state-of-the-art methods.
A data-driven methodology is applied intensively throughout the course of this study. Several paraphrase corpora are constructed using automatic techniques, experts and crowdsourcing platforms. Paraphrase systems are trained and evaluated by using these data as a cornerstone. We show that even with a very noisy or a relatively small amount of parallel training data, it is possible to learn paraphrase models which capture linguistic phenomena. This work expands the scope of paraphrase studies to targeting different language variations, and more potential applications, such as text normalization and domain adaptation.
Title: Computational Complexity Implicationsof Secure Coin-Flipping
Candidate: Tentes, Aristeidis
Advisor(s): Dodis, Yevgeniy
Abstract:
Modern Cryptography is based on computational intractability assumptions, e.g., Factoring, Discrete Logarithm, Diffie-Helman etc. However, since an assumption might be proven incorrect, there has been a lot of focus in order to construct cryptographic primitives based on the possibly most minimal assumption. The most popular minimal assumption, which is implied by the existence of almost all cryptographic primitives, is the existence of One Way Functions. Coin-Flipping protocols are known to be implied by One-Way Functions, however, a complete characterization of the inverse direction is not known. There was even speculation that weak notions of Coin Flipping Protocols might be strictly weaker than One Way Functions. In this thesis we show that even very weak notions of Coin Flipping protocols do imply One Way Functions. In particular we show that the existence of a coin-flipping protocol safe against any non-trivial constant bias (e.g 0.499) implies the existence of One Way Functions. This improves upon a recent result of Haitner and Omri [FOCS '11], who proved this implication for protocols with bias 0.207. Unlike the former result, our result also holds for weak coin-flipping protocols.
Title: Data-driven Approaches for Paraphrasing across Language Variations
Candidate: Xu, Wei
Advisor(s): Grishman, Ralph
Abstract:
Our language changes very rapidly, accompanying political, social and cultural trends, as well as the evolution of science and technology. The Internet, especially the social media, has accelerated this process of change. This poses a severe challenge for both human beings and natural language processing (NLP) systems, which usually only model a snapshot of language presented in the form of text corpora within a certain domain and time frame.
While much previous effort has investigated monolingual paraphrase and bilingual translation, we focus on modeling meaning-preserving transformations between variants of a single language. We use Shakespearean and Internet language as examples to investigate various aspects of this new paraphrase problem, including acquisition, generation, detection and evaluation.
A data-driven methodology is applied intensively throughout the course of this study. Several paraphrase corpora are constructed using automatic techniques, experts and crowdsourcing platforms. Paraphrase systems are trained and evaluated by using these data as a cornerstone. We show that even with a very noisy or a relatively small amount of parallel training data, it is possible to learn paraphrase models which capture linguistic phenomena. This work expands the scope of paraphrase studies to targeting different language variations, and more potential applications, such as text normalization and domain adaptation.
Title: Positive-Unlabeled Learning in the Context of Protein Function Prediction
Candidate: Youngs, Noah
Advisor(s): Shasha, Dennis
Abstract:
With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positiveunlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.
A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples. Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem).
While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms can make large gains in predictive performance.
The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.
The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).
The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically.
Title: Hierarchical Convolutional Deep Learning in Computer Vision
Candidate: Zeiler, Matthew
Advisor(s): Fergus, Rob
Abstract:
It has long been the goal in computer vision to learn a hierarchy of features useful for object recognition. Spanning the two traditional paradigms of machine learning, unsupervised and supervised learning, we investigate the application of deep learning methods to tackle this challenging task and to learn robust representations of images.
We begin our investigation with the introduction of a novel unsupervised learning technique called deconvolutional networks. Based on convolutional sparse coding, we show this model learns interesting decompositions of images into parts without object label information. This method, which easily scales to large images, becomes increasingly invariant by learning multiple layers of feature extraction coupled with pooling layers. We introduce a novel pooling method called Gaussian pooling to enable these layers to store continuous location information while being differentiable, creating a unified objective function to optimize.
In the supervised learning domain, a well-established model for recognition of objects is the convolutional network. We introduce a new regularization method for convolutional networks called stochastic pooling which relies on sampling noise to prevent these powerful models from overfitting. Additionally, we show novel visualizations of these complex models to better understand what they learn and to provide insight on how to develop state-of-the-art architectures for large-scale classification of 1,000 different object categories.
We also investigate some other related problems in deep learning. First, we introduce a model for the task of mapping one high dimensional time series sequence onto another. Second, we address the choice of nonlinearity in neural networks, showing evidence that rectified linear units outperform others types in automatic speech recognition. Finally, we introduce a novel optimization method called ADADELTA which shows promising convergence speeds in practice while being robust to hyper-parameter selection.
Title: Incentive-Centered Design of Money-Free Mechanisms
Candidate: Gkatzelis, Vasilis
Advisor(s): Cole, Richard
Abstract:
This thesis serves as a step toward a better understanding of how to design fair and efficient multiagent resource allocation systems by bringing the incentives of the participating agents to the center of the design process. As the quality of these systems critically depends on the ways in which the participants interact with each other and with the system, an ill-designed set of incentives can lead to severe inefficiencies. The special focus of this work is on the problems that arise when the use of monetary exchanges between the system and the participants is prohibited. This is a common restriction that substantially complicates the designer's task; we nevertheless provide a sequence of positive results in the form of mechanisms that maximize efficiency or fairness despite the possibly self-interested behavior of the participating agents.
The first part of this work is a contribution to the literature on approximate mechanism design without money. Given a set of divisible resources, our goal is to design a mechanism that allocates them among the agents. The main complication here is due to the fact that the agents' preferences over different allocations of these resources may not be known to the system. Therefore, the mechanism needs to be designed in such a way that it is in the best interest of every agent to report the truth about her preferences; since monetary rewards and penalties cannot be used in order to elicit the truth, a much more delicate regulation of the resource allocation is necessary. Our contribution mostly revolves around a new truthful mechanism that we propose, which we call the /Partial Allocation/ mechanism. We first show how to use the two-agent version of this mechanism to create a system with the best currently known worst-case efficiency guarantees for problem instances involving two agents. We then consider fairness measures and prove that the general version of this elegant mechanism yields surprisingly good approximation guarantees for the classic problem of fair division. More specifically, we use the well established solution of /Proportional Fairness/ as a benchmark and we show that for an arbitrary number of agents and resources, and for a very large class of agent preferences, our mechanism provides /every agent/ with a value close to her proportionally fair value. We complement these results by also studying the limits of truthful money-free mechanisms, and by providing other mechanisms for special classes of problem instances. Finally, we uncover interesting connections between our mechanism and the Vickrey-Clarke-Groves mechanism from the literature on mechanism design with money.
The second part of this work concerns the design of money-free resource allocation mechanisms for /decentralized/ multiagent systems. As the world has become increasingly interconnected, such systems are using more and more resources that are geographically dispersed; in order to provide scalability in these systems, the mechanisms need to be decentralized. That is, the allocation decisions for any given resource should not assume global information regarding the system's resources or participants. We approach this restriction by using /coordination mechanisms/: a collection of simple resource allocation policies, each of which controls only one of the resources and uses only local information regarding the state of the system. The system's participants, facing these policies, have the option of choosing which resources they will access. We study a variety of coordination mechanisms and we prove that the social welfare of any equilibrium of the games that these mechanisms induce is a good approximation of the optimal welfare. Once again, we complement our positive results by studying the limits of coordination mechanisms. We also provide a detailed explanation of the seemingly counter-intuitive incentives that some of these mechanisms yield. Finally, we use this understanding in order to design a combinatorial constant-factor approximation algorithm for maximizing the social welfare, thus providing evidence that a game-theoretic mindset can lead to novel optimization algorithms.
Title: Locality Optimization for Data Parallel Programs
Candidate: Hielscher, Eric
Advisor(s): Shasha, Dennis
Abstract:
Productivity languages such as NumPy and Matlab make it much easier to implement data-intensive numerical algorithms than it is to implement them in efficiency languages such as C++. This is important as many programmers (1) aren't expert programmers; or (2) don't have time to tune their software for performance, as their main job focus is not programming per se. The tradeoff is typically one of execution time versus programming time, as unless there are specialized library functions or precompiled primitives for your particular task a productivity language is likely to be orders of magnitude slower than an efficiency language.
In this thesis, we present Parakeet, an array-oriented language embedded within Python, a widely-used productivity language. The Parakeet just-in-time compiler dynamically translates whole user functions to high performance multi-threaded native code. This thesis focuses in particular on our use of data parallel operators as a basis for locality enhancing program optimizations. e transform Parakeet programs written with the classic data parallel operators (Map, Reduce, and Scan; in Parakeet these are called adverbs) to process small local pieces (called tiles) of data at a time. To express this locality we introduce three new adverbs: TiledMap, TiledReduce, and TiledScan. These tiled adverbs are not exposed to the programmer but rather are automatically generated by a tiling transformation.
We use this tiling algorithm to bring two classic locality optimizations to a data parallel setting: cache tiling, and register tiling. We set register tile sizes statically at compile time, but use an online autotuning search to find good cache tile sizes at runtime. We evaluate Parakeet and these optimizations on various benchmark programs, and exhibit excellent performance even compared to typical C implementations.
Title: Piecewise Smooth Surfaces with Features
Candidate: Kovacs, Denis
Advisor(s): Zorin, Denis
Abstract:
The creation, manipulation and display of piecewise smooth surfaces has been a fundamental topic in computer graphics since its inception. The applications range from highest-quality surfaces for manufacturing in CAD, to believable animations of virtual creatures in Special Effects, to virtual worlds rendered in real-time in computer games.
Our focus is on improving the a) mathematical representation and b) automatic construction of such surfaces from finely sampled meshes in the presence of features. Features can be areas of higher geometric detail in an otherwise smooth area of the mesh, or sharp creases that contrast the overall smooth appearance of an object.
In the first part, we build on techniques that define piecewise smooth surfaces, to improve their quality in the presence of features. We present a crease technique suitable for real-time applications that helps increases the perceived visual detail of objects that are required to be very compactly represented and efficiently evaluated.
We then introduce a new subdivision scheme that allows the use of T-junctions for better local refinement. It thus reduces the need for extraordinary vertices, which can cause surface artifacts especially on animated objects.
In the second part, we consider the problem of how to build the control meshes of piecewise smooth surfaces, in a way that the resulting surface closely approximates an existing data set (such as a 3D range scan), particularly in the presence of features. To this end, we introduce a simple modification that can be applied to a wide range of parameterization techniques to obtain an anisotropic parameterization. We show that a resulting quadrangulation can indeed better approximate the original surface. Finally, we present a quadrangulation scheme that turns a data set into a quad mesh with T-junctions, which we then use as a T-Spline control mesh to obtain a smooth surface.
Title: Low-level Image Priors and Laplacian Preconditioners for Applications in Computer Graphics and Computational Photography
Candidate: Krishnan, Dilip
Advisor(s): Fergus, Rob
Abstract:
In the first part of this thesis, we develop novel image priors and efficient algorithms for image denoising and deconvolution applications. Our priors and algorithms enable fast, high-quality restoration of images corrupted by noise or blur. In the second part, we develop effective preconditioners for Laplacian matrices. Such matrices arise in a number of computer graphics and computational photography problems such as image colorization, tone mapping and geodesic distance computation on 3D meshes.
The first prior we develop is a spectral prior which models correlations between different spectral bands. We introduce a prototype camera and flash system, used in conjunction with the spectral prior, to enable taking photographs at very low light levels. Our second prior is a sparsity-based measure for blind image deconvolution. This prior gives lower costs to sharp images than blurred ones, enabling the use simple and efficient Maximum a-Posteriori algorithms.
We develop a new algorithm for the non-blind deconvolution problem. This enables extremely fast deconvolution of images blurred by a known blur kernel. Our algorithm uses Fast Fourier Transforms and Lookup Tables to achieve real-time deconvolution performance with non convex gradient-based priors. Finally, for certain image restoration problems with no clear formation model, we demonstrate how learning a direct mapping between original/corrupted patch pairs enables effective restoration.
We develop multi-level preconditioners to solve discrete Poisson equations. Existing multilevel preconditioners have two major drawbacks: excessive bandwidth growth at coarse levels; and the inability to adapt to problems with highly varying coefficients. Our approach tackles both these problems by introducing sparsification and compensation steps at each level. We interleave the selection of fine and coarse-level variables with the removal of weak connections between potential fine-level variables (sparsification) and compensate for these changes by strengthening nearby connections. By applying these operations before each elimination step and repeating the procedure recursively on the resulting smaller systems, we obtain highly efficient schemes. The construction is linear in time and memory. Numerical experiments demonstrate that our new schemes outperform state of the art methods, both in terms of operation count and wall-clock time, over a range of 2D and 3D problems.
Title: Relation Extraction with Weak Supervision and Distributional Semantics
Candidate: Min, Bonan
Advisor(s): Grishman, Ralph
Abstract:
Relation Extraction aims at detecting and categorizing semantic relations between pairs of entities in unstructured text. It benefits an enormous number of applications such as Web search and Question Answering. Traditional approaches for relation extraction either rely on learning from a large number of accurate human-labeled examples or pattern matching with hand-crafted rules. These resources are very laborious to obtain and can only be applied to a narrow set of target types of interest.
This talk focuses on learning relations with little or no human supervision. First, we examine the approach that treats relation extraction as a supervised learning problem. We develop an algorithm that is able to train a model with approximately 1/3 of the human-annotation cost and that matches the performance of models trained with high-quality annotation. Second, we investigate distant supervision, a weakly supervised algorithm that automatically generates its own labeled training data. We develop a latent Bayesian framework for this purpose. By using a model which provides a better approximation of the weak source of supervision, it outperforms the state-of-the-art methods. Finally, we investigate the possibility of building all relational tables beforehand with an unsupervised relation extraction algorithm. We develop an effective yet efficient algorithm that combines the power of various semantic resources that are automatically mined from a corpus based on distributional semantics. The algorithm is able to extract a very large set of relations from the web at high precision.
Title: Usable Security Mechanisms in the Developing World
Candidate: Paik, Michael
Advisor(s): Subramanian; Lakshminarayanan
Abstract:
Security and privacy are increasingly important in our interconnected world. Cybercrimes, including identity theft, phishing, and other attacks, are on the rise, and computer-assisted crimes such as theft and stalking are becoming commonplace.
Contemporary with this trend is the uptake of technology in the developing world, proceeding at a pace often outstripping that of the developed world. Penetration of mobile phones and services such as healthcare delivery, mobile money, and social networking is higher than that of even amenities like electricity. Connectivity is empowering disenfranchised people, providing information and services to the heretofore disconnected poor.
There are efforts to use technology to enhance physical security and well-being in the developing world, including citizen journalism, education, improving drug security, attendance tracking, etc.
However, there are significant challenges to security both in the digital and the physical domains that are particular to these contexts. Infrastructure is constrained, literacy, numeracy, and familiarity with basic technologies cannot be assumed, and environments are harsh on hardware. These circumstances often prevent security best practices from being transplanted directly to these regions â in many ways, the adoption of technology has overtaken the users ability to use it safely, and their trust in it is oftentimes reater than it should be.
This dissertation describes several systems and methodologies designed to operate in the developing world, using technologies and metaphors that are familiar to users and that are robust against the operating environments.
It begins with an overview of the state of affairs, and several threat models. It continues with a description of Signet, a method to use SIM cards as trusted computing hardware to provide secure signed receipts. Next, Epothecary describes a low-infrastructure system for tracking pharmaceuticals that also significantly and asymmetrically increases costs for counterfeiters. The balance consists of a description of a low-cost Biometric Terminal currently in use by NGOs in India performing DOTS-based tuberculosis treatment, Blacknoise, an investigation into the use of low-cost cameraphones with noisy imaging sensors for image-based steganography, and finally Innoculous, a low-cost, crowdsourcing system for combating the spread of computer viruses, particularly among non-networked computers, while also collecting valuable "epidemiological" data.
Title: Inapproximability Reductions and Integrality Gaps
Candidate: Popat, Preyas
Advisor(s): Khot, Subhash
Abstract:
In this thesis we prove intractability results for several well studied problems in combinatorial optimization.
Closest Vector Problem with Preprocessing (CVPP): We show that the preprocessing version of the well known Closest Vector Problem is hard to approximate to an almost polynomial factor unless NP is in quasi polynomial time. The approximability of CVPP is closely related to the security of lattice based cryptosystems.
Pricing Loss Leaders: We show hardness of approximation results for the problem of maximizing profit from buyers with single minded valuations where each buyer is interested in bundles of at most k items, and the items are allowed to have negative prices ("Loss Leaders"). For k = 2, we show that assuming the Unique Games Conjecture, it is hard to approximate the profit to any constant factor. For k > 2, we show the same result assuming P != N P.
Integrality gaps: We show SemiDefinite Programming (SDP) integrality gaps for Unique Games and 2 to 1 Games. Inapproximability results for these problems imply inapproximability results for many fundamental optimization problems. For the first problem, we show "approximate" integrality gaps for super constant rounds of the powerful Lasserre hierarchy. For the second problem we show integrality gaps for the basic SDP relaxation with perfect completeness.
Title: Natural Interaction with a Virtual World
Candidate: Rosenberg, Ilya
Advisor(s): Perlin, Ken
Abstract:
A large portion of computer graphics and human/computer interaction is concerned with the creation, manipulation and use of two and three dimensional objects existing in a virtual world. By creating more natural physical interfaces and virtual worlds which behave in physically plausible ways, it is possible to empower nonexpert users to create, work and play in virtual environments. This thesis is concerned with the design, creation, and optimization of user-input devices which break down the barriers between the real and the virtual as well as the development of software algorithms which allow for the creation of physically realistic virtual worlds.
Title: Security Mechanisms for Physical Authentication
Candidate: Sharma, Ashlesh
Advisor(s): Subramanian; Lakshminarayanan
Abstract:
Counterfeiting of goods is a worldwide problem where the losses are in billions of dollars. It is estimated that 10% of all the world trade is counterfeit. To alleviate counterfeiting, a number of techniques are used from barcodes to holograms. But these technologies are easily reproducible and hence they are ineffective against counterfeiters.
In this thesis, we introduce PaperSpeckle, a novel way to fingerprint any piece of paper based on its unique microscopic properties. Next, we extend and generalize this work to introduce TextureSpeckle, a novel way to fingerprint and characterize the uniqueness of the surface of a material based on the interaction of light with the natural randomness present in the rough structure at the microscopic level of the surface. We show the existence and uniqueness of these fingerprints by analyzing a large number of surfaces (over 20,000 microscopic surfaces and 200 million pairwise comparisons) of different materials. We also define the entropy of the fingerprints and show how each surface can be uniquely identified in a robust manner even in case of damage.
From a theoretical perspective, we consider a discrete approximation model from light scattering theory which allows us to compute the speckle pattern for a given surface. Under this computational model, we show that given a speckle pattern, it is computationally hard to reconstruct the physical surface characteristics by simulating the multiple scattering of light. Using TextureSpeckle as a security primitive, we design secure protocols to enable a variety of scenarios such as: i) supply chain security, where applications range from drug tracking to inventory management, ii) mobile based secure transfer of money (mobile money), where any paper can be changed to an on-demand currency, and iii) fingerprint ecosystem, a cloud based system, where any physical object can be identified and authenticated on-demand.
We discuss the construction of the prototype device ranging from optical lens design to usability aspects and show how our technique can be applied in the real world to alleviate counterfeiting and forgery. In addition, we introduce Pattern Matching Puzzles (PMPs), a usable security mechanism that provides a 'human computable' one-time-MAC (message authentication code) for every transaction,making each transaction information-theoretically secure against various adversarial attacks. The puzzles are easy tosolve even for semi-literate users with simple pattern recognition skills.
Title: Augmenting Information Flow for Visual Privacy
Candidate: Spiro, Ian
Advisor(s): Bregler, Christopher
Abstract:
In the Information Age, visual media take on powerful new forms. Photographs once printed on paper and stored in physical albums now exist as digital files. With the rise of social media, photo data has moved to the cloud for rapid dissemination. The upside can be measured in terms of increased efficiency, greater reach, or reduced printing costs. But there is a downside that is harder to quantify: the risk of private photos or videos leaking inappropriately. Human imagery is potentially sensitive, revealing private details of a persons body, lifestyle, activities, and more. Images create visceral responses and have the potential to permanently damage a persons reputation.
We employed the theory of contextual integrity to explore privacy aspects of transmitting the human form. In response to privacy threats from new sociotechnical systems, we developed practical solutions that have the potential to restore balance. The main work is a set of client-side, technical interventions that can be used to alter information flows and provide features to support visual privacy. In the first approach, we use crowdsourcing to extract specific, useful human signal from video to decouple it from bundled identity information. The second approach is an attempt to achieve similar ends with pure software. Instead of using information workers, we developed a series of filters that alter video to hide identity information while still revealing motion signal. The final approach is an attempt to control the recipients of photos by encoding them in the visual channel. The software completely protects data from third-parties who lack proper credentials and maintains data integrity by exploiting the visual coherence of uploaded images, even in the face of JPEG compression. The software offers end-to-end encryption that is compatible with existing social media applications.
Title: Toward a computational solution to the inverse problem of how hypoxia arises in metabolically heterogeneous cancer cell populations
Candidate: Sundstrom, Andrew
Advisor(s): Mishra, Bud; Bar-Sagi, Dafna
Abstract:
As a tumor grows, it rapidly outstrips its blood supply, leaving portions of tumor that undergo hypoxia. Hypoxia is strongly correlated with poor prognosis as it renders tumors less responsive to chemotherapy and radiotherapy. During hypoxia, HIFs upregulate production of glycolysis enzymes and VEGF, thereby promoting metabolic heterogeneity and angiogenesis, and proving to be directly instrumental in tumor progression. Prolonged hypoxia leads to necrosis, which in turn activates inflammatory responses that produce cytokines that stimulate tumor growth. Hypoxic tumor cells interact with macrophages and fibroblasts, both involved with inflammatory processes tied to tumor progression. So it is of clinical and theoretical significance to understand: Under what conditions does hypoxia arise in a heterogeneous cell population? Our aim is to transform this biological origins problem into a computational inverse problem, and then attack it using approaches from computer science. First, we develop a minimal, stochastic, spatiotemporal simulation of large heterogeneous cell populations interacting in three dimensions. The simulation can manifest stable localized regions of hypoxia. Second, we employ and develop a variety of algorithms to analyze histological images of hypoxia in xenographed colorectal tumors, and extract features to construct a spatiotemporal logical characterization of hypoxia. We also consider characterizing hypoxia by a linear regression functional learning mechanism that yields a similarity score. Third, we employ a Bayesian statistical model checking algorithm that can determine, over some bounded number of simulation executions, whether hypoxia is likely to emerge under some fixed set of simulation parameters, and some fixed logical or functional description of hypoxia. Driving the model checking process is one of three adaptive Monte Carlo sampling algorithms we developed to explore the high dimensional space of simulation initial conditions and operational parameters. Taken together, these three system components formulate a novel approach to the inverse problem above, and constitute a design for a tool that can be placed into the hands of experimentalists, for testing hypotheses based upon known parameter values or ones the tool might discover. In principle, this design can be generalized to other biological phenomena involving large heterogeneous populations of interacting cells.
Title: Rethinking Information Privacy for the Web
Candidate: Tierney, Matthew
Advisor(s): Subramanian; Lakshminarayanan
Abstract:
In response to Supreme Court Justice Samuel Alitoâs opinion that society should accept a decline in personal privacy with modern technology, Hanni M. Fakhoury, staff attorney with the Electronic Frontier Foundation, argued âTechnology doesnât involve an âinevitableâ tradeoff [of increased convenience] with privacy. The only inevitability must be the demand that privacy be a value built into our technologyâ [42]. Our position resonates with Mr. Fakhouryâs. In this thesis, we present three artifacts that address the balance between usability, efficiency, and privacy as we rethink information privacy for the web.
In the first part of this thesis, we present the design, implementation and evaluation of Cryptagram, a system designed to enhance online photo privacy. Cryptagram enables users to convert photos into encrypted images, which the users upload to Online Social Networks (OSNs). Users directly manage access control to those photos via shared keys that are independent of OSNs or other third parties. OSNs apply standard image transformations (JPEG compression) to all uploaded images so Cryptagram provides image encoding and encryption protocols that are tolerant to these transformations. Cryptagram guarantees that the recipient with the right credentials can completely retrieve the original image from the transformed version of the uploaded encrypted image while the OSN cannot infer the original image. Cryptagramâs browser extension integrates seamlessly with preexisting OSNs, including Facebook and Google+, and currently has over 400 active users.
In the second part of this thesis, we present the design and implementation of Lockbox, a system designed to provide end-to-end private file-sharing with the convenience of Google Drive or Dropbox. Lockbox uniquely combines two important design points: (1) a federated system for detecting and recovering from server equivocation and (2) a hybrid cryptosystem over delta encoded data to balance storage and bandwidth costs with efficiency for syncing end-user data. To facilitate appropriate use of public keys in the hybrid cryptosystem, we integrate a service that we call KeyNet, which is a web service designed to leverage existing authentication media (e.g., OAuth, verified email addresses) to improve the usability of public key cryptography.
In the third part of this thesis, we present the design of Compass, which realizes the philosophical privacy framework of contextual integrity (CI) as a full OSN design. CI), which we believe better captures users privacy expectations in OSNs. In Compass, three properties hold: (a) users are associated with roles in specific contexts; (b) every piece of information posted by a user is associated with a specific context; (c) norms defined on roles and attributes of posts in a context govern how information is shared across users within that context. Given the definition of a context and its corresponding norm set, we describe the design of a compiler that converts the human-readable norm definitions to generate appropriate information flow verification logic including: (a) a compact binary decision diagram for the norm set; and (b) access control code that evaluates how a new post to a context will flow. We have implemented a prototype that shows how the philosophical framework of contextual integrity can be realized in practice to achieve strong privacy guarantees with limited additional verification overhead.
Title: Learning Hierarchical Feature Extractors For ImageRecognition
Candidate: Boureau, Y-Lan
Advisor(s): LeCun, Yann
Abstract:
Telling cow from sheep is effortless for most animals, but requires much engineering for computers. In this thesis, we seek to tease out basic principles that underlie many recent advances in image recognition. First, we recast many methods into a common unsupervised feature extraction framework based on an alternation of coding steps, which encode the input by comparing it with a collection of reference patterns, and pooling steps, which compute an aggregation statistic summarizing the codes within some region of interest of the image.
Within that framework, we conduct extensive comparative evaluations of many coding or pooling operators proposed in the literature. Our results demonstrate a robust superiority of sparse coding (which decomposes an input as a linear combination of a few visual words) and max pooling (which summarizes a set of inputs by their maximum value). We also propose macrofeatures, which import into the popular spatial pyramid framework the joint encoding of nearby features commonly practiced in neural networks, and obtain significantly improved image recognition performance. Next, we analyze the statistical properties of max pooling that underlie its better performance, through a simple theoretical model of feature activation. We then present results of experiments that confirm many predictions of the model. Beyond the pooling operator itself, an important parameter is the set of pools over which the summary statistic is computed. We propose locality in feature configuration space as a natural criterion for devising better pools. Finally, we propose ways to make coding faster and more powerful through fast convolutional feedforward architectures, and examine how to incorporate supervision into feature extraction schemes. Overall, our experiments offer insights into what makes current systems work so well, and state-of-the-art results on several image recognition benchmarks.
Title: On populations, haplotypes and genome sequencing
Candidate: Franquin, Pierre
Advisor(s): Mishra, Bud
Abstract:
Population genetics has seen a renewed interest since the completion of the human genome project. With the availability of rapidly growing volumes of genomic data, the scientific and medical communities have been optimistic that better understanding of human diseases as well as their treatment were imminent. Many population genomic models and association studies have been designed (or redesigned) to address these problems. For instance, the genome-wide association studies (GWAS) had raised hopes for finding disease markers, personalized medicine and rational drug design. Yet, as of today, they have not yielded results that live up to their promise and have only led to a frustrating disappointment.
Intrigued, but not deterred by these challenges, this dissertation visits the different aspects of these problems. In the first part, we will review the different models and theories of population genetics that are now challenged. We will propose our own implementation of a model to test different hypotheses. This effort will hopefully help us in understanding whether our expectations were unreasonably too high or if we had ignored a crucial piece of information. When discussing association studies, we must not forget that we rely on data that are produced by sequencing technologies, so far available. We have to ensure that the quality of this data is reasonably good for GWAS. Unfortunately, as we will see in the second part, despite the existence of a diverse set of sequencing technologies, none of them can produce haplotypes with phasing, which appears to be the most important type of sequence data needed for association studies. To address this challenge, we propose a novel approach for a sequencing technology, called SMASH that allows us to create the quality and type of haplotypic genome sequences necessary for efficient population genetics.
Title: Optimizing Machine Translation by Learning to Search
Candidate: Galron, Daniel
Advisor(s): Melamed, Dan
Abstract:
We present a novel approach to training discriminative tree-structured machine translation systems by learning to search. We describe three primary innovations in this work: a new parsing coordinator architecture and algorithms to synthesize the required training examples for the learning algorithm; a new semiring that provides an unbiased way to compare translations; and a new training objective that measures whether a translation inference improves the quality of a translation. We also apply the reinforcement learning concept of exploration to SMT. Finally, we empirically evaluate the effects of our innovations on the quality of translations output by our system.
Title: Flexible-Cost SLAM
Candidate: Grimes, Matthew
Advisor(s): LeCun, Yann
Abstract:
The ability of a robot to track its position and its surroundings is critical in mobile robotics applications, such as autonomous transport, farming, search-and-rescue, and planetary exploration.
As a foundational building block to such tasks, localization must remain reliable and unobtrusive. For example, it must not provide an unneeded level of precision, when the cost of doing so displaces higher-level tasks from a busy CPU. Nor should it produce noisy estimates on the cheap, when there are CPU cycles to spare.
This thesis explores localization solutions that provide exactly the amount of accuracy needed to a given task. We begin with a real-world system used in the DARPA Learning Applied to Ground Robotics (LAGR) competition. Using a novel hybrid of wheel and visual odometry, we cut the cost of visual odometry from 100% of a CPU to 5%, clearing room for other critical visual processes, such as long-range terrain classification. We present our hybrid odometer in chapter 2.
Next, we describe a novel SLAM algorithm that provides a means to choose the desired balance between cost and accuracy. At its fastest setting, our algorithm converges faster than previous stochastic SLAM solvers, while maintaining significantly better accuracy. At its most accurate, it provides the same solution as exact SLAM solvers. Its main feature, however, is the ability to flexibly choose any point between these two extremes of speed and precision, as circumstances demand. As a result, we are able to guarantee real-time performance at each timestep on city-scale maps with large loops. We present this solver in chapter 3, along with results from both commonly available datasets and Google Street View data.
Taken as a whole, this thesis recognizes that precision and efficiency can be competing values, whose proper balance depends on the application and its fluctuating circumstances. It demonstrates how a localizer can and should fit its cost to the task at hand, rather than the other way around. In enabling this flexibility, we demonstrate a new direction for SLAM research, as well as provide a new convenience for end-users, who may wish to map the world without stopping it.
Title: SMT Beyond DPLL(T): A New Approach to Theory Solvers and Theory Combination
Candidate: Jovanovic, Dejan
Advisor(s): Barrett, Clark
Abstract:
Satisifiability modulo theories (SMT) is the problem of deciding whether a given logical formula can be satisifed with respect to a combination of background theories. The past few decades have seen many significant developments in the field, including fast Boolean satisfiability solvers (SAT), efficient decision procedures for a growing number of expressive theories, and frameworks for modular combination of decision procedures. All these improvements, with addition of robust SMT solver implementations, culminated with the acceptance of SMT as a standard tool in the fields of automated reasoning and computer aided verification. In this thesis we develop new decision procedures for the theory of linear integer arithmetic and the theory of non-linear real arithmetic, and develop a new general framework fro combination of decision procedures. The new decision procedures integrate theory specific reasoning and the Boolean search to provide more powerful and efficient procedures, and allow a more expressive language for explaining problematic states. The new framework for combination of decision procedures overcomes the complexity limitations and restrictions on the theories imposed by the standard Nelson-Oppen approach.
Title: An Adaptive Fast Multipole Method-Based PDE Solver in Three Dimensions
Candidate: Langston, Matthew Harper
Advisor(s): Zorin, Denis
Abstract:
Many problems in scientific computing require the accurate and fast solution to a variety of elliptic PDEs. These problems become increasingly dif.cult in three dimensions when forces become non-homogeneously distributed and geometries are complex.
We present an adaptive fast volume solver using a new version of the fast multipole method, incorporated with a pre-existing boundary integral formulation for the development of an adaptive embedded boundary solver.
For the fast volume solver portion of the algorithm, we present a kernel-independent, adaptive fast multipole method of arbitrary order accuracy for solving elliptic PDEs in three dimensions with radiation boundary conditions. The algorithm requires only a Greenâs function evaluation routine for the governing equation and a representation of the source distribution (the right-hand side) that can be evaluated at arbiÂtrary points.
The performance of the method is accelerated in two ways. First, we construct a piecewise polynomial approximation of the right-hand side and compute far-.eld expansions in the FMM from the coef.cients of this approximation. Second, we precompute tables of quadratures to handle the near-.eld interactions on adaptive octree data structures, keeping the total storage requirements in check through the exploitation of symmetries. We additionally show how we extend the free-space volume solver to solvers with periodic and well as Dirichlet boundary conditions.
For incorporation with the boundary integral solver, we develop interpolation methods to maintain the accuracy of the volume solver. These methods use the existing FMM-based octree structure to locate apÂpropriate interpolation points, building polynomial approximations to this larger set of forces and evaluating these polynomials to the locally under-re.ned grid in the area of interest.
We present numerical examples for the Laplace, modi.ed Helmholtz and Stokes equations for a variety of boundary conditions and geometries as well as studies of the interpolation procedures and stability of far-.eld and polynomial constructions.
Title: Acquiring information from wider scope to improve event extraction
Candidate: Liao, Shasha
Advisor(s): Grishman, Ralph
Abstract:
Event extraction is a particularly challenging type of information extraction (IE). Most current event extraction systems rely on local information at the phrase or sentence level. However, this local context may be insufficient to resolve ambiguities in identifying particular types of events; information from a wider scope can serve to resolve some of these ambiguities.
In this thesis, we first investigate how to extract supervised and unsupervised features to improve a supervised baseline system. Then, we present two additional tasks to show the benefit of wider scope features in semi-supervised learning (self-training) and active learning (co-testing). Experiments show that using features from wider scope can not only aid a supervised local event extraction baseline system, but also help the semi-supervised or active learning approach.
Title: Mobile Accessibility Tools for the Visually Impaired
Candidate: Paisios, Nektarios
Advisor(s): Subramanian; Lakshminarayanan
Abstract:
Visually impaired users are in dire need of better accessibility tools. The past few years have witnessed an exponential growth in the computing capabilities and onboard sensing capabilities of mobile phones making them an ideal candidate for building next-generation applications. We believe that the mobile device can play a significant role in the future for aiding visually impaired users in day-to-day activities with simple and usable mobile accessibility tools. This thesis describes the design, implementation, evaluation and user-study based analysis of four different mobile accessibility applications.
Our first system is the design of a highly accurate and usable mobile navigational guide that uses Wi-Fi and accelerometer sensors to navigate unfamiliar environments. A visually impaired user can use the system to construct a virtual topological map across points of interest within a building based on correlating the user' walking patterns (with turn signals) with the Wi-Fi and accelerometer readings. The user can subsequently use the map to navigate previously traveled routes. Our second system, Mobile Brailler, presents several prototype methods of text entry on a modern touch screen mobile phone that are based on the Braille alphabet and thus are convenient for visually impaired users. Our third system enables visually impaired users to leverage the camera of a mobile device to accurately recognize currency bills even if the images are partially or highly distorted. The final system enables visually impaired users to determine whether a pair of clothes, in this case of a tie and a shirt, can be worn together or not, based on the current social norms of color-matching.
We believe that these applications together, provide a suite of important mobile accessibility tools to enhance four critical aspects of a day-to-day routine of a visually impaired user: to navigate easily, to type easily, to recognize currency bills (for payments) and to identify matching clothes.
Title: Reusable Software Infrastructure for Stream Processing
Candidate: Soule, Robert
Advisor(s): Grimm, Robert
Abstract:
Developers increasingly use streaming languages to write their data processing applications. While a variety of streaming languages exist, each targeting a particular application domain, they are all similar in that they represent a program as a graph of streams (i.e. sequences of data items) and operators (i.e. data transformers). They are also similar in that they must process large volumes of data with high throughput. To meet this requirement, compilers of streaming languages must provide a variety of streaming-specific optimizations, including automatic parallelization. Traditionally, when many languages share a set of optimizations, language implementors translate the source languages into a common representation called an intermediate language (IL). Because optimizations can modify the IL directly, they can be re-used by all of the source languages, reducing the overall engineering effort. However, traditional ILs and their associated optimizations target single-machine, single-process programs. In contrast, the kinds of optimizations that compilers must perform in the streaming domain are quite different, and often involve reasoning across multiple machines. Consequently, existing ILs are not suited to streaming languages.
This thesis addresses the problem of how to provide a reusable infrastructure for stream processing languages. Central to the approach is the design of an intermediate language specifically for streaming languages and optimizations. The hypothesis is that an intermediate language designed to meet the requirements of stream processing can assure implementation correctness; reduce overall implementation effort; and serve as a common substrate for critical optimizations. In evidence, this thesis provides the following contributions: (1) a catalog of common streaming optimizations that helps define the requirements of a streaming IL; (2) a calculus that enables reasoning about the correctness of source language translation and streaming optimizations; and (3) an intermediate language that preserves the semantics of the calculus, while addressing the implementation issues omitted from the calculus This work significantly reduces the effort it takes to develop stream processing languages, and jump-starts innovation in language and optimization design.
Title: Building scalable geo-replicated storage backends for web applications
Candidate: Sovran, Yair
Advisor(s): Li, Jinyang
Abstract:
Web applications increasingly require a storage system that is both scalable and can replicate data across many distant data centers or sites. Most existing storage solutions fall into one of two categories: Traditional databases offer strict consistency guarantees and programming ease, but are difficult to scale in a geo-replicated setting. NoSQL stores are scalable and efficient, but have weak consistency guarantees, placing the burden of ensuring consistency on programmers. In this dissertation, we describe two systems that help bridge the two extremes, providing scalable, geo-replicated storage for web applications, while also easy to program for. Walter is a key-value store that supports transactions and replicating data across distant sites. A key feature underlying Walter is a new isolation property: Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI does not allow write-write conflicts, alleviating the burden of writing conflict resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. Lynx is a distributed database backend for scaling latency-sensitive web applications. Lynx supports optimizing queries via data denormalization, distributed secondary indexes, and materialized join views. To preserve data constraints across denormalized tables and secondary indexes, Lynx relies on the a novel primitive: Distributed Transaction Chain (DTC). A DTC groups a sequence of transactions to be executed on different nodes while providing two guarantees. First, all transactions in a DTC execute exactly once despite failures. Second, transactions from concurrent DTCs are interleaved consistently on common nodes. We built several web applications on top of Walter and Lynx: an auction service, a microblogging service, and a social networking website. We have found that building web applications using Walter and Lynx is quick and easy. Our experiments show that the resulting applications are capable of providing scalable, low latency operation across multiple geo-replicated sites.
Title: Rapid Training of Information Extraction with Local and Global Data Views
Candidate: Sun, Ang
Advisor(s): Grishman, Ralph
Abstract:
This dissertation focuses on fast system development for Information Extraction (IE). State-of-the-art systems heavily rely on extensively annotated corpora, which are slow to build for a new domain or task. Moreover, previous systems are mostly built with local evidence such as words in a short context window or features that are extracted at the sentence level. They usually generalize poorly on new domains.
This dissertation presents novel approaches for rapidly training an IE system for a new domain or task based on both local and global evidence. Specifically, we present three systems: a relation type extension system based on active learning, a relation type extension system based on semi-supervised learning, and a cross-domain bootstrapping system for domain adaptive named entity extraction.
The active learning procedure adopts features extracted at the sentence level as the local view and distributional similarities between relational phrases as the global view. It builds two classifiers based on these two views to find the most informative contention data points to request human labels so as to reduce annotation cost.
The semi-supervised system aims to learn a large set of accurate patterns for extracting relations between names from only a few seed patterns. It estimates the confidence of a name pair both locally and globally: locally by looking at the patterns that connect the pair in isolation; globally by incorporating the evidence from the clusters of patterns that connect the pair. The use of pattern clusters can prevent semantic drift and contribute to a natural stopping criterion for semi-supervised relation pattern discovery.
For adapting a named entity recognition system to a new domain, we propose a cross-domain bootstrapping algorithm, which iteratively learns a model for the new domain with labeled data from the original domain and unlabeled data from the new domain. We first use word clusters as global evidence to generalize features that are extracted from a local context window. We then select self-learned instances as additional training examples using multiple criteria, including some based on global evidence.
Title: Combating Sybil attacks in cooperative systems
Candidate: Tran, Nguyen
Advisor(s): Li, Jinyang
Abstract:
Cooperative systems are ubiquitous nowadays. In a cooperative system, end users contribute resource to run the service instead of only receiving the service passively from the system. For example, users upload and comment pictures and videos on Flicker and YouTube, users submit and vote on news articles on Digg. As another example, users in BitTorrent contribute bandwidth and storage to help each other download content. As long as users behave as expected, these systems benefit immensely from user contribution. In fact, five out of ten most popular websites are operating in this cooperative fashion (Facebook, YouTube, Blogger, Twitter, Wikipedia). BitTorrent is dominating the global Internet traffic.
A robust cooperative system cannot blindly trust that its users will truthfully participate in the system. Malicious users seek to exploit the systems for profit. Selfish users consume but avoid to contribute resource. For example, adversaries have manipulated the voting system of Digg to promote their articles of dubious quality. Selfish users in public BitTorrent communities leave the system to avoid uploading files to others, resulting in drastic performance degradation for these content distribution systems. The ultimate way to disrupt security and incentive mechanisms of cooperative systems is using Sybil attacks, in which the adversary creates many Sybil identities (fake identities) and use them to disrupt the systems' normal operation. No security and incentive mechanism works correctly if the systems do not have a robust identity management that can defend against Sybil attacks.
This thesis provides robust identity management schemes which are resilient to the Sybil attack, and use them to secure and incentivize user contribution in several example cooperative systems. The main theme of this work is to leverage the social network among users in designing secure and incentive-compatible cooperative systems. First, we develop a distributed admission control protocol, called Gatekeeper, that leverages social network to admit most honest user identities and only few Sybil identities into the systems. Gatekeeper can be used as a robust identity management for both centralized and decentralized cooperative systems. Second, we provide a vote aggregation system for content voting systems, called SumUp, that can prevent an adversary from casting many bogus votes for a piece of content using the Sybil attack. SumUp leverages unique properties of content voting systems to provide significantly better Sybil defense compared with applying a general admission control protocol such as \gatekeeper. Finally, we provide a robust reputation system, called Credo, that can be used to incentivize bandwidth contribution in peer-to-peer content distribution networks. Credo reputation can capture user contribution, and is resilient to both Sybil and collusion attacks.
Title: Multi-species biclustering: An integrative method to identify functional gene conservation between multiple species
Candidate: Waltman, Peter
Advisor(s): Bonneau, Richard
Abstract:
Background : Several recent comparative functional genomics projects have indicated that the co-regulation of many genes is conserved across species, at least in part. This suggests that comparative analysis of functional genomics data-sets could prove powerful in identifying co-regulated groups that are conserved across multiple species.
Results : We present recent work to extend our cMonkey algorithm to simultaneously bicluster heterogeneous data from multiple species to identify conserved modules of orthologous genes, which can yield evolutionary insights into the formation of regulatory modules. We also present results from the multi-species analysis to two triplets of bacteria. The first of these is a triplet of Gram-positive bacteria consisting of Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes, while the second is a triplet of Gram-negative bacteria that includes Escherichia coli, Salmonella typhimurium and Vibrio cholerae. Finally, we will present initial results from the multi-species biclustering analysis of human and mouse hematopoietic differentiation data.
Conclusion : Analysis of biclusters obtained revealed a surprising number of gene groups with conserved modularity and high biological significance as judged by several measures of cluster quality. We also highlight cases of interest from the Gram-positive triplet, including one that suggests a temporal difference in the expression of genes governing sporulation in the two Bacillus species. While analysis of the mouse and human hematopoietic differentiation is preliminary, it indicates the applicability of this analysis to eukaryotic systems, including comparison of cancer model systems. Finally, we suggest ways in which this analysis could be extended to identify divergent modules that may exist between normal and disease tissue.
Title: Collusion Preserving Computation
Candidate: Alwen, Joel
Advisor(s): Dodis, Yevgeniy
Abstract:
In collusion-free protocols, subliminal communication is impossible and parties are thus unable to communicate any information beyond what the protocol allows". Collusion-free protocols are interesting for several reasons, but have specifically attracted attention because they can be used to reduce trust in game-theoretic mechanisms. Collusion-free protocols are impossible to achieve (in general) when all parties are connected by point-to-point channels, but exist under certain physical assumptions (Lepinksi et al., STOC 2005) or in specific network topologies (Alwen et al., Crypto 2008).
In addition to proposing the definition, we explore necessary properties of the underlying communication resource. Next we provide a general feasibility result for collusion-preserving computation of arbitrary functionalities. We show that the resulting protocols enjoy an elegant (and surprisingly strong) fallback security even in the case when the underlying communication resource acts in a Byzantine manner. Finally, we investigate the implications of these results in the context of mechanism design.
Title: Re-architecting Web and Mobile Information Access for Emerging Regions
Candidate: Chen, Jay
Advisor(s): Subramanian; Lakshminarayanan
Abstract:
Providing access to information for people in emerging regions is an important problem. Over the past decade there have been many proposed and increasingly numerous deployed systems to enable information access, but successes are few and modest at best. Internet in emerging regions is still generally unusable or intolerably slow. Mobile phone applications are either not designed for the phones that poor people own, otherwise, the applications lack functionality, are difficult to use, or expensive to operate. In this work we focus on enabling digital information access for people in emerging regions.
To advance the state of the art, we contribute numerous observations about how people access information in emerging regions, why the current models for web access and SMS platforms are broken, and techniques to enable applications over constrained Internet or SMS. The mechanisms presented here were designed after extensive field work in several different regions including rural, peri-urban, and urban areas in India, Kenya, Ghana, and Mexico. Multiple user studies were conducted throughout the course of system design and prototyping. We present a novel set of context appropriate platforms and tools, some spanning several layers of the networking stack. Five complete systems were implemented and deployed in the field. First, Event Logger for Firefox (ELF) is an easily deployable Firefox extension which functions as both a web browsing analysis tool and an in-browser web optimization platform. Second, RuralCafe provides a platform for web search and browsing over extremely slow or intermittent networks. Third, Contextual Information Portals (CIP) provide cached repositories of web pages tailored to the particular context in which it is to be used. Fourth, UjU is a mobile application platform that simplies the design of new SMS-based mobile applications. Finally, SMSFind is a SMS-based search service that runs on mobile phones without setup or subscription to a data plan.
Taken as a whole, the systems here are a comprehensive solution for addressing the problem of enabling digital information access in emerging regions.
Title: Automatic Deduction for Theories of Algebraic Data Types
Candidate: Chikanian, Igor
Advisor(s): Barrett, Clark
Abstract:
In this thesis we present formal logical systems, concerned with reasoning about algebraic data types.
The first formal system is based on the quantifier-free calculus (outermost universally quantified). This calculus is comprised of state change rules, and computations are performed by successive applications of these rules. Thereby, our calculus gives rise to an abstract decision procedure. This decision procedure determines if a given formula involving algebraic type members is valid. It is shown that this calculus is sound and complete. We also examine how this system performs practically and give experimental results. Our main contribution, as compared to previous work on this subject,is a new and more efficient decision procedure for checking satisfiability of the universal fragment within the theory of algebraic data types.
The second formal system, called Term Builder, is the deductive system based on higher order type theory, which subsumes second order and higher order logics. The main purpose of this calculus is to formulate and prove theorems about algebraic or other arbitrary user-defined types.Term Builder supports proof objects and is both, an interactive theorem prover, and verifier. We describe the built-in deductive capabilities of Term Builder and show its consistency. The logic represented by our prover is intuitionistic. Naturally, it is also incomplete and undecidable, but its expressive power is much higher than that of the first formal system.
Among our achievements in building this theorem prover is an elegant and intuitive GUI for building proofs. Also, a new feature from the foundational viewpoint is that, in contrast with other approaches, we have uniqueness-of-types property, which is not modulo beta-conversion.
Title: Efficient Cryptographic Primitives for Non-Interactive Zero-Knowledge Proofs and Applications
Candidate: Haralambiev, Kristiyan
Advisor(s): Shoup, Victor
Abstract:
Non-interactive zero-knowledge (NIZK) proofs have enjoyed much interest in cryptography since they were introduced more than twenty years ago by Blum et al. [BFM88]. While quite useful when designing modular cryptographic schemes, until recently NIZK could be realized efficiently only using certain heuristics. However, such heuristic schemes have been widely criticized. In this work we focus on designing schemes which avoid them. In [GS08], Groth and Sahai presented the first efficient (and currently the only) NIZK proof system in the standard model. The construction is based on bilinear maps and is limited to languages of certain satisfiable system of equations. Given this expressibility limitation of the system of equations, we are interested in cryptographic primitives that are "compatible" with it. Equipped with such primitives and Groth-Sahai proof system, we show how to construct cryptographic schemes efficiently in a modular fashion.
In this work, we describe properties required by any cryptographic scheme to mesh well with Groth-Sahai proofs. Towards this, we introduce the notion of "structure-preserving" cryptographic scheme. We present the first constant-size structure-preserving signature scheme for messages consisting of general bilinear group elements. This allows us (for the first time) to instantiate efficiently a modular construction of round-optimal blind signature based on the framework of Fischlin [Fis06].
Our structure-preserving homomorphic trapdoor commitment schemes yield efficient leakage-resilient signatures (in the bounded leakage model) which satisfy the standard security requirements and additionally tolerates any amount of leakage; all previous works satisfied at most two of those three properties.
Lastly, we build a structure-preserving encryption scheme which satisfies the standard CCA security requirements. While somewhat similar to the notion of verifiable encryption, it provides better properties and yields the first efficient two-party protocol for joint ciphertext computation. Note that the efficient realization of such a protocol was not previously possible even using the heuristics mentioned above.
Lastly, in this line of work, we revisit the notion of simulation extractability and define "true-simulation extractable" NIZK proofs. Although quite similar to the notion of simulation-sound extractable NIZK proofs, there is a subtle but rather important difference which makes it weaker and easier to instantiate efficiently. As it turns out, in many scenarios, this new notion is sufficient, and using it, we can construct efficient leakage resilient signatures and CCA encryption scheme.
Title: Learning Feature Hierarchies for Object Recognition
Candidate: Kavukcuoglu, Koray
Advisor(s): LeCun, Yann
Abstract:
In this thesis we study unsupervised learning algorithms for training feature extractors and building deep learning models. We propose sparse-modeling algo- rithms as the foundation for unsupervised feature extraction systems. To reduce the cost of the inference process required to obtain the optimal sparse code, we model a feed-forward function that is trained to predict this optimal sparse code. Using an efficient predictor function enables the use of sparse coding in hierarchical models for object recognition. We demonstrate the performance of the developed system on several recognition tasks, including object recognition, handwritten digit classification and pedestrian detection. Robustness to noise or small variations in the input is a very desirable property for a feature extraction algorithm. In order to train locally-invariant feature extractors in an unsupervised manner, we use group sparsity criteria that promote similarity between the dictionary elements within a group. This model produces locally-invariant representations under small pertur- bations of the input, thus improving the robustness of the features. Many sparse modeling algorithms are trained on small image patches that are the same size as the dictionary elements. This forces the system to learn multiple shifted versions of each dictionary element. However, when used convolutionally over large im- ages to extract features, these models produce very redundant representations. To avoid this problem, we propose convolutional sparse coding algorithms that yield a richer set of dictionary elements, reduce the redundancy of the representation and improve recognition performance.
Title: Topics in Formal Synthesis and Modeling
Candidate: Klein, Uri
Advisor(s): Pnueli, Amir; Zuck, Lenore
Abstract:
The work presented focuses on two problems, that of synthesizing systems from formal specifications, and that of formalizing REST -- a popular web applications' development pattern.
For the synthesis problem, we distinguish between the synchronous and the asynchronous case. For the former, we solve a problem concerning a fundamental flaw in specification construction in previous work. We continue with exploring effective synthesis of asynchronous systems (programs on multi-threaded systems). Two alternative models of asynchrony are presented, and shown to be equally expressive for the purpose of synthesis.
REST is a software architectural style used for the design of highly scalable web applications. Interest in REST has grown rapidly over the past decade. However, there is also considerable confusion surrounding REST: many examples of supposedly RESTful APIs violate key REST constraints. We show that the constraints of REST and of RESTful HTTP can be precisely formulated within temporal logic. This leads to methods for model checking and run-time verification of RESTful behavior. We formulate several relevant verification questions and analyze their complexity.
Title: Adaptive Isotopic Approximation of Nonsingular Curves and Surfaces
Candidate: Lin, Long
Advisor(s): Yap, Chee
Abstract:
Consider the problem of computing isotopic approximations of nonsingular curves and surfaces that are implicitly represented by equations of the form f (X, Y )=0 and f (X,Y, Z)=0. Thisfundamentalproblem has seen much progress along several fronts, but we will focus on domain subdivision algorithms. Two algorithms in this area are from Snyder(1992) and Plantinga and Vegter(2004). We introduce a family of new algorithms that combines the advantages of these two algorithms: like Snyder, we use the parameterizability criterion for subdivision, and like Plantinga and Vegter, we exploit nonlocal isotopy.
We first apply our approach to curves, resulting in a more efficient algorithm. We then extend our approach to surfaces. The extension is by no means routine, as the correctness arguments and case analysis are more subtle. Also, a new phenomenon arises in which local rules for constructing surfaces are no longer sufficient.
We further extend our algorithms in two important and practical directions: first, we allow subdivision cells to be non squares or non cubes, with arbitrary but bounded aspect ratios: in 2D, we allow boxes to be split into 2 or 4 children; and in 3D, we allow boxes to be split into 2, 4 or 8 children. Second, we allow the inputregion-of-interest(ROI) to have arbitrary geometry represented by anquadtreeoroctree,aslongas the curves or surfaces has no singularities in the ROI and intersects the boundary of ROI transversally.
Our algorithm is numerical because our primitives are based on interval arithmetic and exact BigFloat numbers. It is practical, easy to implement exactly (compared to algebraic approaches) and does not suffer from implementation gaps (compared to geometric approaches). We report some very encouraging experimental results,showing that our algorithms can be much more efficient than the algorithms of Plantinga and Vegter(2D and 3D)and Snyder(2D only).
Title: Real-Space Localization Methods for Minimizing the Kohn-Sham Energy
Candidate: Millstone, Marc
Advisor(s): Overton, Michael
Abstract:
The combination of ever increasing computational power and new mathematical models has fundamentally changed the field of computational chemistry. One example of this is the use of new algorithms for computing the charge density of a molecular system from which one can predict many physical properties of the system.
This thesis presents two new algorithms for minimizing the Kohn-Sham energy, which is used to describe a system of non-interacting electrons through a set of single-particle wavefunctions. By exploiting a known localization region of the wavefunctions, each algorithm evaluates the Kohn-Sham energy function and gradient at a set of iterates that have a special sparsity structure. We have chosen to represent the problem in real-space using finite-differences, allowing us to efficiently evaluate the energy function and gradient using sparse linear algebra. Detailed numerical experiments are provided on a set of representative molecules demonstrating the performance and robustness of these methods.
Title: Scoring-and-Unfolding Trimmed Tree Assembler: Algorithms for Assembling Genome Sequences Accurately and Efficiently
Candidate: Narzisi, Giuseppe
Advisor(s): Mishra, Bud
Abstract:
The recent advances in DNA sequencing technology and their many potential applications to Biology and Medicine have rekindled enormous interest in several classical algorithmic problems at the core of Genomics and Computational Biology: primarily, the whole-genome sequence assembly problem (WGSA). Two decades back, in the context of the Human Genome Project, the problem had received unprecedented scientific prominence: its computational complexity and intractability were thought to have been well understood; various competitive heuristics, thoroughly explored and the necessary software, properly implemented and validated. However, several recent studies, focusing on the experimental validation of de novo assemblies, have highlighted several limitations of the current assemblers.
Intrigued by these negative results, this dissertation reinvestigates the algorithmic techniques required to correctly and efficiently assemble genomes. Mired by its connection to a well-known NP-complete combinatorial optimization problem, historically, WGSA has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their priority, these methods opted to rely on local searches, and are thus inherently approximate, ambiguous or error-prone. This dissertation presents a novel sequence assembler, SUTTA, that dispenses with the idea of limiting the solutions to just the approximated ones, and instead favors an approach that could potentially lead to an exhaustive (exponential-time) search of all possible layouts but tames the complexity through constrained search (Branch-and-Bound) and quick identification and pruning of implausible solutions.
Complementary to this problem is the task of validating the generated assemblies. Unfortunately, no commonly accepted method exists yet and widely used metrics to compare the assembled sequences emphasize only size, poorly capturing quality and accuracy. This dissertation also addresses these concerns by developing a more comprehensive metric, the Feature-Response Curve, that, using ideas from classical ROC (receiver-operating characteristic) curve, more faithfully captures the trade-off between contiguity and quality.
Finally, this dissertation demonstrates the advantages of a complete pipeline integrating base-calling (TotalReCaller) with assembly (SUTTA) in a Bayesian manner.
Title: Cryptographic Resilience to Continual Information Leakage
Candidate: Wichs, Daniel
Advisor(s): Dodis, Yevgeniy
Abstract:
We study the question of achieving cryptographic security on devices that leak information about their internal secret state to an external attacker.This study is motivated by the prevalence of side-channel attacks, where the physical characteristics of a computation (e.g. timing, power-consumption, temperature, radiation, acoustics, etc.) can be measured, and may reveal useful information about the internal state of a device. Since some such leakage is inevitably present in almost any physical implementation, we believe that this problem cannot just be addressed by physical countermeasures alone. Instead, it should already be taken into account when designing the mathematical specification of cryptographic primitives and included in the formal study of their security.
In this thesis, we propose a new formal framework for modeling the leakage available to an attacker. This framework, called the continual leakage model, assumes that an attacker can continually learn arbitrary information about the internal secret state of a cryptographic scheme at any point in time, subject only to the constraint that the rate of leakage is bounded. More precisely, our model assumes some abstract notion of time periods. In each such period, the attacker can choose to learn arbitrary functions of the current secret state of the scheme, as long as the number of output bits leaked is not too large. In our solutions, cryptographic schemes will continually update their internal secret state at the end of each time period. This will ensure that leakage observed in different time periods cannot be meaningfully combined to break the security of the cryptosystem. Although these updates modify the secret state of the cryptosystem, the desired functionality of the scheme is preserved, and the users can remain oblivious to these updates. We construct signatures, encryption, and secret sharing/storage schemes in this model.
Title: Surface Representation of Particle Based Fluids
Candidate: Yu, Jihun
Advisor(s): Yap, Chee
Abstract:
In this thesis, we focus on surface representation for particle-based fluid simulators such as Smoothed Particle Hydrodynamics (SPH). We first present a new surface reconstruction algorithm which formulates the implicit function as a sum of anisotropic smoothing kernels. The direction of anisotropy at a particle is determined by performing Weighted Principal Component Analysis (WPCA) over the neighboring particles. In addition, we perform a smoothing step that re-positions the centers of these smoothing kernels. Since these anisotropic moothing kernels capture the local particle distributions more accurately, our method has advantages over existing methods in representing smooth surfaces, thin streams and sharp features of fluids. This method is fast, easy to implement, and the results demonstrate a significant improvement in the quality of reconstructed surfaces as compared to existing methods. Next,we introduce the idea of using an explicit triangle mesh to track the air/liquid interface in a SPH simulator.
Once an initial surface mesh is created, this mesh is carried forward in time using nearby particle velocities to advect the mesh vertices. The mesh connectivity remains mostly unchanged across time-steps; it is only modified locally for topology change events or for the improvement of triangle quality. In order to ensure that the surface mesh does not diverge from the underlying particle simulation, we periodically project the mesh surface onto an implicit surface defined by the physics simulation. The mesh surface presents several advantages over previous SPH surface tracking techniques: A new method for surface tension calculations clearly outperforms the state of the art in SPH surface tension for computer graphics. A new method for tracking detailed surface information (like colors) is less susceptible to numerical diffusion than competing techniques. Finally, a temporally-coherent surface mesh allows us to simulate high-resolution surface wave dynamics without being limited by the particle resolution of the SPH simulation.
Title: On the Randomness Requirements for Privacy
Candidate: Bosley, Carleton
Advisor(s): Dodis, Yevgeniy
Abstract:
Most cryptographic primitives require randomness (for example, to generate secret keys). Usually, one assumes that perfect randomness is available, but, conceivably, such primitives might be built under weaker, more realistic assumptions. This is known to be achievable for many authentication applications, when entropy alone is typically sufficient. In contrast, all known techniques for achieving privacy seem to fundamentally require (nearly) perfect randomness. We ask the question whether this is just a coincidence, or, perhaps, privacy inherently requires true randomness?
We completely resolve this question for information-theoretic private-key encryption, where parties wish to encrypt a b-bit value using a shared secret key sampled from some imperfect source of randomness S. Our technique also extends to related primitives which are sufficiently binding and hiding, including computationally secure commitments and public-key encryption.
Our main result shows that if such n-bit source S allows for a secure encryption of b bits, where b > log n, then one can deterministically extract nearly b almost perfect random bits from S . Further, the restriction that b > log n is nearly tight: there exist sources S allowing one to perfectly encrypt (log n - log log n) bits, but not to deterministically extract even a single slightly unbiased bit.
Hence, to a large extent, true randomness is inherent for encryption: either the key length must be exponential in the message length b, or one can deterministically extract nearly b almost unbiased random bits from the key. In particular, the one-time pad scheme is essentially "universal".
Title: Machine Learning Approaches to Gene Duplication and Transcription Regulation
Candidate: Chen, Huang-Wen
Advisor(s): Shasha, Dennis
Abstract:
Gene duplication can lead to genetic redundancy or functional divergence, when duplicated genes evolve independently or partition the original function. In this dissertation, we employed machine learning approaches to study two different views of this problem: 1) Redundome, which explored the redundancy of gene pairs in the genome of Arabidopsis thaliana, and 2) ContactBind, which focused on functional divergence of transcription factors by mutating contact residues to change binding affinity.
In the Redundome project, we used machine learning techniques to classify gene family members into redundant and non-redundant gene pairs in Arabidopsis thaliana, where sufficient genetic and genomic data is available. We showed that Support Vector Machines were two-fold more precise than single attribute classifiers, and performed among the best within other machine learning algorithms. Machine learning methods predict that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks>1), suggesting that redundancy is stable over long evolutionary periods. The genome-wide predictions were plot with similarity trees based on ClustalW alignment scores, and can be accessed at http://redundome.bio.nyu.edu .
In the ContactBind project, we use Bayesian networks to model dependences between contact residues in transcription factors and binding site sequences. Based on the models learned from various binding experiments, we predicted binding motifs and their locations on promoters for three families of transcription factors in three species. The predictions are publicly available at http://contactbind.bio.nyu.edu . The website also provides tools to predict binding motifs and their locations for novel protein sequences of transcription factors. Users can construct their Bayesian networks for new families once such a familial binding data is available.
Title: New Privacy-Preserving Architectures for Identity-/Attribute-based Encryption
Candidate: Chow, Sze Ming
Advisor(s): Dodis, Yevgeniy; Shoup, Victor
Abstract:
The notion of identity-based encryption (IBE) was proposed as an economical alternative to public-key infrastructures. IBE is also a useful building block in various cryptographic primitives such as searchable encryption. A generalization of IBE is attribute-based encryption (ABE). A major application of ABE is fine-grained cryptographic access control of data. Research on these topics is still actively continuing.
However, security and privacy of IBE and ABE are hinged on the assumption that the authority which setups the system is honest. Our study aims to reduce this trust assumption.
The inherent key escrow of IBE has sparkled numerous debates in the cryptography/security community. A curious key generation center (KGC) can simply generate the user's private key to decrypt a ciphertext. However, can a KGC still decrypt if it does not know the intended recipient of the ciphertext? This question is answered by formalizing KGC anonymous ciphertext indistinguishability (ACI-KGC). All existing practical pairing-based IBE schemes without random oracles do not achieve this notion. In this thesis, we propose an IBE scheme with ACI-KGC, and a new system architecture with an anonymous secret key generation protocol such that the KGC can issue keys to authenticated users without knowing the list of users' identities. This also matches the practice that authentication should be done with the local registration authorities. Our proposal can be viewed as mitigating the key escrow problem in a new dimension.
For ABE, it is not realistic to trust a single authority to monitor all attributes and hence distributing control over many attribute-authorities is desirable. A multi-authority ABE scheme can be realized with a trusted central authority (CA) which issues part of the decryption key according to a user's global identifier (GID). However, this CA may have the power to decrypt every ciphertext, and the use of a consistent GID allowed the attribute-authorities to collectively build a full profile with all of a user's attributes. This thesis proposes a solution without the trusted CA and without compromising users' privacy, thus making ABE more usable in practice.
Underlying both contributions are our new privacy-preserving architectures enabled by borrowing techniques from anonymous credential.
Title: Tools and Techniques for the Sound Verification of Low Level Code
Candidate: Conway, Christopher L.
Advisor(s): Barrett, Clark
Abstract:
Software plays an increasingly crucial role in nearly every facet of modern life, from communications infrastructure to the control systems in automobiles, airplanes, and power plants. To achieve the highest degree of reliability for the most critical pieces of software, it is necessary to move beyond ad hoc testing and review processes towards verification---to prove using formal methods that a piece of code exhibits exactly those behaviors allowed by its specification and no others.
A significant portion of the existing software infrastructure is written in low-level languages like C and C++. Features of these language present significant verification challenges. For example, unrestricted pointer manipulation means that we cannot prove even the simplest properties of programs without first collecting precise information about potential aliasing relationships between variables.
In this thesis, I present several contributions. The first is a general framework for combining program analyses that are only conditionally sound. Using this framework, I show it is possible to design a sound verification tool that relies on a separate, previously-computed pointer analysis.
The second contribution of this thesis is Cascade, a multi-platform, multi-paradigm framework for verification. Cascade includes a support for precise analysis of low-level C code, as well as for higher-level languages such as SPL.
Finally, I describe a novel technique for the verification of datatype invariants in low-level systems code. The programmer provides a high-level specification for a low-level implementation in the form of inductive datatype declarations and code assertions. The connection between the high-level semantics and the implementation code is then checked using bit-precise reasoning. An implementation of this datatype verification technique is available as a Cascade module.
Title: Probabilistic and Topological methods in Computational Geometry
Candidate: Dhandapani, Raghavan
Advisor(s): Pach, Janos
Abstract:
We consider four problems connected by the common thread of geometry. The first three involve problems and algorithms that arise in applications that apriori do not involve geometry but this turns out to be the right language for visualizing and analyzing them. In the fourth, we generalize some well known results in geometry to the topological plane. The techniques we use come from probability and topology.
First, we consider two algorithms that work well in practice but the theoretical mechanism behind whose success is not very well understood.
Greedy routing is a routing mechanism that is commonly used in wireless sensor networks. While routing on the Internet uses standard established protocols, routing in ad-hoc networks with little structure (like sensor networks) is more difficult. Practitioners have devised algorithms that work well in practice, however they were no known theoretical guarantees. We provide the first such result in this area by showing that greedy routing can be made to work on Planar triangulations.
Linear Programming is a technique for optimizing a linear function subject to linear constraints. Simplex Algorithms are a family of algorithms that have proven quite successful in solving Linear Programs in practice. However, examples of Linear Programs on which these algorithms are very inefficient have been obtained by researchers. In order to explain this discrepancy between theory and practice, many authors have shown that Simplex Algorithms are efficient in expectation on randomized Linear Programs. We strengthen these results by proving a partial concentration bound for the Shadow Vertex Simplex Algorithm.
Next, we point out a limitation in an algorithm that is used commonly by practitioners and suggest a way of overcoming this.
Recommendation Systems are algorithms that are used to recommend goods (books, movies etc.) to users based on the similarities between their past preferences and those of other users. Low Rank Approximation is a common method used for this. We point out a common limitation of this method in the presence of ill-conditioning: the presence of multiple local minima. We also suggest a simple averaging based technique to overcome this limitation.
Finally, we consider some basic results in convexity like Radon's, Helly's and Caratheodory's theorems and generalize them to the topological plane, i.e., a plane which has the concept of a linear path which is analogous to a straight line but no notion of metric or distances.
Title: Semi-Supervised Learning via Generalized Maximum Entropy
Candidate: Erkan, Ayse Naz
Advisor(s): LeCun, Yann
Abstract:
Maximum entropy (MaxEnt) framework has been studied extensively in the supervised setting. Here, the goal is to find a distribution p, that maximizes an entropy function while enforcing data constraints so that the expected values of some (pre-defined) features with respect to p, match their empirical counterparts approximately. Using different entropy measures, different model spaces for p and different approximation criteria for the data constraints yields a family of discriminative supervised learning methods (e.g., logistic regression, conditional random fields, least squares and boosting). This framework is known as the generalized maximum entropy framework.
Semi-supervised learning (SSL) has emerged in the last decade as a promising field that enables utilizing unlabeled data along with labeled data so as to increase the accuracy and robustness of inference algorithms. However, most SSL algorithms to date have had trade-offs, for instance in terms of scalability or applicability to multi-categorical data.
In this thesis, we extend the generalized MaxEnt framework to develop a family of novel SSL algorithms using two different approaches: i. Introducing Similarity Constraints We incorporate unlabeled data via modifications to the primal MaxEnt objective in terms of additional potential functions. A potential function stands for a closed proper convex function that can take the form of a constraint and/or a penalty representing our structural assumptions on the data geometry. Specifically, we impose similarity constraints as additional penalties based on the semi-supervised smoothness assumption; i.e., we restrict the generalized MaxEnt problem such that similar samples have similar model outputs. ii. Augmenting Constraints on Model Features We incorporate unlabeled data to enhance the estimates on the model and empirical expectations based on our assumptions on the data geometry.
In particular, we derive the semi-supervised formulations for three specific instances of the generalized MaxEnt on conditional distributions, namely logistic regression and kernel logistic regression for multi-class problems, and conditional random fields for structured output prediction problems. A thorough empirical evaluation on standard data sets that are widely used in the literature demonstrates the validity and competitiveness of the proposed algorithms. In addition to these benchmark data sets, we apply our approach to two real-life problems: i. vision based robot grasping, and ii. remote sensing image classification, where the scarcity of the labeled training samples is the main bottleneck in the learning process. For the particular case of grasp learning, we propose a combination of semi-supervised learning and active learning, another sub-field of machine learning that is focused on the scarcity of labeled samples, when the problem setup is suitable for incremental labeling.
The novel SSL algorithms proposed in this thesis have numerous advantages over the existing semi-supervised algorithms as they yield convex, scalable, inherently multi-class loss functions that can be kernelized naturally.
Title: Solving Quantified First Order Formulas in Satisfiability Modulo Theories
Candidate: Ge, Yeting
Advisor(s): Barrett, Clark
Abstract:
Design errors in computer systems, i.e. bugs, can cause inconvenience, loss of data and time, and in some cases catastrophic damages. One approach for improving design correctness is formal methods: techniques aiming at mathematically establishing that a piece of hardware or software satisfies certain properties. For some industrial cases in which formal methods are utilized, quantified first order formulas in satisfiability modulo theories (SMT) are useful. This dissertation presents several novel techniques for solving quantified formulas in SMT.
In general, deciding a quantified formula in SMT is undecidable. The practical approach for general quantifier reasoning in SMT is heuristics-based instantiation. This dissertation proposes a number of new heuristics that solves several challenges. Experimental results show that with the new heuristics a significant number of more benchmarks can be solved than before.
When only consider formulas within certain fragments of first order logic, it is possible to have complete algorithms based on instantiation. We propose several new fragments, and we prove that formulas in these fragments can be solved by a complete algorithm based on instantiation. For satisfiable quantified formulas in these fragments, we show how to construct the models.
As SMT solvers grow in complexity, the correctness of SMT solvers become questionable. A practical method to improve the correctness is to check the proofs from SMT solvers. We propose a proof translator that translates proofs from SMT solver CVC3 into a trusted solver HOL Light that actually checks the proofs. Experiments with the proof translator discover a faulty proof rule in CVC3 and two MIT-labeled quantified benchmarks in the SMT benchmark library SMT-LIB.
Title: An Algorithmic Enquiry Concerning Causality
Candidate: Kleinberg, Samantha
Advisor(s): Mishra, Bhubaneswar
Abstract:
In many domains we face the problem of determining the underlying causal structure from time-course observations of a system. Whether we have neural spike trains in neuroscience, gene expression levels in systems biology, or stock price movements in finance, we want to determine why these systems behave the way they do. For this purpose we must assess which of the myriad possible causes are significant while aiming to do so with a feasible computational complexity. At the same time, there has been much work in philosophy on what it means for something to be a cause, but comparatively little attention has been paid to how we can identify these causes. Algorithmic approaches from computer science have provided the first steps in this direction, but fail to capture the complex, probabilistic and temporal nature of the relationships we seek.
This dissertation presents a novel approach to the inference of general (type-level) and singular (token-level) causes. The approach combines philosophical notions of causality with algorithmic approaches built on model checking and statistical techniques for false discovery rate control. By using a probabilistic computation tree logic to describe both cause and effect, we allow for complex relationships and explicit description of the time between cause and effect as well as the probability of this relationship being observed (e.g. "a and b until c, causing d in 10-20 time units"). Using these causal formulas and their associated probabilities, we develop a novel measure for the significance of a cause for its effect, thus allowing discovery of those that are statistically interesting, determined using the concepts of multiple hypothesis testing and false discovery control. We develop algorithms for testing these properties in time-series observations and for relating the inferred general relationships to token-level events (described as sequences of observations). Finally, we illustrate these ideas with example data from both neuroscience and finance, comparing the results to those found with other inference methods. The results demonstrate that our approach achieves superior control of false discovery rates, due to its ability to appropriately represent and infer temporal information.
Title: Time Series Modeling with Hidden Variables and Gradient-Based Algorithms
Candidate: Mirowski, Piotr
Advisor(s): LeCun, Yann
Abstract:
We collect time series from real-world phenomena, such as gene interactions in biology or word frequencies in consecutive news articles. However, these data present us with an incomplete picture, as they result from complex dynamical processes involving unobserved state variables. Research on state-space models is motivated by simultaneously trying to infer hidden state variables from observations, as well as learning the associated dynamic and generative models.
I have developed a tractable, gradient-based method for training Dynamic Factor Graphs (DFG) with continuous latent variables. A DFG consists of (potentially nonlinear) factors modeling joint probabilities between hidden and observed variables. The DFG assigns a scalar energy to each configuration of variables, and a gradient-based inference procedure finds the minimum-energy state sequence for a given observation sequence. We approximate maximum likelihood learning by minimizing the expected energy over training sequences with respect to the factors' parameters. These alternated inference and parameter updates constitute a deterministic EM-like procedure.
Using nonlinear factors such as deep, convolutional networks, DFGs were shown to reconstruct chaotic attractors, to outperform a time series prediction benchmark, and to successfully impute motion capture data where a large number of markers were missing. In a joint work with the NYU Plant Systems Biology Lab, DFGs have been subsequently employed to the discovery of gene regulation networks by learning the dynamics of mRNA expression levels.
DFGs have also been extended into a deep auto-encoder architecture, and used on time-stamped text documents, with word frequencies as inputs. We focused on collections of documents that exhibit a structure over time. Working as dynamic topic models, DFGs could extract a latent trajectory from consecutive political speeches; applied to news articles, they achieved state-of-the-art text categorization and retrieval performance.
Finally, I used an embodiment of DFGs to evaluate the likelihood of discrete sequences of words in text corpora, relying on dynamics on word embeddings. Collaborating with AT&T; Labs Research on a project in speech recognition, we have improved on existing continuous statistical language models by enriching them with word features and long-range topic dependencies.
Title: Structure Prediction and Visualization in Molecular Biology
Candidate: Poultney, Christopher
Advisor(s): Shasha, Dennis
Abstract:
The tools of computer science can be a tremendous help to the working biologist. Two broad areas where this is particularly true are visualization and prediction. In visualization, the size of the data involved often makes meaningful exploration of the data and discovery of salient features difficult and time-consuming. Similarly, intelligent prediction algorithms can greatly reduce the lab time required to achieve significant results, or can reduce an intractable space of potential experiments to a tractable size.
Whereas the thesis discusses both a visualization technique and a machine learning problem, the thesis presentation will focus exclusively on the machine learning problem: prediction of temperature-sensitive mutations from protein structure. Temperature-sensitive mutations are a tremendously valuable research tool particularly for studying genes such as yeast essentially genes. To date, most methods for generating temperature-sensitive mutations involve large-scale random mutations followed by an intensive screening and characterization process. While there have been successful efforts to improve this process by rational design of temperature-sensitive proteins, surprisingly little work has been done in the area of predicting those mutations that will exhibit a temperature-sensitive phenotype. We describe a system that, given the structure of a protein of interest, uses a combination of protein structure prediction and machine learning to provide a ranked "top 5" list of likely candidates for temperature-sensitive mutations.
Title: Theoretical Foundations and Algorithms for Learning with Multiple Kernels
Candidate: Rostamizadeh, Afshin
Advisor(s): Mohri, Mehryar
Abstract:
Kernel-based algorithms have been used with great success in a variety of machine learning applications. These include algorithms such as support vector machines for classification, kernel ridge regression, ranking algorithms, clustering algorithms, and virtually all popular dimensionality reduction algorithms, since they are special instances of kernel principal component analysis.
But, the choice of the kernel, which is crucial to the success of these algorithms, has been traditionally left entirely to the user. Rather than requesting the user to commit to a specific kernel, multiple kernel algorithms require the user only to specify a family of kernels. This family of kernels can be used by a learning algorithm to form a combined kernel and derive an accurate predictor. This is a problem that has attracted a lot of attention recently, both from the theoretical point of view and from the algorithmic, optimization, and application point of view.
This thesis presents a number of novel theoretical and algorithmic results for learning with multiple kernels.
It gives the first tight margin-based generalization bounds for learning kernels with Lp regularization. In particular, our margin bounds for L1 regularization are shown to have only a logarithmic dependency on the number of kernels, which is a significant improvement over all previous analyses. Our results also include stability-based guarantees for a class of regression algorithms. In all cases, these guarantees indicate the benefits of learning with a large number of kernels.
We also present a family of new two-stage algorithms for learning kernels based on a notion of alignment and give an extensive analysis of the properties of these algorithms. We show the existence of good predictors for the notion of alignment we define and give efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP.
Finally, we also report the results of extensive experiments with our two-stage algorithms in classification and regression tasks, which show an improvement both over the uniform combination of kernels and over other state-of-the-art learning kernel methods for L1 and L2 regularization. These might constitute the first series of results for learning with multiple kernels that demonstrate a consistent improvement over a uniform combination of kernels.
Title: Creating collections and evaluating viewpoints: Selection techniques for interface design
Candidate: Secord, Adrian
Advisor(s): Zorin, Denis
Abstract:
In computer graphics and user interface design, selection problems are those that require the user to select a collection consisting of a small number of items from a much larger library. This dissertation explores selection problems in two diverse domains: large personal multimedia collections, containing items such as personal photographs or songs, and camera positions for 3D objects, where each item is a different viewpoint observing an object. Multimedia collections have by discrete items with strong associated metadata, while camera positions form a continuous space but are weak in metadata. In either domain, the items to be selected have rich interconnections and dependencies, making it difficult to successfully apply simple techniques (such as ranking) to aid the user. Accordingly, we develop separate approaches for the two domains.
For personal multimedia collections, we leverage the semantic metadata associated with each item (such as song title, artist name, etc.) and provide the user with a simple query language to describe their desired collection. Our system automatically suggests a collection of items that conform to the userâs query. Since any query language has limited expressive power, and since users often create collections via exploration, we provide various refinement techniques that allow the user to expand, refine and explore their collection directly through examples.
For camera positioning, we do not have the advantage of having semantic metadata for each item, unlike in media collections. We instead create a proxy viewpoint goodness function which can be used to guide the solution of various selection problems involving camera viewpoints. This function is constructed from several different attributes of the viewpoint, such as how much surface area is visible, or how "curvy" the silhouette is. Since there are many possible viewpoint goodness functions, we conducted a large user study of viewpoint preference and use the results to evaluate thousands of different functions and find the best ones. While we suggest several goodness functions to the practitioner, our user study data and methodology can be used to evaluate any proposed goodness function; we hope it will be a useful tool for other researchers.
Title: Analysis of Mass Spectrometry Data for Protein Identification In Complex Biological Mixtures
Candidate: Spivak, Marina
Advisor(s): Greengard, Leslie
Abstract:
Mass spectrometry is a powerful technique in analytical chemistry that was originally designed to determine the composition of small molecules in terms of their constituent elements. In the last several decades, it has begun to be used for much more complex tasks, including the detailed analysis of the amino acid sequence that makes up an unknown protein and even the identification of multiple proteins present in a complex mixture. The latter problem is largely unsolved and the principal subject of this dissertation.
The fundamental difficulty in the analysis of mass spectrometry data is that of ill-posedness. There are multiple solutions consistent with the experimental data and the data is subject to significant amounts of noise. In this work, we have developed application-specific machine learning algorithms that (partially) overcome this ill-posedness. We make use of labeled examples of a single class of peptide fragments and of the unlabeled fragments detected by the instrument. This places the approach within the broader framework of semi-supervised learning.
Recently, there has been considerable interest in classification problems of this type, where the learning algorithm only has access to labeled examples of a single class and unlabeled data. The motivation for such problems is that in many applications, examples of one of the two classes are easy and inexpensive to obtain, whereas the acquisition of examples of a second class is difficult and labor-intensive. For example, in document classification, positive examples are documents that address specific subject, while unlabeled documents are abundant. In movie rating, the positive data are the movies chosen by clients, while the unlabeled data are all remaining movies in a collection. In medical imaging, positive (labeled) data correspond to images of tissue affected by a disease, while the remaining available images of the same tissue comprise the unlabeled data. Protein identification using mass spectrometry is another variant of such a general problem.
In this work, we propose application-specific machine learning algorithms to address this problem. The reliable identification of proteins from mixtures using mass spectrometry would provide an important tool in both biomedical research and clinical practice.
Title: Matrix Approximation for Large-scale Learning
Candidate: Talwalkar, Ameet
Advisor(s): Mohri, Mehryar
Abstract:
Modern learning problems in computer vision, natural language processing, computational biology, and other areas are often based on large data sets of thousands to millions of training instances. However, several standard learning algorithms, such as kernel-based algorithms, e.g., Support Vector Machines, Kernel Ridge Regression, Kernel PCA, do not easily scale to such orders of magnitude. This thesis focuses on sampling-based matrix approximation techniques that help scale kernel-based algorithms to large-scale datasets. We address several fundamental theoretical and empirical questions including:
What approximation should be used? We discuss two common sampling-based methods, providing novel theoretical insights regarding their suitability for various applications and experimental results motivated by this theory. Our results show that one of these methods, the Nystrom method, is superior in the context of large-scale learning.
Do these approximations work in practice? We show the effectiveness of approximation techniques on a variety of problems. In the largest study to-date for manifold learning, we use the Nystrom method to extract low-dimensional structure from high-dimensional data to effectively cluster face images. We also report good empirical results for kernel ridge regression and kernel logistic regression.
How should we sample columns? A key aspect of sampling-based algorithms is the distribution according to which columns are sampled. We study both fixed and adaptive sampling schemes as well as a promising ensemble technique that can be easily parallelized and generates superior approximations, both in theory and in practice.
How well do these approximations work in theory? We provide theoretical analyses of the Nystrom method to understand when this technique should be used. We present guarantees on approximation accuracy based on various matrix properties and analyze the effect of matrix approximation on actual kernel-based algorithms.
This work has important consequences for the machine learning community since it extends to large-scale applications the benefits of kernel-based algorithms. The crucial aspect of this research, involving low-rank matrix approximation, is of independent interest within the field of numerical linear algebra.
Title: Factor Graphs for Relational Regression
Candidate: Chopra, Sumit
Advisor(s): LeCun, Yann
Abstract:
Inherent in many interesting regression problems is a rich underlying inter-sample "Relational Structure". In these problems, the samples may be related to each other in ways such that the unknown variables associated with any sample not only depends on its individual attributes, but also depends on the variables associated with related samples. One such problem, whose importance is further emphasized by the present economic crises, is understanding real estate prices. The price of a house clearly depends on its individual attributes, such as, the number of bedrooms. However, the price also depends on the neighborhood in which the house lies and on the time period in which it was sold. This effect of neighborhood and time on the price is not directly measurable. It is merely reflected in the prices of other houses in the vicinity that were sold around the same time period. Uncovering these spatio-temporal dependencies can certainly help better understand house prices, while at the same time improving prediction accuracy.
Problems of this nature fall in the domain of "Statistical Relational Learning". However the drawback of most models proposed so far is that they cater only to classification problems. To this end, we propose "relational factor graph" models for doing regression in relational data. A single factor graph is used to capture, one, dependencies among individual variables of sample, and two, dependencies among variables associated with multiple samples. The proposed models are capable of capturing hidden inter-sample dependencies via latent variables, and also permits non-linear log-likelihood functions in parameter space, thereby allowing considerably more complex architectures. Efficient inference and learning algorithms for relational factor graphs are proposed. The models are applied to predicting the prices of real estate properties and for constructing house price indices. The relational aspect of the model accounts for the hidden spatio-temporal influences on the price of every house. Experiments show that one can achieve considerably superior performance by identifying and using the underlying spatio-temporal structure associated with the problem. To the best of our knowledge this is the first work in the direction of relational regression and is also the first work in constructing house price indices by simultaneously accounting for the spatio-temporal effects on house prices using large-scale industry standard data set.
Title: Numerical Estimation of the Second Largest Eigenvalue of a Reversible Markov Transition Matrix
Candidate: Gade, Kranthi
Advisor(s): Goodman, Jonathan
Abstract:
We discuss the problem of finding the second largest eigenvalue of an operator that defines a reversible Markov chain. The second largest eigenvalue governs the rate at which the statistics of the Markov chain converge to equilibrium. Scientific applications include understanding the very slow dynamics of some models of dynamic glass. Applications in computing include estimating the rate of convergence of Markov chain Monte Carlo algorithms.
Most practical Markov chains have state spaces so large that direct or even iterative methods from linear algebra are inapplicable. The size of the state space, which is the dimension of the eigenvalue problem, grows exponentially with the system size. This makes it impossible to store a vector (for sparse methods), let alone a matrix (for dense methods). Instead, we seek a method that uses only time correlation from samples produced from the Markov chain itself.
In this thesis, we propose a novel Krylov subspace type method to estimate the second eigenvalue from the simulation data of the Markov chain using test functions which are known to have good overlap with the slowest mode. This method starts with the naive Rayleigh quotient estimate of the test function and refines it to obtain an improved estimate of the second eigenvalue. We apply the method to a few model problems and the estimate compares very favorably with the known answer. We also apply the estimator to some Markov chains occuring in practice, most notably in the study of glasses. We show experimentally that our estimator is more accurate and stable for these problems compared to the existing methods.
Title: 2D-Centric Interfaces and Algorithms for 3D Modeling
Candidate: Gingold, Yotam
Advisor(s): Zorin, Denis
Abstract:
The creation of 3D models is a fundamental task in computer graphics. The task is required by professional artists working on movies, television, and games, and desired by casual users who wish to make their own models for use in virtual worlds or as a hobby.
In this thesis, we consider approaches to creating and editing 3D models that minimize the user's thinking in 3D. In particular, our approaches do not require the user to manipulate 3D positions in space or mentally invert complex 3D-to-2D mappings. We present interfaces and algorithms for the creation of 3D surfaces, for texturing, and for adding small-to-medium scale geometric detail.
First, we present a novel approach for texture placement and editing based on direct manipulation of textures on the surface. Compared to conventional tools for surface texturing, our system combines UV-coordinate specification and texture editing into one seamless process, reducing the need for careful initial design of parameterization and providing a natural interface for working with textures directly on 3D surfaces.
Second, we present a system for free-form surface modeling that allows a user to modify a shape by changing its rendered, shaded image using stroke-based drawing tools. A new shape, whose rendered image closely approximates user input, is c omputed using an efficient and stable surface optimization procedure. We demonstrate how several types of free-form surface edits which may be difficult to cast in terms of standard deformation approaches can be easily performed using our system.
Third, we present a single-view 2D interface for 3D modeling based on the idea of placing 2D primitives and annotations on an existing, pre-made sketch or image. Our interface frees users to create 2D sketches from arbitrary angles using their preferred tool---including pencil and paper---which they then "describe" using our tool to create a 3D model. Our primitives are manipulated with persistent, dynamic handles, and our annotations take the form of markings commonly used in geometry textbooks.
Title: Proximity problems for point sets residing in spaces with low doubling dimension
Candidate: Gottlieb, Lee-Ad
Advisor(s): Cole, Richard
Abstract:
In this thesis we consider proximity problems on point sets. Proximity problems arise in all fields of computer science, with broad application to computation geometry, machine learning, computational biology, data mining and the like. In particular, we will consider the problems of approximate nearest neighbor search, and dynamic maintenance of a spanner for a point set.
It has been conjectured that all algorithms for these two problems suffer from the "curse of dimensionality," which means that their run time grow exponentially with the dimension of the point set. To avoid this undesirable growth, we consider point sets that occupy a doubling dimension lambda. We first present a dynamic data structure that uses linear space and supports a (1+e)-approximate nearest neighbor search of the point set. We then extend this algorithm to allow the dynamic maintenance of a low degree (1+e)-spanner for the point set. The query and update time of these structures are exponential in lambda (as opposed to exponential in the dimension); when lambda is small, this provides a significant spead-up over known algorithms, and when lambda is constant then these run times are optimal up to a constant. Even when no assumptions are made on lambda, the query and update times of the neighest neighbor search structure match the best known run times for approximate nearest neighbor search (up to a constant multiple in lambda). Further, the stretch of the spanner is optimal, and its update times exceed all previously known algorithms.
Title: Creativity Support for Computational Literature
Candidate: Howe, Daniel
Advisor(s): Perlin, Ken
Abstract:
The creativity support community has a long history of providing valuable tools to artists and designers. Similarly, creative digital media practice has proven a valuable pedagogical strategy for teaching core computational ideas. Neither strain of research has focused on the domain of literary art however, instead targeting visual, and aural media almost exclusively.
To address this situation, this thesis presents a software toolkit created specifically to support creativity in computational literature. Two primary hypotheses direct the bulk of the research presented: first, that it is possible to implement effective creativity support tools for literary art given current resource constraints; and second, that such tools, in addition to facilitating new forms of literary creativity, will provide unique opportunities for computer science education.
Designed both for practicing artists and for pedagogy, the research presented directly addresses impediments to participation in the field for a diverse range of users and provides an end-to-end solution for courses attempting to engage the creative faculties of computer science students, and to introduce a wider demographic--from writers, to digital artists, to media and literary theorists --to procedural literacy and computational thinking.
The tools and strategies presented have been implemented, deployed, and iteratively refined in a real-world contexts over the past three years. In addition to their use in large-scale projects by contemporary artists, they have provided effective support for multiple iterations of 'Programming for Digital Art & Literature', a successful inter-disciplinary computer science course taught by the author.
Taken together, this thesis provides a novel set of tools for a new domain, and demonstrates their real-world efficacy in providing both creativity and pedagogical support for a diverse and emerging population of users.
Title: Efficient Systems Biology Algorithms for Biological Networks over Multiple Time-Scales: From Evolutionary to Regulatory Time
Candidate: Mitrofanova, Antonina
Advisor(s): Mishra, Bud
Abstract:
Recently, Computational Biology has emerged as one of the most exciting areas of computer science research, not only because of its immediate impact on many biomedical applications, (e.g., personalized medicine, drug and vaccine discovery, tools for diagnostics and therapeutic interventions, etc.), but also because it raises many new and interesting combinatorial and algorithmic questions, in the process. In this thesis, we focus on robust and efficient algorithms to analyze biological networks, primarily targeting protein networks, possibly the most fascinating networks in computational biology in terms of their structure, evolution and complexity, as well as because of their role in various genetic and metabolic diseases.
Classically, protein networks have been studied statically, i.e., without taking into account time-dependent metamorphic changes in network topology and functionality. In this work, we introduce new analysis techniques that view protein networks as being dynamic in nature, evolving over time, and diverse in regulatory patterns at various stages of the system development. Our analysis is capable of dealing with multiple time-scales: ranging from the slowest time-scale corresponding to evolutionary time between species, speeding up to inter-species pathway evolution time, and finally, moving to the other extreme at the cellular developmental time-scale.
We also provide a new method to overcome limitations imposed by corrupting effects of experimental noise (e.g., high false positive and false negative rates) in Yeast Two-Hybrid (Y2H) networks, which often provide primary data for protein complexes. Our new combinatorial algorithm measures connectivity between proteins in Y2H network not by edges but by edge-disjoint paths, which reflects pathway evolution better within single specie network. This algorithm has been shown to be robust against increasing false positives and false negatives, as estimated using variation of information and separation measures.
In addition, we have devised a new way to incorporate evolutionary information in order to significantly improve classification of proteins, especially those isolated in their own networks or surrounded by poorly characterized neighbors. In our method, the networks of two (or more) species are joined by edges of high sequence similarity so that protein-homologs of different species can exchange information and acquire new and improved functional associations.
Finally, we have integrated many of these techniques into one tool to create a novel analysis of malaria parasite P. falciparum's life-cycle at the scale of reaction-time, single cell level, and encompassing its entire inter-erythrocytic developmental cycle (IDC). Our approach allows connecting time-course gene expression profiles of consecutive IDC stages in order to assign functions to un-annotated Malaria proteins and predict potential targets for vaccine and drug development.
Title: Detecting, modeling and rendering complex configurations of curvilinear features
Candidate: Parilov, Evgueni
Advisor(s): Zorin, Denis
Abstract:
Curvilinear features allow one to represent a variety of real world regular patterns like honeycomb tiling as well as very complicated random patterns like networks of furrows on the surface of the human skin. We have developed a set of methods and new data representations for solving key problems related to curvilinear features, which include robust detection of intricate networks of curvilinear features from digital images, GPU-based sharp rendering of fields with curvilinear features, and a parametric synthesis approach to generate systems of curvilinear features with desirable local configurations and global control.
The existing edge-detection techniques may underperform in the presence of noise, usually do not link the detected edge points into chains, often fail on complex structures, heavily depend on initial guess, and assume significant manual phase. We have developed a technique based on active contours, or snakes, which avoids manual initial positioning of the snakes and can detect large networks of curves with complex junctions without user guidance.
The standard bilinear interpolation of piecewise continuous fields results in unwanted smoothing along the curvilinear discontinuities. Spatially varying features can be best represented as a function of the distance to the discontinuity curves and its gradient. We have developed a real-time, GPU-based method for unsigned distance function field and its gradient field interpolation which preserves discontinuity feature curves, represented by quadratic Bezier curves, with minimal restriction on their topology.
Detail features are very important visual clues which make computer-generated imagery look less artificial. Instead of using sample-based synthesis technique which lacks user control on features usually producing gaps in features or breaking feature coherency, we have explored an alternative approach of generating features using random fibre processes. We have developed a Gibbs-type random process of linear fibres based on local fibre interactions. It allows generating non-stationary curvilinear networks with some degree of regularity, and provides an intuitive set of parameters which directly defines fibre local configurations and global pattern of fibres.
For random systems of linear fibres which approximately form two orthogonal dominant orientation fields, we have adapted a streamline placement algorithm which converts such systems into overlapping random sets of coherent smooth curves.
Title: Unsupervised Learning of Feature Hierarchies
Candidate: Ranzato, Marc'Aurelio
Advisor(s): LeCun, Yann
Abstract:
The applicability of machine learning methods is often limited by the amount of available labeled data, and by the ability (or inability) of the designer to produce good internal representations and good similarity measures for the input data vectors.
The aim of this thesis is to alleviate these two limitations by proposing algorithms to learn good internal representations, and invariant feature hierarchies from unlabeled data. These methods go beyond traditional supervised learning algorithms, and rely on unsupervised, and semi-supervised learning.
In particular, this work focuses on ''deep learning'' methods, a set of techniques and principles to train hierarchical models. Hierarchical models produce feature hierarchies that can capture complex non-linear dependencies among the observed data variables in a concise and efficient manner. After training, these models can be employed in real-time systems because they compute the representation by a very fast forward propagation of the input through a sequence of non-linear transformations.
When the paucity of labeled data does not allow the use of traditional supervised algorithms, each layer of the hierarchy can be trained in sequence starting at the bottom by using unsupervised or semi-supervised algorithms. Once each layer has been trained, the whole system can be fine-tuned in an end-to-end fashion. We propose several unsupervised algorithms that can be used as building block to train such feature hierarchies. We investigate algorithms that produce sparse overcomplete representations and features that are invariant to known and learned transformations. These algorithms are designed using the Energy-Based Model framework and gradient-based optimization techniques that scale well on large datasets. The principle underlying these algorithms is to learn representations that are at the same time sparse, able to reconstruct the observation, and directly predictable by some learned mapping that can be used for fast inference in test time.
With the general principles at the foundation of these algorithms, we validate these models on a variety of tasks, from visual object recognition to text document classification and retrieval.
Title: Search Problems for Speech and Audio Sequences
Candidate: Weinstein, Eugene
Advisor(s): Mohri, Mehryar
Abstract:
The modern proliferation of very large audio, video, and biological databases has created a need for the design of effective methods for indexing and searching highly variable or uncertain data. Classical search and indexing algorithms deal with clean or perfect input sequences. However, an index created from speech transcriptions is marked with errors and uncertainties stemming from the use of imperfect statistical models in the speech recognition process. Similarly, automatic transcription of music, such as assigning a sequence of notes to represent a stream of music audio, is prone to errors. How can we generalize search and indexing algorithms to deal with such uncertain inputs?
This thesis presents several novel algorithms, analyses, and general techniques and tools for effective indexing and search that not only tolerate but actually exploit this uncertainty. In particular, it develops an algorithmic foundation for music identification, or content-based music search; presents novel automata-theoretic results applicable generally to a variety of search and indexing tasks; and describes new algorithms for topic segmentation, or automatic splitting of speech streams into topic-coherent segments.
We devise a new technique for music identification in which each song is represented by a distinct sequence of music sounds, called "music phonemes." In our approach, we learn the set of music phonemes, as well as a unique sequence of music phonemes characterizing each song, from training data using an unsupervised algorithm. We also propose a novel application of factor automata to create a compact mapping of music phoneme sequences to songs. Using these techniques, we construct an efficient and robust music identification system for a large database of songs.
We further design new algorithms for compact indexing of uncertain inputs based on suffix and factor automata and give novel theoretical guarantees for their space requirements. Suffix automata and factor automata represent the set of all suffixes or substrings of a set of strings, and are used in numerous indexing and search tasks, including the music identification system just mentioned. We show that the suffix automaton or factor automaton of a set of strings U has at most 2Q-2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a significant improvement over previous work. We also describe a matching new linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|).
We also define a new quality measure for topic segmentation systems and design a discriminative topic segmentation algorithm for speech inputs, thus facilitating effective indexation of spoken audio collections. The new quality measure improves on previously used criteria and is correlated with human judgment of topic-coherence. Our segmentation algorithm uses a novel general topical similarity score based on word co-occurrence statistics. This new algorithm outperforms previous methods in experiments over speech and text streams. We further demonstrate that the performance of segmentation algorithms can be improved by using a lattice of competing hypotheses over the speech stream rather than just the one-best hypothesis as input.
Title: Using Application-Domain Knowledge in the Runtime Support of Multi-Experiment Computational Studies
Candidate: Yau, Siu-Man
Advisor(s): Karamcheti, Vijay; Zorin, Denis
Abstract:
Multi-Experiment Studies (MESs) is a type of computational study in which the same simulation software is executed multiple times, and the result of all executions need to be aggregated to obtain useful insight. As computational simulation experiments become increasingly accepted as part of the scientific process, the use of MESs is becoming more wide-spread among scientists and engineers.
MESs present several challenging requirements on the computing system. First, many MESs need constant user monitoring and feedback, requiring simultaneous steering of multiple executions of the simulation code. Second, MESs can comprise of many executions of long-running simulations; the sheer volume of computation can make them prohibitively long to run.
Parallel architecture offer an attractive computing platform for MESs. Low-cost, small-scale desktops employing multi-core chips allow wide-spread dedicated local access to parallel computation power, offering more research groups an opportunity to achieve interactive MESs. Massively-parallel, high-performance computing clusters can afford a level of parallelism never seen before, and present an opportunity to address the problem of computationally intensive MESs.
However, in order to fully leverage the benefits of parallel architectures, the traditional parallel systems' view has to be augmented. Existing parallel computing systems often treat each execution of the software as a black box, and are prevented from viewing an entire computational study as a single entity that must be optimized for.
This dissertation investigates how a parallel system can view MESs as an end-to-end system and leverage the application-specific properties of MESs to address its requirements. In particular, the system can 1) adapt its scheduling decisions to the overall goal of an MES to reduce the needed computation, 2) simultaneously aggregate results from, and disseminate user actions to, multiple executions of the software to enable simultaneous steering, 3) store reusable information across executions of the simulation software to reduce individual run-time, and 4) adapt its resource allocation policies to the MES's properties to improve resource utilization.
Using a test bed system called SimX and four example MESs across different disciplines, this dissertation shows that the application-aware MES-level approach can achieve multi-fold to multiple orders-of-magnitude improvements over the traditional simulation-level approach.
Title: Ensuring Correctness of Compiled Code
Candidate: Zaks, Ganna
Advisor(s): Pnueli, Amir
Abstract:
Traditionally, the verification effort is applied to the abstract algorithmic descriptions of the underlining software. However, even well understood protocols such as Peterson's protocol for mutual exclusion, whose algorithmic description takes only half a page, have published implementations that are erroneous. Furthermore, the semantics of the implementations can be altered by optimizing compilers, which are very large applications and, consequently, are bound to have bugs. Thus, it is highly desirable to ensure the correctness of the compiled code especially in safety critical and high-assurance software. This dissertation describes two alternative approaches that bring us closer to solving the problem.
First, we present CoVaC - a deductive framework for proving program equivalence and its application to automatic verification of transformations performed by optimizing compilers. To leverage the existing program analysis techniques, we reduce the equivalence checking problem to analysis of one system - a cross-product of the two input programs. We show how the approach can be effectively used for checking equivalence of single-threaded programs that are structurally similar. Unlike the existing frameworks, our approach accommodates absence of compiler annotations and handles most of the classical intraprocedural optimizations such as constant folding, reassociation, common subexpression elimination, code motion, dead code elimination, branch optimizations, and others. In addition, we have developed rules for translation validation of interprocedural optimizations, which can be applied when compiler annotations are available.
The second contribution is the pancam framework for verifying multi-threaded C programs. Pancam first compiles a multithreaded C program into optimized bytecode format. The framework relies on Spin, an existing explicit state model checker, to orchestrate the program's state space search. However, the program transitions and states are computed by the pancam bytecode interpreter. A feature of our approach is that not only pancam checks the actual implementation, but it can also check the code after compiler optimizations. Pancam addresses the state space explosion problem by allowing users to define data abstraction functions and to constrain the number of allowed context switches. We also describe a partial order reduction method that reduces context switches using dynamic knowledge computed on-the-fly, while being sound for both safety and liveness properties.
Title: Verification of Transactional Memories and Recursive Programs
Candidate: Cohen, Ariel
Advisor(s): Pnueli, Amir
Abstract:
Transactional memory is a programming abstraction intended to simplify the synchronization of conflicting concurrent memory accesses without the difficulties associated with locks. In the first part of this thesis we provide a framework and tools that allow to formally verify that a transactional memory implementation satisfies its specification. First we show how to specify transactional memory in terms of admissible interchanges of transaction operations, and give proof rules for showing that an implementation satisfies its specification. We illustrate how to verify correctness, first using a model checker for bounded instantiations, and subsequently by using a theorem prover, thus eliminating all bounds. We provide a mechanical proof of the soundness of the verification method, as well as mechanical proofs for several implementations from the literature, including one that supports non-transactional memory accesses.
Procedural programs with unbounded recursion present a challenge to symbolic model-checkers since they ostensibly require the checker to model an unbounded call stack. In the second part of this thesis we present a method for model-checking safety and liveness properties over procedural programs. Our method performs by first augmenting a concrete procedural program with a well founded ranking function, and then abstracting the Procedural programs with unbounded recursion present a challenge to symbolic model-checkers since they ostensibly require the checker to model an unbounded call stack. In the second part of this thesis we present a method for model-checking safety and liveness properties over procedural programs. Our method performs by first augmenting a concrete procedural program with a well founded ranking function, and then abstracting the augmented program by a finitary state abstraction. Using procedure summarization the procedural abstract program is then reduced to a finite-state system, which is model checked for the property.
Title: Learning Long-Range Vision for an Offroad Robot
Candidate: Hadsell, Raia
Advisor(s): LeCun, Yann
Abstract:
Teaching a robot to perceive and navigate in an unstructured natural world is a difficult task. Without learning, navigation systems are short-range and extremely limited. With learning, the robot can be taught to classify terrain at longer distances, but these classifiers can be fragile as well, leading to extremely conservative planning. A robust, high-level learning-based perception system for a mobile robot needs to continually learn and adapt as it explores new environments. To do this, a strong feature representation is necessary that can encode meaningful, discriminative patterns as well as invariance to irrelevant transformations. A simple realtime classifier can then be trained on those features to predict the traversability of the current terrain.
One such method for learning a feature representation is discussed in detail in this work. Dimensionality reduction by learning an invariant mapping (DrLIM) is a weakly supervised method for learning a similarity measure over a domain. Given a set of training samples and their pairwise relationships, which can be arbitrarily defined, DrLIM can be used to learn a function that is invariant to complex transformations of the inputs such as shape distortion and rotation.
The main contribution of this work is a self-supervised learning process for long-range vision that is able to accurately classify complex terrain, permitting improved strategic planning. As a mobile robot moves through offroad environments, it learns traversability from a stereo obstacle detector. The learning architecture is composed of a static feature extractor, trained offline for a general yet discriminative feature representation, and an adaptive online classifier. This architecture reduces the effect of concept drift by allowing the online classifier to quickly adapt to very few training samples without overtraining. After experiments with several different learned feature extractors, we conclude that unsupervised or weakly supervised learning methods are necessary for training general feature representations for natural scenes.
The process was developed and tested on the LAGR mobile robot as part of a fully autonomous vision-based navigation system.
Title: Synthesizing Executable Programs from Requirements
Candidate: Plock, Cory
Advisor(s): Goldberg, Benjamin
Abstract:
Automatic generation of correct software from requirements has long been a ``holy grail'' for system and software development. According to this vision, instead of implementing a system and then working hard to apply testing and verification methods to prove system correctness, a system is rather built correctly by construction. This problem, referred to as synthesis, is undecidable in the general case. However, by restricting the domain to decidable subsets, it is possible to bring this vision one step closer to reality.
The focus of our study is reactive systems, or non-terminating programs that continuously receive input from an external environment and produce output responses. Reactive systems are often safety critical and include applications such as anti-lock braking systems, auto-pilots, and pacemakers. One of the challenges of reactive system design is ensuring that the software meets the requirements under the assumption of unpredictable environment input. The behavior of many of these systems can be expressed as regular languages over infinite strings, a domain in which synthesis has yielded successful results.
We present a method for synthesizing executable reactive systems from formal requirements. The object-oriented requirements language of Live Sequence Charts (LSCs) is considered. We begin by establishing a mapping between various subsets of the language and finite-state formal models. We also consider LSCs which can express time constraints over a dense-time domain. From one of these models, we show how to formulate a winning strategy that is guaranteed to satisfy the requirements, provided one exists. The strategy is realized in the form of a controller which guides the system in choosing only non-violating behaviors. We describe an implementation of this work as an extension of an existing tool called the Play-Engine.
Title: Theory and Algorithms for Modern Machine Learning Problems and an Analysis of Markets
Candidate: Rastogi, Ashish
Advisor(s): Cole, Richard; Mohri, Mehryar
Abstract:
The unprecedented growth of the Internet over the past decade and of data collection, more generally, has given rise to vast quantities of digital information, ranging from web documents and images, genomic databases to a vast array of business customer information. Consequently, it is of growing importance to develop tools and models that enable us to better understand this data and to design data-driven algorithms that leverage this information. This thesis provides several fundamental theoretical and algorithmic results for tackling such problems with applications to speech recognition, image processing, natural language processing, computational biology and web-based algorithms.
Probabilistic automata provide an efficient and compact way to model sequence- oriented data such as speech or web documents. Measuring the similarity of such automata provides a way of comparing the objects they model, and is an essential first step in organizing this type of data. We present algorithmic and hardness results for computing various discrepancies (or dissimilarities) between probabilistic automata, including the relative entropy and the Lp distance; we also give an efficient algorithm to determine if two probabilistic automata are equivalent. In addition, we study the complexity of computing the norms of probabilistic automata.
Organizing and querying large amounts of digitized data such as images and videos is a challenging task because little or no label information is available. This motivates transduction, a setting in which the learning algorithm can leverage unlabeled data during training to improve performance. We present novel error bounds for a family of transductive regression algorithms and validate their usefulness through experiments.
Widespread success of search engines and information retrieval systems has led to large scale collection of rating information which is being used to provide personalized rankings. We examine an alternate formulation of the ranking problem for search engines motivated bythe requirement that in addition to accurately predicting pairwise ordering, ranking systems must also preserve the magnitude of the preferences or the difference between ratings. We present algorithms with sound theoretical properties, and verify their efficacy through experiments.
Finally, price discovery in a market setting can be viewed as an (ongoing) learning problem. Specifically, the problem is to find and maintain a set of prices that balance supply and demand, a core topic in economics. This appears to involve complex implicit and possibly large-scale information transfers. We show that finding equilibrium prices, even approximately, in discrete markets is NP-hard and complement the hardness result with a matching polynomial time approximation algorithm.We also give a new way of measuring the quality of an approximation to equilibrium prices that is based on a natural aggregation of the dissatisfaction of individual market participants.
Title: Geometric Modeling with High Order Derivatives
Candidate: Tosun, Elif
Advisor(s): Zorin, Denis
Abstract:
Modeling of high quality surfaces is the core of geometric modeling. Such models are used in many computer-aided design and computer graphics applications. Irregular behavior of higher-order differential parameters of the surface (e.g. curvature variation) may lead to aesthetic or physical imperfections. In this work, we consider approaches to constructing surfaces with high degree of smoothness.
One direction is based on a manifold-based surface definition which ensures well-defined high-order derivatives that can be explicitly computed at any point. We extend previously proposed manifold-based construction to surfaces with piecewise-smooth boundary and explore trade-offs in some elements of the construction. We show that growth of derivative magnitudes with order is a general property of constructions with locally supported basis functions and derive a lower bound for derivative growth and numerically study flexibility of resulting surfaces at arbitrary points.
An alternative direction to using high-order surfaces is to define an approximation to high-order quantities for meshes, with high-order surface implicit. These approximations do not necessarily converge point-wise, but can nevertheless be successfully used to solve surface optimization problems. Even though fourth-order problems are commonly solved to obtain high quality surfaces, in many cases, these formulations may lead to reflection-line and curvature discontinuities. We consider two approaches to further increasing control over surface properties.
The first approach is to consider data-dependent functionals leading to fourth-order problems but with explicit control over desired surface properties. Our fourth-order functionals are based on reflection line behavior. Reflection lines are commonly used for surface interrogation and high-quality reflection line patterns are well-correlated with high-quality surface appearance. We demonstrate how these can be discretized and optimized accurately and efficiently on general meshes.
A more direct approach is to consider a poly-harmonic function on a mesh, such as the fourth-order biharmonic or the sixth-order triharmonic. The biharmonic and the triharmonic equations can be thought of as a linearization of curvature and curvature variation Euler-Lagrange equations respectively. We present a novel discretization for both problems based on the mixed finite element framework and a regularization technique for solving the resulting, highly ill-conditioned systems of equations. We show that this method, compared to more ad-hoc discretizations, has higher degree of mesh independence and yields surfaces of better quality.
Title: Scaling Data Servers via Cooperative Caching
Candidate: Annapureddy, Siddhartha
Advisor(s): Mazieres, David
Abstract:
In this thesis, we present design techniques -- and systems that illustrate and validate these techniques -- for building data-intensive applications over the Internet. We enable the use of a traditional bandwidth-limited server in these applications. A large number of cooperating users contribute resources such as disk space and network bandwidth, and form the backbone of such applications. The applications we consider fall in one of two categories. The first type provide user-perceived utility in proportion to the data download rates of the participants; bulk data distribution systems is a typical example. The second type are usable only when the participants have data download rates above a certain threshold; video streaming is a prime example.
We built Shark, a distributed file system, to address the first type of applications. It is designed for large-scale, wide-area deployment, while also providing a drop-in replacement for local-area file systems. Shark introduces a novel locality-aware cooperative-caching mechanism, in which clients exploit each other's file caches to reduce load on an origin file server. Shark also enables sharing of data even when it originates from different servers. In addition, Shark clients are mutually distrustful in order to operate in the wide-area. Performance results show that Shark greatly reduces server load and reduces client-perceived latency for read-heavy workloads both in the wide and local areas.
We built RedCarpet, a near-Video-on-Demand (nVoD) system, to address the second type of applications. nVoD allows a user to watch a video starting at any point after waiting for a small setup time. RedCarpet uses a mesh-based peero-peer (P2P) system to provide the nVoD service. In this context, we study the problem of scheduling the dissemination of chunks that constitute a video. We show that providing nVoD is feasible with a combination of techniques that include network coding, avoiding resource starvation for different chunks, and overay topology management algorithms. Our evaluation, using a simulator as well as a prototype, shows that systems that do not optimize in all these dimensions could deliver significantly worse nVoD performance.
Title: Shape Analysis by Abstraction, Augmentation, and Transformation
Candidate: Balaban, Ittai
Advisor(s): Pnueli, Amir; Zuck, Lenore
Abstract:
The goal of shape analysis is to analyze properties of programs that perform destructive updates of linked structures (heaps). This thesis presents an approach for shape analysis based on program augmentation (instrumentation), predicate abstraction, and model checking, that allows for verification of safety and liveness properties (which, for sequential programs, usually corresponds to program invariance and termination).
One of the difficulties in abstracting heap-manipulating programs is devising a decision procedure for a sufficiently expressive logic of graph properties. Since graph reachability (expressible by transitive closure) is not a first order property, the challenge is in showing that a decision procedure exists for a rich enough subset of first order logic with transitive closure.
Predicate abstraction is in general too weak to verify liveness properties. Thus an additional issue dealt with is how to perform abstraction while retaining enough information. The method presented here is domain-neutral, and applies to concurrent programs as well as sequential ones.
Title: Democratizing Content Distribution
Candidate: Freedman, Michael
Advisor(s): Mazieres, David
Abstract:
In order to reach their large audiences, today's Internet publishers primarily use content distribution networks (CDNs) to deliver content. Yet the architectures of the prevalent commercial systems are tightly bound to centralized control, static deployments, and trusted infrastructure, inherently limiting their scope and scale to ensure cost recovery.
To move beyond such shortcomings, this thesis contributes a number of techniques that realize cooperative content distribution. By federating large numbers of unreliable or untrusted hosts, we can satisfy the demand for content by leveraging all available resources. We propose novel algorithms and architectures for three central mechanisms of CDNs: content discovery (where are nearby copies of the client's desired resource?), server selection (which node should a client use?), and secure content transmission (how should a client download content efficiently and securely from its multiple potential sources?).
These mechanisms have been implemented, deployed, and tested in production systems that have provided open content distribution services for more than three years. Every day, these systems answer tens of millions of client requests, serving terabytes of data to more than a million people.
This thesis presents five systems related to content distribution. First, Coral provides a distributed key-value index that enables content lookups to occur efficiently and returns references to nearby cached objects whenever possible, while still preventing any load imbalances from forming. Second, CoralCDN demonstrates how to construct a self-organizing CDN for web content out of unreliable nodes, providing robust behavior in the face of failures. Third, OASIS provides a general-purpose, flexible anycast infrastructure, with which clients can locate nearby or unloaded instances of participating distributed systems. Fourth, as a more clean-slate design that can leverage untrusted participants, Shark offers a distributed file system that supports secure block-based file discovery and distribution. Finally, our authentication code protocol enables the integrity verification of large files on-the-fly when using erasure codes for efficient data dissemination.
Taken together, this thesis provides a novel set of tools for building highly-scalable, efficient, and secure content distribution systems. By enabling the automated replication of data based on its popularity, we can make desired content available and accessible to everybody. And in effect, democratize content distribution.
Title: Joint Inference for Information Extraction and Translation
Candidate: Ji, Heng
Advisor(s): Grishman, Ralph
Abstract:
The traditional natural language processing pipeline incorporates multiple stages of linguistic analysis. Although errors are typically compounded through the pipeline, it is possible to reduce the errors in one stage by harnessing the results of the other stages.
This thesis presents a new framework based on component interactions to approach this goal. The new framework applies all stages in a suitable order, with each stage generating multiple hypotheses and propagating them through the whole pipeline. Then the feedback from subsequent stages is used to enhance the target stage by re-ranking these hypotheses, and then produce the best analysis.
The effectiveness of this framework has been demonstrated by substantially improving the performance of Chinese and English entity extraction and Chinese-to-English entity translation. The inference knowledge includes mono-lingual interactions among information extraction stages such as name tagging, coreference resolution, relation extraction and event extraction, as well as cross-lingual interaction between information extraction and machine translation.
Such symbiosis of analysis components allows us to incorporate information from a much wider context, spanning the entire document and even going across documents, and utilize deeper semantic analysis; it will therefore be essential for the creation of a high- performance NLP pipeline.
Title: Authentication Mechanisms for Open Distributed Systems
Candidate: Nicolosi, Antonio
Advisor(s): Mazieres, David; Shoup, Victor
Abstract:
While authentication within organizations is a well-understood problem, traditional solutions are often inadequate at the scale of the Internet, where the lack of a central authority, the open nature of the systems, and issues such as privacy and anonymity create new challenges. For example, users typically establish dozens of web accounts with independently administered services under a single password, which increases the likelihood of exposure of their credentials; users wish to receive email from anyone who is not a spammer, but the openness of the email infrastructure makes it hard to authenticate legitimate senders; users may have a rightful expectation of privacy when viewing widely-accessed protected resources such as premium website content, yet they are commonly required to present identifying login credentials, which permits tracking of their access patterns.
This dissertation describes enhanced authentication mechanisms to tackle the challenges of each of the above settings. Specifically, the dissertation develops: 1) a remote authentication architecture that lets users recover easily in case of password compromise; 2) a social network-based email system in which users can authenticate themselves as trusted senders without disclosing all their social contacts; and 3) a group access-control scheme where requests can be monitored while affording a degree of anonymity to the group member performing the request.
The proposed constructions combine system designs and novel cryptographic techniques to address their respective security and privacy requirements both effectively and efficiently.
Title: New Design Criteria for Hash Functions and Block Ciphers
Candidate: Puniya, Prashant
Advisor(s): Dodis, Yevgeniy
Abstract:
Cryptographic primitives, such as hash functions and block ciphers, are integral components in several practical cryptographic schemes. In order to prove security of these schemes, a variety of security assumptions are made on the underlying hash function or block cipher, such as collision-resistance, pseudorandomness etc. In fact, such assumptions are often made without much regard for the actual constructions of these primitives. In this thesis, we address this problem and suggest new, and possibly better, design criteria for hash functions and block ciphers.
We start by analyzing the design criteria underlying hash functions. The usual design principle here involves a two-step procedure: First, come up with a heuristically-designed and ``hopefully strong'' fixed-length input construction (i.e. the compression function), then use a standard domain extension technique, usually the cascade construction, to get a construction that works for variable-length inputs. We investigate this design principle from two perspectives:
We next move on to discuss the Feistel network, which is used in the design of several popular block ciphers such as DES, Triple-DES, Blowfish etc. Currently, the celebrated result of Luby-Rackoff (and further extensions) is regarded as the theoretical basis for using this construction in block cipher design, where it was shown that a four-round Feistel network is a (strong) pseudorandom permutation (PRP) if the round functions are independent pseudorandom functions (PRFs). We study the Feistel network from two different perspectives:
We give a positive answer to the first question and a partial positive answer to the second question. In the process, we undertake a combinatorial study of the Feistel network, that might be useful in other scenarios as well. We provide several practical applications of our results for the Feistel network.
Title: Being Lazy and Preemptive at Learning toward Information Extraction
Candidate: Shinyama, Yusuke
Advisor(s): Sekine, Satoshi
Abstract:
This thesis proposes a novel approach for exploring Information Extraction scenarios. Information Extraction, or IE, is a task aiming at finding events and relations in natural language texts that meet a user's demand. However, it is often difficult to formulate, or even define such events that satisfy both a user's need and technical feasibility. Furthermore, most existing IE systems need to be tuned for a new scenario with proper training data in advance. So a system designer usually needs to understand what a user wants to know in order to maximize the system performance, while the user has to understand how the system will perform in order to maximize his/her satisfaction.
In this thesis, we focus on maximizing the variety of scenarios that the system can handle instead of trying to improve the accuracy of a particular scenario. In traditional IE systems, a relation is defined a priori by a user and is identified by a set of patterns that are manually crafted or acquired in advance. We propose a technique called Unrestricted Relation Discovery, which defers determining what is a relation and what is not until the very end of the processing so that a relation can be defined a posteriori. This laziness gives huge flexibility to the types of relations the system can handle. Furthermore, we use the notion of recurrent relations to measure how useful each relation is. This way, we can discover new IE scenarios without fully specifying definitions or patterns, which leads to Preemptive Information Extraction, where a system can provide a user a portfolio of extractable relations and let the user choose them.
We used one year news articles obtained from the Web as a development set. We discovered dozens of scenarios that are similar to the existing scenarios tried by many IE systems, as well as new scenarios that are relatively novel. We have evaluated the existing scenarios with Automatic Content Extraction (ACE) event corpus and obtained reasonable performance. We believe this system will shed new light on IE research by giving various experimental IE scenarios.
Title: Constituent Parsing by Classification
Candidate: Turian, Joseph
Advisor(s): Melamed, I. Dan
Abstract:
We present an approach to constituent parsing, which is driven by classifiers induced to minimize a single regularized objective. It is the first discriminatively-trained constituent parser to surpass the Collins (2003) parser without using a generative model. Our primary contribution is simplifying the human effort required for feature engineering. Our model can incorporate arbitrary features of the input and parse state. Feature selection and feature construction occur automatically, as part of learning. We define a set of fine-grained atomic features, and let the learner induce informative compound features. Our learning approach includes several novel approximations and optimizations which improve the efficiency of discriminative training. We introduce greedy completion, a new agenda-driven search strategy designed to find low-cost solutions given a limit on search effort. The inference evaluation function was learned accurately enough to guide the deterministic parsers to the optimal parse reasonably quickly without pruning, and thus without search errors. Experiments demonstrate the flexibility of our approach, which has also been applied to machine translation (Wellington et. al, AMTA 2006; Turian et al., NIPS 2007).
Title: Enhanced Security Models for Network Protocols
Candidate: Walfish, Shabsi
Advisor(s): Dodis, Yevgeniy
Abstract:
Modeling security for protocols running in the complex network environment of the Internet can be a daunting task. Ideally, a security model for the Internet should provide the following guarantee: a protocol that "securely" implements a particular task specification will retain all the same security properties as the specification itself, even when an arbitrary set of protocols runs concurrently on the same network. This guarantee must hold even when other protocols are maliciously designed to interact badly with the analyzed protocol, and even when the analyzed protocol is composed with other protocols. The popular Universal Composability (UC) security framework aims to provide this guarantee.
Unfortunately, such strong security guarantees come with a price: they are impossible to achieve without the use of some trusted setup. Typically, this trusted setup is global in nature, and takes the form of a Public Key Infrastructure (PKI) and/or a Common Reference String (CRS). However, the current approach to modeling security in the presence of such setups falls short of providing expected security guarantees. A quintessential example of this phenomenon is the deniability concern: there exist natural protocols that meet the strongest known security notions (including UC) while failing to provide the same deniability guarantees that their task specifications imply they should provide.
We introduce the Generalized Universal Composability (GUC) framework to extend the UC security notion and enable the re-establishment of its original intuitive security guarantees even for protocols that use global trusted setups. In particular, GUC enables us to guarantee that secure protocols will provide the same level of deniability as the task specification they implement. To demonstrate the usefulness of the GUC framework, we first apply it to the analysis and construction of deniable authentication protocols. Building upon such deniable authentication protocols, we then prove a general feasibility result showing how to construct protocols satisfying our security notion for a large class of two-party and multi-party tasks (assuming the availability of some reasonable trusted setup). Finally, we highlight the practical applicability of GUC by constructing efficient protocols that securely instantiate two common cryptographic tasks: commitments and zero-knowledge proofs.
Title: Tree-Structured Models of Multitext: Theory, Design and Experiments
Candidate: Wellington, Benjamin
Advisor(s): Melamed, I. Dan
Abstract:
Statistical machine translation (SMT) systems use empirical models to simulate the act of human translation between language pairs. This dissertation surveys the ability of currently popular syntax-aware SMT systems to model real-world multitext, and shows different types of linguistic phenomena occurring in natural language translation that these popular systems cannot capture. It then proposes a new grammar formalism, Generalized Multitext Grammar (GMTG), and a generalization of Chomsky Normal Form, that allows us to build an efficient SMT system using previously developed parsing techniques. The dissertation addresses many software engineering issues that arise when doing syntax-based SMT using large corpora and lays out a object-oriented design for a translation toolkit. Using the toolkit, we show that a tree-transduction based SMT system, which uses modern machine learning algorithms, outperforms a generative baseline.
Title: Formal Verification Using Static and Dynamic Analyses
Candidate: Zaks, Aleksandr
Advisor(s): Pnueli, Amir
Abstract:
One of the main challenges of formal verification is the ability to handle systems of realistic size, which is especially exacerbated in the context of software verification. In this dissertation, we suggest two related approaches that, while both rely on formal method techniques, they can still be applied to larger practical systems. The scalability is mainly achieved by restricting the types of properties we are considering and guarantees that are given.
Our first approach is a novel run-time monitoring framework. Unlike previous work on this topic, we expect the properties to be specified using Property Specification Language (PSL). PSL is a newly adopted IEEE P1850 standard and is an extension of Linear Temporal Logic (LTL). The new features include regular expressions and finite trace semantics, which make the new logic very attractive for run-time monitoring of both software and hardware designs. To facilitate the new logic we have extended the existing algorithm for LTL tester construction to cover the PSL specific operators. Another novelty of our approach is the ability to use partial information about the program that is being monitored while the existing tools only use the information about the observed trace and the property under consideration. This allows going beyond the focus of traditional run-time monitoring tools -- error detection in the execution trace, towards the focus of static analysis -- bug detection in programs.
In our second approach, we employ static analysis to compute SAT-based function summaries to detect invalid pointer accesses. To compute function summaries, we propose new techniques for improving the precision and performance in order to reduce the false error rates. In particular, we use BDDs to represent a symbolic simulation of functions, where BDDs allow an efficient representation of path-sensitive information and high level simplification. In addition, we use light-weight range analysis technique for determining lower and upper bounds for program variables, which can further offload the work form the SAT solver. Note that while in our current implementation the analysis happens at compile time, we can also use the function summaries as a basis for run-time monitoring.
Title: Guaranteed Precision for Transcendental and Algebraic Computation Made Easy
Candidate: Du, Zilin
Advisor(s): Yap, Chee
Abstract:
Numerical non-robustness is a well-known phenomenon when implementing geometric algorithms. A general approach to achieve geometric robustness is Exact Geometric Computation (EGC). This dissertation explores the redesign and extension of Core Library, a C++ library which embraces the EGC approach. The contributions of this thesis are organized into three parts.
In the first part, we discuss the redesign of Core Library, especially the expression "Expr" and bigfloat "BigFloat" classes. Our new design emphasizes extensibility in a clean and modular way. The three facilities in "Expr", filter, root bound and bigfloat, are separated into independent modules. This allows new filters, root bounds and some bigfloat substitute to be plugged in. The key approximate evaluation and precision propagation algorithms have been greatly improved. A new bigfloat system based on MPFR and interval arithmetic has been incorporated. Our benchmark shows that the redesigned Core Library typically has 5-10 times speedup. We also provide tools to facilitate extensions of "Expr" to incorporate new type of nodes, especially transcendental nodes.
Although the Core Library was originally designed for algebraic applications, transcendental functions are needed in many applications. In the second part, we present a complete algorithm for absolute approximation of the general hypergeometric functions. It's complexity is also given. The extension of this algorithm to ``blackbox number'' is provided. A general hypergeometric function package based on our algorithm is implemented and integrated into the Core Library based on our new design.
Brent has shown that many elementary functions, such as $\exp, \log, \sin$, etc., can be efficiently computed using the Arithmetic-Geometric Mean (AGM) based algorithm. However, he only gave an asymptotic error analysis. The constants in the Big $O(\cdot)$ notation required for implementation are unknown. We provide a non-asymptotic error analysis of the AGM algorithm and the related algorithms for logarithm and exponential functions. These algorithms have been implemented and incorporated into the Core Library.
Title: On Cryptographic Techniques for Digital Rights Management
Candidate: Fazio, Nelly
Advisor(s): Dodis, Yevgeniy
Abstract:
With more and more content being produced, distributed, and ultimately rendered and consumed in digital form, devising effective Content Protection mechanisms and building satisfactory Digital Rights Management (DRM) systems have become top priorities for the Publishing and Entertaining Industries.
To help tackle this challenge, several cryptographic primitives and constructions have been proposed, including mechanisms to securely distribute data over a unidirectional insecure channel (Broadcast Encryption), schemes in which leakage of cryptographic keys can be traced back to the leaker (Traitor Tracing), and techniques to combine revocation and tracing capabilities (Trace-and-Revoke schemes).
In this thesis, we present several original constructions of the above primitives, which improve upon existing DRM-enabling cryptographic primitives along the following two directions:
Our results along the first line of work include the following:
As for the second direction, our contribution can be divided as follows:
Overall, the cryptographic tools developed in this thesis provide more flexibility and more security than existing solutions, and thus offer a better match for the challenges of the DRM setting.
Title: Finding Your Match: Techniques for Improving Sequence Alignment in DNA and RNA
Candidate: Gill, Ofer Hirsch
Advisor(s): Mishra, Bud
Abstract:
In Bioinformatics, finding correlations between species allows us the better understand the important biological functions of those species and trace its evolution. This thesis considers sequence alignment, a method for obtaining these correlations. We improve upon sequence alignment tools designed for DNA with Plains, an algorithm than uses piecewise-linear gap functions and parameter-optimization to obtain correlations in remotely-related species pairs such as human and fugu using reasonable amounts of memory and space on an ordinary computer. We then discuss Planar, which is similar to Plains, but is designed for aligning RNA, and accounts for secondary structure. We also explore SEPA, a tool that uses p-value estimation based on exhaustive empirical data to better emphasize key results from an alignment with a measure of reliability. Using SEPA to measure the quality of an alignment, we proceed to compare Plains and Planar against similar alignment tools, emphaisizing the interesting correlations caught in the process.
Title: DataSlicer: A Hosting Platform for Data-Centric Network Services
Candidate: He, Congchun
Advisor(s): Karamcheti, Vijay
Abstract:
As the Web evolves, the number of network services deployed on the Internet has been growing at a dramatic pace. Such services usually involve a massive volume of data stored in physical or virtual back-end databases, and access the data to dynamically generate responses for client requests. These characteristics restrict use of traditional mechanisms for improving service performance and scalability: large volumes prevent replication of the service data at multiple sites required by content distribution schemes, while dynamic responses do not support the reuse required by web caching schemes.
However, many deployed data-centric network services share other properties that can help alleviate this situation: (1) service usage patterns exhibit locality of various forms, and (2) services are accessed using standard protocols and publicly known message structures. When properly exploited, these characteristics enable the design of alternative caching infrastructures, which leverage distributed network intermediaries to inspect traffic flowing between clients and services, infer locality information dynamically, and potentially improve service performance by taking actions such as partial service replication, request redirection, or admission control.
This dissertation investigates the nature of locality in service usage patterns for two well-known web services, and reports on the design, implementation, and evaluation of such a network intermediary architecture, named DataSlicer. DataSlicer incorporates four main techniques: (1) Service-neutral request inspection and locality detection on distributed network intermediaries; (2) Construction of oriented overlays for clustering client requests; (3)Integrated load-balancing and service replication mechanisms that improve service performance and scalability by either redistributing the underlying traffic in the network or creating partial service replicas on demand at appropriate network locations; and (4) Robustness mechanisms to maintain system stability in a wide-area network environment.
DataSlicer has been successfully deployed on the PlanetLab network. Extensive experiments using synthetic workloads show that our approach can: (1) create appropriate oriented overlays to cluster client requests according to multiple application metrics; (2) detect locality information across multiple dimensions and granularity levels; (3) leverage the detected locality information to perform appropriate load-balancing and service replication actions with minimal cost; and (4) ensure robust behavior in the face of dynamically changing network conditions.
Title: Multimarker Genetic Analysis Methods for High Throughput Array Data
Candidate: Ionita, Iuliana
Advisor(s): Mishra, Bud
Abstract:
In this thesis, we focus on multi-marker/-locus statistical methods for analyzing high-throughput array data used for the detection of genes implicated in complex disorders. There are two main parts: the first part concerns the localization of cancer genes from copy number variation data, with an application to lung cancer; the second part concerns the localization of disease genes using an affected-sib-pair design, with an application to inflammatory bowel disease. A third part addresses an important issue involved in the design of these disease-gene-detection studies. More details follow:
1. Detection of Oncogenes and Tumor Suppressor Genes using Multipoint Statistics from Copy Number Variation Data
ArrayCGH is a microarray-based comparative genomic hybridization technique that has been used to compare a tumor genome against a normal genome, thus providing rapid genomic assays of tumor genomes in terms of copy number variations of those chromosomal segments, which have been gained or lost. When properly interpreted, these assays are likely to shed important light on genes and mechanisms involved in initiation and progression of cancer. Specifically, chromosomal segments, amplified or deleted in a group of cancer patients, point to locations of cancer genes. We describe a statistical method to estimate the location of such genes by analyzing segmental amplifications and deletions in the genomes from cancer patients and the spatial relation of these segments to any specific genomic interval. The algorithm assigns to a genomic segment a score that parsimoniously captures the underlying biology. It computes a p-value for every putative disease gene by using results from the theory of scan statistics. We have validated our method using simulated datasets, as well as a real dataset on lung cancer.
2. Multi-locus Linkage Analysis of Affected-Sib-Pairs
A The affected-sib-pair (ASP) design is a simple and popular design in the linkage analysis of complex traits. The traditional ASP methods evaluate the linkage information at a locus by considering only the marginal linkage information present at that locus. However complex traits are influenced by multiple genes that together interact to increase the risk to disease. We describe a multi-locus linkage method that uses both the marginal information and information derived from the possible interactions among several disease loci, thereby increasing the significance of loci with modest marginal effects. Our method is based on a statistic that quantifies the linkage information contained in a set of markers. By a marker selection-reduction process, we screen a set of polymorphisms and select a few that seem linked to disease. We test our approach on simulated data and a genome-scan data for inflammatory bowel disease. We show that our method is expected to be more powerful than single-locus methods in detecting disease loci responsible for complex traits.
3. A Practical Haplotype Inference Algorithm
We consider the problem of efficient inference algorithms to determine the haplotypes and their distribution from a dataset of unrelated genotypes.
With the currently available catalogue of single-nucleotide polymorphisms (SNPs) and given their abundance throughout the genome (one in about $500$ bps) and low mutation rates, scientists hope to significantly improve their ability to discover genetic variants associated with a particular complex trait. We present a solution to a key intermediate step by devising a practical algorithm that has the ability to infer the haplotype variants for a particular individual from its own genotype SNP data in relation to population data. The algorithm we present is simple to describe and implement; it makes no assumption such as perfect phylogeny or the availability of parental genomes (as in trio-studies); it exploits locality in linkages and low diversity in haplotype blocks to achieve a linear time complexity in the number of markers; it combines many of the advantageous properties and concepts of other existing statistical algorithms for this problem; and finally, it outperforms competing algorithms in computational complexity and accuracy, as demonstrated by the studies performed on real data and synthetic data.
Title: Expressive Motion
Candidate: Lees, Alyssa
Advisor(s): Bregler, Christopher; Geiger, Davi
Abstract:
Since the advent of motion capture animation, attempts have been made to extract the seemingly nebulously defined attributes of 'content' and 'style' from the motion data. Enabling quick access to highly precise data, the benefits of motion capture for animation purposes are abundant. Yet manipulating the expressive attributes of the motion data in a comprehensive manner has proved elusive. This dissertation poses practical solutions that are based on insights from the dance community and learning attributes from the motion data itself. The culminating project is a system which learns the deformations of the human body and reapplies them in exaggerated form for enhanced expressivity.
While simultaneously developing efficient and usable tools for animators, the result is a three pronged technique to enhance the expressive qualities of motion capture animation. The key aspect is the creation of a deformable skeleton representation of the human body using a unique machine learning approach. The deformable skeleton is modeled by replicating the actual movements of the human spine. The second step relies on exploiting the subtle aspects of motion, such as hand movement to create an emotional effect visually. Both of these approaches involve exaggerating the movements in the same vein as traditional 2-D animation technique of 'squash and stretch'. Finally, a novel technique for the application of style on a baseline motion capture sequence is developed.
All of these approaches are rooted in machine learning techniques. Linear discriminate analysis was initially applied to a single phrase of motion demonstrating various style characteristics in LABAN notation. A variety of methods including nonlinear PCA, and LLE were used to learn the underlying manifold of spine movements. Nonlinear dynamic models were learned in attempts to describe motion segments versus single phrases. In addition, the dissertation focuses on the variety of obstacles in learning with motion data. This includes the correct parameterization of angles, applying statistical analysis to quaternions, and appropriate distance measures between postures.
Title: Building Trustworthy Storage Services out of Untrusted Infrastructure
Candidate: Li, Jinyuan
Advisor(s): Mazieres, David
Abstract:
As the Internet has become increasingly ubiquitous, it has seen tremendous growth in the popularity of online services. These services range from online CVS repositories like sourceforge , shopping sites, to online financial and administrative systems, etc. It is critical for these services to provide correct and reliable execution for clients. However, given their attractiveness as targets and ubiquitous accessibility, online servers also have a significant chance of being compromised, leading to Byzantine failures.
Designing and implementing a service to run on a machine that may be compromised is not an easy task, since infrastructure under malicious control may behave arbitrarily. Even worse, as any monitoring facility may also be subverted at the same time, there is no easy way for system behavior to be audited, or for malicious attacks to be detected.
We propose our solution to the problem by reducing the trust needed on the server side in the first place. In the other words, our system is designed specifically for running on untrusted hosts. In this thesis, we realize this principle by two different approaches. First, we design and implement a new network file system -- SUNDR. In SUNDR, malicious servers cannot forge users' operations or tamper with their data without being detected. In the worst case, attackers can only conceal users' operations from each other. Still, SUNDR is able to detect this misbehavior whenever users communicate with each other directly.
The limitation of the approach above lies in that the system cannot guarantee ideal consistency with even one single failure. In the second approach, we use replicated state machines to tolerate some fraction of malicious server failures, which is termed Byzantine Fault Tolerance (BFT) in the literature. Classical BFT systems assume less than 1/3 of the replicas are malicious, to provide ideal consistency. In this thesis, we push the boundary from 1/3 to 2/3. With fewer than 1/3 of replicas faulty, we provide the same guarantees as classical BFT systems. Additionally, we guarantee weaker consistency, instead of arbitrary behavior, when between 1/3 and 1/3 of replicas fail.
Title: Measures for Robust Stability and Controllability
Candidate: Mengi, Emre
Advisor(s): Overton, Michael
Abstract:
A linear time-invariant dynamical system is robustly stable if the system as well as all of its nearby systems in a neighborhood of interest are stable. An important property of robustly stable systems is they decay asymptotically without exhibiting significant transient behavior. The first part of this thesis work focuses on measures revealing the degree of robust stability of a dynamical system. We put special emphasis on pseudospectral measures, those based on the eigenvalues of nearby matrices for a first-order system or matrix polynomials for a higher-order system. We present algorithms for the computation of pseudospectral measures for continuous and discrete systems with quadratic rate of convergence and analyze their accuracy in the presence of rounding errors. We also provide an efficient algorithm for the numerical radius of a matrix, the modulus of the outermost point in the field of values (the set of Rayleigh quotients) of the matrix. These algorithms are inspired by algorithms of Byers, Boyd-Balakrishnan and Burke-Lewis-Overton.
The second part is devoted to indicators of robust controllability. We call a system robustly controllable if it is controllable and remains controllable under perturbations of interest. We describe efficient methods for the computation of the distance to the closest uncontrollable system. Our first algorithm for the first-order distance to uncontrollability depends on a grid and is well-suited for low precision approximation. We then discuss algorithms for high precision approximation of the first-order distance to uncontrollability. These are based on the bisection method of Gu and the trisection variant of Burke-Lewis-Overton.
These algorithms require the extraction of the real eigenvalues of matrices of size $O(n2)$ typically at a cost of $O(n6)$, where $n$ is the dimension of the state space. We propose a new divide-and-conquer algorithm that reduces the cost to $O(n4)$ on average in both theory and practice and $O(n5)$ in the worst case. The new iterative approach to the extraction of real eigenvalues may also be useful in other contexts. For higher-order systems we derive a singular value characterization and exploit this characterization for the computation of the higher-order distance to uncontrollability to low precision. The algorithms in this thesis assume arbitrary complex perturbations are applicable to the input system and usually require the extraction of the imaginary eigenvalues of Hamiltonian matrices (or even matrix polynomials) or the unit eigenvalues of symplectic pencils (or palindromic matrix polynomials).
Title: Algorithmic Algebraic Model Checking: Hybrid Automata & Systems Biology
Candidate: Mysore, Venkatesh Pranesh
Advisor(s): Mishra, Bud
Abstract:
Systems Biology strives to hasten our understanding of the fundamental principles of life by adopting a systems-level approach for the analysis of cellular function and behavior. One popular framework for capturing the chemical kinetics of interacting biochemicals is Hybrid Automata. Our goal in this thesis is to aid Systems Biology research by improving the current understanding of hybrid automata, by developing techniques for symbolic rather than numerical analysis of the dynamics of biochemical networks modeled as hybrid automata, and by honing the theory to two classes of problems: kinetic mass action based simulation in genetic regulatory & signal transduction pathways, and pseudo-equilibrium simulation in metabolic networks.
We first provide new constructions that prove that the "open" Hierarchical Piecewise Constant Derivative (HPCD) subclass is closer to the decidability and undecidability frontiers than was previously understood. After concluding that the HPCD-like classes are unsuitable for modeling chemical reactions, our quest for semi-decidable subclasses leads us to define the "semi-algebraic" subclass. This is the most expressive hybrid automaton subclass amenable to rigorous symbolic temporal reasoning. We begin with the bounded reachability problem, and then show how the dense-time temporal logic Timed Computation Tree Logic (TCTL) can be model-checked by exploiting techniques from real algebraic geometry, primarily real quantifier elimination. We also prove the undecidability of reachability in the Blum-Shub-Smale Turing Machine formalism. We then develop efficient approximation strategies by extending bisimulation partitioning, rectangular grid-based approximation, polytopal approximation and time discretization. We then develop a uniform algebraic framework for modeling biochemical and metabolic networks, also extending flux balance analysis. We present some preliminary results using a prototypical tool Tolque. It is a symbolic algebraic dense time model-checker for semi-algebraic hybrid automata, which uses Qepcad for quantifier elimination.
The "Algorithmic Algebraic Model Checking" techniques developed in this thesis present a theoretically-grounded mathematically-sound platform for powerful symbolic temporal reasoning over biochemical networks and other semi-algebraic hybrid automata. It is our hope that by building upon this thesis, along with the development of computationally efficient parallelizable quantifier elimination algorithms and the integration of different computer algebra tools, scientific software systems will emerge that fundamentally transform the way biochemical networks (and other hybrid automata) are analyzed.
Title: Building an Automatic Phenotyping System of Developing Embryos
Candidate: Ning, Feng
Advisor(s): LeCun, Yann
Abstract:
This dissertation presents a learning-based system for the detection, identification, localization, and measurement of various sub-cellular structures in microscopic images of developing embryos. The system analyzes sequences of images obtained through DIC microscopy and detects cell nuclei, cytoplasm, and cell walls automatically. The system described in this dissertation is the key initial component of a fully automated phenotype analysis system.
Our study primarily concerns the early stages of development of C. Elegans nematode embryos, from fertilization to the four-cell stage. The method proposed in this dissertation consists in learning the entire processing chain {\em from end to end}, from raw pixels to ultimate object categories.
The system contains three modules: (1) a convolutional network trained to classify each pixel into five categories: cell wall, cytoplasm, nuclear membrane, nucleus, outside medium; (2) an Energy-Based Model which cleans up the output of the convolutional network by learning local consistency constraints that must be satisfied by label images; (3) A set of elastic models of the embryo at various stages of development that are matched to the label images.
When observing normal (wild type) embryos it is possible to visualize important cellular functions such as nuclear movements and fusions, cytokinesis and the setting up of crucial cell-cell contacts. These events are highly reproducible from embryo to embryo. The events will deviate from normal behaviors when the function of a specific gene is perturbed, therefore allowing the detection of correlations between genes activities and specific early embryonic events. One important goal of the system is to automatically detect whether the development is normal (and therefore, not particularly interesting), or abnormal and worth investigating. Another important goal is to automatically extract quantitative measurements such as the migration speed of the nuclei and the precise time of cell divisions.
Title: A Polymorphic Type System and Compilation Scheme for Record Concatenation
Candidate: Osinski, Edward
Advisor(s): Goldberg, Benjamin
Abstract:
The notion of records, which are used to organize closely related groups of data so the group can be treated as a unit, and also provide access to the data within by name, is almost universally supported in programming languages. However, in virtually all cases, the operations permitted on records in statically typed languages are extremely limited. Providing greater flexibility in dealing with records, while simultaneously retaining the benefits of static type checking is a desirable goal.
This problem has generated considerable interest, and a number of type systems dealing with records have appeared in the literature. In this work, we present the first polymorphic type system that is expressive enough to type a number of complex operations on records, including three forms of concatenation and natural join. In addition, the precise types of the records involved are inferred, to eliminate the burden of explicit type declarations. Another aspect of this problem is an efficient implementation of records and their associated operations. We also present a compilation method which accomplishes this goal.
Title: A Probabilistic Learning Approach to Attribute Value Inconsistency Resolution
Candidate: Pevzner, Ilya
Advisor(s): Goldberg, Arthur
Abstract:
Resolving inconsistencies in data is a problem of critical practical importance. Inconsistent data arises whenever an attribute takes on multiple, inconsistent, values. This may occur when a particular entity is stored multiple times in one database, or in multiple databases that are combined.
We investigate Attribute Value Inconsistency Resolution (AVIR), the problem of semi-automatically resolving data inconsistencies among multiple database records that describe the same person or thing.
Our survey of the area shows that existing solutions are either limited in scope or impose a significant burden on their users. Either they do not cover all types of inconsistencies and attributes, or they require users to write or choose attribute resolution functions for each potentially conflicting attribute.
Our ML based approach applies to all types of inconsistencies and attributes, and automatically selects appropriate resolution functions based on the conflicting data. We have invented and developed a system, that uses a set of binary features that detect data properties and relationships and resolution functions that merge data. Many such features and resolution functions have been written. The system uses supervised learning with maximum likelihood estimation to determine which function(s) to apply, based on which feature(s) fire.
We have validated our system by comparing its error rate, decision rate and decision accuracy on a test data set to baseline values determined by a clairvoyant application of a standard approach where each potentially conflicting attribute is resolved by the best resolution function for the attribute.
Title: Animating Autonomous Pedestrians
Candidate: Shao, Wei
Advisor(s): Terzopoulos, Demetri
Abstract:
This thesis addresses the difficult open problem in computer graphics of autonomous human modeling and animation, specifically of emulating the rich complexity of real pedestrians in urban environments.
We pursue an artificial life approach that integrates motor, perceptual, behavioral, and cognitive components within a model of pedestrians as highly capable individuals. Our comprehensive model features innovations in these components, as well as in their combination, yielding results of unprecedented fidelity and complexity for fully autonomous multi-human simulation in large urban environments. Our pedestrian model is entirely autonomous and requires no centralized, global control whatsoever.
To animate a variety of natural interactions between numerous pedestrians and their environment, we represent the environment using hierarchical data structures, which efficiently support the perceptual queries of the autonomous pedestrians that drive their behavioral responses and sustain their ability to plan their actions on local and global scales.
The animation system that we implement using the above models enables us to run long-term simulations of pedestrians in large urban environments without manual intervention. Real-time simulation can be achieved for well over a thousand autonomous pedestrians. With each pedestrian under his/her own autonomous control, the self-animated characters imbue the virtual world with liveliness, social (dis)order, and a realistically complex dynamic.
We demonstrate the automated animation of human activity in a virtual train station, and we employ our pedestrian simulator in the context of virtual archaeology for visualizing urban social life in reconstructed archaeological sites. Our pedestrian simulator is also serving as the basis of a testbed for designing and experimenting with visual sensor networks in the field of computer vision.
Title: Complexity Analysis of Algorithms in Algebraic Computation
Candidate: Sharma, Vikram
Advisor(s): Yap, Chee
Abstract:
Numerical computations with real algebraic numbers require algorithms for approximating and isolating real roots of polynomials. A classical choice for root approximation is Newton's method. For an analytic function on a Banach space, Smale introduced the concept of approximate zeros, i.e., points from which Newton's method for the function converges quadratically. To identify these approximate zeros he gave computationally verifiable convergence criteria called point estimates. However, in developing these results Smale assumed that Newton's method is computed exactly. For a system of $n$ homogeneous polynomials in $n+1$ variables, Malajovich developed point estimates for a different definition of approximate zero, assuming that all operations in Newton's method are computed with fixed precision. In the first half of this dissertation, we develop point estimates for these two different definitions of approximate zeros of an analytic function on a Banach space, but assume the strong bigfloat computational model of Brent, i.e., where all operations involve bigfloats with varying precision. In this model, we derive a uniform complexity bound for approximating a root of a zero-dimensional system of $n$ integer polynomials in $n$ variables. We also derive a non-asymptotic bound, in terms of the condition number of the system, on the precision required to implement the robust Newton method.
The second part of the dissertation analyses the worst-case complexity of two algorithms for isolating real roots of a square-free polynomial with real coefficients: The Descartes method and Akritas' continued fractions algorithm. The analysis of both algorithms is based upon amortization bounds such as the Davenport-Mahler bound. For the Descartes method, we give a unified framework that encompasses both the power basis and the Bernstein basis variant of the method; we derive an $O(n(L+\log n))$ bound on the size of the recursion tree obtained by applying the method to a square-free polynomial of degree n with integer coefficients of bit-length $L$, the bound is tight for $L=\Omega(\log n)$; based upon this result we readily obtain the best known bit-complexity bound of $\wt{O}(n^4L2) $ for the Descartes method, where $\wt{O}$ means we ignore logarithmic factors. Similar worst case bounds on the bit-complexity of Akritas' algorithm were not known in the literature. We provide the first such bound, $\wt{O}(n^{12}L3)$, for a square-free integer polynomial of degree $n$ and coefficients of bit-length $L$.
Title: Pairwise Comparison between Genomic Sequences and Optical-Maps
Candidate: Sun, Bing
Advisor(s): Mishra, Bud
Abstract:
With the development and improvement of high throughput experimental technologies, massive amount of biological data including genomic sequences and optical-maps have been collected for various species. Comparative techniques play a central role in investigating the adaptive significance of organismal traits and revealing evolutionary relations among organisms by comparing these biological data. This dissertation presents two efficient comparative analysis tools used in comparative genomics and comparative optical-map study, respectively.
A complete genome sequence of an organism can be viewed as its ultimate genetic map, in the sense that the heritable information are encoded within the DNA and the order of nucleotides along chromosomes is known. Comparative genomics can be applied to find functional sites by comparing genetic maps. Comparing vertebrate genomes requires efficient cross-species sequence alignment programs. The first tool introduced in this thesis is COMBAT (Clean Ordered Mer-Based Alignment Tool), a new mer-based method which can search rapidly for highly similar translated genomic sequences using the stable-marriage algorithm (SM) as an alignment filter. In experiments COMBAT is applied to comparative analysis between yeast genomes, and between the human genome and the recently published bovine genome. The homologous blocks identified by COMBAT are comparable with the alignments produced by BLASTP and BLASTZ.
When genetic maps are not available, other genomic maps, including optical-maps, can be constructed. An optical map is an ordered enumeration of the restriction sites along with the estimated lengths of the restriction fragments between consecutive restriction sites. CAPO (Comparative Analysis and Phylogeny with Optical-Maps), introduced as a second technique in this thesis, is a tool for inferring phylogeny based on pairwise optical map comparison and bipartite graph matching. CAPO combines the stable matching algorithm with either the Unweighted Pair Group Method with Arithmetic Averaging (UPGMA) or the Neighbor-Joining (NJ) method for constructing phylogenetic trees. This new algorithm is capable of constructing phylogenetic trees in logarithmic steps and performs well in practice. Using optical maps constructed in silico and in vivo, our work shows that both UPGMA-flavored trees and the NJ-flavored trees produced by CAPO share substantial overlapping tree topology and are biologically meaningful.
Title: Exploiting Service Usage Information for Optimizing Server Resource Management
Candidate: Totok, Alexander
Advisor(s): Karamcheti, Vijay
Abstract:
It is difficult to provision and manage modern component-based Internet services so that they provide stable quality-of-service (QoS) guarantees to their clients, because: (1) component middleware are complex software systems that expose several independently tuned configurable application runtime policies and server resource management mechanisms; (2) session-oriented client behavior with complex data access patterns makes it hard to predict what impact tuning these policies and mechanisms has on application behavior; (3) component-based Internet services exhibit complex structural organization with requests of different types accessing different components and data sources, which could be distributed and/or replicated for failover, performance, or business purposes.
This dissertation attempts to alleviate this situation by targeting three interconnected goals: (1) providing improved QoS guarantees to the service clients, (2) optimizing server resource utilization, and (3) providing application developers with guidelines for natural application structuring, which enable efficient use of the proposed mechanisms for improving service performance. Specifically, we explore the thesis that exposing and using detailed information about how clients use component-based Internet services enables mechanisms that achieve the range of goals listed above. To validate this thesis we show its applicability to the following four problems: (1) maximizing reward brought by Internet services, (2) optimizing utilization of server resource pools, (3) providing session data integrity guarantees, and (4) enabling service distribution in wide-area environments.
The techniques that we propose for the identified problems are applicable at both the application structuring stage and the application operation stage, and range from automatic (i.e., performed by middleware in real time) to manual (i.e., involve the programmer, or the service provider). These techniques take into account service usage information exposed at different levels, ranging from high-level structure of user sessions to low level information about data access patterns and resource utilization by requests of different types. To show the benefits of the proposed techniques, we implement various middleware mechanisms in the JBoss application server, which utilizes the J2EE component model, and comprehensively evaluate them on several publicly-available sample J2EE applications - Java Pet Store, RUBiS, and our own implementation of the TPC-W web transactional benchmark. Our experimental results show that the proposed techniques achieve optimal utilization of server resources and improve application performance by up to two times for centralized Internet services and by up to 6 times for distributed ones.
Title: Time Series Matching: A Multi-Filter Approach
Candidate: Wang, Zhihua
Advisor(s): Shasha, Dennis
Abstract:
Data arriving in time order (time series) arises in disciplines ranging from music to meteorology to finance to motion capture data, to name a few. In many cases, a natural way to query the data is what we call time series matching - a user enters a time series by hand, keyboard or voice and the system finds "similar" time series.
Existing time series similarity measures, such as DTW (Dynamic Time Warping), can accommodate certain timing errors in the query and perform with high accuracy on small databases. However, they all have high computational complexity and the accuracy dramatically drops when the data set grows. More importantly, there are types of errors that cannot be captured by a single similarity measure.
Here we present a general time series matching framework. This framework can easily optimize, combine and test different features to execute a fast similarity search based on the application's requirement. Basically we use a multi-filter chain and boosting algorithms to compose a ranking algorithm. Each filter is a classifier which removes bad candidates by comparing certain features of the time series data. Some filters use a boosting algorithm to combine a few different weak classifiers into a strong classifier. The final filter will give a ranked list of candidates in the reference data which matches the query data.
The framework is applied to build query algorithms for a Query-by-Humming system. Experiments show that the algorithm has a more accurate similarity measure and its response time increases much slower than the pure DTW algorithm when the number of songs in the database increases from 60 to 1400.
Title: Incremental Web Search: Tracking Changes in the Web
Candidate: Wang, Ziyang
Advisor(s): Davis, Ernest
Abstract:
A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. Here we present our solutions of searching new information from the web by tracking the changes of web documents.
First, we present the algorithms and techniques useful for solving the following problems: detecting web pages that have changed, extracting changes from different versions of a web page, and evaluating the significance of web changes. We propose a two-level change detector: MetaDetector and ContentDetector. The combined detector successfully reduces network traffic by about 67%. Our algorithm for extracting web changes consists of three steps: document tree construction, document tree encoding and tree matching. It has linear time complexity and extracts effectively the changed content from different versions of a web page. In order to evaluate web changes, we propose a unified ranking framework combining three metrics: popularity ranking, content-based ranking and evolution ranking. Our methods can identify and deliver important new information in a timely manner.
Second, we present an application using the techniques and algorithms we developed, named "Web Daily News Assistant (WebDNA): finding what's new on Your Web". It is a search tool that helps community users search new information on their community web. Currently WebDNA is deployed on the New York University web site.
Third, we model the changes of web documents using survival analysis. Modeling web changes is useful for web crawler scheduling and web caching. Currently people model changes to web pages as a Poisson Process, and use a necessarily incomplete detection history to estimate the true frequencies of changes. However, other features that can be used to predict change frequency have not previously been studied. Our analysis shows that PageRank value is a good predictor. Statistically, the change frequency is a function proportional to $\exp[0.36\cdot (\ln(PageRank)+C)]$. We further study the problem of combining the predictor and change history into a unified framework. An improved estimator of change frequency is presented, which successfully reduces the error by 27.3% when the change history is short.
Title: Fast Algorithms for Burst Detection
Candidate: Zhang, Xin
Advisor(s): Shasha, Dennis
Abstract:
Events occur in every aspect of our lives.
An unexpectedly large number of events occurring within some certain measurement (e.g. within some time duration or a spatial region) is called a {\em burst}, suggesting unusual behaviors or activities. Bursts come up in many natural and social processes. It is a challenging task to monitor the occurrence of bursts whose lasting duration is unknown in a fast data stream environment.
This work describes efficient data structures and algorithms for high performance burst detection under different settings. Our view is that bursts, as an unusual phenomenon, constitute a useful preliminary primitive in a knowledge discovery hierarchy. Our intent is to build a high performance primitive detection algorithm to support high-level data mining tasks.
The work starts with an algorithmic framework including a family of data structures and a heuristic optimization algorithm to choose an efficient data structure given the inputs. The advantage of this framework is that it's adaptive to different inputs. Experiments on both synthetic data and real world data show the new framework significantly outperforms existing techniques over a variety of inputs.
Furthermore, we present a greedy dynamic detection algorithm which handles the changing data. It evolves the structure to adapt to the incoming data. It achieves better performance in both synthetic and real data streams than a static algorithm in most cases.
We have applied this framework to different real world applications in physics, stock trading and website traffic monitoring. All the case studies show our framework has superb performance.
We extend this framework to multi-dimensional data and use it in an epidemiology simulation to detect infectious disease outbreak and spread.
Title: High Performance Algorithms for Multiple Streaming Time Series
Candidate: Zhao, Xiaojian
Advisor(s): Shasha, Dennis
Abstract:
Data arriving in time order (a data stream) arises in fields ranging from physics to finance to medicine to music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Furthermore, the number of sensors is increasing, so analyzing data between sensors becomes ever more critical in order to distill knowledge from the data. Fast response is desirable in many applications (e.g. to aim a telescope at an activity of interest or to perform a stock trade). In applications such as finance, recent information, e.g. correlation, is of far more interest than older information, so analysis over sliding windows is a desired operation.
These three factors -- huge data size, fast response, and windowed computation -- motivated this work. Our intent is to build a foundational library of primitives to perform online or near online statistical analysis, e.g. windowed correlation, incremental matching pursuit, burst detection, on thousands or even millions of time series. Beside the algorithms, we also propose the concept of ``uncooperative'' time series, whose power spectra are spread over all frequencies with any regularity.
Previous work showed how to do windowed correlation with Fast Fourier Transforms and Wavelet Transforms, but such techniques don't work for uncooperative time series. This thesis will show how to use sketches (random projections) in a way that combines several simple techniques -- sketches, convolution, structured random vectors, grid structures, combinatorial design, and bootstrapping -- to achieve high performance, windowed correlation over a variety of data sets. Experiments confirm the asymptotic analysis.
To conduct matching pursuit (MP) over time series windows, an incremental scheme is designed to reduce the computational effort. Our empirical study demonstrates a substantial improvement in speed.
In previous work, Zhu and Shasha introduced an efficient algorithm to monitor bursts within windows of multiple sizes. We implemented it in a physical system by overcoming several practical challenges. Experimental results support the authors' linear running time analysis.
Title: Distribution of Route-Impacting Control Information in a Publish/Subscribe System with Delivery Guarantees
Candidate: Zhao, Yuanyuan
Advisor(s): Kedem, Zvi
Abstract:
Event-driven middleware is a popular infrastructure for building large-scale asynchronous distributed systems. Content-based publish/subscribe systems are a type of event-driven middleware that provides service flexibility and specification expressiveness, creating opportunities for improving reliability and efficiency of the system.
The use of route-impacting control information, such as subscription filters and access control rules, has the potential to enable efficient routing for applications that require selective and regional distribution of events. Such applications range from financial information systems to sensor networks to service-oriented architectures. However, it has been a great challenge to design correct and efficient protocols for distributing control information and exploiting it to achieve efficient and highly available message routing.
In this dissertation, we study the problem of distributing and utilizing route-impacting control information. We present an abstract model of content-based routing and reliable delivery in redundant broker networks. Based on this model, we design a generic algorithm that propagates control information and performs content-based routing and delivers events reliably. The algorithm is efficient and light-weight in that it does not require heavy-weight consensus protocols between redundant brokers. We extend this generic algorithm to support consolidation and merging of control information. Existing protocols can be viewed as particular encodings and optimizations of the generic algorithm. We show an encoding using virtual time vectors that supports reliable delivery and deterministic dynamic access control in redundant broker networks. In our system, the semantics of reliable delivery is clearly defined even if subscription information and access control policy can dynamically change. That is, one or more subscribers of same principal will receive exactly the same sequence of messages (modulo subscription filter differences) regardless of where they are connected and the network latency and failure conditions in their parts of the network.
We have implemented these protocols in a fully-functioning content-based publish/subscribe system - Gryphon. We evaluate its efficiency, scalability and high availability.
Title: Translation Validation of Optimizing Compilers
Candidate: Fang, Yi
Advisor(s): Pnueli, Amir; Zuck, Lenore
Abstract:
There is a growing awareness, both in industry and academia, of the crucial role of formally verifying the translation from high-level source-code into low-level object code that is typically performed by an optimizing comiler. Formally verifying an optimizing compiler, as one woule verify any other large program, is not feasible due to its size, ongoing evolution and modification, and possibly, proprietary considerations. Translation validation is a novel approach that offers an alternative to the verification of translator in general and compilers in particular: Rather than verifying the compiler itself, one constructs a validation tool which, after every run of the compiler, formally confirms that the target code produced in the run is a correct translation of the source program. This thesis work takes an important step towards ensuring an extremely high level of confidence in compilers targeted at EPIC architectures.
In this thesis, we focus on the translation validation of structure preserving optimizations, i.e. transformations that do not modify programs' structure in a major way. This category of optimizations covers most of the global optimizations performed by compilers. This thesis has two main parts. One develops a proof rule that formally establishes the correctness of structure preserving transformation based on computational induction. The other part is the development of a tool that applies the proof rule to the automatic validation of global optimizaitons performed by Intel's ORC compiler for IA-64 architecture. With minimal instrumentation from the compiler, the tool constructs ''verification conditions'' -- formal theorems that, if valid, establish the correctness of a translation. The verificaiton conditions are then transferred to an automatic theorem prover that checks their validity. Together, the tool offers a fully automatic method to formally establish the correctness of each translation.
Title: Translation Validation of Loop Optimizations
Candidate: Hu, Ying
Advisor(s): Goldberg, Benjamin; Barrett, Clark
Abstract:
Formal verification is important in designing reliable computer systems. For a critical software system, it is not enough to have a proof of correctness for the source code, there must also be an assurance that the compiler produces a correct translation of the source code into the target machine code. Verifying the correctness of modern optimizing compilers is a challenging task because of their size, their complexity, and their evolution over time.
In this thesis, we focus on the Translation Validation of loop optimizations. In order to validate the optimizations performed by the compiler, we try to prove the equivalence of the intermediate codes before and after the optimizations. There were previously a set of proof rules for building the equivalence relation between two programs. However, they cannot validate some cases with legal loop optimizations. We propose new proof rules to consider the conditions of loops and possible elimination of some loops, so that those cases can also be handled. According to these new proof rules, algorithms are designed to apply them to an automatic validation process.
Based on the above proof rules, we implement an automatic validation tool for loop optimizations which analyzes the loops, guesses what kinds of loop optimizations occur, proves the validity of a combination of loop optimizations, and synthesizes a series of intermediate codes. We integrate this new loop tool into our translation validation tool TVOC, so that TVOC handles not only optimizations which do not significantly change the structure of the code, but also loop optimizations which do change the structure greatly. With this new part, TVOC has succeeded in validating many examples with loop optimizations.
Speculative optimizations are the aggressive optimizations that are only correct under certain conditions that cannot be known at compile time. In this thesis, we present the theory and algorithms for validating speculative optimizations and generating the runtime tests necessary for speculative optimizations. We also provide several examples and the results of the algorithms for speculative optimizations.
Title: Construction of Component-Based Applications by Planning
Candidate: Kichkaylo, Tatiana
Advisor(s): Karamcheti, Vijay; Ernest Davis
Abstract:
Many modern wide-area distributed systems are component-based. This approach provides great flexibility in adapting applications to the changing state of the environment and user requirements, but increases the complexity of configuring the applications. Because of the scale and heterogeneity of modern wide-area environments, manual configuration is hard, inefficient, suboptimal, and error-prone. Automated application configuration is desired.
Constructing distributed applications requires choosing a set of components that will constitute the application instance and assigning network resources to component executions and data transfers. Stated this way, the application configuration problem (ACP) is similar to the planning (action selection) and scheduling (resource allocation) problems studied by the Artificial Intelligence (AI) community.
This thesis investigates the problem of solving the ACP using AI planning techniques. However, the ACP poses several challenges not usually encountered and addressed by the traditional AI solutions. The problem specification for the ACP can be much larger than the solution, with the relevant portions only identified during the search. Additionally, the interactions between planning operators are numeric rather than logical. Finally, it is desirable to be able to trade off quality of the solution versus search time.
We show that the ACP is undecidable in general. Therefore, instead of a single algorithm, we propose a set of techniques that can be used to compose an algorithm for a particular variety of the ACP that can exploit natural restrictions exhibited by that variety. These techniques address the challenges above by dynamically obtaining portions of the problem specification as necessary during the search, using envelope hierarchies based on numeric information for pruning and search guidance, and discretizing continuous variables to approximate numeric parameters without restricting the form of supported numeric functions.
We illustrate these techniques by describing their use in algorithms tailored for two specific varieties of the ACP --- snapshot configurations for dynamic component-based frameworks, and scheduling of grid workflows with replica selection and explicit resource reservations. Experimental evaluation of the performance of these two algorithms shows that the techniques successfully achieve their goals, with acceptable run-time overhead.
Title: Extensible MultiModal Environment Toolkit (EMMET): A Toolkit for Prototyping and Remotely Testing Speech and Gesture Based Multimodal Interfaces
Candidate: Robbins, Christopher A.
Advisor(s): Perlin, Ken
Abstract:
Ongoing improvements to the performance and accessibility of less conventional input modalities such as speech and gesture recognition now provide new dimensions for interface designers to explore. Yet there is a scarcity of commercial applications which utilize these modalities either independently or multimodally. This scarcity partially results from a lack of development tools and design guidelines to facilitate the use of speech and gesture.
An integral aspect of the user interface design process is the ability to easily evaluate various design solutions through an iterative process of prototyping and testing. Through this process guidelines emerge that aid in the design of future interfaces. Today there is no shortage of tools supporting the development of conventional interfaces. However there do not exist resources allowing interface designers to easily prototype and quickly test, via remote distribution, interface designs utilizing speech and gesture.
The thesis work for this dissertation explores the development of an Extensible MultiModal Environment Toolkit (EMMET) for prototyping and remotely testing speech and gesture based multimodal interfaces to three-dimensional environments. The overarching goals for this toolkit are to allow its users to: explore speech and gesture based interface design without requiring an understanding of the details involved in the low-level implementation of speech or gesture recognition, quickly distribute their multimodal interface prototypes via the Web, and receive multimodal usage statistics collected remotely after each use of their application.
EMMET ultimately contributes to the field of multimodal user interface design by providing an environment to existing user interface developers in which speech and gesture recognition have been seamlessly integrated into their palette of user input options. Such seamless integration serves to increase the utilization within applications of speech and gesture modalities by removing any actual or perceived deterrents to the use of these modalities versus the use of conventional modalities. EMMET additionally strives to improve the quality of speech and gesture based interfaces by supporting the prototype-and-test development cycle through its Web distribution and usage statistics collection capabilities. These capabilities also allow developers to realize new design guidelines specific to the use of speech and gesture.
Title: Pattern Discovery for Hypotheses Generation in Biology
Candidate: Tsirigos, Aristotelis
Advisor(s): Shasha, Dennis
Abstract:
In recent years, the increase in the amounts of available genomic as well as gene expression data has provided researchers with the necessary information to train and test various models of gene origin, evolution, function and regulation. In this thesis, we present novel solutions to key problems in computational biology that deal with nucleotide sequences (horizontal gene transfer detection), amino-acid sequences (protein sub-cellular localization prediction), and gene expression data (transcription factor - binding site pair discovery). Different pattern discovery techniques are utilized, such as maximal sequence motif discovery and maximal itemset discovery, and combined with support vector machines in order to achieve significant improvements against previously proposed methods.
Title: Automatic Verification of Parameterized Systems
Candidate: Xu, Jiazhao
Advisor(s): Pnueli, Amir
Abstract:
Verification plays an indispensable role in designing reliable computer hardware and software systems. With the fast growth in design complexity and the quick turnaround in design time, formal verification has become an increasingly important technology for establishing correctness as well as for finding difficult bugs. Since there is no ``silver-bullet'' to solve all verification problems, a spectrum of powerful techniques in formal verification have been developed to tackle different verification problems and complexity issues. Depending on the nature of the problem whose most salient components are the system implementation and the property specification, a proper methodology or a combination of different techniques is applied to solve the problem.
In this thesis, we focus on the research and development of formal methods to uniformly verify parameterized systems. A parameterized system is a class of systems obtained by instantiating the system parameters. Parameterized verification seeks a single correctness proof of a property for the entire class. Although the general parameterized verification problem is undecidable [AK86], it is possible to solve special classes by applying a repertoire of techniques and heuristics. Many methods in parameterized verification require a great deal of human interaction. This makes the application of these methods to real world problems infeasible. Thus, the main focus of this research is to develop techniques that can be automated to deliver proofs of safety and liveness properties.
Our research combines various formal techniques such as deductive methods, abstraction and model checking. One main result in this thesis is an automatic deductive method for parameterized verification. We apply small model properties of Bounded Data Systems (a special type of parameterized system) to help prove deductive inference rules for the safety properties of BDS systems. Another methodology we developed enables us to prove liveness properties of parameterized systems via an automatic abstraction method called counter abstraction . There are several useful by-products from our research: A set of heuristics is established for the automatic generation of program invariants which can benefit deductive verification in general; also we proposed methodologies for the automatic abstraction of fairness conditions that are crucial for proving liveness properties.
Title: Mobility, Route Caching, and TCP Performance in Mobile Ad Hoc Networks
Candidate: Yu, Xin
Advisor(s): Johnson, David B.
Abstract:
In a mobile ad hoc network, mobile nodes communicate with each other through wireless links. Mobility causes frequent topology changes. This thesis addresses the fundamental challenges mobility presents to on-demand routing protocols and to TCP.
On-demand routing protocols use route caches to make routing decisions. Due to mobility, cached routes easily become stale. To address the cache staleness issue, prior work used adaptive timeout mechanisms. However, heuristics cannot accurately estimate timeouts because topology changes are unpredictable. I propose to proactively disseminate the broken link information to the nodes that have cached the link. I define a new cache structure called a cache table to maintain the information necessary for cache updates, and design a distributed cache update algorithm. This algorithm is the first work that proactively updates route caches in an adaptive manner. Simulation results show that proactive cache updating is more efficient than adaptive timeout mechanisms. I conclude that proactive cache updating is key to the adaptation of on-demand routing protocols to mobility.
TCP does not perform well in mobile ad hoc networks. Prior work provided link failure feedback to TCP so that it can avoid invoking congestion control mechanisms for packet losses caused by route failures. Simulation results show that my cache update algorithm significantly improves TCP throughput since it reduces the effect of mobility on TCP. TCP still suffers from frequent data and ACK losses. I propose to make routing protocols aware of lost TCP packets and help reduce TCP timeouts. I design two mechanisms that exploit cross-layer information awareness: early packet loss notification (EPLN) and best-effort ACK delivery (BEAD). EPLN notifies TCP senders about lost data. BEAD retransmits ACKs at intermediate nodes or at TCP receivers. Simulation results show that the two mechanisms significantly improve TCP throughput. I conclude that cross-layer information awareness is key to making TCP efficient in the presence of mobility.
I also study the impact of route caching strategies on the scalability of on-demand routing protocols with mobility. I show that making route caches adapt quickly and efficiently to topology changes is key to the scalability of on-demand routing protocols with mobility.
Title: Information Extraction from Multiple Syntactic Sources
Candidate: Zhao, Shubin
Advisor(s): Grishman, Ralph
Abstract:
Information Extraction is the automatic extraction of facts from text, which includes detection of named entities, entity relations and events. Conventional approaches to Information Extraction try to find syntactic patterns based on deep processing of text, such as partial or full parsing. The problem these solutions have to face is that as deeper analysis is used, the accuracy of the result decreases, and one cannot recover from the induced errors. On the other hand, lower level processing is more accurate and it can also provide useful information. However, within the framework of conventional approaches, this kind of information can not be efficiently incorporated.
This thesis describes a novel supervised approach based on kernel methods to address these issues. In this approach customized kernels are used to match syntactic structures produced from different preprocessing phases. Using properties of a kernel, individual kernels are combined into composite kernels to integrate and extend all the information. The composite kernels can be used with various classifiers, such as Nearest Neighbor or Support Vector Machines (SVM). The main classifier we propose to use is SVM due to its ability to generalize in large dimensional feature spaces. We will show that each level of syntactic information can contribute to IE tasks, and low level information can help to recover from errors in deep processing.
The new approach has demonstrated state-of-the-art performance on two benchmark tasks. The first task is detecting slot fillers for management succession events (MUC-6). For this task two types of kernels were designed, a surface kernel based on word n-grams and a kernel built on sentence dependency trees; the second task is the ACE RDR evaluation, which is to recognize relations between entities in text from newswire and broadcast news transcript. For this task, five kernels were built to represent information from sentence tokenization, syntactic parsing and dependency parsing. Experimental results for the two tasks will be shown and discussed.
Title: Partitionable Services Framework: Seamless Access to Distributed Applications
Candidate: Ivan, Anca
Advisor(s): Karamcheti, Vijay
Abstract:
A key problem in contemporary distributed systems is how to satisfy user quality of service (QoS) requirements for distributed applications deployed in heterogeneous, dynamically changing environments spanning multiple administrative domains.
An attractive solution is to create an infrastructure which satisfies user QoS requirements by automatically and transparently adapting distributed applications to any environment changes with minimum user input. However, successful use of this approach requires overcoming three challenges: (1) Capturing the application behavior and its relationship with the environment as a set of compact local specifications, using both general, quantitative (e.g., CPU usage) and qualitative (e.g., security) properties. Such information should be sufficient to reason about the global behavior of the application deployment. (2) Finding the ``best'' application deployment that satisfies both application and user requirements, and the various domain policies. The search algorithm should be complete, efficient, scalable with regard to application and network sizes, and guarantee optimality (e.g., resources consumed by applications). (3) Ensuring that the found deployments are practical and efficient, i.e., that the efficiency of automatic deployments is comparable with the efficiency of hand-tuned solutions.
This dissertation describes three techniques that address these challenges in the context of component-based applications. The modularity and reusability of the latter enable automatic deployments while supporting reasoning about the global connectivity based on the local information exposed by each component. The first technique extends the basic component-based application model with information about conditions and effects of component deployments and linkages, together with interactions between components and the network. The second technique uses AI planning to build an efficient and scalable algorithm which exploits the expressivity of the application model to find an application deployment that satisfies user QoS and application requirements. The last technique ensures that application deployments are both practical and efficient, by leveraging language and run-time system support to automatically customize components, as appropriate for the desired security and data consistency guarantees. These techniques are implemented as integral parts of the Partitionable Services Framework (PSF), a Java-based framework which flexibly assembles component-based applications to suit the properties of their environment. PSF facilitates on-demand, transparent migration and replication of application components at locations closer to clients, while retaining the illusion of a monolithic application.
The benefits of PSF are evaluated by deploying representative component-based applications in an environment simulating fast and secure domains connected by slow and insecure links. Analysis of the programming and the deployment processes shows that: (1) the code modifications required by PSF are minimal,(2) PSF appropriately adapts the deployments based on the network state and user QoS requirements, (3) the run-time deployment overheads incurred by PSF are negligible compared to the application lifetime, and (4) the efficiency of PSF-deployed applications matches that of hand-crafted solutions.
Title: VALIS: A Multi-language System for Rapid Prototyping in Computational Biology
Candidate: Paxia, Salvatore
Advisor(s): Mishra, Bud
Abstract:
Bioinformatics is a challenging area for computer science, since the underlying computational formalisms span database systems, numerical methods, geometric modeling and visualization, imaging and image analysis, combinatorial algorithms, data analysis and mining, statistical approaches, and reasoning under uncertainty.
This thesis describes the Valis environment for rapid application prototyping in bioinformatics. The core components of the Valis system are the underlying database structure and the algorithmic development platform.
This thesis presents a novel set of data structures that has marked advantages when dealing with unstructured and unbounded data that are common in scientific fields and bioinformatics.
Bioinformatics problems rarely have a one-language, one-platform solution. The Valis environment allows seamless integration between scripts written in different programming languages and includes tools to rapidly prototype graphical user interfaces.
To date the speed of computation of most whole genome analysis tools have stood in the way of developing fast interactive programs that may be used as exploratory tools. This thesis presents the basic algorithms and widgets that permit rapid prototyping of whole genomic scale real-time applications within Valis.
Title: Thick Surfaces: Interactive Modeling of Topologically Complex Geometric Details
Candidate: Peng, Jianbo
Advisor(s): Zorin, Denis
Abstract:
Lots of objects in computer graphics applications are represented by surfaces. It works very well for objects of simple topology, but can get prohibitively expensive for objects with complex small-scale geometrical details.
Volumetric textures aligned with a surface can be used to add topologically complex geometric details to an object, while retaining an underlying simple surface structure. The simple surface structure provides great controllability on the overall shape of the model, and volumetric textures handle geometric details and topological changes efficiently.
Adding a volumetric texture to a surface requires more than a conventional twodimensional parameterization: a part of the space surrounding the surface has to be parameterized. Another problem with using volumetric textures for adding geometric detail is the difficulty of the rendering of implicitly represented surfaces, especially when they are changed interactively.
We introduce thick surfaces to represent objects with topologically complex geometric details. A thick surface consists of three components. First, a base mesh of simple structure is used to approximate the overall shape of the object. Second, a layer of space along the base mesh is parameterized. We define the layer of space as a shell, which covers the geometric details of the object. Third, volumetric textures of geometric details are mapped into the shell. The object is represented as the implicit surface encoded by the volumetric textures. Places without volumetric textures are filled with patches of the base mesh.
We present algorithms for constructing a shell around a surface and rendering a volumetric-textured surface. Mipmap technique for volumetric textures is explored as well. The gradient field of a generalized distance function is used to construct a non-self-intersecting shell, which has other properties desirable for volumetric texture mapping. The rendering algorithm is designed and implemented on NVIDIA GeForceFX video chips. Finally we demonstrate a number of interactive operations that these algorithms enable.
Title: TM-LPSAT: Encoding Temporal Metric Planning in Continuous Time
Candidate: Shin, Ji-Ae
Advisor(s): Davis, Ernest
Abstract:
In any domain with change, the dimension of time is inherently involved. Whether the domain should be modeled in discrete time or continuous time depends on aspects of the domain to be modeled. Many complex real-world domains involve continuous time, resources, metric quantities and concurrent actions. Planning in such domains must necessarily go beyond simple discrete models of time and change.
In this thesis, we show how the SAT-based planning framework can be extended to generate plans of concurrent asynchronous actions that may depend on or make change piecewise linear metric constraints in continuous time.
In the SAT-based planning framework, a planning problem is formulated as a satisfiability problem of a set of propositional constraints (axioms) such that any model of the axioms corresponds to a valid plan. There are two parameters to a SAT-based planning system: an encoding scheme for representing plans of bounded length and a propositional SAT solver to search for a model. The LPSAT architecture is composed of a SAT solver integrated with a linear arithmetic constraint solver in order to deal with metric aspects of domains.
We present encoding schemes for temporal models of continuous time defined in PDDL+: ( i ) Durative actions with discrete and/or continuous changes; (ii) Real-time temporal model with exogenous events and autonomous processes capturing continuous changes. The encoding represents, in a CNF formula over arithmetic constraints and propositional fluents, time-stamped parallel plans possibly with concurrent continuous and/or discrete changes. In addition, we present encoding schemes for multi-capacity resources, partitioned interval resources, and metric quantities which are represented as intervals. An interval type can be used as a parameter to action as well as a fluent type.
Based on the LPSAT engine, the TM-LPSAT temporal metric planner has been implemented: Given a PDDL+ representation of a planning problem, the compiler of TM-LPSAT translates it in a CNF formula, which is fed into the LPSAT engine to find a solution corresponding to a plan for the planning problem. We also have experimented on our temporal metric encodings with other decision procedure, MathSAT, which deals with propositional combinations of linear constraints and Boolean variables. The results show that in terms of searching time the SAT-based approach to temporal metric planning can be comparable to other planning approaches and there is plenty of room to push further the limits of the SAT-based approach.
Title: Unsupervised Discovery of Extraction Patterns for InformationExtraction
Candidate: Sudo, Kiyoshi
Advisor(s): Grishman, Ralph; Sekine, Satoshi
Abstract:
The task of Information Extraction (IE) is to find specific types of information in natural language text. In particular, *event extraction* identifies instances of a particular type of event or fact (a particular "scenario"), including the entities involved, and fills a database which has been pre-defined for the scenario. As the number of documents available on-line has multiplied, entity extraction has grown in importance for various applications, including tracking terrorist activities from newswire sources and building a database of job postings from the Web, to name a few.
Linguistic contexts, such as predicate-argument relationships, have been widely used as *extraction patterns* to identify the items to be extracted from the text. The cost of creating extraction patterns for each scenario has been a bottleneck limiting the portability of information extraction systems to different scenarios, although there has been some research on semi-supervised pattern discovery procedures to reduce this cost. The challenge is to develop a fully automatic method for identifying extraction patterns for a scenario specified by the user.
This dissertation presents a novel approach for the unsupervised discovery of extraction patterns for event extraction from raw text. First, we present a framework that allows the user to have a self-customizing information extraction system for his/her query: the Query-Driven Information Extraction (QDIE) framework. The input to the QDIE framework is the user's query: either a set of keywords or a narrative description of the event extraction task.
Second, we assess the improvement in extraction pattern models. By considering the shortcomings of the prior work based on predicate-argument models and their extensions, we propose a novel extraction pattern model that is based on arbitrary subtrees of dependency trees.
Third, we address the issue of portability across languages. As a case study of the QDIE framework, we implemented a pre-CODIE system, a Cross-Lingual On-Demand Information Extraction system requiring minimal human intervention, which incorporates the QDIE framework as a component for pattern discovery. In addition, we assess the role of machine translation in cross-lingual information extraction by comparing translation-based implementations.
Title: An Efficient and High-Order Accurate Boundary Integral Solver for the Stokes Equations in Three Dimensional Complex Geometries
Candidate: Ying, Lexing
Advisor(s): Zorin, Denis
Abstract:
This dissertation presents an efficient and high-order boundary integral solver for the Stokes equations in complex 3D geometries. The targeted applications of this solver are the flow problems in domains involving moving boundaries. In such problems, traditional finite element methods involving 3D unstructured mesh generation expe- rience difficulties. Our solver uses the indirect boundary integral formulation and discretizes the equation using the Nyström method.
Although our solver is designed for the Stokes equations, we show that it can be generalized to other constant coefficient elliptic partial differential equations (PDEs) with non-oscillatory kernels.
First, we present a new geometric representation of the domain boundary. This scheme takes quadrilateral control meshes with arbitrary geometry and topology as input, and produces smooth surfaces approximating the control meshes. Our surfaces are parameterized over several overlapping charts through explicit nonsingular C ^{ ∞ } parameterizations, depend linearly on the control points, have fixed-size local support for basis functions, and have good visual quality.
Second, we describe a kernel independent fast multipole method (FMM) and its parallel implementation. The main feature of our algorithm is that it is based only on kernel evaluation and does not require the multipole expansions of the underlying kernel. We have tested our method on kernels from a wide range of elliptic PDEs. Our numerical results indicate that our method is efficient and accurate. Other ad- vantages include the simplicity of the implementation and its immediate extension to other elliptic PDE kernels. We also present an MPI based parallel implementation which scales well up to thousands of processors.
Third, we present a framework to evaluate the singular integrals in our solver. A singular integral is decomposed into a smooth far field part and a local part that contains the singularity. The smooth part of the integral is integrated using the trape- zoidal rule over overlapping charts, and the singular part is integrated in the polar coordinates which removes or decreases the order of singularity. We also describe a novel algorithm to integrate the nearly singular integrals coming from the evaluation at points close to the boundary.
Title: High Performance Data Mining in Time Series: Techniques and Case Studies
Candidate: Zhu, Yunyue
Advisor(s): Shasha, Dennis
Abstract:
Note: A significantly improved and expanded description of this material is available in the book High Performance Discovery in Time Series Springer Verlag 2004 ISBN 0-387-00857-8.
As extremely large time series data sets grow more prevalent in a wide variety of settings, we face the significant challenge of developing efficient analysis methods. This dissertation addresses the problem in designing fast, scalable algorithms for the analysis of time series.
The first part of this dissertation describes the framework for high performance time series data mining based on important primitives. Data reduction trasform such as the Discrete Fourier Transform, the Discrete Wavelet Transform, Singular Value Decomposition and Random Projection, can reduce the size of the data without substantial loss of information, therefore provides a synopsis of the data. Indexing methods organize data so that the time series data can be retrieved efficiently. Transformation on time series, such as shifting, scaling, time shifting, time scaling and dynamic time warping, facilitates the discovery of flexible patterns from time series.
The second part of this dissertation integrates the above primitives into useful applications ranging from music to physics to finance to medicine.
StatStream
StatStream is a system based on fast algorithms for finding the most highly correlated pairs of time series from among thousands of time series streams and doing so in a moving window fashion. It can be used to find correlations in time series in finance and in scientific applications.
HumFinder
Most people hum rather poorly. Nevertheless, somehow people have some idea what we are humming when we hum. The goal of the query by humming program, HumFinder, is to make a computer do what a person can do. Using pitch translation, time dilation, and dynamic time warping, one can match an inaccurate hum to a melody remarkably accurately.
OmniBurst
Burst detection is the activity of finding abnormal aggregates in data streams. Our software, OmniBurst, can detect bursts of varying durations. Our example applications are monitoring gamma rays and stock market price volatility. The software makes use of a shifted wavelet structure to create a linear time filter that can guarantee that no bursts will be missed at the same time that it guarantees (under a reasonable statistical model) that the filter eliminates nearly all false positives.
Title: Comparing and Improving Centralized and Distributed Techniques for Coordinating Massively Parallel Shared-Memory Systems
Candidate: Freudenthal, Eric
Advisor(s): Gottlieb, Allan
Abstract:
Two complementary approaches have been proposed to achieve high performance inter-process coordination on highly parallel shared-memory systems. Gottlieb et. al. introduced the technique of combining concurrent memory references, thereby reducing hot spot contention and enabling the bottleneck-free execution of algorithms referencing a small number of shared variables. Mellor- Crummey and Scott introduced an alternative distributed local-spin technique that minimizes hot spot contention by not polling hotspot variables and exploiting the availability of processor-local shared memory. My principal contributions are a comparison of these two approaches, and significant improvements to the former.
The NYU Ultra3 prototype is the only system built that implements memory reference combining. My research utilizes micro-benchmark simulation studies of massively parallel Ultra3 systems executing coordination algorithms. This investigation detects problems in the Ultra3 design that result in higher-than-expected memory latency for reference patterns typical of busy-wait polling. This causes centralized coordination algorithms to perform poorly. Several architectural enhancements are described that significantly reduce the latency of these access patterns, thereby improving the performance of the centralized algorithms.
I investigate existing centralized algorithms for readers-writers and barrier coordination, all of which require fetch-and-add, and discovered variants that require fewer memory accesses (and hence have shorter latency). In addition,my evaluation includes novel algorithms that require only a restricted form of fetch-and-add.
Coordination latency of these algorithms executed on the enhanced combining architecture is compared to the latency of the distributed local-spin alternatives. These comparisons indicate that the distributed local-spin dissemination barrier, which generates no hot spot tra c, has latency slightly inferior to the best centralized algorithms investigated. However, for the less structured readers-writers problem, the centralized algorithms significantly outperform the distributed local-spin algorithm.
Title: Infrastructure Support for Accessing Network Services in Dynamic Network Environments
Candidate: Fu, Xiaodong
Advisor(s): Karamcheti, Vijay
Abstract:
Despite increases in network bandwidth, accessing network services across a wide area network still remains a challenging task. The difficulty mainly comes from the heterogeneous and constantly changing network environment, which usually causes undesirable user experience for network-oblivious applications.
A promising approach to address this is to provide network awareness in communication paths. While several such path-based infrastructures have been proposed, the network awareness provided by them is rather limited. Many challenging problems remain, in particular: (1) how to automatically create effective network paths whose performance is optimized for encountered network conditions, (2) how to dynamically reconfigure such paths when network conditions change, and (3) how to manage and distribute network resources among different paths and between different network regions. Furthermore, there is poor understanding of the benefits of using the path-based approach over other alternatives.
This dissertation describes solutions for these problems, built into a programmable network infrastructure called Composable Adaptive Network Services (CANS). The CANS infrastructure provides applications with network-aware communication paths that are automatically created and dynamically modified. CANS highlights four key mechanisms: (1) a high-level integrated type-based specification of components and network resources; (2) automatic path creation strategies; (3) system support for low-overhead path reconfiguration; and (4) distributed strategies for managing and allocating network resources.
We evaluate these mechanisms using experiments with typical applications running in the CANS infrastructure, and extensive simulation on a large scale network topology to compare with other alternatives. Experimental results validate the effectiveness of our approach, verifying that (1) the path-based approach provides the best and the most robust performance under a wide range of network configurations as compared to end-point or proxy-based alternatives; (2) automatic generation of network-aware paths is feasible and provides considerable performance advantages, requiring only minimal input from applications; (3) path reconfiguration strategies ensure continuous adaptation and provide desirable adaptation behaviors by using automatically generated paths; (4) both run-time overhead and reconfiguration time of CANS paths are negligible for most applications; (5) the resource management and allocation strategies allow effective setting up shared resource pools in the network and sharing resources among paths.
Title: Enriched Content: Concept, Architecture, Implementation, and Applications
Candidate: Hung-Hsien, Chang
Advisor(s): Perlin, Ken
Abstract:
Since the debut of the World Wide Web, Web users have been facing the following problems:
[Extended Semantics]
When we read or study a digital document that we wish to explore further, typically, we interrupt our work to start a search. It costs time.
[Reverse Hyperlink]
When we visit a web page, we might be curious about what other hyperlinks point to the visited page. These links would most likely be of related interest. Can we get ``real time'' information about what other pages are pointing to this page?
[Version Control]
Many of us have been frustrated and even annoyed when the hyperlink that we follow gives us a ``404 not found'' or the retrieved webpage content is entirely different from the one we have bookmarked. Could we also have access to the past versions even if the hyperlink has been removed or the content has been changed?
[Composition Assistant]
Writing is not an easy task. We labor to structure a body of text, sort out ideas, find materials, and digest information. We would like an automated service that can associate the content we have produced with other contexts(on the Web) and bring these web contexts to us for reference.
In this thesis, we provide a unified framework and architecture, named enriched content, to resolve the above problems. We apply the architecture and show how the enriched content can be used in each application. We demonstrate that this method can be a new way of writing add-on functions for various document applications without having to write individual plug-in for each application or re-write each application. We also briefly discuss possible future development.
Title: A framework for optimistic program optimization
Candidate: Pechtchanski, Igor
Advisor(s): Goldberg, Benjamin
Abstract:
The problem of program optimization is a non-trivial one. Compilers do a fair job, but can't always deliver the best performance. The expressibility of general-purpose languages is limited, not allowing programmers to describe expected run time behavior, for example, and some programs are thus more amenable to optimization than others, depending on what the compiler expects to see. We present a generic framework that allows addressing this problem in two ways: through specifying verifiable source annotations to guide compiler analyses, and through optimistically using some assumptions and analysis results for the subset of the program seen so far. Two novel applications are presented, one for each of the above approaches: a dynamic optimistic interprocedural type analysis algorithm, and a mechanism for specifying immutability assertions. Both applications result in measurable speedups, demonstrating the feasibility of each approach.
Title: Secure and Robust Censorship-Resistant Publishing Systems
Candidate: Waldman, Marc
Advisor(s): Mazieres, David
Abstract:
In many cases, censoring documents on the Internet is a fairly simple task. Almost any published document can be traced back to a specific host, and from there to an individual responsible for the material. Someone wishing to censor this material can use the courts, threats, or other means of intimidation to compel the relevant parties to delete the material or remove the host from the network. Even if these methods prove unsuccessful, various denial of service attacks can be launched against a host to make the document difficult or impossible to retrieve. Unless a host's operator has a strong interest in preserving a particular document, removing it is often the easiest course of action.
A censorship-resistant publishing system allows an individual to publish a document in such a way that it is difficult, if not impossible, for an adversary to completely remove, or convincingly alter, a published document. One useful technique for ensuring document availability is to replicate the document widely on servers located throughout the world. However, replication alone does not block censorship. Replicas need to be protected from accidental or malicious corruption. In addition, a censorship-resistant publishing system needs to address a number of other important issues, including protecting the publisher's identity while simultaneously preventing storage flooding attacks by anonymous users.
This dissertation presents the design and implementation of two very different censorship-resistant publishing systems. The first system, Publius, is a web based system that allows an individual to publish, update, delete and retrieve documents in a secure manner. Publius's main contributions include an automatic tamper checking mechanism, a method for updating or deleting anonymously published content and methods for publishing anonymously hyperlinked content. The second system, Tangler, is a peer-to-peer based system whose contributions include a unique publication mechanism and a dynamic self-policing network. The benefits of this new publication mechanism include the automatic replication of previously published content and an incentive to audit the reliability with which servers store content published by other people. In part through these incentives, the self-policing network identifies and ejects servers that exhibit faulty behavior.
Title: A Qualitative Profile-based Approach to Edge Detection
Candidate: Yen, Ting-jen
Advisor(s): Yap, Chee
Abstract:
Edge detection is a fundamental problem of computer vision and has been widely investigated. We propose a new framework for edge detection based on edge profiles.
Our model, based on one-dimensional qualitative edge profile fitting and edge consistency, will produce one continuous edge from an initial seed point. A "profile" is defined as a finite cross-section of a two-dimensional image along a line segment. "Edge consistency" means that all the profiles on the same edge should be consistent.
Appropriate evaluation functions are needed for different types of edge profiles, such as step edges, ramp edges, etc. An evaluation function must meet the requirement that it will produce local minima at the positions where edges of a given type occurs in the profile. Instead of subjective thresholding, image noise is measured statistically and used as a systematic way of filtering false edges. We describe our method as "qualitative edge profile fitting" because it is not based on arbitrary thresolding. Once an edge point is localized, it can be extended into an edge by matching compatible profiles. Two profiles are considered compatible as long as their average di erence is within the noise measurement. Another feature of our approach is its subpixel accuracy. The utilization of profiles and noise-induced threshold selection make tasks such as joining broken edges more objective.
We develop the necessary algorithms and implement them. Different evaluation functions are constructed for different edge models and experimented on different one-dimensional profiles. The edge detector, using these evaluation functions, is then examined using different images and under different noise conditions.
Title: Expert-Driven Validation of Set-Based Data Mining Results
Candidate: Adomavicius, Gediminas
Advisor(s): Tuzhilin, Alexander; Davis, Ernest
Abstract:
This dissertation addresses the problem of dealing with large numbers of set-based patterns, such as association rules and itemsets, discovered by data mining algorithms. Since many discovered patterns may be spurious, irrelevant, or trivial, one of the main problems is how to validate them, e.g., how to separate the ``good'' rules from the ``bad.'' Many researchers have advocated the explicit involvement of a human expert in the validation process. However, scalability becomes an issue when large numbers of patterns are discovered, since the expert cannot perform the validation on a pattern-by-pattern basis in a reasonable period of time. To address this problem, this dissertation describes a new expert-driven approach to set-based pattern validation.
The proposed validation approach is based on validation sequences, i.e., we rely on the expert's ability to iteratively apply various validation operators that can validate multiple patterns at a time, thus making the expert-based validation feasible. We identified the class of scalable set predicates called cardinality predicates and demonstrated how these predicates can be effectively used in the validation process, i.e., as a basis for validation operators. We examined various properties of cardinality predicates, including their expressiveness. We also have developed and implemented the set validation language (SVL) that can be used for manual specification of cardinality predicates by a domain expert. In addition, we have proposed and developed a scalable algorithm for set and rule grouping that can be used to generate cardinality predicates automatically.
The dissertation also explores various theoretical properties of sequences of validation operators and facilitates a better understanding of the validation process. We have also addressed the problem of finding optimal validation sequences and have shown that certain formulations of this problem are NP-complete. In addition, we provided some heuristics for addressing this problem.
Finally, we have tested our rule validation approach on several real-life applications, including personalization and bioinformatics applications.
Title: Responsive Thinwire Visualization of Large Geographic Datasets
Candidate: Been, Kenneth
Advisor(s): Yap, Chee
Abstract:
This thesis describes a web-based, responsive, zooming and panning visual- ization system for a full-featured geographic description of the United States. Current web-based map servers provide, from a visualization standpoint, little more than one static image per page, with hyperlinks for navigation; continuous zooming and panning requires locally stored data. Our primary contribution is a multi-threaded, scalable and responsive client-server architecture that responds to user requests as naturally and quickly as possible, regardless of network band- width reliability. This architecture can be generalized for use in other applica- tions, including non-geographic ones. To this we add a scalable and exible user interface for navigation of multi-scale geographic data, with intuitive zooming and panning, pop-up feature labels, and a user controlled tree-hierarchy of windows. We build software tools and algorithms for translating the U.S. Census Bureau's TIGER data into a format designed for speedy database retrieval and network delivery, and for generalizing the data into multiple levels of detail. Because of anomalies in the TIGER data, this processing requires some human intervention.
Title: Representing and Modifying Complex Surfaces
Candidate: Biermann, Henning
Advisor(s): Zorin, Denis
Abstract:
The increasing demand for highly detailed geometric models poses new and important problems in computer graphics and geometric modeling. Applications for complex models range from geometric design and scientific simulations to feature movies and video games.
We focus on the fundamental problem of creating and manipulating complex surface models. We address the problem by designing an efficient and general surface representation, and develop algorithms for efficient modification of surfaces represented in this form. Our surface representation extends existing subdivision-based representations with explicit representation of sharp features and boundaries, which is crucial in many computer-aided design applications.
We consider two types of surface modifications: boolean operations on solids bounded by surfaces, and surface pasting. Our technique rapidly and robustly computes an approximate result rather than aiming for the precise solution. At the same time, our approach allows one to trade speed for accuracy, and, in most cases, compute the result with any desired accuracy. The second type of editing operations we consider address the problem of transferring geometric features between different objects. Our technique makes it easy to combine geometric data from various sources (e.g. 3D scanning, CAGD model) into a single model.
Title: On computing the Pareto-optimal solution set in a large scale dynamic network
Candidate: Daruwala, Raoul-Sam
Advisor(s): Mishra, Bud
Abstract:
Let G=(V,E) be a graph with time-dependent edges where the cost of a path p through the graph is determined by a vector functions F(p)=[f_1(p),f_2(p), \dots, f_n(p)], where f_1,f_2,...,f_n are independent objective functions. Where n>1 there is no clear idea of what a ``best'' solution is, instead we turn to the idea of Pareto-optimality to define the efficiency of a path. Given the set of paths P through the network, a path p' is Pareto-optimal if for every p in P for all the objective functions (f_i(p) >= f_i(p')).
The problem of planning itineraries on a transportation system involves computing the set of optimal paths through a time-dependent network where the cost of a path is determined by more than one, possibly non-linear and non-additive, cost function. This thesis introduces an algorithmic toolkit for finding the set of Pareto-optimal paths in time-dependent networks in the presence of multiple objective functions.
Multi-criteria path optimization problems are known to be NP-Hard, however, by exploiting geometric and periodic properties of the dynamic graphs that model transit networks we show that it is possible to compute the Pareto-optimal solutions sets rapidly without using heuristics. We show that we can solve the itinerary problem in the presence of response time constraints for a large scale graph.
Title: Informative Features in Vision and Learning
Candidate: Rudra, Archisman
Advisor(s): Geiger, Davi
Abstract:
We explore the role of features in solving problems in computer vision and learning. Features captures important domain-dependent knowledge and are fundamental in simplifying problems. Our goal is to consider the universal features of the problem concerned, and not just particular algorithms used in its solution. Such an approach reveals only the fundamental difficulties of any problem. For most problems we will face a host of other specialized concerns. Therefore, we consider simplified problems which captures the essence of our approach.
This thesis consists of two parts. First, we explore means of discovering features. We come up with an information theoretic criterion to identify features which has deep connections to statistical estimation theory. We consider features to be ``nice'' representations of objects. We find that, ideally, a feature space representation of on image is the most concise representation of an image which captures all available information in it. In practice, however, we are satisfied with an approximation to it. Therefore, we explore a few such approximations and explain their connection to the information-theoretic approach. We look at the algorithms which implement these approximation and look at their generalizations in the related field of stereo vision.
Using features, whether they come from some feature-discovery algorithm or are hand crafted, is usually an ad hoc process which depends on the actual problem, and the exact representation of features. This diversity mostly arises from the multitude of ways features capture information. In the second part of this thesis, we come up with an architecture which lets us use features in a very flexible way, in the context of content-addressable memories. We apply this approach to two radically different domains, face images and English words. We also look at human performance in reconstructing words from fragments, which give us some information about the memory subsystem in human beings.
Title: Knowledge Discovery in Databases for Intrusion Detection, Disease Classification and Beyond
Candidate: Berger, Gideon
Advisor(s): Mishra, Bud
Abstract:
As the number of networked computers grows and the amount of sensitive information available on them grows as well there is an increasing need to ensure the security of these systems. The security of computer networks is not a new issue. We have dealt with the need for security for a long time with such measures as passwords and encryption. These will always provide an important initial line of defense. However, given a clever and malicious individual these defenses can often be circumvented. Intrusion detection is therefore needed as another way to protect computer systems. This thesis describes a novel three stage algorithm for building classification models in the presence of non-stationary, temporal, high dimensional data, in general, and for detecting network intrusion detections, in particular. Given a set of training data records the algorithm begins by identifying "interesting'' temporal patterns in this data using a modal logic. This approach is distinguished from other work in this area where frequent patterns are identified. We show that when frequency is replaced by our measure of "interestingness'' the problem of finding temporal patterns is NP-complete. We then offer an efficient heuristic approach that has proven effective in experiments. Having identified interesting patterns, these patterns then become the predictor variables in the construction of a Multivariate Adaptive Regression Splines (MARS) model. This approach will be justified, after surveying other methods for solving the classification problem, by its ability to capture complex nonlinear relationships between the predictor and response variables which is comparable to other heuristic approaches such as neural networks and classification trees, while offering improved computational properties such as rapid convergence and interpret-ability. After considering a variety of approaches to the problems of over-fitting which is inherent when modeling high dimensional data and non-stationarity, we describe our approach to addressing these issues through the use of truncated Stein shrinkage. This approach is motivated by showing the inadmissibility of the maximum likelihood estimator (MLE) in the high dimensional (dimension >= 3) data. We then discuss the application of our approach as participants in the 1999 DARPA Intrusion Detection Evaluation where we were able to exhibit the benefits of our approach. Finally, we suggest another area of research where we believe that our work would meet with similar success, namely, the area of disease classification.
Title: Algorithms for Rendering in Artistic Styles
Candidate: Hertzmann, Aaron
Abstract:
We describe new algorithms and tools for generating paintings, illustrations, and animation on a computer. These algorithms are designed to produce visually appealing and expressive images that look hand-painted or hand-drawn. In many contexts, painting and illustration have many advantages over photorealistic computer graphics, in aspects such as aesthetics, expression, and computational requirements. We explore three general strategies for non-photorealistic rendering:
First, we describe explicit procedures for placing brush strokes. We begin with a painterly image processing algorithm inspired by painting with real physical media. This method produces images with a much greater subjective impression of looking hand-made than do earlier methods. By adjusting algorithm parameters, a variety of styles can be generated, such as styles inspired by the Impressionists and the Expressionists. This method is then extended to processing video, as demonstrated by painterly animations and an interactive installation. We then present a new style of line art illustration for smooth 3D surfaces. This style is designed to clearly convey surface shape, even for surfaces without predefined material properties or hatching directions.
Next, we describe a new relaxation-based algorithm, in which we search for the painting that minimizes some energy function. In contrast to the first approach, we ideally only need to specify what we want, not how to directly compute it. The system allows as fine user control as desired: the user may interactively change the painting style, specify variations of style over an image, and/or add specific strokes to the painting.
Finally, we describe a new framework for processing images by example, called ``image analogies.'' Given an example of a painting or drawing (e.g. scanned from a hand-painted source), we can process new images with some approximation to the style of the painting. In contrast to the first two approaches, this allows us to design styles without requiring an explicit technical definition of the style. The image analogies framework supports many other novel image processing operations.
Title: Region-based Register Allocation for EPIC Architectures
Candidate: Kim, Hansoo
Advisor(s): Palem, Krishna
Abstract:
Instruction-level parallelism(ILP) is a family of processor and compiler design techniques that speed up execution by allowing individual machine operations. Explicitly Parallel Instruction computing (EPIC) processors evolved in an attempt to achieve high levels of ILP without the hardware complexity. In EPIC processors most of the functions to extract ILP are performed by the compiler. To take advantage higher level of ILP of these architectures, the ILP compiler must use aggressive ILP technique. This opportunity for improved performance comes at the price of increased compilation time.
As the size of the compilation unit is limited, the compilation time can be reduced. But the limited scope of compilation may restrict the scope of optimization. As a result, the compiler may generate less efficient quality of code. Ideally, we want to get smaller compilation time and the same or better execution time as that obtained using the global approach.
In this thesis, we address the problem of the compilation time and execution performance trade-off in region-based compilation within the context of the key optimization of register allocation . We demonstrate that schemes designed for region-based allocation perform as well as or even better than schemes designed for global based allocation while having smaller compilation time. To achieve this goal, we propose several innovative techniques which form the core of this thesis.
We show considerable compilation time savings with comparable execution time performance by synthesizing our techniques in a region-based register allocation. We also explore the relation between the performance of the register allocation and the region size and quantify it. Our research shows selecting the right size of region has the important impact to the performance of register allocation. We proposed the concept of restructuring the regions based on register pressure and discussed how we can estimate the register pressure in order to improve compilation time while maintaining the execution time.
Title: Adversarial Reasoning: A Logical Approach for Computer Go
Candidate: Klinger, Tamir
Advisor(s): Davis, Ernest
Abstract:
Go is a board game with simple rules but complex strategy requiring ability in almost all aspects of human reasoning. A good Go player must be able to hypothesize moves and analyze their consequences; to judge which areas are relevant to the analysis at hand; to learn from successes and failures; to generalize that knowledge to other ``similar'' situations; and to make inferences from knowledge about a position.
Unlike computer chess, which has seen steady progress since Shannon's [23] and Turing's [24] original papers on the subject, progress on computer Go remains in relative infancy. In computer chess, minimax search with [IMAGE ] - [IMAGE ] pruning based on a simple evaluation function can beat a beginner handily. No such simple evaluation function is known for Go. To accurately evaluate a Go position requires knowledge of the life and death status of the points on the board. Since the player with the most live points at the end of the game wins, a small mistake in this analysis can be disastrous.
In this dissertation we describe the design, performance, and underlying logic of a knowledgebased program that solves life and death problems in the game of Go. Our algorithm applies life and death theory coupled with knowledge about which moves are reasonable for each relevant goal in that theory to restrict the search space to a tractable size. Our results show that simple depth-first search armed with a goal theory and heuristic move knowledge yields very positive results on standard life and death test problems - even without sophisticated move ordering heuristics.
In addition to a description of the program and its internals we present a modal logic useful for describing strategic theories in games and use it to give a life and death theory and to formally state the rules of Go. We also give an axiomatization for this logic using the modal [IMAGE ] calculus [15] and prove some basic theorems of the system.
Title: Machine Level Optimizations for High Level Languages
Candidate: Leung, Allen
Advisor(s): Palem, Krishna
Abstract:
Two machine instruction level compiler optimization problems are considered in this work.
The first problem is time-constrained instruction scheduling, i.e., finding optimal schedules for machine code in the presence of time constraints such as release-times and deadlines. These types of time constraints appear naturally in embedded applications, and also as a side effect of many other compiler optimization problems. While the general problem is NP-hard, we have developed a new algorithm which can optimally handle many P-time solvable sub-instances. In fact, we show that almost all previous algorithms in this related area can be seen as an instance of the priority computation scheme that we have developed. Our work extends and unifies many algorithmic results in classical deterministic scheduling theory related to release-times, deadlines and pipeline latencies.
The second problem that we investigate in this work is scalar optimizations in machine code. We present a new framework that utilizes static single assignment form (SSA) at the level of individual machine instructions. Complementing the framework, we have also developed new SSA construction algorithms which are faster than previous algorithms, and are very simple to implement.
Title: Exact Geometric Computation: Theory and Applications
Candidate: Li, Chen
Abstract:
Exact Geometric Computation: Theory and Applications Abstract This dissertation explores the theory and applications of Exact Geometric Computation (EGC), a general approach to robust geometric computing. The contributions of this thesis are organized into three parts. A fundamental task in EGC is to support exact comparison of algebraic expressions. This leads to the problem of constructive root bounds for algebraic expressions. Such root bounds determine the worst-case complexity of exact comparisons. In the first part, we present a new constructive root bound which, compared to previous bounds, can give dramatically better performance in many common computations involving divisions and radical roots. We also improve the well-known degree-measure bound by exploiting the sharing of common sub-expressions. In the second part, we discuss the design and implementation of the Core Library, a C++ library which embraces the EGC approach to robust numerical and geometric computation. Our design emphasizes ease of use and facilitates the rapid development of robust geometric applications. It allows non-specialist programmers to add robustness into new or existing applications with little extra effort. A number of efficiency and implementation issues are investigated. Although focused on geometric computation, the EGC techniques and software we developed can be applied to other areas where it is critical to guarantee numerical precision. In the third part, we introduce a new randomized test for the vanishing of multivariate radical expressions. With this test, we develop a probabilistic approach to proving elementary geometry theorems about ruler-and-compass constructions. A probabilistic theorem prover based on this approach has been implemented using the Core Library. We present some empirical data.Title: An On-Line Handwriting Recognizer with Fisher Matching, Hypotheses Propagation Network and Context Constraint Models
Candidate: Oh, Jong
Advisor(s): Geiger, Davi
Abstract:
We have developed an on-line handwriting recognition system. Our approach integrates local bottom-up constructs with a global top-down measure into a modular recognition engine. The bottom-up process uses local point features for hypothesizing character segmentations and the top-down part performs shape matching for evaluating the segmentations. The shape comparison, called Fisher segmental matching, is based on Fisher's linear discriminant analysis. The component character recognizer of the system uses two kinds of Fisher matching based on different representations and combines the information to form the multiple experts paradigm.
Along with an efficient ligature modeling, the segmentations and their character recognition scores are integrated into a recognition engine termed Hypotheses Propagation Network (HPN), which runs a variant of topological sort algorithm of graph search. The HPN improves on the conventional Hidden Markov Model and the Viterbi search by using the more robust mean-based scores for word level hypotheses and keeping multiple predecessors during the search.
We have also studied and implemented a geometric context modeling termed Visual Bigram Modeling that improves the accuracy of the system's performance by taking the geometric constraints into account, in which the component characters in a word can be formed in relation with the neighboring characters. The result is a shape-oriented system, robust with respect to local and temporal features, modular in construction and has a rich range of opportunities for further extensions.
Title: Continuous Model for Salient Shape Selection and Representation
Candidate: Pao, Hsing-Kuo (Kenneth)
Advisor(s): Geiger, Davi
Abstract:
We propose a new framework for shape representation and salient shape selection. The framework is considered as a low- to middle-level vision process. The framework can be applied to various topics, including figure/ground separation, searching of the shape axis, junction detection and illusory figure finding. The model construction is inspired by the Gestalt studies. They suggest that proximity, convexity, similarity, good continuation, closure, symmetry, etc, are useful for figure/ground separation and visual organization construction. First, we quantify those attributes for (completed or partial) shapes by our distributed systems. The shape will be evaluated and represented by those results. In particular, the shape convexity, rather than other shape attributes like the symmetry axis or size which were well-studied before, will be emphasized in our discussion. Our problem is proposed in a continuous manner. For the shape convexity, unlike the conventional mathematical definition, we are aimed at deriving a definition to describe a shape ``more convex'' or ``less convex'' than the other. To search the shape axis, more than a binary information telling a point on or off any axis, a continuous information will be obtained. We distinguish axes with ``stronger'' or ``weaker'' declarations. An Easy and natural scheme of pruning can be applied by such representation. For the junction detection, we do not assume any artificial threshold. Instead, the transition from low-curvature to high-curvature curves or curves with discontinuities will be shown by our representation. The model is based on a variational approach, provided by the minimization of the data fitting error as well as the neighborhood discrepancy. Two models will be proposed, the decay diffusion process and the orientation diffusion process.
Title: Language Support for Program Generation Reasoning, Implementation, and Applications
Candidate: Yang, Zhe
Advisor(s): Danvy, Olivier; Goldberg, Benjamin
Abstract:
This dissertation develops programming languages and associated techniques for sound and efficient implementations of algorithms for program generation.
First, we develop a framework for practical two-level languages. In this framework, we demonstrate that two-level languages are not only a good tool for describing program-generation algorithms, but a good tool for reasoning about them and implementing them as well. We pinpoint several general properties of two-level languages that capture common proof obligations of program-generation algorithms:
In addition, to justify concrete implementations, we use a native embedding of a two-level language into a one-level language.
We present two-level languages with these properties both for a call-by-name object language and for a call-by-value object language with computational effects, and demonstrate them through two classes of non-trivial applications: one-pass transformations into continuation-passing style and type-directed partial evaluation for call-by-name and for call-by-value.
Next, to facilitate implementations, we develop several general approaches to programming with type-indexed families of values within the popular Hindley-Milner type system. Type-indexed families provide a form of type dependency, which is employed by many algorithms that generate typed programs, but is absent from mainstream languages. Our approaches are based on type encodings, so that they are type safe. We demonstrate and compare them through a host of examples, including type-directed partial evaluation and printf-style formatting.
Finally, upon the two-level framework and type-encoding techniques, we recast a joint work with Bernd Grobauer, where we formally derived a suitable self application for type-directed partial evaluation, and achieved automatic compiler generation.
Title: SETL for Internet Data Processing
Candidate: Bacon, David
Advisor(s): Schwartz, Jack
Abstract:
Although networks and coordinated processes figure prominently in the kinds of data manipulation found in everything from scientific modeling to large-scale data mining, programmers charged with setting up the requisite software systems frequently find themselves hampered by the inadequacy of available languages. The ``real'' languages such as C++ and Java tend to be low-level, requiring the specification of a great deal of often repetitive detail, whereas the higher-level ``scripting'' languages tend to lack the kinds of structuring facilities that lend themselves to the reliable construction of even modestly large systems.
The high-level language SETL meets both of these needs. Originally conceived as a language which aimed to bring programming a little closer to the idealized world of mathematics, making it extremely useful in the human-to-human communication of algorithms, SETL has proven itself over the years to be an excellent language for software prototyping, primarily because its conciseness and immediacy lend it well to rapid experimentation. These characteristics, together with its general freedom from machine-oriented restrictions, its value semantics, its comprehension-style constructors for aggregates, its skill with strings, and especially its syntactic support for mappings, also make it well suited to high-level data processing.
In order to play the role of a full-fledged modern data processing language, however, SETL had to acquire the ability to manipulate processes and communicate with them easily, and furthermore to be able to work with networks, particularly the client-server model that rules the Internet. Accordingly, I have integrated a full set of process and network management features into SETL. In my dissertation, I show how the liberal use of fullweight processes, with the high, protective walls that surround them, sustains a modular design approach which in turn provides a strong defense against the main hazards of distributed computing, namely race conditions and deadlock, while preserving the luxury and convenience of programming in a truly high-level language. To this end, I have evolved protocols and design patterns for developing multiplexing servers and clients in SETL, and in my talk, will present examples of fairly complex systems where hierarchies of processes communicate over the network. Such systems tend to be notorious for their unreliability, but in these instances, robustness seems to follow naturally from the readability of simple programs written in an ancient and friendly language.
Title: A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages
Candidate: Figueroa, Sam
Advisor(s): Dewar, Robert
Abstract:
Processors conforming to the IEEE Standard for Floating-Point Arithmetic have been commonplace for some years, and now several programming languages seem to support or conform to this standard, from hereon referred to as ``the IEEE Standard.'' For example, The Java Language Specification by Gosling, Joy, and Steele, which defines the Java language, frequently mentions the IEEE Standard. Indeed, Java, as do other languages, supports some of the features of the IEEE Standard, including a couple floating-point data formats, and even requires (in section 4.2.4 ``Floating-Point Operations'' of the aforementioned book) that ``operators on floating-point numbers behave exactly as specified by IEEE 754.''
Arguing that the support current languages offer is not enough, this thesis establishes clear criteria for what it means to fully support the IEEE Standard in a programming language. Each aspect of the IEEE Standard is examined in detail from the point of view of how various arithmetic engines implement that aspect of the IEEE Standard, how different languages (and implementations thereof) support it, and what the range of options are in supporting that aspect. Practical recommendations are then offered (particularly, but not exclusively, for Ada and Java), taking, for example, programmer convenience and impact on performance into consideration. A detailed model specification following these recommendations is provided for the Ada language.
In addition, a variety of issues related to the floating-point aspects of programming languages are discussed, so as to serve as a more complete guide to language designers. One such issue is floating-point expression evaluation schemes, and, more specifically, whether bit-for-bit identical results are actually achievable on a variety of platforms that conform to the IEEE Standard, as the Java language promises. Closely tied to this issue is that of double rounding, which occurs when a (possibly intermediate) result is rounded more than once before subsequent use or before being delivered to its final destination. So this thesis discusses when double rounding makes a difference, how it can be avoided, and what the performance impact is in avoiding it.
Title: A Language-Theoretic Approach to Algorithms
Candidate: Goyal, Deepak
Advisor(s): Paige, Bob
Abstract:
An effective algorithm design language should be 1) wide-spectrum in nature, i.e. capable of expressing both abstract specifications and low-level implementations, and 2) "computationally transparent", i.e. facilitate accurate estimation of time and space requirements. The conflict between these requirements is exemplified by SETL which is wide-spectrum, but lacks computational transparency because of its reliance on hash-based data structures. The first part of this thesis develops an effective algorithm design language, and the second demonstrates its usefulness for algorithm explanation and discovery.
In the first part three successively more abstract set-theoretic languages are developed and shown to be computationally transparent. These languages can collectively express both abstract specifications and low-level implementations. We formally define a data structure selection method for these languages using a novel type system. Computational transparency is obtained for the lowest-level language through the type system, and for the higher-level languages by transformation into the next lower level. We show the effectiveness of this method by using it to improve a difficult database query optimization algorithm from expected to worst-case linear time. In addition, a simpler explanation and a shorter proof of correctness are obtained.
In the second part we show how our data structure selection method can be made an effective third component of a transformational program design methodology whose first two components are finite differencing and dominated convergence. Finite differencing replaces costly repeated computations by cheaper incremental counterparts, and dominated convergence provides a generalized iteration scheme for computing fixed-points. This methodology has led us to a simpler explanation of a complex linear-time model-checking algorithm for the alternation-free modal mu-calculus, and to the discovery of an O ( N ^{ 3 } ) time algorithm for computing intra-procedural may-alias information that improves over an existing O ( N ^{ 5 } ) time algorithm.
Title: Supporting a Flexible Parallel Programming Model on a Network of Non-Dedicated Workstations
Candidate: Huang, Shih-Chen
Advisor(s): Kedem, Zvi
Abstract:
A network of non-dedicated workstations can provide computational resources at minimal or no additional cost. If harnessed properly, the combined computational power of these otherwise ``wasted'' resources can outperform even mainframe computers. Performing demanding computations on a network of non-dedicated workstations efficiently has previously been studied, but inadequate handling of the unpredictable behavior of the environment and possible failures resulted in limited success only.
This dissertation presents a shared memory software system for executing programs with nested parallelism and synchronization on a network of non-dedicated workstations. The programming model exhibits a very convenient and natural programming style and is especially suitable for computations whose complexity and parallelism emerges only during their execution, such as in divide and conquer problems. To both support and take advantage of the flexibility inherent in the programming model, an architecture that distributes both the shared memory management and the computation is developed. This architecture removes bottlenecks inherent in centralization, thus enhancing scalability and dependability. By adapting available resource dynamically and coping with unpredictable machine slowdowns and failures, the system also supports dynamic load balancing, and fault tolerance--both transparently to the programmer.
Title: Global Optimization Using Embedded Graphs
Candidate: Ishikawa, Hiroshi
Advisor(s): Geiger, Davi
Abstract:
One of the challenges of computer vision is that the information we seek to extract from images is not even defined for most images. Because of this, we cannot hope to find a simple process that produces the information directly from a given image. Instead, we need a search, or an optimization, in the space of parameters that we are trying to estimate.
In this thesis, I introduce two new optimization methods that use graph algorithms. They are characterized by their ability to find a global optimum efficiently. Each method defines a graph that can be seen as embedded in a Euclidean space. Graph- theoretic entities such as cuts and cycles represent geometric objects that embody the information we seek.
The first method finds a hypersurface in a Euclidean space that minimizes a certain kind of energy functional. The hypersurface is approximated by a cut of an embedded graph so that the total cost of the cut corresponds to the energy. A globally optimal solution is found by using a minimum cut algorithm. In particular, it can globally solve first order Markov Random Field problems in more generality than was previously possible. I prove that the convexity of the smoothing function in the energy is essential for the applicability of the method and provide an exact criterion in terms of the MRF energy.
The second method proposed here efficiently finds an optimal cycle in a Euclidean space. It uses a minimum ratio cycle algorithm to find a cycle with minimum energy in an embedded graph. In the case of two dimensions, the energy can depend not only on the cycle itself but also on the region defined by the cycle. Because of this, the method unifies the two competing views of boundary and region segmentation.
I demonstrate the utility of the methods in applications, with the results of experiments in the areas of binocular stereo, image restoration, and image segmentation. The image segmentation, or contour extraction, experiments are carried out in various situations using different types of information, for example motion, stereo, and intensity.
Title: On the Use of Functionals on Boundaries in Hierarchical Models of Object Recognition
Candidate: Jermyn, Ian
Advisor(s): Geiger, Davi
Abstract:
Object recognition is a central problem in computer vision. Typically it is assumed to follow a sequential model in which successively more specific hypotheses are generated about the image. This is a rather simplistic model, allowing as it does no margin for error at any point. We follow a more general approach in which the various representations involved are allowed to influence one another from the outset. As a guide and ultimate goal, we study the problem of finding the region occupied by human beings in images, and the separation of the region into arms, legs and head. We approach the problem as that of defining a functional on the space of boundaries in images whose minimum specifies the region occupied by the human figure. Previous work that uses such functionals suffers from a number of difficulties. These include an uncontrollable dependence on scale, an inability to find the global minimum for boundaries in polynomial time, and the inability to include region as well as boundary information. We present a new form of functional on boundaries in a manifold that solves these problems, and is also the unique form of functional in a specific class that possesses a non-trivial, efficiently computable global minimum. We describe applications of the model to single images and to the extraction of boundaries from stereo pairs and motion sequences. In addition, the functionals used in previous work could not include information about the shape of the region sought. We develop a model for the part structures of boundaries that extends previous work to the case of real images, thus including shape information in the functional framework. We show that such part structures are hyperpaths in a hypergraph. An `optimal hyperpath' algorithm is developed that globally minimizes the functional under some conditions. We show how to use exemplars of a shape to construct a functional that includes specific information about the topology of the part structure sought. An algorithm is developed that globally minimizes such functionals in the case of a fixed boundary. The behaviour of the functional mimics an aspect of human shape comparison.
Title: Delegation Logic: A Logic-based Approach to Distrbuted Authorization
Candidate: Li, Ninghui
Advisor(s): Feigenbaum, Joan; Siegel, Alan
Abstract:
We address the problem of authorization in large-scale, open, distributed systems. Authorization decisions are needed in electronic commerce, mobile-code execution, remote resource sharing, content advising, privacy protection, etc. We adopt the trustmanagement approach, in which “authorization” is viewed as a “proof-of-compliance” problem: Does a set of credentials prove that a request complies with a policy? We develop a logic-based language Delegation Logic (DL) to represent policies, credentials, and requests in distributed authorization. Delegation Logic extends logic programming (LP) languages with expressive delegation constructs that feature delegation depth and a wide variety of complex principals (including, but not limited to, k-out-of-n thresholds). D1LP, the monotonic version of DL, extends the LP language Datalog with delegation constructs. D2LP, the nonmonotonic version of DL, also features classical negation, negation-as-failure, and prioritized conflict handling. Our approach to defining and implementing DL is based on tractably compiling DL programs into ordinary logic programs (OLP’s). This compilation approach enables DL to be implemented modularly on top of existing technologies for OLP, e.g., Prolog. As a trust-management language, Delegation Logic provides a concept of proof-ofcompliance that is founded on well-understood principles of logic programming and knowledge representation. DL also provides a logical framework for studying delegation, negation of authority, conflicts between authorities, and their interplay.
Title: Queryable Expert Systems
Candidate: Tanzer, David
Abstract:
Queryable Expert Systems
10:00 a.m., Tuesday, October 17, 2000
12th floor conference room, 719 Broadway
Abstract
Interactive rule-based expert systems, which work by ``interviewing'' their users, have found applications in fields ranging from aerospace to help desks. Although they have been shown to be useful, people find them difficult to query in flexible ways. This limits the reusability of the knowledge they contain. Databases and noninteractive rule systems such as logic programs, on the other hand, are queryable but they do not offer an interview capability. This thesis is the first investigation that we know of into query-processing for interactive expert systems.
In our query paradigm, the user describes a hypothetical condition and then the system reports which of its conclusions are reachable, and which are inevitable, under that condition. For instance, if the input value for bloodSugar exceeds 100 units, is the conclusion diabetes then inevitable? Reachability problems have been studied in other settings, e.g., the halting problem, but not for interactive expert systems.
We first give a theoretical framework for query-processing that covers
a wide class of interactive expert systems. Then we present a
query algorithm for a specific language of expert systems. This language
is a restriction of production systems to an acyclic form
that generalizes decision trees and classical spreadsheets.
The algorithm effects a reduction from the reachability and inevitability queries
into datalog rules with constraints. When preconditions are
conjunctive, the data complexity is tractable.
Next, we optimize for queries to production systems that contain regions which are
decision trees. When general-purpose datalog
methods are applied to the rules that result from our queries,
the number of constraints that must be solved is
O
(
n
^{
2
}
), where
n
is the size of the
trees. We lower the complexity to
O
(
n
). Finally, we have built a
query tool for a useful subset of the acyclic production systems. To our knowledge,
these are the first interactive expert systems that can be queried about the
reachability and inevitability of their conclusions.
Title: Scenario Customization for Information Extraction
Candidate: Yangarber, Roman
Advisor(s): Grishman, Ralph
Abstract:
Information Extraction (IE) is an emerging NLP technology, whose function is to process unstructured, natural language text, to locate specific pieces of information, or facts , in the text, and to use these facts to fill a database. IE systems today are commonly based on pattern matching. The core IE engine uses a cascade of sets of patterns of increasing linguistic complexity. Each pattern consists of a regular expression and an associated mapping from syntactic to logical form. The pattern sets are customized for each new topic , as defined by the set of facts to be extracted.
Construction of a pattern base for a new topic is recognized as a time-consuming and expensive process--a principal roadblock to wider use of IE technology in the large. An effective pattern base must be precise and must have wide coverage. This thesis addresses the portability problem in two stages.
First, we introduce a set of tools for building patterns manually from examples . To adapt the IE system to a new subject domain quickly, the user chooses a set of example sentences from a training text, and specifies how each example maps to the extracted event--its logical form. The system then applies meta-rules to transform the example automatically into a general set of patterns. This effectively shifts the portability bottleneck from building patterns to finding good examples.
Second, we propose a novel methodology for discovering good examples automatically from a large un-annotated corpus of text. The system is initially seeded with a small set of relevant patterns provided by the user. An unsupervised learning procedure then identifies new patterns and classes of related terms on successive iterations. We present experimental results, which confirm that the discovered patterns exhibit high quality, as measured in terms of precision and recall.
Title: Higher-Order Conditional Synchronization
Candidate: Afshartous, Niki
Advisor(s): Goldberg, Benjamin
Abstract:
Conditional synchronization - a mechanism that conditionally blocks a thread based on the value of a boolean expression currently exists in several programming languages. We propose promoting conditional synchronization to first-class status allowing the synchronization object representing a suspended conditional synchronization to be passed as a value.
To demonstrate our idea we extend Concurrent ML and present several examples illustrating the expressiveness of first-class conditional synchronization (FCS). FCS has broadcast semantics making it appropriate for applications such as barriers and discrete-event simulation. The semantics also guarantee that no transient store configurations are missed. The end result facilitates abstraction and adds flexibility in writing concurrent programs. To minimize re-evaluation of synchronization conditions we propose a static analysis and translation that identifies expressions for the run-time system that could affect the value of a synchronization condition. The static analysis (which is based on an effect type system) therefore precludes excessive run-time system polling of synchronization conditions.
Title: Metacomputing on on Commodity Computers
Candidate: Baratloo, Arash
Advisor(s): Kedem, Zvi
Abstract:
The advantages of using a set of networked commodity computers for parallel processing is well understood: such computers are cheap, widely available, and mostly underutilized. So why has the use of such environments for compute-intensive applications not proliferated? A major reason is that the inherent complexities of programming applications and coordinating their execution on networked computers outweighs the advantages.
In networked environments populated with multiuser commodity computers, both the computing speed and the number of available computers for executing parallel programs may change frequently and unpredictably. As a consequence, programs need to continuously adapt their execution to the changing environment. The execution of an application must therefore address such issues as dynamic changes in effective machine speeds, dynamic changes in the number of available machines, and sudden network and machine failures. It is not feasible for an application programmer to write programs that adapt to the behavior of a system whose critical aspects cannot be anticipated.
I will present a unified set of techniques to implement a virtual reliable parallel-processing platform on a set of unreliable computers with temporally varying execution speeds. These techniques are specifically designed for automatically adapting the execution of parallel programs to distributed environments. I will explain these techniques in the context of two software systems, Calypso and ResourceBroker, that have been built to validate them.
Calypso gives a programmer a simple tool to build and effectively execute parallel programs on a set of commodity computers. The notable properties of Calypso are: (1) a simple, intuitive programming model based on a virtual machine interface; (2) separation of logical and physical parallelism, allowing the source code to codify the algorithm rather than the execution environment; and (3) a runtime system that efficiently adapts the execution of the program to the dynamic nature of the runtime environment. ResourceBroker is a resource manager that demonstrates a novel technique to dynamically manage the assignment of computers to parallel programs. ResourceBroker can work with a variety of parallel systems, even transparently managing those that are not aware of its existence, such as PVM and MPI, and will distribute available resources fairly among multiple computations. As a result, a mix of parallel programs, written using diverse programming systems can effectively execute concurrently on a set of computers.
Title: A Maximum Entropy Approach to Named Entity Recognition
Candidate: Borthwick, Andrew
Advisor(s): Grishman, Ralph
Abstract:
This thesis describes a novel statistical named-entity (i.e. ``proper name'') recognition system known as ``MENE'' (Maximum Entropy Named Entity). Named entity (N.E.) recognition is a form of information extraction in which we seek to classify every word in a document as being a person-name, organization, location, date, time, monetary value, percentage, or ``none of the above''. The task has particular significance for Internet search engines, machine translation, the automatic indexing of documents, and as a foundation for work on more complex information extraction tasks.
Two of the most significant problems facing the constructor of a named entity system are the questions of portability and system performance. A practical N.E. system will need to be ported frequently to new bodies of text and even to new languages. The challenge is to build a system which can be ported with minimal expense (in particular minimal programming by a computational linguist) while maintaining a high degree of accuracy in the new domains or languages.
MENE attempts to address these issues through the use of maximum entropy probabilistic modeling. It utilizes a very flexible object-based architecture which allows it to make use of a broad range of knowledge sources in making its tagging decisions. In the DARPA-sponsored MUC-7 named entity evaluation, the system displayed an accuracy rate which was well-above the median, demonstrating that it can achieve the performance goal. In addition, we demonstrate that the system can be used as a post-processing tool to enhance the output of a hand-coded named entity recognizer through experiments in which MENE improved on the performance of N.E. systems from three different sites. Furthermore, when all three external recognizers are combined under MENE, we are able to achieve very strong results which, in some cases, appear to be competitive with human performance.
Finally, we demonstrate the trans-lingual portability of the system. We ported the system to two Japanese-language named entity tasks, one of which involved a new named entity category, ``artifact''. Our results on these tasks were competitive with the best systems built by native Japanese speakers despite the fact that the author speaks no Japanese.
Title: Algorithms for Nonlinear Models in Computational Finance and their Object-oriented Implementation
Candidate: Buff, Robert
Advisor(s): Avellaneda, Marco
Abstract:
Individual components of financial option portfolios cannot be evaluated independently under nonlinear models in mathematical finance. This entails increased algorithmic complexity if the options under consideration are path-dependent. We describe algorithms that price portfolios of vanilla, barrier and American options under worst-case assumptions in an uncertain volatility setting. We present a generalized approach to worst-case volatility scenarios in which only the duration, but not the starting dates of periods of high volatility risk are known. Our implementation follows object-oriented principles and is modular and extensible. Combinatorial and numerical algorithms are separate and orthogonal to each other. We make our tools available to a wide audience by using standard Internet technologies.
Title: Prototyping a Prototyping Language
Candidate: Chen, Hseu-Ming
Advisor(s): Harrison, Malcolm C.
Abstract:
The development of a prototyping language should follow the usual software-engineering methodology: starting with an evolvable, easily modifiable, working prototype of the proposed language. Rather than committing to the development of a mammoth compiler at the outset, we can design a translator from the prototyping language to another high-level language as a viable alternative. From a software-engineering point of view, the advantages of the translator approach are its shorter development cycle and lessened maintenance burden.
In prototyping language design, there are often innovative cutting-edge features which may not be well-understood. It is inevitable that numerous experimentations and revisions will be made to the current design, and hence supporting evolvability and modifiability is critical in the translator design.
In this dissertation we present an action-semantics-based framework for high-level source-to-source language translation. Action semantics is a form of denotational semantics that is based on abstract semantic algebra rather than Scott domain and lambda-notation. More specifically, this model not only provides a formal semantics definition for the source language and sets guidelines for implementations as well as migration, but also facilitates mathematical reasoning and a correctness proof of the entire translation process. The translation is geared primarily towards readability, maintainability, and type-preserving target programs, only secondarily towards reasonable efficiency.
We have acquired a collection of techniques for the translation of certain non-trivial high-level features of prototyping languages and declarative languages into efficient procedural constructs in imperative languages like Ada95, while using the abstraction mechanism of the target languages to maximize the readability of the target programs. In particular, we translate Griffin existential types into Ada95 using its object-oriented features, based on coercion calculus. This translation is actually more general, in that one can add existential types to a language (with modicum of extra syntax) supporting object-oriented paradigm without augmenting its type system, through intra-language transformation. We also present a type-preserving translation of closures which allows us to drop the whole-program-transformation requirement.
Title: Distributed intelligence with bounded rationality: Applications to economies and networks
Candidate: Even, Ron
Advisor(s): Mishra, Bud
Abstract:
This dissertation examines bounded rationality as a tool in distributed systems of intelligent agents. We have implemented, in Java, a simulator for complex adaptive systems called CAF??. We use our framework to simulate a simple network and compare the effectiveness of bounded rationality at routing and admission control to that of a more traditional, source based, greedy routing approach. We find that the boundedly rational approach is particularly effective when user behavior is synchronized, such as occurs during breaking news releases on the World Wide Web, for example. We develop the key structures of our framework by first examining, through simulation, the behavior of boundedly rational speculators in a simple economy. We find them to be instrumental in bringing the economy quickly to price equilibrium as well as in maintaining the equilibrium in the face of changing conditions. We draw several interesting conclusions as to the key similarities between economy and computational systems and also, the situations where they differ drastically.
Title: Pattern Discovery in Biology: Theory and Applications
Candidate: Floratos, Aristidis
Advisor(s): Boppana, Ravi; Rigoutsos, Isidore
Abstract:
Molecular Biology studies the composition and interactions of life's agents, namely the various molecules (e.g. DNA, proteins, lipids) sustaining the living process. Traditionally, this study has been performed in wet labs using mostly physicochemical techniques. Such techniques, although precise and detailed, are often cumbersome and time consuming. On top of that, recent advances in sequencing technology have allowed the rapid accumulation of DNA and protein data. As a result a gap has been created (and is constantly being expanded): on the one side there is a rapidly growing collection of data containing all the information upon which life is built; and on the other side we are currently unable to keep up with the study of this data, impaired by the limits of existing analysis tools. It is obvious that alternative analysis techniques are badly needed. In this work we examine how computational methods can help in drilling the information contained in collections of biological data. In particular, we investigate how sequence similarity among various macromolecules (e.g. proteins) can be exploited towards the extraction of biologically useful information.
Title: Matching Algorithms and Feature Match Quality Measures for Model-Based Object Recognition with Applications toAutomatic Target Recognition
Candidate: Garcia-Keller, Martin
Advisor(s): Hummel, Robert
Abstract:
In the fields of computational vision and image understanding, the object recognition problem can often be formulated as a problem of matching a collection of model features to features extracted from an observed scene. This dissertation is concerned with the use of feature-based match similarity measures and feature match algorithms in object detection and classification in the context of image understanding from complex signature data. Our applications are in the domains of target vehicle recognition from radar imagery, and binocular stereopsis.
In what follows, we will consider “image understanding” to encompass the set of activities necessary to identify objects in visual imagery and to establish meaningful three-dimensional relationships between the objects themselves, or between the object and the viewer. The main goal in image understanding then involves the transformation of images to symbolic representation, effectively providing a high-level description of an image in terms of objects, object attributes, and relationships between known objects. As 2 such, image understanding subsumes the capabilities traditionally associated with image processing, object recognition and artificial vision [Crevier and Lepage 1997].
In human and/or biological vision systems, the task of object recognition is a natural and spontaneous one. Humans can recognize immediately and without effort a huge variety of objects from diverse perceptual cues and multiple sensorial inputs. The operations involved are complex and inconspicuous psychophysical and biological processes, including the use of properties such as shape, color, texture, pattern, motion, context, as well as considerations based on contextual information, prior knowledge, expectations, functionality hypothesis, and temporal continuity. These operations and their relation to machine object recognition and artificial vision are discussed in detail elsewhere [Marr 1982], [Biederman 1985], but they are not our concern in this thesis.
In this research, we consider only the simpler problem of model-based vision, where the objects to be recognized come from a library of three-dimensional models known in advance, and the problem is constrained using context and domain-specific knowledge.
The relevance of this work resides in its potential to support state-of-the-art developments in both civilian and military applications including knowledge-based image analysis, sensors exploitation, intelligence gathering, evolving databases, 3 interactive environments, etc. A large number of applications are reviewed below in section 1.4. Experimental results are presented in Chapters 5, 6, and
Title: Learning to Play Network Games
Candidate: Greenwald, Amy
Advisor(s): Mishra, Bud
Abstract:
This talk concerns the strategic behavior of automated agents in the framework of network game theory, with particular focus on the collective behavior that arises via learning. In particular, ideas are conveyed on both the theory and simulation of learning in network games, in terms of two sample applications. The first application is network control, presented via an abstraction known as the Santa Fe bar problem, for which it is proven that rational learning does *not* converge to Nash equilibrium, the classic game-theoretic solution concept. On the other hand, it is observed via simulations, that low-rationality learning, where agents trade-off between exploration and exploitation, typically converges to mixed strategy Nash equilibria in this game. The second application is the economics of shopbots - agents that automatically search the Internet for price and product information - in which learning yields behaviors ranging from price wars to tacit collusion, with sophisticated low-rationality learning algorithms converging to Nash equilibria. This work forms part of a larger research program that advocates learning and game theory as a framework in which to model the interactions of computational agents in network domains.
Title: Experiments in refining graphical interface widgets
Candidate: Hecker, Yaron Chanoch
Abstract:
This thesis investigates GUIs and their shortcomings. We demonstrate that there is room for refinement of existing graphical user interfaces, including those interfaces with which we are most familiar. A foundation for our designs is first established. It consists of known human capabilities, especially concerning hand-eye coordination, short term and long term memory, and visual perception. Accumulated experience in static and animated visual design provides additional guides for our work. On the basis of this foundation we analyze existing widgets. A series of new widgets are then proposed to address observed deficiencies in existing designs for scrolling, multiple copy and paste in text environments, text insertion and selection, and window management. Lessons learned from analyzing our new designs and observations of existing widgets are generalized into principles of widget design.
Title: Automated Software Deployment
Candidate: Jai, Benchiao
Advisor(s): Siegel, Alan
Abstract:
The work users do with an application can be divided into actual work accomplished using the application and overhead performed in order to use the application. The latter can be further partitioned based on the time at which the work is performed: before (application location and delivery), during (installation) and after (upgrade) the installation of the application. This category can be characterized as the software deployment overhead. This thesis presents a component architecture RADIUS (Rapid Application location, Delivery, Installation and Upgrade System) in which applications can be built with no software deployment overhead to the users. An application is deployed automatically by simply giving the user a document produced by the application. Furthermore, the facilities in RADIUS make the applications self-upgrading. In the end, the users perform no deployment overhead work at all.
The conventional way of using an application is to install the application first, then start using documents of the application. The object-oriented programming (OOP) paradigm suggests that this order should be reversed: the data should lead to the code. However, almost all software fails to meet this model of design at the persistence level. While modern software often use OOP at the program level, the underlying operating systems do not support OOP at the document/file level. OOP languages use pointers to methods to indicate what operations can be performed on the objects. We extend the idea to include "pointers to applications". Each document has an attached application pointer, which is read by RADIUS when the document is opened. This application pointer is then used to locate and deliver the application module necessary for the document.
RADIUS is designed to be compatible with existing technologies and requires no extensions to either programming languages or operating systems. It is orthogonal to programming tools, is language-independent and compatible among operating systems, and consequently does not impose limitations on which environments the developers can use. We illustrate the implementations for the two most popular platforms today - C++ on Windows, and Java. RADIUS is also orthogonal to other component systems such as CORBA or COM and is easy to integrate with them.
Title: Toward Stronger User Authentication
Candidate: Monrose, Newman Fabian
Advisor(s): Kedem, Zvi
Abstract:
Password-based authentication is the dominant mechanism for verifying the identity of computer users, even though it is well known that people frequently choose passwords that are vulnerable to dictionary attacks. This talk addresses the issue of improving the security of password-based authentication, and presents authentication techniques that are more secure than traditional approaches against both on-line and off-line attacks.
We present a technique for strengthening the security of a textual password by augmenting it with biometric information such as the duration and latency of keystrokes during entry of the password. Thereby, both the password and the user's typing pattern are used to corroborate the user's identity. The technique presented adapts to gradual changes in a user's typing pattern while maintaining the same strengthened password across authenticated sessions. Moreover, our technique does not reveal which of a user's keystroke features are used to generate the corresponding strengthened password. This knowledge is hidden even from an attacker who captures all the system information used by the authentication server, and we show that our technique increases significantly the amount of work such an attacker must perform.
Additionally, we present an alternative technique for user authentication that exploits features of graphical input devices. We propose and evaluate ``graphical passwords'', which serve the same purpose as textual passwords, but consist of handwritten drawings, possibly in addition to text. Graphical passwords derive their strength from the fact that graphical input devices allow one to decouple the positions of inputs from the temporal order in which these inputs occur. We use this independence to build new password-based authentication schemes that are convincingly stronger than conventional methods.
Title: Optimization Over Symmetric Cones
Candidate: Nayakkankuppam, Madhu
Advisor(s): Overton, Michael
Abstract:
We consider the problem of optimizing a linear function over the intersection of an affine space and a special class of closed, convex cones, namely the symmetric cones over the reals. This problem subsumes linear programming, convex quadratically constrained quadratic programming, and semidefinite programming as special cases. First, we derive some perturbation results for this problem class. Then, we discuss two solution methods: an interior-point method capable of delivering highly accurate solutions to problems of modest size, and a first order bundle method which provides solutions of low accuracy, but can handle much larger problems. Finally, we describe an application of semidefinite programming in electronic structure calculations, and give some numerical results on sample problems.
Title: Efficient Computational Model for Energy Propagation in Geoemtrically Represented Large Envirnoments
Candidate: Rajkumar, Ajay
Advisor(s): Perlin, Ken
Abstract:
Current radio propagation algorithms are very narrowly focused to specific types of input models and do not scale well to an increase in the number of receiver locations or the number of polygons in an input model. In this dissertation, we look at the problem of efficiently computing energy propagation at radio frequencies in a range of geometrically defined environments from a given transmitter location and for various transmitter and receiver characteristics. To achieve this goal, we propose a unified approach to radio propagation for different types of input models and their combinations as well, by representing the geometry as a binary space partitioning tree and broadcasting energy from the source. The approach is both scalable to large input models as well as dynamically adapts to its scale without incurring unreasonable computational cost. The proposed approach is equally effective for acoustic modeling as well.
We present a new adaptive ray-beam tracing algorithm which initially tessellates the surface of a transmitter into four-sided polygons. Each polygon is cast as a beam which avoids arbitrarily large gaps or overlaps between adjacent beams. For fast intersection computation each beam carries information of its medial ray as well. As the computation proceeds a ray-beam is adaptively subdivided depending on various parameters. The proposed algorithm has sublinear time complexity in terms of the number of receiver locations.
Modeling diffraction off an edge of a wedge is important to compute radio signal that reaches the shadow region of the wedge. Storing these edges explicitly in a data structure can be very expensive for large input models and especially for terrain-based models that have significant elevation variations. We present a new runtime edge-detection algorithm instead of storing the edges statically and its adaptation to binary space partitioning tree represented environments.
We have developed a propagation prediction system called Propagate using these algorithms with good statistical correlation between predicted and measured results for a number of different input models. The proposed algorithms have been used to model several other important computations related to a cellular network of transmitters such as signal strength and path loss, delay spread, angular spread, carrier-to-interference ratio, and modeling of different antenna diversity schemes.
Title: Automatic Parallelization: An Incremental, Optimistic, Practical Approach
Candidate: Schwartz, Naftali
Advisor(s): Kedem, Zvi
Abstract:
The historic focus of Automatic Parallelization efforts has been limited in two ways. First, parallelization has generally been attempted only on codes which can be proven to be parallelizeable. Unfortunately, the requisite dependence analysis is undecidable, and today's applications demonstrate that this restriction is more than theoretical. Second, parallel program generation has generally been geared to custom multiprocessing hardware. Although a network of commodity workstations (NOW) could theoretically be harnessed to serve as a multiprocessing platform, the NOW has characteristics which are at odds with effective utilization.
This thesis shows that by restricting our attention to the important domain of ``embarrassingly parallel'' applications, leveraging existing scalable and efficient network services, and carefully orchestrating a synergy between compile-time transformations and a small runtime system, we can achieve a parallelization that not only works in the face of inconclusive program analysis, but is indeed efficient for the NOW. We optimistically parallelize loops whose memory access behavior is unknown, relying on the runtime system to provide efficient detection and recovery in the case of an overly optimistic transformation. Unlike previous work in speculative parallelization, we provide a methodology which is not tied to the Fortran language, making it feasible as a generally useful approach. Our runtime system implements Two-Phase Idempotent Eager Scheduling (TIES) for efficient network execution, providing an Automatic Parallelization platform with performance scalability for the NOW.
Our transformation divides the original program into a server and zero or more clients. The server program is a specialization of the original application with each parallel loop replaced with a scheduling call to the client which comprises the body of that parallel loop. The scheduler remotely executes the appropriate instances of this client on available machines.
We describe the transformation and runtime system in detail, and report on the automatic transformation achieved by our implementation prototype in two case studies. In each of these cases, we were able to automatically locate the important coarse-grained loops, construct a shared-memory layout, and generate appropriate server and client code. Furthermore, we show that our generated parallel programs achieve near-linear speedups for sufficiently large problem sizes.
Title: Destructive Effect Analysis And Finite Differencing For Strict Functional Languages
Candidate: Yung, Chung
Advisor(s): Goldberg, Benjamin
Abstract:
Destructive update optimization is critical for writing scientific codes in functional languages. Pure functional languages do not allow mutations, destructive updates, or selective updates so that the straightforward implementations of functional languages induces large amounts of copying to preserve program semantics. The unnecessary copying of data can increase both the execution time and the memory requirements of an application. Destructive update optimization makes an essential improvement to the implementation of functional programs with compound data structures, such as arrays, sets, and aggregates. Moreover, for many of the compiler optimization techniques that depend on the side-effects, destructive update analysis provide the input for applying such optimization techniques. Among other compiler optimization techniques, finite differencing captures common yet distinctive program constructions of costly repeated calculations and transforms them into more efficient incremental program constructions.
In this dissertation, we develop a new approach to destructive update analysis, called destructive effect analysis . We present the semantic model and the abstract interpretation of destructive effect analysis. We designed EAS , an experimental applicative language with set expressions. The implementation of the destructive effect analysis is integrated with the optimization phase of our experimental compiler of EAS. We apply finite differencing to optimize pure functional programs, and we show the performance improvement that results from applying the finite differencing optimization together with the destructive update optimization.
Title: Foveation Techniques and Scheduling Issues in Thinwire Visualization
Candidate: Chang, Ee-Chien
Advisor(s): Yap, Chee
Abstract:
We are interested in the visualization of large images across a network. Upon request, the server sends an image across the network to the client, who in turn, presents this image to the viewer. A key observation is that, at any moment, the viewer is mainly interested in a region around his gaze point in the image. To exploit this, we let the viewer interactively indicates this point and the selected region will have higher priority in the transmission process. As a result, the displayed image is a ``space-variant'' image. A fundamental difference between this scheme and the usual progressive transmission scheme is that we place more emphasis on the visualization process. This shift in emphasis opens up new perspectives on the problem. In this thesis, we focus on this difference.
In chapter two, we formalize the operation of ``foveating an image'', study how to distribute the resolution over an image, and how to progressively refine such a space-variant image. Motivated by properties of human vision, we propose two methods for the construction of space-variant images. In chapter three, we formulate and study an abstract on-line scheduling problem which is motivated by interactions between the client and the server. In the fourth and last chapter, we describe details and issues in an implementation.
Title: Techniques to Improve the Performance of Software-based Distributed Shared Memory Systems
Candidate: Chu, Churngwei
Advisor(s): Kedem, Zvi
Abstract:
Software distributed shared memory systems are able to provide programmers with the illusion of global shared memory on networked workstations without special hardware support. This thesis identifies two problems in contemporary software distributed shared memory systems: (1) poor application programming interfaces for programmers who need to solve complicated synchronization problems and (2) inefficiencies in traditional multiple writer protocols. We propose a solution to both of these problems. One is the introduction of user-definable high level synchronization primitives to provide a better application programming interface. The other is the single-owner protocol to provide efficiency. In order to accommodate user-definable high level synchronization primitives, a variant of release consistency is also proposed.
User-definable high level synchronization primitives provide a paradigm for users to define their own synchronization primitives instead of relying on traditional low level synchronization primitives, such as barriers and locks. The single-owner protocol reduces the number of messages from O ( n ^{ 2 } ) messages (the number of messages needed in the multiple-owner protocol) to Theta(n) messages when there are first n writers writing to a page and then n readers reading the page. Unlike some multiple-owner protocols, in the single-owner protocol garbage collection is performed asynchronously, and the size of a message for doing memory update is smaller in most cases.
We also evaluate the tradeoffs between the single-owner protocol and multiple-owner protocols. We have found that in most cases the single-owner protocol uses fewer messages than multiple-owner protocols, but there are some computations which may perform better with some multiple-owner protocols. In order to combine the advantages of both protocols, we propose a hybrid owner protocol which can be used to increase the efficiency in an adaptive way, with some pages managed by the single-owner protocol and some by a multiple-owner protocol.
Finally, five applications are evaluated using the single-owner protocol and a particular multiple-owner protocol called the lazy invalidate protocol. The performance of these two protocols is compared. We also demonstrate the use of user-definable high level synchronization primitives on one of the applications, and compare its performance against the same application constructed using only low-level synchronization primitives.
Title: Deformable Object Tabula Rasa: A Zoomable User Interface System
Candidate: Fox, David
Advisor(s): Perlin, Ken
Abstract:
This dissertation develops the concept of a zoomable user interface and identifies the design elements which are important to its viability as a successor to the desktop style of interface. The implementation of an example system named Tabula Rasa is described, along with the design and implementation of some sample applications for Tabula Rasa. We show how programming techniques such as delegation and multi-methods can be used to solve certain problems that arise in the implementation of Tabula Rasa, and in the implementation of Tabula Rasa applications.
Over the past thirty years the desktop or WIMP (Windows, Icons, Menus, Pointer) user interface has made the computer into a tool that allows non-specialists to get a variety of tasks done. In recent years, however, the applications available under this interface have become larger and more unwieldy, taking into themselves more and more marginally related functionality. Any inter-operability between applications must be explicitly designed in.
The Zoomable User Interface (ZUI) is a relatively new metaphor designed as a successor to the desktop interface. It is inspired by the Pad system, which is based on a zoomable surface of unlimited resolution. Just as the desktop interface has a set of essential elements, a ZUI has a set of elements each of which is vital to the whole. These include
These basic elements combine to produce an environment that takes advantage of the user's spatial memory to create a more expansive and dynamic working environment, as well as encouraging finer grained applications that automatically inter-operate with various types of data objects and applications.
Title: Metacomputing and Resource Allocation on the World Wide Web
Candidate: Karaul, Mehmet
Advisor(s): Kedem, Zvi
Abstract:
The World Wide Web is a challenging environment for distributed computing due to its sheer size and the heterogeneity and unreliability of machines and networks. Therefore, scalability, load balancing, and fault masking play an important role for Web-based systems. In this dissertation, I present novel mechanisms for resource allocation and parallel computing on the Web addressing these issues.
Large Web sites rely on a set of geographically dispersed replicated servers among which client requests should be appropriately allocated. I present a scalable decentralized design, which pushes the allocation functionality onto the clients. At its core lies a pricing strategy that provides incentives to clients to control the dispatching of requests while still allowing clients to take advantage of geographic proximity. An adaptive algorithm updates prices to deal with dynamic changes. A prototype system based on this architecture has been implemented and its functionality validated through a series of experiments.
Parallel computing on local area networks is based on a variety of mechanisms targeting the properties of this environment. However, these mechanisms do not effectively extend to wide area networks due to issues such as heterogeneity, security, and administrative boundaries. I present a prototype system which allows application programmers to write parallel programs in Java and allows Java-capable browsers to execute parallel tasks. It comprises a virtual machine model which isolates the program from the execution environment, and a runtime system realizing this machine on the Web. Load balancing and fault masking are transparently provided by the runtime system.
Title: Free Parallel Data Mining
Candidate: Li, Bin
Advisor(s): Shasha, Dennis
Abstract:
Data mining is the emerging field of applying statistical and artificial intelligence techniques to the problem of finding novel, useful, and non-trivial patterns from large databases. This thesis presents a framework for easily and efficiently parallelizing data mining algorithms. We propose an acyclic directed graph structure, exploration dag ( E-dag ), to characterize the computation model of data mining algorithms in classification rule mining, association rule mining, and combinatorial pattern discovery. An E-dag can be constructively formed in parallel from specifications of a data mining problem, then a parallel E-dag traversal is performed on the fly to efficiently solve the problem. The effectiveness of the E-dag framework is demonstrated in biological pattern discovery applications.
We also explore data parallelism in data mining applications. The cross-validation and the windowing techniques used in classification tree algorithms facilitate easy development of efficient data partitioning programs. In this spirit, we present a new classification tree algorithm called NyuMiner that guarantees that every split in a classification tree is optimal with respect to any given impurity function and any given maximum number of branches allowed in a split. NyuMiner can be easily parallelized using the data partitioning technique.
This thesis also presents a software architecture for running parallel data mining programs on networks of workstations (NOW) in a fault-tolerant manner. The software architecture is based on Persistent Linda (PLinda), a robust distributed parallel computing system which automatically utilize idle cycles. Templates are provided for application programmers to develop parallel data mining programs in PLinda. Parallelization frameworks and the software architecture form a synergy that makes free efficient data mining realistic.
Title: Fast Algorithms for Discovering the Maximum Frequent Set
Candidate: Lin, Dao-I
Advisor(s): Kedem, Zvi
Abstract:
Discovering frequent itemsets is a key problem in important data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. Typical algorithms for solving this problem operate in a bottom-up breadth-first search direction. The computation starts from frequent 1-itemsets (minimal length frequent itemsets) and continues until all maximal (length) frequent itemsets are found. During the execution, every frequent itemset is explicitly considered. Such algorithms perform reasonably well when all maximal frequent itemsets are short. However, performance drastically decreases when some of the maximal frequent itemsets are relatively long. We present a new algorithm which combines both the bottom-up and the top-down searches. The primary search direction is still bottom-up, but a restricted search is also conducted in the top-down direction. This search is used only for maintaining and updating a new data structure we designed, the maximum frequent candidate set. It is used to prune candidates in the bottom-up search. A very important characteristic of the algorithm is that it does not require explicite examination of every frequent itemset. Therefore the algorithm performs well even when some maximal frequent itemsets are long. As its output, the algorithm produces the maximum frequent set, i.e., the set containing all maximal frequent itemsets, thus specifying immediately all frequent itemsets. We evaluate the performance of the algorithm using well-known synthetatic benchmark databases and real-life census and stock market databases. The improvement in performance can be up to several orders of magnitude, compared to the best current algorithms.
Title: Algorithmic Techniques in Computational Genomics
Candidate: Parida, Laxmi
Advisor(s): Mishra, Bud
Abstract:
This thesis explores the application of algorithmic techniques in understanding and solving computational problems arising in Genomics (called Computational Genomics ). In the first part of the thesis we focus on the problem of reconstructing physical maps from data, related to "reading" the genome of an organism, and in the second part we focus on problems related to "interpreting" (in a very limited sense) the genome. The main contributions of the thesis are understanding the computational complexity of, and designing algorithms for some key problems in both these domains.
The primary goal of the Human Genome Project is to determine the entire three billion base pair sequence of the human genome and locate roughly 100,000 genes on the DNA. Recently, a set of single molecule methods (such as optical mapping) have been developed that allow one to create physical maps (a set of landmarks on the DNA whose locations are well defined), but can only do so by combining a population of data in the presence of errors from various sources. In the first part of the thesis, we focus on the problem of computing physical maps from data that arise in single molecule methods. We describe two combinatorial models of the problem termed Exclusive Binary Flip Cut (EBFC) and Weighted Consistency Graph (WCG) problems. We show that both the problems are MAX SNP hard and give bounds on the approximation factors achievable. We give polynomial time 0.878-approximation algorithm for the EBFC problem and 0.817-approximation algorithm for the WCG problem, using the maxcut approximation algorithm due to Goemans and Williamson. We also give a low polynomial time practical algorithm that works well on simulated and real data. Naksha is an implementation of this algorithm and a demonstration is available at http://www.cs.nyu.edu/parida/naksha.html . We also have similar results on complexity for generalizations of the problem which model various other sources of errors. We have generalized our complexity and algorithmic results to the case where there is more than one population in the data (which we call the K -populations problem). In the second part of the thesis, we focus on "interpreting" the genome. We consider the problem of discovering patterns (or motifs) in strings on a finite alphabet: we show that by appropriately defining irredundant motifs, the number of irredundant motifs is only quadratic in the input size. We use these irredundant motifs in designing algorithms to align multiple genome or protein sequences. Alignment of sequences aids in comparing similarities, in structure and function of the proteins.
Title: Thinksheet: a Tool for Information Navigation
Candidate: Piatko, Peter
Advisor(s): Shasha, Dennis
Abstract:
Imagine that you are a ``knowledge worker'' in the coming millenium. You must synthesize information and make decisions such as ``Which benefits plan to use?'' ``What do the regulations say about this course of action?'' ``How does my job fit into the corporate business plan?'' or even ``How does this program work?'' If the dream of digital libraries is to bring you all material relevant to your task, you may find yourself drowning before long. Reading is harder than talking to people who know the relevant documents and can tell you what you're interested in. That is what many current knowledge workers do, giving rise to professions such as insurance consultant, lawyer, benefits specialist, and so on.
Imagine by contrast that the documents you retrieve could be tailored precisely to your needs. That is, imagine that the document might ask you questions and produce a document filtered and organized according to those you have answered.
We have been developing software that allows writers to tailor documents to the specific needs of large groups of readers. Thinksheet combines the technologies of expert systems, spreadsheets, and database query processing to provide tailoring capabilities for complex documents. The authoring model is only slighly more complex than a spreadsheet.
This thesis discusses the conceptual model and the implementation of Thinksheet, and applications for complex documents and metadata.
Title: Corpus-based Parsing and Sublanguage Studies
Candidate: Sekine, Satoshi
Advisor(s): Grishman, Ralph
Abstract:
There are two main topics in this thesis, a corpus-based parser and a study of sublanguage.
A novel approach to corpus-based parsing is proposed. In this framework, a probabilistic grammar is constructed whose rules are partial trees from a syntactically-bracketed corpus. The distinctive feature is that the partial trees are multi-layered. In other words, only a small number of non-terminals are used to cut the initial trees; other grammatical nodes are embedded into the partial trees, and hence into the grammar rules. Good parsing performance was obtained, even with small training corpora. Several techniques were developed to improve the parser's accuracy, including in particular two methods for incorporating lexical information. One method uses probabilities of binary lexical dependencies; the other directly lexicalizes the grammar rules. Because the grammar rules are long, the number of rules is huge - more than thirty thousand from a corpus of one million words. A parsing algorithm which can efficiently handle such a large grammar is described. A Japanese parser based on the same idea was also developed.
Corpus-based sublanguage studies were conducted to relate the notion of sublanguage to lexical and syntactic properties of a text. A statistical method based on word frequencies was developed to define sublanguages within a collection of documents; this method was evaluated by identifying the sublanguage of new documents. Relative frequencies of different syntactic structures were used to assess the domain dependency of syntactic structure in a multi-domain corpus. Cross-entropy measurements showed a clear distinction between fiction and non-fiction domains. Experiments were then performed in which grammars trained on individual domains, or sets of domains, were used to parse texts in the same or other domains. The results correlate with the measurements of syntactic variation across domains; in particular, the best performance is achieved using grammars trained on the same or similar domains.
The parsing and sublanguage techniques were applied to speech recognition. Sublanguage techniques were able to increase recognition accuracy, and some promising cases were found where the parser was able to correct recognition errors.
Title: Abstract Models of Distributed Memory Management
Candidate: Ungureanu, Cristian
Advisor(s): Goldberg, Benjamin
Abstract:
In this dissertation, we are presenting a model suitable for reasoning about memory management in concurrent and distributed systems. The model provides a suitable level of abstraction: it is low-level enough so that we can express communication, allocation and garbage collection, but otherwise hides many of the lower-level details of an actual implementation. Using it, we can give compact, and provably correct, characterizations of garbage collection algorithms in distributed systems.
The models are rewriting systems whose terms are programs in which the ``code'' and the ``store'' are syntactically apparent. Evaluation is expressed as conditional rewriting and includes store and communication operations. Using techniques developed for communicating and concurrent systems we give a semantics suitable for proving equivalence of such programs. Garbage collection becomes a rewriting relation that removes part of the store without affecting the behavior of the program.
We introduce and prove correct a very general garbage collection rule based on reachability; any actual implementation which is capable of providing the transitions (including their atomicity constraints) specified by the strategy is therefore correct. We give examples of such specific implementations, and show how their correctness follows from the correctness of the general relation.
Title: Fault-tolerant parallel computing on networks of non-dedicated workstations
Candidate: Wyckoff, Peter
Abstract:
This thesis addresses fault tolerance issues in parallel computing on loosely-coupled networks of non-dedicated, heterogeneous workstations. The efficiency of fault tolerance mechanisms is dictated by network and failure characteristics. Traditional approaches to fault tolerance are efficient when network and failure characteristics are identical across workstations, such as in a local area network of homogeneous workstations; however, a loosely coupled network of non-dedicated workstations has non-uniform network and failure characteristics. This thesis presents the design and implementation of a flexible fault tolerance runtime system that allows each process in a parallel application to use one of three rollback recovery mechanisms. Rollback recovery is achieved using a lightweight form of transaction, which performance results show incurs almost no overhead. The system is built on top of the Linda coordination language and runs on Alpha, Linux, Solaris and SGI workstations and Java-enabled browsers. For barrier synchronous parallel applications, a new equi-distant checkpointing interval selection method, the expected maximum heuristic, is presented. The method is applicable to any rollback recovery system in which processes recover from failure independently and communicate through a reliable third party. Simulation results show that the expected maximum heuristic has near optimal performance under a variety of different failure rates and barrier lengths.
Title: Multiscale Snakes: Resolution-Appropriate Shape Descriptions
Candidate: Baldwin, Bernard
Advisor(s): Geiger, Davi
Abstract:
We present a new type of "snake" in which the dimensionality of the shapes is scaled appropriately for the resolution of the images in which the shapes are embedded. We define shapes as an ordered list of control points and compute the principal components of the shapes in a prior training set. Our energy function is based upon the Mahalanobis distance of a given shape from the mean shape and on the Mahalanobis distance of the image attributes from image attribute values extracted from the training set. We show that the derivative of this energy function with respect to the modal weights is reduced as the image resolution is reduced, and that the derivative of the energy scales with the variance associated with each mode. We exploit this property to determine the subset of the modes which are relevant at a particular level of image resolution, thereby reducing the dimensionality of the shapes. We implement a coarse-to-fine search procedure in the image and shape domains simultaneously, and demonstrate this procedure on the identification of anatomic structures in Computed Tomography images and on the identification of military vehicles in range images.
Title: Deformable Object Recognition with Articulations and Occlusions
Candidate: Liu, Tyng-Luh
Advisor(s): Geiger, Davi
Abstract:
The subject of this thesis is deformable object recognition. We concentrate on issues of articulations and of occlusions.
In order to find a target object (undergoing articulations) in an image we use the following procedures: (i) extracting key features in an image, (ii) detecting key points in the model, (iii) efficiently searching through possible image segmentations and (iv) comparing and grouping shapes. Together, they reconstruct the target object in the image. A Bayesian rational is presented to justify this strategy.
Our main focuses in this thesis are on (iii) and (iv). More precisely, we are interested in shape representation, shape similarity and combining shape similarity with image segmentation.
We consider two possible shape representations for an object. The first is given by its shape contour (SC), or silhouette, and the other is described by the structure of symmetry axis (SA), or skeleton, which has a unique free tree structure. For shape similarity, we review a string matching method based on the SC representation and then, we develop a tree matching scheme using the SA-tree representation. The advantage of this approach is that it becomes extremely simple to account for articulations and occlusions. As a novelty, the SA is obtained via a shape comparison between an SC and its mirror version. Finally we study how to integrate the shape module, for both shape representations (SC and SA), with an active contour tracker to yield an image segmentation.
Our efforts through all these issues have been to provide methods that are guaranteed to find optimal solutions.
We also address the topic of occluded object recognition but from a different viewpoint. Our method is to treat it as a function approximation problem with an over-complete basis (a library of image templates), but also accounts for occlusions, where the basis superposition principle is no longer valid. Since the basis is over-complete, there are infinitely many ways to decompose the image. We are motivated to select a sparse/compact representation of the image and to account for occlusions and noise.
Title: Partial evaluation of concurrent programs
Candidate: Marinescu, Mihnea
Advisor(s): Goldberg, Benjamin
Abstract:
The goal of this dissertation is to develop partial evaluation (program specialization) techniques specific to concurrent programs .
The language chosen for this investigation is a very simple CSP-like language. A standard binding-time analysis for imperative languages is conservatively extended in order to deal with the basic concurrent constructs: synchronous communication and nondeterministic choice. Based on the resulting binding-time annotations, a specialization transformation is formally defined using a labeled transition system with actions. The correctness of the partial evaluation is stated and a proof is included. This result is closely related to (strong) bisimulation , the equivalence relation on transition systems. We name the two directions of the bisimulation equivalence soundness and completeness respectively.
In order to maintain a clear presentation, this simple specialization algorithm addresses only the data transfer component of the communication; a post-specialization analysis for the detection and removal of redundant synchronizations (i.e. synchronizations whose removal does not increase the nondeterminism of a program) is presented separately. This redundant-synchronization analysis is based on the characterization of dependencies in a CSP-like language.
Several pragmatic issues such as improving the binding-time analysis, controlling loop unrolling and the consequences of lifting nondeterminism from run-time to specialization-time are discussed. Two additional binding-time analyses are presented. We call one of them speculative because the specialization transformation based on it is sound but not complete. We call the other one extended because it includes an on-line redundant-synchronization analysis.
The relationship between partial evaluation and different types of fairness is also studied. In order to deal with a wide range of fair run-time systems, ranging from strong to weak, and from process-fair to channel-fair and communication-fair, we use a general operational framework for specifying fairness properties as systematic means of reducing nondeterminism. We then prove the correctness (as bisimulation equivalence) or just the soundness of specialization transformations under various binding-time analyses.
Throughout the dissertation, the power of the newly developed techniques is shown in several examples.
Title: Pricing and Hedging Volatility Risk in Interest-Rate Derivatives
Candidate: Porras, Juan
Advisor(s): Avellaneda, Marco
Abstract:
This work addresses the problem of pricing interest-rate derivative securities and the use of quoted prices of traded instruments to calibrate the corresponding interest-rate dynamics. To this end, an arbitrage-free model of interest rate evolution is adopted, for which the local drift will depend on the history of volatility, thus leading to path-dependent pricing. This model is based on the Heath-Jarrow-Morton formulation but, in addition, presupposes that the volatility process is not defined a-priori . This leads to a path-dependent model that can be formulated in a Markovian framework by considering additional state-variables and hence increasing the dimensionality of the computation. Instead of solving the resulting 3-dimensional partial differential equation, an alternative approach, based on conditional expectations of the history of volatility, is taken. This pricing method is applied to a non-linear (adverse volatility) setting, and used as the core of a non-parametric model calibration technique. The algorithm, by performing an optimization over volatility surfaces, finds a volatility surface that matches the market prices of a given set of securities. This method also finds a hedge for volatility risk, using derivative securities as hedging instruments. In particular, we present results obtained for the problem of hedging American swaptions (options on interest-rate swaps) using European swaptions.
The conditional expectation approach is explored further, and found to be of interest in its own right for the pricing of several kinds of path-dependent instruments, providing an alternative to increasing state-space dimension in order to satisfy a Markov property. In particular, we show how this method speeds up the computation of prices for some types of exotic options, while being general enough to apply to both linear and non-linear pricing of portfolios.
Title: Performance Modeling for Realistic Storage Devices
Candidate: Shriver, Elizabeth
Advisor(s): Siegel, Alan; Wilkes, John
Abstract:
Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system . Since a self-managing storage system is a real-time system, it requires a model that quickly approximates the behavior of the storage device in a workload-dependent fashion. We develop such a model.
Our approach to modeling devices is to model the individual components of the device, such as queues, caches, and disk mechanisms, and then compose the components. To determine the performance of a component, each component modifies the entering workload use patterns and determines the performance from the workload use patterns and the lower-level device behavior. For example, modifying the use patterns allows us to capture the altered spatial locality that occurs when queues reorder their requests.
Our model predicts the device behavior in terms of response time within a 8% relative error for an interesting subset of the domain of devices and workloads. To demonstrate this, the model has been validated with synthetic traces of parallel scientific file system applications and traces of transaction processing applications.
Our contributions to the area of performance modeling for storage devices include the following:
Together, these mean that we have analytic methods to approximate the behavior of a set of realistic storage devices.
Title: Algorithms in Semi-Algabraic Geometry
Candidate: Basu, Saugata
Advisor(s): Pollack, Richard
Abstract:
In this thesis we present new algorithms to solve several very general problems of semi-algebraic geometry. Our algorithms are currently the best algorithms for solving these problems. In addition, we have proved new bounds on the topological complexity of real semi-algebraic sets, in terms of the parameters of the polynomial system defining them, which improve some old and widely used results in this field.
The first part of the thesis deals mainly with the decision problem for the first order theory of real closed fields, and the more general problem of quantifier elimination. We give algorithms which improve the complexity of of all the previously known algorithms for these problems. Moreover, our techniques allow us to prove some purely mathematical theorems on the number of connected components and on the existence of small rational points in a given semi-algebraic set.
The second part of this work deals with connectivity questions of semi-algebraic sets. We develop new techniques in order to give an algorithm for computing roadmaps of semi-algebraic sets which improves on the complexity of the previous algorithms for this problem.
The third part of this work deals with bounding the topological complexity of semi-algebraic sets in terms of the number and the degrees of the polynomials describing it. We extend and improve a classical and widely used result of Oleinik and Petrovsky(1949), Thom (1965) and Milnor(1964), bounding the sum of the Betti numbers of semi-algebraic sets. Using the ideas behind this result, we give the first singly exponential algorithm for computing the Euler characteristic of an arbitrary semi-algebraic set.
One common thread that links these results is that our bounds are separated into a combinatorial part (the part depending on the number of polynomials) and an algebraic part (the part depending on the degrees of the polynomials). The combinatorial part of the complexity of our algorithms is frequently tight and this marks the improvement of many of our results. This is most striking when one considers that in many applications, for instance in computational geometry, it is the number of polynomials which is the most important parameter (the degrees and the number of variables are usually small). Another important and new feature of some of our results is that when the given semi-algebraic set is contained in a lower dimensional variety, the combinatorial part of the complexity depends on the dimension of this variety rather than on the dimension of the ambient space. This is useful when one considers semi-algebraic sets which have low real dimension embedded in a higher dimensional space.
Title: Statistical Source Channel Models for Natural Language Understanding
Candidate: Epstein, Mark
Advisor(s): Grishman, Ralph
Abstract:
The problem of Natural Language Understanding (NLU) has intrigued researchers since the 1960's. Most researchers working in computational linguistics focus on linguistic solutions to their problems. They develop grammars and parsers to process the input natural language into a meaning representation . In this thesis, a new approach is utilized. Borrowing from the field of communication theory , an information theoretic approach to natural language understanding is applied. This is based on the source-channel model of communication.
The source-channel model of NLU assumes that the user has a meaning in the domain of the application that he wishes to convey. This meaning is sent through a noisy channel . The observer receives the English sentence as output from the noisy channel. The observer then submits the English sentence to a decoder , which determines the meaning most likely to have generated the English. The decoder uses mathematical models of the channel and the meanings to process the English sentence. Thus, the following problems must be addressed in a source-channel model for NLU:
This dissertation focuses on the first two of these problems. Several mathematical models of the noisy channel are developed. They are trained from a corpus of context independent sentence pairs consisting of both English and the corresponding meaning. The parameters of the models are trained to maximize the likelihood of the model's prediction of the observed training data using Dempster and Laird's Expectation-Maximization algorithm . Results are presented for the Air Travel Information Service (ATIS) domain.
Title: Solving the Navier-Stokes Equations on a Distributed Parallel Computer
Candidate: Sabbagh, Hadil
Advisor(s): Peskin, Charles S.
Abstract:
Speed and space are two major issues in computational fluid dynamics today. Scalable parallel or distributed computers offer the promise of faster time to solve problems through parallelism and solutions to larger problems by adding more parallel processors with their own private memories. These systems use message passing to share data between processors. Parallel programs are difficult to write, especially for message passing systems, and there are few well-studied test cases.
In this dissertation, we solve the incompressible Navier-Stokes equations on a periodic cubic domain (3-torus). The numerical method is a finite difference method that consists of two parts: upwind differencing applied to the non-linear terms and solution of the Stokes equations. The latter are solved implicitly using a three-dimensional FFT. For the parallel implementation, the domain is divided into equally sized non-periodic cubic subdomains. Each subdomain is assigned to a processor; the processors form a process grid which is periodic. The parallel upwind differencing is preceded by an exchange of face data. The discrete Fourier transform in the Stokes solver is computed by applying one-dimensional FFTs sequentially in the three coordinate directions. In each coordinate direction, data must be exchanged only among those processors that lie on the same line of the process grid.
The parallel algorithm was implemented twice: once using PVM and once using MPI. Although both implementations are described in the thesis, the performance of only the MPI version is discussed.
The Navier-Stokes solver is tested on the IBM SP-2. Three constant problem size and three scalability experiments are used to analyze the performance of the solver. The fluid solver achieves a speedup of 48.8 when solving a 240 * 240 * 240 problem on 216 processors. Furthermore, there is evidence of scalability.
Title: Synthesis and Verification of Controllers for Robotics and Manufacturing Devices with Temporal Logic and the "Control-D" System
Candidate: Antoniotti, Marco
Advisor(s): Mishra, Bud
Abstract:
This dissertation studies the semi-automated synthesis and verification of control systems for robotics and manufacturing devices using formal methods in a discrete framework, and bears some resemblance to the theory of controlled discrete event systems (CDES) of Ramadge and Wonham. The discrete controller components of a walking machine and of a manufacturing line in the Combat Ration Advanced Manufacturing Technology Demonstration (CRAMTD) of Rutgers University are constructed automatically using the algorithms developed here.
The goal of this research has been to facilitate the integration of CDES theory with the the specification and verification formalisms for finite state systems. Many of our techniques rely on the application of some flavor of temporal logic . In particular, the model-checking techniques of Clarke and Emerson, for branching time temporal logic, proved to be valuable in the implementation of a controller synthesis tool for CDES, called Control-D . The main synthesis algorithm used by the Control-D tool compares favorably with the Ramadge-Wonham algorithm in time and space complexity, while achieving improved expressiveness in its underlying specification language.
Title: Planning in an Imperfect World Using Previous Experiences
Candidate: Chiu, Jen-Lung
Advisor(s): Davis, Ernest
Abstract:
This thesis studies the problem of planning and problem solving in an unpredictable environment by adapting previous experiences. We construct a single agent planning system CADDY and operate it in a simple golf world testbed. The study of CADDY combines the studies of probabilistic, spatial, and temporal reasoning, adapting and reusing plans, and the tradeoff between gains and costs based on various considerations.
The CADDY planning system operates in an uncertain and unpredictable environment. Despite limited perception, incomplete knowledge, and imperfect motion control, CADDY achieves its goal efficiently by finding a plan that is already known to work well in a similar situation and applying repair heuristics to improve it. The capability of adapting experiences makes CADDY a planning system with learning capability.
In this thesis, we discuss the structure of the CADDY planning system and the results of experimental tests of CADDY when we applied to a simulated golf world. We compare CADDY with several other research projects on probabilistic planners and planners which utilizes experiences. We also discuss how CADDY can be characterized in terms of theoretical work on plan feasibility. Finally, we point out possible directions of system extension and generalizations of the idea learned from CADDY to other problem domains. Currently CADDY is not directly applied to real-world problems, but it shows an interesting and promising direction of study. By combining the techniques of probabilistic reasoning, planning, and learning, the performance of planning on real-world domains can be improved dramatically.
Title: Geodesic Problems in High Dimensions
Candidate: Choi, Joonsoo
Advisor(s): Yap, Chee
Abstract:
The geometric shortest path (geodesic) problem can be formulated as follows: given a collection of obstacles in $\R^d$
, and source and target points $s, t \in \R^d$
, find a shortest obstacle-avoiding path between s and t . This thesis studies the Euclidean geodesic problem in $\R^3$
with polyhedral obstacles and the rectilinear geodesic problem in $\R^d$
with pairwise-disjoint, axes-parallel boxes.
Computing Euclidean geodesics in $\R^3$
with polyhedral obstacles is known to be NP -hard. In contrast, Papadimitriou gave a polynomial-time approximation algorithm for this problem. Unfortunately his complexity analysis involves an unusual mixture of both the algebraic computing model and the bit computing model. In the first part of the thesis, we present a true bit complexity analysis: there is an approximation algorithm that computes a geodesic with relative error $\epsilon > 0$
in $ O((n^3M\log{M} + (nM)^2) \cdot \mu(W)) $
time, where $M=O(nL/\epsilon)$
, $W = O(\log(n/\epsilon)+L)$, and $\mu(W)$
is the time complexity of multiplying two W -bit integers. Our algorithm is a variant of Papadimitriou's algorithm.
The second part of the thesis addresses the rectilinear geodesic problem in $\R^3$
with a set of pairwise-disjoint, axes-parallel boxes. A monotonicity property of rectilinear geodesics is shown: every obstacle-avoiding geodesic between two points is monotone along at least one of coordinate directions. Using the monotonicity property of geodesics, an algorithm computing a geodesic from a query point to a fixed point is presented. The preprocessing time of the algorithm is $O(n^2 \log n)$
and each query takes $O(\log n +k)$
time, where k is the number of edges in a geodesic.
The last part of the thesis generalizes the above monotonicity property to every dimensions: given a set of pairwise-disjoint, axes-parallel boxes in $\R^d$
, every obstacle-avoiding geodesic between two points is monotone along at least one of coordinate directions.
Title: Practical Structures for Parallel Operating Systems
Candidate: Edler, Jan
Advisor(s): Gottlieb, Allan
Abstract:
Large shared memory MIMD computers, with hundreds or thousands of processors, pose special problems for general purpose operating system design. In particular:
Because of these difficulties, the algorithms, data structures, and abstractions of conventional operating systems are not well suited to highly parallel machines.
We describe the Symunix operating system for the NYU Ultracomputer, a machine with hardware support for Fetch&Phi operations and combining of memory references. Avoidance of serial bottlenecks, through careful design of interfaces and use of highly parallel algorithms and data structures, is the primary goal of the system. Support for flexible parallel programming models relies on user-mode code to implement common abstractions such as threads and shared address spaces. Close interaction between the kernel and user-mode runtime layer reduces the cost of synchronization and avoids many scheduling anomalies.
Symunix first became operational in 1985 and has been extensively used for research and instruction on two generations of Ultracomputer prototypes at NYU. We present data gathered from actual multiprocessor execution to support our claim of practicality for future large systems.
Title: Dreme: for Life in the Net
Candidate: Fuchs, Matthew
Advisor(s): Perlin, Ken
Abstract:
This dissertation makes four contributions towards supporting distributed, multi-user applications over open networks.
Dreme, a distributed dialect of the Scheme language in which all first-class language objects are mobile in the network. In particular, various distributed topologies, such as client/server and peer-to-peer, can be created by migrating closures with overlapping scopes around the network, correct inter-process communication being assured by Scheme's lexical scoping rules and network wide addressing. Threads of control are passed around through first-class distributed continuations.
A User Interface toolkit for coordinating events in multi-threaded, multi-user applications by organizing continuation callbacks into nested lexical scopes. Each event has certain attributes, such as synchronous/asynchronous. Certain events create new scopes with new events. Continuation callbacks allow both synchronous events which return values to their callers, and asynchronous ones. Application needn't be spread throughout the application, as with applications using an event-loop.
A distributed garbage collection algorithm that collects all cycles on an open network. The basic algorithm depends on maintaining the inverse reference graph (IRG) among network nodes (i.e., if a->b is in the regular graph, b->a is in the IRG). A single IRG traversal from any object determines the status of each object touched. Communication is decentralized (any object can choose to determine its status), garbage is touched O(1) times (in the absence of failures), it is fault-tolerant, and can handle malicious or faulty neighbors. Each operation uses messages linear in the size of the IRG. Overlapping operations perform like parallel quick sort.
An approach to using the Standard Generalized Markup Language (SGML) over the network to support distributed GUIs, intelligent clients, and mobile agents. SGML is a meta-grammar for creating domain specific document markup languages to which a variety of semantics (display, reading/writing databases, etc.) can be applied. The document, its grammar, and some semantics, are retrieved over the network. Applications normally create interfaces directly out of graphic objects to communicate with the user. However, if the interface has some semantics (and is parsable), a computational agent can interpret the interface and talk directly to the application on behalf of the human.
Title: Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions
Candidate: Jeong, Karpjoo
Advisor(s): Shasha, Dennis
Abstract:
With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-running coarse grain parallel applications. Their advantages are wide availability and cost-effectiveness, as compared to massively parallel computers. Long-running computation in the workstation environment, however, requires both fault tolerance and the effective utilization of idle workstations.
In this dissertation, we present a variant of Linda, called Persistent Linda (PLinda), that treats these two issues uniformly: specifically, PLinda treats non-idleness as failure.
PLinda provides a combination of checkpointing and transaction support on both data and program state (an encoding of continuations). The traditional transaction model is simplified and then extended to support robust parallel computation. Treatable failures include processor and main memory hard and slowdown failures, and network omission and corruption failures.
The programmer can customize fault tolerance when constructing an application, trading failure-free performance against recovery time. When creating a PLinda program, the programmer can decide on the frequency of transactions and the encoding of continuations to be saved upon transaction commit. At runtime, the programmer can decide to suppress certain continuations for better failure-free performance.
PLinda has been applied to corporate bond index statistics computation and biological pattern recognition.
Title: A Model-Based 3-D Object Recognition System Using Geometric Hashing with Attributed Features
Candidate: Liu, Jyhjong
Advisor(s): Hummel, Robert
Abstract:
We build an object recognition system that is able to recognize 3-D objects such as vehicles embedded in highly complicated backgrounds. We use the geometric hashing method, augmenting the approach through the use of attributed features , k -d trees for access to features, and the use of bounds in order to limit the search.
We make use of expressive features to improve the performance of a geometric hashing object recognition system. Various kinds of attributed features, such as the midpoint of a line segment with its orientation, the endpoints of a line segment with its orientation, and the center and the circle features are extracted and used in our system.
The number of features as well as the type of features in each model can vary. We make use of weighted voting, which has a Bayesian interpretation. The distribution of the invariants for various features as well as the bounds of the weighted voting formula are analyzed. In order to improve the performance of the system, we use a k-d tree to search entries in high-dimensional hash tables. The method is generalized in order to treat variables taking on values from a non-interval domain, such as data measuring angles. To make use of available computer resources, we distribute the computation, assigning evidence accumulation for a single hypothesis to one processor in a multiple processor and multiple workstation environment. The implementation reduces the communication overhead to minimum. The system is implemented using the Khoros software development system.
The results of target recognition are reported in numerous experiments. The experiments show that the use of more expressive features improves the performance of the recognition system.
Title: Grasping and Fixturing: a Geometric Study and an Implementation
Candidate: Teichmann, Marek
Advisor(s): Mishra, Bud
Abstract:
The problem of immobilizing an object by placing ``fingers'' (or points) on its boundary occurs in the field of dexterous manipulation, manufacturing and geometry. In this dissertation, we consider the purely static problems of good grasp and fixture set synthesis, and explore their connection to problems in computational and combinatorial geometry. Two efficient randomized approximation algorithms are proposed for finding the smallest cover for a given convex set and for finding the largest magnitude by which a convex set can be scaled and still be covered by a cover of a given size. They generalize an algorithm by Clarkson. The cover points are selected from a set of n points. The following bounds are valid for both types of problems. For the former, c is the size of the optimal cover, and for the latter, c is the desired cover size. In both cases, a cover of size $4 cd \lg c$
is returned.
The running time depends on the set to be covered. Covering an n -vertex polytope in $R^d$
takes $O(c^2 n \log n \log c)$
expected time, and covering a ball takes
$O(nc^{1+\delta}+c^{\lfloor{d/2}\rfloor+1}\log n\log^{\lfloor{d/2}\rfloor} c)$
expected time. These algorithms have applications to finding a good grasp or fixture set. An $O(n^2 \log n)$
algorithm for finding optimal 3 finger grasps for n sided polygons is also given.
We also introduce a new grasp efficiency measure based on a certain class of ellipsoids, invariant under rigid motions of the object coordinate system. To our knowledge, this is the first measure having this property. We also introduce a new reactive grasping paradigm which does not require a priori knowledge of the object. This paradigm leads to several reactive algorithms for finding a grasp for parallel jaw grippers and three finger robot hands equipped with simple sensors. We show their correctness and discuss our implementation of one such algorithm: a parallel jaw gripper with light-beam sensors which we have built. A short video demonstration will also be shown.
Title: Systolic Combining Switch Designs
Candidate: Dickey, Susan
Advisor(s): Gottlieb, Allan
Abstract:
High-performance VLSI switches are needed in the interconnection network of massively parallel shared memory multiprocessors. The switch designs we consider alleviate the ``hot spot'' problem by adding extra logic to the switches to combine conventional loads and stores as well as fetch-and- $\phi$
operations destined for the same memory location. The performance of three buffered switch architectures was investigated through probabilistic analysis and simulation: Type A switches, with k queues, one at each output, each accepting k inputs per cycle; and two one-input queue designs, Type B switches, with $k^2$
output queues, and Type C switches, with k input queues. While the Type C switch is less expensive, Type A and B have considerably better performance. An efficient CMOS implementation for systolic queue designs was devised. A non-combining switch containing these systolic queues was fabricated through MOSIS in 3 micron CMOS and employed the NORA clocking methodology, using qualified clocks for distributing global control.
A combining switch was fabricated in 2 micron CMOS for use in the 16 by 16 processor/memory interconnection network of the NYU Ultracomputer prototype. Details are given about the internal logic of the two component types used in the network. A design usable in networks of size up to 256 * 256 has been prepared for fabrication by NCR at a smaller feature size in a higher pincount package. Differences in the logic partitioning of the two designs are described. We describe the performance of these designs for systems of up to 1024 PEs obtained through simulation. Our experience in implementing a combining switch indicates that the cost of hardware combining is much less than is widely believed. We compare the cost of a combining switch to that of a non-combining switch and discuss the scalability of the implemented design to large numbers of processors. Differences in the capabilities of combining switch architectures are studied. We describe the implementation of ``two-and-a-half-way'' combining, which promises to avoid network saturation in large networks at only slightly greater cost than two-way combining. We also discuss implementation alternatives and performance for a 4 by 4 combining switch.
Title: Gedanken: A tool for Pondering the Tractability of Correct Program Technology
Candidate: Ericson, Lars
Advisor(s): Mishra, Bud
Abstract:
We examine the feasibility of the Correct Program Technology (CPT) approach to program verification using available technology, with pessimistic results.
We compare CPT with RAPTS and the Calculus of Constructions. We specify the Correct Programmer's Workbench (CPW), and review six programming environments as platforms. We define a Correct Program Editor and prototype it in Mathematica.
CPT applies decision procedures for specification sublanguages to make shorter proofs, hoping these shorter proofs will have faster verifications, but these sublanguages are NP-Complete or worse. We review some heuristics for improving their average case. CPT relies on a sublanguage of set theory, MLS. We prove that MLS is NP-average complete in the sense of the Levin-Gurevich theory of average case complexity. We conjecture that shorter proofs of random theorems cost more to verify.
EMLS is an elementary relational language (ERL). We define syntactic simplification rule sets (SSRs) for ERLs. The average case effect of an SSR is determined by the number of matches of the SSR with ERL sentences of n individual variables. EMLS sentences over n variables can be constructed from sentences in L _{ 4,2, n } and L _{ 2,3, n } , where L _{ k , m , n } is the language of k relations of m arguments over n variables. We recursively define a match-counting algorithm for L _{ k , m , n } SSRs and extend it to EMLS. If an SSR has p patterns in w pattern variables over n individual variables, match counting costs O ( p n ^{ w } 2 ^{ pn w - 1 } (2 + k n ^{ m } )). Match counting for L _{ k ,0,0 } is in #P, and we conjecture that it is # P -Complete. We conjecture generating functions do not yield a method of approximating the number of matches, and we conjecture that the problem of approximating matches is also # P -Complete. We count the matches for low n for some EMLS SSRs, with discouraging results, and note that the matches of an effective rule set must grow as the size of the language for n variables.
We conclude that the remaining hope for verification is to build a large library of specification language constructs which occur frequently and can be verified in polynomial time.
Title: Designing Pattern Matching Algorithms by Exploiting Structural Pattern Properties
Candidate: Hariharan, Ramesh
Advisor(s): Cole, Richard
Abstract:
Exact Complexity of String Matching: We consider the question of how many character comparisons are needed to find all occurrences of a pattern string of length m in a text string of length n . We show an almost tight upper bound of the form n + O ( n / m ) character comparisons, following preprocessing. Specifically, we show an upper bound of n + 8/(3( m +1)) ( n - m ) character comparisons. The following lower bounds are also shown: for on-line algorithms, a bound of n + 9 / (4( m +1)) ( n - m ) character comparisons for m =35+36 k , for any integer k >= 1, and for general algorithms, a bound of n +2( n - m ) / ( m +3) character comparisons, for m =2 k +1, for any integer k >= 1.
Parallel Two-Dimensional Pattern Matching: We give the first time, space and work optimal common CRCW-PRAM algorithm for finding all occurrences of a two-dimensional pattern of size m _{ 1 } * m _{ 2 } in a two-dimensional text of size n _{ 1 } * n _{ 2 } . Our algorithm runs in O (1) time performing O ( n _{ 1 } * n _{ 2 } ) work, following preprocessing of the pattern. A major portion of the preprocessing step is the computation of witnesses for the pattern. We show how to compute witnesses for the pattern in O ( log log m _{ 2 } )time and O ( m _{ 1 } * m _{ 2 } ) work when m _{ 2 } >= m _{ 1 } . In the process of designing the above algorithm, we also obtain some new periodicity properties of two-dimensional patterns.
Parallel Suffix Tree Construction: We consider the problem of constructing the suffix tree of a given string s of length m in parallel. An O ( m )-work, O ( m )-space, O ( log ^{ 4 } m )-time CREW-PRAM algorithm for constructing the suffix tree of s is obtained when s is drawn from any fixed alphabet set. This is the first work and space optimal parallel algorithm known for this problem. It can be generalized to construct the suffix tree of a string s drawn from any general alphabet set to perform in O ( log ^{ 4 } m ) time, $O(m\log |\Sigma|)$
work, and $O(m\log |\Sigma|)$
space, after the characters in s have been sorted alphabetically; here $|\Sigma|$
is the number of distinct characters in s . In this case too, the algorithm is work optimal.
Title: Compilation of Array-Style Programs for Distributed Memory MIMD Machines: a Geometric Approach
Candidate: Katz, Alex
Advisor(s): Schonberg, Edmond
Abstract:
Distributed memory MIMD (Multiple Instruction Multiple Data) machines are emerging as a cost-effective means of speeding up numerically intensive programs. They scale more easily than other parallel machines. But writing explicitly parallel programs for these machines is both difficult and error prone. Compilers for languages like HPF make the task easier by generating the necessary inter-processor communication from the data distribution directives supplied by the programmer. This dissertation shows that for a large class of array-style programs automatic data distribution can produce a significant speedup on a distributed memory MIMD machine. Array-style programs use array primitives to manipulate entire arrays, rather than looping explicitly over the array elements. APL programs are typically array-style.
We show how to apply automated data distribution to APL programs, that treat arrays and operations on them as atomic. Automated data distribution determines the necessary inter-processor communication from the way APL primitives manipulate the entire arrays, rather than by complex algebraic analysis of the patterns of array subscripts, as would be done in more conventional compilers. A simple distribution and alignment scheme automatically distributes arrays across available processors. Arrays can be dynamic, with sizes varying during program execution. Data distribution is guided by array size estimates. Distribution trade-off analysis attempts to optimize the initial distribution by comparing the estimated communication and computation times, and replicating arrays whose partitioning results in excessive communication.
Building on the APL to C compiler developed by W.-M. Ching, we produce explicitly parallelized C, from APL source programs. We describe the parallel implementation of most of the APL primitives. The implementation of several APL primitives uses the monotonic data movement algorithm. The ideas developed are demonstrated with eight APL programs of varying complexity. We show the speedup and efficiency obtained when running these programs on 2 to 32 processors. The speedup achieved on 32 processors, ranging from 7 to 30, shows the technique to be applicable to a wide range of programs.
Title: Lazy SETL Debugging with Persistent Data Structures
Candidate: Liu, Zhiqing
Advisor(s): Schwartz, Jack
Abstract:
Debugging tools have been traditionally difficult to use, particularly in accumulating and exploring program runtime information. This dissertation addresses these issues by proposing a lazy debugging approach, which postpones investigation of debugging hypothesis until complete runtime history is available. This approach encourages a systematic way of debugging and supports many high-level debugging facilities. Recent advance in persistent data structures reduces the time and memory space overhead incurred in recording and storing execution events drastically, and also makes the overhead easily manageable.
To demonstrate this approach, a visual SETL debugger prototype has been designed and implemented based on D. Bacon's SETL translator. This debugger has a persistent runtime system designed using the persistent data structures of the node splitting type, developed by Driscoll, et al. It can efficiently record changes in program execution state under different recording granularities, along with supporting normal SETL executions. Users of this debugger are provided with a graphical interface, which supports many powerful tools, such as forward/backward control/data breakpoints, interactive variable printing, program animation, and re-execution from an recorded execution moment.
A strong set of conclusions can be drawn from an evaluation of the debugger's performance and usability issues, as well as the limitations and open questions of this debugging approach.
Title: Searching for Strings and Searching in Presence of Errors
Candidate: Muthukrishnan, S.
Advisor(s): Spencer, Joel
Abstract:
This dissertation deals with two classes of searching problems. The first class consists of pattern matching problems, and the second class comprises combinatorial searching problems in presence of errors in response to the queries. Our results are as follows.
Standard Stringology. Standard Stringology is the study of pattern matching problems in which a text location matches one in the pattern provided the associated symbols are identical. The basic problem here is the string matching problem of detecting all occurrences of a pattern string in a text string. This naturally generalizes to the dictionary matching problem of finding all occurrences of a set of patterns, rather than a single pattern, in a given text. Very fast optimal parallel algorithms exist for string matching in the PRAM model. These algorithms rely on structural properties of the strings. Unfortunately these structural properties are not useful for solving the dictionary matching problem. We have obtained the fastest and the most work-efficient algorithms known for this problem and a number of its variants by introducing and using a new technique called shrink-and-spawn .
Non-Standard Stringology. In problems from Non-Standard Stringology, an arbitrary many-to-many matching relation holds between the text and pattern locations. An example is string matching with ``don't cares'' where the position in the text that has a ``don't care'' symbol matches every pattern position. The inherent complexity and structure of such non-standard string matching problems is not well understood. Our main results are inherent complexity bounds for these problems, characterized in terms of algebraic convolutions. Traditionally structure in pattern matching has meant repetitions in patterns, but this work exposes a novel graph-theoretic structure in these problems.
Searching in presence of errors. Given a set of items containing one or more distinguished items, the generic combinatorial search problem is to determine the distinguished item(s) using detection tests on groups of items. Motivated by fault-tolerance issues, we consider the scenario when some tests get incorrect responses. We have developed a strategy to solve the generic problem above using at most one test more than that necessary, even under adversarial placement of incorrect responses to the tests.
Title: Visual Programming
Candidate: Nickerson, Jeffrey
Advisor(s): Schonberg, Edmond
Abstract:
While computer science textbooks and classroom lectures are filled with diagrams, and much of our design activity as programmers takes place on whiteboards, we write our pro- grams as text. Proponents of visual programming suggest that we should take advantage of graphic user interface technol- ogy and draw rather than write our programs. This disserta- tion examines the extent to which this is possible, address- ing the question of how graphic representation can best be used in the process of programming.
The use of diagrams in the field of computer science is thoroughly surveyed, and some underlying principles identi- fied. The visual conventions of Adjoinment, Linking, and Enclosure are defined and illustrated. Three languages are developed - a simple programming language that encompasses shell commands, a visual version of APL, and a visual front end for Mathematica. The visual version of APL is notable in that it presents both a program and instances of data under- going transformation as part of one unified diagram.
Building on the work of R. J. A. Buhr, new visual systems designing conventions are created to handle the intricacies of facilities in the Ada9X language. Asynchronous transfers of control, requeueing, and generic formal parameters are addressed. The asynchronous transfer of control convention is suitable for CASE representations of the language con- struct, and can be easily animated.
Some existing software metrics are modified for use in analyzing diagrams, and two new metrics are proposed: graphic token count and diagram class complexity. A graphic design measure, data density, is transformed into a computer science measure, token density. Using these metrics, graphic representations can be compared to each other and to textual representations. From this, a strong set of conclusions are drawn about the relative strengths of graphic and textual representation, as well as the limits and possibilities of graphic representation in programming.
Title: Representing Control in Parallel Applicative Programming
Candidate: Yao, Chi
Advisor(s): Goldberg, Benjamin
Abstract:
This research is an attempt to reason about the control of parallel computation in the world of applicative programming languages.
Applicative languages, in which computation is performed through function application and in which functions are treated as first-class objects, have the benefits of elegance, expressiveness and having clean semantics. Parallel computation and real-world concurrent activities are much harder to reason about than the sequential counterparts. Many parallel applicative languages have thus hidden most control details with their declarative programming styles, but they are not expressive enough to characterize many real world concurrent activities that can be easily explained with concepts such as message passing, pipelining and so on. Ease of programming should not come at the expense of expressiveness. Therefore, we design a parallel applicative language Pscheme such that programmers can express explicitly the control of parallel computation while maintaining the clean semantics and the ease of programming of applicative languages. In Pscheme, we propose the concept of ports to model the general control in parallel computation. Through program examples, we show how Pscheme and ports support various parallel programming paradigms. We have also built libraries for higher level control facilities with ports so that programming in Pscheme becomes easier.
We provide an operational semantics for Pscheme, and develop a compiler and a run time system on NYU's Ultracomputer. Our experiments with parallel programs have shown satisfactory speedup. We claim that ports are the natural parallel extensions of continuations in sequential computation, and thus conclude that representing general control in parallel applicative programming is feasible.
Title: Cell-based Computer Models in Developmental Biology
Candidate: Agarwal, Pankaj
Advisor(s): Schwartz, Jacob T.
Abstract:
In developmental biology, modeling and simulation play an important role in understanding cellular behavior. We suggest a simple language, the Cell Programming Language (CPL), to write computer programs to describe this behavior. Using these programs, it is possible to simulate and visualize cell behavior.
A genome is the program for the development of an organism. The genome, in conjunction with the environment, determines the behavior of each cell of the organism. The program for each cell (written in CPL) plays the role of its genome. The program for an individual cell consists of a set of states. In each state, rules are specified which determine the cell properties (i.e. shape, motility, concentrations of various molecular species, etc.). Different states of the same cell signify different phases in the cell's life. Each cell has a tissue type associated with it. Cells of the same tissue type execute the same CPL program.
We use the discrete time simulation model. At every time step, each cell executes all the instructions in its present state sequentially. All cells are assumed to be executing in parallel, with synchronization performed after every time step.
The cells are two-dimensional. Each cell has a physical location comprising a collection of discrete connected points. This physical presence imparts to the cells the attributes of area, perimeter, and neighbors (other cells). The neighbor attribute forms the basis for all intercellular communication.
The language contains features for specifying:
We have employed CPL to model the following: aggregation in cellular slime mold in response to a chemotactic agent; the formation of skeletal elements in the vertebrate limb; and cellular segregation due to differential adhesion.
Title: Applications of Convexity in Computational Geometry
Candidate: Capoyleas, Vasilis
Advisor(s): Pach, Janos; Pollack, Richard
Abstract:
We present seven results in computational geometry. The concept of convexity plays a vital role in all seven of the results; either as a tool in the proof method or as a means of giving a formal definition.
The topics considered are:
Weak $\epsilon$
-nets: We provide strong upper bounds for the size of the smallest weak [IMAGE ] -net of a set of points, in two basic cases.
Geometric Clusterings: We provide the first polynomial algorithm to find an optimal clustering of a set of points in the plane. The optimality criteria are based on the diameter and radius of the clusters.
The Hadwiger-Kneser-Thue Poulsen conjecture: This famous 40 year old conjecture states that the area of the union of a set of disks is diminished if the disks are pushed together. We provide two partial results to this conjecture.
Grasping: We consider grasping of polygonal objects by a pair of parallel jaws. We define a model and prove that a fairly large class of polygons can be grasped under this model.
Graph drawing and crossing numbers: We consider the problem of estimating the maximum number of edges for graphs that satisfy some sort of a relaxed planarity condition. We provide exact bounds for an important special case.
Title: New Techniques for the Analysis and Implementation of Functional Programs
Candidate: Chuang, Tyng-Ruey
Advisor(s): Goldberg, Benjamin
Abstract:
Functional programming languages provide programmers with clean semantics and side-effect free computation, which make easier the tasks of designing programs and reasoning about them. Efficient implementations of purely functional programs, however, can pose certain challenges. Our purpose in this dissertation is to develop new techniques for the efficient analysis and implementation of functional programs.
Our first goal is to investigate a syntactic approach, contrary to the usual semantic approaches, of finding the least fixed points of higher-order functions over finite domains. The second objective is to develop implementation techniques for aggregate data structures for functional programs such that accesses to aggregates are both efficient and side-effect free.
Finding the least fixed point of a monotonic function over a finite domain is an essential task when analyzing a functional program in the framework of abstract interpretation. Previous methods for least fixed point finding have primarily used semantic approaches, which often traverse large portions of the semantic domain and may be very inefficient even for simple programs. We propose a syntactic method based on an augmented simply typed lambda calculus. It is shown that, for finite domains, the syntactic method is both sound and complete with respect to the semantics. Moreover, we demonstrate that the proposed syntactic method can be quite effective in cases where the usual semantic method is very inefficient.
Efficient implementations of aggregate data structures for functional programs has been an active research topic. The problem arises because once an aggregate is updated, both the old version and newly updated copy must be preserved to maintain the side-effect free semantics of functional languages. We modify the shallow binding scheme of Baker to implement functional arrays for efficient incremental updates and voluminous reads. The scheme, however, uses side-effects and cannot be implemented in purely functional languages themselves. We then investigate the possibility of implementing efficient aggregates without using side-effects, and show that real-time deques can be implemented in a purely functional way. We describe several interesting applications of this technique.
Title: Nonholonomic Motion Planning : Algorithms and Software
Candidate: Fernandes, Christopher
Advisor(s): Mishra, Bud
Abstract:
Robot motion planning with nonholonomic constraints has recently engaged the attention of roboticists, as its application in dexterous manipulation, mobile robots and space robotics has begun to be understood. Such constraints arise from two different sources - Rolling Constraints and Non-Integrable Conservation Laws. For instance, the kinematics of dexterous manipulation using hard fingers making contact on a hard object requires nonholonomic motion planning (NMP) in order to satisfy rolling constraint. On the other hand, the control of attitude of space platform-based manipulators using only the internal motion of their manipulator joints requires NMP, as a result of the law of conservation of angular momentum.
Recently some algorithms and their implementation in software have been created in order to understand, simulate and control nonholonomic systems. Currently most of the algorithms have been demonstrated in somewhat specialized applications. There is a great need for software that enables the researcher to quickly test algorithms on these simple systems and then experiment with potential generalizations.
In this thesis, we describe a software system that we have developed at NYU and the underlying principles and algorithms (the ``Basis algorithm''). The system runs on SGI Iris, is written in C with auxiliary tools from Unix, Mathematica, DASSL etc. We shall also describe how we have designed controllers for such example nonholonomic systems as unicyle, space station and space platform-mounted robot manipulator.
It is hoped that this thesis will be useful for the control engineers engaged in designing non-linear control systems, for roboticists studying dexterous manipulations, motion planning and space robotics and finally, for software engineers interested in building tools and applications for robotics.
Title: Dynamic Impact Analysis: Analyzing Error Propagation in Program Executions
Candidate: Goradia, Tarak
Advisor(s): Weyuker, Elaine
Abstract:
Test adequacy criteria serve as rules to determine whether a test set adequately tests a program. The effectiveness of a test adequacy criterion is determined by its ability to detect faults. For a test case to detect a specific fault, it should execute the fault, cause the fault to generate an erroneous state and propagate the error to the output. Analysis of previously proposed code-based testing strategies suggests that satisfying the error propagation condition is both important and expensive. The technique of dynamic impact analysis is proposed for analyzing a program execution and estimating the error propagation behavior of various potential sources of errors in the execution. Impact graphs are introduced to provide an infrastructure supporting the analysis. A program impact graph modifies the notion of a program dependence graph proposed in the literature in order to capture some of the subtle impact relationships that exist in a program. An execution impact graphs represents the dynamic impact relationships that are demonstrated during a program execution. The notion of impact strength is defined as a quantitative measure of the error sensitivity of an impact. A cost-effective algorithm for analyzing impact relationships in an execution and computing the impact strengths is presented. A research prototype implemented to demonstrate the feasibility of dynamic impact analysis is briefly described. The time complexity of dynamic impact analysis is shown to be linear with respect to the original execution time, and experimental measurements indicate that the constant of proportionality is a small number. The experiments undertaken to validate the computation of impact strengths are presented. An experience study relating impact strengths to error propagation in faulty programs is also presented. The empirical results provide evidence indicating a strong positive correlation between impact strength and error propagation. The results also emphasize the need for better heuristics to improve the accuracy of the error propagation estimates. Potential applications of dynamic impact analysis to mutation testing, syntactic coverage-based testing and dynamic program slicing are discussed.
Title: Singularity Detection, Noise Reduction and Multifractal Fractal Characterization
Candidate: Hwang, Wen-Liang
Advisor(s): Mallat, Stephane
Abstract:
Most of a signal information is often carried by singularities. We study the characterization of the singularities with the wavelet transform and its modulus maxima. We introduce numerical algorithm to detect and characterize pointwise singularities from the behavior of the wavelet transform maxima across scales. As an application, we develop a denoising algorithm which discriminates the signal information from noise through an analysis of local singularities. In one dimension, we recover a piecewise smooth signal, where the sharp transitions are preserved. In two dimensions, the wavelet maxima algorithm detects and characterizes the edges. The geometrical properties of edges are used to discriminate the noise from the image information and the denoising algorithm restores sharp images even at low SNR.
Multifractals are singular signals having some self-similarity properties. We develop a robust algorithm to extract the fractal parameter of fractional Brownian motion embedded in white noise. Fractal parameters are estimated from the evolution of the variance of the wavelet coefficients across scales with a modified penalty method. Self-similar multifractals have a wavelet transform whose maxima define self-similar curves in the scale-space plane. We introduce an algorithm to recover the affine self-similar parameters with a voting procedure. This voting strategy is robust with respect to renormalization noise. We describe the numerical applications to Cantor measures, dyadique multifractals and to the study of diffusion limited aggregates.
Title: Competitive On-line Scheduling for Overloaded Real-Time Systems
Candidate: Koren, Gilad
Advisor(s): Shasha, Dennis; Mishra, Bud
Abstract:
We study competitive on-line scheduling in uniprocessor and multiprocessor real-time environments. In our model, tasks are sporadic and preemptable. Every task has a deadline and a value that the system obtains only if the task completes its execution by its deadline. The aim of a scheduler is to maximize the total value obtained from all the tasks that complete before their deadline.
An on-line scheduler has no knowledge of a task until it is released. The problem is to design an on-line scheduler with worst case guarantees even in the presence of overloaded periods. The guarantee is given in terms of a positive competitive factor. We say that an on-line algorithm has a competitive factor of r , 0 < r <= 1, when under all possible circumstances (i.e, task sets) the scheduler will get at least r times the best possible value. The best value is the value obtained by a clairvoyant algorithm. In contrast to an on-line scheduler, the clairvoyant algorithm knows the entire task set a priori at time zero.
When a uniprocessor system is underloaded there exist several optimal on-line algorithms that will schedule all tasks to completion (e.g., the Earliest Deadline First algorithm). However, under overload, these algorithms perform poorly. Heuristics have been proposed to deal with overloaded situations but these give no worst case guarantees.
We present an optimal on-line scheduling algorithm for uniprocessor overloaded systems called D-over. D-over is optimal in the sense that it has the best competitive factor possible. Moreover, while the system is underloaded, D-over will obtain 100% of the possible value.
In the multiprocessor case, we study systems with two or more processors. We present an inherent limit (lower bound) on the best competitive guarantee that any on-line parallel real-time scheduler can give. Then we present a competitive algorithm that achieves a worst case guarantee which is within a small factor from the best possible guarantee in many cases.
These are the most general results yet known for competitive scheduling of multiprocessor real-time systems.
Title: Probabilistic Methods in Computer Science and Combinatorics
Candidate: Narayanan, Babu
Advisor(s): Boppana, Ravi
Abstract:
Over the last few years, the Probabilistic method has become an important tool in Computer Science and Combinatorics. This thesis deals with three applications of the Probabilistic method.
The first problem concerns a model of imperfect randomness: the slightly-random source of Santha and Vazirani. In a slightly-random source with bias $\epsilon$
, the conditional probability that the next bit output is 1, given complete knowledge of the previous bits output, is between $1/2 - \epsilon$
and $1/2 +\epsilon$
. We show that, for every fixed $\epsilon < 1/2 $
, and for most sets, the probability of hitting that set using a slightly-random source is bounded away from 0.
The second problem arises in parallel and distributed computing. A set of n processors is trying to collectively flip a coin, using a protocol that tolerates a large number of faulty processors. We demonstrate the existence of perfect-information protocols that are immune to sets of $\epsilon n$
faulty processors, for every fixed $\epsilon< 1/2$
.
Finally, we consider a problem in Ramsey theory. Let an adversary color the edges of the Binomial random graph with r colors, the edge probability being $ c / (\sqrt n)$
, where c is a large enough constant. We show that, almost surely, a constant fraction of the triangles in the graph will be monochromatic.
Title: Singularity Detection, Dataflow Analysis of Logic Programs Using Typed Domains
Candidate: Papadopoulos, Georgios
Advisor(s): Harrison, Malcolm C.
Abstract:
Dataflow analysis for logic programming languages has been collecting information about properties of whole terms. As a result pessimistic assumptions had to be made about the substerms of a term, missing information that could be used for better compiler optimization and partial evaluation.
We use type information to divide each term into sets of subterms (t-terms) and then collect dataflow information for these sets. We identify and solve several problems as we develop a new sharing analysis for logic programs using our t-terms in a denotational abstract interpretation framework.
Title: Statistical Recognition of Textured Patterns From Local Spectral Decomposition
Candidate: Perry, Adi
Advisor(s): Lowe, David
Abstract:
Unsupervised segmentation of an image into homogeneously textured regions and the recognition of known texture patterns have been important tasks in computer vision. This thesis presents a new set of algorithms and describes an implemented system which performs these tasks. Initial features are computed from a local multi-channel spectral decomposition of the image that is implemented with Gabor filters. Textures are not assumed to have a band limited frequency spectrum and there is no supposition regarding the image contents: it may contain some unknown texture patterns or regions with no textures at all. Stability of features is enhanced by employing a method for smoothing reliable measurements. Both recognition and segmentation procedures use robust statistical algorithms and are performed locally for small image patches. In particular, statistical classification with principal components is used for recognition. Further accuracy is achieved by employing spatial consistency constraints. When a slanted texture is projected on the image plane, the patterns undergo systematic changes in the density, area, and directionality of the texture elements. Recognition is made invariant to such transformations by representing texture classes with multiple descriptors. These descriptors are computed from carefully selected 3-D views of the patterns. Simulated projection of textures from arbitrary viewpoints are obtained by using a new texture mapping algorithm. The segmentation algorithm overcomes the non-stationarity of the features by employing a new, robust similarity measure. The performance of these methods is demonstrated by applying them to real images.
Title: Automating Physical Database Design: An Extensible Approach
Candidate: Rozen, Steven
Advisor(s): Shasha, Dennis
Abstract:
In a high-level query language such as SQL, queries yield the same result no matter how the logical schema is physically implemented. Nevertheless, a query's cost can vary by orders of magnitude among different physical implementations of the same logical schema, even with the most modern query optimizers. Therefore, designing a low-cost physical implementation is an important pragmatic problem-one that requires a sophisticated understanding of physical design options and query strategies, and that involves estimating query costs, a tedious and error-prone process when done manually.
We have devised a simple framework for automating physical design in relational or post-relational DBMSs and in database programming languages. Within this framework, design options are uniformly represented as ``features'', and designs are represented by ``conflict''-free sets of features. (Mutually exclusive features conflict. An example would be two primary indexes on the same table.) The uniform representation of design options as features accommodates a greater variety of design options than previous approaches; adding a new design option (e.g. a new index type) merely entails characterizing it as a feature with appropriate parameters. We propose an approximation algorithm, based on this framework, that finds low-cost physical designs. In an initial phase, the algorithm examines the logical schema, data statistics, and queries, and generates ``useful features''-features that might reduce query costs. In a subsequent phase, the algorithm uses the DBMS's cost estimates to find ``best features''-features that belong to the lowest-cost designs for each individual query. Finally, the algorithm searches among conflict-free subsets of the best features of all the queries to find organizations with low global cost estimates. We have implemented a prototype physical design assistant for the INGRES relational DBMS, and we evaluate its designs for several benchmarks, including ASSSAP. Our experiments with the prototype show that it can produce good designs, and that the critical factor limiting their quality is the accuracy of query cost estimates. The prototype implementation isolates dependencies on INGRES, permitting our framework to produce design assistants for a wide range of relational, nested-relational, and object-oriented DBMSs.
Title: A Probabilistic Approach to Geometric Hashing using Line Features
Candidate: Tsai, Frank
Advisor(s): Schwartz, Jacob T.
Abstract:
One of the most important goals of computer vision research is object recognition. Most current object recognition algorithms assume reliable image segmentation, which in practice is often not available. This research exploits the combination of the Hough method with the geometric hashing technique for model-based object recognition in seriously degraded intensity images.
We describe the analysis, design and implementation of a recognition system that can recognize, in a seriously degraded intensity image, multiple objects modeled by a collection of lines.
We first examine the factors affecting line extraction by the Hough transform and proposed various techniques to cope with them. Line features are then used as primitive features from which we compute the geometric invariants used by the geometric hashing technique. Various geometric transformations, including rigid, similarity, affine and projective transformations, are examined. We then derive the ``spread'' of computed invariant over the hash space caused by ``perturbation'' of the lines giving rise to this invariant. This is the first of its kind for noise analysis on line features for geometric hashing. The result of the noise analysis is then used in a weighted voting scheme for the geometric hashing technique. We have implemented the system described and carried out a series of experiments on polygonal objects modeled by lines, assuming affine approximations to perspective viewing transformations. Our experimental results show that the technique described is noise resistant and suitable in an environment containing many occlusions.
Title: A Miniature Space-Variant Active Vision System: Cortex-I
Candidate: Bederson, Benjamin
Advisor(s): Schwartz, Eric
Abstract:
We have developed a prototype miniaturized active vision system whose sensor architecture is based on a logarithmically structured space-variant pixel geometry. A space-variant image's resolution changes across the image. Typically, the central part of the image has a very high resolution, and the resolution falls off gradually in the periphery. Our system integrates a miniature CCD-based camera, pan-tilt actuator, controller, general purpose processors and display. Due to the ability of space-variant sensors to cover large work-spaces, yet provide high acuity with an extremely small number of pixels, space-variant active vision system architectures provide the potential for radical reductions in system size and cost. We have realized this by creating an entire system that takes up less than a third of a cubic foot. In this thesis, we describe a prototype space-variant active vision system (Cortex-I) which performs such tasks as tracking moving objects and license plate reading, and functions as a video telephone.
We report on the design and construction of the camera (which is 8x8x8mm), its readout, and a fast mapping algorithm to convert the uniform image to a space-variant image. We introduce a new miniature pan-tilt actuator, the Spherical Pointing Motor (SPM), which is 4x5x6cm. The basic idea behind the SPM is to orient a permanent magnet to the magnetic field induced by three orthogonal coils by applying the appropriate ratio of currents to the coils. Finally, we present results of integrating the system with several applications. Potential application domains for systems of this type include vision systems for mobile robots and robot manipulators, traffic monitoring systems, security and surveillance, telerobotics, and consumer video communications.
The long-range goal of this project is to demonstrate that major new applications of robotics will become feasible when small low-cost machine vision systems can be mass-produced. This notion of `commodity robotics' is expected to parallel the impact of the personal computer, in the sense of opening up new application niches for what has until now been expensive and therefore limited technology.
Title: Regular Expressions to DFA's using Compressed NFA's
Candidate: Chang, Chia-Hsiang
Advisor(s): Paige, Robert
Abstract:
We show how to turn a regular expression R of length r into an O ( s ) space representation of McNaughton and Yamada's NFA, where s is the number of occurrences of alphabet symbols in R , and s +1is the number of NFA states. The standard adjacency list representation of McNaughton and Yamada's NFA takes up 1 + 2 s + s ^{ 2 } space in the worst case. The adjacency list representation of the NFA produced by Thompson takes up between 2 r and 6 r space, where r can be arbitrarily larger than s . Given any subset V of states in McNaughton and Yamada's NFA, our representation can be used to compute the set U of states one transition away from the states in V in optimal time O (| V | + | U |). McNaughton and Yamada's NFA requires $\Theta(|V| \times |U|)$
time in the worst case. Using Thompson's NFA, the equivalent calculation requires $\Theta(r)$
time in the worst case.
An implementation of our NFA representation confirms that it takes up an order of magnitude less space than McNaughton and Yamada's machine. An implementation to produce a DFA from our NFA representation by subset construction shows linear and quadratic speedups over subset construction starting from both Thompson's and McNaughton and Yamada's NFA's.
It also shows that the DFA produced from our NFA is as much as one order of magnitude smaller than DFA's constructed from the two other NFA's.
An UNIX egrep compatible software called cgrep based on our NFA representation is implemented. A benchmark shows that cgrep is dramatically faster than both UNIX egrep and GNU e?grep.
Throughout this thesis the importance of syntax is stressed in the design of our algorithms. In particular, we exploit a method of program improvement in which costly repeated calculations can be avoided by establishing and maintaining program invariants. This method of symbolic finite differencing has been used previously by Douglas Smith to derive efficient functional programs.
Title: Complexity Issues in Computational Algebra
Candidate: Gallo, Giovanni
Advisor(s): Mishra, Bud
Abstract:
The ideal membership problem for the ring of multivariate polynomials is a central problem in Computational Algebra. Relatively tight computational complexity bounds for this problem are known in the case of polynomials with coefficients in a field. After reviewing these results we give an algorithm, together with an upper bound on its complexity, for the solution of the membership problem in the case of polynomials with integer coefficients. This result is obtained adapting Buchberger's algorithm to the integer case. As an application, we also provide a more general upper bound on the length of strictly ascending chain of ideals in the ring $Z[x_1,\ldots,x_n]$
.
Many applications of Computational Algebra, however, do not require the complete solution of the membership problems and alternative approaches are available. In this thesis we survey the method of the characteristic sets, originally introduced by Ritt in the forties and now widely applied, particularly in Elementary Geometry Theorem Proving. We present optimal algorithms for computing a characteristic set with simple-exponential sequential and polynomial parallel time complexities.
We finally present an attempt to generalize some of the constructive methods of Commutative Algebra to the Algebra of differential polynomials. Rings of differential polynomials do not resemble their purely algebraic counterparts: we prove that there exist non-recursive differential ideals in $Z\{x\}$
and hence that, in general, no effective method can be devised to solve the membership problem in this case. However special classes of ideals can be algorithmically approached and toward this goal, we generalize the concept of H-basis, first introduced by Macaulay for algebraic ideals, to differential rings.
Title: Typing Higher-Order Functions with Dynamic Dispatching
Candidate: Hsieh, Chih-Hung
Advisor(s): Harrison, Malcolm C.
Abstract:
We design new type expressions and algorithms to classify and check object types in higher-order programming. Our computation model is imperative and strongly typed. It has dynamic-dispatched functions, higher-order bounded polymorphic functions, record and function subtyping, parameterized types, both named and structural types, free-union types, existential union types, poly-typed variables, poly-typed expressions, and heterogeneous collections.
A prototype of a mini-language with the above features is implemented in Prolog with a type checking system. A small but powerful set of typing structures and operations is identified. The type checking rules are formally defined. A new technique is developed for translating recursive type relations into cyclic AND/OR graphs. Efficient algorithms are designed for resolving generalized AND/OR graphs with constraints on valid cycles.
Using elegant syntax the new type system describes more general and precise type relations than any other type systems we have known. The new technique for translating type relations into AND/OR graphs provides a new direction for implementing a higher-order polymorphic type system, which is not available in unification-based type systems. The AND/OR graph models are general enough to represent recursive relations, and their applications are not solely limited to type-checking. Our AND/OR graph resolution algorithms find the optimal solutions. They are theoretically proved efficient and are shown practical in our implementation.
Title: Computer Simulation of Cortical Polymaps
Candidate: Landau, Pierre
Advisor(s): Schwartz, Eric
Abstract:
Neo-cortical sensory areas of the vertebrate brain are organized in terms of topographic maps of peripheral sense-organs. Cortical topography has been generally modeled in terms of a continuous map of a peripheral sensory surface onto a cortical surface. However, the details of cortical architecture do not conform to this concept. Most, if not all, cortical areas consist of an interlaced structure containing multiple topographic maps of distinct classes of neural input. The term ``polymap'' is used to refer to a cortical area which consists of more than one system, interlaced in a globally topographic, but locally columnar fashion. The best known example of a cortical polymap is provided by the ocular dominance column system in layer IV of primate striate cortex, but the puff/extra-puff and orientation systems of surrounding layers also illustrate this concept, as do the thick-thin-interstripe columns of V-2, and the direction columns of MT. Since polymap architecture seems to be a common architectural pattern in the neo-cortex, this work addresses the computational modeling of polymap systems, with the expectation that such modeling will lead to a better understanding of the underlying biology. An algorithm is presented, based on the computational geometry constructs of Generalized Voronoi Polygon and Medial Axis, which provides a general method for simulating polymap systems. It also adds a powerful technique to the repertoire of Digital Image Warping. The algorithm is illustrated using the ocular dominance column and orientation column systems of V-1. In addition, a mechanism is proposed and demonstrated to account for the spatial registration of the ocular dominance and orientation column systems. Computer simulations of the activity evoked by binocular stimuli, as they would appear at the level of layers III and IV in V-1, are shown, and compared to results from recent experiments. Methods of generalizing these techniques to other common polymap cortical areas are outlined.
Title: Polymorphic Type Inference and Abstract Data Types
Candidate: Laufer, Konstantin
Advisor(s): Goldberg, Benjamin; Odersky, Martin (Yale)
Abstract:
Many statically-typed programming languages provide an abstract data type construct, such as the package in Ada, the cluster in CLU, and the module in Modula2. However, in most of these languages, instances of abstract data types are not first-class values. Thus they cannot be assigned to a variable, passed as a function parameter or returned as a function result.
The higher-order functional language ML has a strong and static type system with parametric polymorphism. In addition, ML provides type reconstruction and consequently does not require type declarations for identifiers. Although the ML module system supports abstract data types, their instances cannot be used as first-class values for type-theoretic reasons.
In this dissertation, we describe a family of extensions of ML. While retaining ML's static type discipline, type reconstruction, and most of its syntax, we add significant expressive power to the language by incorporating first-class abstract types as an extension of ML's free algebraic datatypes. In particular, we are now able to express
Following Mitchell and Plotkin, we formalize abstract types in terms of existentially quantified types. We prove that our type system is semantically sound with respect to a standard denotational semantics.
We then present an extension of Haskell, a non-strict functional language that uses type classes to capture systematic overloading. This language results from incorporating existentially quantified types into Haskell and gives us first-class abstract types with type classes as their interfaces. We can now express heterogeneous structures over type classes. The language is statically typed and offers comparable flexibility to object-oriented languages. Its semantics is defined through a type-preserving translation to a modified version of our ML extension.
We have implemented a prototype of an interpreter for our language, including the type reconstruction algorithm, in Standard ML.
Title: A sublanguage based medical language processing system for German
Candidate: Oliver, Neil
Advisor(s): Sager, Naomi
Abstract:
The major accomplishments reported in this thesis are:
- The development of a computer grammar for a nontrivial sublanguage of German. This grammar, using the LSP (Linguistic String Processor) grammar formalism, solves a number of parsing problems arising in free word order languages such as German.
- The development of an LSP-based information formatting system that obtains semantic representations of texts in a medical sublanguage of German.
- The confirmation of the sublanguage hypothesis (explained below).
In LSP grammar theory, sentences in a language are derived from a collection of basic sentence types. The basic sentence types are described in terms of the major syntactic classes (e.g., noun, verb, adjective) of the language. Sentences are derived from these basic sentences by the insertion of optional structures called adjuncts, by conjoining, and by substituting words in the major classes. Insertion, conjoining, and substitution are constrained by co-occurrence restrictions between elements in the derived syntactic structures. The restrictions subcategorize the major word classes into subclasses that may co-occur in sentences according to the co-occurrence restrictions.
The sublanguage hypothesis elaborates LSP grammar theory in the following way. In a particular domain of discourse, the subcategorization of the major word classes reflects the underlying semantics of the domain. The basic sentence types of the language, represented by sublanguage subclasses instead of major word classes, can function as data structures (called information formats ) representing the information of the domain.
The LSP Medical Language Processor (LSP/MLP) is an information retrieval/information extraction system based on sublanguage and information formatting. It processes sentences in the English sublanguage of clinical reporting into information formats, which are in turn are converted into database update records for a relational database. The information formats are derived from sublanguage co-occurrence information obtained from a corpus of discharge summaries.
The German information formatting system implemented in this work processes German Arztbriefe (doctor letters) of cancer surgery patients into information formats. It confirms the sublanguage hypothesis because it re-uses the sublanguage information (co-occurrence information and formats) of the English LSP/MLP system in an equivalent sublanguage, showing that the sublanguage information reflects the semantics of the domain.
Title: Image Processing, Pattern Recognition and Attentional Algorithms in a Space-Variant Active Vision System
Candidate: Ong, Ping-Wen
Advisor(s): Schwartz, Eric
Abstract:
A space-variant sensor motivated by human vision system has highest resolution at the center with rapidly decreasing resolution toward the peripheral area. It has the advantages of a wide visual field and, at the same time, high central resolution. The dramatic reduction of pixel number in this kind of sensor makes it possible to build a real-time vision system using only moderate computational resources. On the other hand, the space-variant image has different layout compared to a raster image. The neighbor relationships change from pixel to pixel. We need to device special method to solve this neighborhood problem.
We use a connectivity graph to represent neighbor relations between pixels in a space-variant image. We can use it to define operators for edge detection, smoothing, etc. We use a two-level pyramid based on the connectivity graph to perform local thresholding for segmentation. The translation, rotation and scaling graph are three extensions of the connectivity graph which are used to translate, rotate and scale space-variant images. We can use these graphs to perform scale and rotation independent template matching.
We successfully apply several feature designs for OCR in the space-variant domain. They include: Characteristic-Loci, Partition, Heat-Signature, and Projection. All of them are translation and scale invariant. We also have two rotation invariant methods based on Partion and Projection methods.
Since space-variant sensor has higher resolution at the center, the recognition result is more reliable if we point the sensor close to the candidate object. Therefore, if we want to recognize any single character, the center of this character is the best place for pointing the sensor. But for recognizing adjacent characters, except for well separated ones, we need to point the sensor to the place where we can separate these characters.
Based on this reliability analysis, we devised four attentional rules and an algorithm for moving sensors to recognize character strings in static natural scenes.
Finally, we describe the algorithms for reading characters from the license plate on a moving vehicle. It includes stages for traffic zone finding, moving car finding, license plate finding, license plate tracking, and character reading.
Title: On Compiling Regular Loops for Efficient Parallel Execution
Candidate: Ouyang, Pei
Advisor(s): Kedem, Zvi; Palem, Krishna
Abstract:
In this thesis, we study the problem of mapping regular loops onto multiprocessors. We develop mapping schemes that yield very efficient executions of regular loops on shared and distributed memory architectures. We also develop novel analysis techniques, using which we argue about the efficiency of these resulting executions. The quality of the execution of these regular loops in the distributed memory setting, relies heavily on implementing cyclic shifts efficiently. Effectively, cyclic shifts are used to communicate results between individual processors, to which different interdependent iterations are assigned. Therefore, in order to achieve efficient executions of regular loops on distributed memory architectures, we also develop and analyze algorithms for solving the cyclic shift problem. In order to analyze the executions of regular loops that result from any specific mapping, we need to characterize the important parameters that determine its efficiency. We formally characterize a basic set of such parameters. These parameters facilitate the analysis of the memory and the processor requirements of a given execution, as well as its running time . Using these parameters, we analyze a greedy execution scheme, in the shared memory model. For example, we can determine the limit on the number of processors beyond which no speedup can be attained by the greedy method, for regular loops. The greedy scheme is of interest because it exploits the maximal possible parallelism in a natural way.
We then address the mapping scheme of regular loops onto distributed memory machines. Unfortunately, we show that the problem of finding an optimal mapping is computationally intractable in this case. In order to provide schemes that can be actually applied to regular loops at compile-time, we relax the requirement that the resulting executions be optimum. Instead, we design a heuristic mapping algorithm and validate it through experiments. This heuristic mapping scheme relies heavily on the use of efficient algorithms for realizing cyclic shifts. Therefore, we also study the problem of realizing cyclic shifts on hypercube architectures.
Title: Japanese/English Machine Translation Using Sublanguage Patterns and Reversible Grammars
Candidate: Peng, Ping
Abstract:
For this thesis, a Japanese/English machine translation system with reversible components has been designed and implemented in PROLOG. Sublanguage co-occurrence patterns have been used to address the problems of lexical and structural selection in the transfer between the internal representations of a pair of natural languages. The system has been tested translating Japanese into English in the domain of programming language manuals. The evaluation of the test outputs provides some assessment of the utility of the sublanguage approach as a method for the development and refinement of a machine translation system. The thesis also explores the roles that a reversible grammar would play in sharing linguistic knowledge between parsing and generation.
The system has been developed with the goal of using sublanguage word co-occurrence patterns to simplify the description of syntactic/semantic knowledge needed in both the transfer rules and the analysis of the source language. In particular, sublanguage co-occurrence patterns are introduced to provide semantic constraints and ellipsis recovery in parsing Japanese.
This thesis introduces a right-to-left parsing scheme for Japanese. The idea for the right-to-left parsing algorithm evolved from the desire to produce partial syntactic analyses of Japanese in a more deterministic manner than was achieved by conventional left-to-right parsing schemes. The algorithm makes efficient use of sublanguage co-occurrence patterns as semantic knowledge to help disambiguate Japanese parses. The enforcement of syntactic and semantic constraints is tightly interwoven during the course of parsing. The performance in parsing Japanese has thereby been significantly enhanced.
A procedure has been implemented for translating a Definite Clause Grammar dually into a PROLOG parser and PROLOG generator, so that one grammar can be used for parsing and generation. In current natural language processing systems, separate grammars are used for parsing and generation. However, there has long been an interest in designing a single grammar for both parsing and synthesis for reasons of efficiency and integrity, as well as linguistic elegance and perspicuity. As part of the current implementation, a strategy has been developed for creating efficient grammars for both parsing and generation using a goal reordering technique within the logic programming framework.
Title: The Analysis and Generation of Tests for Programming Language Translators
Candidate: Rennels, Deborah
Advisor(s): Schonberg, Edmond
Abstract:
This thesis addresses the automation of two aspects of compiler validation testing: semantic analysis of existing test programs, and construction of new test programs. Semantic analysis is required during test modification and maintenance, and also when evaluating the language coverage attained by the test suite. In the current state of practice, both the semantic analysis and the construction of new test programs are extremely labor-intensive tasks; both, however, are amenable to automation. We describe two systems; one, which we have implemented, involves test case analysis and feature identification. The other is a proposed system for automatic generation of tests from test specifications. We tested our methods on the largest and most comprehensive compiler validation project to date- the Ada Compiler Validation Capability (ACVC), a large collection of Ada test programs used to verify that compilers conform to the Ada language standard.
We first describe the Ada Features Identification System (AFIS), a system which automates test program analysis. AFIS provides three different methods for identifying Ada language features in test programs, ranging from elementary syntactic items to complex context-sensitive combinations of semantic features. Semantic feature copmbinations are specified by writing program templates in a pattern language which is an extension of Ada, and pattern-matching these templates against test programs.
In the second part of this thesis we define a language to facilitate the specification of Ada compiler test objectives, and the design of a system that uses these specifications to automatically generate valid Ada test programs. The language allows a test developer to write a specification that embodies the testing goal of a given objective, without including all type and expression information required in a complete test program. These details are supplied automatically by the generator system. We show, by numerous examples taken from the Ada Implementors Guide (the design document for the Ada validation suite), how Ada test objectives can be specified in this language. The focus of our examples is constraint violation checking, which is an important component of Ada's strong typing system, and also a basic organizing principle of the ACVC tests.
Title: Massively Parallel Bayesian Object Recognition
Candidate: Rigoutsos, Isidore
Advisor(s): Hummel, Robert
Abstract:
The problem of model-based object recognition is a fundamental one in the field of computer vision, and represents a promising direction for practical applications. In this talk we will describe the design, analysis, implementation and testing of a model-based object recognition system.
In the first part of the talk, we will discuss two parallel algorithms for performing geometric hashing. The first algorithm regards geometric hashing as a connectionist algorithm with information flowing via patterns of communication, and is designed for an SIMD hypercube-based machine. The second algorithm is more general, and treats the parallel architecture as a source of ``intelligent memory;'' the algorithm achieves parallelism through broadcast facilities from the parallel machine's host. A number of enhancements to the geometric hashing method, such as hash table equalization, the use of hash table symmetries, and hash table foldings will also be presented. These enhancements were developed specifically for the parallel algorithms, and lead to substantial performance improvements.
In the second part of the talk, we will examine the performance of geometric hashing methods in the presence of noise. The quantization of the invariants can result in a non-graceful degradation of the performance. We will present precise formulas as well as first-order approximations describing the dependency of the computed invariants on Gaussian positional error, for the similarity and affine transformation cases. Knowledge of this dependency allows the incorporation of an error model in the geometric hashing framework and subsequently leads to improved performance. A counter-intuitive result regarding the solutions of certain linear systems will also be derived as a corollary of this analysis.
In the final part of the talk, we will present an interpretation of geometric hashing that allows the algorithm to be viewed as a Bayesian approach to model-based object recognition. This interpretation, which is a new form of Bayesian-based model matching, leads to well-justified formulas, and gives a precise weighted-voting method for the evidence-gathering phase of geometric hashing. These formulas replace traditional heuristically-derived methods for performing weighted voting, and also provide a precise method for evaluating uncertainty.
A prototype object recognition system using these ideas has been implemented on a CM-2 Connection Machine. The system is scalable and can recognize aircraft and automobile models subjected to 2D rotation, translation, and scale changes in real-world digital imagery. This is the first system of its kind that is scalable, uses large databases, can handle noisy input data, works rapidly on an existing parallel architecture, and exhibits excellent performance with real-world, natural scenes.
Title: Control of a Dexterous Robot Hand: Theory, Implementation, and Experiments
Candidate: Silver, Naomi
Advisor(s): Mishra, Bud
Abstract:
Advanced robotic systems, such as multi-fingered hands are becoming more complex, and, as yet, many o