Title: A Note on the Complexity of Model-Checking Bounded Multi-Pushdown Systems
Author(s): Bansal, Kshitij; Demri, Stephane
In this note, we provide complexity characterizations of model checking multi-pushdown systems. Multi-pushdown systems model recursive concurrent programs in which any sequential process has a finite control. We consider three standard notions for boundedness: context boundedness, phase boundedness and stack ordering. The logical formalism is a linear-time temporal logic extending well-known logic CaRet but dedicated to multi-pushdown systems in which abstract operators (related to calls and returns) such as those for next-time and until are parameterized by stacks. We show that the problem is EXPTIME-complete for context-bounded runs and unary encoding of the number of context switches; we also prove that the problem is 2EXPTIME-complete for phase-bounded runs and unary encoding of the number of phase switches. In both cases, the value k is given as an input (whence it is not a constant of the model-checking problem), which makes a substantial difference in the complexity. In certain cases, our results improve previous complexity results.
Title: Learning Hierarchical Feature Extractors For ImageRecognition
Candidate: Boureau, Y-Lan
Advisor(s): LeCun, Yann
Telling cow from sheep is effortless for most animals, but requires much engineering for computers. In this thesis, we seek to tease out basic principles that underlie many recent advances in image recognition. First, we recast many methods into a common unsupervised feature extraction framework based on an alternation of coding steps, which encode the input by comparing it with a collection of reference patterns, and pooling steps, which compute an aggregation statistic summarizing the codes within some region of interest of the image.
Within that framework, we conduct extensive comparative evaluations of many coding or pooling operators proposed in the literature. Our results demonstrate a robust superiority of sparse coding (which decomposes an input as a linear combination of a few visual words) and max pooling (which summarizes a set of inputs by their maximum value). We also propose macrofeatures, which import into the popular spatial pyramid framework the joint encoding of nearby features commonly practiced in neural networks, and obtain significantly improved image recognition performance. Next, we analyze the statistical properties of max pooling that underlie its better performance, through a simple theoretical model of feature activation. We then present results of experiments that confirm many predictions of the model. Beyond the pooling operator itself, an important parameter is the set of pools over which the summary statistic is computed. We propose locality in feature configuration space as a natural criterion for devising better pools. Finally, we propose ways to make coding faster and more powerful through fast convolutional feedforward architectures, and examine how to incorporate supervision into feature extraction schemes. Overall, our experiments offer insights into what makes current systems work so well, and state-of-the-art results on several image recognition benchmarks.
Title: On populations, haplotypes and genome sequencing
Candidate: Franquin, Pierre
Advisor(s): Mishra, Bud
Population genetics has seen a renewed interest since the completion of the human genome project. With the availability of rapidly growing volumes of genomic data, the scientific and medical communities have been optimistic that better understanding of human diseases as well as their treatment were imminent. Many population genomic models and association studies have been designed (or redesigned) to address these problems. For instance, the genome-wide association studies (GWAS) had raised hopes for finding disease markers, personalized medicine and rational drug design. Yet, as of today, they have not yielded results that live up to their promise and have only led to a frustrating disappointment.
Intrigued, but not deterred by these challenges, this dissertation visits the different aspects of these problems. In the first part, we will review the different models and theories of population genetics that are now challenged. We will propose our own implementation of a model to test different hypotheses. This effort will hopefully help us in understanding whether our expectations were unreasonably too high or if we had ignored a crucial piece of information. When discussing association studies, we must not forget that we rely on data that are produced by sequencing technologies, so far available. We have to ensure that the quality of this data is reasonably good for GWAS. Unfortunately, as we will see in the second part, despite the existence of a diverse set of sequencing technologies, none of them can produce haplotypes with phasing, which appears to be the most important type of sequence data needed for association studies. To address this challenge, we propose a novel approach for a sequencing technology, called SMASH that allows us to create the quality and type of haplotypic genome sequences necessary for efficient population genetics.
Title: Optimizing Machine Translation by Learning to Search
Candidate: Galron, Daniel
Advisor(s): Melamed, Dan
We present a novel approach to training discriminative tree-structured machine translation systems by learning to search. We describe three primary innovations in this work: a new parsing coordinator architecture and algorithms to synthesize the required training examples for the learning algorithm; a new semiring that provides an unbiased way to compare translations; and a new training objective that measures whether a translation inference improves the quality of a translation. We also apply the reinforcement learning concept of exploration to SMT. Finally, we empirically evaluate the effects of our innovations on the quality of translations output by our system.
Title: Flexible-Cost SLAM
Candidate: Grimes, Matthew
Advisor(s): LeCun, Yann
The ability of a robot to track its position and its surroundings is critical in mobile robotics applications, such as autonomous transport, farming, search-and-rescue, and planetary exploration.
As a foundational building block to such tasks, localization must remain reliable and unobtrusive. For example, it must not provide an unneeded level of precision, when the cost of doing so displaces higher-level tasks from a busy CPU. Nor should it produce noisy estimates on the cheap, when there are CPU cycles to spare.
This thesis explores localization solutions that provide exactly the amount of accuracy needed to a given task. We begin with a real-world system used in the DARPA Learning Applied to Ground Robotics (LAGR) competition. Using a novel hybrid of wheel and visual odometry, we cut the cost of visual odometry from 100% of a CPU to 5%, clearing room for other critical visual processes, such as long-range terrain classification. We present our hybrid odometer in chapter 2.
Next, we describe a novel SLAM algorithm that provides a means to choose the desired balance between cost and accuracy. At its fastest setting, our algorithm converges faster than previous stochastic SLAM solvers, while maintaining significantly better accuracy. At its most accurate, it provides the same solution as exact SLAM solvers. Its main feature, however, is the ability to flexibly choose any point between these two extremes of speed and precision, as circumstances demand. As a result, we are able to guarantee real-time performance at each timestep on city-scale maps with large loops. We present this solver in chapter 3, along with results from both commonly available datasets and Google Street View data.
Taken as a whole, this thesis recognizes that precision and efficiency can be competing values, whose proper balance depends on the application and its fluctuating circumstances. It demonstrates how a localizer can and should fit its cost to the task at hand, rather than the other way around. In enabling this flexibility, we demonstrate a new direction for SLAM research, as well as provide a new convenience for end-users, who may wish to map the world without stopping it.
Title: SMT Beyond DPLL(T): A New Approach to Theory Solvers and Theory Combination
Candidate: Jovanovic, Dejan
Advisor(s): Barrett, Clark
Satisifiability modulo theories (SMT) is the problem of deciding whether a given logical formula can be satisifed with respect to a combination of background theories. The past few decades have seen many significant developments in the field, including fast Boolean satisfiability solvers (SAT), efficient decision procedures for a growing number of expressive theories, and frameworks for modular combination of decision procedures. All these improvements, with addition of robust SMT solver implementations, culminated with the acceptance of SMT as a standard tool in the fields of automated reasoning and computer aided verification. In this thesis we develop new decision procedures for the theory of linear integer arithmetic and the theory of non-linear real arithmetic, and develop a new general framework fro combination of decision procedures. The new decision procedures integrate theory specific reasoning and the Boolean search to provide more powerful and efficient procedures, and allow a more expressive language for explaining problematic states. The new framework for combination of decision procedures overcomes the complexity limitations and restrictions on the theories imposed by the standard Nelson-Oppen approach.
Title: An Adaptive Fast Multipole Method-Based PDE Solver in Three Dimensions
Candidate: Langston, Matthew Harper
Advisor(s): Zorin, Denis
Many problems in scientific computing require the accurate and fast solution to a variety of elliptic PDEs. These problems become increasingly dif.cult in three dimensions when forces become non-homogeneously distributed and geometries are complex.
We present an adaptive fast volume solver using a new version of the fast multipole method, incorporated with a pre-existing boundary integral formulation for the development of an adaptive embedded boundary solver.
For the fast volume solver portion of the algorithm, we present a kernel-independent, adaptive fast multipole method of arbitrary order accuracy for solving elliptic PDEs in three dimensions with radiation boundary conditions. The algorithm requires only a Greenâs function evaluation routine for the governing equation and a representation of the source distribution (the right-hand side) that can be evaluated at arbiÂtrary points.
The performance of the method is accelerated in two ways. First, we construct a piecewise polynomial approximation of the right-hand side and compute far-.eld expansions in the FMM from the coef.cients of this approximation. Second, we precompute tables of quadratures to handle the near-.eld interactions on adaptive octree data structures, keeping the total storage requirements in check through the exploitation of symmetries. We additionally show how we extend the free-space volume solver to solvers with periodic and well as Dirichlet boundary conditions.
For incorporation with the boundary integral solver, we develop interpolation methods to maintain the accuracy of the volume solver. These methods use the existing FMM-based octree structure to locate apÂpropriate interpolation points, building polynomial approximations to this larger set of forces and evaluating these polynomials to the locally under-re.ned grid in the area of interest.
We present numerical examples for the Laplace, modi.ed Helmholtz and Stokes equations for a variety of boundary conditions and geometries as well as studies of the interpolation procedures and stability of far-.eld and polynomial constructions.
Title: Acquiring information from wider scope to improve event extraction
Candidate: Liao, Shasha
Advisor(s): Grishman, Ralph
Event extraction is a particularly challenging type of information extraction (IE). Most current event extraction systems rely on local information at the phrase or sentence level. However, this local context may be insufficient to resolve ambiguities in identifying particular types of events; information from a wider scope can serve to resolve some of these ambiguities.
In this thesis, we first investigate how to extract supervised and unsupervised features to improve a supervised baseline system. Then, we present two additional tasks to show the benefit of wider scope features in semi-supervised learning (self-training) and active learning (co-testing). Experiments show that using features from wider scope can not only aid a supervised local event extraction baseline system, but also help the semi-supervised or active learning approach.
Title: A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia
Candidate: Morton-Owens, Emily
Advisor(s): Davis, Ernest
The Kivrin program, consisting of a crawler, a data collection, and a front-end interface, attempts to extract biographical information from Wikipedia, specifically, spatio-temporal information--who was where when--and make it easily searchable. Some of the considerations standard to moving object databases do not apply in this context, because the texts by their nature discuss a discontinuous series of notable moments. The paper discusses different methods of arranging the crawler queue priority to find more important figures and of disambiguating locations when the same place name (toponym) is shared among several places. When lifespan information is not available, it is estimated to exclude sightings outside the person's plausible lifetime.
The results are grouped by the number of sightings in the user's search range to minimize the visibility of false drops when they occur. Erroneous results are more visible in times and places where fewer legitimate sightings are recorded; the data is skewed, like Wikipedia itself, towards the U.S. and Western Europe and relatively recent history. The system could be most improved by using statistical methods to predict which terms are more likely personal names than place names and to identify verbs that precede location information rather than personal names. It could also be improved by incorporating the times as a third dimension in the geospatial index, which would allow "near" queries to include that dimension rather than a strict range.
The program can be used at http://linserv1.cims.nyu.edu:48866/cgi-bin/index.cgi
Title: Mobile Accessibility Tools for the Visually Impaired
Candidate: Paisios, Nektarios
Advisor(s): Subramanian; Lakshminarayanan
Visually impaired users are in dire need of better accessibility tools. The past few years have witnessed an exponential growth in the computing capabilities and onboard sensing capabilities of mobile phones making them an ideal candidate for building next-generation applications. We believe that the mobile device can play a significant role in the future for aiding visually impaired users in day-to-day activities with simple and usable mobile accessibility tools. This thesis describes the design, implementation, evaluation and user-study based analysis of four different mobile accessibility applications.
Our first system is the design of a highly accurate and usable mobile navigational guide that uses Wi-Fi and accelerometer sensors to navigate unfamiliar environments. A visually impaired user can use the system to construct a virtual topological map across points of interest within a building based on correlating the user' walking patterns (with turn signals) with the Wi-Fi and accelerometer readings. The user can subsequently use the map to navigate previously traveled routes. Our second system, Mobile Brailler, presents several prototype methods of text entry on a modern touch screen mobile phone that are based on the Braille alphabet and thus are convenient for visually impaired users. Our third system enables visually impaired users to leverage the camera of a mobile device to accurately recognize currency bills even if the images are partially or highly distorted. The final system enables visually impaired users to determine whether a pair of clothes, in this case of a tie and a shirt, can be worn together or not, based on the current social norms of color-matching.
We believe that these applications together, provide a suite of important mobile accessibility tools to enhance four critical aspects of a day-to-day routine of a visually impaired user: to navigate easily, to type easily, to recognize currency bills (for payments) and to identify matching clothes.
Title: Reusable Software Infrastructure for Stream Processing
Candidate: Soule, Robert
Advisor(s): Grimm, Robert
Developers increasingly use streaming languages to write their data processing applications. While a variety of streaming languages exist, each targeting a particular application domain, they are all similar in that they represent a program as a graph of streams (i.e. sequences of data items) and operators (i.e. data transformers). They are also similar in that they must process large volumes of data with high throughput. To meet this requirement, compilers of streaming languages must provide a variety of streaming-specific optimizations, including automatic parallelization. Traditionally, when many languages share a set of optimizations, language implementors translate the source languages into a common representation called an intermediate language (IL). Because optimizations can modify the IL directly, they can be re-used by all of the source languages, reducing the overall engineering effort. However, traditional ILs and their associated optimizations target single-machine, single-process programs. In contrast, the kinds of optimizations that compilers must perform in the streaming domain are quite different, and often involve reasoning across multiple machines. Consequently, existing ILs are not suited to streaming languages.
This thesis addresses the problem of how to provide a reusable infrastructure for stream processing languages. Central to the approach is the design of an intermediate language specifically for streaming languages and optimizations. The hypothesis is that an intermediate language designed to meet the requirements of stream processing can assure implementation correctness; reduce overall implementation effort; and serve as a common substrate for critical optimizations. In evidence, this thesis provides the following contributions: (1) a catalog of common streaming optimizations that helps define the requirements of a streaming IL; (2) a calculus that enables reasoning about the correctness of source language translation and streaming optimizations; and (3) an intermediate language that preserves the semantics of the calculus, while addressing the implementation issues omitted from the calculus This work significantly reduces the effort it takes to develop stream processing languages, and jump-starts innovation in language and optimization design.
Title: Hitting the Sweet Spot for Streaming Languages: Dynamic Expressivity with Static Optimization
Author(s): Soulé, Robert; Gordon, Michael I.; Amarasinghe, Saman; Grimm, Robert; Hirzel, Martin
Developers increasingly use stream processing languages to write applications that process large volumes of data with high throughput. Unfortunately, when choosing which stream processing language to use, they face a difficult choice. On the one hand, dynamically scheduled languages allow developers to write a wider range of applications, but cannot take advantage of many crucial optimizations. On the other hand, statically scheduled languages are extremely performant, but cannot express many important streaming applications.
This paper presents the design of a hybrid scheduler for stream processing languages. The compiler partitions the streaming application into coarse-grained subgraphs separated by dynamic rate boundaries. It then applies static optimizations to those subgraphs. We have implemented this scheduler as an extension to the StreamIt compiler, and evaluated its performance against three scheduling techniques used by dynamic systems: OS thread, demand, and no-op. Our scheduler not only allows the previously static version of StreamIt to run dynamic rate applications, but it outperforms the three dynamic alternatives. This demonstrates that our scheduler strikes the right balance between expressivity and performance for stream processing languages.
Title: Building scalable geo-replicated storage backends for web applications
Candidate: Sovran, Yair
Advisor(s): Li, Jinyang
Web applications increasingly require a storage system that is both scalable and can replicate data across many distant data centers or sites. Most existing storage solutions fall into one of two categories: Traditional databases offer strict consistency guarantees and programming ease, but are difficult to scale in a geo-replicated setting. NoSQL stores are scalable and efficient, but have weak consistency guarantees, placing the burden of ensuring consistency on programmers. In this dissertation, we describe two systems that help bridge the two extremes, providing scalable, geo-replicated storage for web applications, while also easy to program for. Walter is a key-value store that supports transactions and replicating data across distant sites. A key feature underlying Walter is a new isolation property: Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI does not allow write-write conflicts, alleviating the burden of writing conflict resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. Lynx is a distributed database backend for scaling latency-sensitive web applications. Lynx supports optimizing queries via data denormalization, distributed secondary indexes, and materialized join views. To preserve data constraints across denormalized tables and secondary indexes, Lynx relies on the a novel primitive: Distributed Transaction Chain (DTC). A DTC groups a sequence of transactions to be executed on different nodes while providing two guarantees. First, all transactions in a DTC execute exactly once despite failures. Second, transactions from concurrent DTCs are interleaved consistently on common nodes. We built several web applications on top of Walter and Lynx: an auction service, a microblogging service, and a social networking website. We have found that building web applications using Walter and Lynx is quick and easy. Our experiments show that the resulting applications are capable of providing scalable, low latency operation across multiple geo-replicated sites.
Title: Rapid Training of Information Extraction with Local and Global Data Views
Candidate: Sun, Ang
Advisor(s): Grishman, Ralph
This dissertation focuses on fast system development for Information Extraction (IE). State-of-the-art systems heavily rely on extensively annotated corpora, which are slow to build for a new domain or task. Moreover, previous systems are mostly built with local evidence such as words in a short context window or features that are extracted at the sentence level. They usually generalize poorly on new domains.
This dissertation presents novel approaches for rapidly training an IE system for a new domain or task based on both local and global evidence. Specifically, we present three systems: a relation type extension system based on active learning, a relation type extension system based on semi-supervised learning, and a cross-domain bootstrapping system for domain adaptive named entity extraction.
The active learning procedure adopts features extracted at the sentence level as the local view and distributional similarities between relational phrases as the global view. It builds two classifiers based on these two views to find the most informative contention data points to request human labels so as to reduce annotation cost.
The semi-supervised system aims to learn a large set of accurate patterns for extracting relations between names from only a few seed patterns. It estimates the confidence of a name pair both locally and globally: locally by looking at the patterns that connect the pair in isolation; globally by incorporating the evidence from the clusters of patterns that connect the pair. The use of pattern clusters can prevent semantic drift and contribute to a natural stopping criterion for semi-supervised relation pattern discovery.
For adapting a named entity recognition system to a new domain, we propose a cross-domain bootstrapping algorithm, which iteratively learns a model for the new domain with labeled data from the original domain and unlabeled data from the new domain. We first use word clusters as global evidence to generalize features that are extracted from a local context window. We then select self-learned instances as additional training examples using multiple criteria, including some based on global evidence.
Title: Combating Sybil attacks in cooperative systems
Candidate: Tran, Nguyen
Advisor(s): Li, Jinyang
Cooperative systems are ubiquitous nowadays. In a cooperative system, end users contribute resource to run the service instead of only receiving the service passively from the system. For example, users upload and comment pictures and videos on Flicker and YouTube, users submit and vote on news articles on Digg. As another example, users in BitTorrent contribute bandwidth and storage to help each other download content. As long as users behave as expected, these systems benefit immensely from user contribution. In fact, five out of ten most popular websites are operating in this cooperative fashion (Facebook, YouTube, Blogger, Twitter, Wikipedia). BitTorrent is dominating the global Internet traffic.
A robust cooperative system cannot blindly trust that its users will truthfully participate in the system. Malicious users seek to exploit the systems for profit. Selfish users consume but avoid to contribute resource. For example, adversaries have manipulated the voting system of Digg to promote their articles of dubious quality. Selfish users in public BitTorrent communities leave the system to avoid uploading files to others, resulting in drastic performance degradation for these content distribution systems. The ultimate way to disrupt security and incentive mechanisms of cooperative systems is using Sybil attacks, in which the adversary creates many Sybil identities (fake identities) and use them to disrupt the systems' normal operation. No security and incentive mechanism works correctly if the systems do not have a robust identity management that can defend against Sybil attacks.
This thesis provides robust identity management schemes which are resilient to the Sybil attack, and use them to secure and incentivize user contribution in several example cooperative systems. The main theme of this work is to leverage the social network among users in designing secure and incentive-compatible cooperative systems. First, we develop a distributed admission control protocol, called Gatekeeper, that leverages social network to admit most honest user identities and only few Sybil identities into the systems. Gatekeeper can be used as a robust identity management for both centralized and decentralized cooperative systems. Second, we provide a vote aggregation system for content voting systems, called SumUp, that can prevent an adversary from casting many bogus votes for a piece of content using the Sybil attack. SumUp leverages unique properties of content voting systems to provide significantly better Sybil defense compared with applying a general admission control protocol such as \gatekeeper. Finally, we provide a robust reputation system, called Credo, that can be used to incentivize bandwidth contribution in peer-to-peer content distribution networks. Credo reputation can capture user contribution, and is resilient to both Sybil and collusion attacks.
Title: Multi-species biclustering: An integrative method to identify functional gene conservation between multiple species
Candidate: Waltman, Peter
Advisor(s): Bonneau, Richard
Background : Several recent comparative functional genomics projects have indicated that the co-regulation of many genes is conserved across species, at least in part. This suggests that comparative analysis of functional genomics data-sets could prove powerful in identifying co-regulated groups that are conserved across multiple species.
Results : We present recent work to extend our cMonkey algorithm to simultaneously bicluster heterogeneous data from multiple species to identify conserved modules of orthologous genes, which can yield evolutionary insights into the formation of regulatory modules. We also present results from the multi-species analysis to two triplets of bacteria. The first of these is a triplet of Gram-positive bacteria consisting of Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes, while the second is a triplet of Gram-negative bacteria that includes Escherichia coli, Salmonella typhimurium and Vibrio cholerae. Finally, we will present initial results from the multi-species biclustering analysis of human and mouse hematopoietic differentiation data.
Conclusion : Analysis of biclusters obtained revealed a surprising number of gene groups with conserved modularity and high biological significance as judged by several measures of cluster quality. We also highlight cases of interest from the Gram-positive triplet, including one that suggests a temporal difference in the expression of genes governing sporulation in the two Bacillus species. While analysis of the mouse and human hematopoietic differentiation is preliminary, it indicates the applicability of this analysis to eukaryotic systems, including comparison of cancer model systems. Finally, we suggest ways in which this analysis could be extended to identify divergent modules that may exist between normal and disease tissue.