Instructions for submitting a technical report or thesis.
Title: Decision Procedures for Finite Sets with Cardinality, and Local Theories Extensions
Candidate: Bansal, Kshitij
Advisor(s): Barrett, Clark; Wies, Thomas
Abstract:
Many tasks in design, verification, and testing of hardware and computer systems can be reduced to checking satisfiability of logical formulas. Certain fragments of first-order logic that model the semantics of prevalent data types, and hardware and software constructs, such as integers, bit-vectors, and arrays are thus of most interest. The appeal of satisfiability modulo theories (SMT) solvers is that they implement decision procedures for efficiently reasoning about formulas in these fragments. Thus, they can often be used off-the-shelf as automated back-end solvers in verification tools. In this thesis, we expand the scope of SMT solvers by developing decision procedures for new theories of interest in reasoning about hardware and software.
First, we consider the theory of finite sets with cardinality. Sets are a common high-level data structure used in programming; thus, such a theory is useful for modeling program constructs directly. More importantly, sets are a basic construct of mathematics and thus natural to use when mathematically defining the properties of a computer system. We extend a calculus for finite sets to reason about cardinality constraints. The reasoning for cardinality involves tracking how different sets overlap. For an efficient procedure in an SMT solver, we'd like to avoid considering Venn regions directly, which has been the approach in earlier work. We develop a novel technique wherein potentially overlapping regions are considered incrementally. We use a graph to track the interaction of the different regions. Additionally, our technique leverages the procedure for reasoning about the other set operations (besides cardinality) in a modular fashion.
Second, a limitation frequently encountered is that verification problems are often not fully expressible in the theories supported natively by the solvers. Many solvers allow the specification of application-specific theories as quantified axioms, but their handling is incomplete outside of narrow special cases. We show how SMT solvers can be used to obtain complete decision procedures for local theory extensions, an important class of theories that are decidable using finite instantiation of axioms. We present an algorithm that uses E-matching to generate instances incrementally during the search, significantly reducing the number of generated instances compared to eager instantiation strategies.
Title: Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing
Author(s): Coolidge, Sam; Simon, Dan; Shasha, Dennis
Abstract:
Certain pairs of drugs can cause death from their interaction. Knowledge of such interactions is held in drug interaction networks. The problem is that such networks may miss interactions that should be present and may include interactions that should be absent. Clearly, such information is valuable. Drug interaction networks are not unique in this regard. The same holds for protein-protein interaction networks, ecological networks, and many others. Improving the quality of such networks often requires a ground truth analysis (e.g. more experiments) but Roger Guimerá, Marta Sales-Prado, and their colleagues have shown in several papers that a structural analysis of networks can lead to predictions of missing and spurious edges that can improve those networks. Our contribution in this paper and the accompanying software is to create a program implementing their algorithmic ideas that is parallelizable and easy to modify for researchers who wish to try out new ideas. Our software can be found at https://github.com/samcoolidge/network.
Title: Analyzing Source Code Across Static Conditionals
Candidate: Gazzillo, Paul
Advisor(s): Wies, Thomas
Abstract:
We need better tools for C, such as source browsers, bug finders, and automated refactorings. The problem is that large C systems such as Linux are software product lines, containing thousands of configuration variables controlling every aspect of the software from architecture features to file systems and drivers. The challenge of such configurability is how do software tools accurately analyze all configurations of the source without the exponential explosion of trying them all separately. To this end, we focus on two key subproblems, parsing and the build system. The contributions of this thesis are the following: (1) a configuration-preserving preprocessor and parser called SuperC that preserves configurations in its output syntax tree; (2) a configuration-preserving Makefile evaluator called Kmax that collections Linux's compilation units and their configurations; and (3) a framework for configuration-aware analyses of source code using these tools.
C tools need to process two languages: C itself and the preprocessor. The latter improves expressivity through file includes, macros, and static conditionals. But it operates only on tokens, making it hard to even parse both languages. SuperC is a complete, performant solution to parsing all of C. First, a configuration-preserving preprocessor resolves includes and macros yet leaves static conditionals intact, thus preserving a program's variability. To ensure completeness, we analyze all interactions between preprocessor features and identify techniques for correctly handling them. Second, a configuration-preserving parser generates a well-formed AST with static choice nodes for conditionals. It forks new subparsers when encountering static conditionals and merges them again after the conditionals. To ensure performance, we present a simple algorithm for table-driven Fork-Merge LR parsing and four novel optimizations. We demonstrate SuperC's effectiveness on the x86 Linux kernel.
Large-scale C codebases like Linux are software product families, with complex build systems that tailor the software with myriad features. Such variability management is a challenge for tools, because they need awareness of variability to process all software product lines within the family. With over 14,000 features, processing all of Linux's product lines is infeasible by brute force, and current solutions employ incomplete heuristics. But having the complete set of compilation units with precise variability information is key to static tools such a bug-finders, which could miss critical bugs, and refactoring tools, since behavior-preservation requires a complete view of the software project. Kmax is a new tool for the Linux build system that extracts all compilation units with precise variability information. It processes build system files with a variability-aware \texttt{make} evaluator that stores variables in a conditional symbol table and hoists conditionals around complete statements, while tracking variability information as presence conditions. Kmax is evaluated empirically for correctness and completeness on the Linux kernel. Kmax is compared to previous work for correctness and running time, demonstrating that a complete solution's added complexity incurs only minor latency compared to the incomplete heuristic solutions.
SuperC's configuration-preserving parsing of compilation units and Kmax's project-wide capabilities are in a unique position to process source code across all configurations. Bug-finding is one area where such capability is useful. Bugs may appear in untested combinations of configurations and testing each configuration one-at-a-time is infeasible. For example, one compilation units that defines a global function called by other compilation units may not be linked into the final program due to configuration variable selection. Such a bug can be found with Kmax and SuperC's cross-configuration capability. Cilantro is a framework for creating variability-aware bug-checkers. Kmax is used to determine the complete set of compilation units and the combinations of features that activate them, while SuperC's parsing framework is extended with semantic actions in order implement the checkers. A checker for linker errors across all compilation units in the Linux kernel demonstrates each part of the Cilantro framework and is evaluated on the Linux source code.
Title: Semi-Supervised Learning for Electronic Phenotyping in Support of Precision Medicine
Candidate: Halpern, Yonatan
Advisor(s): Sontag, David
Abstract:
Medical informatics plays an important role in precision medicine, delivering the right information to the right person, at the right time. With the introduction and widespread adoption of electronic medical records, in the United States and world-wide, there is now a tremendous amount of health data available for analysis. Electronic record phenotyping refers to the task of determining, from an electronic medical record entry, a concise descriptor of the patient, comprising of their medical history, current problems, presentation, etc. In inferring such a phenotype descriptor from the record, a computer, in a sense, "understands", the relevant parts of the record. These phenotypes can then be used in downstream applications such as cohort selection for retrospective studies, real-time clinical decision support, contextual displays, intelligent search, and precise alerting mechanisms.
To handle the incomplete data present in medical records, we use the formal framework of probabilistic graphical models with latent or unobserved variables. The first part of the thesis presents two different structural conditions under which learning with latent variables is computationally tractable. The first is the "anchored" condition, where every latent variable has at least one child that is not shared by any other parent. The second is the "singly-coupled" condition, where every latent variable is connected to at least three children that satisfy conditional independence (possibly after a transformation of the data). Variables that satisfy these conditions can be specified by an expert without requiring that the entire structure or its parameters be specified, allowing for effective use of human expertise and making room for statistical learning to do some of the heavy lifting in model learning. For both the anchored and singly-coupled conditions, practical algorithms are presented.
The second part of the thesis describes real-life applications using the anchored condition for electronic phenotyping. A human-in-the-loop learning system and a functioning emergency informatics system for real-time extraction of important clinical variables are described and evaluated.
The algorithms and discussion presented here were developed for the purpose of improving healthcare, but are much more widely applicable, dealing with the very basic questions of identifiability and learning models with latent variables - a problem that lies at the very heart of the natural and social sciences.
Title: Finding Prospects for Shopping Centers: a machine learning approach
Author(s): Kogan, Jonathan; Jain, Rishabh; Jean, Joe; Lowrance, Roy; Shasha, Dennis
Abstract:
We have developed an algorithm that predicts which store types are the best prospects to fill vacancies in shopping centers given the combinations of stores already there. The model is able to make predictions with accuracies up to 81.62% for the first prediction, 90.05% for the first two predictions, 93.34% for the first three predictions, 95.52% for the first four predictions, and 96.48% for the first five predictions. The p-values with respect to a naïve strategy of choosing the store types that are simply most frequent are all below 0.0001%. This paper explains how the system was built and some user tests, not all of which were positive. The system can be found at http://linserv2.cims.nyu.edu:54321. The code for the project can be found at https://github.com/jgk99/Store-Prospector.
Title: Improving Knowledge Base Population with Information Extraction
Candidate: Li, Xiang
Advisor(s): Grishman, Ralph
Abstract:
Knowledge Bases (KBs) are data resources that encode world knowledge in machine-readable formats. Knowledge Base Population (KBP) aims at understanding this knowledge and extending KBs with more semantic information, which is a fundamental problem in Artificial Intelligence. It can benefit a wide range of tasks, such as semantic search and question answering. Information Extraction (IE), the task of discovering important types of facts (entities, relations and events) in unstructured text, is necessary and crucial for successfully populating knowledge bases. This dissertation focuses on four essential aspects of knowledge base population by leveraging IE techniques: extracting facts from unstructured data, validating the extracted information, accelerating and enhancing systems with less annotation effort, and utilizing knowledge bases to improve real-world applications.
First, we investigate the Slot Filling task, which is a key component for knowledge base population. Slot filling aims to collect information from a large collection of news, web, or other sources of documents to determine a set of predefined attributes ("slots") for given person and organization entities. We introduce a statistical language understanding approach to automatically construct personal (user-centric) knowledge bases from conversational dialogs.
Second, we consider how to probabilistically estimate the correctness of the extracted slot values. Despite the significant progress of KBP research and systems in recent years, slot filling approaches are still far from completely reliable. Using the NIST KBP Slot Filling task as a case study, we propose a confidence estimation model based on the Maximum Entropy framework, and demonstrate the effectiveness of this model in both precision and the capability to improve the slot filling aggregation through a weighted voting strategy.
Third, we study rich annotation guided learning to fill the gap between an expert annotator and a feature engineer. We develop an algorithm to enrich features with the guidance of all levels of rich annotations from human annotators. We also evaluate the comparative efficacy, generality and scalability of this framework by conducting case studies on three distinct applications in various domains, including facilitating KBP slot filling systems. Empirical studies demonstrate that with little additional annotation time, we can significantly improve the performance for all tasks.
Finally, we explore utilizing knowledge bases in a real-world application - personalized content recommendation. Traditional systems infer user interests from surface-level features derived from online activity logs and user demographic profiles, rather than deeply understanding the context semantics. We conduct a systematic study to show the effectiveness of incorporating deep semantic knowledge encoded in the entities on modeling user interests, by utilizing the abundance of entity information from knowledge bases.
Title: Improving SAT Solvers by Exploiting Empirical Characteristics of CDCL
Candidate: Oh, Chanseok
Advisor(s): Wies, Thomas
Abstract:
The Boolean Satisfiability Problem (SAT) is a canonical decision problem originally shown to be NP-complete in Cook’s seminal work on the theory of computational complexity. The SAT problem is one of several computational tasks identified by researchers as core problems in computer science. The existence of an efficient decision procedure for SAT would imply P = NP. However, numerous algorithms and techniques for solving the SAT problem have been proposed in various forms in practical settings. Highly efficient solvers are now actively being used, either directly or as a core engine of a larger system, to solve real-world problems that arise from many application domains. These state-of-the-art solvers use the Davis-Putnam-Logemann-Loveland (DPLL) algorithm extended with ConflictDriven Clause Learning (CDCL). Due to the practical importance of SAT, building a fast SAT solver can have a huge impact on current and prospective applications. The ultimate contribution of this thesis is improving the state of the art of CDCL by understanding and exploiting the empirical characteristics of how CDCL works on real-world problems. The first part of the thesis shows empirically that most of the unsatisfiable real-world problems solvable by CDCL have a refutation proof with near-constant width for the great portion of the proof. Based on this observation, the thesis provides an unconventional perspective that CDCL solvers can solve real-world problems very efficiently and often more efficiently just by maintaining a small set of certain classes of learned clauses. The next part of the thesis focuses on understanding the inherently different natures of satisfiable and unsatisfiable problems and their implications on the empirical workings of CDCL. We examine the varying degree of roles and effects of crucial elements of CDCL based on the satisfiability status of a problem. Ultimately, we propose effective techniques to exploit the new insights about the different natures of proving satisfi- ability and unsatisfiability to improve the state of the art of CDCL. In the last part of the thesis, we present a reference solver that incorporates all the techniques described in the thesis. The design of the presented solver emphasizes minimality in implementation while guaranteeing state-of-the-art performance. Several versions of the reference solver have demonstrated top-notch performance, earning several medals in the annual SAT competitive events. The minimal spirit of the reference solver shows that a simple CDCL framework alone can still be made competitive with state-of-the-art solvers that implement sophisticated techniques outside the CDCL framework.
Title: Graph-based Approaches to Resolve Entity Ambiguity
Candidate: Pershina, Maria
Advisor(s): Grishman, Ralph
Abstract:
Information Extraction is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. One of the challenges of Information Extraction is to resolve ambiguity between entities either in a knowledge base or in text documents. There are many variations of this problem and it is known under different names, such as coreference resolution, entity disambiguation, entity linking, entity matching, etc. For example, the task of coreference resolution decides whether two expressions refer to the same entity; entity disambiguation determines how to map an entity mention to an appropriate entity in a knowledge base (KB); the main focus of entity linking is to infer that two entity mentions in a document(s) refer to the same real world entity even if they do not appear in a KB; entity matching (also record deduplication, entity resolution, reference reconciliation) is to merge records from databases if they refer to the same object.
Resolving ambiguity and finding proper matches between entities is an important step for many downstream applications, such as data integration, question answering, relation extraction, etc. The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains, posing a scalability challenge for Information Extraction systems. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and to answer complex queries. However the efficient alignment of large-scale knowledge bases still poses a considerable challenge.
Various aspects and different settings to resolve ambiguity between entities are studied in this dissertation. A new scalable domain-independent graph-based approach utilizing Personalized Page Rank is developed for entity matching across large-scale knowledge bases and evaluated on datasets of 110 million and 203 million entities. A new model for entity disambiguation between a document and a knowledge base utilizing a document graph and effectively filtering out noise is proposed. A new technique based on a paraphrase detection model is proposed to recognize name variations for an entity in a document. A new approach integrating a graph-based entity disambiguation model and this technique is presented for an entity linking task and is evaluated on a dataset for Â the Text Analysis Conference Entity Discovery and Linking 2014 task.
Title: On the Solution of Elliptic Partial Differential Equations on Regions with Corners II: Detailed Analysis
Author(s): Serkh, Kirill
Abstract:
In this report we investigate the solution of boundary value problems on polygonal domains for elliptic partial differential equations.
Title: Alphacodes: Usable, Secure Transactions with Untrusted Providers using Human Computable Puzzles
Author(s): Sharma, Ashlesh; Chandrasekaran, Varun; Amjad, Fareeha; Shasha, Dennis; Subramanian, Lakshminarayanan
Abstract:
Many banking and commerce payment systems, especially in developing regions, continue to require users to share private or sensitive information in clear-text with untrusted providers exposing them to different forms of man-in-the-middle attacks. In this paper, we introduce Alphacodes, a new paradigm that provides a usable security solution that enables users to perform secure transactions with untrusted parties using the notion of visual puzzles. Alphacodes are designed as verification codes for short message transactions and provide easy authentication of critical portions of a transaction. We describe how Alphacodes can be applied in different use cases and also show two simple applications that we have built using the Alphacodes framework. We show security vulnerabilities in existing systems and show how our protocol overcomes them. We also demonstrate the ease of use of Alphacodes with minimal training using two simple mechanical turk studies. Using another simple real world user study involving 10 users who speak Kannada (local Indian language), we show that the Alphacodes concept can be easily extended to other languages beyond English.
Title: Partition Memory Models for Program Analysis
Candidate: Wang, Wei
Advisor(s): Barrett, Clark; Wies, Thomas
Abstract:
Scalability is a key challenge in static program analyses based on solvers for Satisfiability Modulo Theories (SMT). For imperative languages like C, the approach taken for modeling memory can play a significant role in scalability. The main theme of this thesis is using partitioned memory models to divide up memory based on the alias information derived from a points-to analysis.
First, a general analysis framework based on memory partitioning is presented. It incorporates a points-to analysis as a preprocessing step to determine a conservative approximation of which areas of memory may alias or overlap and splits the memory into distinct arrays for each of these areas.
Then we propose a new cell-based field-sensitive points-to analysis, which is an extension of Steensgaard’s unification-based algorithms. A cell is a unit of access with scalar or record type. Arrays and dynamically memory allocations are viewed as a collection of cells. We show how our points-to analysis yields more precise alias information for programs with complex heap data structures.
Our work is implemented in Cascade, a static analysis framework for C programs. It replaces the former flat memory model that models the memory as a single array of bytes. We show that the partitioned memory models achieve better scalability within Cascade, and the cell-based memory model, in particular, improves the performance significantly, making Cascade a state-of-the-art C analyzer.
Title: Scaling Multicore Databases via Constrained Parallel Execution
Author(s): Wang, Zhaoguo; Mu, Shuai; Cui, Yang; Yi, Han; Chen, Haibo; Li, Jinyang
Abstract:
Multicore in-memory databases often rely on traditional concurrency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution opportunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency control (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3’s runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC-C benchmark show that IC3 outperforms traditional concurrency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C benchmark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.
Title: A New Strongly Polynomial Algorithm for Computing Fisher Market Equilibria with Spending Constraint Utilities
Candidate: Wang, Zi
Advisor(s): Cole, Richard
Abstract:
This thesis develops and analyzes an algorithm to compute equilibrium prices for a Fisher market in which the buyer utilities are given by spending constraint functions, utility functions originally defined by Devanur and Vazirani.
Vazirani gave a weakly polynomial time algorithm to compute the equilibrium prices. More recently Vegh gave a strongly polynomial algorithm. Here we provide another strongly polynomial algorithm, which arguably is conceptually simpler, although the running time is not always better.
Title: Learning Algorithms from Data
Candidate: Zaremba, Wojciech
Advisor(s): Fergus, Rob; LeCun, Yann
Abstract:
Statistical machine learning is concerned with learning models that describe observations. We train our models from data on tasks like machine translation or object recognition because we cannot explicitly write down programs to solve such problems. A statistical model is only useful when it generalizes to unseen data. Solomonoff has proved that one should choose the model that agrees with the observed data, while preferring the model that can be compressed the most, because such a choice guarantees the best possible generalization. The size of the best possible compression of the model is called the Kolmogorov complexity of the model. We define an algorithm as a function with small Kolmogorov complexity.
This Ph.D. thesis outlines the problem of learning algorithms from data and shows several partial solutions to it. Our data model is mainly neural networks as they have proven to be successful in various domains like object recognition, language modeling, speech recognition and others. First, we examine empirical trainability limits for classical neural networks. Then, we extend them by providing interfaces, which provide a way to read memory, access the input, and postpone predictions. The model learns how to use them with reinforcement learning techniques like Reinforce and Q-learning. Next, we examine whether contemporary algorithms such as convolution layer can be automatically rediscovered. We show that it is possible indeed to learn convolution as a special case in a broader range of models. Finally, we investigate whether it is directly possible to enumerate short programs and find a solution to a given problem. This follows the original line of thought behind the Solomonoff induction. Our approach is to learn a prior over programs such that we can explore them efficiently.
Title: Distributed Stochastic Optimization for Deep Learning
Candidate: Zhang, Sixin
Advisor(s): LeCun, Yann
Abstract:
We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze the convergence rate of the EASGD method in the synchronous scenario and compare its stability condition with the existing ADMM method in the round-robin scheme. An asynchronous and momentum variant of the EASGD method is applied to train deep convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Our approach accelerates the training and furthermore achieves better test accuracy. It also requires a much smaller amount of communication than other common baseline approaches such as the DOWNPOUR method.
We then investigate the limit in speedup of the initial and the asymptotic phase of the mini-batch SGD, the momentum SGD, and the EASGD methods. We find that the spread of the input data distribution has a big impact on their initial convergence rate and stability region. We also find a surprising connection between the momentum SGD and the EASGD method with a negative moving average rate. A non-convex case is also studied to understand when EASGD can get trapped by a saddle point.
Finally, we scale up the EASGD method by using a tree structured network topology. We show empirically its advantage and challenge. We also establish a connection between the EASGD and the DOWNPOUR method with the classical Jacobi and the Gauss-Seidel method, thus unifying a class of distributed stochastic optimization methods.
Title: Pushing the Limits of Additive Fabrication Technologies
Candidate: Zhou, Qingnan (James)
Advisor(s): Zorin, Denis
Abstract:
A rough symmetry can be observed in the stock price of 3D Systems (NYSE:DDD), the leading and largest 3D printer manufacturer, from its IPO on June 3, 2011 to the beginning of 2016. The price sky rocketed nearly 600% from 2011 to the end of 2013, and took a free fall back to its original value by 2016. Coincidentally, it is also the period during which I got my hands dirty and investigated some of the toughest challenges as well as exciting new possibilities associated with different types of 3D printing technologies. In this thesis, I documented my attempts from 3 different angles to push the limits of 3D printing: printability, microstructure design and robust geometry processing with mesh arrangements.
Printability check has long been the bottleneck that prevents 3D printing from scaling up. Oftentimes, designers of 3D models lack the expertise or tools to ensure 3D printability. 3D printing service providers typically rely human inspections to filter out unprintable designs. This process is manual and error-prone. As designs become ever more complex, manual printability check becomes increasingly difficult. To tackle this problem, my colleagues and I proposed an algorithm to automatically determine structurally weak regions and the worst-case usage scenario to break a given model. We validate the algorithm by physically break a number of real 3D printed designs.
A key distinctive feature of 3D printing technologies is that the cost and time of fabrication is uncorrelated with geometric complexity. This opens up many exciting new possibilities. In particular, by pushing geometric complexity to the extreme, 3D printing has the potential of fabricating soft, deformable shapes with microscopic structures using a single raw material. In our recent SIGGRAPH publication, my colleagues and I have not only demonstrated fabricating microscopic frame structures is possible but also proposed an entire pipeline for designing spatially varying microstructures to satisfy target material properties or deformation goals.
With the boost of 3D printing technologies, 3D models have become more abundant and easily accessible than ever before. These models are sometimes known as "wild" models because they differ significantly in complexity and quality from traditional models in graphics researches. This poses a serious challenge in robustly analyzing 3D designs. Many state-of-the-art geometry processing algorithms/libraries are ill-prepared for dealing with "wild" models that are non-manifold, self-intersecting, locally degenerate and/or containing multiple and possibly nested components. In our most recent SIGGRAPH submission, we proposed a systematic recipe based on mesh arrangements for conducting a family of exact constructive solid geometry operations. We exhaustively tested our algorithm on 10,000 "wild" models crawled from Thingiverse, a popular online shape repository. Both the code and the dataset are freely available to the public.