Theses & Reports

Instructions for submitting a technical report or thesis.

You can find technical reports published prior to 1990 archived here.

Title

Authors

Year

Ph.D. Thesis 2021 Advances in computer bridge: techniques for a partial-information, communication-based game. Bethe, Paul Abstract | PDF

Title: Advances in computer bridge: techniques for a partial-information, communication-based game.

Candidate: Bethe, Paul

Advisor(s): Ernest Davis

Abstract:

Bridge is an imperfect information game with elements of competition
against opponents as well as cooperation with a partner. Despite the
application of many tenets of artificial intelligence, humans have yet
to be consistently bested by the computer. This thesis explores AI
shortcomings in both the play and bidding phases of the game. In the
play, we explore weaknesses in the cutting edge Monte Carlo techniques
and explore both inference and learning based solutions. In the bidding,
we go beyond existing rule based systems and investigate deep
reinforcement learning as a method to learn how to bid.
Ph.D. Thesis 2021 Learning Causality in Molecular Biology Cirrone, Jacopo Abstract | PDF

Title: Learning Causality in Molecular Biology

Candidate: Cirrone, Jacopo

Advisor(s): Dennis Shasha

Abstract:

The Systems Biology community has invested a great deal of effort in
modeling gene regulatory networks that should be able to (i) accurately
predict future states and (ii) identify regulatory hubs that can be
manipulated to achieve desired phenotypes. Most computational tools for
the problem embody linear models (e.g. 5 * TF1 + 2*TF2 - 0.4*TF3....).
However, it is well known that biological interactions are highly
synergistic and non-linear. Further, those tools mostly try to directly
predict networks even when the discovered edges (which usually come from
some assay such as Chip-seq) may have little physiological significance
(e.g., may not influence gene expression).

This thesis considers an alternative approach to inferring gene
causality. Specifically, we consider the problem of predicting the
expression of genes at a future time point in a genomic time series. In
this, we follow the philosophy that accurate prediction often
corresponds to a good understanding of causality.
The prediction may rest on several sources of data: the time point
immediately preceding t, the entire target time series preceding t, and
ancillary data. In biology, for example, the ancillary data may consist
of a network based on binding data, data from different time series,
steady state data, a community-blessed gold standard network, or some
combination of those. We introduce OutPredict, which is a machine
learning method for time series that incorporates ancillary steady state
and network data to achieve a low error in gene expression prediction.
We show that OutPredict outperforms several of the best state-of-the-art
methods for prediction. The predictive models OutPredict in turn
generate a causal network.

Thus, this thesis presents an approach to the inference of causality
based on predictions of out-of-sample time-points based on both steady
state and time series data. Because the model for each gene identifies
those transcription factors that have the most importance in prediction,
those important transcription factors are the most likely causal
elements for that gene. We validate those predictions for a set of
well-documented transcription factors in Arabidopsis.
Because our methods apply to any situation in which there is time series
data, ancillary data, and the need for non-linear causal models, we
believe that this work will have a broad appeal to the scientific
community, specifically those studying causality networks in any
biological system.
Ph.D. Thesis 2021 Responsibility Analysis by Abstract Interpretation Deng, Chaoqiang Abstract | PDF

Title: Responsibility Analysis by Abstract Interpretation

Candidate: Deng, Chaoqiang

Advisor(s): Patrick Cousot

Abstract:

Given a behavior of interest, automatically determining the corresponding responsible entity (or say, the root cause) is a task of critical importance in various scientific fields, especially in the program static analysis. Classical static analysis techniques (e.g. dependency analysis, taint analysis, slicing, etc.) assist programmers in narrowing down the scope of responsibility, but none of them can explicitly identify the responsible entity. Meanwhile, the causality analysis is generally not pertinent for analyzing programs, and the structural equations model (SEM) of actual causality misses some information inherent in programs (e.g. temporal information, and whether an entity is free to make choices or not), making the corresponding program analysis imprecise.

In this dissertation, inspired by a classic forest fire example used in defining causality, a novel definition of responsibility based on the abstraction of trace semantics is proposed, which is expressive and generic to cope with both program analyses and tasks in other scientific fields. Briefly speaking, an action aR is responsible for behavior B in a certain trace, if and only if aR is free to make choices, and such a choice is the first one that ensures the occurrence of B in that trace. Such a definition makes use of the information regarding the temporal ordering of actions, as well as whether an action has free choices or not. In addition, our definition of responsibility takes into account the cognizance of observer, which, to the best of our knowledge, is a new innovative idea in program analysis. Compared to current dependency and causality analysis methods, the responsibility analysis is demonstrated to be more precise in many examples.

Furthermore, this dissertation proposes a sound framework of abstract responsibility analysis, which allows a balance between cost and precision to solve the undecidable problem of responsibility. Essentially, the abstract analysis builds a trace partitioning automaton by an iteration of over-approximating forward reachability analysis with trace partitioning and under-approximating/over-approximating backward impossible failure accessibility analysis, and determines the bounds of potentially responsible entities along paths in the automaton. Unlike the concrete responsibility analysis identifies exactly a single action as the responsible entity along every concrete trace, the abstract analysis may lose some precision and find multiple actions potentially responsible along each automaton path. However, the soundness is preserved, and every responsible entity in the concrete is guaranteed to be also found responsible in the abstract.
TR2021-996 2021 Quantum Information Physics: 1 Geiger, Davi; Zvi M. Kedem Abstract | PDF
TR2021-997 2021 Quantum Information Physics: 2 Geiger, Davi; Zvi M. Kedem Abstract | PDF
Ph.D. Thesis 2021 Enhancing Collaboration and Productivity for Virtual and Augmented Reality He, Zhenyi Abstract | PDF

Title: Enhancing Collaboration and Productivity for Virtual and Augmented Reality

Candidate: He, Zhenyi

Advisor(s): Ken Perlin

Abstract:

Immersive environments such as Virtual Reality (VR) and Augmented Reality (AR) are now receiving more and more attention. Although VR and AR have largely been used for individual entertainment experiences, they also possess huge potential as a platform for the support of collaboration and productivity. My thesis work is concerned with enabling VR/AR to be flexibly adapted for collaborative and productive uses. I approach this scope from several facets: a new haptic user interface based on actuated robots to bridge virtual and physical world, a reconfigurable framework for both co-located and geographically dispersed multi-user communication, and a text entry system in which users type by tapping their fingers, without needing to look at their hands or be aware of their hand positions. Further, I extend these ideas to a daily video conferencing experience that requires minimal hardware.
TR2021-998 2021 DietVision : An App for Image-based Food Identification, Volume, and Nutrition Estimation Hofmann, Michael; Leopold Maillard; Jessica Ramaux; Dennis Shasha Abstract | PDF

Title: DietVision : An App for Image-based Food Identification, Volume, and Nutrition Estimation

Author(s): Hofmann, Michael; Leopold Maillard; Jessica Ramaux; Dennis Shasha

Abstract:

DietVision is a mobile app that provides an estimate of the nutritional content of a meal
from images.
The software provides the following functions: (i) food detection that performs
classification and assigns it to a major food group; (ii) volume estimation using two
images at different angles of a plate along with a coin as a fiducial marker; and (iii)
user feedback to correct errors in steps (i) and (ii).
Ph.D. Thesis 2021 Larger-Context Neural Machine Translation Jean, Sébastien Abstract | PDF

Title: Larger-Context Neural Machine Translation

Candidate: Jean, Sébastien

Advisor(s): Kyunghyun Cho

Abstract:

Translation helps connect people by bridging language barriers. It can make travel more enjoyable, allow our minds to explore imaginary worlds, let us talk to others, and so on. Given the need for translation, but the limited availability of human translators, machine translation has flourished. Most systems translate sentences one by one, ignoring its context, which isn't always sufficient as the missing information can lead to incorrect or inconsistent translations. We believe that neural machine translation (NMT) is particularly well-suited to incorporate the surrounding context. Indeed, NMT systems can attend to arbitrarily distant words, while the use of continuous representations improves generalization on unseen examples.

As such, in this thesis, we extend neural machine translation to leverage information from the surrounding context. To do so, we first highlight the potential of the then-nascent NMT paradigm. We subsequently introduce architectural changes to integrate information from the surrounding document, initially starting from the preceding sentence. We further encourage models to use context from either a learning or data augmentation perspective. We also consider the efficient use of document-level neural language models for this task. While some challenges still remain, our work has helped establish larger-context translation on a solid footing, and we are optimistic about future progress.
Ph.D. Thesis 2021 Improving Sample Efficiency of Imitation and Reinforcement Learning Kostrikov, Ilya Abstract | PDF

Title: Improving Sample Efficiency of Imitation and Reinforcement Learning

Candidate: Kostrikov, Ilya

Advisor(s): Rob Fergus

Abstract:

Reinforcement Learning (RL) is an area of machine learning focused on learning to make a sequence of actions in an environment that maximizes cumulative rewards. Combined with Deep Learning, Reinforcement Learning has made significant progress over the last decade across various domains. Notable successes include achieving superhuman performance on Atari games, Go, StarCraft II, Dota 2, and various continuous control tasks.

However, RL's success stories are often limited to games and simulations where it is possible to generate a large amount of training data. This thesis describes several methods focused on improving sample efficiency to enable a wider variety of RL applications. For the first half of the thesis, we focus on Imitation Learning, where ground truth rewards are usually unknown, and expert demonstrations define optimality. First, we introduce a method for robust and sample efficient imitation learning. We adapt an imitation learning approach where an agent tries to mimic a domain expert using a GAN-like framework called GAIL. We identify two primary sources of sample inefficiency associated with this approach: on-policy RL and GAN discriminator training. We show that sample inefficiency can be mitigated by performing off-policy RL training combined with off-policy training of the discriminator. We also identify and resolve some task-specific biases associated with the family of adversarial imitation learning algorithms based on GAIL. Then, we derive a principled off-policy formulation of robust imitation learning that is entirely offline and allows us to learn a policy that imitates the expert relying only on the previously collected data. This work concludes the part of the thesis focused on imitation learning, and for the rest of the thesis, we focus on online and offline RL where we have access to environment rewards. We observe that off-policy RL from pixels suffers from overfitting and propose a simple solution inspired by image augmentation techniques from Computer Vision. Finally, we introduce a method for offline RL that utilizes a pre-trained behavioral policy to improve the robustness of behavior regularization widely used in the context of offline RL. In contrast to prior work on Offline RL, this method utilizes the behavior policy to regularize the critic instead of constraining the training policy.
Ph.D. Thesis 2021 Latent Variable Models and Iterative Refinement for Non-Autoregressive Neural Machine Translation Lee, Jason Abstract | PDF

Title: Latent Variable Models and Iterative Refinement for Non-Autoregressive Neural Machine Translation

Candidate: Lee, Jason

Advisor(s): Kyunghyun Cho

Abstract:

Deep neural networks have fundamentally transformed the field of machine translation, and replaced statistical phrase-based approaches to serve translations to millions of users in production systems every day. Despite impressive progress in translation accuracy, improving decoding speed remains a key challenge as most systems are \emph{autoregressive} and generate a sentence word-by-word. As neural machine translation (NMT) models are becoming increasingly deep and complex, there is a growing need for more efficient translation systems with sub-linear or constant inference latency, with respect to the sentence length. The main challenge in non-autoregressive machine translation is capturing the dependencies between tokens in a target sentence without autogression. Motivated by a rich history of probabilistic graphical models in sequence generation, this thesis proposes to use latent variables to model intra-sentence dependencies, such that the output distribution can be factorized given the latent variables. We also present several inference algorithms for non-autoregressive machine translation based on iterative refinement, which revises a sentence over multiple iterations. Our non-autoregressive models based on latent variables and iterative refinement can deliver significant decoding speedup with comparable translation accuracy relative to a strong autoregressive baseline. Finally, we investigate the correlation between training (log-likelihood) and test objective (BLEU) of several model families. We observe the two metrics are not correlated when comparing models from different families (e.g. between autoregressive and latent variable models).
Ph.D. Thesis 2021 Neural Structured Prediction using Iterative Refinement with Applications to Text and Molecule Generation Mansimov, Elman Abstract | PDF

Title: Neural Structured Prediction using Iterative Refinement with Applications to Text and Molecule Generation

Candidate: Mansimov, Elman

Advisor(s): Kyunghyun Cho

Abstract:

Humans excel at generating structured data in the form of images, text, speech, molecules, computer code, and others. Researchers have spent several decades proposing various solutions for the effective generation of these structured objects in a data-driven way, known as structured prediction. With the revival of deep neural networks, autoregressive models that process structured objects in fixed left-to-right monotonic ordering became a de-facto solution for this problem. Notable successes of autoregressive models include neural machine translation [Sutskever et al., 2014, Bahdanau et al., 2014, Vaswani et al., 2017], open-ended text generation [Radford et al., 2019, Brown et al., 2020], text-to-speech synthesis [van den Oord et al., 2016], among many.

Despite the considerable success of autoregressive models on many applications, a natural question arises whether alternative approaches are possible for structured prediction. This thesis describes a novel method for structured prediction based on the principle of iterative refinement with a particular focus on applications to text and molecule generation. We first introduce the iterative refinement framework for text generation. Starting from the blank sentence, the iterative refinement approach gradually refines text over multiple steps. Using this approach, we show that we can flexibly generate the text in various ways, such as generate all or some words in parallel and generate text according to the ordering learned from the data. We show that iterative refinement achieves competitive performance compared to autoregressive models while delivering a speedup in decoding. We conclude this thesis by showing how we can adapt the iterative refinement framework originally introduced for text generation for molecule generation. In particular, we demonstrate two iterative refinement approaches for molecular graph generation and molecular geometry prediction. We anticipate that models based on the iterative refinement will be broadly applicable to other domains of interest.
Ph.D. Thesis 2021 Scalable Particulate Flow Simulations with Boundary Integral Equations Morse, Matthew Abstract | PDF

Title: Scalable Particulate Flow Simulations with Boundary Integral Equations

Candidate: Morse, Matthew

Advisor(s): Denis Zorin

Abstract:

Numerical simulation of complex particulate flows, and of red blood cell flows through capillaries in particular, is an important investigational tool in the biological sciences. The ability to rapidly evaluate the impact of vessel and cell geometries, plasma viscosity, and particulate densities on macroscopic physiology is crucial to pursuing further biological understanding. Experimental techniques are costly and time-consuming, while analytical approaches are often of limited practical use in realistic scenarios, ultimately underscoring the importance of a comptuational approach.

In this work, we construct such a simulation, capable of simulating microliters of blood flowing through realistic vasculature, along with more general particulate suspensions. Due to the micrometer length scales of typical capillaries, we can model the blood plasma as a Stokesian fluid and red blood cells as inextensible, deformable membranes. By reformulating the viscous flow as a set of boundary integral equations, we are able to produce a method that has optimal complexity with high-order accuracy that is capable of handling dense particulate suspensions in complex geometries.

This approach relies on a novel, robust solver for elliptic partial differential equations, applied to Stokes flow. A core component of the solver is a novel fast algorithm to compute the value of the solution near and on the domain boundary, which we have named \qbkix. We provide a set of algorithms to guarantee the accuracy of \qbkix on piecewise smooth surfaces, discuss the error behavior and complexity of \qbkix, and evaluate its performance.

Leveraging this solver in a confined blood flow simulation involves advecting deformable particulates along the flow trajectory. Large timesteps are required for an efficient simulation, but can cause collisions among cells and with the vessel wall if performed naively. We present collision detection and resolution algorithms for the red blood cells and the blood vessel. We parallelize \qbkix and the collision algorithms and scale the final simulation to nearly 35,000 cores.
Ph.D. Thesis 2021 Towards More General and Adaptive Deep Reinforcement Learning Agents Raileanu, Roberta Abstract | PDF

Title: Towards More General and Adaptive Deep Reinforcement Learning Agents

Candidate: Raileanu, Roberta

Advisor(s): Rob Fergus

Abstract:

Building agents with general skills that can be applied in a wide
range of settings has been a long-standing problem in machine
learning. The most popular framework for training agents to make
sequential decisions in order to maximize reward in a given
environment is Reinforcement Learning (RL). Over the last decade, deep
reinforcement learning, where RL agents are parameterized by neural
networks, has achieved impressive results on a number of tasks, from
games such as Atari, Go, StarCraft, or Dota, to continuous control
tasks with applications in robotics._

However, current RL agents are prone to overfitting and struggle to
generalize when even minor perturbations are applied to the training
environment. This hinders progress on real-world applications such as
autonomous vehicles or home robots, where agents need to deal with a
large variety of scenarios. In this thesis, we introduce several
methods for improving the versatility of deep reinforcement learning
agents. We start by studying the problem of zero-shot generalization
to new instances of a task after training on a limited number of
environments. We first propose an approach for regularizing the policy
and value function of a RL agent and automatically finding an
effective type of data augmentation for a given task. We also identify
that there is an asymmetry between the information needed to represent
the optimal policy and the true value function, which leads to
overfitting when using standard deep RL algorithms. As a step towards
solving this problem, we propose a method which decouples the
optimization of the policy and value, and constrains the
representation to be invariant to the task instance. Next, we focus on
the problem of learning general exploration strategies for
procedurally generated environments with sparse rewards. We formulate
a new type of intrinsic reward which encourages agents to impact their
environments and show that it outperforms other popular exploration
methods. Then, we discuss a novel approach for fast adaptation to new
dynamics. We show that our method, which leverages self-supervised
techniques to learn policy and environment embeddings, enables
adaptation within a single episode on a number of continuous control
tasks. Finally, we investigate how agents can learn more flexible
strategies for interacting with different opponents and collaborators.
Ph.D. Thesis 2021 Theory and Algorithms for Several Central Problems in Large-Scale Machine Learning Storcheus, Dmitry Abstract | PDF

Title: Theory and Algorithms for Several Central Problems in Large-Scale Machine Learning

Candidate: Storcheus, Dmitry

Advisor(s): Mehryar Mohri

Abstract:

This Ph.D. dissertation presents fundamental analaysis of several central problems in large-scale machine learning. We derive novel, scalable algorithms supported by strong theoretical guarantees for the most practically important large-scale learning scenarios. These scenarios include extentions of the standard supervised learning to multiple base hypotheses spaces, multiple objective functions, multiple distributions, multiple classes and high-dimensional feature spaces.

A standard supervised learning scenario consists of fitting a predictor from a fixed hypotheses space that minimizes certain empirical loss on a sample drawn i.i.d. from a particular distribution. The richness of modern machine learning applications requires the learning scenario to be large-scale by having the ability to learn from many training examples. While scalability in terms of many examples is widely studied, the current state of research in the field overlooks other scenarios and directions for scalability that may be even more important that many training examples. For instance, by allowing the learner to select predictors from multiple hypotheses spaces of varying complexity, or fit to multiple objective functions.

While the problems mentioned above may seem to relate to separate aspects of large-scale learning, this thesis provides a unified theoretical analysis framework that brings these central problems together. This framework is based on the Rademacher complexity analysis as well as on the Empirical and Structural Risk Minimization principles.
Ph.D. Thesis 2021 The Evolutionary Maps of Data Tamaskar, Abhinav Abstract | PDF

Title: The Evolutionary Maps of Data

Candidate: Tamaskar, Abhinav

Advisor(s): Bud Mishra

Abstract:

We present a geometric view of analyzing temporal causal models from the perspective of topology and limit graphs. We will briefly cover an intuitive overview of the topological techniques used and the theory of limit graphs. We will then briefly describe the Suppes Bayes causal networks which are used as the temporal causal models. We briefly describe evolutionary models used in scientific literature and show an efficient model for performing simulations on generalized large scale evolutionary networks. We then present the techniques for analyzing large scale evolutionary populations, and showcase their generality through two real world examples, (1) with the linguistic data from Reddit over the course of 5 years and while showing the existence of echo chambers and giving a metric to analyze similarities of populations over time, and (2) through the TCGA and COSMIC dataset for cancer mutation of over 11,000 genes and by using an approximation metric on the space of causal models to find similar cancer types, to perform transfer learning to boost survival forecasting through blackbox learning models.
TR2021-999 2021 A Microservice Redesign of Search and Inference for the Linguistic Website Terraling Vasandani, Shailesh; Hannan Butt; Dennis Shasha Abstract | PDF

Title: A Microservice Redesign of Search and Inference for the Linguistic Website Terraling

Author(s): Vasandani, Shailesh; Hannan Butt; Dennis Shasha

Abstract:

The linguistics web application Terraling serves many useful
functions for linguists. By extracting the critical path for linguistic
analysis into microservices, we are able to improve user experience,
optimize performance, and increase maintainability.
By using a modern stack with a React frontend and a Golang backend,
performance was improved by 700 times. In addition, new features can
be added with high velocity. The website can be accessed on any device
on the Terraling website.
Ph.D. Thesis 2021 Order and Learning in Sequential Neural Structured Prediction Welleck, Sean Abstract | PDF

Title: Order and Learning in Sequential Neural Structured Prediction

Candidate: Welleck, Sean

Advisor(s): Kyunghyun Cho

Abstract:

Structured objects such as sets, trees, and sequences appear in a variety of scientific and industrial domains. Developing machine learning methods that generate these objects is of interest for both scientific understanding and practical applications. One approach, sequential neural structured prediction, decomposes generation into a sequence of predictions, with each prediction made by a deep neural network. Choosing an appropriate sequential representation of each structured object and selecting an effective learning objective are key to adopting this approach. The standard method for learning specifies a canonical ordering of elements in the sequential representation and maximizes the likelihood of the resulting sequences. We develop two streams of research that explore alternatives to this fixed-order, maximum likelihood approach for sequentially generating sets, trees, and sequences, with a focus on natural language processing applications.

First, we focus on text generation and study degenerate properties of fixed-order maximum-likelihood learning, motivating new learning methods. We characterize the degeneracy using three properties observed in generated text: non-termination, logical incoherence, and repetition. To study non-termination, we develop theory that allows us to prove that conventional text generation methods can generate infinite-length sequences with high probability. To study logical incoherence, we create a dataset for investigating the degree to which a model logically contradicts its preceding statements. For reducing degeneration, we develop unlikelihood training, a learning method which penalizes task-specific textual properties. In the second part of the thesis, we remove the requirement of a fixed generation order with a learning framework called non-monotonic generation, which yields models that select input-dependent generation orders. We use non-monotonic generation to generate multisets, parse trees, and text. The investigations and techniques presented in this thesis lead to promising directions for future work.
Ph.D. Thesis 2021 Techniques for Sample-Efficient Reinforcement Learning Whitney, William Abstract | PDF

Title: Techniques for Sample-Efficient Reinforcement Learning

Candidate: Whitney, William

Advisor(s): Kyunghyun Cho

Abstract:

By leveraging advances in deep learning, reinforcement learning (RL) has recently made such advances that for any task which has a simulator, and thus enables the collection of nearly unlimited data, it might now be expected to yield superhuman performance. However, many practically relevant tasks take place in the physical world. Constructing physical simulators of sufficient fidelity and correspondence to transfer is a non-trivial challenge, so for the majority of physical tasks at least some amount of training on real data is required. Collecting data in the real world is sufficiently expensive that it makes up much of the cost of training a reinforcement learning agent.

This thesis focuses on improving the sample efficiency of reinforcement learning in order to make them more practical to use on physical systems. It includes three approaches to this goal. The first part studies the data collection process, and in particular the opportunity for exploration to improve the sample efficiency of RL. The second part considers the use of representation learning to improve generalization, and thus sample efficiency, in reinforcement learning. The third part examines the offline RL setting, which consists of pure policy optimization using a fixed dataset and therefore does not require additional data collection.

Taken together, this work studies techniques for improving the sample efficiency of reinforcement learning by collecting data which is more useful and diverse, then learning more from every sample.
It represents an early step on the path to RL as an everyday tool for control of physical systems.
Ph.D. Thesis 2021 Methods to Improve Knowledge Transfer Efficiency for Data-limited Problems in Genomics Yi, Ren Abstract | PDF

Title: Methods to Improve Knowledge Transfer Efficiency for Data-limited Problems in Genomics

Candidate: Yi, Ren

Advisor(s): Richard Bonneau

Abstract:

The recent advancement in computational genomics has greatly benefited from the explosion of high-throughput genomic data and similar growth in biological databases. However, as more sequencing technologies become available and large genomic consortiums start to crowdsource data from larger cohorts of research groups, data heterogeneity has become an increasingly prominent issue. Data integration across multiple data sources and data modalities becomes particularly important for a greater number of biological systems. High-throughput omics data are typically highly skewed towards a small number of model organisms, factors, and conditions with which wet-lab experiments have higher success rates. It further introduces technical challenges when building machine learning models for problems with limited data. This thesis describes methods that improve knowledge transfer efficiency for learning data-limited problems through effective task-specific feature representation in the multitask learning setting. We demonstrate the performance of our methods in two genomic problems -- genetic variant calling and cell type-specific transcription factor binding predictions.