Theses & Reports

Instructions for submitting a technical report or thesis.

You can find technical reports published prior to 1990 archived here.

Title

Authors

Year

Ph.D. Thesis 2025 On The Applications of Coarse Network Geometry to Personalized Immuno-Oncology Bannon, James Abstract | PDF

Title: On The Applications of Coarse Network Geometry to Personalized Immuno-Oncology

Candidate: Bannon, James

Advisor(s): Bud Mishra

Abstract:

Immune checkpoint inhibitors (ICIs), also called immune checkpoint blockers, are a promising category of targeted therapy for solid tumors. Predicting which patients will respond to ICI therapy remains an open problem under active investigation. This thesis aims to improve the precision with which immune checkpoint inhibitors are prescribed. By focusing on one type of biological measurement --- whole-tumor shotgun RNA sequencing data, which we call \textit{bulk RNA-seq} --- we are able to deeply explore the potential and limits of predictors built from this kind of measurement. Two of the algorithms presented here are based on a notion of graph curvature which we believe has extensive promise in bioinformatic inquiry.

The first part of this thesis performs a rigorous permutation testing evaluation of machine learning models for the task of predicting therapy response which we cast as a binary classification problem. We show that bulk RNA-seq data contains predictive signal but that there is an upper limit to ML model efficacy that can potentially be remedied by the curation of larger data sets or augmenting RNA-seq data with other biological measurements.

The next part presents a modular pipeline for the discovery of biomarkers from bulk RNA-seq data. We contextualize gene expression measurements using a protein-protein interaction (PPI) network and then use a notion of graph curvature to find (pairs of) genes in the PPI network that could serve as potential biomarkers. Our candidate biomarkers are evaluated using an extensive literature search and transfer learning experiments. We also provide a harmonized collection of drug-specific candidate markers found through rank aggregation that we believe merit further study.

Lastly, we cluster patients in an unsupervised manner using discrete Ollivier-Ricci Flow (ORF). Our method surfaces populations with distinct survival curves which in turn allows us to find many potential biomarkers, including gene expression modules. We believe the algorithm may be of independent interest for clustering other datasets in a diverse set of research areas.

As a result of the work here we have provided novel algorithmic techniques for analyzing (biological) data and advanced the state of the art in finding biomarkers for ICI therapy.
Ph.D. Thesis 2025 Fair and Explainable Machine Learning: Estimating Bias, Detecting Disparities, and Designing for Algorithmic Recourse Boxer, Kate Abstract | PDF

Title: Fair and Explainable Machine Learning: Estimating Bias, Detecting Disparities, and Designing for Algorithmic Recourse

Candidate: Boxer, Kate

Advisor(s): Daniel Neill

Abstract:

This dissertation investigates algorithmic bias and explainability from the perspective of an individual's interactions with computational models that have an impact on their circumstances, including those influencing their environmental conditions and those used during institutional decision-making. Accordingly, this dissertation focuses on three subtopics within this broad field: estimating data bias in datasets that inform policy decisions, auditing for predictive bias, and multi-objective formulations for systems that provide algorithmic recourse.

In relation to estimating data bias in datasets utilized to inform governmental resource allocation, we introduce two methods—a novel grouping algorithm for statistical significance testing and a custom latent variable model—to detect under-reporting in citizen-generated data. This introduces a domain-specific framework that is instrumental for practitioners interested in making data-informed policy decisions using self-reported data collected from populations located in urban settings. To audit for predictive bias, we introduce a domain- and model-agnostic framework for detecting statistically significant predictive biases in model outputs affecting both marginal and intersectional subpopulations of a target population through novel pattern detection methods for subgroup scanning, where predictive biases take the form of group-fairness violations.

Lastly, we propose a set of principles aimed at ensuring that systems that provide algorithmic recourse materially increase individual agency. Based on these principles, we endorse specific design choices to ensure the reliability of recommendations, develop burden-based measurements to assess the accessibility and fairness of these systems, and train algorithmic decision-makers that uphold these principles when used in systems that provide algorithmic recourse.

Collectively, these works represent key methodologies to detect data bias and predictive bias, spanning both context-specific and domain-agnostic settings, and also contribute to an effort to fundamentally shift institutional decision-making to ensure that algorithmic decision-makers are designed in such a way that individuals have means to achieve favorable outcomes.
Ph.D. Thesis 2025 Simple Structures in Neural Networks: On Expressiveness, Optimization and Data Distribution Chen, Lei Abstract | PDF

Title: Simple Structures in Neural Networks: On Expressiveness, Optimization and Data Distribution

Candidate: Chen, Lei

Advisor(s): Prof. Joan Bruna

Abstract:

In this era of Large Language Models (LLMs) and other giant neural networks, we aim to analyze simplified settings from scratch, as foundational steps towards understanding the functionality of the giant models. We present our understanding from three aspects. On expressive power, we investigate the function class of simplified graph networks, i.e., Graph-Augmented Multi-layer Perceptrons (GA-MLPs), against the classic Graph Neural Networks (GNNs) using measurements of graph isomorphism testing and counting attributed walks. On optimization, we theoretically study instabilities from large learning rates in training neural networks, i.e., Edge of Stability. We investigate the conditions of how the loss landscape contains such unstable training trajectories, especially oscillating in a low-dimensional subspace. Then we leverage such property in simple, yet representative, learning problems in a teacher-student style. On data distribution of reasoning tasks, we propose a decomposition of next-token prediction into two parts: in-context reasoning and distributional association. We study this decomposition empirically and theoretically in a controlled synthetic setting, and find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Finally, we discuss how such an understanding of next-token predictions and feed-forward layers could be applied to some recent developments of LLMs.
Ph.D. Thesis 2025 Understanding Inductive Bias in the Era of Large-Scale Pretraining with Scientific Data Gruver, Nathaniel Abstract | PDF

Title: Understanding Inductive Bias in the Era of Large-Scale Pretraining with Scientific Data

Candidate: Gruver, Nathaniel

Advisor(s): Andrew Wilson

Abstract:

Inductive biases are crucial for machine learning in data-scarce settings, but their optimal role in data-rich regimes remains poorly understood. This thesis challenges the conventional wisdom that strict architectural constraints are necessary for modeling numerical data, particularly in physics and chemistry. Through systematic empirical studies, I demonstrate that data-driven approaches can effectively learn both physical symmetries and broader numerical patterns without explicit architectural constraints. First, I show that transformer models trained with data augmentation can acquire stronger equivariance properties than convolutional neural networks, despite lacking built-in symmetry constraints. Building on this insight, I investigate whether pretrained language models can learn generalizable numerical capabilities from text alone. By studying the behavior of language models in many settings, I demonstrate that text pretraining induces a preference for simple functions that serves as a powerful inductive bias across numerical domains. This emergent bias enables large language models to outperform specialized architectures on benchmark tasks in time series forecasting and 3D structure prediction, achieving state-of-the-art results with minimal task-specific adaptation. However, these benefits do not extend universally - I identify molecular property prediction as a key limitation and trace this failure to fundamental constraints in discrete token representations. This work provides a comprehensive framework for understanding when learned biases can replace architectural constraints in numerical domains, with important implications for model design in scientific machine learning.
Ph.D. Thesis 2025 Computational Shape Design through Robust Physics Simulations Huang, Zizhou Abstract | PDF

Title: Computational Shape Design through Robust Physics Simulations

Candidate: Huang, Zizhou

Advisor(s): Denis Zorin, Daniele Panozzo

Abstract:

Additive manufacturing enables the fabrication of complex geometric structures tailored to specific material properties, with diverse applications ranging from lightweight yet strong aerospace components to customized shoe soles, prosthetic devices, and flexible robotic parts. However, due to the complexity of geometry, novel techniques for engineering analysis and optimization are needed. Our research seeks to address problems by developing robust and accurate physics simulation methods that can enhance the design process of complex structures.

This thesis introduces a physics-based simulation method for elastodynamics, incorporating collisions and friction, that resolves the artifacts in the state-of-the-art method and provides better robustness and efficiency. Further, the simulator is extended to support differentiability with respect to input physics parameters, enabling gradient-based inverse optimization applications such as optimal shape design and material inference. Specifically, we investigate the desired force response of shock-absorbing materials and leverage our differentiable simulator for shape optimization to achieve the desired behavior. The resulting microstructures are fabricated and validated through real-world experiments, demonstrating the accuracy and practical applicability of the proposed simulation framework.
Ph.D. Thesis 2025 Understanding and Mitigating Goal Misgeneralization in Language Models Joshi, Nitish Abstract | PDF

Title: Understanding and Mitigating Goal Misgeneralization in Language Models

Candidate: Joshi, Nitish

Abstract:

As Large Language Models (LLMs) are being widely used in various applications, it is critical that they are robust and generalize well. One of the reasons why LLMs might perform poorly after deployment is due to goal misgeneralization. Goal misgeneralization refers to the issue where an LLM performs well on the training distribution (e.g., high accuracy or reward), but performs poorly on the test distribution due to misgeneralization. Specifically, misgeneralization implies that the model has a systematic failure on the test distribution due to learning unintended functions, as opposed to performing randomly or lacking capability to do well on the test distribution. This encapsulates various problems that the machine learning community has worked on, including spurious correlations, underspecification, and reward hacking.

This dissertation focuses on goal misgeneralization in language models and consists of the following components. (1) For finetuning language models, if explicit knowledge of the spurious correlation which the model relies on is available, mitigating it is not too hard. We propose a new method to mitigate spurious correlations when such knowledge is not available---our method relies on complementary knowledge based on semantic corruptions. We empirically demonstrate the effectiveness of our method outperforming standard training methods. (2) For such methods which do rely on the knowledge of semantics to mitigate spurious correlations, scalably discovering robust semantic features can be done through crowdsourcing, such as in counterfactual data augmentation. We critically analyze the discrepancy between theory and practice for this training method, where in practice it seems to give marginal to no benefits. We show that this occurs due to the difficulty in obtaining diversity in counterfactuals, and this lack of diversity could even exacerbate spurious correlations. (3) We take a step back and ask: Can we use a mitigation method for any spurious correlation encountered in language data? We argue that there are two main sources of spurious correlations in language data, and methods to mitigate and evaluate spurious correlations might not work well for both. One is when the feature is irrelevant to the label (e.g. extra spaces), and the other is when the feature's effect on the label depends on the context (e.g. negation). We formalize this distinction using causal models and demonstrate why the distinction is necessary empirically. (4) We discuss other goal misgeneralization issues beyond spurious correlations in finetuning. First, we demonstrate how goal misgeneralization could occur during pretraining. Specifically, focusing on causal reasoning we show that language models have learned unintended position bias and post hoc fallacy from the pretraining data. We also show that only scaling language models does not address this misgeneralization. Next, we show that underspecification in in-context learning is also an instance of goal misgeneralization, and understand feature preferences of language models in the setting.

Finally, we discuss future directions focusing on other goal misgeneralization issues in language models. We briefly mention goal misgeneralization in the context of safety for LLM-agents, and reward hacking during reinforcement learning in language models.
Ph.D. Thesis 2025 Governing the Scientific Journals: What Big Data and Computational Modeling Tell Us about the Policies That Shape Editorial Boards Liu, Fengyuan "Michael" Abstract | PDF

Title: Governing the Scientific Journals: What Big Data and Computational Modeling Tell Us about the Policies That Shape Editorial Boards

Candidate: Liu, Fengyuan "Michael"

Advisor(s): Talal Rahwan

Abstract:

Academic journal editors are the gatekeepers of science, collectively shaping the content of scientific publications and setting standards for their fields of research. Yet, most editors take on this role as a form of community service while maintaining their primary careers as research-active scientists. This dual role raises two key questions at the heart of this thesis: (1) To what extent are editors representative of scientists at large in terms of their demographic composition? (2) How prevalent are conflicts of interest among academic editors? To address these questions, I construct two large, novel longitudinal datasets of academic editors and provide quantitative evidence on both fronts. Furthermore, these datasets enable me to evaluate the impact of policy interventions designed to (1) increase editorial board diversity and (2) mitigate conflicts of interest. By leveraging natural experiments identified in historical archives of journal policy documents, I analyze cases where such policies have been implemented and evaluate their effectiveness. Finally, I discuss the broader implications of big data and computational modeling for quantitative policy research.
Ph.D. Thesis 2025 Machine Learning for Simulations Otness, Karl Abstract | PDF

Title: Machine Learning for Simulations

Candidate: Otness, Karl

Advisor(s): Joan Bruna, Benjamin Peherstorfer

Abstract:

Computational modeling of physical systems is a core task of scientific computing. Machine learning methods can extend traditional approaches to modeling partial differential equations and hold the potential to simplify the modeling process and improve simulation accuracy and performance. In this thesis we explore the use of neural networks to learn the behavior of systems from data. We evaluate the performance-accuracy tradeoffs involved in their use as emulators, and use insights gained here to explore a specific application to learning subgrid parameterizations for climate models in particular. For this task we propose two novel techniques to improve the accuracy and stability of the learned parameterizations by tailoring architectures to incorporate favorable inductive biases, and by augmenting training data to encourage stability.
Ph.D. Thesis 2025 Language Models at the Scale of Evolution Rives, Alexander Abstract | PDF

Title: Language Models at the Scale of Evolution

Candidate: Rives, Alexander

Advisor(s): Rob Fergus, Yann LeCun

Abstract:

I will describe the development of the evolutionary scale modeling (ESM) program, which proposes to solve an inverse problem across evolution to learn the biology of proteins from their sequences at the scale of life. Beginning from the idea that the sequences of proteins contain an image of biology in their patterns, this thesis shows that language models trained on protein sequences spanning the natural diversity of the Earth, by learning to predict which amino acids evolution chooses, develop feature spaces that reflect the immense scope and complexity of protein biology containing known and unknown biology. Biological structure and function emerge in the representations of the models. This emergence is shown to occur in a direct linkage with improvements in the language modeling of sequences. The representation space has an ordered structure in which proteins are organized according to their underlying biology, and directions correspond to meaningful biological variations. Attention patterns materialize in the neural network that correspond to the folded three-dimensional structure of proteins. The probabilities assigned to amino acids within a given sequence context, reflect protein function and predict the effects of mutations. The representations learned by protein language models constitute a general and transferable feature space which supports the discovery and generation of new biology. This has enabled an effort to reveal the structures of hundreds of millions of metagenomic proteins for the first time. The thesis concludes with experimental characterizations of proteins created by language models, which demonstrate that the feature space learned from natural proteins supports generating proteins beyond those in nature.
Ph.D. Thesis 2025 Static Analysis Tools For Network-Device Stacks Ruffy, Fabian Abstract | PDF

Title: Static Analysis Tools For Network-Device Stacks

Candidate: Ruffy, Fabian

Advisor(s): Anirudh Sivaraman

Abstract:

Networking devices are becoming more programmable. With this trend, network-device software---dedicated to forwarding packets and interpreting instructions from the network control plane---now covers more functionality and also increases in complexity. Faults in network-device software can have an outsized impact on a network. Hence, network operators and device manufacturers are reaching for static analysis to ensure that this code is both functionally correct and well-optimized. Network-device software is extensive and often written in general-purpose languages such as Python or C++. These languages contain loops, aliasing, or indirection, which can make developing effective static analysis techniques challenging.

In this dissertation, we explore an opportunity to build better static analysis tools for network-device software. We use P4, a domain-specific language for network programming, as our foundation. We develop an execution model for P4 which describes the behavior of a network device, and we reify this execution model using satisfiability modulo theories (SMT), expressed in quantifier-free bit vectors. We refine this execution model through three distinct projects and show its utility by adopting techniques from software engineering research that are theoretically powerful but were considered practically limited for general-purpose languages. Applying our specialized techniques, we were able to find over 50 bugs in network-device software which cause incorrect packet-processing. Furthermore, we reuse our model to optimize network programs based on their control-plane configuration, which can reduce resource usage and increase packet-processing performance.

Our SMT-based execution model for packet processing is protocol-independent, device-agnostic, and precise enough for bug-finding and program optimization. We attribute these successes to tailoring our model to a DSL specialized in packet processing while also appropriately exploiting the restrictions of this DSL.
Ph.D. Thesis 2025 Towards Generally Intelligent Robots that Simply Work Everywhere Shafiullah, Nur Muhammad "Mahi" Abstract | PDF

Title: Towards Generally Intelligent Robots that Simply Work Everywhere

Candidate: Shafiullah, Nur Muhammad "Mahi"

Advisor(s): Lerrel Pinto

Abstract:

Applications of machine learning have touched the lives of common people in innumerable novel ways. Robotics today seems poised to make such an impact, too. Yet the current state-of-the-art in robotics, whether it’s a parkouring humanoid from Boston Dynamics or a T-shirt-folding robot from Google Deepmind, are specialists of their own environments – either by instrumenting and extensively modeling the scene, or by collecting weeks or months of data on the exact same setup.

In this thesis, we focus on building generally intelligent robots that simply work everywhere by studying the interplay of representation, data, and memory in robotics. To create robots that can address the broad and diverse challenges of operating in messy and unstructured environments everywhere, this thesis investigates three fundamental directions. We first look into algorithms that optimize the use of data in robot learning since data, as fuel, plays a critical role in creating broadly capable ML systems. We not only create efficient, self-supervised representations of the robots' perception, but also develop action representations that enable scaling to large, uncurated demonstration datasets. Then, we take a deep dive on creating systems – bridging algorithms and hardware – that can create and learn from robot data in the wild. Such systems enable few-shot and then zero-shot behavior generalization in novel homes in New York City and beyond. Finally, to enable generally intelligent robot behavior that extends over time and space, we construct neural data structures called spatio-semantic memory for robots. These memory modules enable scaling in-the-wild autonomous robot behavior from seconds to hours, and beyond.
Ph.D. Thesis 2025 Mechanisms to Advance the Adoption of Programmable High-speed Packet-Processing Pipelines Wang, Tao Abstract | PDF

Title: Mechanisms to Advance the Adoption of Programmable High-speed Packet-Processing Pipelines

Candidate: Wang, Tao

Advisor(s): Anirudh Sivaraman, Aurojit Panda

Abstract:

Today's programmable high-speed packet-processing pipelines have enabled a wide range of network offloads, e.g., in-network telemetry, parameter aggregation in machine learning, etc. However, it is not ready yet to allow a larger number of people and applications to benefit from those programmable pipelines.

This dissertation looks into this problem from two specific aspects, i.e., multitenancy and general L7 processing, and argues that new hardware primitives together with software toolchains are necessary to make the high-speed packet-processing pipelines a wider adoption for the application developers. Specifically, in this dissertation, we propose two systems: (1) Menshen designs isolation mechanisms to support multiple programs running atop a single pipeline without interfering with each other; (2) QingNiao targets L7 dispatch—a type of L7 process that is pervasive in the networking infrastructure layer—and presents a holistic solution based on the new hardware primitives and a programming model to support running such L7 processing on the programmable pipelines.
Ph.D. Thesis 2025 An Explicit Certified Method for Path Planning Problem of an SE(3) Robot Zhang, Zhaoqi Abstract | PDF

Title: An Explicit Certified Method for Path Planning Problem of an SE(3) Robot

Candidate: Zhang, Zhaoqi

Advisor(s): Chee Yap

Abstract:

The design and implementation of theoretically-sound robot motion planning algorithms is challenging, especially for robots with high degrees of freedom (DOF). This thesis presents an explicit, practical and certified path planner for a rigid spatial robot with 6 DOFs. The robot is a spatial triangle moving amidst polyhedral obstacles. Correct, complete and practical path planners for such a robot has never been achieved. It is widely recognized as a key challenge in robotics. We design such a planner by using the Soft Subdivision Search (SSS) framework, based on the twin foundations of ε-exactness and soft predicates. This SSS planner is a theoretical alternative to the standard exact algorithms, and provides much stronger guarantees than probabilistic or sampling algorithms.

In this thesis, we address technical challenges for the SE(3) robot. First, we establish the foundational theory of SSS framework by proving a general form of the Fundamental Theorem of SSS. Second, we introduce a topologically correct data structure for non-Euclidean path planning in the SE(3) space. Third, we analyze the distortion bound of the SE(3) representation. Fourth, we design an approximate footprint and combine it with the highly efficient feature set technique which leads to its soft predicate. Finally, we explicitly design the geometric primitives to avoid using a general solver of a polynomial system. This allows a direct implementation. These contributions represent a robust, practical, and adaptable solution to robot motion planning.