Theses & Reports
Instructions for submitting a technical report or thesis.
You can find technical reports published prior to 1990 archived here.
-
Ph.D. Thesis
2025
On The Applications of Coarse Network Geometry to Personalized Immuno-Oncology
Bannon, James
Abstract
|
PDF
Title: On The Applications of Coarse Network Geometry to Personalized Immuno-Oncology
Candidate: Bannon, James
Advisor(s): Bud Mishra
Abstract:
Immune checkpoint inhibitors (ICIs), also called immune checkpoint blockers, are a promising category of targeted therapy for solid tumors. Predicting which patients will respond to ICI therapy remains an open problem under active investigation. This thesis aims to improve the precision with which immune checkpoint inhibitors are prescribed. By focusing on one type of biological measurement --- whole-tumor shotgun RNA sequencing data, which we call \textit{bulk RNA-seq} --- we are able to deeply explore the potential and limits of predictors built from this kind of measurement. Two of the algorithms presented here are based on a notion of graph curvature which we believe has extensive promise in bioinformatic inquiry.
The first part of this thesis performs a rigorous permutation testing evaluation of machine learning models for the task of predicting therapy response which we cast as a binary classification problem. We show that bulk RNA-seq data contains predictive signal but that there is an upper limit to ML model efficacy that can potentially be remedied by the curation of larger data sets or augmenting RNA-seq data with other biological measurements.
The next part presents a modular pipeline for the discovery of biomarkers from bulk RNA-seq data. We contextualize gene expression measurements using a protein-protein interaction (PPI) network and then use a notion of graph curvature to find (pairs of) genes in the PPI network that could serve as potential biomarkers. Our candidate biomarkers are evaluated using an extensive literature search and transfer learning experiments. We also provide a harmonized collection of drug-specific candidate markers found through rank aggregation that we believe merit further study.
Lastly, we cluster patients in an unsupervised manner using discrete Ollivier-Ricci Flow (ORF). Our method surfaces populations with distinct survival curves which in turn allows us to find many potential biomarkers, including gene expression modules. We believe the algorithm may be of independent interest for clustering other datasets in a diverse set of research areas.
As a result of the work here we have provided novel algorithmic techniques for analyzing (biological) data and advanced the state of the art in finding biomarkers for ICI therapy. -
Ph.D. Thesis
2025
Language Models at the Scale of Evolution
Rives, Alexander
Abstract
|
PDF
Title: Language Models at the Scale of Evolution
Candidate: Rives, Alexander
Advisor(s): Rob Fergus, Yann LeCun
Abstract:
I will describe the development of the evolutionary scale modeling (ESM) program, which proposes to solve an inverse problem across evolution to learn the biology of proteins from their sequences at the scale of life. Beginning from the idea that the sequences of proteins contain an image of biology in their patterns, this thesis shows that language models trained on protein sequences spanning the natural diversity of the Earth, by learning to predict which amino acids evolution chooses, develop feature spaces that reflect the immense scope and complexity of protein biology containing known and unknown biology. Biological structure and function emerge in the representations of the models. This emergence is shown to occur in a direct linkage with improvements in the language modeling of sequences. The representation space has an ordered structure in which proteins are organized according to their underlying biology, and directions correspond to meaningful biological variations. Attention patterns materialize in the neural network that correspond to the folded three-dimensional structure of proteins. The probabilities assigned to amino acids within a given sequence context, reflect protein function and predict the effects of mutations. The representations learned by protein language models constitute a general and transferable feature space which supports the discovery and generation of new biology. This has enabled an effort to reveal the structures of hundreds of millions of metagenomic proteins for the first time. The thesis concludes with experimental characterizations of proteins created by language models, which demonstrate that the feature space learned from natural proteins supports generating proteins beyond those in nature.
-
Ph.D. Thesis
2025
An Explicit Certified Method for Path Planning Problem of an SE(3) Robot
Zhang, Zhaoqi
Abstract
|
PDF
Title: An Explicit Certified Method for Path Planning Problem of an SE(3) Robot
Candidate: Zhang, Zhaoqi
Advisor(s): Chee Yap
Abstract:
The design and implementation of theoretically-sound robot motion planning algorithms is challenging, especially for robots with high degrees of freedom (DOF). This thesis presents an explicit, practical and certified path planner for a rigid spatial robot with 6 DOFs. The robot is a spatial triangle moving amidst polyhedral obstacles. Correct, complete and practical path planners for such a robot has never been achieved. It is widely recognized as a key challenge in robotics. We design such a planner by using the Soft Subdivision Search (SSS) framework, based on the twin foundations of ε-exactness and soft predicates. This SSS planner is a theoretical alternative to the standard exact algorithms, and provides much stronger guarantees than probabilistic or sampling algorithms.
In this thesis, we address technical challenges for the SE(3) robot. First, we establish the foundational theory of SSS framework by proving a general form of the Fundamental Theorem of SSS. Second, we introduce a topologically correct data structure for non-Euclidean path planning in the SE(3) space. Third, we analyze the distortion bound of the SE(3) representation. Fourth, we design an approximate footprint and combine it with the highly efficient feature set technique which leads to its soft predicate. Finally, we explicitly design the geometric primitives to avoid using a general solver of a polynomial system. This allows a direct implementation. These contributions represent a robust, practical, and adaptable solution to robot motion planning.