Structural Foundations of Efficient Reinforcement Learning
Speaker: Alekh Agarwal, Microsoft Research Redmond
Date: March 23, 2021, 9 a.m.
Host: Andrew Wilson
The design of learning agents which observe, interact with, and manipulate their environment to optimize desirable behaviors is a long-standing goal in machine learning, with roots in artificial intelligence, adaptive experimental design and adaptive feedback control. In machine learning, these questions are typically studied in the area of reinforcement learning (RL), which has seen a recent surge of interest both due to potential applications, from robotics and autonomous systems to healthcare and recommendations; as well as popular successes such as superhuman performance in games and dexterous manipulation of a Rubik’s cube.
On the theoretical front, foundations for sample efficient learning were laid in the early 2000’s, for problems where the agent perceives the environment through relatively simple observations. However, these settings fail to capture the complex sensorimotor observation streams which most application domains naturally contain. In this talk, I will describe a research program aimed at addressing this critical gap in our theoretical understanding.
I will begin by highlighting some key challenges faced by an RL agent, and the importance of understanding the structure of real-world applications to address these challenges. With this aim, I will then introduce a complexity measure, called Bellman rank, for general RL problems. Crucially, many application domains naturally exhibit a small Bellman rank, and I will describe how low Bellman rank enables sample efficient RL. Bellman rank remains one of the most general ways of measuring the complexity of RL problems, however, computationally practical algorithms for all problems with a small Bellman rank still elude us.
The second part of the talk will focus on algorithmic questions, designing optimization-based methods for solving a subclass of problems with a small Bellman rank. I will present an algorithm Policy Cover Policy Gradient (PC-PG), which comes with strong practical guarantees when the problem dynamics obey a certain linear structure. The algorithm is highly practical, and easily composes with modern deep learning libraries for an efficient implementation. I will confirm its effectiveness beyond the confines of the theoretical assumptions in empirical evaluation against popular baselines.
Finally, I will conclude with a brief synopsis of work I have done on Contextual Bandits, a much smaller, yet practically useful subclass of RL. The research here has led to the design and creation of a general purpose cloud service at Microsoft, which powers many applications of Contextual Bandits both inside and outside the company.
The first two parts of the talk are based on the papers https://arxiv.org/abs/1610.09512 and https://arxiv.org/pdf/2007.08459.pdf respectively.
Alekh Agarwal is a Principal Research Manager at Microsoft Research Redmond, where he leads the group on Reinforcement Learning. He has been at Microsoft Research since completing his PhD from UC Berkeley in 2012, spending six years in the New York City lab, before moving to Redmond. Alekh's research has spanned many areas of machine learning including large-scale and distributed optimization, high-dimensional statistics, online learning, and most recently reinforcement learning. He focuses on designing theoretically sound methods which lend themselves to practice, and his work at Microsoft has resulted in the creation of a new Azure service (https://aka.ms/personalizer) that operationalizes some of his reinforcement learning research. His work has been recognized with a NeurIPS best paper award and an ACM SIGAI Industry Impact award for his work on the Azure Personalization Service. His dissertation research was supported by Microsoft and Google PhD fellowships.