Speaker: Professor Haryadi Gunawi, University of Chicago
Location: 60 Fifth Avenue C15
Date: Sept. 13, 2017, 3 p.m.
As more data and computation move from local to cloud environments, datacenter distributed systems have become a dominant backbone for many modern applications. However, the complexity of cloud-scale hardware and software ecosystems has outpaced existing testing, debugging, and verification tools.
I will describe three new classes of bugs that often appear in large-scale datacenter distributed systems: (1) distributed concurrency bugs, caused by non-deterministic timings of distributed events such as message arrivals as well as multiple crashes and reboots; (2) limpware-induced performance bugs, design bugs that surface in the presence of "limping" hardware and cause cascades of performance failures; and (3) scalability bugs, latent bugs that are scale dependent, typically only surface in large-scale deployments (100+ nodes) but not necessarily in small/medium-scale deployments.
The findings above are based on our long, large-scale cloud bug study (3000+ bugs) and cloud outage study (500+ outages). I will present some of our work in understanding and combating distributed concurrency bugs, mainly focusing on our semantic-aware implementation-level model checking (SAMC) and taxonomy of distributed concurrency bugs (TaxDC). If time permits, I will also briefly discuss limpware and scalability bugs.
Haryadi Gunawi is a Neubauer Family Assistant Professor in the Department of Computer Science at the University of Chicago where he leads the UCARE research group (UChicago systems research on Availability, Reliability, and Efficiency). He received his Ph.D. in Computer Science from the University of Wisconsin, Madison in 2009. He was a postdoctoral fellow at the University of California, Berkeley from 2010 to 2012. His current research focuses on cloud computing reliability and new storage technology. He has won numerous awards including NSF CAREER award, NSF Computing Innovation Fellowship, Google Faculty Research Award, NetApp Faculty Fellowships, and Honorable Mention for the 2009 ACM Doctoral Dissertation Award.
His research focus is in improving dependability of storage and cloud computing systems in the context of (1) reliability, wherein he is interested in combating non-deterministic concurrency bugs in cloud-scale distributed systems, (2) performance stability, wherein he is interested in building storage and distributed systems that are robust to "limping" hardware, and (3) scalability, wherein he is interested in developing approaches to find latent scalability bugs that only appear in large-scale deployments.