Computer Science Colloquium

A Reboot-based Approach to High Availability

George Candea
Stanford

Friday, April 29, 2005 11:30 A.M.
Room 1302 Warren Weaver Hall
251 Mercer Street
New York, NY 10012-1185

Directions: http://cs.nyu.edu/csweb/Location/directions.html
Colloquium Information: http://cs.nyu.edu/csweb/Calendar/colloquium/index.html

Hosts:

Richard Cole cole@cs.nyu.edu, (212) 998-3119

Abstract

Application-level software failures are a dominant cause of outages in large-scale software systems, such as e-commerce, banking, or Internet services. The exact root cause of these failures is often unknown and the only cure is to reboot. Unfortunately, rebooting can be expensive, leading to nontrivial service disruption or downtime even when clusters and failover are employed.

In this talk I will describe the "crash-only design," a way to build reboot-friendly systems. I will also present the "microreboot," a technique for surgically recovering faulty application components without disturbing the rest. I will argue quantitatively that recovery-oriented techniques complement bug-reduction efforts and provide significant improvements in software dependability. We applied the crash-only design and microreboot technique to a satellite ground station and an Internet auction system. Without fixing any bugs, microrebooting recovered most of the same failures as process restarts, but did so more than an order of magnitude faster and with an order of magnitude savings in lost work.

Simple, cheap recovery engenders a new way of thinking about failure management. First, we can prophylactically microreboot to rejuvenate a software system by parts; this averts failures induced by software aging, without ever having to bring the system down. Second, we can mask failure and recovery from end users through transparent call-level retries, turning failures into human-tolerable sub-second latency blips. Finally, having made recovery very cheap, it makes sense to microreboot at the slightest hint of failure -- if the microreboot is indeed necessary, we speed up recovery; if not, the impact is negligible. As a result, we productively employed failure detection based on statistical learning, which reduces false negatives at the cost of more frequent false positives. We also closed the monitor-diagnose-recover loop and built an autonomously recovering Internet service, exhibiting orders of magnitude higher availability than previously possible.

Bio

George Candea is a graduating Ph.D. student in the CS department at Stanford University. He left the software industry to pursue a research agenda focused on improving dependability of large-scale software systems, driven by the belief that software failures are unavoidable. He finds it more productive to minimize recovery time than trying to make software bug-free. George received his B.S. and M.Eng. degrees from MIT, where he added protected OS abstractions to the Xok exokernel and wrote a filesystem for mobile computers. He spent over 5 years as a senior developer at Oracle Corp, improving scalability of database servers. Previously, he was affiliated with IBM Research and Microsoft Research. George holds a black belt in Shotokan karate and is captain of Stanford's JKA karate team.


top | contact webmaster@cs.nyu.edu