Practical Dependable Systems with OS/Hypervisor Support
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Practical Dependable Systems with OS/Hypervisor Support

Abstract

Critical applications require dependability mechanisms to prevent them from failuresdue to faults. Dependable systems for mainstream deployment are typically built upon commodity hardware with mechanisms that enhance resilience implemented in software. Such systems are aimed at providing commercially viable, best-effort dependability cost- effectively.

This thesis proposes several practical, low-overhead dependability mechanisms for criticalcomponents in the system: hypervisors, containers, and parallel applications.

For hypervisors, the latency to reboot a new instance to recover from transient faults isunacceptably high. NiLiHype recovers the hypervisor by resetting it to a quiescent state that is highly likely to be valid. Compared to a prior work based on reboot, NiLiHype reduces the service interruption time during recovery from 713ms to 22ms, a factor of over 30x, while achieving nearly the same recovery success rate.

NiLiCon, to the best of our knowledge, is the first replication mechanism for commercialoff-the-shelf containers. NiLiCon is based on high-frequency incremental checkpointing to a warm spare, previously used for VMs. A key implementation challenge is that, compared to a VM, there is a much tighter coupling between the container state and the state of the underlying platform. NiLiCon meets this challenge with various enhancements and achieves performance that is competitive with VM replication.

HyCoR enhances NiLiCon with deterministic replay to address a fundamental drawbackof high-frequency replication techniques: unacceptably long delay of outputs to clients. With deterministic replay, HyCoR decouples latency overhead from the checkpointing interval. For a set of eight benchmarks, with HyCoR, the latency overhead is reduced from tens of milliseconds to less than 600us. For data race-free applications, the throughput overhead of HyCoR is only 2%-58%.

PUSh is a dynamic data race detector based on detecting violations of the intended sharing of objects, specified by the programmer. PUSh leverages existing memory protection hardware to detect such violations. Specifically, a key optimization in PUSh exploits memory protection keys, a hardware feature recently added to the x86 ISA. Several other key optimizations are achieved by enhancing the Linux kernel. For a set of eleven benchmarks, PUSh's memory overhead is less than 5.8% and performance overhead is less than 54%.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View