Skip to main content
eScholarship
Open Access Publications from the University of California

Resilient Virtualized Systems

  • Author(s): Le, Michael Vu
  • Advisor(s): Tamir, Yuval
  • et al.
Abstract

System virtualization allows for

the consolidation of many physical

servers on a single physical host by running the

workload of each physical server

inside a Virtual Machine (VM).

This is facilitated by a set of software components,

that we call the Virtualization Infrastructure (VI),

responsible for managing and multiplexing

physical resources among VMs.

While server consolidation using system virtualization

can greatly improve

the utilization of resources,

reliability becomes a major concern as

failure of the VI

due to hardware or software faults

can result in the failure of

all VMs running on the system.

The focus of this dissertation is on

the design and implementation of

mechanisms that enhance the

resiliency of the virtualized system

by way of enhancing the resiliency

of the VI to transient hardware and software faults.

Given that the use of hardware redundancy

can be costly, one of the main goals of this work

is to achieve high reliability

using purely software-based techniques.

The main approach for providing

resiliency to VI failures used in

this work is

to partition the VI into subcomponents and

provide mechanisms to detect and recover

each failed VI component

transparently to the running VMs.

These resiliency mechanisms are developed

incrementally using results from fault injection

to identify dangerous state corruptions

and inconsistencies between the recovered

and existing components in the system.

A prototype containing mechanisms proposed

in this dissertation is implemented

on top of the widely-used Xen virtualized system.

In this prototype, three different

recovery mechanisms are developed

for each Xen VI component:

the virtual machine monitor (VMM),

driver VM (DVM), and

privileged VM (PrivVM).

With the proposed

resiliency mechanisms,

applications can continue to correctly provide

services over 86% of detected VMM failures

and over 96% of detected DVM and PrivVM failures.

The proposed mechanisms

require no modifications to applications running

in the VMs and

minimal amount of modifications to the VI.

These mechanisms are light-weight and can

operate with minimal CPU and memory overhead

during normal system operations.

The mechanisms in this work do not

rely on redundant hardware but can

make use of redundant resources

to achieve, in many instances,

sub-millisecond recovery latency.

Main Content
Current View