Fault tolerance in super-scalar and VLIW processors
In this paper, we present a method for utilizing the spare capacity in super-scalar and very long instruction word (VLIW) processors to tolerate functional unit failures. Unlike previous work that was primarily interested in detection of transient faults, we are concerned with more permanent and/or intermittent faults which necessitate processor reconfiguration. Our method utilizes the VLIW compiler or the superscalar scheduler to insert redundant operations whenever idle functional units exist. The results of these redundant operations are used to detect and diagnose functional unit failures. For super-scalar processors, the scheduler can then utilize this information to ensure that operations are performed only on non-faulty units. In VLIW processors, this is equivalent to recompiling the code to run on the remaining non-faulty functional units. Since in certain applications, recompilation may not be possible, we consider two alternative reconfiguration strategies for VLIW processors. These strategies sacrifice storage space and execution time, respectively, in order to reconfigure without recompiling. We present Markov models that describe the behavior of processors using these different approaches and we evaluate their reliabilities. The results show that, while super-scalar and VLIW with recompilation provide the highest reliability, all proposed strategies significantly increase reliability over that of an unprotected processor.