This thesis addresses the issue of fault tolerance in the Internet of Things(IoT). The goal of fault tolerance in IoT is to better adapt to changing environments and build up trustworthy redundancy. However, in real IoT deployment scenarios like smart homes or offices, heterogeneity and constant evolution of IoT systems pose a big challenge to building up redundancies and adapting to changing environment. Firstly, heterogeneous devices are deployed in the environment with limited duplications. This brings challenges to find redundant devices in the first place. Secondly, with a changing environment, there comes the need for devices, though deployed with different purposes and capabilities, to collaborate with one another and to be sharable among different applications with QoS requirements. This brings challenges to management of IoT applications and IoT devices. Thirdly, in order to achieve failure-resilience on heterogeneous devices, an evolving yet lightweight dynamic binding mechanism should be designed. This is the basis for supporting both previous points.
In this dissertation, we propose to address this above issues from a service-oriented point of view. Service-Oriented Architecture(SOA) provides IoT with a abstraction of integratable and manageable services. We have designed an IoT middleware to facilitate the cooperation of different devices to achieve this cross-modality fault tolerance. When a fault happens to a device, the middleware can reconfigure the system by using devices of other modalities to cover the fault. The three above problems are addressed in three different stages of service management: service discovery, service mapping, service execution.
For service discovery, this thesis presents a sensing device adaptation scheme for composing more available services. In IoT, sensors of different modalities may be used to enhance the system fault tolerance. We propose the concept of virtual services which use data from other sensor devices to replace an actual service on some faulty device. We do regression analysis to identify and generate virtual services using available sensors. Depending on the sensor correlation types, we can use with recursive least squares (RLS) or multivariate adaptive regression splines (MARS) for virtual service generation. These virtual services provide more choices of backup services without deployment of duplicate backup sensors.
For service mapping, we separate it into two steps: phase 1 pre-runtime mapping for functionality of the application and phase 2 run-time mapping for fault-tolerance. For pre-runtime mapping, we model it into a quadratic integer programming problem. Location policies are used to specify user preference during this mapping, and to limit the size of the QLP problem. For phase 2 mapping, with abundant provision of virtual services, we model it into a multiobjective optimization problem and use a multiobjective genetic algorithm, NSGA-ii, to solve it. With more sensor data from the network, virtual services are updated, and phase 2 mapping is triggered periodically in order to adapt to the changed environment.
For service execution, we set up hierarchical monitoring for monitoring service status. We investigate the issue of device clustering for fault monitoring in IoT systems. We model the new monitoring clustering problem as a multiple traveling salesman without depot problem. In order to detect device faults quickly, fault monitoring must be conducted regularly and frequently. Therefore, it is desirable to reduce the communication cost for fault monitoring. We define the problem by extending the multiple traveling salesman problem (mTSP) in an integer programming (IP) formulation. We also present heuristic algorithms for constructing both monitoring clusters and also the monitoring route within each cluster. Simulation results show that our heuristic algorithms can deliver near optimal solutions on reducing the communication cost, with a low complexity.
Finally, we provide detailed design of the fault tolerance framework, which incorporate above stages and support from our fault recovery mechanism.