Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep neural networks (DNNs) to execute complex inference tasks such as image classification and speech recognition, among others. However, continuously executing the entire DNN on the mobile device can quickly deplete its battery. Although task offloading to cloud/edge servers may decrease the mobile device's computational burden, erratic patterns in channel quality, network, and edge server load can lead to a significant delay in task execution. Recently, splitting DNN has been proposed to address such problems, where the DNN is split into two sections to be executed on the mobile device and on the edge server, respectively. However, the gain of naively splitting DNN models is limited since such approaches result in either local computing or full offloading unless the DNN models have natural "bottlenecks", which are significantly small representations compared to the input data to the models.
Firstly, we explore popular DNN models in image classification tasks and point out that such natural bottlenecks do not appear at early layers for most of the DNN models, thus such naive splitting approaches would result in either local computing or full offloading. We propose a framework to split DNNs and minimize capture-to-output delay in a wide range of network conditions and computing parameters. Different from prior literature presenting DNN splitting frameworks, we distill the architecture of the head DNN to reduce its computational complexity and introduce a bottleneck, thus minimizing processing load at the mobile device as well as the amount of wirelessly transferred data.
Secondly, since most prior work focuses on classification tasks and leaves the DNN structure unaltered, we put our focus on three different object detection tasks, which have more complex goals than image classification tasks, and discuss split DNNs for the challenging tasks. We propose techniques to (i) achieve in-network compression by introducing a bottleneck layer in the early layers on the head model, and (ii) prefilter pictures that do not contain objects of interest using a lightweight neural network. The experimental results show that the proposed techniques represent an effective intermediate option between local and edge computing in a parameter region where these extreme point solutions fail to provide satisfactory performance.
Lastly, we introduce a concept of supervised compression for split computing and adopt ideas from knowledge distillation and neural image compression to compress intermediate feature representations more efficiently. Our supervised compression approach uses a teacher model and a student model with a stochastic bottleneck and learnable prior for entropy coding. We compare our approach to various compression baselines in three vision tasks and found that it achieves better supervised rate-distortion performance while also maintaining smaller end-to-end latency. We furthermore show that the learned feature representations can be tuned to serve multiple downstream tasks. To facilitate studies of supervised compression for split computing, we also propose a new tradeoff metric that considers not only data size and model accuracy but also encoder size, which should be minimized for weak local devices.