While using machine learning to analyze video data is seeing explosive growth, modern vision models are difficult and expensive to deploy in practice. This is because while models are getting more accurate and robust, they are also getting more complicated and thus more resource-intensive. At the same time, the environments in which they are used, such as self-driving cars, demand extremely fast and accurate results.
Traditionally, all video data was sent to cloud servers, where models were run over the frames on GPU machines. Recently though, the use of edge computing has shown promise in addressing this tension between performance and resource usage. Resources available at the edge are highly heterogeneous in terms of computational power and memory, and while most prior work assumes a well-equipped edge, we find that the devices used in practice are often inexpensive commodity hardware. This limits the amount of computation that can practically happen at the edge.
In this thesis, we aim to make the most of these resource-constrained edge devices. We present two systems that improve the tradeoff between performance and resource usage in live video analysis. Our first system, Reducto, uses the limited amount of compute available on smart cameras to run cheap computer vision techniques and filter out frames that are similar enough to the previous frame that we can reuse the previously computed result as an approximation. This lowers GPU usage by over 50% and doubles processing speed. Our next system, GEMEL, addresses the memory bottleneck of running many models on an edge server by finding and merging common layers across a diverse set of models. This lowers the memory footprint by up to 60% and improves accuracy by up to 39%.