With the rapid development of the Internet of things (IoT), networks, software, and computing platforms, the size of the generated data is dramatically increasing, bringing the dawn of the big data era. These ever-increasing data volumes and complexity require new algorithms and hardware platforms to deliver sufficient performance. Data from sensors, such as images, video, and text, contributed to 2.5 quintillions bytes generated every day in 2020. The rate of generating data is outpacing the computational capabilities of conventional computing platforms and algorithms. CPU performance improvement has been stagnating in recent years, which is one of the causes of the rise of application-specific accelerators that process big data applications. FPGAs are also more commonly used for accelerating big data algorithms, such as machine learning.In this work, we develop and optimize both the hardware implementation and also algorithms for FPGA-based accelerators to increase the performance of machine learning applications. We leverage Residue Number System (RNS) to optimize the deep neural networks (DNNs) execution and develop an FPGA-based accelerator, called Residue-Net, to entirely execute DNNs using RNS on FPGAs. Residue-Net improves the DNNs throughput by 2.8x compared to running the FPGA-based baseline. Even though running DNNs on FPGAs provides higher performance compared to running on general-purpose processors, due to their intrinsic computation complexity, it is challenging to deliver high performance and low energy consumption using FPGAs, especially for the edge devices. Less complex and more hardware-friendly machine learning algorithms are needed in order to revolutionize the performance at and beyond the edge.
Hyperdimensional computing (HD) is a great example of a very efficient paradigm for machine learning. HD is intrinsically parallelizable with significantly fewer operations than DNNs, and thus can easily be accelerated in hardware. We develop an automated tool to generate an FPGA-based accelerator, called HD2FPGA, for classification and clustering applications, with accuracy that is comparable to the state-of-the-art machine learning algorithms, but orders of magnitude higher efficiency. HD2FPGA achieves 578x speedup and 1500x energy reduction in end-to-end execution of HD classification compared to the CPU baseline. HD2FPGA, compared to state-of-the-art DNN running on an FPGA, delivers 277x speedup and 172x energy reduction.
As the volume of data increases, a single FPGA is not enough to get the desired performance. Thus, many cloud service providers offer multi-FPGA platforms. The size of the data centers workloads varies dramatically over time, leading to significant underutilization of computing resources such as FPGAs while consuming a large amount of power, which is a critical contributor to data center inefficiency. We propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and hereby frequency at run time according to the prediction of, and adjustment to the workload level while maintaining the desired Quality of Service (QoS). Our evaluations by implementing state-of-the-art deep neural network accelerators revealed that providing an average power reduction of 4.0x, the proposed framework surpasses the previous works by 33.6% (up to 83%).