Characterizing Dynamic Frequency and Thread Blocking Scaling in GPUs: Challenges and Opportunities
Modern data centers are increasingly employing GPUs to accelerate services. These GPUs are commonly used to processes Neural Network-based requests such as, image classification, speech recognition and natural language processing. However, current GPUs have poor built in power management and are not optimized for varying request levels that are typical in data centers and cloud computing. In this work, we first characterize dynamic power management on real GPUs. We show a non linear diminishing relationship between frequency and power. To overcome this constraint, we explore the possible effects of Thread Block scaling to increase throughput. Our Thread Block scaling characterization shows the number of thread blocks per request can be limited with minimal overhead, providing more available resources for other concurrent requests, increasing throughput. We also propose a novel power management policy based on dynamic frequency scaling to reduce total energy consumption while meeting tail latency requirements.