Search

Scholarly Works (4 results)

Sort By:

Thesis
Peer Reviewed

Tuning Hardware and Software for Multiprocessors

Mohiyuddin, Marghoob
Advisor(s): Wawrzynek, John

UC Berkeley Electronic Theses and Dissertations (2012)

Technology scaling trends have enabled the exponential growth of computing power. However, the performance of communication subsystems scales less aggressively. This means that an application constrained by memory/interconnect performance will not be able to use the available computing power efficiently---in fact, technology scaling will make this efficiency even worse. This problem can be alleviated if algorithms minimize communication. To this end, we describe communication-avoiding algorithms and highly optimized implementations of a sparse linear algebra kernel called ``matrix powers''. Results show up to 2.3x improvement in performance over the naive algorithms on modern architectures. Our multi-core implementation of matrix powers enables us to develop a communication-avoiding iterative solver for sparse linear systems which is up to 2.1x faster than a conventional Generalized Minimal Residual method (GMRES) implementation.

Another problem plaguing the supercomputer industry is the power bottleneck---power has, in fact, become the pre-eminent design constraint for future high-performance computing systems which is why computational efficiency is being emphasized over simply peak performance. Static benchmark codes have traditionally been used to find architectures optimal with respect to specific metrics. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software co-tuning as a novel approach for system design. In co-tuning, traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate co-tuning by exploring the parameter space of a Tensilica's Xtensa-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results

demonstrate that co-tuning improves hardware area and power efficiency by up to 3x and 2.4x respectively.

Cover page: Tuning Hardware and Software for Multiprocessors

Article
Peer Reviewed

A design methodology for domain-optimized power-efficient supercomputing

LBL Publications (2009)

As power has become the pre-eminent design constraint for future HPC systems, computational efficiency is being emphasized over simply peak performance. Recently, static benchmark codes have been used to find a power efficient architecture. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software cotuning as a novel approach for system design, in which traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate the proposed methodology by exploring the parameter space of a Tensilica-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results demonstrate that co-tuning significantly improves hardware area and energy efficiency - a key driver for next generation of HPC system design. Copyright 2009 ACM.

Cover page: A design methodology for domain-optimized power-efficient supercomputing

Article
Peer Reviewed

Hardware/software co‐design of global cloud system resolving models

LBL Publications (2011)

Article
Peer Reviewed

Hardware/software co-design for energy-efficient seismic modeling

LBL Publications (2011)

Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8th order or larger, which require large-scale HPC clusters to meet the computational demands. However, the rising power con- sumption of conventional cluster technology has prompted investigation of architectural alternatives that other higher computational efficiency. In this work, we compare the performance and energy efficiency of three architectural alternatives - the Intel Nehalem X5530 multicore processor, the NVIDIA Tesla C2050 GPU, and a general-purpose manycore chip design optimized for high-order wave equations called "Green Wave". We have developed an FPGA-accelerated architectural simulation platform to accurately model the power and performance of the Green Wave design. Results show that across highly-tuned high-order RTM stencils, the Green Wave implementation can offer up to 8× and 3.5× energy efficiency improvement per node respectively, com- pared with the Nehalem and GPU platforms. These results point to the enormous potential energy advantages of our hardware/software co-design methodology. Copyright 2011 ACM.

Cover page: Hardware/software co-design for energy-efficient seismic modeling