Search

Scholarly Works (9 results)

Sort By:

Article
Peer Reviewed

Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

LBL Publications (2016)

The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.

Cover page: Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

Article

Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms

Lawrence Berkeley National Laboratory (2000)

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations with a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multithreaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

Cover page: Parallel conjugate gradient: effects of ordering strategies,
programming paradigms, and architectural platforms

Article
Peer Reviewed

Auto-Tuning the 27-point Stencil for Multicore

LBL Publications (2009)

This study focuses on the key numerical technique of stencil computations, used in many different scientific disciplines, and illustrates how auto-tuning can be used to produce very efficient implementations across a diverse set of current multicore architectures.

Cover page: Auto-Tuning the 27-point Stencil for Multicore

Article
Peer Reviewed

Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms

UC Berkeley Previously Published Works (2012)

We are now in the multicore revolution which is witnessing a rapid evolution of architectural designs due to power constraints and correspondingly limited microprocessor clock speeds. Understanding how to efficiently utilize these systems in the context of demanding numerical algorithms is an urgent challenge to meet the ever growing computational needs of high-end computing. In this work, we examine multicore parallel optimization of the particle-to-grid interpolation step in particle-mesh methods, an inherently complex optimization problem due to its low computation intensity, irregular data accesses, and potential fine-grained data hazards. Our evaluated kernels are derived from two important numerical computations: a biological simulation of the heart using the Immersed Boundary (IB) method, and a Gyrokinetic Particle-in-Cell (PIC)-based application for studying fusion plasma microturbulence. We develop several novel synchronization and grid decomposition schemes, as well as low-level optimization techniques to maximize performance on three modern multicore platforms: Intel's Xeon X5550 (Nehalem), AMD's Opteron 2356 (Barcelona), and Sun's UltraSparc T2+ (Niagara). Results show that our optimizations lead to significant performance improvements, achieving up to a 5.6× speedup compared to the reference parallel implementation. Our work also provides valuable insight into the design of future autotuning frameworks for particle-to-grid interpolation on next-generation systems. © 1990-2012 IEEE.

Cover page: Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms

Article

Science Driven Supercomputing Architectures: Analyzing Architectural Bottlenecks with Applications and Benchmark Probes

LBL Publications (2005)

There is a growing gap between the peak speed of parallel computing systems and the actual delivered performance for scientific applications. In general this gap is caused by inadequate architectural support for the requirements of modern scientific applications, as commercial applications and the much larger market they represent, have driven the evolution of computer architectures. This gap has raised the importance of developing better benchmarking methodologies to characterize and to understand the performance requirements of scientific applications, to communicate them efficiently to influence the design of future computer architectures. This improved understanding of the performance behavior of scientific applications will allow improved performance predictions, development of adequate benchmarks for identification of hardware and application features that work well or poorly together, and a more systematic performance evaluation in procurement situations. The Berkeley Institute for Performance Studies has developed a three-level approach to evaluating the design of high end machines and the software that runs on them: 1) A suite of representative applications; 2) A set of application kernels; and 3) Benchmarks to measure key system parameters. The three levels yield different type of information, all of which are useful in evaluating systems, and enable NSF and DOE centers to select computer architectures more suited for scientific applications. The analysis will further allow the centers to engage vendors in discussion of strategies to alleviate the present architectural bottlenecks using quantitative information. These may include small hardware changes or larger ones that may be out interest to non-scientific workloads. Providing quantitative models to the vendors allows them to assess the benefits of technology alternatives using their own internal cost-models in the broader marketplace, ideally facilitating the development of future computer architectures more suited for scientific computations. The three levels also come with vastly different investments: the benchmarking efforts require significant rewriting to effectively use a given architecture, which is much more difficult on full applications than on smaller benchmarks.

Cover page: Science Driven Supercomputing Architectures: Analyzing Architectural Bottlenecks with
Applications and Benchmark Probes

Peer Reviewed

Auto-tuning stencil computations on multicore and accelerators

UC Berkeley Previously Published Works (2010)

The recent transformation from an environment where gains in computational performance came from increasing clock frequency and other hardware engineering innovations, to an environment where gains are realized through the deployment of ever increasing numbers of modest performance cores has profoundly changed the landscape of scientific application programming. This exponential increase in core count represents both an opportunity and a challenge: access to petascale simulation capabilities and beyond will require that this concurrency be efficiently exploited. The problem for application programmers is further compounded by the diversity of multicore architectures that are now emerging [4]. From relatively complex out-of-order CPUs with complex cache structures, to relatively simple cores that support hardware multithreading, to chips that require explicit use of software controlled memory, designing optimal code for these different platforms represents a serious impediment. An emerging solution to this problem is auto-tuning: the automatic generation of many versions of a code kernel that incorporate various tuning strategies, and the benchmarking of these to select the highest performing version. Typical tuning strategies might include: maximizing incore performance with loop unrolling and restructuring; maximizing memory bandwidth by exploiting non-uniform memory access (NUMA), engaging prefetch by directives; and minimizing memory traffic by cache blocking or array padding. Often a key parameter is associated with each tuning strategy (e.g., the amount of loop unrolling or the cache blocking factor), and these parameters must be explored in addition to the layering of the basic strategies themselves.

Cover page: Auto-tuning stencil computations on multicore and accelerators

Article
Peer Reviewed

Performance Characterization for Fusion Co-design Applications

LBL Publications (2011)

ABSTRACT: Magnetic fusion is a long-term solution for producing electrical power for the world, and the large thermonuclear international device (ITER) being constructed will produce net energy and a path to fusion energy provided the computer modeling is accurate. To effectively address the requirements of the high-end fusion simulation community, application developers, algorithm designers, and hardware architects must have reliable simulation data gathered at scale for scientifically valid configurations. This paper presents detailed benchmarking results for a set of magnetic fusion applications with a wide variety of underlying mathematical models including both particle-in-cell and Eulerian codes using both implicit and explicit numerical solvers. Our evaluation on a petascale Cray XE6 platform focuses on profiling these simulations at scale identifying critical performance characteristics, including scalability, memory/network bandwidth limitations, and communication overhead. Overall results are a key in improving fusion code design, and are a critical first step towards exascale hardware-software co-design — a process that tightly couples applications, algorithms, imple- mentation, and computer architecture.

Cover page: Performance Characterization for Fusion Co-design Applications

Peer Reviewed

Large-scale numerical simulations on high-end computational platforms

LBL Publications (2010)

After a decade where high-end computing was dominated by the rapid pace of improvements to CPU frequencies, the performance of next-generation supercomputers is increasingly differentiated by varying interconnect designs and levels of integration. Understanding the tradeoffs of these system designs is a key step towards making effective petascale computing a reality. In this work, we conduct an extensive performance evaluation of five key scientific application areas: plasma micro-turbulence, quantum chromodynamics, micro-finite-element solid mechanics, supernovae, and general relativistic astrophysics that use a variety of advanced computation methods, including adaptive mesh refinement, lattice topologies, particle in cell, and unstructured finite elements. Scalability results and analysis are presented on three current high-end HPC systems, the IBM Blue Gene/P at Argonne National Laboratory, the Cray XT4 and the Berkeley Laboratory’s NERSC Center, and an Intel Xeon cluster at Lawrence Livermore National Laboratory. In this chapter, we present each code as a section, where we describe the application, the parallelization strategies, and the primary results on each of the three platforms. Then we follow with a collective analysis of the codes performance and make concluding remarks.

Cover page: Large-scale numerical simulations on high-end computational platforms

Article

ORNL Cray X1 evaluation status report

LBL Publications (2004)

On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. "This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership," said Dr. Raymond L. Orbach, director of the department's Office of Science. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of the architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel bench marks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation. Application performance is found to be markedly improved by this architecture: - Large-scale simulations of high-temperature superconductors run 25 times faster than on an IBM Power4 cluster using the same number of processors. - Best performance of the parallel ocean program (POP v1.4.3) is 50 percent higher than on Japan s Earth Simulator and 5 times higher than on an IBM Power4 cluster. - A fusion application, global GYRO transport, was found to be 16 times faster on the X1 than on an IBM Power3. The increased performance allowed simulations to fully resolve questions raised by a prior study. - The transport kernel in the AGILE-BOLTZTRAN astrophysics code runs 15 times faster than on an IBM Power4 cluster using the same number of processors. - Molecular dynamics simulations related to the phenomenon of photon echo run 8 times faster than previously achieved. Even at 256 processors, the Cray X1 system is already outperforming other supercomputers with thousands of processors for a certain class of applications such as climate modeling and some fusion applications. This evaluation is the outcome of a number of meetings with both high-performance computing (HPC) system vendors and application experts over the past 9 months and has received broad-based support from the scientific community and other agencies.

Cover page: ORNL Cray X1 evaluation status report