As the algorithms for scientific applications are becoming more complex, higher computing power and better parallel approaches are demanded to achieve satisfying performance. As a consequence, supercomputers are increasingly adopted to provide huge computing power by deploying thousands of computing units. However, while the computing power of CPU clusters have been increasing constantly by three to four orders of magnitude per decade, the demand for computing power of the scientific simulations is growing in an even faster pace. Constrained by the physical limits such as heat dissipation and power consumption, modern supercomputer systems have to rely on the computationally-denser accelerators to provide further performance boost. As a result, heterogeneous platforms are frequently adopted in high performance computing nowadays.
On the other hand, due to the low bandwidth and high latency constraints of traditional PCIe, great efforts need to be paid in dealing with data transactions between hosts and devices to reach a satisfying overall performance in heterogeneous computing mechanism. Aiming at providing a balance usage of system resources of the heterogeneous platform, these efforts greatly increase the difficulty of concurrent programming, resulting in a lengthy development cycle and preventing the timely utilization of accelerators in newly developed simulation algorithms.
To fully release the performance potential and simplify the programming based on heterogeneous platforms, Coherent Accelerator Processor Interface (CAPI) emerges as a solution that provides the infrastructure to treat the accelerator as a coherent peer to the host processors. With the capacity of CAPI provided on OpenPOWER architecture, the heterogeneous platform combining the OpenPOWER server and FPGA allows an application and its accelerated component to share the same virtual address space with a low access latency. This feature successfully removes the overhead and complexity of the I/O subsystem of traditional hybrid computation, enables a peer-to-peer coherent relationship between FPGA and the OpenPOWER server. Combined with the OpenCL framework, employing CAPI could result in a higher system performance with a much smaller programming investment.
In this work, besides a high-level overview of CAPI and the specific CAPI-based heterogeneous platform available recently, we further present our CAPI-based solution for the Reverse Time Migration (RTM) algorithm. In geophysics exploration, RTM is generally the most time consuming kernel in a complete migration algorithm to model the underground structure. Efficiently solving RTM kernel has to face difficulties such as the algorithmic complexity that leads to low computing efficiency, and the huge dataset that challenges the memory and data transaction. Combining optimization techniques from both the algorithmic and architectural perspectives, we provide an efficient utilization of both host (OpenPOWER server) and device (FPGA). Optimization approaches of RTM solver together with analysis of performance comparison with CPU clusters provide us basic character information of CAPI as well as tuning guidelines of how to make efficient use of the CAPI based heterogeneous platforms.
Haohuan Fu is an associate professor in the Ministry of Education Key Laboratory for Earth System Modeling, and the Center for Earth System Sciences, at Tsinghua University. His research interests include high-performance computing in earth and environmental sciences, computer architectures, performance optimizations, and programming tools in parallel computing. Fu has a PhD in computing form Imperial College London. He’s a member of IEEE.
Jingheng Xu is a PhD candidate in the Department of Computer Science and Technology at Tsinghua University. His research interests include accelerator based solutions to global atmospheric modeling and exploring geophysics, focusing on algorithmic development and performance optimizations based on hybrid platforms such as CPUs, GPUs and FPGAs.
Hong Bo Peng graduated from XiDian University with bachelor’s and master’s degrees in Applied Mathematics. He worked 3 years at the Great Wall System and Software Incorporation. In the end of 2002, Hong Bo joined IBM China Corporation and has worked in a variety of development roles. Since 2006, he was focused on developing high performance mathematical libraries on IBM Power platform. He also helped customers to tune their application on IBM systems.
Yu Song has worked on HPC industry more than 4 years and has 10 years experiece on OS kernel. He’s part of the IBM HPC team working to tuning HPC applicaitons on OpenPOWER and system management on OpnePOWER. That means diving down a lot of rabbit holes and coming up with creative ways to enable applications on a highly distributed cluster. He’s the chair of OpenPOWER group of IBM CSTL and a member of system management group of OpenPOWER fundation.