Improving Performance of Data-Parallel Applications on CPU-GPU Heterogeneous Systems

Citation data:

Open Access Master's Theses

Publication Year:
2013
Usage 601
Downloads 513
Abstract Views 88
Repository URL:
https://digitalcommons.uri.edu/theses/48; https://digitalcommons.uri.edu/cgi/viewcontent.cgi?article=1050&context=theses
Author(s):
Duarte, Ronald
Publisher(s):
DigitalCommons@URI
thesis / dissertation description
Using two full applications with different characteristics, this thesis explores the performance and energy efficiency of CUDA-enabled GPUs and multi-core SIMD CPUs. Our implementations efficiently exploit both SIMD and thread-level parallelism on multi-core CPUs and the computational capabilities of CUDA-enabled GPUs. We discuss general optimization techniques and cost comparison for our CPU-only and CPU-GPU platforms. Finally, we present an evaluation of the implementation effort required to efficiently utilize multi-core SIMD CPUs and CUDA-enabled GPUs. One of the applications, seam carving, has been widely used for content-aware resizing of images and videos with little to no perceptible distortion. The gradient kernel was improved and achieves over 102x speedup on the GPU; this fraction (gradient kernel) of the seam carving operation has largest execution time. The overall resizing operation achieves 32x speedup on multi-core SIMD CPU. The time to resize one minute of a 1920x1080 video with seam carving was reduced from 6 hours to 17 minutes on a heterogeneous CPU-GPU system. The second application, numerical simulations of cardiac action potential propagation (CAPPS), is a valuable tool for understanding the mechanisms that promote arrhythmias that may degenerate into spiral wave propagation. Our implementation of CAPPS reduces the simulation time from 10 days (single-core implementation) to approximately 4 hours and 8 minutes. This is 54% faster than the execution time of CAPPS on a 60-core CPU-only cluster using MPI. Moreover, our implementation is 18.4x more energy-efficient than the 60-core cluster implementation