ExHET '25: Proceedings of the 2025 4th International Workshop on Extreme Heterogeneity Solutions

Full Citation in the ACM Digital Library

A Unified Portable and Programmable Framework for Task-Based Execution and Dynamic Resource Management on Heterogeneous Systems

Serhan Gener
Sahil Hassan
Liangliang Chang
Chaitali Chakrabarti
Tsung-Wei Huang
Umit Ogras
Ali Akoglu

Heterogeneous computing systems are essential for addressing the diverse computational needs of modern applications. However, they present a fundamental trade-off between easy programmability and performance. This paper addresses this trade-off by enabling performance and energy efficiency optimization while facilitating easy programming without delving into hardware details. It introduces CEDR-Taskflow, a comprehensive framework that automatically parallelizes user applications and dynamically schedules its tasks to heterogeneous platforms, enabling efficient resource utilization and ease of programming. Emulation-based studies on the Xilinx ZCU102 and NVIDIA Jetson AGX Xavier SoC platforms demonstrate that this integrated framework improves application execution time by up to 1.47x compared to state-of-the-art, while maintaining hardware-agnostic application development. Furthermore, this integration approach enables features such as streaming-enabled execution and schedule caching that reduce the time spent on task scheduling by up to 29.6x and results in up to 6.1x lower execution time.

From OpenACC to OpenMP5 GPU Offloading: Performance Evaluation on NAS Parallel Benchmarks

Yehonatan Fridman
Yosef Goren
Gal Oren

The NAS Parallel Benchmarks (NPB) are widely used to evaluate parallel programming models, yet lack a native OpenMP offloading implementation for GPUs. This gap is significant given OpenMP’s emergence as a versatile standard for heterogeneous systems, offering broad compatibility with both current and future GPU architectures. Existing solutions, such as those that directly translate OpenACC to a binary executable, are limited by OpenACC’s stagnation and vendor-specific constraints, while not exposing OpenMP, which is used internally as an intermediate representation.

This work addresses this limitation by developing a source-level translation of OpenACC-based NPB benchmarks into OpenMP5 offloading code. This translation employs a combination of automated source-to-source tool and manual optimization to ensure efficient execution across various GPU architectures. Performance evaluations indicate that the translated OpenMP versions deliver results comparable to the original OpenACC implementations, validating their reliability for GPU-based computations. Additionally, comparisons between GPU-accelerated OpenMP implementations and traditional CPU-based benchmarks reveal significant performance gains, especially in computationally intensive workloads. These findings highlight OpenMP’s potential as a unified programming model, offering superior portability and optimization capabilities across diverse hardware platforms.

The sources of this work are available at our repository.

Extending SEER for Extreme Heterogeneity

Jhonny Gonzalez
Jose Gonzalez
Keita Teranishi
Jeffrey S Vetter
Pedro Valero-Lara

Heterogeneous and multi-device nodes are increasingly common in high-performance computing and data centers, yet existing programming models often lack simple, transparent, and portable support for these diverse architectures. The main contribution of this work is the development of novel SEER capabilities to address this challenge by providing a descriptive programming model that allows applications to seamlessly leverage heterogeneous nodes across various device types. SEER uses efficient memory management and can select the proper device[s] depending on the computational cost of the applications. This is completely transparent to the programmer, thereby providing a highly productive programming environment. Integrating extreme heterogeneity into the SEER library as shown with the use of NVIDIA and AMD GPUs simultaneously allows it to expand and exploit the performance possibilities. Our analysis based on the well-known Conjugate Gradient algorithm reports accelerations above 1.5 × on computationally demanding steps of such an algorithm by using both architectures simultaneously.