FHPNC 2021: Proceedings of the 9th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing

Full Citation in the ACM Digital Library


Improving GHC Haskell NUMA profiling

As the number of cores increases Non-Uniform Memory Access (NUMA) is becoming increasingly prevalent in general purpose machines. Effectively exploiting NUMA can significantly reduce memory access latency and thus runtime by 10-20%, and profiling provides information on how to optimise. Language-level NUMA profilers are rare, and mostly profile conventional languages executing on Virtual Machines. Here we profile, and develop new NUMA profilers for, a functional language executing on a runtime system.

We start by using existing OS and language level tools to systematically profile 8 benchmarks from the GHC Haskell nofib suite on a typical NUMA server (8 regions, 64 cores). We propose a new metric: NUMA access rate that allows us to compare the load placed on the memory system by different programs, and use it to contrast the benchmarks. We demonstrate significant differences in NUMA usage between computational and data-intensive benchmarks, e.g. local memory access rates of 23% and 30% respectively. We show that small changes to coordination behaviour can significantly alter NUMA usage, and for the first time quantify the effectiveness of the GHC 8.2 NUMA adaption.

We identify information not available from existing profilers and extend both the numaprof profiler, and the GHC runtime system to obtain three new NUMA profiles: OS thread allocation locality, GC count (per region and generation) and GC thread locality. The new profiles not only provide a deeper understanding of program memory usage, they also suggest ways that GHC can be adapted to better exploit NUMA architectures.

Parallelism-preserving automatic differentiation for second-order array languages

We develop automatic differentiation (AD) procedures for reductions and scans—parameterized by arbitrary differentiable monoids—in a way that preserves parallelism, by rewriting them as other reductions and scans. This is in contrast with the literature and with existing AD systems, which are either general, but force sequential execution of the derivative program, or only include hand-crafted rules for a select few monoids (usually (0, +), (1, ×), (−∞, max) and (∞, min)) and thus lack the general flexibility of second-order languages.

Computing persistent homology in Futhark

We present a massively parallel algorithm for computing persistent homology, a concept within the field of topological data analysis, and we implement it in the purely functional array-based language Futhark, which has an efficient compiler targeting GPUs. Computing persistent homology consists of bringing a certain sparse matrix to a reduced form. We compare our implementation with OpenPH, an existing library for computing persistent homology on GPUs, and on large matrices we achieve speedups of 2.3 to 5. Our work shows both that persistent homology can be computed efficiently entirely on GPU hardware, and that Futhark can be used for this kind of sparse matrix manipulation.

Generating high performance code for irregular data structures using dependent types

Parallel architectures offer high performance but are challenging to program. Data parallel functional languages offer a solution by providing a high-level programming model to work with accelerators such as GPUs. Existing languages are designed to work with dense arrays, limiting their usefulness in expressing irregular data structures, such as graphs and sparse matrices important in many application domains.

This paper addresses this limitation by extending a data-parallel language with limited dependent types, including position dependent arrays and dependent pairs to model irregular data structures. The approach is demonstrated through three case studies: dense to sparse matrix conversion, sparse matrix-vector multiplication, and parallel breadth-first search.

Experimental results show that this approach outperforms state-of-the-art implementations on GPUs. Compared to Nvidia’s cuSparse, our automatically generated code achieves an average speedup of 1.2× for dense to sparse matrix conversion and 1.3× for sparse matrix-vector multiplication.