PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Full Citation in the ACM Digital Library

SESSION: Data Structures

Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization

Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests.

In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.

The State-of-the-Art LCRQ Concurrent Queue Algorithm Does NOT Require CAS2

Concurrent queues are, arguably, one of the most important data structures in high-load applications, which require them to be extremely fast and scalable. Achieving these properties is non-trivial. The early solutions, such as the classic queue by Michael and Scott, store elements in a concurrent linked list. Reputedly, this design is non-scalable and memory-inefficient. Modern solutions utilize the Fetch-and-Add instruction to improve the algorithm's scalability and store elements in arrays to reduce the memory pressure. One of the most famous and fast such algorithms is LCRQ. The main disadvantage of its design is that it relies on the atomic CAS2 instruction, which is unavailable in most modern programming languages, such as Java, Kotlin, or Go, let alone some architectures.

This paper presents the LPRQ algorithm, a portable modification of the original LCRQ design that eliminates all CAS2 usages. In contrast, it performs the synchronization utilizing only the standard Compare-and-Swap and Fetch-and-Add atomic instructions. Our experiments show that LPRQ provides the same performance as the classic LCRQ algorithm, outrunning the fastest of the existing solutions that do not use CAS2 by up to 1.6×.

Provably Good Randomized Strategies for Data Placement in Distributed Key-Value Stores

Distributed storage systems are used widely in clouds, databases, and file systems. These systems store a large amount of data across multiple servers. When a request to access data comes in, it is routed to the appropriate server, queued, and eventually processed. If the server's queue is full, then requests may be rejected. Thus, one important challenge when designing the algorithm for allocating data to servers is the fact that the request pattern may be unbalanced, unpredictable, and may change over time. If some servers get a large fraction of the requests, they are overloaded, leading to many rejects. In this paper, we analyze this problem theoretically under adversarial assumptions. In particular, we assume that the request sequence is generated by an adversarial process to maximize the number of rejects and analyze the performance of various algorithmic strategies in terms of the fraction of the requests rejected. We show that no deterministic strategy can perform well. On the other hand, a simple randomized strategy guarantees that at most a constant fraction of requests are rejected in expectation. We also show that moving data to load balance is essential if we want to reject a very small fraction (1/m where m is the number of servers) of requests. We design a strategy with randomization and data transfer to achieve this performance with speed augmentation. Finally, we conduct experiments and show that our algorithms perform well in practice.

SESSION: Algorithms

2PLSF: Two-Phase Locking with Starvation-Freedom

Invented more than 40 years ago, the two-phase locking concurrency control (2PL) is capable of providing opaque transactions over multiple records. However, classic 2PL can suffer from live-lock and its scalability is low when applied to workloads with a high number of non-disjoint read accesses.

In this paper we propose a new 2PL algorithm (2PLSF) which, by utilizing a novel reader-writer lock, provides highly scalable transactions with starvation-freedom guarantees. Our 2PLSF concurrency control can be applied to records, metadata and to indexing data structures used in database management systems (DBMS). In our experiments we show that 2PLSF is often superior to classical 2PL and can surpass concurrency controls with optimistic reads, simultaneously providing high throughput and low latency for disjoint and non-disjoint workloads.

Provably Fast and Space-Efficient Parallel Biconnectivity

Computing biconnected components (BCC) of a graph is a fundamental graph problem. The canonical parallel BCC algorithm is the Tarjan-Vishkin algorithm, which has O(n + m) optimal work and polylogarithmic span on a graph with n vertices and m edges. However, Tarjan-Vishkin is not widely used in practice. We believe the reason is the space-inefficiency (it uses O(m) extra space). In practice, existing parallel implementations are based on breath-first search (BFS). Since BFS has span proportional to the diameter of the graph, existing parallel BCC implementations suffer from poor performance on large-diameter graphs and can be slower than the sequential algorithm on many real-world graphs.

We propose the first p arallel b iconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm creates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. We carefully analyze the correctness of our algorithm, which is highly non-trivial.

We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We tested them on a 96-core machine on 27 graphs with varying edge distributions. FAST-BCC is the fastest on all graphs. On average (geometric means), FAST-BCC is 3.1× faster than the best existing baseline on each graph.

Practically and Theoretically Efficient Garbage Collection for Multiversioning

Multiversioning is widely used in databases, transactional memory, and concurrent data structures. It can be used to support read-only transactions that appear atomic in the presence of concurrent update operations. Any system that maintains multiple versions of each object needs a way of efficiently reclaiming them. We experimentally compare various existing reclamation techniques by applying them to a multiversion tree and a multiversion hash table.

Using insights from these experiments, we develop two new multiversion garbage collection (MVGC) techniques. These techniques use two novel concurrent version list data structures. Our experimental evaluation shows that our fastest technique is competitive with the fastest existing MVGC techniques, while using significantly less space on some workloads. Our new techniques provide strong theoretical bounds, especially on space usage. These bounds ensure that the schemes have consistent performance, avoiding the very high worst-case space usage of other techniques.

SESSION: Programming Models

A Programming Model for GPU Load Balancing

We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques.

With our open-source framework for load-balancing, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.

Exploring the Use of WebAssembly in HPC

Containerization approaches based on namespaces offered by the Linux kernel have seen an increasing popularity in the HPC community both as a means to isolate applications and as a format to package and distribute them. However, their adoption and usage in HPC systems faces several challenges. These include difficulties in unprivileged running and building of scientific application container images directly on HPC resources, increasing heterogeneity of HPC architectures, and access to specialized networking libraries available only on HPC systems. These challenges of container-based HPC application development closely align with the several advantages that a new universal intermediate binary format called WebAssembly (Wasm) has to offer. These include a lightweight userspace isolation mechanism and portability across operating systems and processor architectures. In this paper, we explore the usage of Wasm as a distribution format for MPI-based HPC applications. To this end, we present MPIWasm, a novel Wasm embedder for MPI-based HPC applications that enables high-performance execution of Wasm code, has low-overhead for MPI calls, and supports high-performance networking interconnects present on HPC systems. We evaluate the performance and overhead of MPIWasm on a production HPC system and AWS Graviton2 nodes using standardized HPC benchmarks. Results from our experiments demonstrate that MPIWasm delivers competitive native application performance across all scenarios. Moreover, we observe that Wasm binaries are 139.5x smaller on average as compared to the statically-linked binaries for the different standardized benchmarks.

Fast and Scalable Channels in Kotlin Coroutines

Asynchronous programming has gained significant popularity over the last decade: support for this programming pattern is available in many popular languages via libraries and native language implementations, typically in the form of coroutines or the async/await construct. Instead of programming via shared memory, this concept assumes implicit synchronization through message passing. The key data structure enabling such communication is the rendezvous channel. Roughly, a rendezvous channel is a blocking queue of size zero, so both send(e) and receive() operations wait for each other, performing a rendezvous when they meet. To optimize the message passing pattern, channels are usually equipped with a fixed-size buffer, so sends do not suspend and put elements into the buffer until its capacity is exceeded. This primitive is known as a buffered channel.

This paper presents a fast and scalable algorithm for both rendezvous and buffered channels. Similarly to modern queues, our solution is based on an infinite array with two positional counters for send(e) and receive() operations, leveraging the unconditional Fetch-And-Add instruction to update them. Yet, the algorithm requires non-trivial modifications of this classic pattern, in order to support the full channel semantics, such as buffering and cancellation of waiting requests. We compare the performance of our solution to that of the Kotlin implementation, as well as against other academic proposals, showing up to 9.8× speedup. To showcase its expressiveness and performance, we also integrated the proposed algorithm into the standard Kotlin Coroutines library, replacing the previous channel implementations.

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.

We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7×.

SESSION: Applications

A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations

The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method plays an important role in solving large-scale and complex engineering problems. This method needs to handle numerous matrix-vector multiplications. Directly calling the vendor-optimized library for general matrix-vector multiplication (gemv) on GPU leads to low performance, since it does not consider optimizations for different matrix sizes in HTFETI, i.e. different row and column sizes. In addition, state-of-the-art graph partitioning methods cannot guarantee load balancing for HTFETI, since the matrix size is determined by the length of the subdomain boundary. To solve the problems above, we first port gemv to the multi-stream pipeline scheme and develop a new batched kernel function on GPU, which brings 15%~30% throughput improvement and 37% average GFLOPs improvement, respectively. We also propose a multi-grained load-balancing scheme based on graph repartitioning and work-stealing, and the load imbalance ratio is down to 1.05~1.09 from 1.5. We have successfully applied the scalable HTFETI method to simulate the whole core assembly of China Experimental Fast Reactor (CEFR) for steady-state analysis, and the efficiencies of weak scalability and strong scalability reach 78% and 72% on 12,288 GPUs, respectively. As far as we know, this is the first time that HTFETI has been used in large-scale and high-fidelity whole core assembly simulation.

Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer

High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. This paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including, an interpretation method to deal with slicing overhead, an inplace slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our inplace slicing strategy would be less than the Cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work.

High-Performance Filters for GPUs

Filters approximately store a set of items while trading off accuracy for space-efficiency and can address the limited memory on accelerators, such as GPUs. However, there is a lack of high-performance and feature-rich GPU filters as most advancements in filter research has focused on CPUs.

In this paper, we explore the design space of filters with a goal to develop massively parallel, high performance, and feature rich filters for GPUs. We evaluate various filter designs in terms of performance, usability, and supported features and identify two filter designs that offer the right trade off in terms of performance, features, and usability.

We present two new GPU-based filters, the TCF and GQF, that can be employed in various high performance data analytics applications. The TCF is a set membership filter and supports faster inserts and queries, whereas the GQF supports counting which comes at an additional performance cost. Both the GQF and TCF provide point and bulk insertion API and are designed to exploit the massive parallelism in the GPU without sacrificing usability and necessary features. The TCF and GQF are up to 4.4× and 1.4× faster than the previous GPU filters in our benchmarks and at the same time overcome the fundamental constraints in performance and usability in current GPU filters.

High-Performance and Scalable Agent-Based Simulation with BioDynaMo

Agent-based modeling plays an essential role in gaining insights into biology, sociology, economics, and other fields. However, many existing agent-based simulation platforms are not suitable for large-scale studies due to the low performance of the underlying simulation engines. To overcome this limitation, we present a novel high-performance simulation engine.

We identify three key challenges for which we present the following solutions. First, to maximize parallelization, we present an optimized grid to search for neighbors and parallelize the merging of thread-local results. Second, we reduce the memory access latency with a NUMA-aware agent iterator, agent sorting with a space-filling curve, and a custom heap memory allocator. Third, we present a mechanism to omit the collision force calculation under certain conditions.

Our evaluation shows an order of magnitude improvement over Biocellion, three orders of magnitude speedup over Cortex3D and NetLogo, and the ability to simulate 1.72 billion agents on a single server.

Supplementary Materials, including instructions to reproduce the results, are available at: https://doi.org/10.5281/zenodo.6463816

SESSION: Task Parallelism

OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code

This paper presents OpenCilk, an open-source software infrastructure for task-parallel programming that allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation strategy, runtime mechanism, and productivity-tool development.

The OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular --- modifying one component for the most part does not necessitate modifications to the other components --- and easy to extend --- its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code.

We investigated OpenCilk's modularity, extensibility, and performance through several case studies, including a study to extend OpenCilk to support multiple parallel runtime systems, including Cilk Plus, OpenMP, and oneTBB. OpenCilk's design enables rapid prototyping of new compiler back ends to target different parallel-runtime ABIs. Each back end required fewer than 2000 new lines of code. We examined the OpenCilk runtime's performance empirically on 15 benchmark Cilk programs and found that it outperforms the other runtimes by a geometric mean of 4%--26% on 1 core and 10%--120% on 48 cores.

Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness

The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.

Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence

Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these problems in the presence of data-dependent control flow and arbitrary aliasing is a challenge that most existing systems eschew by compromising the expressivity of their programming models and/or the performance of their implementations. We demonstrate a general class of solutions to these problems via a reduction to the visibility problem from computer graphics.

SESSION: Transactions

Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing

Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, and every execution of the block must yield the same deterministic outcome. Block-STM further enforces that the outcome is consistent with executing transactions according to a preset order, leveraging this order to dynamically detect dependencies and avoid conflicts during speculative transaction execution. At the core of Block-STM is a novel, low-overhead collaborative scheduler of execution and validation tasks.

Block-STM is implemented on the main branch of the Diem Blockchain code-base and runs in production at Aptos. Our evaluation demonstrates that Block-STM is adaptive to workloads with different conflict rates and utilizes the inherent parallelism therein. Block-STM achieves up to 110k tps in the Diem benchmarks and up to 170k tps in the Aptos Benchmarks, which is a 20x and 17x improvement over the sequential baseline with 32 threads, respectively. The throughput on a contended workload is up to 50k tps and 80k tps in Diem and Aptos benchmarks, respectively.

TL4x: Buffered Durable Transactions on Disk as Fast as in Memory

The arrival of persistent memory devices to consumer market has revived the interest in transactional durable algorithms. Persistent memory (PM) is touted as having two attributes that distinguish it from other storage technologies: byte-addressability and fast transactional persistence.

In this work we investigate how these attributes differentiate PM from block storage in the context of buffered durability. We present a novel algorithm, TL4x, capable of providing buffered durable linearizable transactions with high scalability for disjoint writes and efficient persistence on either PM or block storage devices. TL4x is a software-only user-space solution that optimizes writes to persistent storage, providing buffered durable transactions whose cost is negligible compared to similar non-durable transactions. TL4x maintains a volatile consistent snapshot which is used for buffered durability and shared with irrevocable read-only transactions, allowing long range-query operations to run in parallel with write transactions. We use TL4x to implement a transactional database engine that can outperform RocksDB by an order of magnitude.

SESSION: Decompositions

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.

Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

One-sided dense matrix decompositions (e.g., Cholesky, LU, and QR) are the key components in scientific computing in many different fields. Although their design has been highly optimized for modern processors, they still consume a considerable amount of energy. As CPU-GPU heterogeneous systems are commonly used for matrix decompositions, in this work, we aim to further improve the energy saving of onesided matrix decompositions on CPU-GPU heterogeneous systems. We first build an Algorithm-Based Fault Tolerance protected overclocking technique (ABFT-OC) to enable us to exploit reliable overclocking for key matrix decomposition operations. Then, we design an energy-saving matrix decomposition framework, Bi-directional Slack Reclamation (BSR), that can intelligently combine the capability provided by ABFT-OC and DVFS to maximize energy saving and maintain performance and reliability. Experiments show that BSR is able to save up to 11.7% more energy compared with the current best energy saving optimization approach with no performance degradation and up to 14.1% Energy×Delay2 reduction. Also, BSR enables the Pareto efficient performance-energy trade-off, which is able to provide up to 1.43× performance improvement without costing extra energy.

End-to-End LU Factorization of Large Matrices on GPUs

LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.

Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core

Symmetric eigenvalue decomposition (EVD) is a fundamental analytic and numerical tool used in many scientific areas. The state-of-the-art algorithm in terms of performance is typically the two-stage tridiagonalization method. The first stage in the two-stage tridiagonalization is called successive band reduction (SBR), which reduces a symmetric matrix to a band form, and its computational cost usually dominates. When Tensor Core (specialized matrix computational accelerator) is used to accelerate the expensive EVD, the conventional ZY-representation-based method results in suboptimal performance due to unfavorable shapes of the matrix computations. In this paper, we propose a new method that uses WY representation instead of ZY representation (see Section 3.2 for details), which can provide a better combination of locality and parallelism so as to perform better on Tensor Cores. Experimentally, the proposed method can bring up to 3.7x speedup in SBR and 2.3x in the entire EVD compared to state-of-the-art implementations.

SESSION: Kernels

iQAN: Fast and Accurate Vector Search with Efficient Intra-Query Parallelism on Multi-Core Architectures

Vector search has drawn a rapid increase of interest in the research community due to its application in novel AI applications. Maximizing its performance is essential for many tasks but remains preliminary understood. In this work, we investigate the root causes of the scalability bottleneck of using intra-query parallelism to speedup the state-of-the-art graph-based vector search systems on multi-core architectures. Our in-depth analysis reveals several scalability challenges from both system and algorithm perspectives. Based on the insights, we propose iQAN, a parallel search algorithm with a set of optimizations that boost convergence, avoid redundant computations, and mitigate synchronization overhead. Our evaluation results on a wide range of real-world datasets show that iQAN achieves up to 37.7× and 76.6× lower latency than state-of-the-art sequential baselines on datasets ranging from a million to a hundred million datasets. We also show that iQAN achieves outstanding scalability as the graph size or the accuracy target increases, allowing it to outperform the state-of-the-art baseline on two billion-scale datasets by up to 16.0× with up to 64 cores.

WISE: Predicting the Performance of Sparse Matrix Vector Multiplication with Machine Learning

Sparse Matrix-Vector Multiplication (SpMV) is an essential sparse kernel. Numerous methods have been developed to accelerate SpMV. However, no single method consistently gives the highest performance across a wide range of matrices. For this reason, a performance prediction model is needed to predict the best SpMV method for a given sparse matrix. Unfortunately, predicting SpMV's performance is challenging due to the diversity of factors that impact it.

In this work, we develop a machine learning framework called WISE that accurately predicts the magnitude of the speedups of different SpMV methods over a baseline method for a given sparse matrix. WISE relies on a novel feature set that summarizes a matrix's size, skew, and locality traits. WISE can then select the best SpMV method for each specific matrix. With a set of nearly 1,500 matrices, we show that using WISE delivers an average speedup of 2.4× over using Intel's MKL in a 24-core server.

Efficient Direct Convolution Using Long SIMD Instructions

This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.

SESSION: Attention

TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks

Temporal Graph Neural Networks are gaining popularity in modeling interactions on dynamic graphs. Among them, Temporal Graph Attention Networks (TGAT) have gained adoption in predictive tasks, such as link prediction, in a range of application domains. Most optimizations and frameworks for Graph Neural Networks (GNNs) focus on GNN models that operate on static graphs. While a few of these optimizations exploit redundant computations on static graphs, they are either not applicable to the self-attention mechanism used in TGATs or do not exploit optimization opportunities that are tied to temporal execution behavior.

In this paper, we explore redundancy-aware optimization opportunities that specifically arise from computations that involve temporal components in TGAT inference. We observe considerable redundancies in temporal node embedding computations, such as recomputing previously computed neighbor embeddings and time-encoding of repeated time delta values. To exploit these redundancy opportunities, we developed TGOpt which introduces optimization techniques based on deduplication, memoization, and precomputation to accelerate the inference performance of TGAT. Our experimental results show that TGOpt achieves a geomean speedup of 4.9× on CPU and 2.9× on GPU when performing inference on a wide variety of dynamic graphs, with up to 6.3× speedup for the Reddit Posts dataset on CPU.

Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between "dynamic" and "fine-grained", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.

SESSION: Training

Elastic Averaging for Efficient Pipelined DNN Training

Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that Avg-Pipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.

DSP: Efficient GNN Training with Multiple GPUs

Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from high communication costs and low GPU utilization due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a collective sampling primitive (CSP), which pushes the sampling tasks to data to reduce communication. We also design a producer-consumer-based pipeline, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be up to 26x and is over 2x in most cases.

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs

Dynamic Graph Neural Networks (DGNNs) have been widely applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle these challenges, we propose PiPAD, a Pipelined and PArallel DGNN training framework for the end-to-end performance optimization on GPUs. From both algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves 1.22 × --9.57× speedup over the state-of-the-art DGNN frameworks on three representative models.


AArch64 Atomics: Might They Be Harming Your Performance?

Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is "compare-and-swap" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs.

In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.

Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems

All-reduce is the crucial communication primitive to reduce model parameters in distributed Deep Neural Networks (DNN) training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs due to the low data bandwidth of the electrical interconnect systems. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect systems. WRHT can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. Simulations using real DNN models show that, compared to all-reduce algorithms in the electrical and optical network systems, our approach reduces communication time by 75.76% and 91.86%, respectively.

Fast Parallel Exact Inference on Bayesian Networks

Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://github.com/jjiantong/FastBN.

Generating Fast FFT Kernels on CPUs via FFT-Specific Intrinsics

This paper proposes an algorithm-specific instruction (ASI)-based fast Fourier transform (FFT) code generation framework, named FFTASI, to generate unified architecture independent butterfly kernels that can be transformed into architecture-dependent kernels by establishing the mapping between ASIs and architecture-specific instructions for various hardware platforms. FFTASI strikes a good balance between performance and productivity on CPUs.

Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.

High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing

Random walk serves as a powerful tool in dealing with large-scale graphs, reducing data size while preserving structural information. Unfortunately, existing system frameworks all focus on the execution of a single walker task in serial. We propose CoWalker, a high-throughput GPU random walk framework tailored for concurrent random walk tasks. It introduces a multi-level concurrent execution model to allow concurrent random walk tasks to efficiently share GPU resources with low overhead. Our system prototype confirms that the proposed system could outperform (up to 54%) the state-of-the-art in a wide spectral of scenarios.

The ERA Theorem for Safe Memory Reclamation

Safe memory reclamation (SMR) schemes for concurrent data structures offer trade-offs between three desirable properties: ease of integration, robustness, and applicability. In this paper we define SMR and these three properties, and we present the ERA theorem, asserting that any SMR scheme can only provide at most two of the three properties.

Unexpected Scaling in Path Copying Trees

Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (Universal Constructions or UCs) for building concurrent data structures. UCs (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks [3, 6] that protect a sequential data structure and allow only one process to access it at a time. However, the resulting data structure is blocking. Most work on UCs instead focuses on obtaining non-blocking progress guarantees such as obstruction-freedom, lock-freedom or wait-freedom. Many non-blocking UCs have appeared. Key examples include the seminal wait-free UC [2] by Herlihy, a NUMA-aware UC [10] by Yi et al., and an efficient UC for large objects [1] by Fatourou et al.

Transactional Composition of Nonblocking Data Structures

We introduce nonblocking transaction composition (NBTC), a new methodology for atomic composition of nonblocking operations on concurrent data structures. Unlike previous software transactional memory (STM) approaches, NBTC leverages the linearizability of existing nonblocking structures, reducing the number of memory accesses that must be executed together, atomically, to only one per operation in most cases (these are typically the linearizing instructions of the constituent operations).

Our obstruction-free implementation of NBTC, which we call Medley, makes it easy to transform most nonblocking data structures into transactional counterparts while preserving their nonblocking liveness and high concurrency. In our experiments, Medley outperforms Lock-Free Transactional Transform (LFTT), the fastest prior competing methodology, by 40--170%. The marginal overhead of Medley's transactional composition, relative to separate operations performed in succession, is roughly 2.2×.

For persistent memory, we observe that failure atomicity for transactions can be achieved "almost for free" with epoch-based periodic persistence. Toward that end, we integrate Medley with nbMontage, a general system for periodically persistent data structures. The resulting txMontage provides ACID transactions and achieves throughput up to two orders of magnitude higher than that of the OneFile persistent STM system.

CuPBoP: A Framework to Make CUDA Portable

CUDA, as one of the most popular choices for GPU programming, can be executed only on NVIDIA GPUs. To execute CUDA on non-NVIDIA devices, researchers have proposed to translate CUDA to other programming languages. However, this approach cannot achieve high coverage due to the challenges in source-to-source translation.

We propose a framework, CuPBoP, that executes CUDA programs on non-NVIDIA devices without relying on other programming languages. CuPBoP consists of two parts. The compilation part applies transformations on CUDA host/kernel IRs. The runtime part consists of the runtime libraries for CUDA built-in functions. For the CPU backends, compared with the existing frameworks, CuPBoP achieves the highest coverage on all CPUs that we evaluate (x86, aarch64, RISC-V).

We make CuPBoP publicly available to inspire more works in this area 1.

Swift: Expedited Failure Recovery for Large-Scale DNN Training

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance critical. Existing state-of-the-art methods like Check-Freq and Elastic Horovod need to back up a copy of the model state in memory, which is costly for large models and leads to non-trivial overhead. This paper presents Swift, a novel failure recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, Swift resolves the inconsistencies of the model state caused by the failure and exploits replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. Evaluations show that Swift significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy.

Learning to Parallelize in a Shared-Memory Environment with Transformers

In past years, the world has switched to multi and many core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes, such as OpenMP, to applications. Nevertheless, introducing OpenMP work-sharing loop construct into code, especially legacy code, is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in machine learning techniques, specifically in natural language processing (NLP), to suggest the need for an OpenMP work-sharing loop directive and data-sharing attributes clauses --- the building blocks of concurrent programming. We train several transformer models, named PragFormer, for these tasks and show that they outperform statistically-trained baselines and automatic source-to-source (S2S) parallelization compilers in both classifying the overall need for an parallel for directive and the introduction of private and reduction clauses. In the future, our corpus can be used for additional tasks, up to generating entire OpenMP directives. The source code and database for our project can be accessed on GitHub 1 and HuggingFace 2.