Many real-time OLAP systems have been proposed to query evolving data with sub-second latency. Although this feature is highly attractive, it is very hard to be achieved on analytic graph queries that can only be answered after accessing every connected vertex. Fortunately, researchers recently observed that answering pairwise queries is enough for many real-world scenarios. These pairwise queries avoid the exhaustive nature and hence may only need to access a small portion of the graph. Obviously, the crux of achieving low latency is to what extent the system can eliminate unnecessary computations. This pruning process, according to our investigation, is usually achieved by estimating certain upper bounds of the query result in existing systems.
However, our evaluation results demonstrate that these existing upper-bound-only pruning techniques can only prune about half of the vertex activations, which is still far away from achieving the sub-second latency goal on large graphs. In contrast, we found that it is possible to substantially accelerate the processing if we are able to not only estimate the upper bounds, but also foresee a tighter lower bound for certain pairs of vertices in the graph. Our experiments show that only less than 1% of the vertices are activated via using this novel lower bound based pruning technique. Based on this observation, we build SGraph, a system that is able to answer dynamic pairwise queries over evolving graphs with sub-second latency. It can ingest millions of updates per second and simultaneously answer pairwise queries with a latency that is several orders of magnitude smaller than state-of-the-art systems.
Research into processor-based side-channels has seen both a large number and a large variety of disclosed vulnerabilities that can leak critical, private data to malicious attackers. While most previous works require speculative execution and the use of cache primitives to transmit data, our new approach, called AfterImage, requires neither, capitalizing on vulnerabilities in Intel’s IP-stride prefetcher to both expose and transmit victim data. By training this prefetcher with attacker-known values, and watching for changes to the prefetcher state when execution returns to the attacker, it is now possible to monitor and leak critical data from a large number of common userspace applications and kernel routines without speculation and additional cache accesses. To demonstrate the novel capabilities of AfterImage, we (1) present proof-of-concept attacks that leak data across different isolation levels, (2) present an end-to-end attack that leaks an entire RSA key from a modern, timing-balanced algorithm, and also (3) show how AfterImage can significantly improve the effectiveness of other attacks, such as power side-channel attacks, by using this technique as a high-precision marker.
In addition to an extensive evaluation of these and other cache-based attacks, we also present a full reverse-engineering of the Intel IP-stride prefetcher which was required to enable AfterImage, and describe how AfterImage can be used as a covert channel. Finally, we present several mitigation techniques that can be used to block this side-channel on machines today. Taken together, this work explores a full set of techniques to utilize the prefetcher to leak previously protected information between different protection domains (SGX, kernel and other user spaces) and across many important applications, including security and non-security-related workloads.
Key-value stream aggregation is a common operation in distributed systems, which requires intensive computation and network resources. We propose a generic in-network aggregation service for key-value streams, ASK, to accelerate the aggregation operations in diverse distributed applications. ASK is a switch-host co-designed system, where the programmable switch provides a best-effort aggregation service, and the host runs a daemon to interact with applications. ASK makes in-depth optimization tailored to traffic characteristics, hardware restrictions, and network unreliable natures: it vectorizes multiple key-value tuples’ aggregation of one packet in one switch pipeline pass, which improves the per-host’s goodput; it develops a lightweight reliability mechanism for key-value stream’s asynchronous aggregation, which guarantees computation correctness; it designs a hot-key agnostic prioritization for key-skewed workloads, which improves the switch memory utilization. We prototype ASK and use it to support Spark and BytePS. The evaluation shows that ASK could accelerate pure key-value aggregation tasks by up to 155 times and big data jobs by 3-5 times, and be backward compatible with existing INA-empowered distributed training solutions with the same speedup.
To better facilitate application performance programming we propose a software optimization strategy enabled by a novel low-latency Prediction System Service (PSS). Rather than relying on nuanced domain-specific knowledge or slapdash heuristics, a system service for prediction encourages programmers to spend their time uncovering new levers for optimization rather than worrying about the details of their control. The core idea is to write optimizations that improve performance in specific cases, or under specific tunings, and leave the decision of how and when exactly to apply those optimizations to the system to learn through feedback-directed learning. Such a prediction service can be implemented in any number of ways, including as a shared library that can be easily reused by software written in different programming languages, and opens the door to both new software optimization patterns and hardware design possibilities.
As a demonstration of the utility of this approach, we show that three very different application-targeted optimization scenarios can each benefit from even a very straightforward perceptron-based implementation of the PSS as long as the service latency can be held low. First, we show that PSS can be used to more intelligently guide hardware lock elision with resulting speedups over a baseline implementation by 34% on average. Second, we show that a PSS can find good configuration parameters for PyPy’s Just-In-Time (JIT) compiler resulting in 15% speedup on average. Last, we show PSS can guide the page reclamation task within a kernel memory management subsystem to reduce the average memory latency by 33% on average. In all three cases, this new optimization pattern with service support is able to meet or beat the best-known hand-crafted methods with a fraction of the complexity.
CPUs with weak memory-consistency models (WMMs), such as Arm and RISC-V, are rapidly increasing their market share. Porting legacy x86 applications to such CPUs requires introducing extra synchronization to prevent WMM-related concurrency bugs---a task often left to human experts.
Given the rarity of such experts and the enormous size of legacy applications, we develop AtoMig, an effective, fully automated tool for porting large, real-world applications to WMM CPU architectures. AtoMig detects shared memory access patterns with novel static analysis strategies and performs program transformations to properly protect them from WMM effects. In the absence of sufficiently scalable verification methods, AtoMig shows practicality of focusing on code patterns more prone to WMM faults, trading off completeness for scalability.
We validate the correctness of AtoMig's transformations on several small concurrent benchmarks via model checking. We demonstrate the scalability and performance of our approach by applying AtoMig to popular real-world large code bases with up to millions of lines of code, viz., MariaDB, Postgres, SQlite, LevelDB, and Memcached. As part of this work, we also found a WMM bug in MariaDB, which AtoMig fixes automatically.
Function-as-a-service (FaaS), an emerging cloud computing paradigm, is expected to provide strong elasticity due to its promise to auto-scale fine-grained functions rapidly. Although appealing for applications with good parallelism and dynamic workload, this paper shows that it is non-trivial to adapt existing monolithic applications (like web services) to FaaS due to their complexity. To bridge the gap between complicated web services and FaaS, this paper proposes a runtime-based Semi-FaaS execution model, which dynamically extracts time-consuming code snippets (closures) from applications and offloads them to FaaS platforms for execution. It further proposes BeeHive, an offloading framework for Semi-FaaS, which relies on the managed runtime to provide a fallback-based execution model and addresses the performance issues in traditional offloading mechanisms for FaaS. Meanwhile, the runtime system of BeeHive selects offloading candidates in a user-transparent way and supports efficient object sharing, memory management, and failure recovery in a distributed environment. The evaluation using various web applications suggests that the Semi-FaaS execution supported by BeeHive can reach sub-second resource provisioning on commercialized FaaS platforms like AWS Lambda, which is up to two orders of magnitude better than other alternative scaling approaches in cloud computing.
The overheads of classical decoding for quantum error correction in cryogenic quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottlenecked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget.
To overcome these overheads, we are motivated by the observation that in the common case (over 90% of the time), error correction 'syndromes' are fairly trivial with high redundancy / sparsity, since the error correction codes are over-provisioned to be able to correct for uncommon worst-case complex scenarios (to ensure substantially low logical error rates). If suitably exploited, these trivial scenarios can be handled with insignificant overhead, thereby alleviating any bottlenecks towards handling the worst-case scenarios by state-of-the-art means.
We propose Better Than Worst-Case Decoding for Quantum Error Correction, targeting cryogenic quantum systems and Surface Code, consisting of:
On-chip Clique Decoder: An extremely lightweight decoder for correcting trivial common-case errors, designed for the cryogenic domain. The decoder is implemented and evaluated for SFQ logic.
Statistical Off-chip Bandwidth Allocation: A statistical confidence-based technique for allocation of off-chip decoding bandwidth, to efficiently handle the rare complex decodes that are not covered by the Clique Decoder.
Decode-Overflow Execution Stalling: A method to stall circuit execution, for the worst-case scenarios in which the provisioned off-chip bandwidth is insufficient to complete all requested off-chip decodes.
In all, BTWC decoding achieves 70-99+% off-chip bandwidth elimination across a range of logical and physical error rates, without significantly sacrificing the accuracy of a state-of-the-art off-chip decoder. Further, it achieves 10-1000x bandwidth reduction over prior bandwidth reduction techniques, as well as 15-37x resource overhead reduction compared to prior on-chip decoding.
The Graph Neural Network (GNN) is showing outstanding results in improving the performance of graph-based applications. Recent studies demonstrate that GNN performance can be boosted via using more advanced aggregators, deeper aggregation depth, larger sampling rate, etc. While leading to promising results, the improvements come at a cost of significantly increased memory footprint, easily exceeding GPU memory capacity. In this paper, we introduce a method, Betty, to make GNN training more scalable and accessible via batch-level partitioning. Different from DNN training, a mini-batch in GNN has complex dependencies between input features and output labels, making batch-level partitioning difficult. Betty introduces two noveltechniques, redundancy-embedded graph (REG) partitioning and memory-aware partitioning, to effectively mitigate the redundancy and load imbalances issues across the partitions. Our evaluation of large-scale real-world datasets shows that Betty can significantly mitigate the memory bottleneck, enabling scalable GNN training with much deeper aggregation depths, larger sampling rate, larger training batch sizes, together with more advanced aggregators, with a few as a single GPU.
Technology companies reduce their datacenters’ carbon footprint by investing in renewable energy generation and receiving credits from power purchase agreements. Annually, datacenters offset their energy consumption with generation credits (Net Zero). But hourly, datacenters often consume carbon-intensive energy from the grid when carbon-free energy is scarce. Relying on intermittent renewable energy in every hour (24/7) requires a mix of renewable energy from complementary sources, energy storage, and workload scheduling. In this paper, we present the Carbon Explorer framework to analyze the solution space. We use Carbon Explorer to balance trade-offs between operational and embodied carbon, optimizing the mix of solutions for 24/7 carbon-free datacenter operation based on geographic location and workload. Carbon Explorer has been open-sourced at https://github.com/facebookresearch/CarbonExplorer.
We consider the problem of graph analytics on evolving graphs (i.e., graphs that change over time). In this scenario, a query typically needs to be applied to different snapshots of the graph over an extended time window, for example to track the evolution of a property over time. Solving a query independently on multiple snapshots is inefficient due to repeated execution of subcomputation common to multiple snapshots. At the same time, we show that using streaming, where we start from the earliest snapshot and stream the changes to the graph incrementally updating the query results one snapshot at a time is also inefficient. We propose CommonGraph, an approach for efficient processing of queries on evolving graphs. We first observe that deletion operations are significantly more expensive than addition operations for many graph queries (those that are monotonic). CommonGraph converts all deletions to additions by finding a common graph that exists across all snapshots. After computing the query on this graph, to reach any snapshot, we simply need to add the missing edges and incrementally update the query results. CommonGraph also allows sharing of common additions among snapshots that require them, and breaks the sequential dependency inherent in the traditional streaming approach where snapshots are processed in sequence, enabling additional opportunities for parallelism. We incorporate the CommonGraph approach by extending the KickStarter streaming framework. We implement optimizations that enable efficient handling of edge additions without resorting to expensive in place graph mutations, significantly reducing the streaming overhead, and enabling direct reuse of shared edges among different snapshots. CommonGraph achieves 1.38x-8.17x improvement in performance over Kickstarter across multiple benchmarks.
Compilation Consistency Modulo Debug Information (CCMD) is an essential compiler property that a production compiler should support: the compiler should emit the same machine code regardless of enabling debug information. CCMD is vital to developers’ experiences with debugging a production binary containing no debug information. To debug such a binary, developers need build another binary with the same compiler flags and enable debug information. Without CCMD, the machine code in the latter binary will be different, which can confuse the debugger, hide the bug, or even cause a miscompilation (as GCC once did with the Linux Kernel).
This paper is the first to introduce to the research community the validation of CCMD, a new research problem that has been overlooked for decades despite its importance. More importantly, we propose the first testing technique Dfusor to automatically validate CCMD for C compilers. At the high level, given a compilable program P as a seed, Dfusor automatically generates compilable program variants via multiple effective program transformations. Such variants can cause a compiler to emit more debug information than it would when compiling P, thus exercising more code paths in the compiler and increasing the chance to find CCMD bugs.
Our extensive evaluations of Dfusor demonstrate that Dfusor can produce variants that exhibit significant increases in the quantity and complexity of the emitted debug information, and thus has found new, real bugs in GCC and LLVM. With a sample of 100 variants derived from distinct seed programs, Dfusor introduces 214% more debug information entries and 36% more distinct debug information entries in the variants than the seeds, and improves the code coverage of GCC and Clang by up to 6.00% and 6.82%. More importantly, Dfusor has found CCMD bugs; within 10 months of development and intermittent testing, Dfusor has found 23 bugs (9 in GCC and 14 in Clang), with 3 confirmed and 18 fixed.
Distributed systems are difficult to design and implement correctly. In response, both research and industry are exploring applications of formal methods to distributed systems. A key challenge in this domain is the missing link between the formal design of a system and its implementation. Today, practitioners bridge this link through manual effort.
We present a language called Modular PlusCal that extends PlusCal by cleanly separating the model of a system from a model of its environment. We then present a compiler tool-chain called PGo that automatically translates MPCal models to TLA+ for model checking, and that also compiles MPCal models to runnable Go code. PGo provides system designers with a new ability to model and check their designs, and then re-use their modeling efforts to mechanically extract runnable implementations of their designs.
Our evaluation shows that the PGo approach works for complex models: we model check, compile, and evaluate the performance of MPCal systems based on Raft, CRDTs, and primary-backup. Compared to previous work, PGo requires less time to develop a checked model and derive a fully working implementation. With PGo we created a formally checked Raft model and its corresponding implementation in under 1 person-month, which is 3× less time than Ivy. Our evaluation shows that a PGo-based Raft KV store with three nodes has 41% higher throughput than a Raft KV store based on Ivy, the highest performing verified Raft-based KV store from related work. A PGo-based CRDT set has a latency within 2 × of a CRDT set implementation from SoundCloud called Roshi.
Operating systems utilize Copy-on-Write (COW) to conserve memory and improve performance. During the last two decades, a series of COW-related bugs - which compromised security, corrupted memory and degraded performance - was found. The majority of these bugs are related to page "pinning", which operating systems employ to access process memory efficiently and to perform direct I/O. Unfortunately, the true cause of these bugs is not well understood, resulting in incomplete bug fixes. We show this by: (1) surveying previously reported pinning-related COW bugs; (2) uncovering new such bugs in Linux, FreeBSD, and NetBSD; and (3) showing that they occur because the COW logic does not consider page pinnings correctly, resulting in incorrect behavior (e.g., I/O of stale data). We then address the underlying problem by deriving when/how shared pages must be copied and under which conditions pinned pages can be shared to maintain correctness. Based on this assessment, we introduce the "Copy-on-Pin (COP)" scheme, an extension of the COW mechanism that handles pinned pages correctly by ensuring pinned pages and shared pages are mutually exclusive. However, we find that a naive implementation of this scheme hampers performance and increases complexity if pages are copied only when strictly necessary. To compensate, we introduce a relaxed-COP design, which does not require precise tracking of page sharing, maintains correctness without increasing complexity, and (while potentially needlessly copying pages in some corner cases) marginally improves performance. Our relaxed-COP solution has been integrated into Linux 5.19.
Modern code reuse attacks take full advantage of bloated software. Attackers piece together short sequences of instructions in otherwise benign code to carry out malicious actions. Mitigating these reusable code snippets, known as gadgets, has become one of the prime focuses of attack surface reduction research. While some debloating techniques remove parts of software that contain such gadgets, other methods focus on making them unusable by breaking up chains of them, thereby substantially diminishing the possibility of code reuse attacks. Third-party libraries are another main focus, because they exhibit a high number of vulnerabilities, but recently, techniques have emerged that deal with whole applications. Attack surface reduction efforts have typically tried to eliminate such attacks by subsetting (debloating) the application, e.g. via user-specified inputs, configurations, or features to achieve high gadget reductions. However, such techniques suffer from the limitations of soundness, i.e. the software might crash during no-attack executions on regular inputs, or they may be conservative and leave a large amount of attack surface untackled.
In this work we present a general, whole-program attack surface reduction technique called Decker that significantly reduces gadgets which are accessible to an attacker during an execution phase (called a deck) and has minor performance degradation. Decker requires no user inputs and leaves all features intact. It uses static analysis to determine key function sets that should be enabled/disabled at runtime. The runtime system then enables these function sets at the specified program points during execution. On SPEC CPU 2017, our framework achieves 73.2% total gadget reduction with 5.2% average slowdown. On 10 GNU coreutils applications, it achieves 87.2% reduction and negligible slowdown. On the nginx server it achieves 80.3% reduction with 2% slowdown. We also provide a gadget chain-breaking case study, including detailed JOP gadget metrics on both Linux and Windows, and show that our framework breaks the shell-spawning chain in all cases.
Deep neural networks (DNNs) are continuing to get wider and deeper. As a result, it requires a tremendous amount of GPU memory and computing power. In this paper, we propose a framework called DeepUM that exploits CUDA Unified Memory (UM) to allow GPU memory oversubscription for DNNs. While UM allows memory oversubscription using a page fault mechanism, page migration introduces enormous overhead. DeepUM uses a new correlation prefetching technique to hide the page migration overhead. It is fully automatic and transparent to users. We also propose two optimization techniques to minimize the GPU fault handling time. We evaluate the performance of DeepUM using nine large-scale DNNs from MLPerf, PyTorch examples, and Hugging Face and compare its performance with six state-of-the-art GPU memory swapping approaches. The evaluation result indicates that DeepUM is very effective for GPU memory oversubscription and can handle larger models that other approaches fail to handle.
The lack of representative, publicly-available cloud services has been a recurring problem in the architecture and systems communities. While open-source benchmarks exist, they do not capture the full complexity of cloud services. Application cloning is a promising way to address this, however, prior work is limited to CPU-/cache-centric, single-node services, operating at user level.
We present Ditto, an automated framework for cloning end-to-end cloud applications, both monolithic and microservices, which captures I/O and network activity, as well as kernel operations, in addition to application logic. Ditto takes a hierarchical approach to application cloning, starting with capturing the dependency graph across distributed services, to recreating each tier's control/data flow, and finally generating system calls and assembly that mimics the individual applications. Ditto does not reveal the logic of the original application, facilitating publicly sharing clones of production services with hardware vendors, cloud providers, and the research community.
We show that across a diverse set of single- and multi-tier applications, Ditto accurately captures their CPU and memory characteristics as well as their high-level performance metrics, is portable across platforms, and facilitates a wide range of system studies.
By eliminating compute operations intelligently based on the run time input, dynamic pruning (DP) promises to improve deep neural network inference speed substantially without incurring a major impact on their accuracy. Although many DP algorithms with good pruning performance have been proposed, it remains a challenge to translate these theoretical reductions in compute operations into satisfactory end-to-end speedups in practical real-world implementations. The overhead of identifying operations to be pruned during run time, the need to efficiently process the resulting dynamic dataflow, and the non-trivial memory I/O bottleneck that emerges as the number of compute operations reduces, have all contributed to the challenge of implementing practical DP systems.
In this paper, the design and implementation of DPACS are presented to address these challenges. DPACS utilizes a hardware-aware dynamic spatial and channel pruning algorithm in conjunction with a dynamic dataflow engine in hardware to facilitate efficient processing of the pruned network. A channel mask precomputation scheme is designed to reduce memory I/O, and a dedicated inter-layer pipeline is used to achieve efficient indexing and dataflow of sparse activation. Extensive design space exploration has been performed using two architectural variations implemented on FPGA to accelerate multiple networks from the ResNet family on the ImageNet and CIFAR10 dataset across a wide range of pruning ratios. Across the spectrum of configurations, DPACS is able to achieve 1.1× to 3.9× end-to-end speedup over a baseline hardware implementation without pruning. Analysis of the tradeoff among accuracy, compute, and memory I/O performance highlights the importance of algorithm-architecture codesign in developing DP systems.
Cloud platforms' rapid growth is raising significant concerns about their carbon emissions. To reduce carbon emissions, future cloud platforms will need to increase their reliance on renewable energy sources, such as solar and wind, which have zero emissions but are highly unreliable. Unfortunately, today's energy systems effectively mask this unreliability in hardware, which prevents applications from optimizing their carbon-efficiency, or work done per kilogram of carbon emitted. To address the problem, we design an "ecovisor", which virtualizes the energy system and exposes software-defined control of it to applications. An ecovisor enables each application to handle clean energy's unreliability in software based on its own specific requirements. We implement a small-scale ecovisor prototype that virtualizes a physical energy system to enable software-based application-level i) visibility into variable grid carbon-intensity and local renewable generation and ii) control of server power usage and battery charging and discharging. We evaluate the ecovisor approach by showing how multiple applications can concurrently exercise their virtual energy system in different ways to better optimize carbon-efficiency based on their specific requirements compared to general system-wide policies.
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.
Modern recommendation systems, primarily driven by deep-learning models, depend on fast model inferences to be useful. To tackle the sparsity in the input space, particularly for categorical variables, such inferences are made by storing increasingly large embedding vector (EV) tables in memory. A core challenge is that the inference operation has an all-or-nothing property: each inference requires multiple EV table lookups, but if any memory access is slow, the whole inference request is slow. In our paper, we design, implement and evaluate EVStore, a 3-layer EV table lookup system that harnesses both structural regularity in inference operations and domain-specific approximations to provide optimized caching, yielding up to 23% and 27% reduction on the average and p90 latency while quadrupling throughput at 0.2% loss in accuracy. Finally, we show that at a minor cost of accuracy, EVStore can reduce the Deep Recommendation System (DRS) memory usage by up to 94%, yielding potentially enormous savings for these costly, pervasive systems.
Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements.
Quantum Approximate Optimization Algorithm (QAOA) is one of the leading candidates for demonstrating the quantum advantage using near-term quantum computers. Unfortunately, high device error rates limit us from reliably running QAOA circuits for problems with more than a few qubits. In QAOA, the problem graph is translated into a quantum circuit such that every edge corresponds to two 2-qubit CNOT operations in each layer of the circuit. As CNOTs are extremely error-prone, the fidelity of QAOA circuits is dictated by the number of edges in the problem graph.
We observe that the majority of graphs corresponding to real-world applications follow a “power-law” distribution, where some hotspot nodes have significantly higher number of connections. We leverage this insight and propose “FrozenQubits” that freezes the hotspot nodes or qubits and intelligently partitions the state-space of the given problem into several smaller sub-spaces, which are then solved independently. The corresponding QAOA sub-circuits are significantly less vulnerable to gate and decoherence errors due to the reduced number of CNOT operations in each sub-circuit. Unlike prior circuit-cutting approaches, FrozenQubits does not require any exponentially complex postprocessing step. Our evaluations with 5,300 QAOA circuits on eight different quantum computers from IBM show that FrozenQubits can improve the quality of solutions by 8.73x on average (and by up to 57x), albeit while utilizing 2x more quantum resources.
Graphics Processing Units (GPUs) have traditionally relied on the host CPU to initiate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion in the GPU. However, emerging applications such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU initiation of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU-initiated storage removes these overheads from the storage control path and, thus, can potentially support these applications at much higher speed. However, there is a lack of systems architecture and software stack that enable efficient GPU-initiated storage access. This work presents a novel system architecture, BaM, that fills this gap. BaM features a fine-grained software cache to coalesce data storage requests while minimizing I/O traffic amplification. This software cache communicates with the storage system via high-throughput queues that enable the massive number of concurrent threads in modern GPUs to make I/O requests at a high rate to fully utilize the storage devices and the system interconnect. Experimental results show that BaM delivers 1.0x and 1.49x end-to-end speed up for BFS and CC graph analytics benchmarks while reducing hardware costs by up to 21.7x over accessing the graph data from the host memory. Furthermore, BaM speeds up data-analytics workloads by 5.3x over CPU-initiated storage access on the same hardware.
Zero-knowledge proof (ZKP) is a cryptographic protocol that allows one party to prove the correctness of a statement to another party without revealing any information beyond the correctness of the statement itself. It guarantees computation integrity and confidentiality, and is therefore increasingly adopted in industry for a variety of privacy-preserving applications, such as verifiable outsource computing and digital currency.
A significant obstacle in using ZKP for online applications is the performance overhead of its proof generation. We develop GZKP, a GPU accelerated zero-knowledge proof system that supports different levels of security requirements and brings significant speedup toward making ZKP truly usable. For polynomial computation over a large finite field, GZKP promotes a cache-friendly memory access pattern while eliminating the costly external shuffle in existing solutions. For multi-scalar multiplication, GZKP adopts a new parallelization strategy, which aggressively combines integer elliptic curve point operations and exploits fine-grained task parallelism with load balancing for sparse integer distribution. GZKP outperforms the state-of-the-art ZKP systems by an order of magnitude, achieving up to 48.1× and 17.6× speedup with standard cryptographic benchmarks and a real-world application workload, respectively.
Side-channel attacks pose serious threats to many security models, especially sandbox-based browsers. While transient-execution side channels in out-of-order processors have previously been blamed for vulnerabilities such as Spectre and Meltdown, we show that in fact, the capability of out-of-order execution itself to cause mayhem is far more general.
As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the rapidly growing number of operators.
Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations (e.g., double buffering). In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering directly in the tensor programs. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity (e.g., allowing program-statement-level optimizations). We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time.
With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.
Deep learning models are a valuable “secret sauce” that confers a significant competitive advantage. Many models are never visible to the user and even publicly known state-of-the-art models are either completely proprietary or only accessible via access-controlled APIs. Increasingly, these models run directly on the edge, often using a low-power DNN accelerator. This makes models particularly vulnerable, as an attacker with physical access can exploit side channels like off-chip memory access volumes. Indeed, prior work has shown that this channel can be used to steal dense DNNs from edge devices by correlating data transfer volumes with layer geometry.
Unfortunately, prior techniques become intractable when the model is sparse in either weights or activations because off-chip transfers no longer correspond exactly to layer dimensions. Could it be that the many mobile-class sparse accelerators are inherently safe from this style of attack?
In this paper, we show that it is feasible to steal a pruned DNN model architecture from a mobile-class sparse accelerator using the DRAM access volume channel. We describe HuffDuff, an attack scheme with two novel techniques that leverage (i) the boundary effect present in CONV layers, and (ii) the timing side channel of on-the-fly activation compression. Together, these techniques dramatically reduce the space of possible model architectures up to 94 orders of magnitude, resulting in fewer than 100 candidate models — a number that can be feasibly tested. Finally, we sample network instances from our solution space and show that (i) our solutions reach the victim accuracy under the iso-footprint constraint, and (ii) significantly improve black-box targeted attack success rates.
1.5 billion smartphones are sold annually, and most are decommissioned less than two years later. Most of these unwanted smartphones are neither discarded nor recycled but languish in junk drawers and storage units. This computational stockpile represents a substantial wasted potential: modern smartphones have increasingly high-performance and energy-efficient processors, extensive networking capabilities, and a reliable built-in power supply. This project studies the ability to reuse smartphones as "junkyard computers." Junkyard computers grow global computing capacity by extending device lifetimes, which supplants the manufacture of new devices. We show that the capabilities of even decade-old smartphones are within those demanded by modern cloud microservices and discuss how to combine phones to perform increasingly complex tasks. We describe how current operation-focused metrics do not capture the actual carbon costs of compute. We propose Computational Carbon Intensity---a performance metric that balances the continued service of older devices with the superlinear runtime improvements of newer machines. We use this metric to redefine device service lifetime in terms of carbon efficiency. We develop a cloudlet of reused Pixel 3A phones. We analyze the carbon benefits of deploying large, end-to-end microservice-based applications on these smartphones. Finally, we describe system architectures and associated challenges to scale to cloudlets with hundreds and thousands of smartphones.
This paper proposes Khuzdul, a distributed execution engine with a well-defined abstraction that can be integrated with existing single-machine graph pattern mining (GPM) systems to provide efficiency and scalability at the same time. The key novelty is the extendable embedding abstraction which can express pattern enumeration algorithms, allow fine-grained task scheduling, and enable low-cost GPM-specific data reuse to reduce communication cost. The effective BFS-DFS hybrid exploration generates sufficient concurrent tasks for communication-computation overlapping with bounded memory consumption. Two scalable distributed GPM systems are implemented by porting Automine and GraphPi on Khuzdul. Our evaluation shows that Khuzdul based systems significantly outperform state-of-the-art distributed GPM systems with partitioned graphs by up to 75.5× (on average 19.0×), achieve similar or even better performance compared with the fastest distributed GPM systems with replicated graph, and scale to massive graphs with more than one hundred billion edges with a commodity cluster.
Container isolation is implemented through OS-level virtualization, such as Linux namespaces. Unfortunately, these mechanisms are extremely challenging to implement correctly and, in practice, suffer from functional interference bugs, which compromise container security. In particular, functional interference bugs allow an attacker to extract information from another container running on the same machine or impact its integrity by modifying kernel resources that are incorrectly isolated. Despite their impact, functional interference bugs in OS-level virtualization have received limited attention in part due to the challenges in detecting them. Instead of causing memory errors or crashes, many functional interference bugs involve hard-to-catch logic errors that silently produce semantically incorrect results.
This paper proposes KIT, a dynamic testing framework that discovers functional interference bugs in OS-level virtualization mechanisms, such as Linux namespaces. The key idea of KIT is to detect inter-container functional interference by comparing the system call traces of a container across two executions, where it runs with and without the preceding execution of another container. To achieve high efficiency and accuracy, KIT includes two critical components: an efficient algorithm to generate test cases that exercise inter-container data flows and a system call trace analysis framework that detects functional interference bugs and clusters bug reports. KIT discovered 9 functional interference bugs in Linux kernel 5.13, of which 6 have been confirmed. All bugs are caused by logic errors, showing that this approach is able to detect hard-to-catch semantic bugs.
In modern solid-state drives (SSDs), the indexing of flash pages is a critical component in their storage controllers. It not only affects the data access performance, but also determines the efficiency of the precious in-device DRAM resource. A variety of address mapping schemes and optimizations have been proposed. However, most of them were developed with human-driven heuristics.
In this paper, we present a learning-based flash translation layer (FTL), named LeaFTL, which learns the address mapping to tolerate dynamic data access patterns via linear regression at runtime. By grouping a large set of mapping entries into a learned segment, it significantly reduces the memory footprint of the address mapping table, which further benefits the data caching in SSD controllers. LeaFTL also employs various optimization techniques, including out-of-band metadata verification to tolerate mispredictions, optimized flash allocation, and dynamic compaction of learned index segments. We implement LeaFTL with both a validated SSD simulator and a real open-channel SSD board. Our evaluation with various storage workloads demonstrates that LeaFTL saves the memory consumption of the mapping table by 2.9× and improves the storage performance by 1.4× on average, in comparison with state-of-the-art FTL schemes.
While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive manner, exorbitant integration and maintenance cost, limited scalability, as well as opaque decision processes. Motivated by these issues, we design and implement Lucid, a non-intrusive deep learning workload scheduler based on interpretable models. It consists of three innovative modules. First, a two-dimensional optimized profiler is introduced for efficient job metric collection and timely debugging job feedback. Second, Lucid utilizes an indolent packing strategy to circumvent interference. Third, Lucid orchestrates resources based on estimated job priority values and sharing scores to achieve efficient scheduling. Additionally, Lucid promotes model performance maintenance and system transparent adjustment via a well-designed system optimizer. Our evaluation shows that Lucid reduces the average job completion time by up to 1.3× compared with state-of-the-art preemptive scheduler Tiresias. Furthermore, it provides explicit system interpretations and excellent scalability for practical deployment.
Shared memory platforms provide a memory consistency specification (MCS) so that developers can reason about the behaviors of their parallel programs. Unfortunately, ensuring that a platform conforms to its MCS is difficult, as is exemplified by numerous bugs in well-used platforms. While existing MCS testing approaches find bugs, their efficacy depends on the testing environment (e.g. if synthetic memory pressure is applied). MCS testing environments are difficult to evaluate since legitimate MCS violations are too rare to use as an efficacy metric. As a result, prior approaches have missed critical MCS bugs.
This work proposes a mutation testing approach for evaluating MCS testing environments: MC Mutants. This approach mutates MCS tests such that the mutants simulate bugs that might occur. A testing environment can then be evaluated using a mutation score. We utilize MC Mutants in two novel contributions: (1) a parallel testing environment, and (2) An MCS testing confidence strategy that is parameterized over a time budget and confidence threshold. We implement our contributions in WebGPU, a new web-based GPU programming specification, and evaluate our techniques across four GPUs. We improve testing speed by three orders of magnitude over prior work, empowering us to create a conformance test suite that reproduces many mutated tests with high confidence and requires only 64 seconds per test. We identified two bugs in WebGPU implementations, one of which led to a specification change. Moreover, the official WebGPU conformance test suite has adopted our approach due to its efficiency, effectiveness, and broad applicability.
Fine-tuning on cheap commodity GPU servers makes large-scale deep learning models benefit more people. However, the low inter-GPU communication bandwidth and pressing communication contention on the commodity GPU server obstruct training efficiency.
In this paper, we present Mobius, a communication-efficient system for fine tuning large-scale models on commodity GPU servers. The key idea is a novel pipeline parallelism scheme enabling heterogeneous memory for large-scale model training, while bringing fewer communications than existing systems. Mobius partitions the model into stages and carefully schedules them between GPU memory and DRAM to overlap communication with computation. It formulates pipeline execution into a mixed-integer program problem to find the optimal pipeline partition. It also features a new stage-to-GPU mapping method termed cross mapping, to minimize communication contention.
Experiments on various scale models and GPU topologies show that Mobius significantly reduces the training time by 3.8-5.1× compared with the prior art.
Machine learning models with millions or billions of parameters are increasingly trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, collective communication becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application-specific communication patterns can alleviate this bottleneck and help these applications scale. However, implementing correct and efficient custom algorithms is challenging.
This paper introduces MSCCLang, a system for programmable GPU communication. MSCCLang provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter-based runtime. We used MSCCLang to write novel collective algorithms for AllReduce and AllToAll that are up to 1.9× and 1.3× faster than hand-optimized implementations, respectively.
In the Noisy Intermediate Scale Quantum (NISQ) era, the dynamic nature of quantum systems causes noise sources to constantly vary over time. Transient errors from the dynamic NISQ noise landscape are challenging to comprehend and are especially detrimental to classes of applications that are iterative and/or long-running, and therefore their timely mitigation is important for quantum advantage in real-world applications.
The most popular examples of iterative long-running quantum applications are variational quantum algorithms (VQAs). Iteratively, VQA’s classical optimizer evaluates circuit candidates on an objective function and picks the best circuits towards achieving the application’s target. Noise fluctuation can cause a significant transient impact on the objective function estimation of the VQA iterations’ tuning candidates. This can severely affect VQA tuning and, by extension, its accuracy and convergence.
This paper proposes QISMET: Quantum Iteration Skipping to Mitigate Error Transients, to navigate the dynamic noise landscape of VQAs. QISMET actively avoids instances of high fluctuating noise which are predicted to have a significant transient error impact on specific VQA iterations. To achieve this, QISMET estimates transient error in VQA iterations and designs a controller to keep the VQA tuning faithful to the transient-free scenario. By doing so, QISMET efficiently mitigates a large portion of the transient noise impact on VQAs and is able to improve the fidelity by 1.3x-3x over a traditional VQA baseline, with 1.6-2.4x improvement over alternative approaches, across different applications and machines.
Deep-learning (DL) compilers such as TVM and TensorRT are increasingly being used to optimize deep neural network (DNN) models to meet performance, resource utilization and other requirements. Bugs in these compilers can result in models whose semantics differ from the original ones, producing incorrect results that corrupt the correctness of downstream applications. However, finding bugs in these compilers is challenging due to their complexity. In this work, we propose a new fuzz testing approach for finding bugs in deep-learning compilers. Our core approach consists of (i) generating diverse yet valid DNN test models that can exercise a large part of the compiler's transformation logic using light-weight operator specifications; (ii) performing gradient-based search to find model inputs that avoid any floating-point exceptional values during model execution, reducing the chance of missed bugs or false alarms; and (iii) using differential testing to identify bugs. We implemented this approach in NNSmith which has found 72 new bugs for TVM, TensorRT, ONNXRuntime, and PyTorch to date. Of these 58 have been confirmed and 51 have been fixed by their respective project maintainers.
The parallel execution model of GPUs enables scaling to hundreds of thousands of threads, which is a key capability that many modern high-performance applications exploit. GPU vendors are hence increasing the compute and memory resources with every GPU generation — resulting in the need to efficiently stitch together a plethora of Symmetric Multiprocessors (SMs), Last-Level Cache (LLC) slices and memory controllers while maximizing bandwidth and keeping power consumption and design complexity in check. Conventional GPUs are Uniform Bandwidth Architectures (UBAs) as they provide equal bandwidth between all SMs and all LLC slices. UBA GPUs require a uniform high-bandwidth Network-on-Chip (NoC), and our key observation is that provisioning a NoC to match the LLC slice bandwidth incurs a hefty power and complexity overhead. We propose the Non-Uniform Bandwidth Architecture (NUBA), a GPU system architecture aimed at fully utilizing LLC slice bandwidth. A NUBA GPU consists of partitions that each feature a few SMs and LLC slices as well as a memory controller — hence exposing the complete LLC bandwidth to the SMs within a partition since they can be connected with point-to-point links — and a NoC between partitions — to enable access to remote data.Exploiting the potential of NUBA GPUs however requires carefully co-designing system software, the compiler and architectural policies. The critical system software component is our Local-And-Balanced (LAB) page placement policy which enables the GPU driver to place data in local partitions while avoiding load imbalance. Moreover, we propose Model-Driven Replication (MDR) which identifies read-only shared data with data-flow analysis at compile time. At run time, MDR leverages an architectural mechanism that replicates read-only shared data across LLC slices when this can be done without pressuring cache capacity. With LAB and MDR, our NUBA GPU improves average performance by 23.1% and 22.2% (and up to 183.9% and 182.4%) compared to iso-resource memory-side and SM-side UBA GPUs, respectively. When the NUBA concept is leveraged to reduce overhead while maintaining similar performance, NUBA reduces NoC power consumption by 12.1× and 9.4×, respectively.
In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism to multiple GPUs. Such technique, however, suffers from a high overhead of inter-node communication. Compressing the communication is one way to mitigate the overhead by reducing the inter-node traffic volume; however, the existing compression techniques have critical limitations to be applied for NLP models with 3D parallelism in that 1) only the data parallelism traffic is targeted, and 2) the existing compression schemes already harm the model quality too much.
In this paper, we present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression. Optimus-CC differs from existing communication compression frameworks in the following ways: First, we compress pipeline parallel (inter-stage) traffic. In specific, we compress the inter-stage backpropagation and the embedding synchronization in addition to the existing data-parallel traffic compression methods. Second, we propose techniques to avoid the model quality drop that comes from the compression. We further provide mathematical and empirical analyses to show that our techniques can successfully suppress the compression error. Lastly, we analyze the pipeline and opt to selectively compress those traffic lying on the critical path. This further helps reduce the compression error. We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.
Public cloud providers seek to meet stringent performance requirements and low hardware cost. A key driver of performance and cost is main memory. Memory pooling promises to improve DRAM utilization and thereby reduce costs. However, pooling is challenging under cloud performance requirements. This paper proposes Pond, the first memory pooling system that both meets cloud performance goals and significantly reduces DRAM cost. Pond builds on the Compute Express Link (CXL) standard for load/store access to pool memory and two key insights. First, our analysis of cloud production traces shows that pooling across 8-16 sockets is enough to achieve most of the benefits. This enables a small-pool design with low access latency. Second, it is possible to create machine learning models that can accurately predict how much local and pool memory to allocate to a virtual machine (VM) to resemble same-NUMA-node memory performance. Our evaluation with 158 workloads shows that Pond reduces DRAM costs by 7% with performance within 1-5% of same-NUMA-node VM allocations.
As data generation has been on an upward trend, storing vast volumes of data cost-effectively as well as efficiently accessing them is paramount. At the same time, today's storage landscape continues to diversify, from high-bandwidth storage devices such as NVMe SSDs to low-latency non-volatile memory (e.g., Intel Optane DCPMM). These heterogeneous storage devices have the potential to deliver high performance in terms of bandwidth and latency with cost efficiency, while achieving the performance and cost targets together still remains a challenging problem. We provide our solution, Prism, a novel key-value store that utilizes modern heterogeneous storage devices. Prism uses heterogeneous storage devices synergistically to harness the advantages of each storage device while suppressing their downsides. We devise new techniques to balance the latency-bandwidth tradeoff when reading from SSD. For ensuring multicore scalability and crash consistency of data across heterogeneous storage media, Prism proposes cross-storage concurrency control and cross-storage crash consistency protocols. Our evaluation shows that Prism outperforms state-of-the-art key-value stores by up to 13.1× with significantly lower tail latency.
The Probabilistic Concurrency Testing (PCT) algorithm that provides theoretical guarantees on the probability of detecting concurrency bugs does not apply to weak memory programs. The PCT algorithm builds on the interleaving semantics of sequential consistency, which does not hold for weak memory concurrency. It is because weak memory concurrency allows additional behaviors that cannot be produced by any interleaving execution.
In this paper, we generalize PCT to address weak memory concurrency and present Probabilistic Concurrency Testing for Weak Memory (PCTWM). We empirically evaluate PCTWM on a set of well-known weak memory program benchmarks in comparison to the state-of-the-art weak memory testing tool C11Tester. Our results show that PCTWM can detect concurrency bugs more frequently than C11Tester.
While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.
Database-backed web applications persist a large amount of production data and have high requirements for integrity. To protect data integrity against application code bugs and operator mistakes, most RDBMSes allow application developers to specify various types of integrity constraints. Unfortunately, applications (e.g., e-commerce web apps) often do not take full advantage of this capability and miss specifying many database constraints, resulting in many severe consequences, such as crashing the order placement page and corrupting the store inventory data.
In this paper, we focus on the problem of missing database constraints in web applications. We first study several widely used open-source e-commerce and communication applications, and observe that all these applications have missed integrity constraints and many were added later as afterthoughts after issues occurred.
Motivated by our observations, we build a tool called CFinder to automatically infer missing database constraints from application source code by cleverly leveraging the observation that many source code patterns usually imply certain data integrity constraints. By analyzing application source code automatically, CFinder can extract such constraints and check against their database schemas to detect missing ones. We evaluate CFinder with eight widelydeployed web applications, including one commercial company with millions of users. Overall, our tool identifies 210 previously unknown missing constraints. We have reported 92 of them to the developers of these applications, so far 75 are confirmed. Our tool achieves a precision of 78% and a recall of 79%.
Quantum computing is in an era of limited resources. Current hardware lacks high fidelity gates, long coherence times, and the number of computational units required to perform meaningful computation. Contemporary quantum devices typically use a binary system, where each qubit exists in a superposition of the 0 and 1 states. However, it is often possible to access the 2 or even 3 states in the same physical unit by manipulating the system in different ways. In this work, we consider automatically encoding two qubits into one four-state ququart via a compression scheme. We use quantum optimal control to design efficient proof-of-concept gates that fully replicate standard qubit computation on these encoded qubits.
We extend qubit compilation schemes to efficiently route qubits on an arbitrary mixed-radix system consisting of both qubits and ququarts, reducing communication and minimizing excess circuit execution time introduced by longer-duration ququart gates. In conjunction with these compilation strategies, we introduce several methods to find beneficial compressions, reducing circuit error due to computation and communication by up to 50
Zoned Namespace (ZNS) SSDs are the latest evolution of host-managed flash storage, enabling improved performance at a lower cost-per-byte than traditional block interface (conventional) SSDs. To date, there is no support for arranging these new devices in arrays that offer increased throughput and reliability (RAID). We identify key challenges in designing redundant ZNS SSD arrays, such as managing metadata updates and persisting partial stripe writes in the absence of overwrite support from the device. We present RAIZN, a logical volume manager that exposes a ZNS interface and stripes data and parity across ZNS SSDs. RAIZN provides more stable throughput and lower tail latencies than an mdraid array of conventional SSDs based on the same hardware platform. RAIZN achieves superior performance because device-level garbage collection slows down conventional SSDs. We confirm that the benefits of RAIZN translate to higher layers by adapting the F2FS file system, RocksDB key-value store, and MySQL database to work with ZNS and leverage its benefits by closely controlling garbage collection. Compared to arrays of conventional SSDs experiencing on-device garbage collection, RAIZN leverages the ZNS interface to maintain consistent performance with up to 14× higher throughput and lower tail latency.
We present MioDB, a novel LSM-tree based key-value (KV) store system designed to fully exploit the advantages of byte-addressable non-volatile memories (NVMs). Our experimental studies reveal that the performance bottleneck of LSM-tree based KV stores using NVMs mainly stems from (1) costly data serialization/deserialization across memory and storage, and (2) unbalanced speed between memory-to-disk data flushing and on-disk data compaction. They may cause unpredictable performance degradation due to write stalls and write amplification. To address these problems, we advocate byte-addressable and persistent skip lists to replace the on-disk data structure of LSM-tree, and design four novel techniques to make the best use of fast NVMs. First, we propose one-piece flushing to minimize the cost of data serialization from DRAM to NVM. Second, we exploit an elastic NVM buffer with multiple levels and zero-copy compaction to eliminate write stalls and reduce write amplification. Third, we propose parallel compaction to orchestrate data flushing and compactions across all levels of LSM-trees. Finally, MioDB increases the depth of LSM-tree and exploits bloom filters to improve the read performance. Our extensive experimental studies demonstrate that MioDB achieves 17.1× and 21.7× lower 99.9th percentile latency, 8.3× and 2.5× higher random write throughput, and up to 5× and 4.9× lower write amplification compared with the state-of-the-art NoveLSM and MatrixKV, respectively.
While the implications of persistent memory (PM) on CPU hardware and software are well-explored, the same is not true for GPUs (Graphics Processing Units). A recent work, GPM, demonstrated how GPU programs can benefit from the fine-grain persistence of PM. However, in the absence of a persistency model, one cannot reason about the correctness of PM-aware GPU programs. Persistency models define the order in which writes to PM are persisted. We explore persistency models for GPUs.
We explore persistency models for GPUs. We demonstrate that CPU persistency models fall short for GPUs. We qualitatively and quantitatively argue that GPU persistency models should support scopes and buffering of writes to PM to leverage parallelism while adapting to higher NVM latencies. We formally specify a GPU persistency model that supports both scopes and buffers. We detail how GPU architecture can efficiently realize such a model. Finally, we quantitatively demonstrate the usefulness of scopes and buffers for PM-aware GPU programs.
Functional programming’s benefits for hardware description have long been recognized in the literature. In particular, functional hardware description languages provide combinators such as maps and filters to facilitate the compositional description of circuits. However, it is challenging to apply functional programming with combinators to complex circuits with latency-insensitive interfaces such as valid/ready interfaces due to the cyclic nature of their forward and backward ports.
In this work, we present ShakeFlow: the first functional hardware description language supporting latency-insensitive interface combinators. ShakeFlow provides extensible support for custom interfaces and combinators and a compiler to synthesizable Verilog and FIRRTL. We port a part of the BaseJump STL library and the Corundum 100Gbps NIC from (System)Verilog to ShakeFlow, reducing the code size by 38% and 26%, respectively. By experimenting with Corundum, we demonstrate that ShakeFlow is capable of designing realistic circuits, and porting to ShakeFlow does not incur significant resource and performance overhead.
Most dataflow accelerator compilers achieve high performance by mapping each node in a dataflow program to a dedicated hardware element on a dataflow accelerator. However, this approach misses critical data reuse optimizations required to exploit the data bandwidth from fine-grained memory elements, e.g., FIFOs and pipeline registers. Moreover, writing performant dataflow programs requires users to have domain expertise in the underlying dataflow accelerators.
To address these issues, we designed Sigma, a novel compiler that supports high-level programming constructs such as Einstein summations, index notations, and tensors, finds opportunities for data reuse from high-level dataflow graphs, and exploits on-chip data bandwidth from fine-grained memory elements. Sigma targeting a research dataflow accelerator demonstrates a 5.4x speedup and 44.6x area-normalized speedup over Nvidia's V100 accelerator, and a 7.1x speedup over hand-written dataflow programs.
Traditionally, architecture prototypes are built on top of FPGA infrastructure, with two associated problems. First, very large FPGAs are prohibitively expensive for most people and institutions. Second, the burden of FPGA development adds to an already uneasy life of researchers, especially those who focus on software. Large designs that do not fit into a single FPGA exacerbate these issues even more. This work presents SMAPPIC — the first open-source prototype platform for shared memory multi-die architectures on cloud FPGAs. SMAPPIC leverages the OpenPiton/BYOC infrastructure and AWS F1 instances to make FPGA-based prototypes of System-on-Chips, processor cores, accelerators, cache subsystems, etc., cheap, scalable, and straightforward. SMAPPIC enables many use cases that are not possible or significantly more complicated in existing software and FPGA tools. This work has the potential to accelerate the rate of innovation in computer engineering fields in the nearest future.
Sparse matrix-matrix multiplication (SpGEMM) is widely used in many scientific and deep learning applications. The highly irregular structures of SpGEMM limit its performance and efficiency on conventional computation platforms, and thus motivate a large body of specialized hardware designs. Existing SpGEMM accelerators only support specific types of rigid execution dataflow such as inner/output-product or row-based schemes. Each dataflow is only optimized for certain sparse patterns and fails to generalize with robust performance to the widely diverse SpGEMM workloads across various domains. We propose Spada, a combination of three novel techniques for SpGEMM accelerators to efficiently adapt to various sparse patterns. First, we describe a window-based adaptive dataflow that can be flexibly adapted to different modes to best match the data distributions and realize different reuse benefits. Then, our hardware architecture efficiently supports this dataflow template, with flexible, fast, and low-cost reconfigurability and effective load balancing features. Finally, we use a profiling-guided approach to detect the sparse pattern and determine the optimized dataflow mode to use, based on the key observations of sparse pattern similarity in nearby matrix regions. Our evaluation results demonstrate that Spada is able to match or exceed the best among three state-of-the-art SpGEMM accelerators, and avoid the performance degradation of the others if data distribution and dataflow mismatch. It achieves an average 1.44× speedup across a wide range of sparse matrices and compressed neural network models.
Crash consistency overhead is a long-standing barrier to the adoption of byte-addressable persistent memory in practice. Despite continuous progress, persistent transactions for crash consistency still incur a 5.6X slowdown, making persistent memory prohibitively costly in practical settings. This paper introduces speculative logging, a new method that forgoes most memory fences and reduces data persistence overhead by logging data values early. This technique enables a novel persistent transaction model, speculatively persistent memory transactions (SpecPMT). Our evaluation shows that SpecPMT reduces the execution time overheads of persistent transactions substantially to just 10%.
High-level programming models for hardware design let domain experts quickly produce specialized accelerators. However, tools for debugging these accelerators remain tied to low-level hardware description languages (HDLs). High-level descriptions contain control-flow information that is lost in HDL code. We describe Cider, a stepwise debugger that exploits this information to provide software-like debugging abstractions for languages that compile to hardware. Cider uses Calyx, an intermediate language for accelerator generators that preserves control information. Cider provides breakpoints, watchpoints, state inspection, and source-level position mapping. Using case studies that examine one new and two preexisting accelerator generators, we demonstrate how Cider helps find and localize previously unreported bugs. By directly simulating a control-rich representation, Cider avoids wasting effort on inactive parts of the design and, despite being largely unoptimized, performs competitively with open-source HDL simulators.
Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app’s memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays.
To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy.
Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1–2 orders of magnitude lower memory, outperforming competitive baselines.
Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional ten- sor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms.
Stream processing engines (SPEs) are widely used for large scale streaming analytics over unbounded time-ordered data streams. Modern day streaming analytics applications exhibit diverse compute characteristics and demand strict latency and throughput requirements. Over the years, there has been significant attention in building hardware-efficient stream processing engines (SPEs) that support several query optimization, parallelization, and execution strategies to meet the performance requirements of large scale streaming analytics applications. However, in this work, we observe that these strategies often fail to generalize well on many real-world streaming analytics applications due to several inherent design limitations of current SPEs. We further argue that these limitations stem from the shortcomings of the fundamental design choices and the query representation model followed in modern SPEs. To address these challenges, we first propose TiLT, a novel intermediate representation (IR) that offers a highly expressive temporal query language amenable to effective query optimization and parallelization strategies. We subsequently build a compiler backend for TiLT that applies such optimizations on streaming queries and generates hardware-efficient code to achieve high performance on multi-core stream query executions. We demonstrate that TiLT achieves up to 326× (20.49× on average) higher throughput compared to state-of-the-art SPEs (e.g., Trill) across eight real-world streaming analytics applications. TiLT source code is available at https://github.com/ampersand-projects/tilt.git.
Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their feature extraction heavily relies on expert-level domain knowledge in hardware architectures. Even so, the extracted features are often unsatisfactory and require separate considerations for CPUs and GPUs. Second, a cost model trained on one hardware platform usually performs poorly on another, a problem we call cross-hardware unavailability.
In order to address these problems, we propose TLP and MTL-TLP. TLP is a deep learning-based cost model that facilitates tensor program tuning. Instead of extracting features from the tensor program itself, TLP extracts features from the schedule primitives. We treat schedule primitives as tensor languages. TLP is thus a Tensor Language Processing task. In this way, the task of predicting the tensor program latency through the cost model is transformed into a natural language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning and TLP to cope with the cross-hardware unavailability problem.
We incorporate these techniques into the Ansor framework and conduct detailed experiments. Results show that TLP can speed up the average search time by 9.1× and 3.0× on CPU and GPU workloads, respectively, compared to the state-of-the-art implementation. MTL-TLP can achieve a speed-up of 4.7× and 2.9× on CPU and GPU workloads, respectively, using only 7% of the target hardware data. To the best of our knowledge, TLP is the first tensor program cost model to extract features directly from schedule primitives, and MTL-TLP is the first open-sourced work that effectively addresses the cross-platform unavailability problem. The code is available at https://github.com/zhaiyi000/tlp.
The complexity of modern operating systems (OSes), rapid diversification of hardware, and steady evolution of machine learning (ML) motivate us to explore the potential of ML to improve decision-making in OS kernels. We conjecture that ML can better manage tradeoff spaces for subsystems such as memory management and process and I/O scheduling that currently rely on hand-tuned heuristics to provide reasonable average-case performance. We explore the replacement of heuristics with ML-driven decision-making in five kernel subsystems, consider the implications for kernel design, shared OS-level components, and access to hardware acceleration. We identify obstacles, address challenges and characterize tradeoffs for the benefits ML can provide that arise in kernel-space. We find that use of specialized hardware such as GPUs is critical to absorbing the additional computational load required by ML decisioning, but that poor accessibility of accelerators in kernel space is a barrier to adoption. We also find that the benefits of ML and acceleration for OSes is subsystem-, workload- and hardware-dependent, suggesting that using ML in kernels will require frameworks to help kernel developers navigate new tradeoff spaces. We address these challenge by building a system called LAKE for supporting ML and exposing accelerators in kernel space. LAKE includes APIs for feature collection and management across abstraction layers and module boundaries. LAKE provides mechanisms for managing the variable profitability of acceleration, and interfaces for mitigating contention for resources between user and kernel space. We show that an ML-backed I/O latency predictor can have its inference time reduced by up to 96% with acceleration.
We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only 2f+1 replicas to tolerate f Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tailored trusted computing base—disaggregated memory—and consumes a practically bounded amount of memory. uBFT is based on a novel abstraction called Consistent Tail Broadcast, which we use to prevent equivocation while bounding memory. We implement uBFT using RDMA-based disaggregated memory and obtain an end-to-end latency of as little as 10 us. This is at least 50× faster than MinBFT, a state-of-the-art 2f+1 BFT SMR based on Intel’s SGX. We use uBFT to replicate two KV-stores (Memcached and Redis), as well as a financial order matching engine (Liquibook). These applications have low latency (up to 20 us) and become Byzantine tolerant with as little as 10 us more. The price for uBFT is a small amount of reliable disaggregated memory (less than 1 MiB), which in our prototype consists of a small number of memory servers connected through RDMA and replicated for fault tolerance.
As graph neural networks (GNNs) have achieved great success in many graph learning problems, it is of paramount importance to support their efficient execution. Different graphs and different operators present different patterns during execution. However, there is still a gap in the existing GNN acceleration research to explore adaptive parallelism. We show that existing GNN frameworks rely on handwritten static kernels, which fail to achieve the best performance across different graph operators and input graph structures. In this work, we propose uGrapher, a unified interface that achieves general high performance for different graph operators and datasets. The existing GNN frameworks can easily integrate our design for its simple and unified API. We take a principled approach that decouples a graph operator’s computation and schedule to achieve that. We first build a GNN-specific operator abstraction that incorporates the semantics of graph tensors and graph loops. We explore various schedule strategies based on the abstraction that can balance the well-established trade-off relationship between parallelism, locality, and efficiency. Our evaluation shows that uGrapher can bring up to 29.1× (3.5× on average) performance improvement over the state-of-the-art baselines on two studied NVIDIA GPUs.
Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers suffer from poor portability as well as high engineering efforts to achieve efficiency across platforms. In this paper, we propose VClinic, a portable and efficient fine-grained value profiling framework for analyzing highly optimized binaries on both X86 and ARM platforms. VClinic exploits operand-centric two-level designs in its implementation to provide the common building blocks required for value profilers. By constructing four representative value profilers with VClinic, we demonstrate that VClinic can ease the development of value profilers with portability and efficiency across platforms. Guided by the value profilers built upon VClinic, we can achieve up to 89.94% and 74.66% speedup for real-world programs on X86 and ARM platforms, respectively.
Hardware memory domain primitives, such as Intel MPK and ARM Memory Domain, have been used for efficient in-process memory isolation. However, they can only provide a limited number of memory domains (16 domains), which cannot satisfy the compelling need for more isolated domains inside the address space of a process. Existing solutions to virtualize memory domains are either intrusive (need the modification to existing hardware), or incur a large performance overhead.
In this paper, we propose VDom, a fast and scalable memory domain virtualization system that supports unlimited memory domains. VDom leverages separate address spaces to provide an unlimited number of virtual domains, and optimizes related memory management operations. To map virtual domains to hardware domains, we design a domain virtualization algorithm, which manages address spaces and domain maps for threads to efficiently access other domains that are unmapped in the current address space. According to our evaluation on real Intel and ARM platforms, on real-world server applications (httpd and MySQL), VDom incurs less than 2.65% performance overhead, which is lower than the overheads of the state-of-the-art software approaches (libmpk and EPK). In random domain access tests, VDom is comparable to EPK and has significantly higher efficiency than libmpk.
In this paper, we present WACO, a novel method of co-optimizing the format and the schedule of a given sparsity pattern in a sparse tensor program. A core challenge in this paper is the design of a lightweight cost model that accurately predicts the runtime of a sparse tensor program by considering the sparsity pattern, the format, and the schedule. The key idea in addressing this is exploiting a sparse convolutional network to learn meaningful features of the sparsity pattern and embedding a coupled behavior between the format and the schedule using a specially designed schedule template. In addition, within the enormous search space of co-optimization, our novel search strategy, an approximate nearest neighbor search, efficiently and accurately retrieves the best format and schedule for a given sparsity pattern. We evaluated WACO for four different algorithms (SpMV, SpMM, SDDMM, and MTTKRP) on a CPU using 726 different sparsity patterns. Our experimental results showed that WACO outperformed four state-of-the-art baselines, Intel MKL, BestFormat, TACO with a default schedule, and ASpT. Compared to the best of four baselines, WACO achieved 1.43×, 1.18×, 1.14×, and 1.27× average speedups on SpMV, SpMM, SDDMM, and MTTKRP, respectively.
The availability of debug information for optimized executables can largely ease crucial tasks such as crash analysis. Source-level debuggers use this information to display program state in terms of source code, allowing users to reason on it even when optimizations alter program structure extensively. A few recent endeavors have proposed effective methodologies for identifying incorrect instances of debug information, which can mislead users by presenting them with an inconsistent program state.
In this work, we identify and study a related important problem: the completeness of debug information. Unlike correctness issues for which an unoptimized executable can serve as reference, we find there is no analogous oracle to deem when the cause behind an unreported part of program state is an unavoidable effect of optimization or a compiler implementation defect. In this scenario, we argue that empirically derived conjectures on the expected availability of debug information can serve as an effective means to expose classes of these defects.
We propose three conjectures involving variable values and study how often synthetic programs compiled with different configurations of the popular gcc and LLVM compilers deviate from them. We then discuss techniques to pinpoint the optimizations behind such violations and minimize bug reports accordingly. Our experiments revealed, among others, 24 bugs already confirmed by the developers of the gcc-gdb and clang-lldb ecosystems.