LISA11 - Fork Yeah! The Rise and Development of illumos @UsenixOrg

USENIX Fork Yeah! The Rise and Development of illumos

Bryan M. Cantrill, Joyent

In August 2010, illumos, a new OpenSolaris derivative, was born. While not at the time intended to be a fork, Oracle sealed the fate of illumos when it elected to close OpenSolaris: by choosing to cease its contributions, Oracle promoted illumos from a downstream repository to the open source repository of record for such revolutionary technologies as ZFS, DTrace, and Zones. This move accelerated the diaspora of kernel engineers from the former Sun Microsystems, many of whom have landed in the illumos community, where they continue to innovate. We will discuss the history of illumos but will focus on its promising future.

updated 12 years ago

LISA11 - Fork Yeah! The Rise and Development of illumos

USENIX 2011-12-15 | Fork Yeah! The Rise and Development of illumos

Bryan M. Cantrill, Joyent

In August 2010, illumos, a new OpenSolaris derivative, was born. While not at the time intended to be a fork, Oracle sealed the fate of illumos when it elected to close OpenSolaris: by choosing to cease its contributions, Oracle promoted illumos from a downstream repository to the open source repository of record for such revolutionary technologies as ZFS, DTrace, and Zones. This move accelerated the diaspora of kernel engineers from the former Sun Microsystems, many of whom have landed in the illumos community, where they continue to innovate. We will discuss the history of illumos but will focus on its promising future.

OSDI 20 - From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes

USENIX 2024-09-23 | From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes

Florian Rommel and Christian Dietrich, Leibniz Universität Hannover; Birte Friesel, Marcel Köppen, Christoph Borchert, Michael Müller, and Olaf Spinczyk, Universität Osnabrück; Daniel Lohmann, Leibniz Universität Hannover

Live patching has become a common technique to keep long-running system services secure and up-to-date without causing downtimes during patch application. However, to safely apply a patch, existing live-update methods require the entire process to enter a state of quiescence, which can be highly disruptive for multi-threaded programs: Having to halt all threads (e.g., at a global barrier) for patching not only hampers quality of service, but can also be tremendously difficult to implement correctly without causing deadlocks or other synchronization issues.

In this paper, we present WfPatch, a wait-free approach to inject code changes into running multi-threaded programs. Instead of having to stop the world before applying a patch, WfPatch can gradually apply it to each thread individually at a local point of quiescence, while all other threads can make uninterrupted progress.

We have implemented WfPatch as a kernel service and user-space library for Linux 5.1 and evaluated it with OpenLDAP, Apache, Memcached, Samba, Node.js, and MariaDB on Debian 10 (“buster”). In total, we successfully applied 33 different binary patches into running programs while they were actively servicing requests; 15 patches had a CVE number or were other critical updates. Applying a patch with WfPatch did not lead to any noticeable increase in request latencies — even under high load — while applying the same patch after reaching global quiescence increases tail latencies by a factor of up to 41x for MariaDB.

View the full OSDI '20 program at usenix.org/conference/osdi20/technical-sessions

USENIX ATC 24 - PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context...

USENIX 2024-09-12 | PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch

Kinman Lei, Yuyang Jin, Mingshu Zhai, Kezhao Huang, Haoxing Ye, and Jidong Zhai, Tsinghua University

Aligning Large Language Models (LLMs) is currently the primary method to ensure AI systems operate in an ethically responsible and socially beneficial manner. Its paradigm differs significantly from standard pre-training or fine-tuning processes, involving multiple models and workloads (context), and necessitates frequently switching execution, introducing significant overhead, such as parameter updates and data transfer, which poses a critical challenge: efficiently switching between different models and workloads.

To address these challenges, we introduce PUZZLE, an efficient system for LLM alignment. We explore model orchestration as well as light-weight and smooth workload switching in aligning LLMs by considering the similarity between different workloads. Specifically, PUZZLE adopts a two-dimensional approach for efficient switching, focusing on both intra- and inter-stage switching. Within each stage, switching costs are minimized by exploring model affinities and overlapping computation via time-sharing. Furthermore, a similarity-oriented strategy is employed to find the optimal inter-stage switch plan with the minimum communication cost. We evaluate PUZZLE on various clusters with up to 32 GPUs. Results show that PUZZLE achieves up to 2.12× speedup compared with the state-of-the-art RLHF training system DeepSpeed-Chat.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in...

USENIX 2024-09-12 | Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu

Qingyuan Liu, Yanning Yang, Dong Du, and Yubin Xia, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Ping Zhang and Jia Feng, Huawei Cloud; James R. Larus, EPFL; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Key Laboratory of System Software (Chinese Academy of Science)

Current serverless platforms struggle to optimize resource utilization due to their dynamic and fine-grained nature. Conventional techniques like overcommitment and autoscaling fall short, often sacrificing utilization for practicability or incurring performance trade-offs. Overcommitment requires predicting performance to prevent QoS violation, introducing trade-off between prediction accuracy and overheads. Autoscaling requires scaling instances in response to load fluctuations quickly to reduce resource wastage, but more frequent scaling also leads to more cold start overheads. This paper introduces Jiagu to harmonize efficiency with practicability through two novel techniques. First, pre-decision scheduling achieves accurate prediction while eliminating overheads by decoupling prediction and scheduling. Second, \emph{dual-staged scaling} achieves frequent adjustment of instances with minimum overhead. We have implemented a prototype and evaluated it using real-world applications and traces from the public cloud platform. Our evaluation shows a 54.8% improvement in deployment density over commercial clouds (with Kubernetes) while maintaining QoS, and 81.0%–93.7% lower scheduling costs and a 57.4%–69.3% reduction in cold start latency compared to existing QoS-aware schedulers.

View the full ATC '24 program at usenix.org/conference/atc24/prog

USENIX ATC 24 - ScalaAFA: Constructing User-Space All-Flash Array Engine with Holistic Designs

USENIX 2024-09-12 | ScalaAFA: Constructing User-Space All-Flash Array Engine with Holistic Designs

Shushu Yi, Peking University and Zhongguancun Laboratory; Xiurui Pan, Peking University; Qiao Li, Xiamen University; Qiang Li, Alibaba; Chenxi Wang, University of Chinese Academy of Sciences; Bo Mao, Xiamen University; Myoungsoo Jung, KAIST and Panmnesia; Jie Zhang, Peking University and Zhongguancun Laboratory

All-flash array (AFA) is a popular approach to aggregate the capacity of multiple solid-state drives (SSDs) while guaranteeing fault tolerance. Unfortunately, existing AFA engines inflict substantial software overheads on the I/O path, such as the user-kernel context switches and AFA internal tasks (e.g., parity preparation), thereby failing to adopt next-generation high-performance SSDs.

Tackling this challenge, we propose ScalaAFA, a unique holistic design of AFA engine that can extend the throughput of next-generation SSD arrays in scale with low CPU costs. We incorporate ScalaAFA into user space to avoid user-kernel context switches while harnessing SSD built-in resources for handling AFA internal tasks. Specifically, in adherence to the lock-free principle of existing user-space storage framework, ScalaAFA substitutes the traditional locks with an efficient message-passing-based permission management scheme to facilitate inter-thread synchronization. Considering the CPU burden imposed by background I/O and parity computation, ScalaAFA proposes to offload these tasks to SSDs. To mitigate host-SSD communication overheads in offloading, ScalaAFA takes a novel data placement policy that enables transparent data gathering and in-situ parity computation. ScalaAFA also addresses two AFA intrinsic issues, metadata persistence and write amplification, by thoroughly exploiting SSD architectural innovations. Comprehensive evaluation results indicate that ScalaAFA can achieve 2.5× write throughput and reduce average write latency by a significant 52.7%, compared to the state-of-the-art AFA engines.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions

USENIX 2024-09-12 | ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions

Yuqi Fu, University of Virginia; Ruizhe Shi, George Mason University; Haoliang Wang, Adobe Research; Songqing Chen, George Mason University; Yue Cheng, University of Virginia

FaaS (Function-as-a-Service) workloads feature unique patterns. Serverless functions are ephemeral, highly concurrent, and bursty, with an execution duration ranging from a few milliseconds to a few seconds. The workload behaviors pose new challenges to kernel scheduling. Linux CFS (Completely Fair Scheduler) is workload-oblivious and optimizes long-term fairness via proportional sharing. CFS neglects the short-term demands of CPU time from short-lived serverless functions, severely impacting the performance of short functions. Preemptive shortest job first—shortest remaining process time (SRPT)—prioritizes shorter functions in order to satisfy their short-term demands of CPU time and, therefore, serves as a best-case baseline for optimizing the turnaround time of short functions. A significant downside of approximating SRPT, however, is that longer functions might be starved.

In this paper, we propose a novel application-aware kernel scheduler, ALPS (Adaptive Learning, Priority Scheduler), based on two key insights. First, approximating SRPT can largely benefit short functions but may inevitably penalize long functions. Second, CFS provides necessary infrastructure support to implement user-defined priority scheduling. To this end, we design ALPS to have a novel, decoupled scheduler frontend and backend architecture, which unifies approximate SRPT and proportional-share scheduling. ALPS’ frontend sits in the user space and approximates SRPT-inspired priority scheduling by adaptively learning from an SRPT simulation on a recent past workload. ALPS’ backend uses eBPF functions hooked to CFS to carry out the continuously learned policies sent from the frontend to inform scheduling decisions in the kernel. This design adds workload intelligence to workload-oblivious OS scheduling while retaining the desirable properties of OS schedulers. We evaluate ALPS extensively using two production FaaS workloads (Huawei and Azure), and results show that ALPS achieves a reduction of 57.2% in average function execution duration compared to CFS.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Cost-Efficient Large Language Model Serving for Multi-turn Conversations with...

USENIX 2024-09-12 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Bin Gao, National University of Singapore; Zhuomin He, Shanghai Jiaotong University; Puru Sharma, Qingxuan Kang, and Djordje Jevdjic, National University of Singapore; Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo, Huawei Cloud

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Fast Inference for Probabilistic Graphical Models

USENIX 2024-09-12 | Fast Inference for Probabilistic Graphical Models

Jiantong Jiang, The University of Western Australia; Zeyi Wen, HKUST (Guangzhou) and HKUST; Atif Mansoor and Ajmal Mian, The University of Western Australia

Probabilistic graphical models (PGMs) have attracted much attention due to their firm theoretical foundation and inherent interpretability. However, existing PGM inference systems are inefficient and lack sufficient generality, due to issues with irregular memory accesses, high computational complexity, and modular design limitation. In this paper, we present Fast-PGM, a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms. Fast-PGM incorporates careful memory management techniques to reduce memory consumption and enhance data locality. It also employs computation and parallelization optimizations to reduce computational complexity and improve the overall efficiency. Furthermore, Fast-PGM offers high generality and flexibility, allowing easy integration with all the mainstream importance sampling-based algorithms. The system abstraction of Fast-PGM facilitates easy optimizations, extensions, and customization for users. Extensive experiments show that Fast-PGM achieves 3 to 20 times speedup over the state-of-the-art implementation. Fast-PGM source code is freely available at github.com/jjiantong/FastPGM.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Power-aware Deep Learning Model Serving with μ-Serve

USENIX 2024-09-12 | Power-aware Deep Learning Model Serving with μ-Serve

Haoran Qiu, Weichao Mao, Archit Patke, and Shengkun Cui, University of Illinois Urbana-Champaign; Saurabh Jha, Chen Wang, and Hubertus Franke, IBM Research; Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer, University of Illinois Urbana-Champaign

With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, µ-Serve. µ-Serve is a model-serving framework that optimizes the power consumption and model serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that µ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

USENIX 2024-09-12 | StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

Hao Wu, Yue Yu, and Junxiao Deng, Huazhong University of Science and Technology; Shadi Ibrahim, Inria; Song Wu and Hao Fan, Huazhong University of Science and Technology and Jinyinhu Laboratory; Ziyue Cheng, Huazhong University of Science and Technology; Hai Jin, Huazhong University of Science and Technology and Jinyinhu Laboratory

The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - FastCommit: resource-efficient, performant and cost-effective file system...

USENIX 2024-09-12 | FastCommit: resource-efficient, performant and cost-effective file system journaling

Harshad Shirwadkar, Saurabh Kadekodi, and Theodore Tso, Google
Awarded Best Paper!

JBD2, the current physical journaling mechanism in Ext4 is bulky and resource-hungry. Specifically, in case of metadata-heavy workloads, fsyncs issued by applications cause JBD2 to write copies of changed metadata blocks, incurring high byte and IO overhead. When storing data in Ext4 via NFS (a popular setup), the NFS protocol issues fsyncs for every file metadata update which further exacerbates the problem. In a simple multi-threaded mail-server workload, JBD2 consumed approximately 76% of the disk’s write bandwidth. Higher byte and IO utilization of JBD2 results in reduced application throughput, higher wear-out of flash based media and increased performance provisioning costs in cloud-based storage services.

We present FastCommit: a hybrid journaling approach for Ext4 which performs logical journaling for simple and frequent file system modifications, while relying on JBD2 for more complex and rare modifications. Key design elements of FastCommit are compact logging, selective flushing and inline journaling. The first two techniques work together to ensure that over 80% commits are contained within a single 4KB block and are written to disk without requiring an expensive cache flush operation. Inline journaling minimizes context switching delays. With faster and efficient fsyncs, FastCommit reduces throughput interference of JBD2 by over 2× along with throughput improvements of up to 120%. We implemented FastCommit in Ext4 and successfully merged our code to the upstream Linux kernel.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - HiP4-UPF: Towards High-Performance Comprehensive 5G User Plane Function on P4...

USENIX 2024-09-12 | HiP4-UPF: Towards High-Performance Comprehensive 5G User Plane Function on P4 Programmable Switches

Zhixin Wen and Guanhua Yan, Binghamton University

Due to better cost benefits, P4 programmable switches have been considered in a few recent works to implement 5G User Plane Function (UPF). To circumvent limited resources on P4 programmable switches, they either ignore some essential UPF features or resort to a hybrid deployment approach which requires extra resources. This work is aimed to improve the performance of UPFs with comprehensive features which, except packet buffering, are deployable entirely on commodity P4 programmable switches. We build a baseline UPF based on prior work and analyze its key performance bottlenecks. We propose a three-tiered approach to optimize rule storage on the switch ASICs. We also develop a novel scheme that combines pendulum table access and selective usage pulling to reduce the operational latency of the UPF. Using a commodity P4 programmable switch, the experimental results show that our UPF implementation can support twice as many mobile devices as the baseline UPF and 1.9 times more than SD-Fabric. Our work also improves the throughputs in three common types of 5G call flows by 9-619% over the UPF solutions in two open-source 5G network emulators.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

USENIX 2024-09-12 | OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, and Timo Schneider, ETH Zürich; Daniele De Sensi, ETH Zürich and Sapienza University of Rome; Luca Benini and Torsten Hoefler, ETH Zürich

Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource manager co-design. OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - PeRF: Preemption-enabled RDMA Framework

USENIX 2024-09-12 | PeRF: Preemption-enabled RDMA Framework

Sugi Lee and Mingyu Choi, Acryl Inc.; Ikjun Yeom, Acryl Inc. and Sungkyunkwan University; Younghoon Kim, Sungkyunkwan University

Remote Direct Memory Access (RDMA) provides high throughput, low latency, and minimal CPU usage for data-intensive applications. However, RDMA was initially designed for single-tenant use, and its application in a multi-tenant cloud environment poses challenges in terms of performance isolation, security, and scalability. This paper proposes a Preemption-enabled RDMA Framework (PeRF), which offers software-based performance isolation for efficient multi-tenancy in RDMA. PeRF leverages a novel RNIC preemption mechanism to dynamically control RDMA resource utilization for each tenant, while ensuring that RNICs remain busy, thereby enabling work conservation. PeRF outperforms existing approaches by achieving flexible performance isolation without compromising RDMA's bare-metal performance.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - ZMS: Zone Abstraction for Mobile Flash Storage

USENIX 2024-09-12 | ZMS: Zone Abstraction for Mobile Flash Storage

Joo-Young Hwang, Seokhwan Kim, Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, and Sangyeun Cho, Samsung Electronics; Youjip Won, Korea Advanced Institute of Science and Technology

We propose an I/O stack for ZNS based flash storage in mobile environment, ZMS. The zone interface is known to save the flash storage from two fundamental issues which modern flash storage suffers from: logical-to-physical mapping table size and garbage collection overhead. Through extensive study, we find that realizing the zone interface in mobile environment is more than a challenge due to the unique characteristics of mobile environment: the lack of on-device memory in mobile flash storage and the frequent fsync() calls in mobile applications. Aligned with this, we identify the root causes that need to be addressed in realizing the zone interface in mobile I/O stack: write buffer thrashing and tiny synchronous file update. We develop a filesystem, block I/O layer, and device firmware techniques to address the above mentioned two issues. The three key techniques in ZMS are (i) IOTailor, (ii) budget-based in-place update, and (iii) multi-granularity logical-to-physical mapping. Evaluation on a real production platform shows that ZMS improves write amplification by 2.9–6.4× and random write performance by 5.0–13.6×. With the three techniques, ZMS shows significant performance improvement in writing to the multiple zones concurrently, executing SQLite transactions, and launching the applications.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Starburst: A Cost-aware Scheduler for Hybrid Cloud

USENIX 2024-09-12 | Starburst: A Cost-aware Scheduler for Hybrid Cloud

Michael Luo, Siyuan Zhuang, Suryaprakash Vengadesan, and Romil Bhardwaj, UC Berkeley; Justin Chang, UC Santa Barbara; Eric Friedman, Scott Shenker, and Ion Stoica, UC Berkeley
Distinguished Artifact Award!

To efficiently tackle bursts in job demand, organizations employ hybrid cloud architectures to scale their batch workloads from their private clusters to public cloud. This requires transforming cluster schedulers into cloud-enabled versions to navigate the tradeoff between cloud costs and scheduler objectives such as job completion time (JCT). However, our analysis over production-level traces show that existing cloud-enabled schedulers incur inefficient cost-JCT trade-offs due to low cluster utilization.

We present Starburst, a system that maximizes cluster utilization to streamline the cost-JCT tradeoff. Starburst's scheduler dynamically controls jobs' waiting times to improve utilization—it assigns longer waits for large jobs to increase their chances of running on the cluster, and shorter waits to small jobs to increase their chances of running on the cloud. To offer configurability, Starburst provides system administrators a simple waiting budget framework to tune their position on the cost-JCT curve. A departure from traditional cluster schedulers, Starburst operates as a higher-level resource manager over a private cluster and dynamic cloud clusters. Simulations over production-level traces and real-world experiments on a 32-GPU private cluster show that Starburst can reduce cloud costs by up to 54-91% over existing cluster managers, while increasing average JCT by at most 5.8%.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Ethane: An Asymmetric File System for Disaggregated Persistent Memory

USENIX 2024-09-12 | Ethane: An Asymmetric File System for Disaggregated Persistent Memory

Miao Cai, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics; Junru Shen, College of Computer Science and Software Engineering, Hohai University; Baoliu Ye, State Key Laboratory for Novel Software Technology, Nanjing University

The ultra-fast persistent memories (PMs) promise a practical solution towards high-performance distributed file systems. This paper examines and reveals a cascade of three performance and cost issues in the current PM provision scheme, namely expensive cross-node interaction, weak single-node capability, and costly scale-out performance, which not only underutilizes fast PM devices but also magnifies its limited storage capacity and high price deficiencies. To remedy this, we introduce Ethane, a file system built on disaggregated persistent memory (DPM). Through resource separation using fast connectivity technologies, DPM achieves efficient and cost-effective PM sharing while retaining low-latency memory access. To unleash such hardware potentials, Ethane incorporates an asymmetric file system architecture inspired by the imbalanced resource provision feature of DPM. It splits a file system into a control-plane FS and a data-plane FS and designs these two planes to make the best use of the respective hardware resources. Evaluation results demonstrate that Ethane reaps the DPM hardware benefits, performs up to 68× better than modern distributed file systems, and improves data-intensive application throughputs by up to 17×.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - CyberStar: Simple, Elastic and Cost-Effective Network Functions Management in...

USENIX 2024-09-12 | CyberStar: Simple, Elastic and Cost-Effective Network Functions Management in Cloud Network at Scale

Tingting Xu, Nanjing University; Bengbeng Xue, Yang Song, Xiaomin Wu, Xiaoxin Peng, and Yilong Lyu, Alibaba Group; Xiaoliang Wang, Chen Tian, Baoliu Ye, and Camtu Nguyen, Nanjing University; Biao Lyu and Rong Wen, Alibaba Group; Zhigang Zong, Alibaba Group and Zhejiang University; Shunmin Zhu, Alibaba Group and Tsinghua University

Network functions (NFs) facilitate network operations and have become a critical service offered by cloud providers. One of the key challenges is how to meet the elastic requirements of massive traffic and diverse NF requests of tenants. This paper identifies the opportunity by leveraging cloud elastic compute services (ECS), i.e. containers or virtual machines, to provide the cloud-scale network function services, CyberStar. CyberStar introduces two key designs: (i) resource pooling based on a newly proposed three-tier architecture for scalable network functions; and (ii) on-demand resource assignment while maintaining high resource utilization in terms of both tenant demands and operation cost. Compared to the traditional NFs constructed over bare-metal servers, CyberStar can achieve 100Gbps bandwidth (6.7×) and scale to millions of connections within one second (20×).

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - More is Different: Prototyping and Analyzing a New Form of Edge Server with...

USENIX 2024-09-12 | More is Different: Prototyping and Analyzing a New Form of Edge Server with Massive Mobile SoCs

Li Zhang, Beijing University of Posts and Telecommunications; Zhe Fu, Tsinghua University; Boqing Shi and Xiang Li, Beijing University of Posts and Telecommunications; Rujin Lai and Chenyang Yang, vclusters; Ao Zhou, Xiao Ma, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

Huge energy consumption poses a significant challenge for edge clouds. In response to this, we introduce a new type of edge server, namely SoC Cluster, that orchestrates multiple low-power mobile system-on-chips (SoCs) through an on-chip network. For the first time, we have developed a concrete SoC Cluster consisting of 60 Qualcomm Snapdragon 865 SoCs housed in a 2U rack, which has been successfully commercialized and extensively deployed in edge clouds. Cloud gaming emerges as the principal workload on these deployed SoC Clusters, owing to the compatibility between mobile SoCs and native mobile games.

In this study, we aim to demystify whether the SoC Cluster can efﬁciently serve more generalized, typical edge workloads. Therefore, we developed a benchmark suite that employs state-of-the-art libraries for two critical edge workloads, i.e., video transcoding and deep learning inference. This suite evaluates throughput, latency, power consumption, and other application-speciﬁc metrics like video quality. Following this, we conducted a thorough measurement study and directly compared the SoC Cluster with traditional edge servers, with regards to electricity usage and monetary cost. Our results quantitatively reveal when and for which applications mobile SoCs exhibit higher energy efﬁciency than traditional servers, as well as their ability to proportionally scale power consumption with ﬂuctuating incoming loads. These outcomes provide insightful implications and offer valuable direction for further reﬁnement of the SoC Cluster to facilitate its deployment across wider edge scenarios.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - ETC: An Elastic Transmission Control Using End-to-End Available Bandwidth...

USENIX 2024-09-12 | ETC: An Elastic Transmission Control Using End-to-End Available Bandwidth Perception

Feixue Han, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Peng Zhang, Tencent; Gareth Tyson, Hong Kong University; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Mingwei Xu, Tsinghua University; Yulong Lan and ZhiCheng Li, Tencent

Researchers and practitioners have proposed various transport protocols to keep up with advances in networks and the applications that use them. Current Wide Area Network protocols strive to identify a congestion signal to make distributed but fair judgments. However, existing congestion signals such as RTT and packet loss can only be observed after congestion occurs. We therefore propose Elastic Transmission Control (ETC). ETC exploits the instantaneous receipt rate of N consecutive packets as the congestion signal. We refer to this as the pulling rate, as we posit that the receipt rate can be used to "pull'' the sending rate towards a fair share of the capacity. Naturally, this signal can be measured prior to congestion, as senders can access it immediately after the acknowledgment of the first N packets. Exploiting the pulling rate measurements, ETC calculates the optimal rate update steps following a simple elastic principle: the further away from the pulling rate, the faster the sending rate increases. We conduct extensive experiments using both simulated and real networks. Our results show that ETC outperforms the state-of-the-art protocols in terms of both throughput (15% higher than Copa) and latency (20% lower than BBR). Besides, ETC shows superiority in convergence speed and fairness, with a 10× improvement in convergence time even compared to the protocol with the best convergence performance.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - KEPC-Push: A Knowledge-Enhanced Proactive Content Push Strategy for...

USENIX 2024-09-12 | KEPC-Push: A Knowledge-Enhanced Proactive Content Push Strategy for Edge-Assisted Video Feed Streaming

Ziwen Ye, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qing Li, Peng Cheng Laboratory; Chunyu Qiao, ByteDance; Xiaoteng Ma, Tsinghua Shenzhen International Graduate School; Yong Jiang, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qian Ma and Shengbin Meng, ByteDance; Zhenhui Yuan, University of Warwick; Zili Meng, HKUST

Video Feed Streaming (e.g., TikTok, Reels) is increasingly popular nowadays. Users will be scheduled to the distribution infrastructure, including content distribution network (CDN) and multi-access edge computing (MEC) nodes, to access the content. Our observation is that the existing proactive content push algorithms, which are primarily based on historical access information and designed for on-demand videos, no longer meet the demands of video feed streaming. The main reason is that video feed streaming applications always push recently generated videos to attract users’ interests, thus lacking historical information when pushing. In this case, push mismatches and load imbalances will be observed, resulting in degraded bandwidth cost and user experience. To this end, we propose KEPC-Push, a Knowledge-Enhanced Proactive Content Push strategy with the \textit{knowledge} of video content features. KEPC-Push employs knowledge graphs to determine the popularity correlation among similar videos (with similar authors, contents, length, etc.) and pushes content based on this guidance. Besides, KEPC-Push designs a hierarchical algorithm to optimize the resource allocation in edge nodes with heterogeneous capabilities and runs at the regional level to shorten the communication distance. Trace-driven simulations show that KEPC-Push saves the peak-period CDN bandwidth costs by 20% and improves the average download speeds by 7% against the state-of-the-art solutions.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Scalable and Effective Page-table and TLB management on NUMA Systems

USENIX 2024-09-12 | Scalable and Effective Page-table and TLB management on NUMA Systems

Bin Gao, Qingxuan Kang, and Hao-Wei Tee, National University of Singapore; Kyle Timothy Ng Chu, Horizon Quantum Computing; Alireza Sanaee, Queen Mary University of London; Djordje Jevdjic, National University of Singapore

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent.

In this paper, we propose a novel page-table management mechanism, called Hydra, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that Hydra's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, Hydra not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement Hydra in Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that Hydra achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - FetchBPF: Customizable Prefetching Policies in Linux with eBPF

USENIX 2024-09-12 | FetchBPF: Customizable Prefetching Policies in Linux with eBPF

Xuechun Cao, Shaurya Patel, and Soo Yee Lim, University of British Columbia; Xueyuan Han, Wake Forest University; Thomas Pasquier, University of British Columbia

Monolithic operating systems are infamously complex. Linux in particular has a tendency to intermingle policy and mechanisms in a manner that hinders modularity. This is especially problematic when developers aim to finely optimize performance,since it is often the case that a default policy in Linux, while performing well on average, cannot achieve the optimal performance in all circumstances. However, developing and maintaining a bespoke kernel to satisfy the need of a specific application is usually an unrealistic endeavor due to the high software engineering cost. Therefore, we need a mechanism to easily customize kernel policies and its behavior. In this paper, we design a framework called FetchBPF that addresses this problem in the context of memory prefetching. FetchBPF extends the widely used eBPF framework to allow developers to easily express, develop, and deploy prefetching policies without modifying the kernel codebase. We implement various memory prefetching policies from the literature and demonstrate that our deployment model incurs negligible overhead as compared to the equivalent native kernel implementation.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - High-density Mobile Cloud Gaming on Edge SoC Clusters

USENIX 2024-09-12 | High-density Mobile Cloud Gaming on Edge SoC Clusters

Li Zhang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

System-on-Chip (SoC) Clusters, i.e., servers consisting of many stacked mobile SoCs, have emerged as a popular platform for serving mobile cloud gaming. Sharing the underlying hardware and OS, these SoC Clusters enable native mobile games to be executed and rendered efﬁciently without modiﬁcation. However, the number of deployed game sessions is limited due to conservative deployment strategies and high GPU utilization in current game ofﬂoading methods. To address these challenges, we introduce SFG, the ﬁrst system that enables high-density mobile cloud gaming on SoC Clusters with two novel techniques: (1) It employs a resource-efﬁcient game partitioning and cross-SoC ofﬂoading design that maximally preserves GPU optimization intents in the standard graphics rendering pipeline; (2) It proposes an NPU-enhanced game partition coordination strategy to adjust game performance when co-locating partitioned and complete game sessions. Our evaluation of ﬁve Unity games shows that SFG achieves up to 4.5× higher game density than existing methods with trivial performance loss. Equally important, SFG extends the lifespan of SoC Clusters, enabling outdated SoC Clusters to serve new games that are unfeasible on a single SoC due to GPU resource shortages.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Limitations and Opportunities of Modern Hardware Isolation Mechanisms

USENIX 2024-09-12 | Limitations and Opportunities of Modern Hardware Isolation Mechanisms

Xiangdong Chen and Zhaofeng Li, University of Utah; Tirth Jain, Maya Labs; Vikram Narayanan and Anton Burtsev, University of Utah

A surge in the number, complexity, and automation of targeted security attacks has triggered a wave of interest in hardware support for isolation. Intel memory protection keys (MPK), ARM pointer authentication (PAC), ARM memory tagging extensions (MTE), and ARM Morello capabilities are just a few hardware mechanisms aimed at supporting low-overhead isolation in recent CPUs. These new mechanisms aim to bring practical isolation to a broad range of systems, e.g., browser plugins, device drivers and kernel extensions, user-defined database and network functions, serverless cloud platforms, and many more. However, as these technologies are still nascent, their advantages and limitations are yet unclear. In this work, we do an in-depth look at modern hardware isolation mechanisms with the goal of understanding their suitability for the isolation of subsystems with the tightest performance budgets. Our analysis shows that while a huge step forward, the isolation mechanisms in commodity CPUs are still lacking implementation of several design principles critical for supporting low-overhead enforcement of isolation boundaries, zero-copy exchange of data, and secure revocation of access permissions.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise

USENIX 2024-09-12 | An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise

Hongyu Li, Beijing University of Posts and Telecommunications; Liwei Guo, University of Electronic Science and Technology of China; Yexuan Yang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications
Awarded Best Paper!

Developed for over 30 years, Linux has already become the computing foundation for today's digital world; from gigantic, complex mainframes (e.g., supercomputers) to cheap, wimpy embedded devices (e.g., IoTs), countless applications are built on top of it. Yet, such an infrastructure has been plagued by numerous memory and concurrency bugs since the day it was born, due to many rogue memory operations are permitted by C language. A recent project Rust-for-Linux (RFL) has the potential to address Linux's safety concerns once and for all -- by embracing Rust's static ownership and type checkers into the kernel code, the kernel may finally be free from memory and concurrency bugs without hurting its performance. While it has been gradually matured and even merged into Linux mainline, however, RFL is rarely studied and still remains unclear whether it has indeed reconciled the safety and performance dilemma for the kernel.

To this end, we conduct the first empirical study on RFL to understand its status quo and benefits, especially on how Rust fuses with Linux and whether the fusion assures driver safety without overhead. We collect and analyze 6 key RFL drivers, which involve hundreds of issues and PRs, thousands of Github commits and mail exchanges of the Linux mailing list, as well as over 12K discussions on Zulip.We have found while Rust mitigates kernel vulnerabilities, it is beyond Rust's capability to fully eliminate them; what is more, if not handled properly, its safety assurance even costs the developers dearly in terms of both runtime overhead and development efforts.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - ExtMem: Enabling Application-Aware Virtual Memory Management for Data-Intensive...

USENIX 2024-09-12 | ExtMem: Enabling Application-Aware Virtual Memory Management for Data-Intensive Applications

Sepehr Jalalian, Shaurya Patel, Milad Rezaei Hajidehi, Margo Seltzer, and Alexandra Fedorova, University of British Columbia

For over forty years, researchers have demonstrated that operating system memory managers often fall short in supporting memory-hungry applications. The problem is even more critical today, with disaggregated memory and new memory technologies and in the presence of tera-scale machine learning models, large-scale graph processing, and other memory-intensive applications. Past attempts to provide application-specific memory management either required significant in-kernel changes or suffered from high overhead. We present ExtMem, a flexible framework for providing application-specific memory management. It differs from prior solutions in three ways: (1) It is compatible with today’s Linux deployments, (2) it is a general-purpose substrate for addressing various memory and storage backends, and (3) it is performant in multithreaded environments. ExtMem allows for easy and rapid prototyping of new memory management algorithms, easy collection of memory patterns and statistics, and immediate deployment of isolated custom memory management.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - HydraRPC: RPC in the CXL Era

USENIX 2024-09-12 | HydraRPC: RPC in the CXL Era

Teng Ma, Alibaba Group; Zheng Liu, Zhejiang University and Alibaba Group; Chengkun Wei, Zhejiang University; Jialiang Huang, Alibaba Group and Tsinghua University; Youwei Zhuo, Alibaba Group and Peking University; Haoyu Li, Zhejiang University; Ning Zhang, Yijin Guan, and Dimin Niu, Alibaba Group; Mingxing Zhang, Tsinghua University; Tao Ma, Alibaba Group

In this paper, we present HydraRPC, which utilizes CXL-attached HDM for data transmission. By leveraging CXL, HydraRPC can benefit from memory sharing, memory semantics, and high scalability. As a result, expensive network rounds, memory copying, and serialization/deserialization are eliminated. Since CXL.cache protocols are not fully supported, we employ non-cachable sharing to bypass the CPU cache and design a busy-polling free notification mechanism. This ensures efficient data transmission without the need for constant polling. We conducted evaluations of HydraRPC on real CXL hardware, which showcased the potential efficiency of utilizing CXL HDM to build RPC systems.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Telescope: Telemetry for Gargantuan Memory Footprint Applications

USENIX 2024-09-12 | Telescope: Telemetry for Gargantuan Memory Footprint Applications

Alan Nair, Sandeep Kumar, and Aravinda Prasad, Intel Labs; Ying Huang, Intel Corporation; Andy Rudoff and Sreenivas Subramoney, Intel Labs

Data-hungry applications that require terabytes of memory have become widespread in recent years. To meet the memory needs of these applications, data centers are embracing tiered memory architectures with near and far memory tiers. Precise, efficient, and timely identification of hot and cold data and their placement in appropriate tiers is critical for performance in such systems. Unfortunately, the existing state-of-the-art telemetry techniques for hot and cold data detection are ineffective at terabyte scale.

We propose Telescope, a novel technique that profiles different levels of the application's page table tree for fast and efficient identification of hot and cold data. Telescope is based on the observation that for a memory- and TLB-intensive workload, higher levels of a page table tree are also frequently accessed during a hardware page table walk. Hence, the hotness of the higher levels of the page table tree essentially captures the hotness of its subtrees or address space sub-regions at a coarser granularity. We exploit this insight to quickly converge on even a few megabytes of hot data and efficiently identify several gigabytes of cold data in terabyte-scale applications.
Importantly, such a technique can seamlessly scale to petabyte-scale applications.

Telescope's telemetry achieves 90%+ precision and recall at just 0.9% single CPU utilization for microbenchmarks with 5 TB memory footprint. Memory tiering based on Telescope results in 5.6% to 34% throughput improvement for real-world benchmarks with 1–2 TB memory footprint compared to other state-of-the-art telemetry techniques.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Fast (Trapless) Kernel Probes Everywhere

USENIX 2024-09-12 | Fast (Trapless) Kernel Probes Everywhere

Jinghao Jia, University of Illinois Urbana-Champaign; Michael V. Le and Salman Ahmed, IBM T.J. Watson Research Center; Dan Williams, Virginia Tech and IBM T.J. Watson Research Center; Hani Jamjoom, IBM T.J. Watson Research Center; Tianyin Xu, University of Illinois at Urbana-Champaign

The ability to efficiently probe and instrument a running operating system (OS) kernel is critical for debugging, system security, and performance monitoring. While efforts to optimize the widely used Kprobes in Linux over the past two decades have greatly improved its performance, many fundamental gaps remain that prevent it from being completely efficient. Specifically, we find that Kprobe is only optimized for ~80% of kernel instructions, leaving the remaining probe-able kernel code to suffer the severe penalties of double traps needed by the Kprobe implementation. In this paper, we focus on the design and implementation of an efficient and general trapless kernel probing mechanism (no hardware exceptions) that can be applied to almost all code in Linux. We discover that the main limitation of current probe optimization efforts comes from not being able to assume or change certain properties/layouts of the target kernel code. Our main insight is that by introducing strategically placed nops, thus slightly changing the code layout, we can overcome this main limitation. We implement our mechanism on Linux Kprobe, which is transparent to the users. Our evaluation shows a 10x improvement of probe performance over standard Kprobe while providing this level of performance for 96% of kernel code.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - WingFuzz: Implementing Continuous Fuzzing for DBMSs

USENIX 2024-09-12 | WingFuzz: Implementing Continuous Fuzzing for DBMSs

Jie Liang, Zhiyong Wu, and Jingzhou Fu, Tsinghua University; Yiyuan Bai and Qiang Zhang, Shuimu Yulin Technology Co., Ltd.; Yu Jiang, Tsinghua University

Database management systems (DBMSs) are critical components within software ecosystems, and their security and stability are paramount. In recent years, fuzzing has emerged as a prominent automated testing technique, effectively identifying vulnerabilities in various DBMSs. Nevertheless, many of these fuzzers require specific adaptation for a DBMS with a particular version. Employing these techniques to test enterprise-level DBMSs continuously poses challenges due to the diverse specifications of DBMSs and the code changes in their rapid version evolution.

In this paper, we present the industry practice of implementing continuous DBMS fuzzing on enterprise-level DBMSs like ClickHouse. We summarize three main obstacles in implementing, namely the diverse SQL grammar in test case generation, the ongoing evolution of codebase in continuous testing, and the disturbance of noises during anomaly analysis. We propose WingFuzz, which utilizes specification-based mutator generation, corpus-driven evolving code fuzzing, and noise-resilient anomaly assessment to address them. By working with the engineers in continuous DBMS fuzzing, we have found a total of 236 previously undiscovered bugs in 12 widely-used enterprise-level DBMSs including ClickHouse, DamengDB, and TenDB. Due to its favorable test results, our efforts received recognition and cooperation invitations from some DBMS vendors. For example, ClickHouse’s CTO praised: "Which tool did you use to find this test case? We need to integrate it into our CI." and WingFuzz has been successfully integrated into its development process.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Kivi: Verification for Cluster Management

USENIX 2024-09-12 | Kivi: Verification for Cluster Management

Bingzhe Liu and Gangmuk Lim, UIUC; Ryan Beckett, Microsoft; P. Brighten Godfrey, UIUC and Broadcom

Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users’ operational burden, the complex, dynamic, and non-deterministic nature of these systems makes them hard to reason about, potentially leading to failures ranging from performance degradation to outages.

We present Kivi, the first system for verifying controllers and their configurations in cluster management systems. Kivi focuses on the popular system Kubernetes, and models its controllers and events into processes whereby their interleavings are exhaustively checked via model checking. Central to handling autoscaling and large-scale deployments are our modeling optimizations and our design which seeks to find violations in a smaller and reduced topology. We show that Kivi is effective and accurate in finding issues in realistic and complex scenarios and showcase two new issues in Kubernetes controller source code.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Balancing Analysis Time and Bug Detection: Daily Development-friendly Bug...

USENIX 2024-09-12 | Balancing Analysis Time and Bug Detection: Daily Development-friendly Bug Detection in Linux

Keita Suzuki, Keio University; Kenta Ishiguro, Hosei University; Kenji Kono, Keio University

Linux, a battle-tested codebase, is known to suffer from many bugs despite its extensive testing mechanisms. While many of these bugs require domain-specific knowledge for detection, a significant portion matches well-known bug patterns. Even though these bugs can be found with existing tools, our simple check of Linux kernel patches suggests that these tools are not used much in the developer's daily workflow. The lack of usage is probably due to the well-known trade-off between analysis time and bug detection capabilities: tools typically employ complex analysis to effectively and comprehensively find bugs in return for a long analysis time, or focus on a short analysis time by only employing elementary analyses and thus can only find a very limited number of bugs. Ideally, developers expect the tools to incur short analysis time, while still finding many bugs to use them in daily development.

This paper explores an approach that balances this trade-off by focusing on bugs that can be found with less computationally-complex analysis methods, and limiting the scope to each source code. To achieve this, we propose a combination of computationally lightweight analyses and demonstrate our claim by designing FiTx, a framework for generating daily development-friendly bug checkers that focus on well-known patterns. Despite its simplicity, FiTx successfully identified 47 new bugs in the Linux kernel version 5.15 within 2.5 hours, outperforming Clang Static Analyzer and CppCheck in both speed and bug detection. It demonstrates that focusing on less complex bug patterns can still significantly contribute to the improvement of codebase health. FiTx can be embedded into the daily development routine, enabling early bug detection without sacrificing developers' time.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Metis: Fast Automatic Distributed Training on Heterogeneous GPUs

USENIX 2024-09-12 | Metis: Fast Automatic Distributed Training on Heterogeneous GPUs

Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, and Mohd Muzzammil, Samsung Research; Myeongjae Jeon, UNIST

As deep learning model sizes expand and new GPUs are released every year, the need for distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end GPUs and reduce the cost of purchasing expensive high-end GPUs. In this paper, we introduce Metis, a system designed to automatically find efficient parallelism plans for distributed training on heterogeneous GPUs. Metis holistically optimizes several key system components, such as profiler, cost estimator, and planner, which were limited to single GPU types, to now efficiently leverage compute powers and memory capacities of diverse GPU types. This enables Metis to achieve fine-grained distribution of training workloads across heterogeneous GPUs, improving resource efficiency. However, the search space designed for automatic parallelism in this complexity would be prohibitively expensive to navigate.

To address this issue, Metis develops a new search algorithm that efficiently prunes large search spaces and balances loads with heterogeneity-awareness, while preferring data parallelism over tensor parallelism within a pipeline stage to take advantage of its superior computation and communication trade-offs. Our evaluation with three large models (GPT-3, MoE, and Wide-Resnet) on combinations of three types of GPUs demonstrates that Metis finds better parallelism plans than traditional methods with $1.05 ~ 8.43× training speed-up, while requiring less profiling searching time. Compared to the oracle planning that delivers the fastest parallel training, Metis finds near-optimal solutions while reducing profiling and search overheads by orders of magnitude.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Monarch: A Fuzzing Framework for Distributed File Systems

USENIX 2024-09-12 | Monarch: A Fuzzing Framework for Distributed File Systems

Tao Lyu, EPFL; Liyi Zhang, University of Waterloo; Zhiyao Feng, Yueyang Pan, and Yujie Ren, EPFL; Meng Xu, University of Waterloo; Mathias Payer and Sanidhya Kashyap, EPFL

Distributed file systems (DFSes) are prone to bugs. Although numerous bug-finding techniques have been applied to DFSes, static analysis does not scale well with the sheer complexity of DFS codebases while dynamic methods (e.g., regression testing) are limited by the quality of test cases. Although both can be improved by pouring in manual effort, they are less practical when facing a diverse set of real-world DFSes. Fuzzing, on the other hand, has shown great success in local systems. However, several problems exist if we apply existing fuzzers to DFSes as they 1) cannot test multiple components of DFSes holistically; 2) miss the critical testing aspects of DFSes (e.g., distributed faults); 3) have not yet explored the practical state representations as fuzzing feedback; and 4) lack checkers for asserting semantic bugs unique to DFSes.

In this paper, we introduce MONARCH, a multi-node fuzzing framework to test all POSIX-compliant DFSes under one umbrella. MONARCH pioneers push-button fuzzing for DFSes with a new set of building blocks to the fuzzing toolbox: 1) A multi-node fuzzing architecture for testing diverse DFSes from a holistic perspective; 2) A two-step mutator for testing DFSes with syscalls and faults; 3) Practical execution state representations with a unified coverage collection scheme across execution contexts; 4) A new DFSes semantic checker SYMSC. We applied MONARCH to six DFSes and uncovered a total of 48 bugs, including a bug whose existence can be traced back to the initial release of the DFSes.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed...

USENIX 2024-09-12 | FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences

Mengwei Xu, Dongqi Cai, Yaozong Wu, Xiang Li, and Shangguang Wang, Beijing University of Posts and Telecommunications (BUPT)

Large Language Models (LLMs) are transforming the landscape of mobile intelligence. Federated Learning (FL), a method to preserve user data privacy, is often employed in fine-tuning LLMs to downstream mobile tasks, i.e., FedLLM. A vital challenge of FedLLM is the tension between LLM complexity and resource constraint of mobile devices.

In response to this challenge, this work introduces FwdFL, an innovative FL protocol designed to enhance the FedLLM efficiency. The key idea of FwdFL is to employ backpropagation (BP)-free training methods, requiring devices only to execute ''perturbed inferences''. Consequently, FwdFL delivers way better memory efficiency and time efficiency (expedited by mobile NPUs and an expanded array of participant devices). FwdFL centers around three key designs: (1) it combines BP-free training with parameter-efficient training methods, an essential way to scale the approach to the LLM era; (2) it systematically and adaptively allocates computational loads across devices, striking a careful balance between convergence speed and accuracy; (3) it discriminatively samples perturbed predictions that are more valuable to model convergence. Comprehensive experiments illustrate FwdFL's significant advantages over conventional methods, including up to three orders of magnitude faster convergence and a 4.6× reduction in memory footprint. Uniquely, FwdFL paves the way for federated billion-parameter LLMs such as LLaMA on COTS mobile devices -- a feat previously unattained.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - UniMem: Redesigning Disaggregated Memory within A Unified Local-Remote Memory...

USENIX 2024-09-12 | UniMem: Redesigning Disaggregated Memory within A Unified Local-Remote Memory Hierarchy

Yijie Zhong, Minqiang Zhou, and Zhirong Shen, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University

Disaggregated memory (DM) has been proposed as a feasible solution towards scaling memory capacity. A variety of memory disaggregation approaches have been introduced to facilitate the practical use of DM. The cache-coherent-based DM system, which relies on cache-coherent accelerator, can offer network-attached memory as NUMA memory. However, the current cache-coherent-based DM system introduces an extra address translation for each remote memory access. Meanwhile, the local cache mechanism of existing approaches overlooks the inherent issues of cache thrashing and pollution that arise from DM system. This paper presents UniMem, a cache-coherent-based DM system that proposes a unified local-remote memory hierarchy to remove extra indirection layer on remote memory access path. To optimize local memory utilization, UniMem redesigns the local cache mechanism to prevent cache thrashing and pollution. Furthermore, UniMem puts forth a page migration mechanism that promotes frequently used pages from device-attached memory to host memory based not only on page hotness but also on hotness fragmentation. Compared to state-of-the-art systems, UniMem reduces the average memory access time by up to 76.4% and offers substantial improvement in terms of data amplification.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - SimEnc: A High-Performance Similarity-Preserving Encryption Approach for...

USENIX 2024-09-12 | SimEnc: A High-Performance Similarity-Preserving Encryption Approach for Deduplication of Encrypted Docker Images

Tong Sun and Bowen Jiang, Zhejiang University; Borui Li, Southeast University; Jiamei Lv, Yi Gao, and Wei Dong, Zhejiang University

Encrypted Docker images are becoming increasingly popular in Docker registries for privacy. As the Docker registry is tasked with managing an increasing number of images, it becomes essential to implement deduplication to conserve storage space. However, deduplication for encrypted images is difficult because deduplication exploits identical content, while encryption tries to make all contents look random. Existing state-of-the-art works try to decompress images and perform message-locked encryption (MLE) to deduplicate encrypted images. Unfortunately, our measurements uncover two limitations in current works: (i) even minor modifications to the image content can hinder MLE deduplication, (ii) decompressing image layers would increase the size of the storage for duplicate data, and significantly compromise user pull latency and deduplication throughput.

In this paper, we propose SimEnc, a high-performance similarity-preserving encryption approach for deduplication of encrypted Docker images. SimEnc is the first work that integrates the semantic hash technique into MLE to extract semantic information among layers for improving the deduplication ratio. SimEnc builds on a fast similarity space selection mechanism for flexibility. Unlike existing works completely decompressing the layer, we explore a new similarity space by Huffman decoding that achieves a better deduplication ratio and performance. Experiments show that SimEnc outperforms both the state-of-the-art encrypted serverless platform and plaintext Docker registry, reducing storage consumption by up to 261.7% and 54.2%, respectively. Meanwhile, SimEnc can surpass them in terms of pull latency.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Accelerating the Training of Large Language Models using Efficient Activation...

USENIX 2024-09-12 | Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism

Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, Kuaishou Technology

Recent advancements in training large-scale models have centered on optimizing activation strategies and exploring various parallel training options. One research avenue focuses on enhancing activation-related operations, such as offloading and recomputing. However, there is room for further refinement in these strategies to improve the balance between computation and memory utilization. Another line of work explores different training parallelisms, which often require extensive parameter tuning and achieve suboptimal combinations of parallel options.

To tackle these challenges, this paper introduces a novel method for losslessly accelerating the training of large language models. Specifically, two efficient activation rematerialization strategies are proposed: Pipeline-Parallel-Aware Offloading, which maximizes the utilization of host memory for storing activations, and Compute-Memory Balanced Checkpointing, which seeks a practical equilibrium between activation memory and computational efficiency. Additionally, the paper presents an extremely efficient searching method for optimizing parameters for hybrid parallelism, considering both offloading and checkpointing to achieve optimal performance. The efficacy of the proposed method is demonstrated through extensive experiments on public benchmarks with diverse model sizes and context window sizes. For example, the method significantly increases Model FLOPs Utilization (MFU) from 32.3% to 42.7% for a 175B Llama-like model with a context window size of 32,768 on 256 NVIDIA H800.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND

USENIX 2024-09-12 | A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND

Jaehyun Song and Bumsuk Kim, Sungkyunkwan University; Minwoo Kwak, Yonsei University; Byoungyoung Lee, Seoul National University; Euiseong Seo, Sungkyunkwan University; Jinkyu Jeong, Yonsei University

Serverless computing often utilizes the warm container technique to improve response times. However, this method, which allows the reuse of function containers across different function requests of the same type, creates persistent vulnerabilities in memory and file systems. These vulnerabilities can lead to security breaches such as data leaks. Traditional approaches to address these issues often suffer from performance drawbacks and high memory requirements due to extensive use of user-level snapshots and complex restoration processes.

The paper introduces REWIND, an innovative and efficient serverless function execution platform designed to address these security and efficiency concerns. REWIND ensures that after each function request, the container is reset to an initial state, free from any sensitive data, including a thorough restoration of the file system to prevent data leakage. It incorporates a kernel-level memory snapshot management system, which significantly lowers memory usage and accelerates the rewind process. Additionally, REWIND optimizes runtime by reusing memory regions and leveraging the temporal locality of function executions, enhancing performance while maintaining strict data isolation between requests. The REWIND prototype is implemented on OpenWhisk and Linux and evaluated with serverless benchmark workloads. The evaluation results have demonstrated that REWIND provides substantial memory saving while providing high function execution performance. Especially, the low memory usage makes more warm containers kept alive thereby improving the throughput as well as the latency of function execution while providing isolation between function requests.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation...

USENIX 2024-09-12 | Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

Dan Graur, Oto Mraz, Muyu Li, and Sepehr Pourghannad, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich

Input data preprocessing is a common bottleneck in machine learning (ML) jobs, that can significantly increase training time and cost as expensive GPUs or TPUs idle waiting for input data. Previous work has shown that offloading data preprocessing to remote CPU servers successfully alleviates data stalls and improves training time. However, remote CPU workers in disaggregated data processing systems comprise a significant fraction of total training costs. Meanwhile, current disaggregated solutions often underutilize CPU and DRAM resources available on ML accelerator nodes. We propose two approaches to alleviate ML input data stalls while minimizing costs. First, we dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth. Second, we analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput. We observe that relaxing commutativity increases throughput while maintaining high model accuracy for a variety of ML data pipelines. We build Pecan, an ML data preprocessing service that automates data preprocessing worker placement and transformation reordering decisions. Pecan reduces preprocessing costs by 87% on average and total training costs by up to 60% compared to training with state-of-the-art disaggregated data preprocessing and total training costs by 55% on average compared to collocated data preprocessing.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric...

USENIX 2024-09-12 | Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

Haojun Xia, University of Sydney; Zhen Zheng and Xiaoxia Wu, Microsoft; Shiyang Chen, Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt, Microsoft; Donglin Zhuang and Zhongzhu Zhou, University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song, Microsoft

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at github.com/usyd-fsalab/fp6_llm.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Config-Snob: Tuning for the Best Configurations of Networking Protocol Stack

USENIX 2024-09-12 | Config-Snob: Tuning for the Best Configurations of Networking Protocol Stack

Manaf Bin-Yahya, Yifei Zhao, and Hossein Shafieirad, Huawei Technologies Canada; Anthony Ho, Huawei Technologies Canada and University of Waterloo; Shijun Yin and Fanzhao Wang, Huawei Technologies China; Geng Li, Huawei Technologies Canada

Web servers usually use predefined configurations, yet empirical studies have shown that performance can be significantly improved when the configurations of the networking protocol stack (e.g., TCP, QUIC, and congestion control parameters) are carefully tuned due to the fact that a “one-size-fits-all” strategy does not exist. However, dynamically tuning the protocol stack's configurations is challenging: first, the configuration space is ample, and parameters with complex dependencies must be tuned jointly; second, the network condition space is also large, so an adaptive solution is needed to handle clients' diversity and network dynamics; and finally, clients endure unsatisfactory performance degradation due to learning exploration. To this end, we propose Config-Snob, a protocol tuning solution that selects the best configurations based on historical data. Config-Snob exploits the configuration space by tuning several configuration knobs and provides a practical fine-grained client grouping while handling the network environment dynamics. Config-Snob uses a controlled exploration approach to minimize the performance degradation. Config-Snob utilizes causal inference (CI) algorithms to boost the tuning optimization. Config-Snob is implemented in a QUIC-based server and deployed in a large-scale production environment. Our extensive experiments show that the proposed solution improves the completion time over the default configurations by 15% to 36% (mean) and 62% to 70% (median) in the real deployment.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - mmTLS: Scaling the Performance of Encrypted Network Traffic Inspection

USENIX 2024-09-12 | mmTLS: Scaling the Performance of Encrypted Network Traffic Inspection

Junghan Yoon, Seoul National University; Seunghyun Do and Duckwoo Kim, KAIST; Taejoong Chung, Virginia Tech; KyoungSoo Park, Seoul National University

Modern network monitoring TLS middleboxes play a critical role in fighting against the abuse by encrypted network traffic. Unfortunately, operating a TLS middlebox often incurs a huge computational overhead as it must translate and relay encrypted traffic from one endpoint to the other. We observe that even a simple TLS proxy drops the throughput of end-to-end TLS sessions by 43% to 73%. What is worse is that recent security enhancement TLS middlebox works levy an even more computational tax.

In this paper, we present mmTLS, a scalable TLS middlebox development framework that significantly improves the traffic inspection performance and provides a TLS event programming library with which one can write a TLS middlebox with ease. mmTLS eliminates the traffic relaying cost as it operates on a single end-to-end TLS session by secure session key sharing. This approach is not only beneficial to performance but it naturally guarantees all end-to-end TLS properties except confidentiality. To detect illegal content modification, mmTLS supplements a TLS record with a private tag whose key is kept secret only to TLS endpoints. We find that the extra overhead for private tag generation and verification is minimal when augmented with the first tag generation. Our evaluation demonstrates that mmTLS outperforms the nginx TLS proxy in the split-connection mode by a factor 2.7 to 41.2, and achieves 179 Gbps of traffic relaying throughput.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States

USENIX 2024-09-12 | MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States

Chen Zhang, Rongchao Dong, Haojie Wang, Runxin Zhong, Jike Chen, and Jidong Zhai, Tsinghua University

Real-world deep learning programs are often developed with dynamic programming languages like Python, which usually have complex features, such as built-in functions and dynamic typing. These programs typically execute in eager mode, where tensor operators run without compilation, resulting in poor performance. Conversely, deep learning compilers rely on operator-based computation graphs to optimize program execution. However, complexities in dynamic languages often prevent the conversion of these programs into complete operator graphs, leading to sub-optimal performance.

To address this challenge, we introduce MAGPY to optimize the generation of operator graphs from deep learning programs. MAGPY generates more complete operator graphs by collecting key runtime information through monitoring program execution. MAGPY provides a reference graph to record program execution states and leverages reference relationships to identify state changes that can impact program outputs. This approach significantly reduces analysis complexity, leading to more complete operator graphs. Experimental results demonstrate that MAGPY accelerates complex deep learning programs by up to 2.88× (1.55× on average), and successfully instantiates 93.40% of 1191 real user programs into complete operator graphs.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - FBMM: Making Memory Management Extensible With Filesystems

USENIX 2024-09-12 | FBMM: Making Memory Management Extensible With Filesystems

Bijan Tabatabai, James Sorenson, and Michael M. Swift, University of Wisconsin—Madison

New memory technologies like CXL promise diverse memory configurations such as tiered memory, far memory, and processing in memory. Operating systems must be modified to support these new hardware configurations for applications to make use of them. While many parts of operating systems are extensible, memory management remains monolithic in most systems, making it cumbersome to add support for a diverse set of new memory policies and mechanisms.

Rather than creating a whole new extensible interface for memory managers, we propose to instead use the memory management callbacks provided by the Linux virtual file system (VFS) to write memory managers, called memory management filesystems (MFSs). Memory is allocated by creating and mapping a file in an MFS's mount directory and freed by deleting the file. Use of an MFS is transparent to applications. We call this system File Based Memory Management (FBMM).

Using FBMM, we created a diverse set of standalone memory managers for tiered memory, contiguous allocations, and memory bandwidth allocation, each comprising 500-1500 lines of code. Unlike current approaches that require custom kernels, with FBMM, an MFS can be compiled separately from the kernel and loaded dynamically when needed. We measure the overhead of using filesystems for memory management and found the overhead to be less than 8% when allocating a single page, and less than 0.1% when allocating as little as 128 pages. MFSs perform competitively with kernel implementations, and sometimes better due to simpler implementations.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - OPER: Optimality-Guided Embedding Table Parallelization for Large-scale...

USENIX 2024-09-12 | OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model

Zheng Wang, University of California, San Diego; Yuke Wang, Boyuan Feng, and Guyue Huang, University of California, Santa Barbara; Dheevatsa Mudigere and Bharath Muthiah, Meta; Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California, San Diego

The deployment of Deep Learning Recommendation Models (DLRMs) involves the parallelization of extra-large embedding tables (EMTs) on multiple GPUs. Existing works overlook the input-dependent behavior of EMTs and parallelize them in a coarse-grained manner, resulting in unbalanced workload distribution and inter-GPU communication.

To this end, we propose OPER, an algorithm-system co-design with OPtimality-guided Embedding table parallelization for large-scale Recommendation model training and inference. The core idea of OPER is to explore the connection between DLRM inputs and the efficiency of distributed EMTs, aiming to provide a near-optimal parallelization strategy for EMTs. Specifically, we conduct an in-depth analysis of various types of EMTs parallelism and propose a heuristic search algorithm to efficiently approximate an empirically near-optimal EMT parallelization. Furthermore, we implement a distributed shared memory-based system, which supports the lightweight but complex computation and communication pattern of fine-grained EMT parallelization, effectively converting theoretical improvements into real speedups. Extensive evaluation shows that OPER achieves 2.3× and 4.0× speedup on average in training and inference, respectively, over state-of-the-art DLRM frameworks.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Conspirator: SmartNIC-Aided Control Plane for Distributed ML Workloads

USENIX 2024-09-12 | Conspirator: SmartNIC-Aided Control Plane for Distributed ML Workloads

Yunming Xiao, Northwestern University; Diman Zad Tootaghaj, Aditya Dhakal, Lianjie Cao, and Puneet Sharma, Hewlett Packard Labs; Aleksandar Kuzmanovic, Northwestern University

Modern machine learning (ML) workloads heavily depend on distributing tasks across clusters of server CPUs and specialized accelerators, such as GPUs and TPUs, to achieve optimal performance. Nonetheless, prior research has highlighted the inefficient utilization of computing resources in distributed ML, leading to suboptimal performance. This inefficiency primarily stems from CPU bottlenecks and suboptimal accelerator scheduling. Although numerous proposals have been put forward to address these issues individually, none have effectively tackled both inefficiencies simultaneously. In this paper, we introduce Conspirator, an innovative control plane design aimed at alleviating both bottlenecks by harnessing the enhanced computing capabilities of SmartNICs. Following the evolving role of SmartNICs, which have transitioned from their initial function of standard networking task offloading to serving as programmable connectors between disaggregated computing resources, Conspirator facilitates efficient data transfer without the involvement of host CPUs and hence circumvents the potential bottlenecks there. Conspirator further integrates a novel scheduling algorithm that takes into consideration of the heterogeneity of accelerators and adapts to changing workload dynamics, enabling the flexibility to mitigate the second bottleneck. Our evaluation demonstrates that Conspirator may provide a 15% end-to-end completion time reduction compared to RDMA-based alternatives while being 17% more cost-effective and 44% more power-efficient. Our proposed scheduler also helps to save 33% GPU hours compared to naive GPU-sharing schedulers by making close-to-optimal decisions while taking much less time than the optimal NP-Hard scheduler.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - QDSR: Accelerating Layer-7 Load Balancing by Direct Server Return with QUIC

USENIX 2024-09-12 | QDSR: Accelerating Layer-7 Load Balancing by Direct Server Return with QUIC

Ziqi Wei, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Zhiqiang Wang, Tencent and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Yuan Yang, Tsinghua University; Cheng Luo and Fuyu Wang, Tencent; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Sijie Yang, Tencent; Zhenhui Yuan, Northumbria University

Layer-7(L7) load balancing is a crucial capability for cloud service providers to maintain stable and reliable services. However, high flexibility of the L7 load balancers(LBs) and increasing downlink relaying service result in a heavy workload, which significantly increases the cost of cloud service providers and reduces end-to-end service quality. We proposes QDSR, a new L7 load balancing scheme that uses QUIC and Direct Server Return(DSR) technology. QDSR divides the QUIC connection into independent streams and distributes them to multiple real servers(RSs), enabling real servers to send data directly to the client simultaneously. Due to the lack of redundant relaying, QDSR enables high performance, low latency, and nearly eliminates additional downlink relaying overhead.

To evaluate the performance of QDSR, we implemented all its components using Nginx and Apache Traffic Server, deployed them in a real environment testbed, and conducted large-scale simulation experiments using mahimahi. The experimental results show that QDSR can process an additional 4.8%-18.5% of client requests compared to traditional L7 proxy-based load balancing schemes. It can achieve a maximum throughput that is 12.2 times higher in high-load scenarios and significantly reduce end-to-end latency and first packet latency.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - Evaluating Chiplet-based Large-Scale Interconnection Networks via Cycle-Accurate...

USENIX 2024-09-12 | Evaluating Chiplet-based Large-Scale Interconnection Networks via Cycle-Accurate Packet-Parallel Simulation

Yinxiao Feng and Yuchen Wei, Institute for Interdisciplinary Information Sciences, Tsinghua University; Dong Xiang, School of Software, Tsinghua University; Kaisheng Ma, Institute for Interdisciplinary Information Sciences, Tsinghua University

The Chiplet architecture has achieved great success in recent years. However, chiplet-based networks are significantly different from traditional networks, thus presenting new challenges in evaluation. On the one hand, on-chiplet and off-chiplet networks are tightly coupled; therefore, the entire heterogeneous network must be designed and evaluated jointly rather than separately. On the other hand, existing network simulators cannot efficiently evaluate large-scale chiplet-based networks with cycle-accurate accuracy.

In this paper, we present the design and implementation of the Chiplet Network Simulator (CNSim), a cycle-accurate packet-parallel simulator supporting efficient simulation for large-scale chiplet-based (shared-memory) networks. In CNSim, a packet-centric simulation architecture and an atomic-based hyper-threading mechanism are adopted, accelerating simulation speed by 11× ~ 14× compared with existing cycle-accurate simulators. Besides, we implement the heterogeneous router/link microarchitecture and many other features, including hierarchical topologies, adaptive routing, and real workload traces integration. Based on CNSim, two typical chiplet-based networks, which cannot be efficiently simulated by existing simulators, are systematically evaluated. The advantages and limitations of chiplet-based networks are revealed through systematical cycle-accurate simulations. The simulator and evaluation framework are open-sourced to the community.

View the full ATC '24 program at usenix.org/conference/atc24/program

USENIX ATC 24 - FlexMem: Adaptive Page Profiling and Migration for Tiered Memory

USENIX 2024-09-12 | FlexMem: Adaptive Page Profiling and Migration for Tiered Memory

Dong Xu, University of California, Merced; Junhee Ryu, Jinho Baek, and Kwangsik Shin, SK hynix; Pengfei Su and Dong Li, University of California, Merced

Tiered memory, combining multiple memory components with different performance and capacity, provides a cost-effective solution to increase memory capacity and improve memory utilization. The existing system software to manage the tiered memory often has limitations: (1) rigid memory profiling methods that cannot timely capture emerging memory access patterns or lose profiling quality, (2) rigid page demotion (i.e., the number of pages for demotion is driven by an invariant requirement on free memory space), and (3) rigid warm page range (i.e., emerging hot pages) that leads to unnecessary page demotion from fast to slow memory. To address the above limitations, we introduce FlexMem, a page profiling and migration system for tiered memory. FlexMem combines the performance counter-based and page hinting fault-based profiling methods to improve profiling quality, dynamically decides the number of pages for demotion based on the needs of accommodating hot pages (i.e., frequently accessed pages), and dynamically decides the warm page range based on how often the pages in the range is promoted to hot pages. We evaluate FlexMem with common memory-intensive benchmarks. Compared to the state-of-the-art (Tiering-0.8, TPP, and MEMTIS), FlexMem improves performance by 32%, 23%, and 27% on average respectively.

View the full ATC '24 program at usenix.org/conference/atc24/program