A Survey on User-Space Storage and Its Implementations

Junzhe Li, Xiurui Pan, Shushu Yi, Jie Zhang Junzhe Li is with the Department of Electronics Engineering and Computer Science, Peking University. Email: allenli-@pku.edu.cn Xiurui Pan is with the Department of Electronic Engineering, Tsinghua University. Email: panxiurui@outlook.com Shushuyi and Jie Zhang are with the School of Computer Science, Peking University. Email: firnyee@gmail.com, jiez@pku.edu.cn

Abstract

The storage stack in the traditional operating system is primarily optimized towards improving the CPU utilization and hiding the long I/O latency imposed by the slow I/O devices such as hard disk drivers (HDDs). However, the emerging storage media experience significant technique shifts in the past decade, which exhibit high bandwidth and low latency. These high-performance storage devices, unfortunately, suffer from the huge overheads imposed by the system software including the long storage stack and the frequent context switch between the user and kernel modes. Many researchers have investigated huge efforts in addressing this challenge by constructing a direct software path between a user process and the underlying storage devices. We revisit such novel designs in the prior work and present a survey in this paper. Specifically, we classify the former research into three categories according to their commonalities. We then present the designs of each category based on the timeline and analyze their uniqueness and contributions. This paper also reviews the applications that exploit the characteristics of theses designs. Given that the user-space storage is a growing research field, we believe this paper can be an inspiration for future researchers, who are interested in the user-space storage system designs.

Index Terms:

Storage System, Solid-state disks, Non-volatile memory

1 Introduction

The traditional storage media, such as hard disk drivers (HDDs), are commonly considered as slow I/O devices, whose performance is multiple orders of magnitude worse than the main memory [1, 2]. To prevent the slow devices from stalling the execution of the user applications, a traditional operating system is usually split into user space and kernel space. While users execute their applications in the user space without caring about the explicit computer hardware, the kernel space is responsible for interacting with all the peripheral I/O devices. Although the user-kernel mode switch introduces additional execution time [3], such overheads are fairly minor compared to the slow read/write latencies of the traditional HDDs.

However, as the storage techniques shift, there emerge a couple of high-performance storage devices such as storage-class memory (SCM) [4, 5] and NVMe SSDs [6]. Compared to HDDs, these new techniques significantly narrow down the performance gap between the storage and the memory. In addition, they exhibit brand new features that have never been unveiled by the stale storage devices. Specifically, SCM is usually comprised of non-volatile memory (NVM) [4]. It achieves the read/write latency similar to the traditional DRAM. It also provides byte-granule data accesses to the users such that the user programs can directly access SCM via the standard load/store instructions. As a type of storage media, SCM also guarantees data persistency. On the other hand, solid state drives (SSDs) employ several dozens of flash dies, which can serve the I/O requests in parallel. It also employs a high-performance communication protocol, called non-volatile memory express (NVMe) [6], which is customized to exploit the internal parallelism of solid state drives. While NVMe SSDs are block devices, the accumulated throughput of the state-of-the-art NVMe SSDs are close to the commodity main memory.

While the I/O access latencies of the emerging storage devices decrease significantly, many prior work observe that the system software overheads have become the dominant performance bottleneck [3]. For example, as representatives of SCM and NVMe SSDs, an Optane DC persistent memory module (PMM) and a ultra-low-latency (ULL) SSD decrease their read latencies to 100 ns [7] and 3 us [8], respectively. However, the software latency of user-kernel context switch is reported to be 0.5 $\sim$ 2 us [9], which is close to or even longer than the I/O access latencies of SCM and NVMe SSD. In addition, the storage software stack within the kernel space usually performs multiple address translations and boundary checks, which consumes over 5 us software latency for each I/O request [10]. Therefore, the traditional computer system is not capable of unleashing the entire benefits of the emerging storage devices, due to the huge performance disparity between the existing system software and the emerging storage devices.

Considering that the involvement of system software in the I/O data path becomes the main reason for the performance degradation of the storage system, multiple prior work [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] propose to provide user space with direct access to the underlying storage, referred to as user-space storage. Specifically, many research [11, 13] concentrate on implementing user-space storage driver or I/O framework in order to grant the user applications full access to the underlying storage resources. Other prior studies [14, 15, 16, 20, 25, 26, 27, 28] consider to decouple the memory management and the system software such that the address translation and boundary check can be served directly from the user space. Furthermore, there are also multiple new designs of file systems [18, 29, 19, 17, 30, 21, 22, 31, 23, 24, 32] to minimize the involvement of kernel in the file accesses. These work propose to construct user-space file systems, which can take over the tasks of the traditional file system. This paper gives a survey to summarize all the efforts in mitigating the penalty imposed by the traditional storage software stack including the innovative solutions for the storage driver, virtualization and file system.

Refer to caption — Figure 1: Three major system components of user-space storage designs and their applications.

The remainder of the paper is organized as follows. Section 2 describes the background and motivation for the hot research trend of user space storage at the moment. Section 3 introduces the existing works that explore the designs of user space storage following the academic lineage and analyses the advantages and limitations of these works. Section 4 presents the user-space applications. Lastly, Section 5 concludes this paper.

2 Background

2.1 Emerging Storage Techniques

NVMe SSDs. As a replacement of the traditional hard disk drivers, the solid state drives (SSDs) have become the dominant storage media in the diverse application domains [33, 34, 35]. Compared to HDDs, SSDs can deliver significantly higher throughput by exposing their internal parallelism. Figure 2 shows the details of an SSD internal. SSDs typically consist of internal DRAM modules, a large number of flash packages, and several controllers and embedded cores over channel and system buses, which are connected to either MCH or ICH. Since the working frequency domains between host side hubs (MCH/ICH) and SSD device(s) are completely different, all I/O requests coming from the host-side hubs should be first buffered in SSD internal DRAM modules. The requests then again are transmitted to the data or cache registers of the underlying flash for back-end I/O services. To increase storage capacity and throughput, modern SSDs employ multiple channels, each containing a flash controller and a number of flash packages over its flash system bus such as ONFi [36], also referred to as ways. The low-level bandwidth of flash is around 70 MB/s [37], which is far from capturing the bandwidth of the system bus or storage interface (4 $\sim$ 8 GB/s [38]). Thus, maximizing SSD internal parallelism is a key of designing modern high-performance SSDs. In practice, the SSD controller spreads the I/O requests across multiple channels and ways with four different levels of internal parallelism [39, 40]. In addition, a customized communication protocol, referred to as Non-Volatile Memory express (NVMe), is used to enable users to take all the benefits of all levels of SSD internal parallelism [41]. Specifically, NVMe allows the user applications to send I/O commands to upto 64K deep queues, each with up to 64K entries. Such massive deep queues allows the host to utilize as many hardware resources (e.g., threads) as possible for I/O accesses thereby maximizing the storage utilization.

Storage class memory. Storage class memory (SCM) has attracted a wide range of attention from both the academia and the industries as its non-volatile intrinsic, high density and low power consumption can benefit modern datacenters and high-performance computers. There are three standard incarnations of storage class memory including NVDIMM-N [42], NVDIMM-F [43] and NVDIMM-P [44]. NVDIMM-N commonly consists of multiple volatile DRAM modules with a small piece of non-volatile memory (e.g., flash) for backup. On the other hand, NVDIMM-F directly integrates flash into a dual-inline memory module (DIMM). Similar to SSDs, NVDIMM-F provides a high memory capacity, but exposes a block interface to the users. As an ideal type of SCM, NVDIMM-P, such as Optane DC PMM [45], can offer byte-addressable persistency with DRAM-like performance. Thanks to these advantages, NVDIMM-P can be accessed via standard load/store instructions. In practice, enterprise servers (e.g., Intel Xeon scalable [46]) employ NVDIMM-P with DirectAccess (DAX) [47], which brings the advantages of unprecedented levels of performance and data resiliency [7].

2.2 OS Storage Stack

Figure 3 shows the system structure of a representative operating system (UNIX [48]) from a user process to flash media. As shown in the figure, an I/O service is initialized by the user-space applications. It is then handled by the page cache, file system, multi-queue block layer and NVMe driver residing the kernel space.

User-kernel interface. User applications and runtime libraries typically reside in the user space, which can only access isolated resources. On the other hand, the kernel space is the core of the modern operating system and has permissions to access any hardware resources attached to the computer system. To get services beyond the permission range of user space, a user application can communicate with the OS kernel via a customized interface, referred to as system call [49]. The system calls specify the tasks that the user space can hand over to the OS kernel. The tasks include I/O request handling [50], CPU scheduling [51], demand paging [52] and page swapping [53]. Once the OS kernel finishes the execution of a system call, it re-invokes the execution of user applications residing in the user space. This mechanism is referred to as user-kernel mode switch, which is usually accompanied by the context switch.

Page cache. Linux page cache stores page-sized chunks of files in the DRAM to speed up accesses to files on the underlying storage. The page-sized chunk (e.g., 4KB) is in practice managed by a radix tree, and if data is updated by a process, the target chunk is considered as a dirty page. When a user tries to read data over a file descriptor, the page cache receives the request at the beginning of the I/O process. It retrieves the file related information (inode) from the underlying file system and keeps it with the target page-sized chunk into DRAM. Even though the page cache efficiently keeps the data by using a read-only page table entry or copy-on-write for sharing, the limited DRAM capacity cannot accommodate all pages requested by users. Thus, the page cache flushes dirty pages when the number of page dirty pages is greater than a threshold, referred to as the dirty_background_ratio or dirty_background_bytes. In addition, the page cache selectively writes the dirty pages whose period of life is longer than a timer that the user has configured. To guarantee data persistency, the user can make a system call that synchronizes all I/O operations or balances the dirty memory state. When this type of system call is used, the OS suspends the user process, generating file writes and performing the flush operations of the page cache. Since writing many dirty pages back to the SSD is a time consuming task, the threshold is periodically monitored by a kernel thread, bdi_fork_thread, and the thread creates a background task (bdi_writeback_thread) and periodically calls it to flush the dirty pages.

File system. Linux file system (e.g., EXT4) manages the storage space by abstracting all piece of information. The abstraction is guided by metadata (e.g., inode), which includes a file name, offset addresses indicating the beginning and end of a file, permissions, modifications and the last change. When the page cache requests I/O service due to flushing of a dirty page or a read of data from the system’s backend, the file system retrieves the corresponding logical block address (LBA) and the length of the request in terms of the number of sectors (512 bytes). The file system then composes a block I/O structure instance, referred to as a bio, and calls a function (e.g., ext4_io_submit) to send it to the underlying block I/O layer.

TABLE I: Classification of relevant researches.

Storage Driver	Virtualization	FUSE	Comprehensive Design	Application
Section 3.1	Section 3.2	Section 3.3	Section 3.4	Section 4
SPDK [11, 54, 55]	Arrakis [14, 15, 16]	Ishiguro et al. [29]	Aerie [17]	RUMA [56]
NVMeDirect[12]	Moneta-D [20]	Direct-FUSE [18]	Strata [30]	Breeze [57]
	Simurgh [25]	XFUSE [58]	SplitFS [21]	HyCache [59]
	Quill [26]	Son et al. [60, 61]	ZoFS [22]	Davram [62]
	vNVML [27, 28]	EvFS [19]	Kuco [63]	DLFS [64]
		URFS [65]	UMFS [31]
			DevFS [23]
			CrossFS [24]
			FSP [32]

Block I/O layer. There are two types of block I/O layers. A conventional block I/O layer maintains a simple request queue as a request interface between the file system and the underlying driver/controller of the target interface. Since this introduces many performance issues due to a single lock on the queue management, a ”multi-queue block layer”, called blk-mq, is employed in most NVMe storage stack. Rather than a request interface, blk-mq actually composes an I/O request using a kernel data structure, called a request, by converting an incoming bio to a request. The blk-mq’s queues are created/released on a per-CPU or node basis. Thus, each CPU submits I/O requests into its own queue without contention on the single lock or interference with other CPUs. To further improve performance, blk-mq checks the target queue’s entries and merges the incoming request into an existing entry, which is called aggregation.

Storage interface. Under the block I/O layer, there is a storage interface driver, also known as the host block adapter. The implementation of this driver can vary based on the type of interface that the underlying SSD employs, but it is mostly related to compose commands being aware of the device-level registers or hardware map to communicate. In cases of SATA/IDE, the target system employs a hardware controller (i.e., disk controller) to manage their storage interface protocol, so the interface driver usually handles I/O interrupt or system memory management. In contrast, in the case of NVMe, a kernel module (NVMe driver) directly accesses the PCIe bus over a memory mapped I/O and issues the request to the target SSD by composing an nvme_rw_command.

2.3 Challenges

The designs of discrete kernel and user spaces can provide the traditional computing system with guaranteed isolation, security and consistency. However, this tight collaboration between the user and kernel spaces incurs frequent user-kernel mode switches, which introduces extra long latency. Specifically, [10] reports that the user-kernel mode switch costs 2 $\sim$ 4 us latency. Considering that the I/O latency of the stale HDD-based storage system is millisecond-scale, the software latency caused by the context switch is relatively minor. However, the I/O latencies of the state-of-the-art storage devices (e.g., SCM and high-performance NVMe SSDs) have decreased to less than 10 us, which in turn exacerbates the software penalty. In addition, multiple prior research [66, 67, 68, 69] reveal that the storage stack, including virtual memory, file system, and storage drivers, imposes huge software overheads to the high-performance storage devices. For example, the storage stack requires multiple address translation and boundary checks, which cost around 10 us latency.

2.4 Classification

To address the large overheads imposed by the tedious software stack of I/O services, there is continuous research on constructing a direct expressway between user space and the I/O devices. We review the prior works in the past decade aiming to address the software overhead issue in the storage system. In general, most prior works concentrate on three system components: storage driver, virtualization technique, and file system, which are shown in Figure 1. Specifically, from the aspects of storage drivers, researchers propose to enable applications to directly access the underlying storage in user space by either moving drivers from the kernel to the user space or exploiting the unique features of the state-of-the-art storage drivers (e.g., NVMe drivers). Virtualization-centric works map storage devices to user space and expose virtual interfaces to the applications, which are thereby permitted to access storage via such interfaces. Lastly, the research of file systems propose to reassign tasks (e.g., metadata read/write and I/O permission checks) among the user space, the kernel space, and firmware so as to break free from the stale designs of traditional file system and put forward novel mechanisms to achieve user-space direct access. Moreover, for various scenarios that call for fast access to storage (e.g., high performance computing), researchers have proposed several applications that are optimized for the emerging storage devices. Table I lists our classification of the relevant researches based on the core techniques or the application themes of each research.

3 User Space Storage Designs

In this section, we will review the three techniques (i.e., storage driver, virtualization and file system designs) and their related works in details. Section 3.1 discusses works of storage drivers. Section 3.2 focuses on virtualization-related works. Section 3.3 and 3.4 review file system designs, concentrating on both user-space-only file systems and comprehensive user-kernel-storage file systems.

3.1 Storage Driver

NVMe engine in user space. One of the most representative NVMe engines in user space is Storage Performance Development Kits (SPDK), which is developed and released by Intel. SPDK includes a set of new tools and runtime libraries to eliminate the huge overheads imposed by the kernel I/O stack [55, 54, 11]. Specifically, SPDK provides programmers with four software layers: application scheduling, storage services, storage protocols, and drivers. The first three layers allow users to design customized event schedulers, abstract storage resources, and develop diverse storage protocols, respectively. The last software layer (i.e., drivers) plays the fundamental role in mitigating I/O stack overheads in SPDK. SPDK places NVMe drivers in user space, which provides advantageous properties such as zero-copy and direct-access to NVMe SSDs for user-level applications. By exploiting the NVMe drivers, I/O requests from SPDK applications are processed in user space without going through the tedious I/O stack in kernel space. However, implementing drivers in user space raises portability problems. In particular, most applications relies on a uniform API (i.e., POSIX ) to access the entire storage stack. Unfortunately, SPDK is incompatible with POSIX APIs, making it difficult from being applied to a wide range of computer systems.

NVMe engine in kernel space. The portability issue of SPDK is caused by the implementation of NVMe drivers in user space. To resolve this issue, Kim et al. [13, 12] proposes a new user-level I/O framework that mitigates the kernel I/O overheads while keeping the NVMe drivers in kernel space. The I/O framework is called NVMeDirect. It consists of three main components including an admin tool, an I/O queue management module, and a runtime library. The admin tool and I/O queue management module are implemented in kernel space while the runtime library is executed in user space. Handling I/O requests in this framework requires collaboration of the three modules so as to achieve better performance in accessing NVMe SSDs. The user-space runtime library provides user-level APIs and invokes the kernel NVMe driver upon I/O requests from user space. The admin tools takes responsibility for controlling kernel drivers (i.e., NVMe drivers), which sets up the NVMe I/O queues, and managing the permission of I/O queues. The I/O queue management module handles such NVMe I/O queues and provides user-level applications with flexibility in choosing customized I/O policy (e.g., separation of reads and writes in a single thread). NVMeDirect achieves higher I/O throughput and shorter read/write latency than SPDK and the kernel I/O in most cases [12] while resolving the portability issue in SPDK.

3.2 Virtualization

The traditional storage system suffers from software overheads such as context switch and metadata management. In order to reduce such overheads and also to limit kernel involvement, researchers have paid great attention to the virtualization technique [70]. Based on the specific storage component that is being virtualized in each work, we divide prior works on virtualization into three major groups: device virtualization, storage virtualization and NVM virtualization. We show the general system structures of each virtualization technique in Figure 4.

Virtual Interfaces Storage virtualization. Caulfield et al. [20] proposes Moneta-Direct (Moneta-D), a new type of storage architecture that integrates storage virtualization into its core design. It employs kernel only to process necessary management operations as the control plane and provides user space with direct and concurrent accesses to the storage devices. The key design to achieve such goals is the virtual channels, which are virtual interfaces provided to user space and are directly mapped to storage pages. A virtual channel consists of both privileged and unprivileged interfaces. The former are exposed to the kernel for management and permission checks of each channel while the latter are exposed to user space, enabling it to directly access storage devices via the mapping to the storage pages. Moreover, multiple channels can serve I/O requests simultaneously thereby realizing concurrent access to the underlying storage.

Peter et al. [14, 15, 16] proposes a specific device model named Arrakis for virtualized I/O, which abstracts the underlying hardware devices as virtual device instances in user space. Specifically, Arrakis presents the storage devices as virtual storage interface cards (VSICs), which function in the same way as their corresponding physical devices from the perspective of users. Arrakis also proposes to handle user-level I/O requests efficiently by splitting the roles of operating systems into data plane and control plane. The operations of data plane are associated with data accessing, such as asynchronous reads and writes. The control plane, on the other hand, manages the core tasks of a system for reliability and security, such as I/O permission checking and hardware resource allocation. For control plane, the device drivers in the kernel manage VSICs and the physical devices via a set of operations such as device creation and destruction. These operations do not interfere with data-plane operations during runtime. For data-plane operations, user-level applications can directly issue I/O requests to VSICs during the execution, which, with the support of hardwares (e.g., DMA controller), directly associates with underlying storage. To this end, thanks to the split of data plane and control plane, user-space application attains direct access to the underlying storage devices without the involvement of the kernel.

NVM virtualization. device virtualization and storage virtualization exploit the virtualization of storage devices for user-space direct access without specifying a particular kind of storage media. In contrast, other researches focus solely on the virtualization of non-volatile memory (NVM) from the aspects of file systems and user-level libraries.

Moti et al. [25] designs a user-space file system named Simurgh, which bases its core design on virtualizing NVM. Considering that NVM achieves similar performance to DRAM and is byte-addressable, Simurgh directly maps NVM into the address space of each application without employing DRAM to cache data and metadata from NVM. In this way, neither data copy between NVM and DRAM nor data buffering in DRAM is necessary. Thus, NVM can be accessed directly in user space without DRAM as the medium. Since the metadata is cached in DRAM, it is possible for metadata to be accessed concurrently by independent processes. Moreover, Simurgh adds two additional instructions (i.e., protected jump and return) to the CPU ISA to securely execute user-space functions in privileged mode, which in turn reduces the kernel’s involvement during runtime.

Aside from embedding NVM mapping in file system design (e.g., Simurgh), researchers have also proposed more light-weighted tools (i.e., user-level libraries) for NVM mapping. Eisner et al. [26] proposes Quill, a user-level library specified for accessing NVM block devices. Quill targets at addressing the challenges imposed by the stale memory management mechanism (i.e., paging). Paging is a memory management technique, which retrieves data from secondary storage devices (e.g., SSD, NVM) and stores them to the main memory (e.g., DRAM) for further usage. However, considering that the performance behaviors of NVM (i.e., latency and bandwidth) are comparable to DRAM, data copy from NVM to DRAM cannot benefit from the high speed of DRAM, thus making paging inefficient for NVM. Observing this feature, Quill tries to avoid paging of NVM block devices. To this end, since NVMs can be cached inside CPUs, they are organized as pages in CPUs’ physical address space directly. When a user-level application accesses a file, Quill takes it over and uses the mmap() function to map the physical pages of NVM devices to the virtual address space of user-space applications. The files stored in NVM are permitted to be accessed directly by the application, thus eliminating the necessity of paging.

Rather than solely focusing on the latency and bandwidth properties of NVM, Chou et al. [27, 28] designs a user-level library named vNVML, which explores utilization of other features of NVM. vNVML exploits the byte-addressable feature of NVM by providing a conventional file interface mmaping, which enables applications to access data through byte-level load/store instructions. Moreover, vNVML extends the available physical storage of virtual NVM by integrating DRAM and underlying storage devices (e.g., SSDs) into its design. vNVML leverages DRAM and NVM as the cache to the underlying storage devices. Data read from the underlying storage are cached in DRAM, and NVM functions as both the log buffer and the write cache. When the user-level application issues read or write requests to the virtual NVM, it is actually accessing the cache to underlying storage, which owns much larger physical storage space than NVM. Therefore, the virtual NVM is much larger from the users’ perspective.

3.3 User-Space File Systems

While the prior work of the storage driver and the virtualization techniques enable the user-space application directly access data from storage, the kernel still plays a significant role in the entire computing system. Specifically, the kernel needs to control the behaviors of storage drivers and maintain the mapping from storage to user-space address spaces. Since file systems play an essential role in accessing underlying storage, other than simply concentrating on one particular aspect of the storage system, such as the storage driver or the virtualization techniques, researchers have committed to redesigning file systems so as to minimize the kernel’s involvement in data accessing.

General file systems reside in kernel space. User applications can interact with the file systems via customized system calls (i.e., open). However, developing file systems in kernel can be complicated and challenging because kernel codes are deeply coupled and its architecture is complex. Therefore, user-space file systems have attracted researchers attention since they are easier to develop. In this subsection, we mainly review the works focusing on user-space file systems to obtain better performance in accessing storage devices, including the widely used framework called Filesystem in Userspace (FUSE), FUSE-based designs and other user-space file systems, which focus to one particular types of storage media (e.g., NVM and NVMe SSDs).

FUSE overview. To explore new paths for storage accesses, Filesystem in Userspace (FUSE) provides users with fundamental software interfaces to develop customized file systems in user space [71, 72, 73], minimizing interference from the kernel.

Specifically, FUSE consists of modules residing in both the user and kernel spaces. The kernel module is registered as a file system driver also named fuse. It manages kernel operations with the virtual file system (VFS), an interface that supports concrete file systems. The user-space modules, including the libfuse library and the fuse daemon, are responsible for setting up the file system in user space and handling file system calls. Using FUSE to implement a new file system requires users to write a handler program, which specifies the behavior of the file system in responding to read/write requests from user space and links such program to the libfuse library. After FUSE is mounted, this handler program is registered by the kernel module (i.e., fuse) for handling runtime data requests.

The process of executing a system call in FUSE can be regarded as a client-server model, where the kernel fuse module is the client and the fuse daemon in user space is the server. When a system call, such as read() or write(), is issued from user space, the VFS first invokes the default handler in the fuse to fetch data in the page cache. The requested data is returned directly to user space if found in page cache. Only one user-kernel context switch is required here. If not found, the system call is thereafter re-directed to the user-space module libfuse. libfuse invokes the user-defined handler in the kernel, which fetches the data according to the specification of the user and returns the fetched data back to user space. In this process, two user-kernel context switches are needed, as shown in Figure 5. The kernel forwards the system call to libfuse, which later invokes the handler in the kernel, and the kernel returns the requested data back to the user space.

By providing the software interface (i.e., libfuse), FUSE allows users to develop customized file systems conveniently. Moreover, FUSE moves parts of code execution from kernel to user space. Such reduction of code execution in kernel would reduce the possibility of kernel crashing. In addition, with libfuse residing in user space and easy to be deployed in different environments, it is effortless to transplant FUSE-based file systems from one environment to another, which in turn makes FUSE achieve compatibility.

FUSE modification for better performance. While FUSE enjoys many benefits as mentioned above (e.g., compatibility), kernel is massively involved in FUSE framework (e.g., VFS in handling system calls), which generates side effects. Specifically, when the user-defined handler is invoked, FUSE requires extra user-kernel context switches to transfer control between the handler and the fuse daemon, which introduces non-negligible software overheads. In addition, extra memory copies, frequent page cache misses, and potential dead locks all lead to the performance degradation of FUSE [74]. Therefore, although FUSE-based file systems perform well in some scenarios (e.g., web-server workload), it is not suitable for metadata-intensive workloads (e.g., creating or deleting files over many directories) [75]. To minimize the overheads introduced by FUSE framework, several modifications to FUSE have been proposed.

Ishiguro et al. in 2012 [29] modifies the kernel module of FUSE to achieve performance improvement. They redesign a mechanism that ports the fuse daemon to the kernel and modify the kernel fuse module to adapt to this mechanism. Thus, all the works that used to be done in the user space are handled by the kernel. This mechanism avoids redundant context switches and memory copies. Zhu et al. [18], on the other hand, focused on different types of FUSE. They describes a FUSE-based framework called Direct-FUSE to support multiple backend file systems without trapping into kernel. Direct-FUSE divides the FUSE framework into three layers including bottom, middle and top layers. The bottom layer (also known as the backend services) provides operations of multiple file systems while the middle layer provides unified file system interfaces for various backend services. Lastly, the top layer receives file system operations and identifies the corresponding backend service for each operation. All the layers of Direct-FUSE are in user space, thereby eliminating non-trivial overheads of user-kernel context switches. Apart from minimizing the heavy overheads imposed by context switches, Huai et al. [58] puts forward a new user-space file system infrastructure called XFUSE. XFUSE targets at improving FUSE performance by adapting FUSE to modern multi-core CPU systems and the emerging storage devices (e.g., NVM). To improve the parallelism, XFUSE enables multiple file system daemon threads handling different requests in parallel, thereby increasing the scalability in multi-core systems. To meet the speed of fast storage devices, it dynamically adjusts the period of busy-waiting according to the I/O requests, thus avoiding the unnecessary waits and improving the I/O throughput.

User-space file systems on specific storage media. Though FUSE framework has gained great popularity in user-space file system development, it can not satisfy all needs raised by the emerging storage devices (i.e., NVM). Since NVM calls for faster and lightweight I/O stack, the prior work have employed new frameworks in developing user-space file systems to meet such demands. To be specific, Son et al. [60, 61] observes the performance disparity between the slow I/O stack and the fast NVM. They propose a new user-level file system that aims to bridge this performance gap. They provide users with a byte-grained interface to submit I/O requests and construct a user-space file system to serve storage I/O accesses. The proposed scheme bypasses the bulky traditional I/O layers such as VFS, generic block layer, and page cache, which enable user applications to read/write byte-grained data from/to NVM directly thereby reducing I/O latency.

NVMe protocol has become the dominant storage interface, which can deliver low-latency and highly parallel access to the underlying NVM media (i.e., SSDs). NVMe-based SSDs bring about many advantages, including massive I/O queues and faster speed than traditional SSDs. These advantages have raised new challenges to the design of user-space file systems since they play the major role in accessing NVMe SSDs. Yoshimura et al. [19] and Tu et al. [65] propose EvFS and URFS, respectively. They both are user-level file systems that aim to fully exploit the advantages of NVMe SSDs. EvFS employs SPDK [11] to support various NVMe devices.Specifically, EvFS adopts the event-driven execution model in SPDK for file I/O processing, targeting at the communication between users and the page cache in EvFS. Users submit the I/O operations as events to invoke the user-level NVMe driver in SPDK directly, which highly utilizes the bandwidth of NVM with few user threads and provides low I/O latency. URFS, on the other hand, explores the parallelism provided by NVMe. It utilizes a series of memory sharing and protection mechanisms to share storage accesses among multiple I/O processes. Applications that are executed upon URFS can share multiple NVMe SSDs via a library named fslib, which provides POSIX-like file system APIs for the convenience of developers.

3.4 Comprehensive File System Design

Instead of being constrained to the framework provided by FUSE, researchers redesigned user space file systems from scratch and adopt more aggressive modifications, so that the new file systems fit to new storage devices better. In this subsection, we will review multiple prior file system designs with special focus on the re-organized duties of kernel space and user space. By examining prior file system designs with respect to the emerging storage techniques, we divide these works into two categories according to the roles of kernel. As shown in Figure 6, several designs [17, 30, 21, 22] rely on kernel to maintain metadata integrity, security, consistency and etc. The kernel is also in charge of performing permission checks on each data request sent from the user space. For these designs, kernel is responsible for control-plane operations. Other designs [31, 24, 23, 32], as shown in Figure 7, in contrast, only delegate few auxiliary responsibilities to the kernel. Namely, most frequently used functions (e.g., data read/write) are migrated to user space or device firmware while kernel serves as the auxiliary plane.

3.4.1 Kernel As The Control Plane

The prior work propose user-space file systems to access storage hardware directly by bypassing kernel. However, employing untrusted user library only to manage the whole file system will pose security, atomicity, and consistency issues [76]. As a remedy, these file systems use kernel to handle control plane operations (e.g., metadata modifications). In this part, we chronologically review the existing file systems designs, which are under the control of the kernel.

Sidestepping kernel for the SCM. Storage-class memory (SCM), built from NVM, is a type of high-performance storage device with byte-addressability. In other words, data can be accessed directly from SCM through load/store instructions instead of traditional I/O requests. To utilize this characteristic, Volos et al. [17] presents Aerie, a file system architecture that exposes storage to user-space applications and thus reduces software overheads of deep storage stacks. Aerie proposes two key designs including direct access to metadata and the client-server model. Based on the two design principles, Aerie constructs an architecture including an untrusted library (libFS) and a trusted file-system service (TFS) in user space and an SCM manager in kernel space. libFS functions as the client, which has direct access to SCM when performing normal reads and writes or metadata reads. When accessing protected data or changing metadata, it needs to query the server, TFS. TFS manages protected data and metadata in user space, which is hidden from from the user applications. TFS has to query SCM manager for privileged operations. The SCM manager in kernel space then controls the privileges of SCM in storage allocation, address mapping, and extent protection. This client-server model provides direct access to SCM while maintaining the metadata security. However, the SCM management relies on a particular hardware module (i.e., memory controller), which in turn reduces the potability to other storage devices. Besides, Aerie introduces extra latency overheads due to the query process of the client-server model.

Designs for multi-level storage media. Apart from stressing on one single storage technique, Kwon et al. [30] proposes Strata as a cross-media file system, which leverages the advantages of various storage media (i.e., NVM, SSD, and HDD). Strata reorganizes duties of the user space and kernel space to collaborate with different storage devices, and stores different data and metadata in the most suitable media. At the core idea, Strata maintains per-process user-level logs in NVM to record write requests and uses kernel to digest these updates. Digestion is a special technique proposed in Strata. It is an asynchronous and periodical operation that applies transactions in the log to a kernel-managed shared area, which is built on diverse storage devices. With digestion, private modifications in Strata are visible to other processes. Strata further improves the performance of the storage system is to leverage the locality within the multi-level storage media. Specifically, Strata uses data popularity (i.e., access frequency) to identify hot/cold data and migrates cold data to lower layers of the storage hierarchy while keeping hot data in higher layers. This method assures low latency for popular data, thus increasing overall time efficiency. At the high level, Strata handles data plane operations (e.g., data writes and reading metadata from logs) in user space while managing control plane (e.g., metadata writes and digestion) within the kernel. Despite that Strata demonstrates desired I/O performance enhancement over prior works with respect to latency and throughput, intensive kernel involvement in user space accessing storage devices still introduces significant software overheads.

Post-Strata designs on NVM. Though the novel design of Strata [30] casts a spark in the field of constructing file systems that utilize multi-level storage media, it does not fully utilize specific features of new storage techniques (e.g., byte-addressability of NVM). Accordingly, Kadekodi et al. [21] presents SplitFS to fully utilize the byte-addressability feature of NVM. Similar to Strata, SplitFS splits file systems and responsibilities into user space and kernel space. Its novelty lies in the way such responsibilities are divided. On the contrary of Strata, SplitFS authorizes user space libraries to handle regular files while moving all the metadata operations into kernel space. Though using kernel to manage metadata control raises doubts on the efficiency of I/O requests, SplitFS brought multiple benefits. By reusing a mature and robust PM file system (ext4-DAX) in kernel for all the metadata operations, it can reduce the implementation complexity, utilize the existing features, and obtain consistency guarantees.

Combining Aerie and Strata. To provide user space with direct access to storage devices, Aerie [17] adopts the idea of the client-server model and Strata [30] focuses on the user-kernel collaboration. Chen et al. [63] combined the ideas of these two pioneer works in their hybrid file system. The file system, named Kuco, utilizes a user-space library named Ulib as the client to provide unified POSIX interfaces, and a trusted kernel thread called Kfs as the server to handle requests from Ulib. The user-kernel collaboration mechanism in Kuco offloads most tasks to Ulib and few to Kfs, avoiding performance bottleneck caused by being trapped into the kernel. Kfs mainly handles metadata updates, in which Ulib pre-locates addresses where KFS updates metadata. Thus, the server (Kfs) needs not wait for the client to update metadata, reducing the latency overheads brought by the client-server model as observed in Aerie.

3.4.2 Kernel As The Auxiliary Plane

Although aforementioned file systems are already able to handle basic read/write requests under user mode, they still trap into kernel for control-plane management. In other words, these designs cannot completely bypass kernel, which inspires more radical optimizations. In this part, we take a close look at such file systems which further reduce kernel’s involvement, in other words, kernel is auxiliary.

Kernel into the backstage. Considering that the kernel is still involved in most operations in current user-space file systems, Chen et al. [31] presents a user-space file system design, named UMFS, to reduce the kernel’s involvement. UMFS exploits a contiguous virtual memory space and hardware MMU to expose files directly to user space, and designs a new mechanism named user-space journaling to guarantee crash consistency. The duty of kernel is limited to handling physical memory mapping, mounting UMFS during initialization, and managing hardware privileged operations during I/O requests. Operations that do not require hardware privileges are still kept in user space. Through these approaches, UMFS is capable of hiding kernel into the backstage and achieving direct access to storage in user space when serving I/O requests.

Unlike UMFS, which depends on hardware (i.e., MMU) to minimize kernel’s involvement, Dong et al. [22] presents a method of using software to achieve such a goal, along with a user-space file system named ZoFS. ZoFS proposes a new management unit called coffer, an abstraction of isolated NVM pages that can store files with the identical permission. In ZoFS, user-space libraries take full control of NVM within a coffer. The kernel only guarantees cross-coffer isolation, responsible for handling metadata of coffers and coffer-level requests from user space.

In-storage file system. Limiting kernel’s involvement for kernel-bypass while still keeping the file system in kernel space, as in UMFS, can not fully reduce the overheads introduced by kernel. To address such defects, Kannan et al. [23] proposes DevFS, which deserts aforementioned design ideas where file systems are kept in kernel space, such as kernel hiding in UMFS [31] and trusted server in Aerie [17]. DevFS takes a brave step forward by integrating device firmware into file system design and bypassing kernel for control-plane operations. Specifically, DevFS constructs a firmware file system that resides in the storage hardware, which utilizes device-level DRAM and CPU cores. With the hardware directly exposed to file system, the hardware features such as device-level capacity are visible to users. The on-device file system also exposes standard POSIX interfaces to user-space applications. Therefore, user applications can directly access storage devices through DevFS without trapping into kernel. By such design, the performance of DevFS surpasses prior works by providing a true direct-access file system, which minimizes the kernel’s involvement. However, the reliance on hardware assistance, such as device-level CPUs, limits its portability to other storage systems with different device-level features. And the performance limits of device-level CPUs (i.e., they are typically slower than host-machine CPUs) imposes a more strict performance limit on DevFS.

Synergizing user, kernel and firmware file systems. DevFS [23] is the pioneer in constructing firmware-level file systems, meaning that they exploit firmware-level features to build user-accessible file systems. Such file systems are usually referred as Firmware-FS. Nevertheless, Firmware-FS has not fully utilized the multi-core parallelism of its host system. Besides, both user-level file system (User-FS) and kernel-level file system (Kernel-FS) cannot achieve fully direct access. Observing such shortcomings of current file system designs, Ren et al. [24] proposes CrossFS, a synergistic design that disaggregates the file system across the user-level (LibFS), OS layer, and device firmware (FirmFS), thereby taking advantages of each layer and optimizing the overall performance. To achieve fine-grained concurrency, CrossFS utilizes the file descriptor (rather than the inode) as the basic synchronization unit and assigns each file descriptor an I/O queue (named FD-queue) for requests submission. By such abstraction, CrossFS dispatches different file system functions to different layers. The user-level LibFS, using host resources (e.g., host CPU), provides unified POSIX semantics to user applications and converts received POSIX system calls to FirmFS-dedicated I/O commands. Then, FirmFS uses on-device resources (e.g., device DRAM and CPUs) to fetch commands from FD-queues and apply them to the underlying hardware. FirmFS is also responsible for metadata management, data journaling, and permission check. On the other hand, the duty of the OS layer is limited to FD-queue initialization, file system mounting and garbage collection, which are rarely used. By disaggregating file system across different layers and utilizing both host and device resources, CrossFS achieves overall performance enhancement over single User-FS, Kernel-FS, and Firmware-FS.

File system as user process. Opposite to DevFS [23] and CrossFS [24], Liu et al. [32] is not content with firmware file systems for the limited on-device computing resources. They put forward a novel idea called file systems as processes (FSP). FSP constructs a file system which runs as a standalone user process and invokes kernel only during its initialization. Replacing the role of kernel, FSP becomes the mediator between user applications and storage devices. FSP provides channels consisting of a control plane and a data plane to user applications, allowing applications to issue I/O requests to storage devices via these channels. During the I/O processing, FSP takes charge of the request handling and ensures most kernel properties (e.g., metadata integrity and crash consistency) using these channels. Therefore, FSP reduces kernel overheads in delivering file system services without extra hardware assistance.

4 Application

In this section, we give a review of the applications that are either inspired by new storage devices or deployed in modern user-space storage systems.

More productive equipment for developers. Inspired by FUSE, Schuhknecht et al. [56] proposes a new approach named RUMA, which aims to manage physical memory allocation in user space. Considering that efficient and secure memory management is crucial for developing data-intensive systems, RUMA claims that traditional methods for memory allocation cannot achieve both flexibility and access performance. To address this problem, without modifying the kernel, RUMA provides a user space toolset to manipulate the virtual to physical mapping, which exposes the physical memory to user space. Therefore, programmers can allocate continuous memory in physical address space, which is also available for dynamic adjustment. Similarly, Breeze [57] also focuses on enabling developers to write efficient codes while considering NVM in its design. The directly accessible, low-latency and byte-addressable properties of NVM offer a wide range of benefits to users. However, it is demanding for programmers to fully exploit these benefits since it requires a comprehensive understanding of NVM. Aiming to make NVM programmer-friendly, Breeze launches a toolchain that includes a user-level library and a C compiler allowing programmers to write NVM-oriented codes without having particular knowledge of NVM.

Data-intensive storage application. The descending of the big-data era and the explosion of deep learning have promoted the research of data-intensive computing systems with new emerging storage devices [77]. Besides storing data, researchers are also concerned with how to effectively access them. Han et al. [78] propose a novel user-level I/O framework in high performance computing (HPC) systems that implements user-level I/O isolation by leveraging multi-streamed SSDs [79]. HyCache [59] focuses on imbalanced performance between HDDs and SSDs in distributed file systems. By developing a middleware layer to mediate upper-level distributed file systems and lower-level storage devices, it proposes a user-level file system design which manages diverse storage devices (e.g. HDDs and SSDs) and leverages their properties for performance improvement. Davram [62] observes the lack of systematic and efficient memory allocation mechanism in distributed big data systems. To address this challenge, Davram proposes a user-level memory management middleware, which enables non-privileged users to access distributed virtual memory without development and performance overheads. It manages the data swapping between persistent storage and transient memory, and exposes low-level memory design details to users, thereby achieving user-level access to low-level memory. DLFS [64], or Deep Learning File System, notices the inefficient resource arrangement of deep learning applications on HPC systems while considering the high performance of NVMe devices. It then proposes a new user-level file system that disaggregates storage across NVMe devices and provides efficient methods to handle metadata and I/O services. The evaluation of DLFS shows significant performance enhancement with much less CPU utilization compared to conventional file systems in deep learning application.

5 Conclusions and Future Work

The deployment of the emerging storage devices in the existing memory hierarchy has significantly increased the I/O bandwidth and reduced the I/O latency. However, this technique shift imposes huge demands for the evolution of the traditional operating systems. This challenge inspires researchers to explore user-space storage system designs, which strive to remove the involvement of kernel from data path as much as possible and maintain the consistency and security of the entire operating system. In this paper, we review the former research in the past decade that target towards addressing this challenge and summarize them as a survey. Specifically, we categorize the prior work into different system layers and user-level applications. In each category, we compare different works in terms of similarities and differences.

References

[1] S. Mittal and J. S. Vetter, “A survey of software techniques for using non-volatile memories for storage and main memory systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 5, pp. 1537–1550, 2015.
[2] Y. Zhang and S. Swanson, “A study of application performance with non-volatile main memory,” in 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2015, pp. 1–10.
[3] J. Zhang, D. Donofrio, J. Shalf, M. T. Kandemir, and M. Jung, “Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures,” in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 13–24.
[4] R. F. Freitas and W. W. Wilcke, “Storage-Class Memory: The next Storage System Technology,” IBM J. Res. Dev., vol. 52, no. 4, p. 439–447, jul 2008. [Online]. Available: https://doi.org/10.1147/rd.524.0439
[5] A. K. Kamath, L. Monis, A. T. Karthik, and B. Talawar, “Storage class memory: Principles, problems, and possibilities,” ArXiv, vol. abs/1909.12221, 2019.
[6] “NVM expresss,” https://nvmexpress.org/.
[7] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J. Soh, Z. Wang, Y. Xu, S. R. Dulloor et al., “Basic performance measurements of the intel optane dc persistent memory module,” arXiv preprint arXiv:1903.05714, 2019.
[8] W. Cheong, C. Yoon, S. Woo, K. Han, D. Kim, C. Lee, Y. Choi, S. Kim, D. Kang, G. Yu et al., “A flash memory controller for 15 $\mu$ s ultra-low-latency ssd using high-speed 3d nand flash with 3 $\mu$ s read time,” in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 338–340.
[9] D. Le Moal, “I/O latency optimization with polling,” in Vault Linux Storage and Filesystems Conference, 2017.
[10] J. Huang, A. Badam, M. K. Qureshi, and K. Schwan, “Unified address translation for memory-mapped ssds with flashmap,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, 2015, pp. 580–591.
[11] Z. Yang, J. R. Harris, B. Walker, D. Verkamp, C. Liu, C. Chang, G. Cao, J. Stern, V. Verma, and L. E. Paul, “SPDK: A Development Kit to Build High Performance Storage Applications,” in 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2017, pp. 154–161.
[12] H.-J. Kim and J.-S. Kim, “A user-space storage I/O framework for NVMe SSDs in mobile smart devices,” IEEE Transactions on Consumer Electronics, vol. 63, no. 1, pp. 28–35, 2017.
[13] H.-J. Kim, Y.-S. Lee, and J.-S. Kim, “NVMeDirect: A User-Space I/O Framework for Application-Specific Optimization on NVMe SSDs,” in Proceedings of the 8th USENIX Conference on Hot Topics in Storage and File Systems, ser. HotStorage’16. USA: USENIX Association, 2016, p. 41–45.
[14] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, “Arrakis: The Operating System Is the Control Plane,” ACM Trans. Comput. Syst., vol. 33, no. 4, nov 2015. [Online]. Available: https://doi.org/10.1145/2812806
[15] S. Peter and T. Anderson, “Arrakis: A case for the end of the empire,” in 14th Workshop on Hot Topics in Operating Systems (HotOS XIV), 2013.
[16] S. Peter, J. Li, D. Woos, I. Zhang, D. R. K. Ports, T. Anderson, A. Krishnamurthy, and M. Zbikowski, “Towards high-performance application-level storage management,” ser. HotStorage’14. USA: USENIX Association, 2014, p. 7.
[17] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Saxena, and M. M. Swift, “Aerie: Flexible File-System Interfaces to Storage-Class Memory,” in Proceedings of the Ninth European Conference on Computer Systems, ser. EuroSys ’14. New York, NY, USA: Association for Computing Machinery, 2014. [Online]. Available: https://doi.org/10.1145/2592798.2592810
[18] Y. Zhu, T. Wang, K. Mohror, A. Moody, K. Sato, M. Khan, and W. Yu, “Direct-fuse: Removing the middleman for high-performance fuse file system support,” in Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers, 2018, pp. 1–8.
[19] T. Yoshimura, T. Chiba, and H. Horii, “Evfs: User-level, event-driven file system for non-volatile memory,” in 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19), 2019.
[20] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson, “Providing safe, user space access to fast, solid state disks,” ACM SIGPLAN Notices, vol. 47, no. 4, pp. 387–400, 2012.
[21] R. Kadekodi, S. K. Lee, S. Kashyap, T. Kim, A. Kolli, and V. Chidambaram, “SplitFS: Reducing Software Overhead in File Systems for Persistent Memory,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 494–508. [Online]. Available: https://doi.org/10.1145/3341301.3359631
[22] M. Dong, H. Bu, J. Yi, B. Dong, and H. Chen, “Performance and Protection in the ZoFS User-Space NVM File System,” ser. SOSP ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 478–493. [Online]. Available: https://doi.org/10.1145/3341301.3359637
[23] S. Kannan, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, Y. Wang, J. Xu, and G. Palani, “Designing a True Direct-Access File System with DevFS,” in Proceedings of the 16th USENIX Conference on File and Storage Technologies, ser. FAST’18. USA: USENIX Association, 2018, p. 241–255.
[24] Y. Ren, C. Min, and S. Kannan, CrossFS: A Cross-Layered Direct-Access File System. USA: USENIX Association, 2020.
[25] N. Moti, F. Schimmelpfennig, R. Salkhordeh, D. Klopp, T. Cortes, U. Rückert, and A. Brinkmann, “Simurgh: a fully decentralized and secure NVMM user space file system,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
[26] L. A. Eisner, T. Mollov, and S. J. Swanson, Quill: Exploiting fast non-volatile memory by transparently bypassing the file system. Citeseer, 2013.
[27] C. C. Chou, J. Jung, A. N. Reddy, P. V. Gratz, and D. Voigt, “vNVML: An efficient user space library for virtualizing and sharing non-volatile memories,” in 2019 35th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2019, pp. 103–115.
[28] C. C. Chou, J. Jung, A. Reddy, P. V. Gratz, and D. Voigt, “Virtualize and share non-volatile memories in user space,” CCF Transactions on High Performance Computing, vol. 2, no. 1, pp. 16–35, 2020.
[29] S. Ishiguro, J. Murakami, Y. Oyama, and O. Tatebe, “Optimizing local file accesses for fuse-based distributed storage,” in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012, pp. 760–765.
[30] Y. Kwon, H. Fingler, T. Hunt, S. Peter, E. Witchel, and T. Anderson, “Strata: A Cross Media File System,” in Proceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 460–477. [Online]. Available: https://doi.org/10.1145/3132747.3132770
[31] X. Chen, E. H.-M. Sha, Q. Zhuge, T. Wu, W. Jiang, X. Zeng, and L. Wu, “UMFS: An efficient user-space file system for non-volatile memory,” Journal of Systems Architecture, vol. 89, pp. 18–29, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1383762117305064
[32] J. Liu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and S. Kannan, “File systems as processes,” in 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19), 2019.
[33] Y. Jin and B. Lee, “Chapter one - a comprehensive survey of issues in solid state drives,” ser. Advances in Computers, A. R. Hurson, Ed. Elsevier, 2019, vol. 114, pp. 1–69. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0065245819300117
[34] Q. Xu, H. Siyamwala, M. Ghosh, M. Awasthi, T. Suri, Z. Guz, A. Shayesteh, and V. Balakrishnan, “Performance characterization of hyperscale applicationson on nvme ssds,” in Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2015, pp. 473–474.
[35] Q. Xu, H. Siyamwala, M. Ghosh, T. Suri, M. Awasthi, Z. Guz, A. Shayesteh, and V. Balakrishnan, “Performance analysis of nvme ssds and their implication on real world databases,” in Proceedings of the 8th ACM International Systems and Storage Conference, 2015, pp. 1–11.
[36] T. Grunzke, “Onfi 3.0: The path to 400mt/s nand interface speeds,” Flash Memory Summit, Santa Clara, CA, vol. 17, 2010.
[37] M. Jung, J. Zhang, A. Abulila, M. Kwon, N. Shahidi, J. Shalf, N. S. Kim, and M. Kandemir, “Simplessd: Modeling solid state drives for holistic system simulation,” IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 37–41, 2017.
[38] D. Gouk, M. Kwon, J. Zhang, S. Koh, W. Choi, N. S. Kim, M. Kandemir, and M. Jung, “Amber: Enabling precise full-system simulation with detailed modeling of all ssd resources,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 469–481.
[39] S. Koh, C. Lee, M. Kwon, and M. Jung, “Exploring system challenges of $\{$ Ultra-Low $\}$ latency solid state drives,” in 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18), 2018.
[40] M. Jung and M. T. Kandemir, “An evaluation of different page allocation strategies on high-speed ssds.” in HotStorage, 2012.
[41] J. Zhang, M. Kwon, M. Swift, and M. Jung, “Scalable parallel flash firmware for many-core architectures,” in 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020, pp. 121–136.
[42] DDR4 Registered Non-Volatile DIMM (NVDIMM-N). http://agigatech.com/wp-content/uploads/2021/06/Komodo1-DDR4-3200-NVDIMM-N-Datasheet.pdf.
[43] A. Sainio, “Nvdimm: changes are here so what’s next,” Memory Computing Summit, 2016.
[44] JEDEC DDR5 NVDIMM-P Standards Under Development. https://www.jedec.org/news/pressreleases/jedec-ddr5-nvdimm-pstandards-under-development.
[45] Intel Optane DC Persistent Memory. http://www.intel.com/optanedcpersistentmemory/.
[46] Intel Xeon Scalable. https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html.
[47] Direct Access (DAX). https://www.kernel.org/doc/html/latest/filesystems/dax.html.
[48] UNIX. https://unix.org/.
[49] System call. https://linux-kernel-labs.github.io/refs/heads/master/lectures/syscalls.html.
[50] I/O request handling. https://www.kernel.org/doc/html/latest/virt/acrn/io-request.html.
[51] CPU scheduling. https://docs.kernel.org/scheduler/index.html.
[52] “Demand paging,” https://tldp.org/LDP/tlk/mm/memory.html.
[53] Page swapping. https://www.kernel.org/doc/gorman/html/understand/understand014.html.
[54] “SPDK github,” https://github.com/spdk/spdk.
[55] “Storage Performance Development Kit (SPDK),” https://spdk.io/.
[56] F. M. Schuhknecht, J. Dittrich, and A. Sharma, “Ruma has it: Rewired user-space memory access is possible!” Proc. VLDB Endow., vol. 9, no. 10, p. 768–779, jun 2016. [Online]. Available: https://doi.org/10.14778/2977797.2977803
[57] A. Memaripour and S. Swanson, “Breeze: User-level access to non-volatile main memories for legacy software,” in 2018 IEEE 36th International Conference on Computer Design (ICCD), 2018, pp. 413–422.
[58] Q. Huai, W. Hsu, J. Lu, H. Liang, H. Xu, and W. Chen, “ $\{$ XFUSE $\}$ : An infrastructure for running filesystem services in user space,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 863–875.
[59] D. Zhao and I. Raicu, “Hycache: A user-level caching middleware for distributed file systems,” in 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, 2013, pp. 1997–2006.
[60] Y. Son, N. Y. Song, H. Han, H. Eom, and H. Y. Yeom, “A user-level file system for fast storage devices,” in 2014 International Conference on Cloud and Autonomic Computing. IEEE, 2014, pp. 258–264.
[61] ——, “Design and evaluation of a user-level file system for fast storage devices,” Cluster Computing, vol. 18, no. 3, pp. 1075–1086, 2015.
[62] L. Jiang, K. Wang, and D. Zhao, “Davram: Distributed virtual memory in user space,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2018, pp. 344–347.
[63] Y. Chen, Y. Lu, B. Zhu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and J. Shu, “Scalable Persistent Memory File System with Kernel-Userspace Collaboration,” in 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, Feb. 2021, pp. 81–95. [Online]. Available: https://www.usenix.org/conference/fast21/presentation/chen-youmin
[64] Y. Zhu, W. Yu, B. Jiao, K. Mohror, A. Moody, and F. Chowdhury, “Efficient user-level storage disaggregation for deep learning,” in 2019 IEEE International Conference on Cluster Computing (CLUSTER), 2019, pp. 1–12.
[65] Y. Tu, Y. Han, Z. Chen, Z. Chen, and B. Chen, “Urfs: A user-space raw file system based on nvme ssd,” in 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2020, pp. 494–501.
[66] L. Barroso, M. Marty, D. Patterson, and P. Ranganathan, “Attack of the killer microseconds,” Communications of the ACM, vol. 60, no. 4, pp. 48–54, 2017.
[67] S. Koh, J. Jang, C. Lee, M. Kwon, J. Zhang, and M. Jung, “Faster than flash: An in-depth study of system challenges for emerging ultra-low latency ssds,” in 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2019, pp. 216–227.
[68] J. Hwang, M. Vuppalapati, S. Peter, and R. Agarwal, “Rearchitecting linux storage stack for $\mu$ s latency and high throughput,” in 15th $\{$ USENIX $\}$ Symposium on Operating Systems Design and Implementation ( $\{$ OSDI $\}$ 21), 2021, pp. 113–128.
[69] D. Hildebrand, A. Povzner, R. Tewari, and V. Tarasov, “Revisiting the storage stack in virtualized $\{$ NAS $\}$ environments,” in 3rd Workshop on I/O Virtualization (WIOV 11), 2011.
[70] E. Artiaga and T. Cortes, “Using Filesystem Virtualization to Avoid Metadata Bottlenecks,” in 2010 Design, Automation and Test in Europe Conference and Exhibition (DATE 2010). IEEE, 2010, pp. 562–567.
[71] A. Kantee, “Refuse : Userspace fuse reimplementation using puffs,” 2007.
[72] S. Narayan, R. K. Mehta, and J. A. Chandy, “User space storage system stack modules with file level control,” 2010.
[73] D. Mazières, “A toolkit for user-level file systems,” in Proceedings of the General Track: 2001 USENIX Annual Technical Conference. USA: USENIX Association, 2001, p. 261–274.
[74] A. Rajgarhia and A. Gehani, “Performance and extension of user space file systems,” in Proceedings of the 2010 ACM Symposium on Applied Computing, 2010, pp. 206–213.
[75] B. K. R. Vangoor, V. Tarasov, and E. Zadok, “To $\{$ FUSE $\}$ or not to $\{$ FUSE $\}$ : Performance of $\{$ User-Space $\}$ file systems,” in 15th USENIX Conference on File and Storage Technologies (FAST 17), 2017, pp. 59–72.
[76] M. Hedayati, S. Gravani, E. Johnson, J. Criswell, M. L. Scott, K. Shen, and M. Marty, “Hodor: $\{$ Intra-Process $\}$ isolation for $\{$ High-Throughput $\}$ data plane libraries,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 489–504.
[77] A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, A. Jagatheesan, R. K. Gupta, A. Snavely, and S. Swanson, “Understanding the impact of emerging non-volatile memories on high-performance, io-intensive computing,” in SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2010, pp. 1–11.
[78] J. Han, D. Koo, G. K. Lockwood, J. J. Lee, H. Eom, and S. Hwang, “Accelerating a burst buffer via user-level I/O isolation,” 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 245–255, 2017.
[79] J.-U. Kang, J. Hyun, H. Maeng, and S. Cho, “The multi-streamed $\{$ Solid-State $\}$ drive,” in 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14), 2014.