Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Michael Lui21, Yavuz Yetim1, Özgür Özkan1, Zhuoran Zhao1, Shin-Yeh Tsai1,
Carole-Jean Wu1, Mark Hempstead12 1Facebook, 2Drexel University, 2Tufts University

Abstract

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes–that load entire models to a single server–are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers.

This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model’s static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner–P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

Index Terms:

recommendation, deep learning, distributed systems

I Introduction

Deep learning recommendation models play an important role in providing high-quality internet services, and has recently started receiving attention from the systems community [1, 2, 3, 4, 5, 6, 7, 8]. Recommendation workloads have been shown to consume up to 79% of all AI inference cycles at Facebook’s data-centers [1] in 2019 and have been added to the MLPerf benchmarking effort [9, 10, 11]. The significance of this data-center workload warrants increased attention to its performance, throughput, and scalability. A major challenge facing this distinct class of deep learning workload is their growing size. Figure 1 shows the rate of recommendation model growth at Facebook. Further models at the 1-10TB scale have also been deployed at Baidu and Google [12, 13, 14]. This growth is spurred by the desire for increased accuracy. Increasing model size, as number of parameters, is well known to improve generalization and accuracy and is necessary to achieve optimal results given larger training data sets [15, 16, 17, 18, 19, 20]. This is true across of a variety of deep learning applications, including language models with large embedding tables–similar to deep recommendation systems. This work does not explore the accuracy effects of increasing model size and instead focuses on system and performance implications of supporting the massive models used in production deep recommendation systems.

Refer to caption — Figure 1: Historical model growth of significant production recommendation model. Both number of features and embeddings have grown an order of magnitude in only three years.

Deep recommendation model size, and by proxy memory capacity, is dominated by embedding tables, which represent learned sparse parameters. Each sparse input is hashed to one or more indices in its corresponding embedding table, where each index maps to a vector. Indexed embedding vectors are then combined in a pooling operation, such as a vector summation. Increasing hash bucket size and the number of embedding tables increase the information captured by the embeddings and thus is a straightforward method to improve the accuracy of recommendation models, but is constrained by the commensurate increase in memory footprint [21]. Reducing hash bucket size will constrain model size, and so must be performed with care to maintain model accuracy.

As memory requirements for large recommendation models surpass the memory capacity of a single server, solutions are needed to either constrain model size or increase effective memory capacity. Model compression techniques can be used to constrain model size but can degrade model accuracy [22]. DRAM capacity can be increased, but this is not scalable beyond the single-digit terabyte range. Even then, larger server chassis are required to support large memory capacity. On demand paging of the model from higher capacity storage is another solution, but this requires fast solid-state drives (SSD) to meet latency constraints.

Additional hardware requirements are undesirable in the data-center due to the added complexity of managing heterogeneous resources. For example, clusters with specialized configurations cannot easily expand resources during periods of high activity or efficiently shrink resources during periods of low activity. This is particularly true of workloads affected by diurnal traffic patterns through the day [23]. In contrast to the aforementioned approaches, a distributed serving paradigm can be implemented on existing infrastructures by splitting the model into distinct and independent sub-networks to run on common CPU platforms. Of course, this approach is not without tradeoffs. Added network latency, network load, and the requirement of additional compute nodes are incurred as a result. However in a large homogeneous data-center, it is easier to deploy and scale to these resources compared to custom, unconventional hardware platforms.

As such, we provide the research community in-depth descriptions and characterizations of distributed inference for deep learning recommender systems. It is a first approach, scalable solution to the challenge of growing model sizes and presents a baseline system for further optimization. Notably, it is distinct from recent throughput-oriented training systems that do not have the same latency constraints found in inference serving [12, 15]. The implementation is characterized on real inputs and real, scaled-down deep learning recommendation models. Three model parallelization strategies are explored within this serving paradigm. This challenging, novel workload presents new opportunities for the systems research community in the data-center scale, machine learning domain. The contributions of this work are as follows:

•

To our knowledge, this is the first work describing a distributed inference serving infrastructure for at-scale neural recommendation inference.
•

We present an in-depth characterization and breakdown of distributed inference’s impact on end-to-end recommendation serving latency, tail latency, and operator compute overhead. We further investigate the impact of embedding table placement using model sharding and place our findings in the context of the data-center serving environment.
•

Finally, we design a cross-layer, distributed instrumentation framework for performance debugging and optimization analysis to quantify the performance overhead from remote procedure call (RPC) services and the machine learning framework.

The paper is organized as follows: Section II provides a brief introduction to deep learning recommendation. Section III describes how distributed inference is applied and implemented. Special attention is given to the sharding strategies used to generate distributed models. Section IV describes the custom tracing framework used to collect workload measurements. Section V provides details about the models, platforms, and inputs used in our characterization. Sections VI and VII present our characterization findings and place them in context of a data-center serving environment. We identify guidances and key takeaways for system designers. Section VIII provides a discussion of related works. Finally, Sections IX and X offers our concluding thoughts.

II Recommendation Inference At-Scale

Recommendation is the task of recommending, or suggesting, product or content items from a set that are of interest to a user. For example, a service may recommend new video clips, movies, or music based on a user’s explicitly liked content and implicitly consumed content. Accuracy of the model is an abstract measure of a user’s interest and satisfaction in the recommended results. Traditionally, neighborhood based techniques like matrix factorization have been used to good effect by providing recommendation based on similarity to other users or similarity of preferred items [24, 25]. However, more recent recommender systems have used deep neural networks to combine a variety of dense and sparse input features into a more generalized predictive model [3, 1, 26, 27, 28, 29]. An example of a dense feature is a user’s age; an example of a sparse, or categorical, feature is web pages the user likes, and the output of the model are rankings of the candidate item inputs. Figure 2(a) shows a simplified overview of this deep learning recommendation model architecture.

Today, recommendation inference is performed on the CPU, compared to the heterogeneous systems popular with other deep learning workloads [1, 2]. This is because, as compared to GPUs and other AI accelerators leveraged in other inference tasks, the (1) sparsity in recommendation models, (2) evolving nature of recommendation algorithms, and (3) latency-bounded throughput constraints make it challenging for recommendation inference be served on AI accelerators efficiently at-scale. Throughput, or queries per second (QPS), is a paramount target for inference, but just as important are latency constraints. In order to provide a satisfactory user experience, recommendation results are expected within a timed window. This strict latency constraint defines the service-level agreement (SLA) [1]. If SLA targets cannot be satisfied, the inference request is dropped in favor of a potentially lower quality recommendation result, which could worsen user experience. To maximize throughput within the respective SLA constraints, various techniques are applied, such as batch sizing and model hyperparameter tuning.

II-1 Sparse Feature Operation

Dense features are processed with fully-connected (FC) layers, while sparse features, shown in Figure 2(a), go through a transformation before further feature interaction. The sparse inputs are transformed into a list of access IDs, or hash indices, which index into an embedding table. The size of the embedding table–the number of buckets and the vector dimension–is a tunable hyperparameter. Usually, there is one embedding table per sparse feature, but it is possible for features to share tables to save memory resources. For example, in a $N\times M$ table, with embedding dimension $M$ , a sparse feature input with $K$ indices will produce $K$ embedding vectors of length $M$ . A pooling operation, such as summation or concatenation, will collapse the matrix along the first dimension to produce a $1\times M$ or $1\times(MK)$ result for use in the feature interaction. In the Caffe2 framework, used in this work, the embedding table operator is called SparseLengthsSum, or SLS. It is typical to refer to the family of related operators also as SLS. Because of the sheer number of possible inputs, the embedding tables are constrained in size and hash collisions may occur. For example, one FP32 embedding table with a dimension of 32, for 3-billion unique users, will consume over 347GB. The table size would need to be reduced to tractably use on a single server.

II-2 Substantial Model Growth and Large Embedding Tables

To improve model accuracy and capitalize on the rich, latent information in sparse features, deep learning recommendation models have exploded in size as shown in Figure 1 and recent works [14, 12]. The embedding tables dominate the size of recommendation models and are responsible for the significant growth in model size. As rapid as this growth appears, it is still constrained by the commensurate increase in the memory capacity requirement. In an effort to further improve the accuracy of recommendation models, the total capacity of embedding tables has grown larger than can be supported on a single server node, motivating the need for distributed serving paradigms.

III Distributed Inference

This is the first work to describe distributed, neural network inference in detail for the unique goals and characteristics of deep recommendation systems. Because this initial implementation targets deployability and scalability over performance and efficiency, many optimization opportunities exist across the systems design space, including algorithms, software, microarchitecture, system configuration, and task mapping. Thus, in this section, we provide an overview of the general system design and structure for the research community. At minimum, a distributed inference solution for neural recommendation must enable (1) a greater variety and a larger number of sparse features and (2) larger embedding tables, i.e. larger embedding dimensions, less hash collisions, and/or less compression.

III-A Distributed Model Execution

Consider that a deep learning model can be represented by a directed control and dataflow graph. A distributed model simply partitions a neural network into subnets, where each subnet can operate independently. Treating each subnet independently provides desirable flexibility in deployment, since all the infrastructure for serving the model already exists. An example of such partitioning is shown in Figure 2(b). This is traditionally referred to as model or layer parallelism.

III-A1 Model Sharding

We call each independent, partitioned subnet a shard, and the process of creating shards, sharding. Sharding occurs both before and after training. Before training, parameter server shards are generated to hold model parameters for distributed training. After training, during model publishing, parameters are resharded and serialized from parameter servers to the respective inference shard based on a prior partitioning phase. At this stage, other model transformations, e.g. quantization, are executed and all training metadata is already available for sharding decisions. Resharding directly after training also avoids the extra storage, compute, and complexity needed to reload and reshard terabyte-scale models. In Figure 2(b), the main shard performs all the dense layers, and any partitioned subnets are replaced by custom remote-procedure-call (RPC) operators that call remote shards.

Only the sparse operators, like SLS, and their embedding tables are placed on remote shards. This heuristic directly addresses the memory constraint imposed by embedding tables and retains compute density on the main shard. Thus, remote shards are also termed sparse shards. Figure 3 shows this scheme as a sample distributed trace where the main shard, at the top, performs the majority of compute within each net.

Due to the massive size of embedding tables, a single table can also be partitioned to multiple shards, as shown in Figure 2(b). In such a case, the sparse feature IDs are split and sent to the appropriate RPC operator based on a hashing function. This is implemented by partitioning embedding table rows with a simple modulus operator across shards. The serving infrastructure imposes the constraint that graph cycles cannot exist between shards, so each shard is stateless to avoid further complexity. This restriction also provides greater flexibility in a serving environment, where shards may fail and need to restart or replicas may be added.

III-A2 Serving Shards

Distributed inference requires an additional special remote procedure call (RPC) operator, which replaces subnets in the main shard and invokes its respective sparse shard, as shown in Figure 2(b). This scheme enables straightforward scale-out style support for large models. An inference request gets sent to a server with the main shard loaded, and when an RPC op is encountered, a subsequent request to the appropriate sparse shards is issued. Inference serving on all shards is comprised of an RPC service handler, such as Thrift or gRPC, and a machine learning framework, such as Caffe2, PyTorch, or TensorFlow [30, 31, 32]. Each shard runs a full service handler and ML framework instance.

This distributed architecture supports the same replication infrastructure used in non-distributed inference. Shards are replicated based on their load and the resource needs of large-scale deployment. For example, a shard that requires more compute resources to meet QPS requirements will have copies replicated and deployed on its cluster via a cluster-level hypervisor. The advantage to both main and sparse shards running a full service stack is that they can be replicated independently. Each individual request can then be processed by a different combination of machines than a previous request, further motivating the requirement of stateless shards. The replication infrastructure is not enabled in our experiments because an isolated set of servers is used for in-depth characterization and analysis of per-request overheads. A discussion of projected impacts and interactions of distributed inference on replication is included in Section VII.

III-B Capacity-Driven Model Sharding

Optimal model sharding is a challenging systems problem, due to the number of configurations and varying optimization objectives. While this problem has been studied in the context of training, the inference context presents new challenges. Training must contend with the accuracy implications of mini-batch sizing and parameter synchronization, and is targeted at maximizing throughput on a single instance that’s fed a large dataset [33]. Data-center inference, in contrast, must contend with varying request rates and sizes under strict latency constraints, and the impact of model replication to meet those requirements. Furthermore, in the context of recommendation, memory footprint is dominated by embedding tables which are computationally sparse. Model sharding for deep learning recommendation inference is motivated by the desire to enable huge models, and thus we consider this new challenge capacity-driven sharding.

Due to the number of shard configurations, an exhaustive search for an optimal sharding scheme is intractable. Instead, heuristics are used, which depend on the model architecture and include measurements or estimates of model capacity, compute resources, and communication costs. Naively, the heuristic should aim to minimize the number of shards due to the additional compute and network resources consumed with more shards. We study the resulting impact of this assumption in Section VI. The heuristics are guided by the following observations:

1.

Embedding tables used in sparse layers contribute to the vast majority ( $>97\%$ ) of model size.
2.

Sparse layers are memory and communication bound, while dense layers are compute bound.
3.

Existing server infrastructure cannot support memory requirements of sparse layers but can support the compute and latency requirements of dense layers.
4.

Deep recommendation model architectures, as in Figure 2, can execute sparse operators in parallel, which feed successive dense layers.

Because the dense layers do not benefit from additional compute or parallelism provided by distributed inference, our sharding strategies are constrained to only move sparse layers to remote shards. This is specific to the architecture of deep recommendation models. Figure 2 shows that, after embedding table lookup and pooling, all of the resulting embedding vectors and dense features are combined in a series of feature interaction layers. Placing the interaction layer on its own shards results in an undesirable increase of communication and lessening of compute density. Placing this layer with existing sparse shards also increases communication and violates the stateless shard constraint. Alternate architectures may benefit from sharding interaction layers. Consider a model architecture that has dedicated feature interaction layers for specific sets of embedding tables. Such an architecture could shard those feature interaction layers with their respective sparse layers and indeed see performance benefits. While an intriguing model architecture, we are limited to models that are currently available, thus this work chooses to focus on the more traditional model in Figure 2(a).

Thus, sharding the sparse layers (1) directly addresses capacity concerns of the largest parameters, (2) effectively parallelizes sparse operators which otherwise execute sequentially, and (3) enables better resource provisioning by isolating the communication- and compute-bound portions of inference. Note that inference is not traditionally operator-parallel because operators do not typically produce enough work to offset scheduling overheads. Extra computing cores are instead utilized by increasing batch-level parallelism.

Three sharding strategies, using the above heuristic, are evaluated in this work (Table I). These strategies address the new and distinct challenges of huge deep learning recommendation models: the varied number and size of embedding tables, non-uniformity of sparsity, and the challenge of the real-time serving environment. Two trivial cases of (1) non-distributed inference, or singular, and (2) a single shard with all embedding tables are also presented in Table I.

TABLE I: Sharding Strategy Summary

Sharding Strategy	Notes
Singular	Distributed inference disabled. Entire model loaded on one server.
1 Shard - All Tables	Only one sparse shard with all embedding tables.
{2, 4, 8} Shards Capacity-balanced	Table placement ensures similar total embedding table size per shard
{2, 4, 8} Shards Load-balanced	Table placement ensures similar pooling work per shard
{2, 4, 8} Shards Net-specific bin-packing (NSBP)	Tables are grouped by ML net, and packed into shards until a size limit is reached. Tables larger than this limit are effectively given an entire shard.

III-B1 Capacity-balanced

An intuitive strategy is to spread embedding tables evenly across many shards. Capacity-balanced sharding ensures that each sparse shard has the same memory requirements. This serves to minimize the number of shards, with the goal to achieve the least compute resource overhead for a singly served model.

III-B2 Load-balanced

Each sparse feature is represented as a multi-hot encoded vector, which transforms to a multi-index lookup into the embedding table and a final pooling operation. Because the expected number of lookups per table is dependent on specific feature and how often it appears in request inputs, capacity-balanced sharding may result in imbalanced shards that perform signficantly different amounts of work. Additionally, the number of lookups is proportional to the network bandwidth used to send table indices. This aggregate imbalance can cause certain shards to become a critical path bottleneck, degrading latency compared to the load-balanced sharding configuration.

The distributed trace shown in Figure 3 demonstrates how one sparse shard may become on the critical path, albeit unpredictable variance in network latency must also be considered. Remote shard 1 and 2 are queried asynchronously, in parallel. Because remote shard 1 performs significantly more work, more latency overhead is incurred. To reduce the likelihood of one shard consistently increasing latency, the load-balancing strategy places embedding tables based on their pooling factor, or expected number of lookups. The pooling factor is estimated by sampling 1000 requests from the evaluation dataset and observing the number of lookups per table.

III-B3 Net-specific bin-packing

Recommendation models frequently separate the user features and content/product features into distinct nets to more effectively batch user-content pairs together [1]. In the models chosen for this work, output of the user net is fed into the content/product net, so they must be executed sequentially. An example of this is shown in Figure 3, where Net 2 is dependent on Net 1. Separate RPC operators will be called when each net is executed to access their respective embedding tables. If tables are not grouped by net, the same shard may be accessed multiple times per batch, for each net. This is undesirable because 1) in Figure 3, three RPC operators are invoked, compared to two (one RPC per shard), and 2) as servers are replicated in a data center environment to handle increased requests, tables for both nets will be duplicated regardless of which net receives more inputs.

This is particularly undesirable if, in Figure 3, server replication of remote shard 1 is triggered by high compute characteristic of Net 1’s pooling. All of Net 2’s embedding tables, which may be 10s of GBs sharded, will consume additional, under-utilized memory resources. To account for this and target more efficient resource usage, a net-specific bin-packing (NSBP) strategy is evaluated, which first groups tables by net, and then packs them into bins based on a given size constraint. To reduce data-center resources incurred during sharding, each bin starts out as the existing sparse parameter servers used during training. If a parameter server’s bin is already full, it is considered a full shard. This reduces network bandwidth and compute orchestration required for sharding.

III-C Distributed Inference Implementation

We present an in-depth description of the customized open-source frameworks used in this work. This provides the reader with a concrete implementation to frame our results and provides guidance for implementation with other frameworks. For this work, distributed inference is built on top of a highly customized variant of Thrift and Caffe2, however the methodology is generalizable to any RPC or ML framework [30, 31]. While PyTorch has absorbed and succeeded Caffe2 as the state-of-the-art moving forward, the Caffe2 infrastructure used in this work is shared between both frameworks. Thrift serves ranking requests by loading models and appropriately splitting received requests to inference batches to the appropriate net. A modified version of Caffe2 is used that includes RPC operators that issues Thrift requests. The intermediate requests to sparse shards are routed via a universal service discovery protocol. All inter-server communication occurs through standard TCP/IP stack over Ethernet. The model is transformed for distributed inference after training. A custom partitioning tool employs a user-supplied configuration to group embedding tables and their operators, insert RPC operators, generate new Caffe2 nets, and then serialize the model to storage.

IV Cross-Layer ML Operator- and Communication-Aware Characterization

The performance of distributed inference is determined by choices at multiple layers of the system, in particular, the data-center service level (scheduling, discovery, and networking), the machine learning framework-level, and then the machine learning operators themselves. Measuring a workload across these layers is important for understanding overheads, attributing costs, targeting components for optimization, and making high-level system design decisions. Because no profiling tools exist to perform such a cross-layer characterization, we built a custom cross-layer distributed tracing framework for measuring distributed inference workloads.

TABLE II: Sharding Results for DRM1. Each column is a different sharding configuration. Each row is a static attribute. The bracketed elements

[N]:value

represent the

N^{th}

shard’s attribute value for that configuration.

\ssmall

	1-shard	Load-balanced			Capacity-balanced			Net-Specific Bin-Packed
	1-shard	2	4	8	2	4	8	2	4	8
Capacity (GiB)	[1]: 194.05	[1]: 89.38 [2]: 104.67	[1]: 40.94 [2]: 60.76 [3]: 44.16 [4]: 48.18	[1]: 28.87 [2]: 29.82 [3]: 18.23 [4]: 21.0 [5]: 20.5 [6]: 26.35 [7]: 23.44 [8]: 25.85	[1]: 97.03 [2]: 97.03	[1]: 48.52 [2]: 48.51 [3]: 48.51 [4]: 48.51	[1]: 24.25 [2]: 24.25 [3]: 24.25 [4]: 24.25 [5]: 24.25 [6]: 24.25 [7]: 24.25 [8]: 24.25	[1]: 33.58 [2]: 160	[1]: 55.89 [2]: 48.22 [3]: 55.89 [4]: 33.58	[1]: 27.93 [2]: 5.649 [3]: 27.95 [4]: 27.94 [5]: 27.94 [6]: 27.95 [7]: 27.95 [8]: 20.28
Embedding Tables	[1]: 257	[1]: 125 [2]: 132	[1]: 63 [2]: 67 [3]: 63 [4]: 64	[1]: 33 [2]: 31 [3]: 33 [4]: 32 [5]: 32 [6]: 31 [7]: 32 [8]: 33	[1]: 121 [2]: 136	[1]: 62 [2]: 74 [3]: 60 [4]: 61	[1]: 43 [2]: 31 [3]: 30 [4]: 31 [5]: 31 [6]: 31 [7]: 31 [8]: 29	[1]: 72 [2]: 185	[1]: 28 [2]: 106 [3]: 51 [4]: 72	[1]: 42 [2]: 30 [3]: 18 [4]: 11 [5]: 43 [6]: 27 [7]: 27 [8]: 59
Estimated Pooling Factor	[1]: 138943.1	[1]: 69471.6 [2]: 69471.5	[1]: 34735.9 [2]: 34735.7 [3]: 34735.7 [4]: 34735.8	[1]: 17367.4 [2]: 17368.3 [3]: 17367.5 [4]: 17368.2 [5]: 17368.1 [6]: 17367.6 [7]: 17368.4 [8]: 17367.6	[1]: 65852.3 [2]: 73090.8	[1]: 28392.9 [2]: 35719.4 [3]: 38163.4 [4]: 36667.4	[1]: 24656.9 [2]: 12067.0 [3]: 29746.4 [4]: 6310.7 [5]: 11825.0 [6]: 14050.1 [7]: 21075.8 [8]: 19211.2	[1]: 126652.7 [2]: 8010.7	[1]: 1859.9 [2]: 3540.5 [3]: 2610.3 [4]: 126652.7	[1]: 77103.4 [2]: 49549.3 [3]: 1034.2 [4]: 781.3 [5]: 2017.9 [6]: 1372.3 [7]: 1532.2 [8]: 1272.8

IV-A Workload Capture via Distributed Tracing

Distributed tracing is a proven technique for debugging and performance investigations across inter-dependent services[34, 35]. We leverage this technique to investigate the performance impact of the distributed inference infrastructure. Custom instrumentation was added at multiple layers in the production service stack to provide a complete view of costs for request/response serialization, RPC service boilerplate setup, and model execution for each request.

Thrift and Caffe2 both provide instrumentation hooks to capture salient points in execution, leaving the service handler to be more intrusively modified, albeit still lightweight. Source instrumentation was chosen over other binary instrumentation approaches because it’s lighter weight and can interface with Thrift’s RequestContext abstraction to propagate contextual data for distributed tracing. At each trace point, metadata specific to the layer and a wall-clock timestamp are logged to a lock-free buffer and then asynchronously flushed to disk. Wall-clock time is desirable because its ordering helps achieve a useful trace visualization. Furthermore, most spans are small and sequential, enabling wall-clock time as a proxy for CPU time. Per-shard CPU time for each request is also logged to verify this claim. The trace points are then collected and post-processed offline for overhead analysis and to reconstruct a visualization of events, resembling Figure 3.

An example distributed trace is shown in Figure 3. Shards are separated by horizontal slices, with the main shard servicing requests located at the top. The example trace represents a single batch, however it’s typical for multiple batches to be executed in parallel depending on the number of ranking requests and the configurable batch size. Furthermore, the ordering of operators is typical for the recommendation models studied in this work. Operators are scheduled to execute sequentially–unless specifically asynchronous like the RPC ops–because other cores are utilized via request- and batch-level parallelism. The initial dense layers are first executed, followed by the sparse lookups, and then the later feature interaction and top dense layers.

As discussed in SectionIII-A, each sparse shard is a full RPC service for deployment flexibility, and as such incurs service-related overheads compared to local, in-line computation. Figure 3 demonstrates that latency overhead in each sparse shard is any time not spent executing SLS operators. In the non-distributed case, the SLS ops are directly executed in the main shard. The extra latency is attributed to the network link, extra (de)serialization of inputs and outputs, and time spent preparing and scheduling the Caffe2 net. Despite such overheads, we also see that the asynchronous nature of the distributed computation enables more parallelization of the sparse operators. An important takeaway from this characterization is the trade-off between increased parallelization to reduce latency overhead and decreased parallelization to reduce compute and data-center resource overhead. Furthermore, this trade-off is specific to each model.

IV-B Cross-Layer Attribution

Compared to simple, operator-based counters for compute or end-to-end (E2E) latency, capturing a cross-layer trace provides a holistic view of compute and latency. The example flow in Figure 3 shows compute overheads from RPC serialization and deserialization and request/response processing of the software infrastructure. Additional compute is incurred in each thread’s scheduling and book-keeping of asynchronous RPC operators. Simple end-to-end latency overhead is easily measured at the main shard. However, because each sparse shard is executed in parallel, attribution of latency overheads is more complex and involves overlap between sparse shard requests. To simplify analysis, the slowest asynchronous sparse shard request, per main shard request, is used for latency breakdown. The network-, serialization-, and RPC service- latencies in the associated sparse shard are used. Because the clocks on disparate servers will be skewed, network latency is measured as the difference between the outstanding request measured at the main shard and the end-to-end service latency measured at the sparse shard.

V Experimental Methodology and Workloads

We evaluated production-grade distributed serving software infrastructure to quantify the impacts on compute requirements, latency, and effectiveness of sharding. The purpose of this study was to evaluate the practicality of distributed inference as a means to support huge deep recommendation models and provide a basis for further systems exploration with this novel workload. In this section, we provide descriptions of the data-center deep learning recommendation models chosen for this study and the underlying hardware platform.

V-A Recommendation Models

Three DLRM-like models, referred to as DRM1, DRM2 and DRM3, were used that span a range of large model attributes, such as varying input features and embedding table characteristics. The D-prefix is to contrast the distributed models to the specific models discussed in recent deep learning recommendation works [2, 1, 3]. The DRM* models are a subset of many possible, evolving models and chosen as early candidates for distributed inference given their sparse feature characteristics. The goal of studying these models is to present a basis for evaluating overheads of distributed inference via our cross-layer distributed trace analysis, and they should not be interpreted as canonical benchmark models such as those included in the MLPerf benchmark suite [9].

Figure 4, Figure 5, and Table II demonstrate the variety within large neural recommendation models. DRM1 and DRM2 have two nets, each having their own respective sparse features, while DRM3 has a single net. Because real, sampled requests were used for each model, the inference request size and corresponding compute and latency also varied between models. All parameters were uncompressed as single-precision floating point. Section VII-D discusses the impact of compression. Per-operator group attributions for each model are shown in Figure 4, as a simple mean average across all sampled requests for the non-distributed model. DRM1 and DRM2 are the most similar architectures, reflected in Figure 4. Compared to DRM3, DRM1 and DRM2 have a more complex structure evidenced by additional tensor transform costs. More relevant to this work, sparse operators consume a much larger portion of all operators compared to DRM3. Specifically, sparse operators contribute to 9.7%, 9.6%, and 3.1% of all operator time, in DRM1, DRM2, and DRM3 respectively. Despite their low proportional compute, the sparse operators account for >97% of model capacity in DRM1 and DRM2, and >99.9% of capacity for DRM3. Embedding tables larger than a given threshold were scaled down by a proportional factor to fit the entire model on a single 256GB server. This provided a straight comparison of compute and latency overheads across all sharding strategies listed in Table I, including singular. The original data-center scale models are many times larger.

The distribution of embedding tables sizes within each model is shown in Figure 5. DRM1 was sized to 200GB with 257 embedding tables and the largest table is at 3.6GB. DRM2 was proportionately sized to 138GB with 133 embedding tables and the largest table at 6.7GB. Lastly, DRM3 was sized to 200GB with 39 embedding tables and the largest table at 178.8GB. DRM1 and DRM2 demonstrate a long tail of embedding table size compared to DRM3, explaining the additional sparse operator cost shown in Figure 4. Comparatively, DRM3 is dominated by a single large table. The embedding tables are representative versions of the tables discussed in prior deep learning recommendation work [3, 1, 9].

For DRM1 and DRM2, ten sharding configurations were evaluated. Sharding strategy results for DRM1 are described in Table II. For the load-balanced configuration, per-shard capacities varied up to 50% compared to capacity-balanced where each shard is the same total capacity. For the capacity-balanced configuration, per-shard estimated load varied up to 371%, between shards 4 and 3 with eight shards, compared to load-balanced where each shard had the same total pooling factor. Lastly, the net-specific bin-packing (NSBP) strategy restricted each shard’s tables to a single net. This is most noticeable at the 2-shard configuration, where each net is placed on its own shard. Shard 2 consumes 4.75 $\times$ as much memory capacity as Shard 1, yet is estimated to perform just 6.3% of Shard 1’s compute work.

DRM3 is only sharded with NSBP, and not capacity-balanced or load-balanced strategy, due to existing technical challenges of sharding huge tables. Because it is dominated by a single large table, for each additional shard the largest table is further split, while the smaller tables remained grouped as one shard. The dominating table has a pooling factor of 1, thus only one of the shards spanning the table will be accessed. For example, given four sparse shards, the largest table is partitioned into three shards and the remaining tables grouped together into one shard. Each inference, only two shard would be accessed: one for the sharded large table and one for the smaller tables.

V-B Test Platform

Two classes of servers, representative of the data-center environment, were used for characterization. SC-Large is representative of a typical large server in a data-center and has 256GB of DRAM and two 20-core Intel CPUs. SC-Small is representative of a typical, more efficient web server and has 64GB of DRAM and two, slower clocked, 18-core Intel CPUs, and less network bandwidth than SC-Large. Because of the limited DRAM available on SC-Small, only a subset of configurations could be tested on this platform. The majority of results discussed in Section VI were collected on SC-Large platforms for an apples-to-apples comparison with the non-distributed models. A discussion on the impact of server platform follows in Section VII.

All inference experiments were run on CPU platforms, as discussed in Section II. The reserved, bare-metal servers were located in the same data centers as production recommendation ranking and utilized the same intranet. Their locations within the data center was representative of a typical inference serving tier. Shards were assigned to unique servers and not co-located. Shard-to-server mappings were randomized across repeated trials. Recommendation ranking requests were sampled from production servers and replayed on the test infrastructure; a database of de-identified requests were sampled evenly across a five-day time period in order to capture any diurnal behavior within the requests. The production replayer then pre-processed and cached the requests before sending them to the inference servers using the same networking infrastructure used in online production ranking.

For the majority of experimental runs, batch sizes for inference were set to production defaults, where each batch represents a number of recommendation items to rank and is executed in parallel. In Section VI requests were sent serially, to isolate inherent overheads. In Section VII-A requests were sent asynchronously, to simulate a higher QPS rate more representative of the production environment.

VI Distributed Inference Sharding Analysis

In this section, we discuss the findings of the cross-layer distributed trace analysis. We find that the design space is more nuanced than naively minimizing the number of shards, given hard latency targets and compute scale of the data-center environment. Our primary takeaways are:

•

Increasing the number of shards can manage latency overheads incurred by the RPC operators by effectively increasing model parallelism.
•

However, increased sharding also incurs substantial compute costs for each additional RPC, from service boilerplate and scheduling.
•

Blocking requests sent serially, one-at-a-time, always perform worse in distributed inference across P50, P90, and P99 latencies, due to a simple Amdahl’s law bound. Embedding lookups for deep recommendation systems do not comprise a large enough fraction of E2E latency to benefit from the increased parallelism.
•

The load-balanced strategy did not significantly affect E2E latency compared the capacity-balanced strategy. The net-specific bin-packing (NSBP) sharding performed the worst, latency-wise, but the best compute-wise.
•

Some models have insufficient work to parallelize and don’t receive overhead reduction from additional shards.

The findings show distributed inference is a practical solution to serve the large, deep learning recommendation models, which the existing serving paradigm cannot support. The findings also expose the need for targeted analyses of the network link between shards and the supporting software infrastructure for managing communication, given the little work performed on sparse shards. The results suggest that an automatic sharding methodology is feasible, but requires sufficient profiling data given the variety in embedding table behavior and complex trade-offs of sharding strategies.

VI-A Distributed Inference Overheads

The P50, P90, and P99 End-to-End (E2E) latency and aggregate compute time overheads, across all sharding strategies, for all models, for serial blocking requests, are shown in Figures 6 and 7. A description of each sharding strategy is located in Table I. Note that singular is the non-distributed case, and 1-shard is the worse-case with all embedding tables on 1 sparse shard. P90 and P99 latencies are typical metrics for inference serving because a less accurate fallback recommendation is returned if the inference request is not processed within SLA targets. P50 is also presented for completeness to show the median case, instead of the mean average, due to long tail latencies discussed in prior work [1].

VI-B Latency Overhead

VI-B1 Latency overheads can be ameliorated by increasing shards

E2E slowdown was incurred for all distributed inference configurations across all models, shown in Figures 6 and 7. The parallelization of sparse operators with asynchronous RPC ops was insufficient to overcome the added network latency and software layers. For DRM1 and DRM2, the worst performing configuration, for latency, is unsurprisingly one sparse shard. This is the impractical worst-case, where all embedding tables are placed on one shard and no work is parallelized. Encouragingly, at just 2-shards load-balanced, the latency overhead was only 7.3% at P99 for DRM1. At 8-shards load-balanced, this overhead fell to just 1% at P99, and 11% in the median p50 case. This demonstrates that it’s possible to achieve minimal, practical latency overheads given a simple sharding strategy. Unexpectedly, the 2-shard NSBP strategy actually had the worst in P99 latency for DRM1 and nearly the worst for DRM2. Recall in Table II that much of the work–pooling factor–is assigned to a single shard, and thus for the P99 case, the 2-shard NSBP configuration acts effectively like a bounding 1-shard configuration.

VI-B2 Constant overheads eventually dominate

As the number of shards increase, the work per-shard is reduced and network latency and additional software layers dominate, shown in Figure 8. Network latency is measured as the difference between outstanding request time at the main shard, and the total E2E time at the sparse shard. This time includes in-kernel packet processing and forwarding time. For all distributed inference configurations, network latency was greater than operator latency. Distributed inference will always hurt the latency of these models. Put another way, if the sparse operators produced enough work on average, then the model would be amenable to distributed inference. And given sufficient sparse operator work, latency could be improved. This provides a multi-discipline opportunity for the system architect, model architect, and feature engineer to collaborate on balancing model resource consumption, performance, and accuracy.

VI-B3 Sharding impact depends on model architecture

For DRM3, the number of shards didn’t have a strong impact. DRM3 is dominated by a single large table, shown in Figure 5, which is split amongst the shards. Even at 8-shards, only 2 shards are accessed per inference request–one shard containing the smaller tables and one shard containing the entry for the sharded, largest table, emphasized in Figure 11(a).

VI-B4 In-depth latency layer attribution

Figure 8 shows a breakdown of P50, or median, latency across the layers traced. Figure 8(a) shows that only the embedded portion of the workload is significantly affected across sharding strategies as expected, because this is the portion of the workload being offloaded to sparse shards. The Dense Ops are all ML operators that are not embedding table lookup and pooling ops; the Embedded Portion is all embedding table lookup and pooling operators. For singular this is the ops themselves, whereas for the distributed inference configurations, this is time spent waiting for a response from a sparse shard in Figure 8(a). RPC Serde refers to all serialization and deserialization request times, while RPC Service is any other time strictly not spent in a Caffe2 net or serialization/deserialization. Finally, Net Overhead refers to any time in the net not spent executing operators, e.g. scheduling of asynchronous ops. For distributed inference configurations, the effects on embedded portion latency represent the overhead shown in Figures 6 and 7. In the DRM1 singular configuration, the embedded portion represents only $\sim 10\%$ of latency, while for a single-shard, it’s 32%. The best distributed inference case, for 8-shards load-balanced, it represents 15.6% of total latency. Comparatively, the embedded portion of DRM3 does not significantly change as shards are increased because only the large dominating table is further partitioned. Changes to latency and compute for DRM3 are attributed to cache effects and network variability of communicating with more server nodes. Figure 8(b) further attributes latency within that embedded portion–each bar stack represents the embedded portions in Figure 8(a). For P99, the embedded portion is less significant, and the dense operators and RPC deserialization on the main shard begin to dominate due to very large inference request sizes. This is why the P99 latency overheads are more favorable than P50.

VI-C Compute Overhead

Understanding compute overheads is vital to minimize additional resource requirements at the data-center scale. Sharding strategy is one method for the system designer to balance latency constraints, discussed in the previous section, and compute overheads which impact resource requirements.

VI-C1 Increased compute is a trade-off for reduced latency overheads

The high compute overhead is the trade-off of a flexible, easily deployed system. As stated in the previous section, increasing sparse shards can reduce the latency overhead of distributed inference. However, compute overhead is also increased, because each shard invokes a full Thrift service. Figure 9 shows that for all models, distributed inference always increases compute due to the additional RPC ops required. More importantly, Figure 9 demonstrates that compute overhead is proportional to the number of RPC ops. The NSBP strategy observes the least compute overhead because it executes the least RPC ops. Recall that the NSBP strategy restricts each shard to not mix embedding tables from different nets, and as such each shard is invoked only once per inference. Given multiple nets, each net is less parallelized. Comparatively, the other sharding strategies, which may parallelize each net more, will invoke more RPC ops and lead to increased overall compute overhead.

VI-C2 Compute overhead impacts data-center resources

Increased compute overhead is especially problematic when it is incurred on the main shard, because it increases compute-driven replication and resource requirements to handle the same QPS. This occurs when the compute needed to issue RPC ops, on the main shard, dominates the compute saved by offloading the embedded portion. This is also more likely to occur with model architectures with many, large embedding tables and low pooling factors. The results provide an impetus to investigate these inflection points, which should be inputs to future automatic sharding methodologies and are dependent on model attributes and software infrastructure. This is further discussed in Section VII.

VI-D Sharding Strategy Effects

A trade-off between increased latency or increased compute, as a result of shard count, was established in the previous section. Shard count is a straightforward knob for system designers to balance compute overheads with latency constraints. The sharding strategy, discussed in this section, provides another knob for designers, but the effects on latency and compute are more nuanced.

VI-D1 NSBP is the most scalable strategy

Latency overheads did not show a strong difference between the load-balanced and capacity-balanced configurations. However, the net-specific bin-packed strategy deviated by being less impacted by additional sharding. The most important takeaway from this observation is that NSBP is the most scalable strategy evaluated, because it invokes less RPC ops.

NSBP presented the most unbalanced per-shard latencies. Recall in this strategy, embedding tables are first grouped by net, and then assigned to shards based on both net and size. Tables from separate nets are never assigned to the same shard. Figure 10 more clearly demonstrates this via the per-net operator latencies for the load-balanced and NSBP strategies. Shards 1 and 2 comprise the first net which performs the most work but has the smallest table sizes, shown in Figure 10(b) and Table II. For NSBP, this had a negative effect on latency since less work is parallelized, however compute overhead is less impacted precisely due to less parallelization, ergo less scheduling and service overheads. Note that Net1 and Net2 are executed sequentially, so their combined effect on E2E latency is additive. The benefit of the NSBP strategy, in terms of resource utilization, are further discussed in Section VII.

VI-D2 Little overhead difference exists between load-balanced and capacity-balanced strategies

Figure 12 shows the per-shard operator latencies for all 8-shard configurations with DRM1, across all requests; DRM2 shows similar trends and is omitted for brevity. Recall from Section III-B that load-balanced was expected to have lower latency overheads, by removing any one shard from being the bounding critical path. However, per-shard operator latencies for both strategies are insignificant compared to the E2E latency. Furthermore, there isn’t a significant difference in latencies between load- and capacity-balanced strategies, as was suggested by the estimated pooling factors in Table II. The pooling factors are too small at this scale to show appreciable effect. For DRM1 and DRM2, between load-balanced and capacity-balanced shard strategies, the largest impact comes from increasing the number of shards.

VI-E Model Variety

Model attributes also affect distributed inference performance. The number of nets, number of embedding tables, size distribution, and respective pooling factor are the model attributes most relevant attributes, and are described in Section V-A for the evaluated models. DRM3 has less total compute attributed to sparse operators and is dominated by a single large embedding table, compared to the long tail embedding tables in DRM1 and DRM2. Latency for DRM1 and DRM2 benefit from increased sharding, but DRM3 sees no such benefit shown in Figures 7 and 8.

VI-E1 Effects of sharding depend on model architecture

Figure 11 singles out the per-shard operator latency and embedd portion breakdown of DRM3 to show additional sharding does not improve latency. The primary causes are twofold. (1) Additional sharding only partitions the one capacity-dominating embedding table, which does not parallelize any significant compute. Figure 11(a) shows shard 1 performing the majority of compute as it contains all embedding tables except for the largest, which is partitioned across the rest of the shards. (2) Even if the smaller tables were partitioned, the relatively low compute would still preclude practical latency improvements because network latency dominates E2E latency overheads too greatly. Thus we conclude models with a long tail of embedding tables and higher pooling factors, like DRM1 and DRM2, are required to benefit from sharding.

VI-F Batching Effects

Batch-sizing for inference splits a request into parallel tasks and is a careful balance between throughput and latency in order to meet SLA and QPS targets. To show its interaction with distributed inference, we set the batch size artificially large to perform one-batch per-request. Smaller batch sizes increase task-level parallelism per-request and can reduce latency, but consequently increase task-level overheads and reduce data-parallelism, which can reduce throughput. In contrast, larger batches increase sparse operator work and benefit from the parallelization of distributed inference. Batch-sizing for deep recommendation inference is an on-going research topic [2].

VI-F1 Distributed inference improves latency with sufficiently large batch sizes

Figure 13 shows distributed inference can improve latency in the DRM1 single-batch case, when using 8-shards capacity- or load-balanced configurations. This is because the sparse operators perform enough work to sufficiently benefit from parallelization, emphasized in Figure 13(b). DRM2 shows similar trends, but is not as strongly impacted because requests are smaller. In this context, larger batches can be viewed as a proxy for embedding tables with larger pooling factor, with the salient characteristic being additional lookup indices sent over RPC and increasing sparse operator work and network requirements. DRM3 isn’t shown because its requests are typically small enough for only one batch per request, with default batch sizes.

VI-F2 Batch sizing can manage distributed inference compute overheads

Figure 14 emphasizes the multiplicative compute overhead–each additional batch issues corresponding RPC ops which increases compute requirements. For example, because NSBP for DRM1 issues one RPC per shard, its compute overhead increases slower than load-balanced as shards are added with the default-batch size. With one-batch per-request, the marginal increase in compute from sharding is less severe. Thus, it’s essential to consider batch-size when exploring distributed inference compute overheads.

VII Impacts in Data-Center Environments

In this section, we discuss the implications of distributed inference, for deep recommender systems, in the data-center. Analyses of latency and compute overheads are important to understanding the impact of distributed inference on SLA targets and additional compute resources. However, the analyses in Section VI focused on a simplified scenario to attribute per-request overheads where (1) each request was processed serially and (2) the servers had the same SC-Large hardware configurations which are over-provisioned. To model a more representative serving environment, we performed two additional experiments on DRM1, which is the most compute intensive model. First, we sent requests at a higher rate of 25 QPS across all sharding strategies and with the same hardware configuration of SC-Large. Second, we re-ran the load-balanced configuration across SC-Small platforms more typical for web serving to compare to SC-Large, with requests again sent serially. We place our results in the context of a serving environment where model instances are replicated to handle real-time traffic. Lastly, we include a discussion of existing compression techniques currently implemented in large, data-center scale deep recommender systems. Our primary observations are:

•

Requests sent at a higher QPS, indicative of a data-center environment, perform better in distributed inference at P99 due to improved resource availability.
•

Between SC-Large and SC-Small, there is not a significant difference in sparse shard per-request latency, providing an opportunity for better efficiency via sparse shard serving with lower power consumption.
•

Shard replication provides opportunity for improved serving efficiency by allocating resources for the dense and sparse portions of model independently.
•

Compression is complementary to distributed inference and cannot address model scalability issues itself.

VII-A High QPS Environment

The request replayer was configured to send requests at 25QPS to the main server for DRM1, the most compute-heavy of all evaluated models. The overhead graph for each DRM1 configuration, compared to singular, is shown in Figure 16. All overheads in the 25QPS experiment are less than the same configuration when sent serially, which was shown in Figure 6. Across nearly all configurations, the P99 latency is improved compared to the singular configuration, shown in Figure 6. Furthermore, in some configurations, like 8-shard capacity- and load-balanced, P50 is either improved over singular, or has less than $\sim.05\%$ overhead. Distributed inference has better latency characteristics in higher QPS scenarios given a model with sufficient sparse operator compute.

VII-B Sparse Shard Platform Efficiency

To provide an apples-to-apples comparison of adding servers for distributed inference, the same hardware platform was used for the sparse shards. However, Figures 8 and 12 shows that the sparse shards have far less compute requirements compared to the main shard, as the sharding in this work is capacity-driven, not compute-driven. Thus, we run an additional configuration using lighter-weight SC-Small servers with DRM1, which again, is the most compute-intensive model. Figure 15 shows the per-shard operator latencies are nearly identical, when run on the original heavier-weight SC-Large and the lighter-weight SC-Small servers. Recall that the SC-Large server has more and faster cores, and $4\times$ memory capacity, resulting in increased energy footprint. This suggests the opportunity for coarse-grained platform specialization of sparse shards for increased serving- and energy-efficiency at the cluster-level.

VII-C Replication in the Data-Center

In this section, we include a small discussion on how distributed inference can improve shard replication. Supporting singular models that have 100s of GBs of memory footprint requires servers that are inefficiently provisioned. For example, most of the memory capacity will be dedicated to the large embedding tables, but most cores will be dedicated to using the significantly smaller dense parameters. The majority of compute touches less than 3% of the model’s memory footprint for DRM1, DRM2, and DRM3. This inefficiency is scaled to support the QPS of millions or billions of users as inference servers are replicated dynamically. In such a case, the large load incurred by the dense layers, shown in Figure 4, will cause the entire model to be replicated to additional servers, including all embedding tables. Distributed inference alleviates this inefficiency by enabling compute resources to be allocated for dense layers and memory resources to be allocated for sparse layers. In other words, the memory requirements of replication are reduced. Our heuristic to shard the dense and sparse layers of the models into separate components simplifies this allocation. Lastly, sharding strategy plays an auxiliary role in replication. While the decision to shard sparse operators has the strongest impact on reducing memory requirements of replication, Table II shows that sharding strategy can further impact replication. Recall that DRM1 is comprised of two primary nets, where Net 2 consumes $4.75\times$ as much memory resources, but 6.3% of the compute resources as Net 1. The NSBP sharding strategy constrains each shard to hold either Net 1 or Net 2, and accordingly doesn’t parallelize work as well compared a net-agnostic sharding strategy. However, NSBP can further improve resource utilization by grouping the most compute-dense embedding tables together, even at higher shard counts necessary for larger models. This sharding strategy trade-off, between improved latency and improved resource utilization, will need to be made on a model-by-model basis and encourages further work in sharding automation.

VII-D Model Compression

TABLE III: Effect of Quantization and Pruning on DRM1. CPU Times and E2E Latencies are normalized to the respective Uncompressed P50 values.

		Uncompressed	Quantized and Pruned
Total Size (GB)		194.46	35 (5.56 $\times$ )
CPU Time	P50	1 $\times$	0.98 $\times$
	P90	3.53 $\times$	3.47 $\times$
	P99	6.6 $\times$	6.31 $\times$
E2E Latency	P50	1 $\times$	.99 $\times$
	P90	2.72 $\times$	2.75 $\times$
	P99	5.03 $\times$	4.87 $\times$

Model compression is the traditional approach to constrain model capacity. We show the compressed model size for DRM1 in Table III as a point-of-reference. All compression techniques deployed on current data-center models were used to generate the compressed model, which utilize quantization and pruning[22]. All tables were row-wise linear quantized to at least 8-bits, and sufficiently large tables were quantized to 4-bits. Tables were manually pruned as specified by the model architect based on a threshold magnitude or training update frequency. Quantization and pruning options were chosen to preserve accuracy and latency. We note that latency and compute are marginally improved with compression, but because this was a focus of this work, we leave this analysis to future work. We speculate the cause is improved memory locality, relatively. More relevantly, Table III shows the compressed model is 5.56 $\times$ smaller. While significant, even with these savings, large models will still not be able to fit on one, two, or even four commodity servers configured with $\sim$ 50GB of usable DRAM. Thus, compression alone is insufficient to enable emerging large deep recommendation systems.

VIII Related Work

Recent work on inference for deep learning recommendation models focused on smaller models that fit on a single machine [2, 3, 1, 5, 4, 6]. Characteristics like compute density, embedding table memory access patterns, and batch-size effects were explored in the context of CPU-architecture and accelerators. However, these inference-focused works did not explore the trends of scaling models to the terabyte-scale.

Separate works have described the challenges of scaling training for large models, but they either don’t address deep recommendation systems or don’t explore the inference serving environment that is the focus of our paper [15, 36, 37, 38, 12, 14]. Google’s data-center scale ML serving infrastructure, TensorFlow-Serving, encompasses model loading, versioning, RPC APIs, and inference batching [39]. Support for data-parallel distributed inference exists as part of distributed TensorFlow [32], but the distributed schemes described in this work are not currently supported out-of-the-box. The remote operators used are similar to the ones described in this work, but a distributed serving paradigm has only briefly been mentioned in publication [14]. More recent works have discussed infrastructure for model-parallelism in TensorFlow–and in turn huge models–but have focused on language models and training which don’t have the variety and number of features shown in Figure 5 and emphasize training throughput compared to inference’s latency-bounded throughput [15, 36, 37, 40, 41].

Gshard provides a sharding abstraction that strongly relies on manual user annotation for tensor placement, and it does not automatically shard based on dynamic inputs as is explored in this work [15]. Furthermore, Baidu’s recent work includes multi-GPU, fast SSDs, and pipelined stages specifically to support large embedding tables for recommendation [13, 12]. In that work, compared to the baseline MPI solution, their GPU+SSD system attains higher throughput during training and requires less physical nodes. However, such a system is more complex to deploy and the effects on end-to-end latency and strict SLA targets is unexplored in the serving environment. In both cases, the evaluated use-case is parallelism for training where model parallelism is mostly static, compared to the dynamic environment of an inference serving, where model components can be replicated to meet influxes of QPS. While training massive deep recommendation systems is also an incredible and important challenge–and is related to inference serving–this work focuses on the data-center implications of serving those models. The sharding strategies discussed in this work are heuristically tailored to the unique model architecture of deep recommendation systems and driven by capacity and latency constraints, not training throughput.

Cooperative, hybrid approaches–that split the model between mobile systems and data-center–have also been explored for improved performance and efficiency [42]. However, such a scheme is problematic in the scope of recommendation where the model is already capacity-bound, so execution on a constrained mobile system is challenging. Privacy requirements, energy concerns for the user, and online models further impede the placement of model partitions on user end devices.

Model compression is discussed in Section VII-D, which concluded current quantization and pruning techniques do not solely resolve the challenge of serving terabyte-scale deep recommendation systems. The effectiveness of traditional compression techniques is specific to the neural network model, with CNNs in particular shown to compress well [22]. Techniques specific to large embedding tables have traditionally targeted the intuitive characteristics of pre-trained word embeddings, not the sparse user- and content-features of deep recommendation systems [43, 44, 45, 46, 47, 48, 49]. Other lossless in-memory or in-cache compression techniques assume data-sets that exhibit lower entropy and more regularity than observed in embedding tables [50, 51, 52]. Even quantized to 8- or 4- bits, embedding tables value distributions do not significantly benefit from further lossless entropy encoding. Techniques to reduce the size of recommendation embedding tables warrants further exploration and is an on-going area of research[21].

Lastly, the distributed scheme described in this work, that shards only the embedding tables, can be compared to traditional scale-out, in-memory distributed databases or hash tables that serve large magnitudes of tabular data [53, 54]. However, these are overly complex due to their mutability and scale requirements, compared to a trained model snapshot. Even within training, the varying levels of asynchrony of parameter servers also does not lend itself to existing distributed database solutions. Given a key motivation of distributed inference is rapid and flexible deployment, integration with database solutions was not considered.

IX Industry and Academic Relevance

The challenge of large deep learning recommendation models provides a strong opportunity for academics to tackle problems of the data-center, the impact of which is significant due to the ubiquity of deep recommendation systems.A core contribution of this work is that the models and infrastructure are highly representative of real inference serving. The software infrastructure in particular matches the data center environment. Despite the seemingly industry-oriented inclination of this workload, academic-friendly methodologies also exist that are less resource intensive. DLRM is an open-source recommendation model that has been used in many recent works [3, 1, 5, 4, 2]. While it currently doesn’t support distributed inference, the opportunity to extend this functionality is straightforward with the RPC operators described in Section III. Further research opportunities are achievable by way of trace-driven experimentation. For example, Bandana used embedding table access traces–which can be collected offline–to reduce effective DRAM requirements [4]. Because embedding table behavior is the dominating design factor in large models, explorations table placement and frequency-based caching are also valuable directions enabled with trace-based analyses.

X Conclusion and Future Work

The rate and scale of model growth for recommendation motivates new serving paradigms to accommodate enormous sparse embedding tables. This emerging domain provides rich opportunity for system architects to have impact. Distributed inference in particular provides a solution that is easily deployed by existing infrastructure. Understanding the performance trade-offs and overhead sources is paramount to efficient and scalable recommendation.

This is the first work to describe and characterize capacity-driven distributed inference for deep learning recommender systems, with a cross-layer distributed tracing methodology. In this work, we described and identified distributed inference as a practical, deployable solution on existing data-center infrastructure. Even with the naive sharding strategies we evaluated, only a 1% increase in P99 latency was experienced at 8-shards for the most compute-intensive model. The trace-based characterization identified sharding strategy, sparse feature hyperparameters like pooling factor, and request size as having substantial impact on latency and compute overheads. In particular, we found that organizing shards by net had a stronger impact on both latency and compute overheads, compared to only considering embedding tables in isolation. These results varied depending on the model and warrants a workflow that dynamically profiles models. We also showed that–given enough sparse operator work–latency can be improved over the non-distributed model, especially when batches are sufficiently large and in a high QPS serving environment. Finally, potentials for improved data-center scale serving efficiency were identified, by decoupling dense and sparse operator resources in distributed inference.

Future work is needed to automate model sharding to target data-center resource efficiency and per-model SLA and QPS requirements. The various design trade-offs identified in this work place a large burden on the feature, model, and system designers to manually optimize the distributed deep learning recommendation model. Moreover, the design space should be expanded to include additional system-level solutions such as paging-from-disk, and to include additional large models, such as recent terabyte-scale NLP models and graph-based recommender systems [15, 55].

Acknowledgments

This work was completed in a collaboration between Drexel University, Tufts University, and Facebook. The majority of work was completed while some authors were on internship and sabbatical at Facebook. We would like to thank our colleagues in Facebook AI for their feedback, key insights, and myriad of discussions on recommender systems and distributed infrastructures and AI. The authors would like to specifically thank Udit Gupta, Liu Ke, Vikram Saraph, Mark Jeffrey, Caroline Trippel, Brandon Reagen, Michael Bevilacqua-Linn, Tristan Rice, Xuan Zhang, Hsien-Hsin Sean Lee, and Baris Taskin for their invaluable mentorship, guidance and support, without whom this work would not be possible.

References

[1] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, “The architectural implications of facebook’s dnn-based personalized recommendation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 488–501.
[2] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G.-Y. Wei, H.-H. S. Lee, D. Brooks, and C.-J. Wu, “Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference,” in 2020 International Symposium on Computer Architecture (ISCA), 2020.
[3] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” 2019.
[4] A. Eisenman, M. Naumov, D. Gardner, M. Smelyanskiy, S. Pupyrev, K. Hazelwood, A. Cidon, and S. Katti, “Bandana: Using non-volatile memory for storing deep learning models,” 2018.
[5] L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C. Wu, M. Hempstead, and X. Zhang, “Recnmp: Accelerating personalized recommendation with near-memory processing,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 790–803.
[6] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 968–981.
[7] Y. Kwon, Y. Lee, and M. Rhu, “Tensor casting: Co-designing algorithm-architecture for personalized recommendation training,” 2020.
[8] S. Hsia, U. Gupta, M. Wilkening, C.-J. Wu, G.-Y. Wei, and D. Brooks, “Cross-stack workload characterization of deep recommendation systems,” in 2020 IEEE International Symposium on Workload Characterization (IISWC), 2020.
[9] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “Mlperf inference benchmark,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 446–459.
[10] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. S. John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia, “Mlperf training benchmark,” arXiv preprint arXiv:1910.01500, 2019.
[11] C.-J. Wu, R. Burke, E. H. Chi, J. Konstan, J. McAuley, Y. Raimond, and H. Zhang, “Developing a recommendation benchmark for mlperf training and inference,” 2020.
[12] W. Zhao, D. Xie, R. Jia, Y. Qian, R. Ding, M. Sun, and P. Li, “Distributed hierarchical gpu parameter server for massive scale deep learning ads systems,” 2020.
[13] W. Zhao, J. Zhang, D. Xie, Y. Qian, R. Jia, and P. Li, “Aibox: Ctr prediction model training on a single node,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, ser. CIKM ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 319–328. [Online]. Available: https://doi.org/10.1145/3357384.3358045
[14] X. Yi, Y.-F. Chen, S. Ramesh, V. Rajashekhar, L. Hong, N. Fiedel, N. Seshadri, L. Heldt, X. Wu, and E. H. Chi, “Factorized deep retrieval and distributed tensorflow serving,” ser. SysML’18, 2018.
[15] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” 2020.
[16] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep learning scaling is predictable, empirically,” 2017.
[17] J. Hestness, N. Ardalani, and G. Diamos, “Beyond human-level accuracy: Computational challenges in deep learning,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1–14. [Online]. Available: https://doi.org/10.1145/3293883.3295710
[18] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
[19] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in 5th International Conference on Learning Representations, ser. ICLR ’17, 2017. [Online]. Available: https://openreview.net/forum?id=Sy8gdB9xx
[20] M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’ Ascoli, G. Biroli, C. Hongler, and M. Wyart, “Scaling description of generalization with number of parameters in deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 2, p. 023401, Feb 2020. [Online]. Available: http://dx.doi.org/10.1088/1742-5468/ab633c
[21] H.-J. M. Shi, D. Mudigere, M. Naumov, and J. Yang, “Compositional embeddings using complementary partitions for memory-efficient recommendation systems,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery; Data Mining, ser. KDD ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 165–175. [Online]. Available: https://doi.org/10.1145/3394486.3403059
[22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http://arxiv.org/abs/1510.00149
[23] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Applied machine learning at facebook: A datacenter infrastructure perspective,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2018, pp. 620–629.
[24] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[25] C. Desrosiers and G. Karypis, Recommender Systems Handbook. Springer, 2011, ch. A Comprehensive Survey of Neighborhood-based Recommendation Methods.
[26] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide and deep learning for recommender systems,” arXiv:1606.07792, 2016. [Online]. Available: http://arxiv.org/abs/1606.07792
[27] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in Proceedings of the 10th ACM Conference on Recommender Systems, ser. RecSys ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 191–198. [Online]. Available: https://doi.org/10.1145/2959100.2959190
[28] G. Zhou, C. Song, X. Zhu, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai, “Deep interest network for click-through rate prediction,” 2017.
[29] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorization-machine based neural network for ctr prediction,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, ser. IJCAI’17. AAAI Press, 2017, p. 1725–1731.
[30] Facebook, “fbthrift,” https://github.com/facebook/fbthrift.
[31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[32] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[33] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” 2018.
[34] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Inc., Tech. Rep., 2010. [Online]. Available: https://research.google.com/archive/papers/dapper-2010-1.pdf
[35] J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V. Venkataraman, K. Veeraraghavan, and Y. J. Song, “Canopy: An end-to-end performance tracing and analysis system,” in Proceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 34–50. [Online]. Available: https://doi.org/10.1145/3132747.3132749
[36] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2018.
[37] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman, “Mesh-tensorflow: Deep learning for supercomputers,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 10 414–10 423. [Online]. Available: http://papers.nips.cc/paper/8242-mesh-tensorflow-deep-learning-for-supercomputers.pdf
[38] M. Naumov, J. Kim, D. Mudigere, S. Sridharan, X. Wang, W. Zhao, S. Yilmaz, C. Kim, H. Yuen, M. Ozdal, K. Nair, I. Gao, B.-Y. Su, J. Yang, and M. Smelyanskiy, “Deep learning training in facebook data centers: Design of scale-up and scale-out systems,” 2020.
[39] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” 2017.
[40] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: Generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1–15. [Online]. Available: https://doi.org/10.1145/3341301.3359646
[41] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural networks,” in Proceedings of Machine Learning and Systems 2019, 2019, pp. 1–13.
[42] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 615–629. [Online]. Available: https://doi.org/10.1145/3037697.3037698
[43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013.
[44] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
[45] R. Shu and H. Nakayama, “Compressing word embeddings via deep compositional code learning,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJRZzFlRb
[46] J. Mu and P. Viswanath, “All-but-the-top: Simple and effective postprocessing for word representations,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HkuGJ3kCb
[47] Z. Yin and Y. Shen, “On the dimensionality of word embedding,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 895–906.
[48] V. Raunak, “Simple and effective dimensionality reduction for word embeddings,” 2017.
[49] O. Hrinchuk, V. Khrulkov, L. Mirvakhabova, E. Orlova, and I. Oseledets, “Tensorized embedding layers for efficient model compression,” 2019.
[50] A. Lagar-Cavilla, J. Ahn, S. Souhlal, N. Agarwal, R. Burny, S. Butt, J. Chang, A. Chaugule, N. Deng, J. Shahid, G. Thelen, K. A. Yurtsever, Y. Zhao, and P. Ranganathan, “Software-defined far memory in warehouse-scale computers,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 317–330. [Online]. Available: https://doi.org/10.1145/3297858.3304053
[51] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Base-delta-immediate compression: Practical data compression for on-chip caches,” in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 377–388. [Online]. Available: https://doi.org/10.1145/2370816.2370870
[52] A. Alameldeen and D. Wood, “Frequent pattern compression: A significance-based compression scheme for l2 caches,” 01 2004.
[53] Apache Software Foundation, “Hadoop.” [Online]. Available: https://hadoop.apache.org
[54] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” in Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, ser. OSDI ’06. USA: USENIX Association, 2006, p. 15.
[55] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for web-scale recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 974–983. [Online]. Available: https://doi.org/10.1145/3219819.3219890