A Case Against Hardware Managed DRAM Caches for NVRAM based Systems

Mark Hildebrand*, Julian T. Angeles†, Jason Lowe-Power†, Venkatesh Akella*
*Department of Electrical and Computer Engineering
†Department of Computer Science
University of California, Davis
{mhildebrand, jtales, jlowepower, akella}@ucdavis.edu

Abstract—Non-volatile memory (NVRAM) based on phase-change memory (such as Optane DC Persistent Memory Module) is making its way into Intel servers to address the needs of emerging applications that have a huge memory footprint. These systems have both DRAM and NVRAM on the same memory channel with the smaller capacity DRAM serving as a cache to the larger capacity NVRAM in the so called 2LM mode. In this work we analyze the performance of such DRAM caches on real hardware using a broad range of synthetic and real-world benchmarks. We identify three key limitations of DRAM caches in these emerging systems which prevent large-scale, bandwidth bound applications from taking full advantage of NVRAM read and write bandwidth. We show that software based techniques are necessary for orchestrating the data movement between DRAM and PMM for such workloads to take full advantage of these new heterogeneous memory systems.

I. INTRODUCTION

Large scale machine learning and large scale graph analytics represent workloads of interest for high performance server in the forseeable future. Emerging machine learning models in NLP and recommendation engines (such as GPT3 [3] and DLRM [33]) can have over 100 billion parameters requiring hundreds of gigabytes to terabytes of memory for training. Similarly real world graphs can have hundreds of billions of edges, requiring hundreds of gigabytes to just store the graphs [36]. As a result, the cost of memory (DRAM) is becoming an important concern in datacenters and other high performance computing facilities dealing with large scale data analysis [15], [16].

To address this challenge Intel recently introduced Optane Data-Center Persistent-Memory-Modules (DC PMM), a non-volatile memory (NVRAM) technology based on phase change memory that can serve as a drop-in replacement for conventional DRAM [20]. While programmers can use the NVRAM as a main memory DRAM replacement using normal load and store instructions, the latency is 3× higher and the bandwidth is at least 60% lower than DRAM [51]. Traditionally, to hide high memory latency and limited bandwidth, computer architects have turned to hardware caches. In this tradition, Intel Cascade Lake systems implement a DRAM cache for the NVRAM. DRAM caches have been well studied in simulation [6], [7], [27]–[29], [31], [41]. These previous works have not taken all of the realistic implementation details (e.g., tracking “coherence” of request issued to NVRAM) leaving gaps between research proposals and the actual implementation.

In this work, we analyze the performance of an actual implementation of the DRAM cache in Intel’s Cascade Lake based servers on workloads whose memory footprint greatly exceeds the capacity of DRAM. We first analyze the behavior of the DRAM cache with microbenchmarks to reverse engineer its design and understand pathological performance cliffs. It is well known that this DRAM cache is implemented as a direct-mapped [26], and we find that the tags are stored ECC bits of the DRAM DIMMs to limits the access overhead. However, we also find that in many cases there are extra DRAM accesses required to update the cache metadata (e.g., tag reads before writes) which can significantly decrease the performance of miss-heavy workloads. In fact, using microbenchmarks on real hardware, we find that a single demand request can require up to 5 memory accesses.

After using microbenchmarks to understand the cache behavior and implementation, we analyze two memory capacity limited workloads: training large convolutional neural networks (CNNs) [21], [24], [47], [48] and graph analytics [17]. We show that in these realistic workloads, the DRAM cache can hurt performance even with a modest cache miss rate. We show that for the CNN workload, software management can increase performance by up to 3× over the DRAM cache. Furthermore, we show significant access amplification and bandwidth reduction for graph based workloads.

Fundamentally, we find three characteristics of this DRAM cache implementation which causes performance degradation for workloads with large working sets.

1) The direct-mapped, insert on miss cache is inflexible and many conflicts can increase the miss rate.
2) Under high miss rates, memory bandwidth is poorly utilized with extra bandwidth used for non-demand accesses (e.g., fills, writebacks, and tag checks).
3) For some workloads the data in the DRAM cache is temporary or dead from the program’s perspective leading to wasted data movement.

While some of these characteristics may be alleviated in future hardware, we can use these three insights on today’s hardware to improve the performance of heterogeneous memory systems. We present one example of a static software management technique which by managing the data movement
in software can mitigate many of these performance problems. In the future, we hope that the insights presented in this paper can influence the next era of DRAM cache development.

The rest of the paper is organized as follows. We start with the quick background on Intel’s NVRAM technology and related work in the area of benchmarking NVRAM from recent literature. In Section II we present the details of our evaluation and validation framework. Section IV follows up with a detailed analysis of the DRAM cache in these systems. Next we use two representative case studies from deep learning and graph analytics to corroborate the findings from the microbenchmark experiments. We end the paper with a discussion of the results and the software based mitigation strategies in Section VII.

II. BACKGROUND AND RELATED WORK

Intel Optane DC (NVRAM) is a phase-change based non-volatile memory [20]. These devices come in a dual in-line memory module (DIMM) form factor and have the same physical footprint as traditional DRAM DIMMs. Memory controllers in high-end Cascade Lake or newer Xeon processors are capable of managing both a DRAM DIMM and a NVRAM DIMM on the same memory channel. Since NVRAM resides on the memory bus, CPUs may read and write to these devices using normal load and store instructions.

NVRAM can be used in the so-called 2LM (also known as memory mode or cached) [26], where NVRAM act transparently as system memory. In this mode, system DRAM serves as a direct mapped cache for the non-volatile memory. NVRAM can also be used in the 1LM (or app direct) mode using the `ndctl` tool to appear as regular devices that are mounted into the Linux file system. In this mode, all loads or stores to memory mapped regions on this device go directly to the NVRAM devices themselves.

There have been several efforts in research literature that focus on evaluating the system level performance of Optane DC [26], [39], [40], [43], [49], especially in comparison with DRAM. More recently, Wang et. al [50] developed a profiler and NVRAM simulator to model the microarchitecture of NVRAMs in general. However, to the best of our knowledge there has been no effort in trying understand the performance of DRAM caches in large scale NVRAM-based systems. However, the tools described by Wang could be used for hardware/software codesign of DRAM caches in the future, building on the findings in this paper.

On the application front there has been work on the design of data structures and algorithms to mitigate the disadvantages of NVRAMs, chiefly the slower and asymmetric read/write latency and bandwidth [4], [13], [35], [38], [45]. Dhulipala et. al [13] and Gill et. al [17] evaluate the performance of large scale graph analytics on NVRAM based systems. These works focus on application performance evaluation and optimization but do not delve into the details of behavior of the DRAM cache (the 2LM mode) and why they do not work well on these applications. The goal of this work is to fill this gap. In fact, one could view Sage [13] as a software technique to mitigate the limitations of DRAM caches in NVRAM based systems as discussed in Section VI and Section VII.

III. EVALUATION METHODOLOGY AND VALIDATION

A. Test System

Our test machine is a two-socket Xeon server (illustrated in Figure 1) equipped with 24-core Cascade Lake engineering sample CPUs. The CPU on each socket is equipped with two integrated memory controllers (IMC), each with three memory channels. Integrated memory controllers are responsible for performing the actual reads and writes to DRAM and NVRAM. Each memory channel is populated with a 32 GiB DDR4 DRAM DIMM and a 512 GiB Optane DC DIMM.

B. Evaluation Methodology

To test the basic bandwidth performance of DRAM and NVRAM, both in 1LM and 2LM, we made a custom open source benchmark generator[^1] written in Julia [2]. The generator uses Julia’s metaprogramming and just-in-time compilation to generate custom low overhead load and store loops. Memory can be accessed either sequentially or pseudo-randomly. When accessed pseudo-randomly, we ensure that each addresses is touched exactly once (i.e. no repeats) using a maximum length Linear Feedback Shift Register to generate array indices. Furthermore, for pseudo-random iteration, access granularity ranges from 64 B to 512 B. We found sequential iteration is largely indifferent to access granularity, so only a single result for sequential access is reported. For these experiments, we used read-only, write-only, and read-modify-write operations. We explore both standard or nontemporal instructions for all stores. Nontemporal stores bypass the on-chip cache, allowing us to directly study the behavior of LLC writes to the memory controller. Data is partitioned evenly across threads when multithreading is used.

To measure DRAM and NVRAM traffic, we use uncore hardware performance counters located in each IMC. These counters capture column access strobes (CAS) for DRAM reads and writes. The Cascade Lake generation added IMC counters for NVRAM read and write requests, and 2LM tag statistics including tag hit, tag miss clean, and tag miss dirty.

[^1]: In this paper we will use Optane DC and NVRAM interchangeably.

[^1]: https://docs.pmem.io/ndctl-user-guide/

[^1]: https://github.com/darchr/KernelBenchmarks.jl

![Diagram of the test platform. Each socket has 192 GiB of DRAM and 3 TB of NVRAM spread across six memory channels.](https://example.com/diagram.png)
which will be explained in more detail later. Results from the
hardware performance counters are validated with the expected
data movement and benchmark wall clock time.

Each benchmark was executed on a quiet system. Unless
otherwise specified, all six Optane DC DIMMs are configured
as a single interleaved set and experiments are run on socket
1 to avoid NUMA overheads.

C. NVRAM Performance Results

The results obtained here are in line with observations made
by other researchers [18, 26, 39, 43]. We highlight results
that are relevant to our upcoming discussion in Section IV on
the 2LM DRAM cache. Since read and write bandwidth to
Optane DC is asymmetric, we will consider these separately. Figure 2a shows the read bandwidth of six interleaved 512 GB
NVRAMs under varying thread counts. Sequential bandwidth
scales with the number of threads up to a maximum 30 GB/s
with 8 threads, at which it stops increasing. This result is
slightly different than the 39 GB/s reported in other works [26]
because our system uses 512 GiB DIMMs instead of 128 GiB
or 256 GiB DIMMs. The 512 GiB DIMMs provide a maximum
read bandwidth of 5.3 GB/s read bandwidth per DIMM
while the others provide 6.8 GB/s [9].

Figure 2b demonstrates the write bandwidth of NVRAM
when using nontemporal stores. In addition to bypassing the
on-chip cache, nontemporal stores do not need a Read-For-
Ownership (RFO), a step in Intel’s usual cache coherence
protocol [10], and are critical for high NVRAM write band-
width [51]. Write bandwidth peaks with four threads, and is
roughly the same for sequential and random access exceeding
256 B. Limited buffer space within the Optane DIMM
decreases the media controller’s ability to merge sequential 64 B
writes into a single 256 B write, leading to write amplification
and the observed drop in bandwidth [51].

In summary, with this system we can achieve just over to
30 GB/s read and 11 GB/s write to NVRAM.

IV. DRAM CACHE / 2LM MODE

Intel Cascade Lake chips support a 2LM mode, where the
Optane DIMMs act as system memory and DRAM serves as
a transparent, hardware managed, direct-mapped cache [26].

The access granularity of this cache is 64B, matching the
exchange size of the underlying CPU. While not mentioned
explicitly, Intel patents suggest that cache tags are stored along
with ECC data [42]. ECC DRAM is implemented by adding
an extra DRAM module to each DIMM. Thus, each 64B
data transaction for each DIMM is accompanied by 8B (64
bits) of ECC. Of these 64 bits of ECC data, only 20 [5]
are required to provide Single Error Correction/Double Error
Detection redundancy, leaving ample room for tag metadata,
including both physical address and cache line state. Our data
is consistent with this approach.

In this section, we use microbenchmarks to try to deduce the
performance implications of the Cascade Lake DRAM cache
design. Our results are summarized in Table I and Figure 3.

A. Methodology

To study the behavior of the 2LM DRAM cache, we
used the same benchmarks discussed in Section III and the
same methodology for measuring bandwidth. In this case,
data gathered from the performance counters allows us to
differentiate DRAM and NVRAM traffic. Furthermore, the
tag related performance counters in each IMC allows us to
correlate tag events with memory traffic. Each IMC only
allows four events types to be recorded at a time. Since our
benchmarks are long running and largely deterministic, we run
them twice to obtain both bandwidth and tag events.

Table I summarizes the observed actions required for each
type of access to the IMC. We define two types of requests to
the IMC. An LLC Read is a request from the LLC for data
from the DRAM cache or NVRAM. This request is generated
on a load or store miss at the LLC. Stores can generate an
LLC read as they may require a RFO. An LLC Write is a
request from the LLC to write back dirty data to the DRAM
cache. LLC write requests are generated either when a dirty
line is evicted from the LLC or from a nontemporal store.

Furthermore, the hardware performance counters differenti-
te between three different types of cache accesses: hit, clean
miss, and dirty miss. A hit implies that address accessed by
an LLC request is present in DRAM. A miss means that an
address is not resident in DRAM and must be fetched from NVRAM. Since this cache is direct mapped, a miss implies that some other data is occupying the set corresponding to the requested address. A miss is dirty if this aliasing data has been modified since its original insertion and thus must be written back to NVRAM upon eviction.

To study read and write hits, we use the read-only and write-only benchmarks respectively on a 51 GiB array backed by 1 GiB hugepages to mitigate TLB overheads. Because the array is far larger than the 33 MB LLC cache, each CPU load generates an LLC read and each CPU nontemporal store generates an LLC write. This array is also small enough to fit in the DRAM cache without aliasing. Thus, all LLC reads/writes access will be cache hits.

Generating clean LLC read misses and dirty LLC write misses is also straightforward. We use a 420 GB array, which is over twice the size of the 192 GB DRAM cache per socket. Applying the read-only benchmark to this array for several iterations ensures a clean LLC read misses for each CPU load. Similarly, the write-only benchmark ensures that each nontemporal store generates a dirty LLC write miss.

Testing dirty LLC read misses and clean LLC write misses is more complicated. For dirty LLC read misses, we first prepare the 420 GB array from before by writing to it, making the entire DRAM cache is dirty. We then perform a single iteration of the read-only kernel. Thus, each CPU load early in the iteration generate LLC reads that will be a dirty miss in the cache. As the iteration progresses, however, a larger portion of these loads becomes clean misses as the dirty cache is replaced by clean data. Consequently, we determine cache behavior based on data collected early in the iteration. We use a similar procedure to prime and test clean LLC write misses.

When testing the behavior of the cache, we use nontemporal stores when writing. This ensures that the behavior shown by the IMC is purely the result of the incoming store and not an earlier RFO. For all benchmarks, we also compute an effective bandwidth as seen by the application. This is obtained using the size of the array and wall clock time for each benchmark.

While we only outlined several key benchmarks to test the different regimes of the DRAM cache, we also applied a whole range of microbenchmarks with different thread counts and access patterns to fully characterize the behavior of the cache and validate the results presented here.

### B. 2LM Observations

Table 1 summarizes our findings for the cache events and Figure 3 demonstrates a flow chart of IMC logic that models this behavior. We describe each of these columns in turn. To help with our discussion, we use the term access amplification as the ratio of memory accesses (i.e., both DRAM and NVRAM) to demand accesses.

LLC read hits are simple. The IMC initiates a DRAM read, which fetches data along with the tag in the ECC bits. A tag check is performed and since the tag matches, the data is immediately forwarded with no access amplification.

Figure 4a shows bandwidth for the read-only benchmark in the 100% clean miss scenario. Note a 3× access amplification for each miss. Essentially, the tag miss is serviced by a miss handler, which fetches the requested cache line from NVRAM, inserts it into DRAM, and forwards to the CPU. Dirty read misses are handled much the same as clean read misses. The only change is that the cache line evicted from DRAM must be written back to NVRAM.

LLC write hits incur a 2× access amplification because the IMC must first emit a DRAM read to perform a tag check. Only upon verification of the tag can the line be safely written.

Next, we discuss dirty LLC write misses. Figure 4b shows collected bandwidth for the write-only benchmark where each nontemporal store is a dirty tag miss. Observe a 2× access amplification in DRAM write alone. Upon receiving a completely dirty cache line store yielding a tag miss, we would expect the IMC to write the evicted line to NVRAM and directly insert the incoming line to DRAM. This would yield a total of 1 DRAM read (for the tag check), 1 NVRAM write, and 1 DRAM write. However, the data in Figure 4b suggests that this is not the case. Our best guess is that the memory controller always inserts on a miss (regardless of whether that miss was a read or write). The second DRAM write is thus the actual write of cache line to DRAM. Clean LLC write misses are similar dirty write misses without the NVRAM write back.

### C. Dirty Data Optimization

Finally, this brings us to the phenomenon that we call the Dirty Data Optimization (DDO). At times, the memory controller is able to elide the tag check (i.e., DRAM read) and instead directly forward LLC writes to DRAM. This can be seen in Figure 4c, which shows the distribution of traffic for the read-modify-write benchmark in a 100% dirty LLC miss scenario using standard stores. The CPU load initiates a dirty LLC read miss (dirty from a previous write), accounting for.
Due to this low locality, we would expect this delayed LLC write to require another tag check, resulting in a total of two DRAM reads per CPU load-store pair. However, this is not the case and it appears this second tag check is elided. While this could be explained by an inclusive cache, we found that this is not the case as it is possible to have small amounts (< 8 KiB) of aliasing data simultaneously within the CPU cache. Thus, we are not sure the exact mechanism driving this optimization.

### D. Discussion

We described our observation of 2LM’s mechanics, but what does this mean for user applications? There are two points we want to make. First, contrast Figure 4 which shows the effective NVDIMM bandwidth in 2LM with a high miss rate, with Figures 2a and 2b showing the maximum speed of NVRAM. The highest NVRAM read bandwidth in 2LM (Figure 4a) is 23 GB/s and the highest write bandwidth (Figure 4b) is 8 GB/s. This is 60% and 72% the demonstrated achievable bandwidth of our system’s NVRAM. This is the ideal case with well formed traffic. We expect applications with a large memory footprint (exactly those that would benefit from the large memory pool provided by NVRAM) and a high DRAM cache miss rate to experience a severe bandwidth bottleneck. Second, cache misses are costly in terms of extra traffic generated, with LLC read and write misses generating up to 5× and 5× access amplification. This is costly both in terms of energy and lost bandwidth.

So far, we have demonstrated the potential for applications to experience bandwidth bottlenecks in 2LM. In the next two sections, we provide case studies demonstrating this effect on real applications.

### V. Case Study 1: Convolutional Neural Networks

In this section, we will take a deep dive into some of pitfalls a bandwidth and compute heavy application can fall into when running under 2LM. Specifically, we consider the problem of training deep Convolutional Neural Networks (CNNs) whose working set size greatly exceeds the physical DRAM of a system, requiring the extra memory provided by NVRAM. CNNs are typically expressed as a directed acyclic graph of computation primitives such as convolutions and matrix multiplications, that are heavy on compute, and operations such as batch normalization and concatenation that are heavy on bandwidth requirements. At a high level, a single iteration of training consists of a forward pass, during which the network is evaluated (almost) normally on a batch of training data (some kernels like Batch Normalization have slightly different versions for training and inference). The output of the forward pass is compared to an expected output to generate a loss value, which is used in the backward pass to compute the partial derivative of the loss with respect to each of the trainable parameters of the network. The parameters of the network are adjusted based on these derivatives. An important aspect of the backpropagation algorithm is that many intermediate values computed during the forward pass must be preserved to compute the backward pass. Thus, the active memory footprint of the network during an iteration of training increases during the forward pass, then decreases during the backward pass. It takes many such iterations of training across different input samples to fully train a CNN.

#### A. Methodology

We implemented three popular large CNNs: Inception v4 [48], Resnet 200 [21], and DenseNet [23] using the ngraph compiler [11] on the NVRAM-based system described earlier. Intel’s ngraph compiler is an optimizing compiler specifically targeting static deep neural networks that takes advantage of the Xeon CPU ISA. For these large networks, we scaled the training batch size until the overall footprint of these applications exceeded 650 GB, well beyond the capacity of the DRAM cache. All networks were run on a single NUMA node and assigned all 24 physical cores on that node with no hyper-threading. These networks were run for two warm.

<table>
<thead>
<tr>
<th>Method</th>
<th>DRAM Read (GB/s)</th>
<th>DRAM Write (GB/s)</th>
<th>NVRAM Read (GB/s)</th>
<th>NVRAM Write (GB/s)</th>
<th>Effective (GB/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Random 64 B</td>
<td>10</td>
<td>10</td>
<td>20</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>Random 128 B</td>
<td>5</td>
<td>5</td>
<td>10</td>
<td>10</td>
<td>10</td>
</tr>
<tr>
<td>Random 256 B</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Random 512 B</td>
<td>5</td>
<td>5</td>
<td>10</td>
<td>10</td>
<td>10</td>
</tr>
</tbody>
</table>

![Figure 4a](image_url) (a) Read-only benchmark, clean LLC read misses, 24 threads.

![Figure 4b](image_url) (b) Write-only benchmark, dirty LLC write misses, 24 threads, nontemporal stores. Using 4 threads only increases the maximum write bandwidth by 1 GB/s.

![Figure 4c](image_url) (c) Read-modify-write benchmark, dirty LLC read miss followed by a later DDO LLC write, 4 threads, standard stores. Sequential achieves the highest NVDRAM write bandwidth of any 2LM benchmark with negligible difference between nontemporal and standard stores.
Finally, the regions of high DRAM cache hit rate occur at the beginning of the forward pass and the backward pass, and (3) there are noticeable regions of high tag hits at the beginning of the forward and backward passes with a corresponding drop in dirty tag misses. Finally, Figure 5c breaks down the read and write bandwidths to DRAM and NVRAM. Regions of high dirty miss rate correspond to low bandwidth and instruction throughput. Reasonable system performance is only achieved when the hit rate is high.

So, a good question at this point is - Why are so many dirty tag misses generated, and why are there regions of high cache hit rate? Two related phenomena can explain this.

Figure 5d shows the memory usage of DenseNet through time for a single iteration of training. Before execution, the ngraph compiler allocates a single buffer for the entire network. The offset from the base of this buffer is shown on the vertical axis of Figure 5d. The change in memory state through time is shown using different colors. The color white indicates that the region of memory is free (semantically speaking). That is, it will always be written to before it is read by the program. A blue highlight indicates that a region of memory is being actively written to, red indicates a read, and gray indicates that the memory will be read from in the future.

For an iteration of training, first the forward pass of the model is computed (up to time around 220, annotated in Figure 5d). Throughout the forward pass, some of the generated intermediate tensors must be held in memory to facilitate computation of the backward pass. Thus, the amount of live memory (gray) accumulates through the forward pass. Once a preserved tensor is used on the backward pass, the region in memory where it was stored is free for further use (white). The ngraph compiler takes advantage of this newly freed area to allocate intermediate tensors required to compute the backward pass. This is the very subtle streak of blue on the right shoulder of Figure 5d.

However, from the perspective of the 2LM cache, the fact that writes are occurring to a region of memory on the backward pass makes memory is dirty with respect to the DRAM cache. Hence, even when this region of memory is semantically free from the program’s perspective, the cache must still generate a dirty write back upon eviction. Because the DRAM cache is unaware of the meaningful lifetime of memory, it generates a large amount of unnecessary traffic.

Finally, the regions of high DRAM cache hit rate occur at the beginning of the forward and backward pass because the area of active memory folds back on itself. Recent data is in
the cache, so all accesses are cache hits. This continues until the entire cache has been read, at which point further accesses are cache misses.

C. Problematic Kernels

To wrap up this section, we will explain the relatively high frequency periodic behavior that is noticeable in the Tag Hit line of Figure 5. DenseNet is composed of a linear chain of “dense blocks,” where each dense block consists of a sequence of Concat, BatchNorm, Conv, BatchNorm, and Conv operators. Figure 6 shows a high resolution snapshot of the bandwidth for two such dense blocks during the forward pass of DenseNet. The point where kernels begin execution is annotated on the graph. The main performance bottlenecks apparent in Figure 6 are Concat and BatchNorm. These are both memory-bound kernels with little data reuse and are more affected by the low bandwidth associated with a high dirty tag miss rate. The second BatchNorm within each dense block operates on much smaller intermediate tensors, and is thus less impactful on overall performance. Similar problematic kernels exist on the backwards pass as well, including BatchNorm-Backprop and the back-propagation kernels for the filter/bias inputs of 3x3 convolutions.

D. Discussion

In summary, the overall performance of CNN training in 2LM mode in NVRAM-based systems is affected by two factors: (1) low effective bandwidth with a high miss rate and (2) a significant amount of unnecessary dirty writebacks. From the microbenchmarks, the first of these is not too surprising. However, the second exposes a performance pathology not demonstrated by the microbenchmarks, made worse by the relatively low write bandwidth of NVRAM. Next, we will look at a different class of algorithms that suffer similarly.

VI. CASE STUDY 2: GRAPH PROCESSING

In this section, we perform a preliminary study on applications known for having diverse performance characteristics and irregular memory access patterns. To accomplish this, we evaluate a variety of graph processing algorithms on large real-world graph inputs using Galois [24], a high performance shared memory graph analytics framework.

A. Background

Large graph processing has garnered substantial research interest across a variety of use cases, including the identification of social media influencers and decision makers, or finding fraudulent actors within a business network. These real-world large systems require frameworks process representative graphs with tens of billions of nodes and trillions of edges, incurring a high memory footprint that is expensive to accommodate in DRAM. Depending on the topology of the input graph and the processing algorithm being used, the memory access pattern can vary wildly. This presents a challenge when optimizing such workloads for systems with limited main memory.

To address these issues, several efforts [14], [18] have explored leveraging NVRAM for graph analytics on a single machine. However, such works focused on performing an analysis and comparison of different graph processing frameworks and system settings to optimize the use of Optane for graph workloads. Here, we evaluate the bandwidth characteristics of such irregular workloads in 2LM.

B. Methodology

Graph kernel experiments were run on the shared memory graph analytics framework Galois. Specifically, our evaluations consisted of 4 benchmarks from the lonestar suite: breadth-first search (bfs) [8], connected components (cc) [44], [46], k-core decomposition (kcore) [12], and pagerank-push (pr) [37]. These kernels were chosen based on their diverse execution characteristics [1]. Our workloads were run with the settings by Gill et al. [18]. For bfs, the source node was the maximum out-degree node. The tolerance of pr was set to 10^-6 and we used the k = 100 for kcore. Each kernel ran until convergence, except for pr which ran for 100 rounds.

We used two realistic unweighted massive input graphs: wdc12 [36], the largest publicly available graph, and kron30 [30], a randomized scale free graph generated using a graph500 based kronecker generator [19]. Each were chosen to highlight the differences between when a graph fit and did not fit in the DRAM cache. While these graphs have different structures, we can still draw conclusions from kernels’ relative performance on these graphs. Both were processed using the provided graph-converter in Galois and resulted in binaries of size 507 GB and 73 GB respectively.

In 2LM, all benchmarks were run on two NUMA nodes and assigned all 96 threads. Since two sockets are used, the size of the DRAM cache is effectively doubled to 384 GB with 6 TB of NVRAM. The total NUMA interleaving and 2 MiB hugepages were used with no page migration to maximize performance [18].

To find the baseline data movement required by the algorithms, we configured the NVRAM regions on each socket as extra NUMA nodes. This is facilitated through the daxctl tool with the machine in 1LM. Since Galois uses a NUMA preferred policy, the threads on each socket will initially

https://docs.pmem.io/ndctl-user-guide/daxctl-man-pages
allocate memory on that socket’s DRAM. When DRAM is exhausted, further allocations are serviced by NVRAM. By summing the traffic to DRAM and NVRAM, we can establish the baseline memory traffic required by each application.

As with our previous experiments, measurements on bandwidth and tag statistics were gathered using hardware performance counters.

C. Results

Figure 7 compares the observed bandwidth when running the graph kernels on kron30 and wdc12. When processing kron30, the kernels have a working set that largely fits within the DRAM cache while the working set when processing wdc12 greatly exceeds the DRAM cache. When the working set does not fit in the DRAM cache, there is a significant decrease in DRAM bandwidth during an algorithm’s execution.

Figure 8 shows the total amount of data moved during the execution of a graph kernel when the input graph does not fit in the DRAM cache. Bandwidth is stable at 70 GB/s with roughly equal DRAM reads and writes.

On the other hand, Figure 8b demonstrates the bandwidth of pagerank-push when its working set does not fit in the DRAM cache. Not only is the average bandwidth significantly lower, but there is also an excess of DRAM reads coupled with heavy NVRAM traffic. The tag metrics shown in Figure 8c show the presence of both clean and dirty tag misses as well as the correlation between hit rate and DRAM bandwidth.

D. Discussion

As with CNN training, large scale graph processing is a workload with a high DRAM cache miss rate. This is made worse since traditional graph algorithm implementations involve mutating the in-memory representation of the graph [18]. In 2LM, this mutation will mark the corresponding memory as dirty. Thus, not only is the miss rate high, but many of
these misses require NVRAM write backs, which we have demonstrated to be inefficient. As a result, it is not surprising that 2LM behaves poorly for these particular implementations.

VII. DISCUSSION AND MITIGATION STRATEGIES

In this paper, we demonstrated that the DRAM cache as currently implemented in Intel’s Cascade Lake systems performs poorly for applications with a high miss rate. We showed that a DRAM cache miss can cause 3–5× more memory accesses than the original demand requests. Further, we showed that this causes performance degradation in two bandwidth-limited workloads: CNN training and graph analytics which are important use cases for NVRAM since they have extremely large memory footprints. Furthermore, we show that certain data reuse semantics at the program level can cause severe degradation.

For instance, in the deep neural network training workload, a significant amount of the data movement from the DRAM cache to NVRAM is useless as this data was only meant to be used temporarily by the program and will be overwritten before it is read again. This dirty temporary data dominates the DRAM cache leading to more misses than necessary and limiting performance to the smaller NVRAM write bandwidth.

A. Software-managed multi-level memory

So what can be done about this? In this section, we look at an example of software-managed memory for each of the case studies presented previously: CNNs training and graph analytics. We show that through software-managed memory, we can obtain better performance than using the hardware-managed cache in 2LM mode for these miss heavy bandwidth-bound workloads.

Software management relies on decoupling the DRAM and NVRAM memory pools. So far, this paper focused on the 2LM (or “memory mode”) of the NVRAM systems, these systems can also be configured in “app-direct mode” or 1LM where the programmer has full control over the data location and movement. NVRAM is simply mapped into a program’s address space.

1) CNN Training: Hildebrand et al. showed that for static compute graphs such as static CNNs, where there is no data dependent behavior and the structure of the network and sizes of intermediate tensors are fully known ahead of time, that software data movement can provide a significant performance boost over hardware management [22]. This work, AutoTM, does so by using an integer linear programming and a profile guided optimizer. AutoTM understands the execution time of kernels with input and output tensors in various combinations of DRAM and NVRAM and can manage these locations and data movement to minimize execution time under a set DRAM budget. With this knowledge, AutoTM achieves a 1.88×, 2.24×, and 3.10× speedup over 2LM for Inception v4, ResNet 200, and DenseNet 264 respectively [22].

First, AutoTM is aware of the difference between semantically live data versus dead data and thus elide the unnecessary dirty write-backs on the backward. This can be seen in Figure 10 which shows the trace of bandwidth through out a single iteration of training for the large DenseNet model under AutoTM. Contrast this with Figure 5c AutoTM only generates NVRAM writes during the forward pass (where it is storing intermediate activations for use on the backward pass). Similarly, AutoTM only generates NVRAM reads during the backward pass. Table II compares the total amount of data moved for these workloads in 2LM and under AutoTM. AutoTM generates similar amounts of DRAM traffic, but only 50% to 60% of the NVRAM traffic.

The average read and write bandwidth that AutoTM achieves is to NVRAM is also significantly higher than that achieved during 2LM. This is because AutoTM is designed to read and write to NVRAM in the patterns discussed in Section III for achieving high bandwidth. However, the average bandwidth in Figure 10 does not tell the whole story. Under AutoTM, tensors are usually moved between DRAM and NVRAM (and vice versa) synchronously between compute kernel execution. Therefore, during kernel execution, there is no data movement. Thus, we are seeing the bandwidth averaged over times of data movement and times of no data movement, implying the active bandwidth is much higher.

2) Graph Analytics: As pointed out in Section VI, graph algorithm implementations in Galois and other graph frameworks often mutate graph data structure. With NVRAM, this is an issue due its low write bandwidth (which is further exacerbated by 2LM’s write amplification). To tackle this issue, the authors of Sage [18] designed that software specifically with NVRAM in mind. Their key approach is to (as much as possible) use NVRAM for read only data.

When running algorithms that require tracking state (such as nodes visited for bfs), an auxiliary DRAM-based data structure is used. This data structure is greatly compressed and supplements the read-only NVRAM-based adjacency list. Mutation is only performed on the auxiliary data structure, and hence write traffic is only generated to DRAM. To optimize for multiple sockets, Sage takes advantage of NVRAM’s capacity to keep a full copy of the graph on both CPU sockets. With these techniques, they were able to design algorithms 1.87× faster on average than GBBS and 1.94× faster on average than Galois in 2LM [18].

This is another example demonstrating the clever software management can over come the bandwidth limitations of NVRAM. Conversely, these same limitations are exacerbated by access amplification caused by the DRAM cache.
B. Limitations of software approaches and future directions

Even though the software approaches discussed above provide some mitigation to the problems of hardware-managed DRAM caches, these approaches have limitations. These approaches use the CPU cores to move data via loads and non-temporal stores. The DMA copy engines in current systems are designed for I/O data movement and not high bandwidth movement between different memory technologies. These DMA devices’ programming models and performance characteristics do not fit the requirements of this data movement. Additionally, because these approaches use CPUs for data movement it is difficult to transfer data asynchronously.

Looking forward, future research should concentrate on providing hardware-software co-design for data movement between NVRAM and DRAM. If software, with its high level knowledge of data access patterns, could work with the hardware, then we could realize the benefits of hardware acceleration without the limitations presented above.

ACKNOWLEDGMENTS

This work is supported in part by the Intel Corporation and by the National Science Foundation under Grant No. CNS-1850566.

We would also like to thank our anonymous reviewers and members of the Davis Computer Architecture Research Group (DArchR) for their valuable feedback.

REFERENCES
