Performance analysis of graph workloads with NVDIMMs+DRAM caches
NVDIMMs based on 3DXpoint (such as Optane DC Persistent Memory Module) are emerging as an attractive option to address the needs of emerging applications that requires tens of terabytes of memory such as graph analytics and machine learning. To mitigate the increased latency and reduced bandwidth of NVDIMMs, in Intel systems, a smaller capacity DRAM serves as a cache to the larger capacity NVRAM in the so called 2LM mode (also known as memory mode or cached). In the past, DRAM caches have been studied in the context of die-stacked systems where the goal was to use a few gigabytes of stacked DRAM as a giant last-level cache mainly to overcome the bandwidth limitation of going off chip. However, the purpose of DRAM caches in a NVRAM based system is different. Instead of a few gigabytes, the DRAM cache in Intel’s Cascade Lake systems can easily be 384 GB with 6 TB of backing main memory.
The goal of this work is to provide initial answers to this question by taking a deep dive into the performance of 2LM based systems using a real hardware and the built-in performance counters. We focus our study on large scale graph processing for two reasons. First, this represents a “growing” workload with today”s large graphs already requiring many terabytes of RAM. Second, these workloads have irregular memory access patterns that are difficult to predict (e.g., with software managed approaches). We show that the current DRAM cache implementation (which is a naive direct mapped cache) performs poorly on graph workloads and does not take full advantage of the available bandwidth while generating a significant amount of unnecessary traffic. Further, we argue that designing hardware managed DRAM caches is an important problem that the computer architecture should address.
Short teaser video
We also have an ISPASS paper and presentation which more deeply analyzes these DRAM caches in the context of terabyte-scale NVDIMMs. This work also analyzes machine learning workloads, specifically large-scale CNNs, on these heterogeneous memory systems.
This work in sponsored by Intel.