We propose a new data structure called CachedEmbeddings for training large scale deep learning recommendation models (DLRM) efficiently on heterogeneous (DRAM + non-volatile) memory platforms. CachedEmbeddings implements an implicit software-managed cache and data movement optimization that is integrated with the Julia programming framework to optimize the implementation of large scale DLRM implementations with multiple sparse embedded tables operations. In particular we show an implementation that is 1.4X to 2X better than the best known Intel CPU based implementations on state-of-the-art DLRM benchmarks on a real heterogeneous memory platform from Intel, and 1.32X to 1.45X improvement over Intel’s 2LM implementation that treats the DRAM as a hardware managed cache.
@inproceedings{10.1007/978-3-031-32041-5_3,
author = {Hildebrand, Mark and Lowe-Power, Jason and Akella, Venkatesh},
title = {Efficient Large Scale DLRM Implementation On Heterogeneous Memory Systems},
year = {2023},
isbn = {978-3-031-32040-8},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-031-32041-5_3},
doi = {10.1007/978-3-031-32041-5_3},
booktitle = {High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings},
pages = {42–61},
numpages = {20},
location = {Hamburg, Germany}
}
Recently the rowhammer vulnerability has affected modern memory devices, which allows an attacker to cause bitflips without accessing the corresponding cell. The rowhammer effects can exacerbate future memory technologies due to scaling. Hence, we need to invest in studying and mitigating rowhammer attacks. We, therefore, propose a model to simulate RowHammer in gem5 to capture the system-level interaction of RowHammer.
]]>Architecture simulators are a common and powerful tool in computer architecture research. It is important that results reported by simulators are trustworthy. gem5 is a well known architectural simulator used by academia and industry. We present our methodology and tools for evaluating gem5’s memory subsystem components and the results of our validation of gem5’s current memory system components. We have validated the accuracy of DDR models in gem5 and report significant difference between gem5 and our reference for HBM models. In addition, we have validated the functional correctness and accuracy of cache models in gem5. Lastly, we observe a 10% difference between gem5 and real hardware in our random access benchmark.
]]>The increasing growth of applications’ memory demands has led the CPU vendors to deploy diverse memory technologies either within the same package such as heterogenous memory systems, or in disaggregated form through local or remote memory nodes. As these new memory technologies emerge, conventional memory management should be reconsidered to better meet the applications memory requirements. However, there is not a suitable model available in the community to accurately study these new systems. In this work we describe our contribution toward a cycle-level analysis model of heterogenous memories in gem5 simulator. We believe this work enables the community to perform a design space exploration for the next generation of memory systems.
]]>HPC systems will employ various memory technologies to meet applications demands. The ability to model heterogeneous systems (from both compute and memory perspectives) makes gem5 a highly suitable tool to evaluate future HPC systems. In this work, we’ll discuss new contributions to gem5 that are extending its modeling support for heterogeneous memory systems. Specifically, we show the new DRAM cache model, the improved HBM model in gem5, and the refactoring of its memory controller models.
]]>The increasing growth of applications’ memory demands has led the CPU vendors to deploy large DRAM caches, backed by large non-volatile memories like Intel Optane (e.g., Intel’s Cascade Lake). Previous work has explored many aspects of DRAM cache design in simulation such as the caching granularity, dram cache tag placement, etc. to improve performance. However, these works do not provide an open-source DRAM cache modeling plat-form for a detailed micro-architectural and timing analysis. In this presentation we will describe a cycle-level unified DRAM cache and main memory controller (UDCC) for gem5. The protocol is inspired by the actual hardware providing DRAM cache, such as Intel’s Cascade Lake, in which a DRAM cache is backed by an NVRAM as the off-chip main memory sharing the same bus. We leverage the cycle-level DRAM and NVRAM models in gem5.
]]>Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power. IEEE International Symposium on Secure and Private Execution Environment Design (SEED) 2022.
@inproceedings{akram2022sok,
title={SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems},
author={Akram, Ayaz and Akella, Venkatesh and Peisert, Sean and Lowe-Power, Jason},
booktitle={2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED)},
pages={121--132},
year={2022},
organization={IEEE}
}
Ayaz Akram, Maryam Babaie, Wendy Elsasser, Jason Lowe-Power. The 4th gem5 Users’ Workshop associated with ISCA 2022.
@inproceedings{akram2022hbm2,
title={Modeling HBM2 Memory Controller},
author={Akram, Ayaz and Babaie, Maryam and Elsasser, Wendy and Lowe-Power, Jason},
booktitle={The 4th gem5 Users’ Workshop associated with ISCA 2022.},
year={2022}
}
As emerging workloads exhibit irregular memory access patterns with poor data reuse and locality, they would benefit from a DRAM that achieves low latency without sacrificing bandwidth and energy efficiency. We propose LLM (Low Latency Memory), a codesign of the DRAM microarchitecture, the memory controller and the LLC/DRAM interconnect by leveraging embedded silicon photonics in 2.5D/3D integrated system on chip. LLM relies on Wavelength Division Multiplexing (WDM)-based photonic interconnects to reduce the contention throughout the memory subsystem. LLM also increases the bank-level parallelism, eliminates bus conflicts by using dedicated optical data paths, and reduces the access energy per bit with shorter global bitlines and smaller row buffers. We evaluate the design space of LLM for a variety of synthetic benchmarks and representative graph workloads on a full-system simulator (gem5). LLM exhibits low memory access latency for traffics with both regular and irregular access patterns. For irregular traffic, LLM achieves high bandwidth utilization (over 80% peak throughput compared to 20% of HBM2.0). For real workloads, LLM achieves 3× and 1.8× lower execution time compared to HBM2.0 and a state-of-the-art memory system with high memory level parallelism, respectively. This study also demonstrates that by reducing queuing on the data path, LLM can achieve on average 3.4× lower memory latency variation compared to HBM2.0.
@inproceedings{10.1007/978-3-031-07312-0_3,
author = {Fariborz, Marjan and Samani, Mahyar and Fotouhi, Pouya and Proietti, Roberto and Yi, Il-Min and Akella, Venkatesh and Lowe-Power, Jason and Palermo, Samuel and Yoo, S. J. Ben},
title = {LLM: Realizing Low-Latency Memory By Exploiting Embedded Silicon Photonics For Irregular Workloads},
year = {2022},
isbn = {978-3-031-07311-3},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-031-07312-0_3},
doi = {10.1007/978-3-031-07312-0_3},
booktitle = {High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings},
pages = {44–64},
numpages = {21},
location = {Hamburg, Germany}
}
We propose a new architecture called HTA for high throughput irregular HPC applications with little data reuse. HTA reduces the contention within the memory system with the help of a partitioned memory controller that is amenable for 2.5D implementation using Silicon Photonics. In terms of scalability, HTA supports 4× higher number of compute units compared to the state-of-the-art GPU systems. Our simulation-based evaluation on a representative set of HPC benchmarks shows that the proposed design reduces the queuing latency by 10% to 30%, and improves the variability in memory access latency by 10% to 60%. Our results show that the HTA improves the L1 miss penalty by 2.3× to 5× over GPUs. When compared to a multi-GPU system with the same number of compute units, our simulation results show that the HTA can provide up to 2× speedup.
@inproceedings{10.1007/978-3-030-78713-4_10,
author = {Fotouhi, Pouya and Fariborz, Marjan and Proietti, Roberto and Lowe-Power, Jason and Akella, Venkatesh and Yoo, S. J. Ben},
title = {HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads},
year = {2021},
isbn = {978-3-030-78712-7},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-030-78713-4_10},
doi = {10.1007/978-3-030-78713-4_10},
booktitle = {High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings},
pages = {176–194},
numpages = {19}
}