UC Davis Computer Architecture

CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems

2024-05-29T00:00:00-07:00

Paper on IEEEXplore Local Paper Download IPDPS Presentation Download SC HMEM Workshop Presentation Source Code

We propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs that exhibit different data access and usage patterns. We show that CachedArrays outperforms hardware caches, and can exploit many of the algorithmic-specific optimizations of prior works.

CachedEmbeddings builds on top of CachedArrays.

@inproceedings{hildebrand24cachedarrays,
author = {Hildebrand, Mark and Lowe-Power, Jason and Akella, Venkatesh},
title = { {CachedArrays}: Optimizing Data Movement for Heterogeneous Memory Systems},
year = {2024},
url = {https://doi.org/10.1109/IPDPS57955.2024.00055},
doi = {10.1109/IPDPS57955.2024.00055},
booktitle = {Proceeding of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}
}

TEGRA – Scaling Up Terascale Graph Processing with Disaggregated Computing

2024-04-28T00:00:00-07:00

Local Download arXiv Link

Graphs are essential for representing relationships in various domains, driving modern AI applications such as graph analytics and neural networks across science, engineering, cybersecurity, transportation, and economics. However, the size of modern graphs are rapidly expanding, posing challenges for traditional CPUs and GPUs in meeting real-time processing demands. As a result, hardware accelerators for graph processing have been proposed. However, the largest graphs that can be handled by these systems is still modest often targeting Twitter graph (1.4B edges approximately). This paper aims to address this limitation by developing a graph accelerator capable of terascale graph processing. Scale out architectures, architectures where nodes are replicated to expand to larger datasets, are natural for handling larger graphs. We argue that this approach is not appropriate for very large-scale graphs because it leads to under utilization of both memory resources and compute resources. Additionally, vertex and edge processing have different access patterns. Communication overheads also pose further challenges in designing scalable architectures. To overcome these issues, this paper proposes TEGRA, a scale-up architecture for terascale graph processing. TEGRA leverages a composable computing system with disaggregated resources and a communication architecture inspired by Active Messages. By employing direct communication between cores and optimizing memory interconnect utilization, TEGRA effectively reduces communication overhead and improves resource utilization, therefore enabling efficient processing of terascale graphs.

Citation

William Shaddix, Mahyar Samani, Marjan Fariborz, S.J. Ben Yoo, Jason Lowe-Power, and Venkatesh Akella, “TEGRA – Scaling Up Terascale Graph Processing with Disaggregated Computing,” arXiv preprint arXiv:2404.03155, 2024. doi: 10.48550/arXiv.2404.03155.

@article{shaddix2024tegra,
  author       = {William Shaddix and Mahyar Samani and Marjan Fariborz and S.J. Ben Yoo and Jason Lowe-Power and Venkatesh Akella},
  title        = {TEGRA -- Scaling Up Terascale Graph Processing with Disaggregated Computing},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2404.03155},
  doi          = {10.48550/arXiv.2404.03155},
  eprint={2404.03155},
  archivePrefix={arXiv},
  primaryClass={cs.AR}
}

TDRAM: Tag-enhanced DRAM for Efficient Caching

2024-04-22T00:00:00-07:00

Local Download arXiv Link

As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6× faster tag check, 1.2× speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.

Citation

Maryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael Miller, Taeksang Song, Thomas Vogelsang, Steven Woo, and Jason Lowe-Power, “TDRAM: Tag-enhanced DRAM for Efficient Caching,” arXiv preprint arXiv:2404.14617, 2024. doi: 10.48550/arXiv.2404.14617.

@article{babaie2024tdram,
  author       = {Maryam Babaie and Ayaz Akram and Wendy Elsasser and Brent Haukness and Michael Miller and Taeksang Song and Thomas Vogelsang and Steven Woo and Jason Lowe-Power},
  title        = {TDRAM: Tag-enhanced DRAM for Efficient Caching},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2404.14617},
  doi          = {10.48550/arXiv.2404.14617},
  eprint={2404.14617},
  archivePrefix={arXiv},
  primaryClass={cs.AR}
}

Aragorn: A Privacy-Enhancing System for Mobile Cameras

2024-01-12T00:00:00-08:00

Local Download ACM DL Link

Mobile app developers often rely on cameras to implement rich features. However, giving apps unfettered access to the mobile camera poses a privacy threat when camera frames capture sensitive information that is not needed for the app’s functionality. To mitigate this threat, we present Aragorn, a novel privacy-enhancing mobile camera system that provides fine grained control over what information can be present in camera frames before apps can access them. Aragorn automatically sanitizes camera frames by detecting regions that are essential to an app’s functionality and blocking out everything else to protect privacy while retaining app utility. Aragorn can cater to a wide range of camera apps and incorporates knowledge distillation and crowdsourcing to extend robust support to previously unsupported apps. In our evaluations, we see that, with no degradation in utility, Aragorn detects credit cards with 89\% accuracy and faces with 100\% accuracy in context of credit card scanning and face recognition respectively. We show that Aragorn’s implementation in the Android camera subsystem only suffers an average drop of 0.01 frames per second in frame rate. Our evaluations show that the overhead incurred by Aragorn to system performance is reasonable.

Citation

Hari Venugopalan, Zainul Abi Din, Trevor Carpenter, Jason Lowe-Power, Samuel T. King, and Zubair Shafiq. 2024. Aragorn: A Privacy-Enhancing System for Mobile Cameras. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 4, Article 181 (December 2023), 31 pages. https://doi.org/10.1145/3631406

@article{Venugopalan2024aragorn,
author = {Venugopalan, Hari and Din, Zainul Abi and Carpenter, Trevor and Lowe-Power, Jason and King, Samuel T. and Shafiq, Zubair},
title = {Aragorn: A Privacy-Enhancing System for Mobile Cameras},
year = {2024},
issue_date = {December 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {7},
number = {4},
url = {https://doi.org/10.1145/3631406},
doi = {10.1145/3631406},
journal = {Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.},
month = {jan},
articleno = {181},
numpages = {31},
keywords = {Knowledge Distillation, Object Detection}
}

Centauri: Practical Rowhammer Fingerprinting

2023-06-30T00:00:00-07:00

Local Download arXiv Link

Fingerprinters leverage the heterogeneity in hardware and software configurations to extract a device fingerprint. Fingerprinting countermeasures attempt to normalize these attributes such that they present a uniform fingerprint across different devices or present different fingerprints for the same device each time. We present Centauri, a Rowhammer fingerprinting approach that can build a unique and stable fingerprints even across devices with homogeneous or normalized/obfuscated hardware and software configurations. To this end, Centauri leverages the process variation in the underlying manufacturing process that gives rise to unique distributions of Rowhammer-induced bit flips across different DRAM modules. Centauri’s design and implementation is able to overcome memory allocation constrains without requiring root privileges. Our evaluation on a test bed of about one hundred DRAM modules shows that system achieves 99.91% fingerprinting accuracy. Centauri’s fingerprints are also stable with daily experiments over a period of 10 days revealing no loss in fingerprinting accuracy. We show that Centauri is efficient, taking as little as 9.92 seconds to extract a fingerprint. Centauri is the first practical Rowhammer fingerprinting approach that is able to extract unique and stable fingerprints efficiently and at-scale.

Citation

Hari Venugopalan, Zainul Abi Din, Jason Lowe-Power, Samuel T. King, and Zubair Shafiq. 2024. Centauri: Practical Rowhammer Fingerprinting. arXiv preprint arXiv:2307.00143, 2023. doi: 10.48550/arXiv.2307.00143.

@misc{venugopalan2023centauri,
      title={Centauri: Practical Rowhammer Fingerprinting}, 
      author={Hari Venugopalan and Kaustav Goswami and Zainul Abi Din and Jason Lowe-Power and Samuel T. King and Zubair Shafiq},
      year={2023},
      eprint={2307.00143},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Efficient Large Scale DLRM Implementation on Heterogeneous Memory Systems

2023-05-22T00:00:00-07:00

Paper on Springer Local Paper Download Code available on GitHub

We propose a new data structure called CachedEmbeddings for training large scale deep learning recommendation models (DLRM) efficiently on heterogeneous (DRAM + non-volatile) memory platforms. CachedEmbeddings implements an implicit software-managed cache and data movement optimization that is integrated with the Julia programming framework to optimize the implementation of large scale DLRM implementations with multiple sparse embedded tables operations. In particular we show an implementation that is 1.4X to 2X better than the best known Intel CPU based implementations on state-of-the-art DLRM benchmarks on a real heterogeneous memory platform from Intel, and 1.32X to 1.45X improvement over Intel’s 2LM implementation that treats the DRAM as a hardware managed cache.

CachedEmbeddings builds on top of CachedArrays

@inproceedings{hildebrand23cachedembeddings,
author = {Hildebrand, Mark and Lowe-Power, Jason and Akella, Venkatesh},
title = {Efficient Large Scale DLRM Implementation On Heterogeneous Memory Systems},
year = {2023},
isbn = {978-3-031-32040-8},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-031-32041-5_3},
doi = {10.1007/978-3-031-32041-5_3},
booktitle = {High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings},
pages = {42–61},
numpages = {20},
location = {Hamburg, Germany}
}

HammerSim: A Tool to Model Rowhammer

2023-03-20T00:00:00-07:00

Local Download Presentation Download Repository

Abstract

Recently the rowhammer vulnerability has affected modern memory devices, which allows an attacker to cause bitflips without accessing the corresponding cell. The rowhammer effects can exacerbate future memory technologies due to scaling. Hence, we need to invest in studying and mitigating rowhammer attacks. We, therefore, propose a model to simulate RowHammer in gem5 to capture the system-level interaction of RowHammer.

Validating gem5’s Memory Components

2022-12-13T00:00:00-08:00

Local Download Presentation Download Presentation Video

Abstract

Architecture simulators are a common and powerful tool in computer architecture research. It is important that results reported by simulators are trustworthy. gem5 is a well known architectural simulator used by academia and industry. We present our methodology and tools for evaluating gem5’s memory subsystem components and the results of our validation of gem5’s current memory system components. We have validated the accuracy of DDR models in gem5 and report significant difference between gem5 and our reference for HBM models. In addition, we have validated the functional correctness and accuracy of cache models in gem5. Lastly, we observe a 10% difference between gem5 and real hardware in our random access benchmark.

Rethinking the Management Techniques in Emerging Memory Systems

2022-12-12T00:00:00-08:00

Local Download Presentation Download Poster Download

Abstract

The increasing growth of applications’ memory demands has led the CPU vendors to deploy diverse memory technologies either within the same package such as heterogenous memory systems, or in disaggregated form through local or remote memory nodes. As these new memory technologies emerge, conventional memory management should be reconsidered to better meet the applications memory requirements. However, there is not a suitable model available in the community to accurately study these new systems. In this work we describe our contribution toward a cycle-level analysis model of heterogenous memories in gem5 simulator. We believe this work enables the community to perform a design space exploration for the next generation of memory systems.

Toward High-Fidelity Heterogeneous Memory System Modeling in gem5

2022-12-12T00:00:00-08:00

Local Download Presentation Download Poster Download

Abstract

HPC systems will employ various memory technologies to meet applications demands. The ability to model heterogeneous systems (from both compute and memory perspectives) makes gem5 a highly suitable tool to evaluate future HPC systems. In this work, we’ll discuss new contributions to gem5 that are extending its modeling support for heterogeneous memory systems. Specifically, we show the new DRAM cache model, the improved HBM model in gem5, and the refactoring of its memory controller models.