<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://arch.cs.ucdavis.edu/feed.xml" rel="self" type="application/atom+xml" /><link href="https://arch.cs.ucdavis.edu/" rel="alternate" type="text/html" /><updated>2026-05-28T21:57:16-07:00</updated><id>https://arch.cs.ucdavis.edu/feed.xml</id><title type="html">UC Davis Computer Architecture</title><subtitle>Computer Architecture Research Group at UC Davis ECE/CS</subtitle><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><entry><title type="html">CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST</title><link href="https://arch.cs.ucdavis.edu/simulation/memory/cxl/2026/05/26/cxl-clustersim.html" rel="alternate" type="text/html" title="CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST" /><published>2026-05-26T00:00:00-07:00</published><updated>2026-05-26T00:00:00-07:00</updated><id>https://arch.cs.ucdavis.edu/simulation/memory/cxl/2026/05/26/cxl-clustersim</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/simulation/memory/cxl/2026/05/26/cxl-clustersim.html"><![CDATA[<p><a href="https://arxiv.org/abs/2605.27745" class="btn btn--primary btn--large">ArXiv</a>
<a href="https://github.com/darchr/cxl-clustersim/" class="btn btn--primary btn--large">Repository</a></p>

<p>Large-scale AI training and inference require hundreds of
gigabytes to terabytes of DRAM with high peak to average
utilization ratios, resulting in overprovisioning. In cloud
computing, DRAM constitutes a significant share of the cost.
Yet, as shown by recent articles, DRAM is heavily under
utilized. Memory disaggregation is a solution to both these
problems. With the advent of the CXL protocol, there is
renewed interest in designing and optimizing computing
systems with disaggregated memory. However, at present,
there are limited simulation tools available for exploring the
design space and evaluating the performance tradeoffs in
computer systems with disaggregated memory.</p>

<p>In this paper, we propose CXL-ClusterSim, a full-system
modeling and simulation framework by combining the gem5
simulator for fidelity, with the Structural Simulation Toolkit
(SST) for parallel simulation. We outline the challenges
in creating this simulation infrastructure and present a
design that is scalable, flexible, and reasonably fast to help
computer architects to explore the design space of CXL-based
disaggregated memory and identify new opportunities for
hardware/software codesign and performance optimization.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">goswami2026cxlclustersimmodelingcxlbaseddisaggregated</span><span class="p">,</span>
      <span class="na">title</span><span class="p">=</span><span class="s">{CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST}</span><span class="p">,</span> 
      <span class="na">author</span><span class="p">=</span><span class="s">{Kaustav Goswami and Maryam Babaie and Hoa Nguyen and Venkatesh Akella and Jason Lowe-Power}</span><span class="p">,</span>
      <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
      <span class="na">eprint</span><span class="p">=</span><span class="s">{2605.27745}</span><span class="p">,</span>
      <span class="na">archivePrefix</span><span class="p">=</span><span class="s">{arXiv}</span><span class="p">,</span>
      <span class="na">primaryClass</span><span class="p">=</span><span class="s">{cs.AR}</span><span class="p">,</span>
      <span class="na">url</span><span class="p">=</span><span class="s">{https://arxiv.org/abs/2605.27745}</span><span class="p">,</span> 
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;simulation&quot;, &quot;memory&quot;, &quot;CXL&quot;]" /><summary type="html"><![CDATA[Kaustav Goswami, Maryam Babaie, Hoa Nguyen, Venkatesh Akella, Jason Lowe-Power]]></summary></entry><entry><title type="html">Toward Reproducible and Standardized Computer Architecture Simulation with gem5</title><link href="https://arch.cs.ucdavis.edu/simulation/2026/05/26/gem5-resources.html" rel="alternate" type="text/html" title="Toward Reproducible and Standardized Computer Architecture Simulation with gem5" /><published>2026-05-26T00:00:00-07:00</published><updated>2026-05-26T00:00:00-07:00</updated><id>https://arch.cs.ucdavis.edu/simulation/2026/05/26/gem5-resources</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/simulation/2026/05/26/gem5-resources.html"><![CDATA[<p><a href="https://ieeexplore.ieee.org/document/11527308" class="btn btn--primary btn--large">Paper</a>
<a href="https://arxiv.org/abs/2512.13479" class="btn btn--primary btn--large">arXiv</a></p>

<p>Reproducibility in simulation-based computer architecture research requires coordinating artifacts like disk images, kernels, and benchmarks, but existing workflows are inconsistent. We improve gem5, an open-source simulator with over 1600 forks, and gem5 Resources, a centralized repository of over 2000 pre-packaged artifacts, to address these issues. While gem5 Resources enables artifact sharing, researchers still face challenges. Creating custom disk images is complex and timeconsuming, with no standardized process across ISAs, making it difficult to extend and share images. gem5 provides limited guesthost communication features through a set of predefined exit events that restrict researchers’ ability to dynamically control and monitor simulations. Lastly, running simulations with multiple workloads requires researchers to write custom external scripts to coordinate multiple gem5 simulations which creates errorprone and hard-to-reproduce workflows. To overcome this, we introduce several features in gem5 and gem5 Resources. We standardize disk-image creation across x86, ARM, and RISCV using Packer, and provide validated base images with preannotated benchmark suites (NPB, GAPBS). We provide 12 new disk images, 6 new kernels, and over 200 workloads across three ISAs. We refactor the exit event system to a class-based model and introduce hypercalls for enhanced guest-host communication that allows researchers to define custom behavior for their exit events. We also provide a utility to remotely monitor simulations and the gem5-bridge driver for user-space m5 operations. Additionally, we implemented Suites and MultiSim to enable parallel full-system simulations from gem5 configuration scripts, eliminating the need for external scripting. These features reduce setup complexity and provide extensible, validated resources that improve reproducibility and standardization.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@INPROCEEDINGS</span><span class="p">{</span><span class="nl">11527308</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Pai, Kunal and Patel, Harshil and Le, Erin and Krim, Noah and Samani, Mahyar and Bruce, Bobby R. and Lowe-Power, Jason}</span><span class="p">,</span>
  <span class="na">booktitle</span><span class="p">=</span><span class="s">{2026 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}</span><span class="p">,</span> 
  <span class="na">title</span><span class="p">=</span><span class="s">{Toward Reproducible and Standardized Computer Architecture Simulation with gem5}</span><span class="p">,</span> 
  <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
  <span class="na">volume</span><span class="p">=</span><span class="s">{}</span><span class="p">,</span>
  <span class="na">number</span><span class="p">=</span><span class="s">{}</span><span class="p">,</span>
  <span class="na">pages</span><span class="p">=</span><span class="s">{184-196}</span><span class="p">,</span>
  <span class="na">keywords</span><span class="p">=</span><span class="s">{Simulation;Timing;Radio access networks;Regional area networks;Kernel;Arm;Computer architecture;Printing;Testing;Booting;gem5;computer architecture;reproducibility}</span><span class="p">,</span>
  <span class="na">doi</span><span class="p">=</span><span class="s">{10.1109/ISPASS69572.2026.00027}</span><span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;simulation&quot;]" /><summary type="html"><![CDATA[Kunal Pai, Harshil Patel, Erin Le, Noah Krim, Mahyar Samani, Bobby R. Bruce, Jason Lowe-Power]]></summary></entry><entry><title type="html">Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory</title><link href="https://arch.cs.ucdavis.edu/security/memory/cxl/2026/03/06/space-control.html" rel="alternate" type="text/html" title="Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory" /><published>2026-03-06T00:00:00-08:00</published><updated>2026-03-06T00:00:00-08:00</updated><id>https://arch.cs.ucdavis.edu/security/memory/cxl/2026/03/06/space-control</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/security/memory/cxl/2026/03/06/space-control.html"><![CDATA[<p><a href="https://arxiv.org/abs/2603.06951" class="btn btn--primary btn--large">ArXiv</a></p>

<p>Memory disaggregation via Compute Express Link (CXL) enables multiple hosts to share remote memory, improving utilization for data-intensive workloads.
Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation.
This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory.</p>

<p>We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory.
Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache.
We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory.
Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache.
Our design allows up to 127 processes running concurrently on 255 hosts to share memory with only 1.56% storage overhead.
In a gem5 + Structural Simulation Toolkit (SST) based CXL model, Space-Control incurs minimal performance overhead of 3.3%, making shared disaggregated memory isolation practical.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">goswami2026spacecontrolprocesslevelisolationsharing</span><span class="p">,</span>
      <span class="na">title</span><span class="p">=</span><span class="s">{Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory}</span><span class="p">,</span> 
      <span class="na">author</span><span class="p">=</span><span class="s">{Kaustav Goswami and Sean Peisert and Venkatesh Akella and Jason Lowe-Power}</span><span class="p">,</span>
      <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
      <span class="na">eprint</span><span class="p">=</span><span class="s">{2603.06951}</span><span class="p">,</span>
      <span class="na">archivePrefix</span><span class="p">=</span><span class="s">{arXiv}</span><span class="p">,</span>
      <span class="na">primaryClass</span><span class="p">=</span><span class="s">{cs.AR}</span><span class="p">,</span>
      <span class="na">url</span><span class="p">=</span><span class="s">{https://arxiv.org/abs/2603.06951}</span><span class="p">,</span> 
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;security&quot;, &quot;memory&quot;, &quot;CXL&quot;]" /><summary type="html"><![CDATA[Kaustav Goswami, Sean Peisert, Venkatesh Akella, Jason Lowe-Power]]></summary></entry><entry><title type="html">Implications of Full-System Modeling for Superconducting Architectures</title><link href="https://arch.cs.ucdavis.edu/simulation/2025/11/16/superconducting-architectures.html" rel="alternate" type="text/html" title="Implications of Full-System Modeling for Superconducting Architectures" /><published>2025-11-16T00:00:00-08:00</published><updated>2025-11-16T00:00:00-08:00</updated><id>https://arch.cs.ucdavis.edu/simulation/2025/11/16/superconducting-architectures</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/simulation/2025/11/16/superconducting-architectures.html"><![CDATA[<p><a href="https://dl.acm.org/doi/full/10.1145/3731599.3769278" class="btn btn--primary btn--large">Paper</a></p>

<p>As Moore’s Law slows, superconducting electronics offer ultra-low-power, high-speed computation potential. This paper presents the first full-system superconducting architecture modeling in gem5, evaluating superconducting cores, caches, and interconnects under realistic workloads. We extend gem5 with cryogenic semiconductor (4 GHz) and superconducting (100 GHz) RISC-V cores and multi-level caches, evaluating RISC-V benchmarks and SPEC CPU2006 applications. We also integrate SRNoC, a superconducting interconnect, with the NOVA graph accelerator.
Results show superconducting cores and caches achieve up to 24 × speedup for compute-intensive workloads, but memory-intensive applications are bottlenecked by room-temperature DRAM (1.2 × improvement). High cache bandwidth requirements (800 GB/s) present design challenges. SRNoC provides 35-73 × energy efficiency gains for narrow data paths but 1246 × slowdown for wide data communication. Therefore, superconducting technology suits domain-specific accelerators better than general-purpose computing, with performance dependent on workload memory access patterns and data widths.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">10.1145/3731599.3769278</span><span class="p">,</span>
  <span class="na">author</span> <span class="p">=</span> <span class="s">{Pai, Kunal and Samani, Mahyar and Nand, Anusheel and Lowe-Power, Jason}</span><span class="p">,</span>
  <span class="na">title</span> <span class="p">=</span> <span class="s">{Implications of Full-System Modeling for Superconducting Architectures}</span><span class="p">,</span>
  <span class="na">year</span> <span class="p">=</span> <span class="s">{2025}</span><span class="p">,</span>
  <span class="na">isbn</span> <span class="p">=</span> <span class="s">{9798400718717}</span><span class="p">,</span>
  <span class="na">publisher</span> <span class="p">=</span> <span class="s">{Association for Computing Machinery}</span><span class="p">,</span>
  <span class="na">address</span> <span class="p">=</span> <span class="s">{New York, NY, USA}</span><span class="p">,</span>
  <span class="na">url</span> <span class="p">=</span> <span class="s">{https://doi.org/10.1145/3731599.3769278}</span><span class="p">,</span>
  <span class="na">doi</span> <span class="p">=</span> <span class="s">{10.1145/3731599.3769278}</span><span class="p">,</span>
  <span class="na">abstract</span> <span class="p">=</span> <span class="s">{As Moore's Law slows, superconducting electronics offer ultra-low-power, high-speed computation potential. This paper presents the first full-system superconducting architecture modeling in gem5, evaluating superconducting cores, caches, and interconnects under realistic workloads. We extend gem5 with cryogenic semiconductor (4 GHz) and superconducting (100 GHz) RISC-V cores and multi-level caches, evaluating RISC-V benchmarks and SPEC CPU2006 applications. We also integrate SRNoC, a superconducting interconnect, with the NOVA graph accelerator. Results show superconducting cores and caches achieve up to 24 × speedup for compute-intensive workloads, but memory-intensive applications are bottlenecked by room-temperature DRAM (1.2 × improvement). High cache bandwidth requirements (800 GB/s) present design challenges. SRNoC provides 35-73 × energy efficiency gains for narrow data paths but 1246 × slowdown for wide data communication. Therefore, superconducting technology suits domain-specific accelerators better than general-purpose computing, with performance dependent on workload memory access patterns and data widths.}</span><span class="p">,</span>
  <span class="na">booktitle</span> <span class="p">=</span> <span class="s">{Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis}</span><span class="p">,</span>
  <span class="na">pages</span> <span class="p">=</span> <span class="s">{1484–1490}</span><span class="p">,</span>
  <span class="na">numpages</span> <span class="p">=</span> <span class="s">{7}</span><span class="p">,</span>
  <span class="na">keywords</span> <span class="p">=</span> <span class="s">{superconducting electronics, cryogenic computing, gem5, full-system}</span><span class="p">,</span>
  <span class="na">series</span> <span class="p">=</span> <span class="s">{SC Workshops '25}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;simulation&quot;]" /><summary type="html"><![CDATA[Kunal Pai, Mahyar Samani, Anusheel Nand, Jason Lowe-Power]]></summary></entry><entry><title type="html">NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing</title><link href="https://arch.cs.ucdavis.edu/accelerator/2025/03/03/nova.html" rel="alternate" type="text/html" title="NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing" /><published>2025-03-03T00:00:00-08:00</published><updated>2025-03-03T00:00:00-08:00</updated><id>https://arch.cs.ucdavis.edu/accelerator/2025/03/03/nova</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/accelerator/2025/03/03/nova.html"><![CDATA[<p><a href="/assets/papers/NOVA-HPCA-2025.pdf" class="btn btn--primary btn--large">Local Download</a></p>

<p>We propose a scalable graph processing hardware accelerator called NOVA that is based on a novel vertex management architecture that decouples the execution of reduction and propagation operations in the popular vertex-centric graph processing paradigm.
This allows us to store the working set in off-chip memory and utilize the available on-chip memory as a buffer to hide the latency of DRAM accesses instead of a traditional cache.
This overcomes one of the key drawbacks of almost all prior works which require temporal partitioning of graphs to scale to large graphs.
We develop a cycle-accurate model of the architecture in gem5 and demonstrate that NOVA exhibits near-perfect weak and strong scaling while scaling to large graphs by spatially tiling multiple nodes.
In addition, our simulations show that NOVA is 2.35x better that a state-of-the-art graph accelerator (PolyGraph) while using a fraction of the on-chip memory on a synthetic graph with 134M vertices and over 2.14B edges.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">babaie2024tdram</span><span class="p">,</span>
  <span class="na">author</span>       <span class="p">=</span> <span class="s">{Marjan Fariborz and Mahyar Samani and Austin York and SJ Ben Yoo and Jason Lowe-Power and Venkatesh Akella}</span><span class="p">,</span>
  <span class="na">title</span>        <span class="p">=</span> <span class="s">{NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing}</span><span class="p">,</span>
  <span class="na">year</span>         <span class="p">=</span> <span class="s">{2025}</span><span class="p">,</span>
  <span class="na">url</span>          <span class="p">=</span> <span class="s">{https://doi.org/10.1109/HPCA61900.2025.00072}</span><span class="p">,</span>
  <span class="na">doi</span>          <span class="p">=</span> <span class="s">{10.1109/HPCA61900.2025.00072}</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;accelerator&quot;]" /><summary type="html"><![CDATA[Marjan Fariborz, Mahyar Samani, Austin York, SJ Ben Yoo, Jason Lowe-Power, Venkatesh Akella]]></summary></entry><entry><title type="html">TDRAM: Tag-enhanced DRAM for Efficient Caching</title><link href="https://arch.cs.ucdavis.edu/memory/2025/02/27/tdram.html" rel="alternate" type="text/html" title="TDRAM: Tag-enhanced DRAM for Efficient Caching" /><published>2025-02-27T00:00:00-08:00</published><updated>2025-02-27T00:00:00-08:00</updated><id>https://arch.cs.ucdavis.edu/memory/2025/02/27/tdram</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/memory/2025/02/27/tdram.html"><![CDATA[<p><a href="/assets/papers/TDRAM-HPCA-2025.pdf" class="btn btn--primary btn--large">Local Download</a></p>

<p>As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demands. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances existing DRAM, such as HBM3, by adding small, low-latency mats to store tags and metadata on the same die as the data mats. These mats enable tag and data access in lockstep, in-DRAM tag comparison, and conditional data response based on the comparison result (reducing wasted data transfers), akin to SRAM cache mechanisms. TDRAM further optimizes hit and miss latencies through opportunistic early tag probing. Moreover, TDRAM introduces
a flush buffer to store conflicting dirty data on write misses, eliminating data bus turnaround delays on write demands. We
evaluate TDRAM in a full-system simulation using a set of HPC workloads with large memory footprints, showing that TDRAM,
on average, provides 2.65× faster tag checks, 1.23× speedup, and 21% less energy consumption compared to state-of-the-art commercial and research designs.</p>

<h2 id="citation">Citation</h2>

<p>Maryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael R. Miller, Taeksang Song, Thomas Vogelsang, Steven C. Woo, and Jason Lowe-Power, “TDRAM: Tag-enhanced DRAM for Efficient Caching,” arXiv preprint arXiv:2404.14617, 2024. doi: 10.48550/arXiv.2404.14617.</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">babaie2024tdram</span><span class="p">,</span>
  <span class="na">author</span>       <span class="p">=</span> <span class="s">{Maryam Babaie and Ayaz Akram and Wendy Elsasser and Brent Haukness and Michael R. Miller and Taeksang Song and Thomas Vogelsang and Steven C. Woo and Jason Lowe-Power}</span><span class="p">,</span>
  <span class="na">title</span>        <span class="p">=</span> <span class="s">{TDRAM: Tag-enhanced DRAM for Efficient Caching}</span><span class="p">,</span>
  <span class="na">year</span>         <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
  <span class="na">url</span>          <span class="p">=</span> <span class="s">{https://doi.org/10.48550/arXiv.2404.14617}</span><span class="p">,</span>
  <span class="na">doi</span>          <span class="p">=</span> <span class="s">{10.48550/arXiv.2404.14617}</span><span class="p">,</span>
  <span class="na">eprint</span><span class="p">=</span><span class="s">{2404.14617}</span><span class="p">,</span>
  <span class="na">archivePrefix</span><span class="p">=</span><span class="s">{arXiv}</span><span class="p">,</span>
  <span class="na">primaryClass</span><span class="p">=</span><span class="s">{cs.AR}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;memory&quot;]" /><summary type="html"><![CDATA[Maryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael R. Miller, Taeksang Song, Thomas Vogelsang, Steven C. Woo, Jason Lowe-Power]]></summary></entry><entry><title type="html">Potential and Limitation of High-Frequency Cores and Caches</title><link href="https://arch.cs.ucdavis.edu/simulation/2024/08/06/potentiallimitationhighfreqcorescaches.html" rel="alternate" type="text/html" title="Potential and Limitation of High-Frequency Cores and Caches" /><published>2024-08-06T00:00:00-07:00</published><updated>2024-08-06T00:00:00-07:00</updated><id>https://arch.cs.ucdavis.edu/simulation/2024/08/06/potentiallimitationhighfreqcorescaches</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/simulation/2024/08/06/potentiallimitationhighfreqcorescaches.html"><![CDATA[<p><a href="/assets/papers/arxiv24-potentialhighfreq.pdf" class="btn btn--primary btn--large">Local Download</a>
<a href="https://arxiv.org/abs/2408.03308" class="btn btn--primary btn--large">arXiv Link</a>
<a href="/assets/papers/modsim2024-potentialhighfreq-poster.pdf" class="btn btn--primary btn--large">Poster Download</a>
<a href="/assets/papers/modsim2024-potentialhighfreq-presentation.pdf" class="btn btn--primary btn--large">Presentation Download</a></p>

<p>This paper explores the potential of cryogenic semiconductor computing and superconductor electronics as promising alternatives to traditional semiconductor devices. As semiconductor devices face challenges such as increased leakage currents and reduced performance at higher temperatures, these novel technologies offer high performance and low power computation. Conventional semiconductor electronics operating at cryogenic temperatures (below -150°C or 123.15 K) can benefit from reduced leakage currents and improved electron mobility. On the other hand, superconductor electronics, operating below 10 K, allow electrons to flow without resistance, offering the potential for ultra-low-power, high-speed computation. This study presents a comprehensive performance modeling and analysis of these technologies and provides insights into their potential benefits and limitations. We implement models of in-order and out-of-order cores operating at high clock frequencies associated with superconductor electronics and cryogenic semiconductor computing in gem5. We evaluate the performance of these components using workloads representative of real-world applications like NPB, SPEC CPU2006, and GAPBS. Our results show the potential speedups achievable by these components and the limitations posed by cache bandwidth. This work provides valuable insights into the performance implications and design trade-offs associated with cryogenic and superconductor technologies, laying the foundation for future research in this field using gem5.</p>

<h2 id="citation">Citation</h2>

<p>Kunal Pai, Anusheel Nand, Jason Lowe-Power, “Potential and Limitations of High-Frequency Cores and Caches,” arXiv preprint arXiv:2408.03308, 2024. doi: 10.48550/arXiv.2408.03308.</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">pai2024potentiallimitationhighfrequencycores</span><span class="p">,</span>
      <span class="na">title</span><span class="p">=</span><span class="s">{Potential and Limitation of High-Frequency Cores and Caches}</span><span class="p">,</span>
      <span class="na">author</span><span class="p">=</span><span class="s">{Kunal Pai and Anusheel Nand and Jason Lowe-Power}</span><span class="p">,</span>
      <span class="na">year</span><span class="p">=</span><span class="s">{2024}</span><span class="p">,</span>
      <span class="na">eprint</span><span class="p">=</span><span class="s">{2408.03308}</span><span class="p">,</span>
      <span class="na">archivePrefix</span><span class="p">=</span><span class="s">{arXiv}</span><span class="p">,</span>
      <span class="na">primaryClass</span><span class="p">=</span><span class="s">{cs.AR}</span><span class="p">,</span>
      <span class="na">url</span><span class="p">=</span><span class="s">{https://arxiv.org/abs/2408.03308}</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;simulation&quot;]" /><summary type="html"><![CDATA[Kunal Pai, Anusheel Nand, Jason Lowe-Power. ModSim 2024: Workshop on Modeling & Simulation of Systems and Applications.]]></summary></entry><entry><title type="html">CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems</title><link href="https://arch.cs.ucdavis.edu/memory/2024/05/29/cachedarrays.html" rel="alternate" type="text/html" title="CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems" /><published>2024-05-29T00:00:00-07:00</published><updated>2024-05-29T00:00:00-07:00</updated><id>https://arch.cs.ucdavis.edu/memory/2024/05/29/cachedarrays</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/memory/2024/05/29/cachedarrays.html"><![CDATA[<p><a href="https://doi.org/10.1109/IPDPS57955.2024.00055" class="btn btn--primary btn--large">Paper on IEEEXplore</a>
<a href="/assets/papers/ipdps24-cachedarrays.pdf" class="btn btn--primary btn--large">Local Paper Download</a>
<a href="/assets/papers/ipdps24-cachedarrays-presentation.pdf" class="btn btn--primary btn--large">IPDPS Presentation Download</a>
<a href="https://hmem-workshop.github.io/hmem2023/hmem-sc23/slides-papers/lowe-power-cachedarrays-hmem-workshop-sc23.pdf" class="btn btn--primary btn--large">SC HMEM Workshop Presentation</a>
<a href="https://github.com/darchr/CachedArrays.jl" class="btn btn--primary btn--large">Source Code</a></p>

<p>We propose a new framework called <em>CachedArrays</em>
and a set of APIs to address the data tiering problem in large scale
heterogeneous and disaggregated memory systems. The proposed
framework operates at a variable size object granularity and
allows the programmer to specify semantic hints about future
use of data via a Policy API, which are used by a Data Manager
to choose when and where to place a particular data object
using a data management API, thus bridging the semantic gap
between the programmer and the platform-specific hardware
details, and optimizing overall performance. We evaluate the
proposed framework on a real hardware platform with terabytes
of memory consisting of NVRAM and DRAM on large scale ML
training workloads such CNNs that exhibit different data access
and usage patterns. We show that <em>CachedArrays</em> outperforms
hardware caches, and can exploit many of the algorithmic-specific
optimizations of prior works.</p>

<p><a href="/memory/2023/05/22/cached-embeddings.html">CachedEmbeddings</a> builds on top of <em>CachedArrays</em>.</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">hildebrand24cachedarrays</span><span class="p">,</span>
<span class="na">author</span> <span class="p">=</span> <span class="s">{Hildebrand, Mark and Lowe-Power, Jason and Akella, Venkatesh}</span><span class="p">,</span>
<span class="na">title</span> <span class="p">=</span> <span class="s">{ {CachedArrays}: Optimizing Data Movement for Heterogeneous Memory Systems}</span><span class="p">,</span>
<span class="na">year</span> <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
<span class="na">url</span> <span class="p">=</span> <span class="s">{https://doi.org/10.1109/IPDPS57955.2024.00055}</span><span class="p">,</span>
<span class="na">doi</span> <span class="p">=</span> <span class="s">{10.1109/IPDPS57955.2024.00055}</span><span class="p">,</span>
<span class="na">booktitle</span> <span class="p">=</span> <span class="s">{Proceeding of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;memory&quot;]" /><summary type="html"><![CDATA[Mark Hildebrand, Jason Lowe-Power, Venkatesh Akella. IPDPS 2024]]></summary></entry><entry><title type="html">TEGRA – Scaling Up Terascale Graph Processing with Disaggregated Computing</title><link href="https://arch.cs.ucdavis.edu/memory/2024/04/28/tegra.html" rel="alternate" type="text/html" title="TEGRA – Scaling Up Terascale Graph Processing with Disaggregated Computing" /><published>2024-04-28T00:00:00-07:00</published><updated>2024-04-28T00:00:00-07:00</updated><id>https://arch.cs.ucdavis.edu/memory/2024/04/28/tegra</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/memory/2024/04/28/tegra.html"><![CDATA[<p><a href="/assets/papers/arxiv24-tegra.pdf" class="btn btn--primary btn--large">Local Download</a>
<a href="https://arxiv.org/abs/2404.03155" class="btn btn--primary btn--large">arXiv Link</a></p>

<p>Graphs are essential for representing relationships in various domains, driving modern AI applications such as graph analytics and neural networks across science, engineering, cybersecurity, transportation, and economics. However, the size of modern graphs are rapidly expanding, posing challenges for traditional CPUs and GPUs in meeting real-time processing demands. As a result, hardware accelerators for graph processing have been proposed. However, the largest graphs that can be handled by these systems is still modest often targeting Twitter graph (1.4B edges approximately). This paper aims to address this limitation by developing a graph accelerator capable of terascale graph processing. Scale out architectures, architectures where nodes are replicated to expand to larger datasets, are natural for handling larger graphs. We argue that this approach is not appropriate for very large-scale graphs because it leads to under utilization of both memory resources and compute resources. Additionally, vertex and edge processing have different access patterns. Communication overheads also pose further challenges in designing scalable architectures. To overcome these issues, this paper proposes TEGRA, a scale-up architecture for terascale graph processing. TEGRA leverages a composable computing system with disaggregated resources and a communication architecture inspired by Active Messages. By employing direct communication between cores and optimizing memory interconnect utilization, TEGRA effectively reduces communication overhead and improves resource utilization, therefore enabling efficient processing of terascale graphs.</p>

<h2 id="citation">Citation</h2>

<p>William Shaddix, Mahyar Samani, Marjan Fariborz, S.J. Ben Yoo, Jason Lowe-Power, and Venkatesh Akella, “TEGRA – Scaling Up Terascale Graph Processing with Disaggregated Computing,” arXiv preprint arXiv:2404.03155, 2024. doi: 10.48550/arXiv.2404.03155.</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">shaddix2024tegra</span><span class="p">,</span>
  <span class="na">author</span>       <span class="p">=</span> <span class="s">{William Shaddix and Mahyar Samani and Marjan Fariborz and S.J. Ben Yoo and Jason Lowe-Power and Venkatesh Akella}</span><span class="p">,</span>
  <span class="na">title</span>        <span class="p">=</span> <span class="s">{TEGRA -- Scaling Up Terascale Graph Processing with Disaggregated Computing}</span><span class="p">,</span>
  <span class="na">year</span>         <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
  <span class="na">url</span>          <span class="p">=</span> <span class="s">{https://doi.org/10.48550/arXiv.2404.03155}</span><span class="p">,</span>
  <span class="na">doi</span>          <span class="p">=</span> <span class="s">{10.48550/arXiv.2404.03155}</span><span class="p">,</span>
  <span class="na">eprint</span><span class="p">=</span><span class="s">{2404.03155}</span><span class="p">,</span>
  <span class="na">archivePrefix</span><span class="p">=</span><span class="s">{arXiv}</span><span class="p">,</span>
  <span class="na">primaryClass</span><span class="p">=</span><span class="s">{cs.AR}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;memory&quot;]" /><summary type="html"><![CDATA[William Shaddix, Mahyar Samani, Marjan Fariborz, S.J. Ben Yoo, Jason Lowe-Power, Venkatesh Akella]]></summary></entry><entry><title type="html">Aragorn: A Privacy-Enhancing System for Mobile Cameras</title><link href="https://arch.cs.ucdavis.edu/security/2024/01/12/aragorn.html" rel="alternate" type="text/html" title="Aragorn: A Privacy-Enhancing System for Mobile Cameras" /><published>2024-01-12T00:00:00-08:00</published><updated>2024-01-12T00:00:00-08:00</updated><id>https://arch.cs.ucdavis.edu/security/2024/01/12/aragorn</id><content type="html" xml:base="https://arch.cs.ucdavis.edu/security/2024/01/12/aragorn.html"><![CDATA[<p><a href="/assets/papers/imwut24-aragorn.pdf" class="btn btn--primary btn--large">Local Download</a>
<a href="https://dl.acm.org/doi/abs/10.1145/3631406" class="btn btn--primary btn--large">ACM DL Link</a></p>

<p>Mobile app developers often rely on cameras to implement rich features. However, giving apps unfettered access to the mobile camera poses a privacy threat when camera frames capture sensitive information that is not needed for the app’s functionality. To mitigate this threat, we present Aragorn, a novel privacy-enhancing mobile camera system that provides fine grained control over what information can be present in camera frames before apps can access them. Aragorn automatically sanitizes camera frames by detecting regions that are essential to an app’s functionality and blocking out everything else to protect privacy while retaining app utility. Aragorn can cater to a wide range of camera apps and incorporates knowledge distillation and crowdsourcing to extend robust support to previously unsupported apps. In our evaluations, we see that, with no degradation in utility, Aragorn detects credit cards with 89\% accuracy and faces with 100\% accuracy in context of credit card scanning and face recognition respectively. We show that Aragorn’s implementation in the Android camera subsystem only suffers an average drop of 0.01 frames per second in frame rate. Our evaluations show that the overhead incurred by Aragorn to system performance is reasonable.</p>

<h2 id="citation">Citation</h2>

<p>Hari Venugopalan, Zainul Abi Din, Trevor Carpenter, Jason Lowe-Power, Samuel T. King, and Zubair Shafiq. 2024. Aragorn: A Privacy-Enhancing System for Mobile Cameras. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 4, Article 181 (December 2023), 31 pages. https://doi.org/10.1145/3631406</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">Venugopalan2024aragorn</span><span class="p">,</span>
<span class="na">author</span> <span class="p">=</span> <span class="s">{Venugopalan, Hari and Din, Zainul Abi and Carpenter, Trevor and Lowe-Power, Jason and King, Samuel T. and Shafiq, Zubair}</span><span class="p">,</span>
<span class="na">title</span> <span class="p">=</span> <span class="s">{Aragorn: A Privacy-Enhancing System for Mobile Cameras}</span><span class="p">,</span>
<span class="na">year</span> <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
<span class="na">issue_date</span> <span class="p">=</span> <span class="s">{December 2023}</span><span class="p">,</span>
<span class="na">publisher</span> <span class="p">=</span> <span class="s">{Association for Computing Machinery}</span><span class="p">,</span>
<span class="na">address</span> <span class="p">=</span> <span class="s">{New York, NY, USA}</span><span class="p">,</span>
<span class="na">volume</span> <span class="p">=</span> <span class="s">{7}</span><span class="p">,</span>
<span class="na">number</span> <span class="p">=</span> <span class="s">{4}</span><span class="p">,</span>
<span class="na">url</span> <span class="p">=</span> <span class="s">{https://doi.org/10.1145/3631406}</span><span class="p">,</span>
<span class="na">doi</span> <span class="p">=</span> <span class="s">{10.1145/3631406}</span><span class="p">,</span>
<span class="na">journal</span> <span class="p">=</span> <span class="s">{Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.}</span><span class="p">,</span>
<span class="na">month</span> <span class="p">=</span> <span class="s">{jan}</span><span class="p">,</span>
<span class="na">articleno</span> <span class="p">=</span> <span class="s">{181}</span><span class="p">,</span>
<span class="na">numpages</span> <span class="p">=</span> <span class="s">{31}</span><span class="p">,</span>
<span class="na">keywords</span> <span class="p">=</span> <span class="s">{Knowledge Distillation, Object Detection}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Jason Lowe-Power</name><email>jlowepower@ucdavis.edu</email></author><category term="[&quot;security&quot;]" /><summary type="html"><![CDATA[Hari Venugopalan, Zainul Abi Din, Trevor Carpenter, Jason Lowe-Power, Samuel T. King, and Zubair Shafiq. IMWUT 2024.]]></summary></entry></feed>