

# LLM: Realizing Low-Latency Memory by Exploiting Embedded Silicon Photonics for Irregular Workloads

<u>Marjan Fariborz</u>, Mahyar Samani, Pouya Fotouhi, Roberto Proietti, II-Min Yi, Venkatesh Akella, Jason Lowe-Power, Samuel Palermo, and S. J. Ben Yoo

University of California Davis, Texas A&M University



June 1, 2022





# Outline

- Motivation
- Background on Silicon Photonic
- LLM Architecture
- Evaluation methodology
- Evaluation results
- Conclusion



June 1, 2022



# Large Scale Irregular Application

- Modern applications have irregular memory access pattern
  with low locality
- Memory system is the bottleneck
- Rethink the architecture of the memory systems for these applications.
  - Low latency

XT GENERATION

SYSTEMS LABORATOR

- High bandwidth
- Low memory access variation
- Main source of latency is the contention caused by sharing resources.



Recommendation System





Mining large graphs

Speech Recognition





ISC-HPC 2022: LLM



2022

#### 1. Interconnect:

Post-Moore's law era: Replacing large monolithic dies into smaller "chiplets".

Interconnection between chiplets have challenges.

Chiplets require to share interconnect resources.









#### 2. Memory Controller:

NEXT GENERATION

SYSTEMS LABORATORY

#### Single memory controller per channel

- Single command and data bus.
- Memory timing constraints.
- Maintain low latency and high throughput

June 1, 2022

#### Requests targeting the same channel share the same Read and Write queue



I/O Buffer

#### 3. Memory channel :

FXT GENERATION

DRAM systems are organized into a hierarchy of channels, banks, rows, and columns to exploit locality and parallelism.

Memory banks inside of the single memory channel share the same data and command bus.







- 4. Memory bank (bank conflict):
  - Single sense amplifier ٠
  - **Global bitlines**
  - Command decoder ٠



ENGINEERING



June 1, 2022

## **Related Work**

- "Combining memory and a controller with photonics through 3dstacking to enable scalable and energy-efficient system", Rohbani et al., ISCA2021 Low Latency **"Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations**", son *et al.*, ISCA 2013 "A case for exploiting subarray-level parallelism (SALP) in DRAM", Kim et al., ISCA.2012 "Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems", O'Connor *et al.*, Micro 2017 High Bandwidth "Combining memory and a controller with photonics through 3Dstacking to enable scalable and energy-efficient systems", Udipi et al., **ISCA2011** "Re-architecting dram memory systems with monolithically
  - integrated silicon photonics", Beamer et al., ISCA 2010

NEXT GENERATION \_\_\_\_\_\_ NETWORKING & COMPUT SYSTEMS LABORATORY \_\_

June 1, 2022





### **End-to-End Latency Analysis**



# Low Latency Memory (LLM)

Design:

- Remove the Contention
- Co-design the components on the data path.
- All optical data plane with **Silicon Photonic** (SiPh).

#### **Benefits:**

- Lower queuing latency
- Lower latency variation
- Low energy per bit using SiPh
  - More parallelism  $\rightarrow$  High bandwidth





NEXT GENERATION \_\_\_\_\_\_\_\_\_\_ NETWORKING & COMPUTING SYSTEMS LABORATORY \_\_\_\_\_\_

June 1, 2022

ISC-HPC 2022: LLM

**Better** 



# Outline

- Motivation
- Background on Silicon Photonic
  - Microring
  - Silicon-Photonic Link
  - Array Waveguide Grating Router
- LLM Architecture
- Evaluation methodology
- Evaluation results
- Conclusion

June 1, 2022



# Background on Silicon Photonic (SiPh)

- Microring Resonators (MRs)
  - Resonates at a particular wavelength  $(\lambda)$
  - Filtering
  - Modulation

- Silicon-Photonic Link
  - Off/On-Chip laser
  - MRs as modulator at sender
  - MRs as filters at receiver
  - Heat-control to tune MRs



## Arrayed Waveguide Grating Router (AWGR)

- AWGR
  - Wavelength multiplexer
  - Passive device
  - Bidirectional
  - Compact layout (<1mm<sup>2</sup>)

#### **Contention-free one-to-all**

Input port

Output port



NEXT GENERATION \_\_\_\_\_\_ NETWORKING & COMPUTING \_\_\_\_\_ SYSTEMS LABORATORY \_\_\_\_\_\_

June 1, 2022



## Arrayed Waveguide Grating Router (AWGR)

- AWGR
  - Wavelength multiplexer
  - Passive device
  - Bidirectional
  - Compact layout (<1mm<sup>2</sup>)

#### **Contention-free one-to-all**



**Contention-free all-to-one** 

# Arrayed Waveguide Grating Router (AWGR)

Input port

 $(\lambda_1)(\lambda_2)(\lambda_3)(\lambda_4)$ 

λ1 λ2 λ3 λ4

🗙 λ2 🗙 λ3 🗙 λ4

λ1

λ2 λ3 λ4 1

**Contention-free all-to-all** 

4 × 4 AWGR

**Contention-free all-to-one** 

**Output port** 

 $\lambda_2(\lambda_1)$ 

 $\lambda_1 (\lambda_4) \lambda_3$ 

 $\lambda_4 (\lambda_3) \lambda_2$ 

 $\lambda_3$   $\lambda_2$   $\lambda_1$ 

λ2

λ4

UCDAVIS

ENGINEERING

LECTRICAL AND COMPUTER

#### • AWGR

SYSTEMS LABORATORY

- Wavelength multiplexer
- Passive device
- Bidirectional
- Compact layout (<1mm<sup>2</sup>)

#### **Contention-free one-to-all**



# Outline

- Motivation
- Background on Silicon Photonic
- LLM Architecture
  - Processor memory interconnect
  - Memory controller
  - Memory microarchitecture
  - Organization
- Evaluation methodology
- Evaluation results
- Conclusion



## Low Latency Memory (LLM)

Removing end-to-end contention:

- Ground up co-design of the entire path
  - $\circ$  Interconnect
  - $\circ$  Memory controller
  - Memory microarchitecture
- Separating data and control plane:
  - Optical data plane
  - Electrical control plane



June 1, 2022



### LLM: Processor Memory Interconnect

#### Data plane

- Connecting each chiplet to each memory bank directly.
- Using low energy, high bandwidth density, all-to-all optical interconnects  $\rightarrow$

AWGR.





June 1, 2022

ISC-HPC 2022: LLM



0

0

3 4

 $\mu$ Bank<sub>63</sub>

Global Sens amp

SerDes

7

3



#### LLM: Processor Memory Interconnect

#### **Electrical Control Plane**

- Low bandwidth electrical interconnect
  - o Requests
  - $\circ$  Handshaking signals





June 1, 2022



## LLM: Memory Controller Architecture

#### **HBM** memory controller

- Shared electrical data bus
- Long shared data and command queue

#### LLM memory controller

- Dedicated optical data link
- No data queue
- Dedicated single entry command queue



NEXT GENERATION \_\_\_\_\_\_\_\_\_

June 1, 2022

ISC-HPC 2022: LLM

21

AND COMPUTER

#### LLM: Bank Architecture



- 2x lower in-bank data movement latency
- 2x lower data movement energy
- 4x lower activation energy

June 1, 2022





## LLM organization

- 3D stacks
  - **o High bandwidth**
  - **High capacity**
  - Replace data TSVs with Vertical
    Optical Interconnect (VOI)
- Non-stacked





June 1, 2022





# Outline

- Motivation
- Background on Silicon Photonic
- LLM Architecture
- Evaluation methodology
- Evaluation results
- Conclusion



June 1, 2022



## Evaluation methodology

#### gem5 simulator version 21.0

- Baseline memory systems:
  - HBM2.0
  - HBM + SALP
  - FGDRAM



- Synthetic test:
  - 32 traffic generators
  - Iso-Bandwidth test
    - $\circ\,$  Memories have the same peak bandwidth
- Irregular Workloads:
  - 16 CPU cores
  - GAP Benchmark Suite (GAPBS)
  - Iso-capacity test
    - $\circ$  Memories have same capacity (8 channels)



IEXT GENERATION \_\_\_\_\_\_ IETWORKING & COMPUTING \_\_ YSTEMS LABORATORY \_\_\_\_\_\_

June 1, 2022



# Outline

- Motivation
- Background on Silicon Photonic
- LLM Architecture
- Evaluation methodology
- Evaluation results
  - Synthetic workload
  - Irregular workload
- Conclusion

June 1, 2022



## **Evaluation: Synthetic Workload**

- Number of channels:
  - HBM(SALP) > FGDRAM > LLM



Check out our paper for more results!



June 1, 2022



#### **Evaluation: Irregular Workloads - Execution Time**







June 1, 2022



#### **Evaluation: Irregular Workloads – Power Consumption**



Better



June 1, 2022

ISC-HPC 2022: LLM

29



# Outline

- Motivation
- Background on Silicon Photonic
- LLM Architecture
- Evaluation methodology
- Evaluation results
- Conclusion



June 1, 2022



# Key Takeaways

- LLM proposes an end-to-end co-design that removes the contention on the data path.
- It proposes a new memory system optimized for applications with irregular access patterns.
- The use of optical links provide better data movement energy and higher bandwidth/mm<sup>2</sup>.
- LLM achieves around 3× better execution time while maintaining the same power consumption as HBM2.0.
- Future Work:
  - Exploring the benefits of using LLM in graph accelerators
  - Evaluate the performance for other irregular/regular workloads

June 1, 2022



## Thank You



June 1, 2022

*--*

ISC-HPC 2022: LLM



UCDAVIS ELECTRICAL AND COMPUTER ENGINEERING