The Importance of Decoupled Architectures
in Future Processor Designs
Matthew Farrens
Gary Tyson
Introduction
The ability to tolerate latencies (in particular, memory latencies) is
becoming a critical component in high-performance processor designs. This
tutorial will present an overview of decoupled architectures, a class of
latency tolerant architectures and techniques that are becoming increasingly
important in the drive for high performance. The history of decoupled
architectures will be presented, followed by a survey of existing decoupled
designs (the R10000, for example) and some motivation as to why this approach
will have an impact on processor designs well into the next century.
Outline
- Introduction to Decoupled Processing
- Decoupled processing will be defined, and the goals of decoupling and the
techniques employed will be discussed. The motivation for decoupling will then
be presented, and the importance to current and future designs will be
highlighted. Finally, factors limiting decoupled performance will be discussed.
- Overview of the Decoupled Model
- Goals of Decoupling
- Memory Latency Tolerance
- ILP
- Provide Constrained Out-of-Order Execution (Slip)
- Decoupling Approaches
- Multiple Processors
- Dynamic Loop Pipelining
- Architecturally Visible Memory Queues
- Reduced Register Pressure by Using Queues
- Existing Decoupled Processors
- Several existing decoupled processors will be detailed. Five classes
will be defined, the existing processors will be classified, and example code
will be shown to illustrate the differences.
- Class 1 : Multiple Instruction Streams
- General Description (including code example)
- Existing Machines
- Class 2 : Single Instruction Stream, with Macro, processor fetch
- General Description (including code example)
- Existing Machines
- Class 3 : Single Instruction Stream, with Macro, shared IFU
- General Description (including code examle)
- Existing Machines
- Class 4 : Single Instruction Stream, without Macro, processor
fetch
- General Description (including code example)
- Existing Machine
- Class 5 : Single Instruction Stream, without Macro, shared IFU
- General Description (including code examle)
- Existing Machines
- Research on Decoupled Architectures
- The research performed on decoupled processing will be summarized,
including both past and future research domains.
- Previous Research
- Overlapped Execution of Inherently Single Issue Code (MISC v. VLIW)
- DAE v. Superscalar on Livermore Loops
- Reduction of Register Pressure
- Memory Latency Tolerance (with PIPE, DAE v. CRAY-1)
- Effects of Load Imbalance (with PIPE)
- Architectural Slip Limitations (with ZS-1)
- Memory Queue Depth (with PIPE, DAE v. CRAY-1, many studies)
- Data Cache Utility/Interaction (by Kurian et. al.)
- Hetero/Homo-genous Processors
- Future Research Issues
- Available Program Slip
- Processor Load Imbalance
- Queue Accessing Capabilities
- Architectural Slip Limitations
- Code Expansion Implications
- Data Cache Impacts
- Processor Designs