Tutorials for ASPLOS-VIII
Instruction-Level Parallelism (ILP) has witnessed a number
of significant innovations in the last few years. At the same time,
explosive technology trends promise a continued boom in transistor
counts. The resulting scenario is one where several recent ILP
innovations are expected to be implemented in commercial processors
in the next few years, making processor design and evaluation both
complex and challenging. This tutorial presents an overview of
important recently-proposed ILP processor techniques. Primarily, the
tutorial is a quick but comprehensive tour of ILP techniques for
each of the different processor pipeline stages and selected aspects
of the top level memory hierarchy. Examples of topics to be covered
include: high-bandwidth instruction fetch, dispatch, and issue
mechanisms; high-accuracy and high-bandwidth control prediction;
memory dependence handling; data value prediction; aspects of high-
bandwidth data caches; reuse of dependence and scheduling information;
recovery from mis-speculation; sub-word SIMD parallelism for multi-
media; and support for precise interrupts. Drawing from relevant
literature, we discuss the hardware complexity of several of the
techniques. Putting things together, we present an example high-ILP
processor of the future that incorporates several of these techniques,
in order to show how they might fit together and interact with each
other. In conclusion, we mention open research issues and current
research directions in the ILP area.
Emerging Processor Techniques for Exploiting
About the Presenter
Sriram Vajapeyam is an Assistant Professor at the Indian Institute of
Science, Bangalore, with primary research interests in processor
architecture. His recent research contribution includes the first work
on trace-centric processors. The speaker's industry experience includes
over a year with the processor design group of Cray Research, Inc. in
1991-1992. Sriram obtained a Ph.D. from the University of Wisconsin,
Madison in 1991 for a thesis on the characterization of the Cray Y-MP
processor. He is currently researching hardware and software runtime
schemes for further exploiting instruction-level parallelism in
Online transaction processing (OLTP) and decision support (DSS)
databases are important, yet often overlooked workloads for evaluating
the effectiveness of computer architecture innovations. Using
commercial workloads to drive performance studies often proves to be
difficult, due to the complexity of the workloads, the large hardware
requirements for fully scaled experiments, the lack of access to
source code, and the restrictions on the disclosure of performance
information. How can we make it tractable for academic and industrial
researchers to measure these workloads?
The Impact of Database System Configuration
on Computer Architecture Performance Evaluation
This tutorial will feature a series of industrial presentations,
followed by a panel discussion to "ask the industrial experts"
questions such as the following:
How does one build well-balanced configurations to run a TPC-C or
What does it mean for a system to be "well-balanced"?
(How) is it possible to scale back the hardware requirements of
full-scale systems, while ensuring representative behavior?
Alternately, what microbenchmarks can be used to measure
How do (academic) researchers get around logistical difficulties
of running these workloads?
What are the right future benchmarks to examine?
Kim Keeton, UC Berkeley
Walter Baker, Gradient Systems (formerly of Informix Software)
Luiz Barroso, Compaq/DEC Western Research Laboratory
Michael Koster, Sun Microsystems
Seckin Unlu, Intel
Kim Keeton, the organizer of this session, is completing her PhD at UC
Berkeley with Dave Patterson on computer architecture support for
database workloads. She has worked with Informix to analyze the
processor and memory system behavior of their shared memory database
for OLTP workloads, and is currently investigating the use of
increasingly intelligent disks (IDISKs) for offloading data-intensive
The industrial experts participating in this session collectively
possess decades of experience analyzing the performance of databases
and other commercial applications, and grappling with many of these
SimOS is a complete machine simulation
environment designed for the efficient and
accurate study of both uniprocessor and
multiprocessor computer systems. SimOS simulates
computer hardware in enough detail to boot and run commercial operating
systems. SimOS currently models hardware similar to
that of machines sold by Silicon Graphics, Inc. We simulate CPUs, caches,
multiprocessor memory busses, disk drives, ethernet,
consoles, and other devices commonly found on these machines. By
simulating the hardware typically found on commercial
computer platforms, we are able to easily port
existing operating systems to
the SimOS environment.
SimOS: The Complete Machine Simulator
A key component of SimOS is "annotations". Annotations are
non-intrusive Tcl scripts that are executed whenever an event of
interest occurs in the simulator. These events may be execution
of a particular program counter value, a reference to a specified
memory address, or even reaching a particular cycle count.
Annotations collect and classify performance data, providing detailed
information regarding operating system performance,
application behavior, or architectural decisions.
In this half-day tutorial, we cover the design decisions made
in the development of SimOS and present several case studies
highlighting SimOS's utility.
About the presenter:
Prof. Mendel Rosenblum is an Assistant Professor in the Computer Science
Department at Stanford University. His research focuses on
system software and simulation systems for high performance
computing architectures, including the SimOS and Flash projects.
Our understanding of the dynamic characteristics of programs drives the
development of next generation processors and compilers. At Compaq, we
use a variety of tools to study and exploit the behavior of programs.
An execution profile is a summary of the number of times an event occurs
for each instruction, where an event can be instruction issue, cache miss,
clock cycle, etc. We study profiles to identify performance pitfalls and
opportunities for optimization. Compilers use profiles to guide optimization.
Instrumentation is the process of adding extra code to a program to measure
some characteristic. We use customized instrumentation to model architectural
alternatives and study program behavior. This tutorial describes the technology
behind profiling, instrumentation, and profile based optimization. We
discuss how to decide which technique is appropriate for a problem, and
describe the tools that are publicly available from Compaq for Alpha based
Unix and NT systems.
Profiling, Instrumentation, and Profile Based Optimization
David Goodwin and Robert Cohn
Alpha Design Group
David Goodwin is also in the Alpha Desgign Group, where he works on
architecture and compiler advanced development. He has contributed
to the performance analysis of the 21164, 21164PC, and 21264
microprocessors. David has implemented profile-directed register
allocation and interprocedural dataflow analysis in the Spike
executable optimizer. David received a B.S.E.E.from Virginia Tech
and a Ph.D. in computer science from the University of California,
Robert Cohn is also in the Alpha Design Group, where he works on advanced
compiler technology for Alpha microprocessors. He has implemented trace
scheduling and profile based optimizations in the production compilers
for Alpha. He is a key contributor to Spike, implementing code layout
and other profile based optimizations. Robert received a BA from Cornell
University and a Ph.D. from Carnegie Mellon, both in computer science.
Last updated August 25, 1998