Tutorials for ASPLOS-VIII

Emerging Processor Techniques for Exploiting
Instruction-Level Parallelism

Instruction-Level Parallelism (ILP) has witnessed a number of significant innovations in the last few years. At the same time, explosive technology trends promise a continued boom in transistor counts. The resulting scenario is one where several recent ILP innovations are expected to be implemented in commercial processors in the next few years, making processor design and evaluation both complex and challenging. This tutorial presents an overview of important recently-proposed ILP processor techniques. Primarily, the tutorial is a quick but comprehensive tour of ILP techniques for each of the different processor pipeline stages and selected aspects of the top level memory hierarchy. Examples of topics to be covered include: high-bandwidth instruction fetch, dispatch, and issue mechanisms; high-accuracy and high-bandwidth control prediction; memory dependence handling; data value prediction; aspects of high- bandwidth data caches; reuse of dependence and scheduling information; recovery from mis-speculation; sub-word SIMD parallelism for multi- media; and support for precise interrupts. Drawing from relevant literature, we discuss the hardware complexity of several of the techniques. Putting things together, we present an example high-ILP processor of the future that incorporates several of these techniques, in order to show how they might fit together and interact with each other. In conclusion, we mention open research issues and current research directions in the ILP area.

About the Presenter

Sriram Vajapeyam is an Assistant Professor at the Indian Institute of Science, Bangalore, with primary research interests in processor architecture. His recent research contribution includes the first work on trace-centric processors. The speaker's industry experience includes over a year with the processor design group of Cray Research, Inc. in 1991-1992. Sriram obtained a Ph.D. from the University of Wisconsin, Madison in 1991 for a thesis on the characterization of the Cray Y-MP processor. He is currently researching hardware and software runtime schemes for further exploiting instruction-level parallelism in high-performance processors.

The Impact of Database System Configuration
on Computer Architecture Performance Evaluation

Online transaction processing (OLTP) and decision support (DSS) databases are important, yet often overlooked workloads for evaluating the effectiveness of computer architecture innovations. Using commercial workloads to drive performance studies often proves to be difficult, due to the complexity of the workloads, the large hardware requirements for fully scaled experiments, the lack of access to source code, and the restrictions on the disclosure of performance information. How can we make it tractable for academic and industrial researchers to measure these workloads?

This tutorial will feature a series of industrial presentations, followed by a panel discussion to "ask the industrial experts" questions such as the following:

  • How does one build well-balanced configurations to run a TPC-C or TPC-D workload?
  • What does it mean for a system to be "well-balanced"?
  • (How) is it possible to scale back the hardware requirements of full-scale systems, while ensuring representative behavior?
  • Alternately, what microbenchmarks can be used to measure representative behavior?
  • How do (academic) researchers get around logistical difficulties of running these workloads?
  • What are the right future benchmarks to examine?


  • Kim Keeton, UC Berkeley


  • Walter Baker, Gradient Systems (formerly of Informix Software)
  • Luiz Barroso, Compaq/DEC Western Research Laboratory
  • Michael Koster, Sun Microsystems
  • Seckin Unlu, Intel

    Kim Keeton, the organizer of this session, is completing her PhD at UC Berkeley with Dave Patterson on computer architecture support for database workloads. She has worked with Informix to analyze the processor and memory system behavior of their shared memory database for OLTP workloads, and is currently investigating the use of increasingly intelligent disks (IDISKs) for offloading data-intensive DSS operations.

    The industrial experts participating in this session collectively possess decades of experience analyzing the performance of databases and other commercial applications, and grappling with many of these issues.

    SimOS: The Complete Machine Simulator

    SimOS is a complete machine simulation environment designed for the efficient and accurate study of both uniprocessor and multiprocessor computer systems. SimOS simulates computer hardware in enough detail to boot and run commercial operating systems. SimOS currently models hardware similar to that of machines sold by Silicon Graphics, Inc. We simulate CPUs, caches, multiprocessor memory busses, disk drives, ethernet, consoles, and other devices commonly found on these machines. By simulating the hardware typically found on commercial computer platforms, we are able to easily port existing operating systems to the SimOS environment.

    A key component of SimOS is "annotations". Annotations are non-intrusive Tcl scripts that are executed whenever an event of interest occurs in the simulator. These events may be execution of a particular program counter value, a reference to a specified memory address, or even reaching a particular cycle count. Annotations collect and classify performance data, providing detailed information regarding operating system performance, application behavior, or architectural decisions.

    In this half-day tutorial, we cover the design decisions made in the development of SimOS and present several case studies highlighting SimOS's utility.

    About the presenter:

    Prof. Mendel Rosenblum is an Assistant Professor in the Computer Science Department at Stanford University. His research focuses on system software and simulation systems for high performance computing architectures, including the SimOS and Flash projects.

    Tutorial 4
    Profiling, Instrumentation, and Profile Based Optimization
    David Goodwin and Robert Cohn
    Alpha Design Group

    Our understanding of the dynamic characteristics of programs drives the development of next generation processors and compilers. At Compaq, we use a variety of tools to study and exploit the behavior of programs. An execution profile is a summary of the number of times an event occurs for each instruction, where an event can be instruction issue, cache miss, clock cycle, etc. We study profiles to identify performance pitfalls and opportunities for optimization. Compilers use profiles to guide optimization. Instrumentation is the process of adding extra code to a program to measure some characteristic. We use customized instrumentation to model architectural alternatives and study program behavior. This tutorial describes the technology behind profiling, instrumentation, and profile based optimization. We discuss how to decide which technique is appropriate for a problem, and describe the tools that are publicly available from Compaq for Alpha based Unix and NT systems.

    David Goodwin is also in the Alpha Desgign Group, where he works on architecture and compiler advanced development. He has contributed to the performance analysis of the 21164, 21164PC, and 21264 microprocessors. David has implemented profile-directed register allocation and interprocedural dataflow analysis in the Spike executable optimizer. David received a B.S.E.E.from Virginia Tech and a Ph.D. in computer science from the University of California, Davis.

    Robert Cohn is also in the Alpha Design Group, where he works on advanced compiler technology for Alpha microprocessors. He has implemented trace scheduling and profile based optimizations in the production compilers for Alpha. He is a key contributor to Spike, implementing code layout and other profile based optimizations. Robert received a BA from Cornell University and a Ph.D. from Carnegie Mellon, both in computer science.

    Last updated August 25, 1998