(Courtesy Dirk Grunwald)
Semester Project Proposals for CSCI7143





Speculation Control & Confidence Estimation

"Speculation Control" is a recent development in computer architecture that attempts to control the degree or how speculation is used within the processor. In addition to the normal branch prediction mechanism, processors implement a confidence estimation mechanism. These mechanisms attempt to classify the branch prediction as "high confidence" (i.e., likely to be correct) or "low confidence" (i.e., likely to be wrong).

Technically, you should be able to use confidence estimation to improve branch prediction. In other words, if you predict that a branch is "taken", and the confidence estimator reports a "low confidence" on that prediction, you may want to invert the prediction to report a "not taken" prediction. In order to do this, you'll need to either have a good confidence estimator (better than any we've found) or figure out a way to take advantage of the confidence estimators in a limited fashion (i.e., figure out when you should believe the confidence estimator). This project will involve a modified version of SimpleScalar. The programming won't be particularly difficult, but the solution to this problem isn't immediately obvious and will require a little exploration and new ideas. Many processors (like the MIPS R10000) limit the number of unresolved branches because they need to attribute specific instructions to conditional branches. This is necessary to provide a "partial kill" of instructions in the process when a branch is mispredicted. Technically, you don't need to label instructions for each conditional branch -- if you skipped conditional branches (e.g., allowing 8 branches to be out-standing when you can only label instructions for 4 branches), you may be able to have more outstanding branch in the pipeline, and may improve the processor performance. One way to do this is to use confidence estimation to determine which branches are likely to be predicted correctly, and to not allocate branch labels for those "high confidence" branches. This project would use either SimpleScalar or another simulator infrastructure called PolyPath. The latter simulator is written in C++.  Cache and Memory System Management Normally, a processor will fetch a complete cache line when a cache miss occurs. If the processor only reads a single word from that cache line, it may have been better fetching only the word that was being referenced, since this would leave the contents of the old cache line available. A year ago, Tyson, Farrens & Pleszkun proposed the "Cache No Allocate" mechanism to determine which load instructions tend to cause cache misses that actually only use a single word from the cache line. In this project, you'll implement the "CNA" method in SimpleScalar. This will require reasonably proficient programming and data analysis skills. This is an idea that Matt Farrens and I have wanted to investigate since 1995. Once a cache line has been modified ("dirtied"), it will eventually need to be written back to the next level of the memory hierarchy. Normally, this is done on an as-needed basis; i.e., when a cache miss occurs, and the selected replacement location is dirty, the resident cache line needs to be written back to memory. If the cache line is not dirty, it does not need to be written back to memory. If the processor needs to wait for this old cache line to be written back to memory, it could impact performance because the processor could not load the new cache line until the old one is written back. Current processors use a write buffer to reduce this delay by copying the dirty cache line to the write buffer and then writing the result back to memory when the bus is not busy. Another alternative is to try to guess when it is unlikely that the processor will make further updates to an already dirty cache line. If you could accurately determine this, you could write-back the dirty cache line whenever the memory bus is idle. If the processor makes no further writes to the cache line, you'll incur less delay on a cache miss. This involves trade-offs in the size of the write-buffer and the accuracy of your guesses. A large write buffer would mask all of the delay from the write-backs, but large write buffers are expensive to implement. This project would most likely involve modifications to SimpleScalar. Matt Farrens has done some exploratory work on this, and we would probably coordinate with him. Prefetching is the process of moving memory close to the processor before it is needed. Chen & Baer proposed a simple, yet effective prefetch mechanism. Basically, they track the use of different load instructions and determine when a given load instruction is prefetching using an arithmetic progression (i.e., fetching from address X, X+8, X+16, X+24,...). You'll implement this in the SimpleScalar system. Changes to the structure of Simple Scalar Currently, SimpleScalar can not implement a true in-order architectural model. It's close, but the processor model requires a 2-entry RUU. You'll modify SimpleScalar to allow a true in-order architectural model. Currently, Simple Scalar uses an RUU to handle precise exceptions and the instruction window. You'll modify the processor to allow (different sized) reservation stations at the functional units, and then use this model to compare the performance of reservation stations vs. a central window. This will require more programming than some of the projects because the core of the processor model will be changed. The TERA processor uses an explicit dependence mechanism where the instruction specifies the next instruction (in program order) that is dependent on the current instruction. This simplifies the instruction decoder and issue logic. You'll extend the SimpleScalar RUU to use this mechanism and compare the performance against the current mechanism. The current mechanism is expensive, requiring a lot of CAM's, but there are probably performance limitations in the mechanism used by the Tera. The CounterFlow processor model uses localized communication to control instruction issue. There are no reservation stations or central instruction windows. This reduces the wire complexity of the processor and should allow a much faster implementation. In this project, you'll read the original papers by Sproull & Sutherland and some follow-on papers that have been published in the last two years. You'll modify the SimpleScalar processor core to implement a CounterFlow pipeline model. This is a relatively complicated project, but is a very interesting intellectual exercise.
    1. You can download the original paper on the CounterFlow Pipline Processor from Sun Lab
    2. The remaining work on the CFPP model has been done at Oregon State University. They have two online papers that are interesting.
  Branch and Value Prediction Value prediction is the process of guessing about the value of a loaded value prior to that value coming back from memory. This technique was introduced by Lipasti and Shen at ASPLOS'96, and has been studied extensively since then. In this project, you'll implement the value prediction technique of Lipasti and Shen in SimpleScalar and compare the performance to a conventional processor model. SimOS allows us to run "real" applications (e.g., the Java virtual machine and the operating system). It is thought that these "complex" applications have distinctly different branch prediction properties than the benchmarks normally used. You'll implement a few branch predictors in the SimOS environment and measure the prediction accuracy of different applications. There's little "new" intellectual contribution here, but it's always been the case that new data provides new insight into how computers should be designed. Mudge et al proposed the Bimod branch predictor last year. You'll implement this branch predictor in SimpleScalar and compare the performance to other leading-edge branch predictors, such as the Agree predictor and 2-level branch predictors. It's very likely that you'll discover a better technique.


Here's the paper on the bimode predictor in PDF format.

Simulation models indicate that branches take longer to execute than other instructions in out-of-order processors; i.e., branches remain "in-flight" much longer than other types of instructions. Various people have speculated that this occurs because branches tend to depend on memory operations, and that those memory operations have a longer average delay due to cache misses. It may be possible to improve processor performance by giving more priority to memory operations that will later be used for branches. In this project, you'll determine if this actually occurs. You don't need to determine a solution -- you just need to illustrate that memory operations are what causes branches to remain in the pipe for such a long time.