(Courtesy Dirk Grunwald)
Semester Project Proposals for
CSCI7143
Speculation Control & Confidence
Estimation
"Speculation Control" is a recent development in computer architecture
that attempts to control the degree or how speculation is used within the
processor. In addition to the normal branch prediction mechanism, processors
implement a confidence estimation mechanism. These mechanisms attempt
to classify the branch prediction as "high confidence" (i.e., likely
to be correct) or "low confidence" (i.e., likely to be wrong).
-
Using confidence estimation to improve branch
prediction
Technically, you should be able to use confidence estimation to improve
branch prediction. In other words, if you predict that a branch is "taken",
and the confidence estimator reports a "low confidence" on that prediction,
you may want to invert the prediction to report a "not taken" prediction.
In order to do this, you'll need to either have a good confidence estimator
(better than any we've found) or figure out a way to take advantage of
the confidence estimators in a limited fashion (i.e., figure out
when you should believe the confidence estimator). This project will involve
a modified version of SimpleScalar. The programming won't be particularly
difficult, but the solution to this problem isn't immediately obvious and
will require a little exploration and new ideas.
-
Using confidence estimation to control register
rename checkpointing
Many processors (like the MIPS R10000) limit the number of unresolved
branches because they need to attribute specific instructions to conditional
branches. This is necessary to provide a "partial kill" of instructions
in the process when a branch is mispredicted. Technically, you don't need
to label instructions for each conditional branch -- if you skipped
conditional branches (e.g., allowing 8 branches to be out-standing
when you can only label instructions for 4 branches), you may be able to
have more outstanding branch in the pipeline, and may improve the processor
performance. One way to do this is to use confidence estimation to determine
which branches are likely to be predicted correctly, and to not allocate
branch labels for those "high confidence" branches. This project would
use either SimpleScalar or another simulator infrastructure called PolyPath.
The latter simulator is written in C++.
Cache and Memory
System Management
Normally, a processor will fetch a complete cache line when a cache
miss occurs. If the processor only reads a single word from that cache
line, it may have been better fetching only the word that was being referenced,
since this would leave the contents of the old cache line available. A
year ago, Tyson, Farrens & Pleszkun proposed the "Cache No Allocate"
mechanism to determine which load instructions tend to cause cache misses
that actually only use a single word from the cache line. In this project,
you'll implement the "CNA" method in SimpleScalar. This will require reasonably
proficient programming and data analysis skills.
This is an idea that Matt Farrens and I have wanted to investigate
since 1995. Once a cache line has been modified ("dirtied"), it will eventually
need to be written back to the next level of the memory hierarchy. Normally,
this is done on an as-needed basis; i.e., when a cache miss occurs,
and the selected replacement location is dirty, the resident cache line
needs to be written back to memory. If the cache line is not dirty, it
does not need to be written back to memory. If the processor needs to wait
for this old cache line to be written back to memory, it could impact performance
because the processor could not load the new cache line until the old one
is written back. Current processors use a write buffer to reduce
this delay by copying the dirty cache line to the write buffer and then
writing the result back to memory when the bus is not busy. Another alternative
is to try to guess when it is unlikely that the processor will make further
updates to an already dirty cache line. If you could accurately determine
this, you could write-back the dirty cache line whenever the memory bus
is idle. If the processor makes no further writes to the cache line, you'll
incur less delay on a cache miss. This involves trade-offs in the size
of the write-buffer and the accuracy of your guesses. A large write buffer
would mask all of the delay from the write-backs, but large write buffers
are expensive to implement. This project would most likely involve modifications
to SimpleScalar. Matt Farrens has done some exploratory work on this, and
we would probably coordinate with him.
-
Implement Baer & Chen stride prefetch mechanism
Prefetching is the process of moving memory close to the processor
before it is needed. Chen & Baer proposed a simple, yet effective prefetch
mechanism. Basically, they track the use of different load instructions
and determine when a given load instruction is prefetching using an arithmetic
progression (i.e., fetching from address X, X+8, X+16, X+24,...).
You'll implement this in the SimpleScalar system.
Changes to the structure of Simple
Scalar
-
Modify SimpleScalar to simulate a true inorder
architecture
Currently, SimpleScalar can not implement a true in-order architectural
model. It's close, but the processor model requires a 2-entry RUU. You'll
modify SimpleScalar to allow a true in-order architectural model.
-
Implement reservatation stations rather than RUU
Currently, Simple Scalar uses an RUU to handle precise exceptions
and the instruction window. You'll modify the processor to allow (different
sized) reservation stations at the functional units, and then use this
model to compare the performance of reservation stations vs. a central
window. This will require more programming than some of the projects because
the core of the processor model will be changed.
-
Implement the TERA mechanism for dependence enforcement
The TERA processor uses an explicit dependence mechanism where the
instruction specifies the next instruction (in program order) that is dependent
on the current instruction. This simplifies the instruction decoder and
issue logic. You'll extend the SimpleScalar RUU to use this mechanism and
compare the performance against the current mechanism. The current mechanism
is expensive, requiring a lot of CAM's, but there are probably performance
limitations in the mechanism used by the Tera.
-
Implement CounterFlow pipeline model
The CounterFlow processor model uses localized communication to control
instruction issue. There are no reservation stations or central instruction
windows. This reduces the wire complexity of the processor and should allow
a much faster implementation. In this project, you'll read the original
papers by Sproull & Sutherland and some follow-on papers that have
been published in the last two years. You'll modify the SimpleScalar processor
core to implement a CounterFlow pipeline model. This is a relatively complicated
project, but is a very interesting intellectual exercise.
-
You can download the original paper on the CounterFlow
Pipline Processor from Sun Lab
-
The remaining work on the CFPP model has been done at Oregon
State University. They have two online papers that are interesting.
Branch and Value Prediction
-
Implement value prediction
Value prediction is the process of guessing about the value of a loaded
value prior to that value coming back from memory. This technique was introduced
by Lipasti and Shen at ASPLOS'96, and has been studied extensively since
then. In this project, you'll implement the value prediction technique
of Lipasti and Shen in SimpleScalar and compare the performance to a conventional
processor model.
-
Measure branch prediction accuracy using SimOS
SimOS allows us to run "real" applications (e.g., the Java
virtual machine and the operating system). It is thought that these "complex"
applications have distinctly different branch prediction properties than
the benchmarks normally used. You'll implement a few branch predictors
in the SimOS environment and measure the prediction accuracy of different
applications. There's little "new" intellectual contribution here, but
it's always been the case that new data provides new insight into how computers
should be designed.
-
Implement the Bimod branch predictor
Mudge et al proposed the Bimod branch predictor last year.
You'll implement this branch predictor in SimpleScalar and compare the
performance to other leading-edge branch predictors, such as the Agree
predictor and 2-level branch predictors. It's very likely that you'll discover
a better technique.
Here's the paper
on the bimode predictor in PDF format.
-
Categorize misprediction accuracy vs. memory access
time for branches
Simulation models indicate that branches take longer to execute than
other instructions in out-of-order processors; i.e., branches remain
"in-flight" much longer than other types of instructions. Various people
have speculated that this occurs because branches tend to depend on memory
operations, and that those memory operations have a longer average delay
due to cache misses. It may be possible to improve processor performance
by giving more priority to memory operations that will later be used for
branches. In this project, you'll determine if this actually occurs. You
don't need to determine a solution -- you just need to illustrate that
memory operations are what causes branches to remain in the pipe for such
a long time.