PPT Slide
This graph compares the performance of RADram with that of a normal processor/memory system on a component of the Optimal Register Allocator (ORA). The ORA uses the vector-matrix multiply primitive with a dense vector and sparse matrix. As the L1 cache is scaled to the size of the vector, the performance of the normal and RADram versions converge. However, as the problem size increases beyond the L1 cache size, the performance of the normal system is severely affected. The RADram version remains unaffected, because it compresses the data at the memory system, thereby decreasing the dependence on the L1 cache. This behavior is observed in other RADram algorithms. The reconfigurable logic is used to condense as well as linearly format the data, thus shrinking and regularizing the access memory pattern of the processor.
In this scientific benchmark, each matrix is extremely sparse, with less than 10 non-zeroes per column. The RADram version is less affected by the latency between system memory and processor speeds. Moreover, the RADram system can capitalize upon the parallelism inherent in the problem, while a single processor must progress through it sequentially. Unlike the preceding ORA problem, the “vector” here is extremely sparse, and hence performance is not affected by the relative size of the L1 cache for the normal version. Note that when the matrix is this sparse, the processor is memory I/O bound and not by the time per floating point operation. Thus, a more intelligent memory system such as RADram can better utilize the core processor’s floating point logic.