High Performance Substrate (HPS) research of Yale Patt and students

(This is incomplete, but gives a starting subset of the work and citations.)

Modern superscalar processors like the HP PA 8000, IBM/Motorola PowerPC 604, Intel P6, and MIPS R10000 are built around a dynamically scheduled microengine. This style of design has often been referred to as restricted data flow [PHS85]. The microengine typically contains many functional units and has the ability to contain tens of instructions in various stages of execution in an out-of-order manner. While most of the aforementioned processors have instruction sets classified as RISC, the Intel P6 and earlier work by Patt and colleagues on a VAX design [PMHSCW86,WMSHP87] demonstrate that the advantages of restricted dataflow can apply to CISC instruction sets also.

Patt and his colleagues have been investigating dynamically scheduled microengines over the past ten years, beginning with the HPS in 1985 [PHS85,PHMS85,HP86]. From the beginning they considered that a decoded instruction cache might be a useful supplement to the microengine to assist performance. In 1988, Melvin, Shebanow, and Patt described in more detail a hardware assist, called a fill unit, to compact microoperations generated from sequentially-fetched instructions into a decoded instruction cache [MSP88]. The purpose of the fill unit was to construct a larger piece of atomic work that could be given to the dynamic scheduler and thereby increase the utilization of the functional units. In their design, architected registers (i.e., those specified in the instruction set architecture) as well as microengine registers (i.e., logical registers visible at the microarchitecture level only) were renamed to encode forwarding requirements between and within instructions. Microoperation references were also renamed into the address space of the decoded instruction cache, and microoperations from a single instruction could be split across two lines in the decoded instruction cache. Filling stopped (i.e., a decoded instruction cache line was ``finalized'') whenever a branch was encountered or no empty microoperation slots remained in the filled line. Microoperations with data dependencies would be executed in the proper order by the underlying dynamically scheduling hardware.

Patt and colleagues also proposed a decoded instruction cache (``node cache'') to assist the performance of an HPS version of the DEC VAX [PMHSCW86,WMSHP87]. The VAX is a CISC and has variable-length instruction formats; an average VAX instruction generates about four HPS microoperations. The node cache assisted in keeping the decoding rate at one VAX instruction per cycle, and simulations indicated that the average CPI of 6 observed in contemporaneous VAX implementations could be reduced to 2 in the HPS version. This was remarkable since the VAX instruction set includes many data-dependent operations that cannot be easily cached as fixed sequences of microoperations (e.g., the procedure return instruction provides automatic register restoring but this is dependent on a register save mask located in the procedure's stack frame). Across a set of small benchmarks, including daxpy and Dhrystone, between 60\% and 100\% of the instructions could be stored as microoperations in the node cache. A decoded instruction cache was also part of subsequent work on an extended Motorola 88110 (RISC) design [BYPASS91].

Web links

HPS pages at Texas

old HPS pages at Michigan

Steve Melvin publications

Awards to Yale Patt related to his HPS research

IEEE Emannuel R. Piore Medal, 1995
ACM/IEEE Eckert-Mauchly Award, 1996
IEEE Wallace W. McDowell Award, 1999

ACM Fellow
IEEE Fellow


[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]