Design principles Design principles for high performance 1. make the frequent case fast 1(a). corollary: make the fast case frequent, e.g., the code generated by the compiler should be for the fast case 1(b). corollary: make all cases fast 2. target the dominant case (e.g., freq1 < freq2 but time1 >> time2) 2(a). corollary: Amdahl's Law (a law of diminishing returns) 3. exploit frequent patterns, e.g., locality of reference 4. exploit parallelism examples - logic parallelism, e.g., carry-lookahead adder - word parallelism, e.g., SIMD instructions like Intel's MMX and SSE - instruction pipelining, e.g., overlap fetch and execute - instruction-level parallelism (superscalar, VLIW, EPIC) - thread-level parallelism (multithreading, multicore) 5. time vs. physical replication +--+ | | double-pump component, go from 1 to 2 ops / time unit +--+ or +--+ | | replicate component, each 1 op / time unit +--+ +--+ goal is to have few (or even better no) conflicts | | (examples include register file and memory banks) +--+ If next action is unknown, you can i) stop and wait ii) guess and proceed -- an example is branch prediction (which is often an application of exploiting frequent patterns) iii) perform all possible next actions in parallel and late select among them (the general idea is to exploit parallelism) examples - multiple data forwarding paths selected by a mux for an ALU input - overlap instruction decoding with register file read of two registers as well as sign-extending a bit field; thus, the needed operands are always ready for the next stage (two registers for register-to-register ALU op, or one register and an immediate value for address calculation or ALU op with immediate) For comparison, some design principles for low power 1. turn off unused parts 1(a). corollary: generate code so that unused parts won't have to be turned on 2. use power down modes, stop the clock until the next event 3. dynamic voltage scaling, stretch the clock for program phases not requiring high performance