Example of instruction scheduling example loop ld F0,0(R1) // 2-cycle latency, F0 <- memory[ R1 + 0 ] addf F4,F2,F0 // 4-cycle execution, F4 <- F2 + F0 st F4,0(R1) // 2-cycle latency, memory[ R1 + 0 ] <- F4 sub R1,R1,8 // 1-cycle execution, R1 <- R1 - 8 bne R1,R2,loop // 1-cycle execution, branch to loop if R1 != R2 cycle diagram for simple 5-stage pipeline with forwarding, with an integer ALU and floating-point ALU (in same stage but designated by integer instruction executing as 'E' and FP instruction executing as 'X') E F | D | / | M | W X 0/LD: FDEMW // first iteration 1/AD: FD-XXXXMW // addf dependent on ld (executes on FP pipe) 2/ST: F-D---EMW // st dependent on addf 3/SU: F---DEMW 4/BN: FDEMW // bne dependent on sub F // delay slot (could move store here) 5/LD: FDEMW // second iteration 6/AD: FD-XXXXMW 7/ST: F-D---EMW 8/SU: F---DEMW 9/BN: FDEMW F 10/LD: FDEMW 11/AD: FD-XXXXMW 12/ST: F-D---EMW 13/SU: F---DEMW 14/BN: FDEMW ... ... | | 1234567890 = 1 store each 10 cycles F-Fetch; D-decode; E/X-execute; M-memory; W-writeback cycle diagram for 2-wide in-order superscalar (omit M stage from FP and branch pipes) 0/LD: FDEMW // first iteration 1/AD: FD--XXXXW // addf dependent on ld 2/ST: F--DE--MW // st dependent on addf 3/SU: F--DE---W 4/BN: F-DE--W // bne dependent on sub // empty slot 5/LD: FDEMW // second iteration 6/AD: FD--XXXXW 7/ST: F--DE--MW 8/SU: F--DE---W 9/BN: F-DE--W 10/LD: FDEMW 11/AD: FD--XXXXW 12/ST: F--DE--MW 13/SU: F--DE---W 14/BN: F-DE--W ... ... | | 1234567 = 1 store each 7 cycles (could be reduced to 6 cycles with branch prediction) F-Fetch; D-decode; E/X-execute; M-memory; W-writeback compare with unrolling on simple pipeline with forwarding example loop ld F0,0(R1) ld F6,0(R1) addf F4,F2,F0 addf F8,F2,F6 st F4,0(R1) st F8,0(R1) sub R1,R1,8 bne R1,R2,loop cycle diagram for simple 5-stage pipeline with forwarding, with an integer ALU and floating-point ALU (in same stage but designated by integer instruction executing as 'E' and FP instruction executing as 'X') E F | D | / | M | W X 0/LD: FDEMW 1/LD: FDEMW 2/AD: FDXXXXMW 3/AD: FD---XXXXMW 4/ST: F---D---EMW 5/ST: F---DEMW 6/SU: FDEMW 7/BN: FDEMW F 8/LD: FDEMW 9/LD: FDEMW 10/AD: FDXXXXMW 11/AD: FD---XXXXMW 12/ST: F---D---EmW 13/ST: F---DEMW 14/SU: FDEMW 15/BN: FDEMW F 16/LD: FDEMW 17/LD: FDEMW 18/AD: FDXXXXMW 19/AD: FD---XXXXMW 20/ST: F---D---EmW 21/ST: F---DEMW 22/SU: FDEMW 23/BN: FDEMW ... ... | | 123456789012345 = 2 stores 15 cycles effective rate 7.5 F-Fetch; D-decode; E/X-execute; M-memory; W-writeback cycle diagram for simple 5-stage pipeline with pipelined execution unit E F | D | / | M | W X 0/LD: FDEMW 1/LD: FDEMW 2/AD: FDXXXXMW 3/AD: FDXXXXMW 4/ST: FDE---MW 5/ST: FD---EMW 6/SU: F---DEMW 7/BN: FDEMW F 8/LD: FDEMW 9/LD: FDEMW 10/AD: FDXXXXMW 11/AD: FDXXXXMW 12/ST: FDE---mW 13/ST: FD---EMW 14/SU: F---DEMW 15/BN: FDEMW F 16/LD: FDEMW 17/LD: FDEMW 18/AD: FDXXXXMW 19/AD: FDXXXXMW 20/ST: FDE---mW 21/ST: FD---EMW 22/SU: F---DEMW 23/BN: FDEMW ... ... | | 123456789012 = 2 stores each 12 cycles effective rate 6 10 simple pipeline, normal loop 7.5 simple pipeline, unrolled loop 7 two-wide, in-order pipeline, normal loop 6 simple pipeline with pipelined FP execution unit, unrolled loop 3 two-wide, out-of-order pipeline, normal loop (shown in other notes)