SPARC V9 Architecture and UltraSPARC-I Mark Smotherman Clemson University SPARC V9: integer registers and virtual addresses extended to 64 bits code compatible with V8 (low 32 bits same as V8 result) integer register windows more flexible and kernel can allocate one window per process or thread (for no overhead context switch) clean registers guaranteed to be zero on first use - provide security floating-point registers extended to 64 bits (with lock bits to save register save/restore on context switch) 128-bit quad precision floating point format dual set of integer condition codes (one for 32-bit results, one for 64-bit) four sets of floating-point condition codes branch on register value (eliminate condition code setting and testing) speculative loads if(p!=NULL) x = *p; /* can move load of p prior to check */ conditional move pointer alias detection: difference of two ptrs stored in int register and FMOVRZ will conditionally move a flt pt register based on result; thus this allows compiler to move loads up above stores and then check for an alias and correct if necessary relaxed memory order (RMO) and new barrier inst. compare and swap additional register set to be used as globals for trap handling prefetch data and instructions unaligned loads either endian UltraSPARC-I is the first implementation of the 64-bit SPARC V9 architecture visual instruction set (VIS) - for multimedia block load/store additional register sets to be used as globals for TLB miss, interrupts performance counters (like Pentium) shutdown to reduce power (30W to 20mW) 44-bit virtual addresses and 41-bit physical addresses supported four page sizes: 8K, 16K, 512K, 4M UltraSPARC-I implementation 4-way issue to nine fn units up to two int per cycle (only one shift or condition code setting inst) up to two flt-pt/graphics per cycle up to one ld/st per cycle up to one branch per cycle in-order issue, out-of-order completion precise exceptions no FPQ but adds three empty stages to integer pipeline speculative execution past multiple branches single-entry micro-i-tlb, 64-entry fully-associative i-tlb one-cycle micro-i-tlb miss penalty can lock entries in i-tlb 16 KB pseudo-2-way set associative i-cache (w/ snooping) set prediction, 2-cycle penalty for set mispredict 16 byte sector with extra fields two 2-bit predictors (one / 2 words) one branch target address one branch predicted and followed per cycle 64-entry fully-associative d-tlb can lock entries in d-tlb 16 KB direct-mapped, nonblocking d-cache 32 byte line size with 16 byte sectors virtually indexed and physically tagged write through, no write allocate 1-cycle load-use penalty 6-cycle miss penalty 9-entry load buffer, 8-entry store buffer load bypass write merging of last two store buffer entries if in same sector 128-bit split transaction bus on-chip L2 controller (up to 4MB) 64 byte line snooping (MOESI) and directory-based cache coherency support 8-window integer reg file with 7 read ports and 3 write ports functional unit performance 2 bits/clock int multiplication with early-out, max latency of 35 1 bit/clock int division, max latency of 67 flt-pt divide and sqrt have separate fn units flt-pt reg file with 5 read ports and 3 write ports 3 cycle latency for flt-pt add, subtract, multiply 1 cycle latency for flt-pt compare 22 cycle latency for flt-pt divide pipeline stage 1: fetch stage 2: decode ..12-entry ibuf... tolerates i-cache misses stage 3: group >---------------\ try to issue 4 oldest entries stage 4: exec exec exec reg reg flt-pt/graphics => register access stage 5: cache n0 n0 x1 x1 branches resolved stage 6: ld_ms n1 n1 x2 x2 load misses put in load buffer stage 7: n2 n2 n2 x3 x3 stage 8: n3 <------------------/ resolve exceptions stage 9: writeback