Prediction and Profile Buffers
Last updated: May 2004
Summary: ... tbd ...
... intro to be written ...
... explain purpose
... importance of dynamic branch prediction
... but beyond that high perf. impls. will record and exploit other dynamic
behavior of programs
... look at tables and caches that assist inst. fetch and decoding ...
... this survey not oriented to covering all the specific BP algorithms ...
(so find cite to paper or web page that is)
... note difference between
table structures (mere lookup) versus caches (hit/miss via tagging) ...
offer rationale ...
An extensive design tree was posted by Andy Glew in comp.arch in July 1998
and reprinted in CAN Internet Nuggets, Vol. 26, No. 5, December 1998, p. 34.
From: "Andy Glew"
Subject: Re: Branch Target Cache (&& AMD 29K)
Date: 15 Jul 1998 23:02:42 GMT
Organization: Intel & University of Wisconsin
At the risk of reducing the number of future patent plaques on my wall,
let me enumerate some possible combinations of the above that I
consider "obvious to someone skilled in the art".
Index structure by:
branch target address
branch from address
path (sequence of branch from/to addresses)
branch history (taken/not-taken)
procedure call address
procedure call point
branch address (or procedure address) in combination with certain parameter values
basic block number (i.e. transformed branch target address)
address of data referencing instruction (e.g. load at address A accesses B)
memory addressing mode
memory addressing mode and parameters
register version numbers or timestamps
any hash or checksum of the above
e.g bitfield extracts, shifts, XORs, etc.
linear offsets of any of the above
denominated in a variety of units such as time, etc.
e.g. branch from address - 2 instructions
transitive closures of the above
denominated in a variety of units such as time, same unit, etc.
e.g. the branch from address 2 branches before the instruction
What is cached
branch target address
instructions at branch target
sequence of instructions at branch target
possibly beyond multiple branches, e.g. a trace
optimized sequence of instructions at branch target
where optimizations may include
a) pre-renaming (relative to trace)
b) rearranging instructions for schedule
c) eliminating dead code
d) common sub expressions detected by hardware
e) predication to eliminate some branches
type information (e.g. this value is an integer; this value is a pointer)
skip list information
future branch history
future branch path
past branch history
past branch path
schedules indicating expected order of fetch or execution of instructions
cache miss addresses
absolute, or relative to given data values
compressed versions of the above
combinations of the above - e.g. return both a trace of instructions,
and a expected cache miss addresses
Branch Prediction Terminology
IBM mainframe systems and patents use the terms
So here are my terms.
- "BTB" for what I would call an instruction prefetch buffer
- "BHT" for what I would call a BTAC
- "DHT" for what I would call a BHT
branch branch target target
address history address insts.
------- ------- ------- -------
BHT yes branch history table
BPC tags yes yes yes branch prediction cache
BTAC tags yes branch target address cache
BTB tags yes yes branch target buffer
BTIC tags yes branch target instruction cache
(My use of BTB and BPC is more a concession to historical usage
rather than seeing them as completely descriptive terms.)
For another discussion of terminology, see
sections 8.4.4 and 8.4.5 of D. Sima, T. Fountain, and P. Kacuk,
Advanced Computer Architecture: A Design Space Approach, Addison-Wesley,
1997 and chapter 4 of A. Omondi, The Microarchitecture of Pipelined and
Superscalar Computers, Kluwer, 1999.
Note also that branch instruction identification can be accelerated
without using a prediction buffer/cache by partial decoding of instruction
fetch buffers ("look-ahead branch detection" section 8.4.2 of Sima, et al.).
Actual Designs and Implementations
... to be done / very incomplete ...
... many descriptions of current processors ...
... aim this list at identifying earliest examples in each category ...
- branch direction
- history held in icache
S-1 (1978) - two-bit predictor held in icache for each
- prediction bit ("jump bit") and dynamic reverse bit
("wrong bit"), which was set whenever the branch
was mispredicted; two mispredicts in a row caused
the jump bit to toggle
- A.J. Smith cites an unpublished memo by L.C. Widdoes, Jr., at
Stanford in February 1977 that outlines the scheme
- this scheme has a state diagram just like the two-bit predictor
given in Hennessy and Patterson but with the strongly-taken
state at the top left having the bit pattern of 10 (rather
than 11) and weakly-taken having 11 (rather than 10)
- Jim Smith (1981) - mentions this in
"A study of branch prediction strategies,"
Proc. ISCA, 1981, pp. 135-148.
- Mike Johnson's scheme and AMD K5 (see US Patent 5,136,697)
- Alpha 21064
- to do: classify patents
- US 3,551,895 G.C. Driscoll, Jr., "Look-Ahead Branch
Detection System," granted to IBM in 1970
- Jim Smith, US Patent 4,370,711 (filed 1980, granted to CDC 1983)
- Smith recounts his work in 1979-1980 on a 64-entry,
2-bit predictor BHT for the CDC Cyber 180/900 in his
retrospective on "A study of branch prediction strategies"
in 25 Years of ISCA (Sohi, ed.);
the 180/900 performed speculative execution by having the predicted
branch reserve all the result bus slots until it was resolved
- Losq, et al., US Patent 4,477,872 - called "DHT" (filed 1982, granted
to IBM 1984)
- ... and others ...
- branch target address capture
ACS (late 1960s) - prefetch sequence control registers
- Ed Sussenguth, US Patent 3,559,183 (filed 1968, granted to IBM 1971)
- MU5 (late 1960s / early 1970s) - jump trace buffer
- eight entries (jump-from,jump-to)
- branch is predicted taken if it is stored in the buffer,
no correction on mispredict, FIFO replacement,
virtually addressed so flush on process switch
- simulated a BTIC in L.A. Taylor, "Instruction Accessing in
High Speed Computers," M.Sc. Thesis, Univ. of Manchester, 1969,
but ruled it out for the MU5 due to hardware cost.
- R.N. Ibbett, "The MU5 instruction pipeline," The Computer
Journal, vol. 15, February 1972, pp. 42-50.
- R.W. Holgate and Roland N. Ibbett, "An Analysis of
Instruction-Fetching Strategies in Pipelined Computers," IEEE
Transactions on Computers, 29(4), 1980, pp. 325-329.
- see also Morris and Ibbett, pp. 59-60.
- decoded instruction cache
- M68060 (1995) - zero-cycle taken branch
- target info stored using address of instruction prior to
the predict-taken branch
- entry includes the condition code setting needed for execution
unit to check the accuracy of the taken prediction and to start
misprediction recovery if necessary
- ... and others ...
decoded instruction caches)
- branch target fetch
- loop-catching instruction buffers
- CDC 6600 (1964)
- 8-word instruction stack
- D (depth) and L
- catches jump within stack
- CDC 7600 (1968)
- 12-word instruction stack
- keep address with each word in stack (thus loop can have
- S/360 Model 91 (1967)
- 8-doubleword instruction stack
- backward branch to a target address within 8 doublewords of the
branch (called a "short loop") causes entry into loop mode;
decoder tags taken-path instructions as conditional and sends
them to wait at the execution units for branch resolution
- otherwise decoder predicts untaken and tags the fall-through
instructions as conditional and sends them to wait at the
execution units for branch resolution
- "BTB" - instruction buffer for use in prefetching at target
address in case a predict-untaken branch is taken
- method was described in 1964 by T.C. Chen at AFIPS
- AMD 29K (1987) - "BTC"
- 32 entries of four instructions each, two-way set associative
- Phil Frieden, US 4,933,837 (filed 1986, granted to AMD 1990)
- Baror, et al., US 4,926,323 (filed 1988, granted to AMD 1990)
- Motorola 88110 (1991) - "TIC"
- 32 entries of two instructions each, fully associative
- US 3,940,741 Horikoshi, et al., granted to Hitachi in February 1976;
describes "route index memory" (BTAC) and "route buffer memory" (BTIC)
- instruction prefetch
- dynamic code optimization
- dynamic inlining, re-layout of hot spots
- profile buffer, Conte et al., Micro-29
- branch behavior buffer,
Merten et al., ISCA '99 / ISCA ' 00 / IEEE TC June 2001
- related: Dynamo, rePLay, DELI, DISE, etc.
- data prefetch
- ... tbd ...
- J. Pomerene, T. Puzak, R. Rechtschaffen, and F. Sparacio.
Prefetching System for a Cache Having a Second Directory
for Sequentially Accessed Blocks.
U.S. Patent 4,807,110, February 1989.
- RPT - Tien-Fu Chen and Jean-Loup Baer, "Effective Hardware-Based
Data Prefetching for High-Performance Processors," IEEE Transactions
on Computers, Vol. 44, No. 5, pp. 609-623, May 1995.
- IRB - Sharad Mehrotra, "Data Prefetch Mechanisms For Accelerating
Symbolic And Numeric Computation," Illinois, 1996.
- Amir Roth, Andreas Moshovos, and Gurindar S. Sohi, "Dependence
Based Prefetching for Linked Data Structures," ASPLOS, October 1998.
- Surendra Byna, Yong Chen, and Xian-He Sun, "A Taxonomy of Data
Prefetching Mechanisms," Proc. Intl. Symp. on Parallel Architectures,
Algorithms, and Networks (ISPAN), 2008, pp. 19-24.
- value prediction
- ... tbd ...
- Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen,
"Value Locality and Load Value Prediction", ASPLOS, 1996.
- Mikko H. Lipasti and John Paul Shen,
"Exceeding the Dataflow Limit via Value Prediction",
- instruction reuse
- ... tbd ...
- value cache - S.P. Harbison, "An Architectural Alternative to
Optimizing Compilers," ASPLOS, 1982.
- result cache - S.E. Richardson, "Caching Function Results:
Faster Arithmetic by Avoiding Unnecessary Computation,"
Tech. Rept., Sun Microsystems Labs, 1992.
- Avinash Sodani and Gurindar S. Sohi,
"Dynamic Instruction Reuse", ISCA, 1997.
- value reuse table - J. Yi, R. Sendag, and D. Lilja,
"Increasing Instruction-Level Parallelism with Instruction
Precomputation", Euro-Par, 2002.
- memo tables - Daniel Citron
- data flow manipulation - Stephan Jourdan
(US Patents - for branching see subclasses 238 and 240 under class 712)
[Clemson Univ. homepage]