Eager Execution / Dual Path / Multiple Path

Mark Smotherman
Last updated: July 2010

Summary: Eager execution deals with the uncertain nature of branches by applying the design principle of "late select" to the paths in a program. In their 1972 paper, Riseman and Foster demonstrated an impressive speedup was available from this approach. However, it has been too expensive to actually implement eager execution, so other branch-handling alternatives have been employed in commercial processors over the years, including prefetching and branch prediction with speculative execution. Research into eager execution has continued to refine certain aspects of the idea, including work like Uht's Disjoint Eager Execution as well as speculative multithreading.

.. under construction ..

First, Dispelling A Misconception: Dual Path Mainframes

Several academic papers and research designs have proposed muliple-path (eager) execution, and you may come across statements such as these:

The IBM 370/165 employed a branching strategy that fetched two paths, the inline path and the target path [cite to Flores]. Two buffers were maintained, the main instruction buffer (MIB) and the auxiliary instruction buffer (AIB). When the effective address of the target instruction stream is known at stage n', the target stream is fetched into the AIB. Both streams are processed until the outcome is known.

-- page 77 (emphasis added), Harvey Cragon, Branch Strategy Taxonomy and Performance Models. Los Alamitos, CA: IEEE Computer Society Press, 1992.


Multiple Instruction Streams: To avoid the penalty of a wrong branch prediction, both paths of a branch can be fetched and executed, and the results of the incorrect path discarded once the branch is resolved. This method requires duplication of hardware, early calculation of the branch target address, high memory and register file bandwidth, and a method for dealing with the occurrence of still another branch before a previous one has been resolved. This technique has been used in mainframes, including the IBM 370/168 and the IBM 3033 [cites to IBM manuals].

-- page 397 (emphasis added), C.H. Perleberg and A.J. Smith, "Branch Target Buffer Design and Optimization," IEEE Transactions on Computers, vol. 42, no. 4, April 1993, pp. 396-412.

As in the passages above, I've seen it alleged that some mainframe processors in the 1960s and 1970s executed down both paths beyond a branch; but, as far as I'm aware, multiple-path execution has never been done by a commercial processor.

What you will find in commercial processors is a combination of these techniques:

Thus, I believe that statements regarding multipath execution in commercial processors are misunderstandings. For example, I believe that that the emphasized sentence from Cragon above misstates his cited reference regarding the actions of the model 165. I discuss the model 165 design and the original passage from Flores in a section below.

Likewise, I believe that the passage from Perleberg and Smith above resulted from a faulty summarization of a section from an earlier paper by Smith and another student on possible solutions to the branch problem. This summarization left out a crucial qualifier present in the earlier paper. In particular, in the 1984 IEEE Computer paper by Johnny Lee and Alan Jay Smith, in a subsection entitled "Multiple instruction streams" within a larger section entitled "Existing approaches to the branch problem", they discuss IBM mainframes.

Despite these problems, a numbers of machines follow multiple instruction streams, including the IBM 370/168 [cite to IBM manual], which can fetch one alternative instruction path and the IBM 3033 [cite to IBM manual], which can pursue two alternative instruction streams.
These machines do not decode the alternative instruction paths.

-- page 7 (emphasis added), J.K.F. Lee and A.J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design," IEEE Computer, vol. 17, no. 1, January 1984, pp 6-21.

Perleberg and Smith use similar language regarding the multiple instruction stream approach, but they do not add the emphasized qualifing statement from the earlier Lee and Smith paper when citing the IBM mainframes as examples. (The description of multipath methods was tangential to the Perleberg and Smith paper.)

In the sections below, I first review IBM mainframe computers and their approach to branching and then look at research proposals regarding eager execution.

IBM Stretch (1961)

The IBM Stretch (later marketed as the 7030) was a remarkable high-performance design. Instructions in Stretch flowed through two processing elements:

The indexing and instruction unit of Stretch directly executed instructions related to the index registers and "prepared" the arithmetic instructions by calculating effective addresses and starting any memory operand loads. Unconditional branches and conditional branches that depended on the index registers were fully executed in the indexing and instruction unit.

Conditional branches that depended on the arithmetic registers were predicted untaken (this is the case in Stretch and the first IBM 7030), and the untaken path was speculatively executed. Misprediction recovery was accomplished through the Stretch's four-entry "lookahead unit", which functioned as a history buffer for the index register values.

However, a taken branch-on-bit on Stretch required almost 10 times longer to execute than a floating-point multiply (and almost 15 times longer than a floating-point add). In the subsequent machines built, non-index branches were predicted taken or untaken based on the currently-available indicator status (with the assumption being that the indicator would be unlikely to change).

[Nine Stretch/7030s were built.]


Conjecture: because branch misprediction recovery time was found to be one of the culprits that reduced Stretch performance, I think speculative execution was ruled out in subsequent IBM mainframe designs, up until the 3090.

In 1972, Jim Pomerene (designer of Harvest) wrote about Stretch:

Unfortunately the critical importance of the branch instruction was not fully recognized. At a branch the program may take one of two paths and if the lookahead had gone down the wrong path considerable unwinding was necessary. The problem proved to be quite fundamental and had a strong effect on high performance machine organization.

[ "Historical perspectives on computers: components," AFIPS FJCC, 1972, pp. 977-983.]

IBM S/360 Model 75 (1966)

The Model 75 was the intended high-performance member of the New Processor Line (NPL) and was originally named the 501. It was announced as the S/360 Model 70 in April 1964 and then renamed as the 75 in April 1965. It had an I-unit and a separate E-unit for instruction overlap. The I-unit had two doubleword instruction prefetch registers and prefetched operands into a doubleword operand prefetch register for the E-unit. The I-unit also executed the I/O instructions.

Buffers and functional units for the Model 75 include:

IBM S/360 Model 91 (1967)

The Model 91 developed from the high-performance Project X. It became a part of the NPL in September 1963 (after the formal announcement of the CDC 6600), and was known then as the 604. A footnote in the April 1964 S/360 announcement mentioned a Model 90 that was under development, and the Model 92 was formally announced in August. The model name changed to the 91 in November 1964.

The Model 91 processor featured a deep pipeline and out-of-order execution within the floating-point unit using Tomasulo's algorithm (this was helpful since there were only four architected FP registers). Like the Stretch and Model 75, separate I and E units were used. Data prefetch was performed but there was no instruction preexecution.

There is an 8-doubleword instruction stack that provides instruction prefetch. A backward branch to a target address within 8 doublewords of the branch is called a "short loop" and causes the processor to enter into loop mode. In loop mode, the decoder tags the taken-path instructions as conditional and sends them to wait at the execution units until the branch is resolved. In normal mode, the decoder tags the untaken-path instructions as conditional and sends them to wait at the execution units until the branch is resolved.

There is one sentence in D. Anderson, F. Sparacio, and R. Tomasulo, "The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling," (see Bell & Newell) which may at first glance suggest dual-path (or eager) execution:

The detrimental performance effect which stems from short loops led to a dual branch philosophy.

-- page 16, bottom of column 2 as originally published

However, the authors mean their philosophy has two aspects, specifically the normal mode and loop mode of branch handling. The paper continues:

The first aspect deals with branches which are either forward into the instruction stream, beyond the prefetched instructions, or if backward from the branch instruction, greater than eight double-words back. In these situations, the branch storage-delay is unavoidable. As a hedge against such a branch being taken, the branch sequencing (Fig. 12) initiates fetches for the first two double words down the target path. Two branch buffers are provided (Fig. 10-the instruction supply data flow) to receive these words, in order that the instruction buffer array will be unaffected if the result is a no-branch decision.
The second aspect of the branch philosophy treats the case for which the target is backward within eight double words of the branch instruction. A separation of eight double words or less defines a "short" loop-this number being chosen as a hardware/ performance compromise. Part of the housekeeping required in the branch sequencing is a "back eight" test. If this test is satisfied the instruction unit enters what is termed "loop mode." Two beneficial results derive from loop mode. First, the complete loop is fetched into the instruction buffer array, after which instruction fetching ceases. Consequently, the address port to storage is totally available for operand fetching and a one instruction per cycle issue rate is possible. The second advantage gained by loop mode is a reduction by a factor of two to three in the time required to sequence the loop- establishing branch instruction. (For example, the "branch on index" instruction normally requires eight cycles for a successful branch, while in loop mode three cycles are sufficient.) In many significant programs it is estimated that the CPU will be in loop mode up to 30% of the time.

-- pages 16-17, as originally published

Buffers and functional units for the Model 91 include:

[US 3,553,655, David Anderson and Frank Sparacio, "Short forward conditional skip hardware" describes a modification to the Model 91 conditional operation mode in which instructions from the target address onward for a short forward branch can be marked as unconditional and thus avoids repeating the instruction fetch at the target address when the branch is resolved as taken.]

The Model 195 adds a cache, a decimal functional unit in the fixed-point unit, and a extended-precision functional unit (with one reservation station) in the floating-point unit. [Does US 3,588,829 relate to adding the cache to the 195?]

[Based on work in multithreaded processors by Mike Flynn and Al Podvin, John Earle led the design of a two-threaded 195, known as the 195-II. Due to conflicts between out-of-order execution and dynamic address translation, the 195-II effort did not result in a product. See US 3,728,692 and US 3,771,138.]

[15 Model 91's and 2 Model 95's were built. Approximately twenty Model 195's were built.]

IBM S/360 Model 85 (1968) and S/370 Models 165/168 (1971/1973)

Following the same approach as the Models 75 and 91, separate I and E units with data prefetch but without instruction preexecution were used in the S/360 Model 85 and the S/370 Model 165 (and later the Model 168). Note, however, that the Tomasulo out-of-order approach was not used. Also, instructions after an unresolved branch are stalled in a decoded instruction queue rather than at the execution units. This can be traced back to the amount of circuitry required in the Model 91:

In retrospect, the conditional philosophy and its effect on loop mode, although significant to the performance of the CPU and conceptually simple, were found to require numerous interlocks throughout the CPU. The complications of conditional mode, coupled with the fact that it is primarily aimed at circumventing storage access delays, indicate that a careful re-examination of its usefulness will be called for as the access time decreases.

-- pages 18-19, Anderson, et al.

Conti, Gibson, and Pitkowsky describe the Model 85 in the following way:

The CPU contains an instruction unit (I unit) and an execution unit (E Unit) that are spoken of together as the processor. Since the I Unit is capable of a one cycle rate of decoding and issuance of instructions to the E Unit, an instruction can be processed by the I Unit each 80 nanoseconds. An instruction buffer in the I Unit can queue up to three instructions ahead of current execution. For branch instructions, as many as 16 bytes of each leg of the branch may be prefetched. This helps to minimize the lost time due to branches that depend, during some interim, upon incomplete executions. Although several instructions may be in process at a given instant of time, strict instruction sequence is preserved: instruction N is never completed before instruction N - 1. Thus, the capability for precise interrupts is preserved as in the Models 65 and 75. Once the I Unit has decoded an instruction and fetched the required operand from main storage (if such is necessary), the instruction is passed on to the execution unit as a pseudo register-to-register (RR) instruction, much as in teh Model 91. [cite to Model 91 paper] Two operands from storage may be buffered in this manner.

-- pages 5-6 (emphasis in the original), C.J. Conti, D.H. Gibson, and S.H. Pitkowsky, "Structural Aspects of the System/360 Model 85: I General Organization," IBM Systems Journal, vol. 7, no. 1, 1968, pp. 2-14.

The Functional Characteristics manual for the Model 85 states:

Three queue registers are provided in the instruction unit. When decoded, an instruction is buffered in a queue register until it is transferred to the execution unit.
For branches that are estimated to be unsuccessful, the target instruction and successive instructions in that stream are put into the auxiliary instruction buffer. Instructions from the current stream remain in the main instruction buffer, from which instructions continue to be processed, and additional requests for instructions are made from the current stream as required.

For branches that are estimated to be successful, the contents of the buffers for the current and target streams are switched. When the branch is executed, if the estimate is correct, the only action required is to stop fetching instructions into the auxiliary instruction buffer, which contains instructions from the current stream. However, if the estimate is incorrect, the contents of the buffers must be switched again.

-- pages 10-11, A22-6916-1, IBM System/360 Model 85 Functional Characteristics, Second Edition, June 1968.

The Functional Characteristics manual for the Model 165 has almost identical language. (The order of the first two phrases in the second sentence of the last paragraph quoted above is reversed.) It also contains the following diagram.

S/360 M165 branching logic

-- page 12, GA22-6935-0, IBM System/370 Model 165 Functional Characteristics, Fifth Edition, June 1970.

Thus, I believe Harvey Cragon is mistaken in his description of the Model 165. Moreover, in the article by Ivan Flores that is cited by Cragon above, Flores states in discussing branching:

If the lookahead unit anticipates instructions only through the [sequential] instruction stream, then, when a true branch occurs, all the instructions that it has preprocessed are useless and must be discarded.
To improve matters, more of a hedge is possible -- even though we preprocess only one set of instructions, it is still possible to prefetch instructions in the other leg of the branch. Then, should our guess prove wrong, memory cycles are saved, since other instructions are already available to preprocess. To make possible this hedge, the Model 165 includes two instruction buffers, instead of just one, called the main instruction buffer (IBM) and the auxiliary instruction buffer (IBA). When the guess is to accept the branch, instructions for the new target address are put into the IBM while instructions from the other (sequential) leg are brought to the IBA.

-- pages 32-33 (emphasis added), Ivan Flores, "Lookahead Control in the IBM System 370 Model 165," IEEE Computer, vol. 7, no. 11, November 1974, pp. 24-38.

Connors, Florkowski, and Patton describe the S/370 Model 168 and the 3033 in a May 1979 Datamation article. (Even though publishing in the popular industry magazine Datamation, they are IBM Poughkeepsie employees; so, their description is first-hand.) They clearly describe one execution path in the Model 168:

If the decoded instruction is a branch, the IPPF then begins fetching instructions along the branch path into the inactive instruction buffer, and for a conditional branch designates (by means of a guess) which of the instruction buffers is now active and which is inactive. Thus both buffers are kept filled with instructions, but only the active one continues to decode.

-- page 199 (emphasis added) W.D. Connors, J. Florkowski and S.K. Patton. "The IBM 3033: An Inside Look," Datamation, vol. 25., no. 5, May 1979, pp. 198-218.

Buffers and functional units for the Model 85 and Model 165 include:

The Model 168 added virtual address translation, a fourth instruction queue entry, and a second result register.

[About thirty Model 85's were built; they apparently had a reputation for poor reliability/availability.]

IBM 3033 (1978)

Connors, Florkowski, and Patton note that the 3033 adds a third instruction fetch buffer to handle the case when a second branch is seen before the current one is resolved. Again, only one is actively decoded. The instruction buffer size was doubled so that all three buffers were 32 bytes. The number of operand buffers was tripled (to six), and the number of result registers was doubled (to four). Instruction decoding performance was improved with a reduction from two cycles to one cycle.

[See US Patent 4,200,927, Hughes, et al., "Multi-instruction stream branch processing mechanism," filed 1978, granted 1980. Also, US Patent 4,189,768, Liptay and Rymarczyk, "Operand fetch control improvement," filed 1978, granted 1980.]

Later IBM processors

The 3081 (1981) was IBM's first LSI-based mainframe design, and design decisions were made in ways that were felt would reduce design errors. Consequently, a decision was made to pursue a sequential instruction-execution design and avoid instruction pipelining and overlap. (See Gustafson and Sparacio, IBM JRD, v26:1, 1982.)

The 3090 (1985) began as an attempt to remap the 3033 into LSI technology. Several improvements to the 3033 design were made based on workload studies. Although unintentional, the resulting 3090 operated in a manner somewhat reminiscent of Stretch, with prefetch of data operands and pre-execution of loop-closing and load-address instructions by the I element. A decode history table is used for branch-on-condition prediction when there is an unresolved condition-code-setting instruction in the instruction queue; otherwise the I element uses the current condition code to resolve the branch. (See Tucker, IBM SJ, v25:1, 1986.)

The z900 (2000) also pre-executed branches and loads in the I-unit. (See Schwarz, et al., IBM JRD v46:4/5, 2002.)

Sidebar: IBM high-performance mainframes

hardwired processors

microcoded processors

CMOS generations

zSeries (64-bit addressing)

pre-execution inst. prefetch paths branch prediction speculative execution
Stretch indexing insts. (including index branches) one non-index conditional branches predicted untaken yes, recovery using lookahead rollback
91 two (instruction stack and "BTB" prefetch buffers) predicted insts. stall at execution units no
85/165/168 two stall in IQ no
3033 three stall in IQ no
3090 LA, BCT, BXH, BXLE three BHT (called "DHT") yes, flush to recover
9000 one 4K-entry BTAC (called "BHT") yes, insts. tagged with two conditional bits
z900 LA, LD, index and some linking branches one 8K-entry BTB yes, flush to recover
z990 LA five 8K-entry BTB yes, flush to recover

Eager execution designs

... these designs go beyond normal speculative execution down a predicted branch path ...

... what is earliest reference? e.g., Stone in 1970 suggests following both paths (cloning a virtual stack machine) and stopping only on a store ...

... discuss limit studies involving eager execution: Riseman and Foster; Lam and Wilson; etc. ...

Gus Uht has recently written up a survey of multipath execution: A.K. Uht, "Multipath Execution," chapter 6 in D. Kaeli and P.-C. Yew (Eds.), Speculative Execution in High Performance Computer Architectures, CRC Press, August 2004. Table 6.1 gives an overview comparison of the following schemes, and he provides further description of each scheme.

Other schemes I hope to describe here include:


[AHU 98] P. Ahuja, K. Skadron, M. Martonosi, and D. Clark, "Multi-path execution: Opportunities and limits," ICS, July 1998.

[CHE 01] C.-Y. Cher and T.N. Vijaykumar, "Skipper: A microarchitecture for exploiting control-flow independence," Micro-34, Dec. 2001, pp. 4-15.

[FLO 74] I. Flores, "Lookahead Control in the IBM System 370 Model 165," IEEE Computer, November 1974, pp. 24-38.

[FUK 91] A. Fukuda, K. Murakami, S. Tomita, "Toward Advanced Parallel Processing: Exploiting Parallelism at Task and Instruction Levels, IEEE Micro, vol.11, no.4, pp.16-19 and 50-61, Aug. 1991.

[HEI 96] T. Heil and J. Smith, "Selective dual path execution," technical report, University of Wisconsin at Madison, November 1996.

[KLA 98a] A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," PACT, 1998, pp. 278-285.

[KLA 98b] A. Klauser, A. Paithankar, and D. Grunwald, "Selective eager execution on the PolyPath architecture," ISCA, June 1998, pp. 250-259.

[KUG 91] M. Kuga, K. Murakami, S. Tomita,"DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar): Yet Another Superscalar Processor Architecture," ACM-SIGARCH Computer Architecture News, vol.19, no.4, pp.14-29, June 1991.

[LAM 92] M. Lam and R. Wilson, "Limits of control flow on parallelism," ISCA, May 1992, pp. 46-57.

[MUR 89] K. Murakami, N. Irie, M. Kuga, and S. Tomita, "SIMP: A novel high-speed single-processor architecture," ISCA, 1989.

[PIE 94] J. Pierce, and T. Mudge, "Wrong-path instruction prefetching," technical report CSE-222-94, University of Michigan at Ann Arbor, 1994.

[ROT 99a] E. Rotenberg, Q. Jacobson, and J. Smith, "A study of control i independence in superscalar processors," HPCA-5, 1999.

[ROT 99b] E. Rotenberg and J. Smith, "Control independence in trace processors," Micro-32, 1999.

[TYS 97] G. Tyson, K. Lick and M. Farrens, "Limited dual path execution," technical report CSE-TR-346-97, University of Michigan at Ann Arbor, 1997.

[UHT 95] A. Uht, V. Sindagi, and K. Hall, "Disjoint eager execution: An optimal form of speculative execution," Micro-28, Dec. 1995, pp. 313-325.

[WAL 98] S. Wallace, B. Calder, and D. Tullsen, "Threaded multiple path execution," ISCA, June 1998, pp. 238-249.


My thanks to Julian Thomas and Stuart Tucker for their help with information on IBM mainframes.

[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]