Last updated: May 2021
Summary: ILP dates backs to the 1940s, and various attempts have
been made to exploit it over the years.
... under construction ...
... intro to be written ...
dependency fn. unit time to start
checking assignment execution
---------- ---------- -------------
superscalar hardware hardware hardware
EPIC software . hardware hardware
dynamic VLIW software software . hardware
VLIW software software software
| v | | v |
| dependency checking | | dependency checking |
| O(n^2) | | | |
| | | EPIC | | |
| |`----------------------------------. |
| v | | v |
| fn. unit assignment | | fn. unit assignment |
| | | dynamic | | |
| | | VLIW | | |
| |`----------------------------------. |
| v | | v |
| time to start execution | | time to start execution |
| | | | | |
| ` | VLIW | | |
| `----------------------------------. |
| | | | |
compiler scheduling | hardware control
hardware fn. units
B. Rau and J. Fisher, "Instruction-Level Parallel Processing:
History, Overview and Perspective," HP Tech Rept, 1992 (pdf).
A partial list of machines and designs exploiting ILP
Zuse Z3 (1941, patent filed in 1949) - execution pipeline with
overlap of instructions
- Harvard Mark I (1944) - overlap of operations - a long-running operation
such as a multiply can be started and other operations such as adds or
prints can run as concurrent, "interposed" operations before the
multiply finishes and delivers the product
- Zuse Z4
(1945, rebuilt 1950) - two-instruction lookahead, used to start
- Bell Labs Model V (1946) - two processors with a master control that
distributed an instruction to whichever processor was available
(called "superbranching" by Stibitz, but Paul Ceruzzi compares
it to resource management within an operating system)
- Turing's ACE proposal (1946) - complicated instruction format with
several timing-dependent features
(The Pilot ACE is one of the first
transport-triggered architectures in which a store into a certain
memory location will cause an operation to start. Some early
calculators had some of this flavor, e.g., an ASCC move from one
counter into another nonempty counter caused an add to occur.)
(The ACE required what came to be termed "optimum programming"
in dealing with positioning the instructions and arranging the
wait times for fastest execution.
Turing was dismissive of the simpler EDSAC: "The 'code' which he
[Wilkes] suggests is however very contrary to the line of development
here [at NPL], and much more in the American tradition of solving
one's difficulties by means of much equipment rather than by thought."
Dec. 1946, quoted by Lavington)
- IBM SSEC (1948) - two instructions in a "line of sequence", could be
used to specify two operations with a single final destination or
two separate operations;
data transfers were asynchronous;
output was overlapped with computation;
(see J.C. McPherson, F.E. Hamilton, and R.R. Seeber, Jr.,
"A Large-Scale, General-Purpose Electronic Digital Calculator:
The SSEC," IEEE Annals of the History of Computing,
Oct.-Dec. 1982; US Patent 2,636,672; appendix A of Bashe,
Johnson, Palmer, and Pugh, IBM's Early Computers, 1986)
[cf. Harvard Mark II]
- in 1944, John von Neumann spent time wiring plugboards on a
punched-card machine at Los Alamos; he found it frustrating to
have to take "into account the relative timing of the parallel
operations"; this experience "led him to reject parallel
computations in electronic computers" (Metropolis and Nelson, 1982)
- from sections 5.3-5.6 of the "First Draft of a Report on the EDVAC"
(von Neumann, 1945):
At this point there arises another question of principle. In all existing
devices where the element is not a vacuum tube the reaction time of the
element is sufficiently long to make a certain telescoping
of the steps involved in addition, subtraction, and still more in
multiplication and division, desirable. ...
5.4 ... The logical procedure to avoid these long durations, consists of
telescoping operations, that is of carrying out simultaneously as
many as possible. ...
Such accelerating, telescoping procedures are being used in all existing
devices. ... However, they save time only at exactly the rate at which
they multiply the necessary equipment, that is the number of elements
in the device: Clearly if a duration is halved by systematically carrying
out two additions at once, double adding equipment will be required (even
assuming that it can be used without disproportionate control facilities
and fully efficiently), etc.
This way of gaining time by increasing equipment is fully justifed in
non vacuum tube element devices, where gaining time is of the essence,
and extensive engineering experience is available regarding the handling
of involved devices containing many elements. A really all-purpose automatic
digital computing system constructed along these lines must, according to
all available experience, contain over 10,000 elements.
For a vacuum tube element device on the other hand, it would seem that the
opposite procedure holds more promise. ...
Thus it seems worth while to consider the following viewpoint: The device
should be as simple as possible, that is, contain as few elements as
possible. This can be achieved by never performing two operations
simultaneously, if this would cause a signifcant increase in the number
of elements required. The result will be that the device will work more
reliably and the vacuum tubes can be driven to shorter reaction times
- ENIAC had tremendous parallelism but was it difficult to coordinate;
from "The ENIAC: The First General-Purpose Electronic Computer"
(Burks and Burks, 1981):
The full parallelism of [the distributed control version of]
ENIAC was seldom used for two reasons. First, hardly any problems
lent themselves to such extensive parallelism of operation. ...
Second, in the case of ENIAC, its slow manual method of problem
setup and its relatively slow input-output operations significantly
decreased the payoff for optimizaing parallelism.
... When two or more chains of operation were to proceed in parallel,
and were to be followed by another chain dependent upon them, the
operator would determine the computation time of each of the parallel
chains, and would connect the last program control of the longest
parallel chain to the first program control of the following chain.
It was not always possible, however, to figure out in advance the
computation time for a given chain. ...
In the case of the [divide/square-root] unit, therefore, each
program was provided with an extra program pulse input, called
the "interlock input," ... [to show] that a parallel sequence of
operations was complete ... The card-reader program control also
had an interlock ...
- Gene Amdahl
WISC (1950) - four-stage pipeline
(instruction fetch, operand fetch, execution, write back)
- Elliott 152 (1950) - Simon Lavington states that "provision was made
for a number of instruction streams to be executed in parallel and
for multiple simultaneous data-transfers to take place between the
two RAM sections and various functional units within the CPU." See
- 1953 paper suggesting horizontal microcode by Wilkes and Stringer
- Elliott 153 (1954) - 64-bit instruction specifying multiple
register transfers (ALU, multiplier, I/O, branching, and control
of two scratchpad memories; similar to horizontal microcode; see
Lavington's description and an extended version in
"Appendix 4: Technical Details of the Elliott 800 Series
and 503 Computers," in Simon Lavington, Moving Targets:
Elliott-Automation and the Dawn of the Computer Age in Britain,
1947-67. History of Computing Series, Springer, 2011, pp. 555-568.)
Stretch (late 1950s, delivered 1961)
- aggressive supercomputer design including instruction pre-execution,
speculaseetive execution, and branch misprediction recovery
[lookahead design was suggested by Gene Amdahl and John Backus]
- UNIVAC LARC (late 1950s) - four-stage pipeline,
Norman Hardy's "Engineering Considerations of the LARC" and
UNIVAC-LARC General Description
- (to do) pipelining in the Atlas
F.H. Sumner, G. Haley, and E.C.Y. Chen, "The Central Control Unit
of the Atlas Computer," in C.M. Popplewell (ed.), Information
Processing 1962, Proceedings of IFIP Congress 62, Munich, Germany,
August 27 - September 1, 1962. North-Holland, 1962, pp. 657-663.
IBM proposal to execute up to six instructions per cycle
as an advanced version of the "look ahead" facility for Stretch
- 1970 paper on multiple-decoding issue logic for 7094 by Tjaden and Flynn
(IEEE TC, Oct. 70; see also Aug. 73)
- The 1970 Tjaden and Flynn paper, along with a 1972 paper by
Riseman and Foster, constituted a countertrend because of
the negative results on the amount of available
- A 1972 paper on architectural trends by Amdahl (SIGMICRO Newsletter,
Jan. 72) in which he felt that cache memories eliminated the need
for building up queues of arithmetic operations to "cover the
time required for operand fetching and branch target fetching"
and thus "the primary motivation for building queues reduces to
permitting resequencing for concurrency". (Note, however, that
modern out-of-order processors benefit by covering cache misses.)
- (rumors of Burroughs and Univac efforts -- e.g.,
US patents 3,611,306 and 4,466,061 by Burroughs employees)
- Elbrus 1
"Superscalar implementation" (1973-1978) - for a description see,
Peter Wolcott and Mikhail Dorojevets, "The Institute of Precision
Mechanics and Computer Technology and the El'brus Family of High-Speed
Computers," IEEE Annals of the History of Computing, vol. 20, no. 1,
1998, pp. 4-14. [translates a stack-based ISA into an internal
register-based microinstruction representation and executes these
microinstructions out-of-order; the article compares the
implementation to the CDC 6600; the decoding rate is unspecified,
but it may only decode (i.e., crack) one stack-based instruction
per cycle and thus not be truly a superscalar]
- US 3,771,141 (1973) Culler (Culler-Harrison), "Data processor
with parallel operations per instruction"
- FPS AP-120B (1975) / FPS-164 (1980) - VLIW
- 1978 paper on concurrency control bits by Lee Higbie -
bits added to instruction format and set by programmer or compiler,
"Overlapped operation with microprogramming," IEEE TC (March 78)
[cf. ready signal in NBS PILOT instruction format, 1958]
Flexible Processor (1979) - VLIW [datapath reminiscent of
- US 4,295,193 (1981) Pomerene (IBM),
"Machine for multiple instruction execution"
- VLIW work by Bob Rau (1981) - Polycyclic Architecture project at TRW/ESL
- DAE work by Jim Smith (first publication 1982) -
led to Astronautics ZS-1
- VLIW work by Josh Fisher (1983) - ELI-512 design
Cydrome (1983-1988) - VLIW
Multiflow (1984-1990) VLIW
- CDC Cyber 180/990 (1984) - history buffer for precise interrupts
VAX-HPS (1985) - work of Yale Patt on a restricted data flow
design called High Performance Substrate, which is a decoupled
microarchitecture with a repair facility that provides precise
interrupts and branch misprediction recovery
- Jim Smith and Andrew Plezskun ISCA paper on implementing precise
interrupts in pipelined processors (1985) - compares reorder buffer,
history buffer, and future file
- HC Torng's paper on the dispatch stack for multiple issue (1986)
(note that Torng's US 4,807,115 patent has a 1983 priority date)
Culler-7 (ca. 1986) - an interesting DAE/LIW design
CHoPP (mid-1980s) - proposal for nine-way issue
- dual-issue IBM Cheetah and Panther (mid-1980s, forerunners of
quad-issue America and IBM RS/6000, basis of 1986 FJCC paper on
compilers for dual-issue machines and 1987 Agerwala and Cocke
tech report on high-performance RISC processors)
- Bell Labs CRISP (1987) - branch folding in a decoded icache
- Burton Smith's Horizon (1988) - lookahead count to specify
- Stellar GS-1000 (1988) - four-way multithreading,
two instructions per cycle
- Apollo DN10000 (1988) - two-way LIW
- Intel i860 (1988) - two-way LIW
- Intel i960CA (1989) - triple-issue superscalar
- IBM RS/6000 (1989) - quad-issue superscalar
HARP (Hatfield RISC Processor) (1989) - VLIW design with
predication and a prohibit-writeback bit to reduce register
file traffic while allowing operands to flow through the
- National Semiconductor
Swordfish (1990) - a two-issue superscalar processor with
an LIW-like format inside a decoded icache
- IBM ES/9000-900 (1990) - out-of-order superscalar implementation
of IBM mainframe [conjecture: this is essentially the late 1960s
Metaflow Lightning/Thunder (1991/1995)
- early o-o-o superscalar designs for SPARC
Transmeta - VLIW
- ... many others ...
- Dezso Sima, Terence Fountain, Peter Kacuk, Advanced Computer
Architecture: A Design Space Approach. Harlow, England: Addison-Wesley,
- Amos R. Omondi, The Microarchitecture of Pipelined and Superscalar
Computers. Boston: Kluwer Academic Publishers, 1999.
- Jurij Silc, Borut Robic, and Theo Ungerer,
Processor Architecture: From Dataflow to Superscalar and Beyond.
Berlin/Heidelberg: Springer-Verlag, 1999.
- John Paul Shen and Mikko H. Lipasti, Modern processor Design:
Fundamentals of Superscalar Processors. Boston: McGraw-Hill, 2004.
- Joseph A. Fisher, Paolo Faraboschi, and Cliff Young ,
Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools.
San Francisco: Morgan Kaufmann, 2004.
- David R. Kaeli and Pen-Chung Yew,
Speculative Execution in High-Performance Computer Architectures.
Boca Raton: Chapman and Hall/CRC, 2005.
My thanks to David Hemmendinger for pointing me to the Harvard Mark I and
other early machines with overlaps among the function units.
[Clemson Univ. homepage]