Instruction-Level Parallelism
Mark Smotherman
Last updated: January 2012
Summary: ILP dates backs to the 1940s, and various attempts have
been made to exploit it over the years.
... under construction ...
... intro to be written ...
dependency fn. unit time to start
checking assignment execution
---------- ---------- -------------
superscalar hardware hardware hardware
...........
EPIC software . hardware hardware
............
dynamic VLIW software software . hardware
............
VLIW software software software
code generation
| superscalar
|`----------------------------------.
| |
.-------|-----------------. .-------|-----------------.
| v | | v |
| dependency checking | | dependency checking |
| O(n^2) | | | |
| | | EPIC | | |
| |`----------------------------------. |
| v | | v |
| fn. unit assignment | | fn. unit assignment |
| | | dynamic | | |
| | | VLIW | | |
| |`----------------------------------. |
| v | | v |
| time to start execution | | time to start execution |
| | | | | |
| ` | VLIW | | |
| `----------------------------------. |
| | | | |
`-------------------------' `-------|-----------------'
compiler scheduling | hardware control
|
v
hardware fn. units
See
B. Rau and J. Fisher, "Instruction-Level Parallel Processing:
History, Overview and Perspective," HP Tech Rept, 1992 (pdf).
A partial list of machines and designs exploiting ILP
1940s
-
Zuse Z3 (1941, patent filed in 1949) - execution pipeline with overlap
of instructions
- Zuse Z4
(1945, rebuilt 1950) - two-instruction lookahead, used to start
loads early
- Bell Labs Model V (1946) - two processors with a master control that
distributed an instruction to whichever processor was available
(called "superbranching" by Stibitz, but Paul Ceruzzi compares
it to resource management within an operating system)
- Turing's ACE proposal (1946) - complicated instruction format with
several timing-dependent features
(The Pilot ACE is one of the first
transport-triggered architectures in which a store into a certain
memory location will cause an operation to start. Some early
calculators had some of this flavor, e.g., an ASCC move from one
counter into another nonempty counter caused an add to occur.)
(The ACE required what came to be termed "optimum programming"
in dealing with positioning the instructions and arranging the
wait times for fastest execution.
Turing was dismissive of the simpler EDSAC: "The 'code' which he
[Wilkes] suggests is however very contrary to the line of development
here [at NPL], and much more in the American tradition of solving
one's difficulties by means of much equipment rather than by thought."
Dec. 1946, quoted by Lavington)
- IBM SSEC (1948) - two instructions in a "line of sequence", could be
used to specify two operations with a single final destination or
two separate operations;
data transfers were asynchronous;
output was overlapped with computation;
(see J.C. McPherson, F.E. Hamilton, and R.R. Seeber, Jr.,
"A Large-Scale, General-Purpose Electronic Digital Calculator:
The SSEC," IEEE Annals of the History of Computing,
Oct.-Dec. 1982; US Patent 2,636,672; appendix A of Bashe,
Johnson, Palmer, and Pugh, IBM's Early Computers, 1986)
[cf. Harvard Mark II]
- countertrends
- in 1944, John von Neumann spent time wiring plugboards on a
punched-card machine at Los Alamos; he found it frustrating to
have to take into account the relative timing of parallel
operations; this experience led him to reject parallel
computations in electronic computers (Metropolis and Nelson, 1982)
- ENIAC had tremendous parallelism but was it difficult to coordinate,
in part because of the plugboard wiring; in 1948, the plugboards
were wired into a centralized control structure, in which the
machine was programmed by setting switches; the plugboards were
not changed thereafter (Burks, 1981)
1950s
1960s
- CDC 6600 (1964) - scalar, out-of-order execution in central processor
- ten functional units in CPU
- CPU decodes one instruction per cycle and sends to scoreboard ("issue")
- up to three instructions can start execution each cycle
(limit is due to structural limitation of three sets of
read-port pairs, called data trunks, provided for the register file);
operands are read when execution starts
- up to four instructions can write results each minor cycle
- register RAW causes wait to start execution;
WAR cause wait to write result;
WAW and function unit busy prevents "issue" from decoder
- memory stunt box
- also provided multithreaded execution in PPUs
- 1966 paper on very high-speed computing systems by Flynn
Proc. IEEE (Dec. 66) ["confluent" SISD (including multiple decoding),
SIMD, MISD, MIMD]
- IBM S/360 Model 91 (1967) - scalar pipeline, with out-of-order
execution using Tomasulo algorithm in floating-point unit
[trivia: in 1964 AFIPS paper,
T.C. Chen called it "topological path distortion"]
- IBM
ACS (mid to late 1960s)
- proposal for a 7-way-issue superscalar processor;
Schorr's paper on ACS published in 1971
- 1969 paper on direct functional control (noninterlocked VLIW with
exposed pipelining) by Melliar-Smith in AFIPS FJCC
[in reaction to execution resources "squandered" and "wasted" by a
Tomasulo-like E-box coupled with a one-instruction-decode-per-cycle
I-box; 128-bit or longer instruction words; suggestion to use this
for inner loops in array processing]
1970s
1980s
- US 4,295,193 (1981) Pomerene (IBM),
"Machine for multiple instruction execution"
- VLIW work by Bob Rau (1981) - Polycyclic Architecture project at TRW/ESL
- DAE work by Jim Smith (first publication 1982) -
led to Astronautics ZS-1
- VLIW work by Josh Fisher (1983) - ELI-512 design
-
Cydrome (1983-1988) - VLIW
-
Multiflow (1984-1990) VLIW
-
VAX-HPS (ca. 1985) - work of Yale Patt on restricted data flow
- HC Torng's paper on the dispatch stack for multiple issue (1986)
-
Culler-7 (ca. 1986) - an interesting DAE/LIW design
-
CHoPP (mid-1980s) - proposal for nine-way issue
- dual-issue IBM Cheetah and Panther (mid-1980s, forerunners of
quad-issue America and IBM RS/6000, basis of 1986 FJCC paper on
compilers for dual-issue machines and 1987 Agerwala and Cocke
tech report on high-performance RISC processors)
- Bell Labs CRISP (1987) - branch folding in a decoded icache
- Burton Smith's Horizon (1988) - lookahead count to specify
instruction independence
- Stellar GS-1000 (1988) - four-way multithreading,
two instructions per cycle
- Apollo DN10000 (1988) - two-way LIW
- Intel i860 (1988) - two-way LIW
- Intel i960CA (1989) - triple-issue superscalar
- IBM RS/6000 (1989) - quad-issue superscalar
-
HARP (Hatfield RISC Processor) (1989) - VLIW design with
predication and a prohibit-writeback bit to reduce register
file traffic while allowing operands to flow through the
forwarding logic
1990s
- National Semiconductor
Swordfish (1990) - a two-issue superscalar processor with
an LIW-like format inside a decoded icache
- IBM ES/9000-900 (1990) - out-of-order superscalar implementation
of IBM mainframe [conjecture: this is essentially the late 1960s
ACS/360 design]
-
Metaflow Lightning/Thunder (1991/1995)
- early o-o-o superscalar designs for SPARC
-
Transmeta - VLIW
- ... many others ...
2000s
Other Resources
- Dezso Sima, Terence Fountain, Peter Kacuk, Advanced Computer
Architecture: A Design Space Approach. Harlow, England: Addison-Wesley,
1997.
- Amos R. Omondi, The Microarchitecture of Pipelined and Superscalar
Computers. Boston: Kluwer Academic Publishers, 1999.
- Jurij Silc, Borut Robic, and Theo Ungerer,
Processor Architecture: From Dataflow to Superscalar and Beyond.
Berlin/Heidelberg: Springer-Verlag, 1999.
- John Paul Shen and Mikko H. Lipasti, Modern processor Design:
Fundamentals of Superscalar Processors. Boston: McGraw-Hill, 2004.
- Joseph A. Fisher, Paolo Faraboschi, and Cliff Young ,
Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools.
San Francisco: Morgan Kaufmann, 2004.
- David R. Kaeli and Pen-Chung Yew,
Speculative Execution in High-Performance Computer Architectures.
Boca Raton: Chapman and Hall/CRC, 2005.
[History page]
[Mark's homepage]
[CPSC homepage]
[Clemson Univ. homepage]
mark@cs.clemson.edu