Mark Smotherman. Last updated January 2002
The Swordfish is a unique design with a superscalar external appearance but a long-instruction-word (LIW) internal microarchitecture based on a decoded instruction cache (DINC).
Swordfish chip photo (1.4M TIF file). A 1 KB data cache lies along the bottom of the die on the left. Above it, in the lefthand column, is the floating-point unit and the DSP multiplier (at the very top lefthand). The CPU core is in middle column extending 2/3 the way down. The instruction emulator is in the lower middle, below the CPU core. At the top righthand edge is a 4 KB instruction cache. Below it, and slightly to the middle, is the instruction loader. Along the lower righthand edge are the DMA, ICU, and timers. The BIU is along the bottom of the die, 1/4th the way in on the right.
The Swordfish was designed to be a successor chip to the NS32532. As such it was initially known as the NS32732 (and later as the NS32764). Even though it was not delivered as a N32K family member, the design lives on in one of National's embedded processor lines (see CompactRISC).
The people involved in the Swordfish design were:
The Swordfish design was started in the late 1980s in Israel. The design featured dual integer pipelines, A and B, for superscalar issue. The pipelines were standard RISC-like designs, with five stages (fetch, decode, execute, memory, writeback). Pipeline B was the primary, in the sense that all instructions could execute on it. Pipeline A was secondary and in particular could not execute branches or initiate floating-point instructions. However, the first instruction in an instruction pair fetched from the decoded instruction cache (see below) was always assigned to pipeline A, and the second to pipeline B. The pipelines operated in lockstep except when the decoder in pipeline B stalled due to a dependency between the paired instructions or some other condition. A new instruction pair would not be obtained until both instructions from the previous pair had exited the decode stages.
A register scoreboard was used to control WAR and WAW stalls. Additionally, a load reservation FIFO was included so that the pipelines could continue execution past data cache misses, each of which required about six cycles to satisfy. The register scoreboard would stall a load-dependent instruction if it was decoded prior to the missing data being returned from cache.
Each instruction that was issued to pipeline B was also supplied to the floating-point pipeline, so that a floating-point instruction could be immediately started. The floating-point pipeline consisted of five stages after fetch: decode, execute-1, execute-2, round and normalize, and write back. Pipeline B operated in lockstep with an instruction in the floating-point pipeline; this was done in order to control program sequencing. If the floating-point instruction could trap, pipeline B additionally cycled twice in its memory stage so that both pipeline B and the floating-point pipeline would enter their respective write-back stages simultaneously. Floating-point traps were thereby made precise (i.e, no instructions beyond the trapping one would be allowed into a writeback stage).
The initial chip ran at 50 MHz, and could perform a 32bx32b integer multiply in one cycle or a 16bx16b->32b signed integer multiply in one cycle (with selection of the 16b from either low or high halves of the registers to help implement complex arithmetic). Three floating-point units were provided: an adder, a multiplier, and a divider.
Don Alpert states,
Swordfish was most strongly influenced by:
- MIPS-X at Stanford. We followed a similar integer pipeline and looked at their branch handling as well. I visited Stanford in summer 1987(?) and was exposed to the work in detail.
- Multiflow VLIW. I had met Josh Fisher once when I was a student at Stanford, then heard him give a talk about Multiflow at UC Berkeley in 1987 (?). We were trying to figure out how to get parallelism out of multiple functional units, and adopted a microarchitecture that was like VLIW: each FU was assigned to fixed slots in a 2-wide instruction word fetched from the cache. We had the HW detect dependencies as instructions were placed in the cache slots, so it was a superscalar architecture with a VLIW machine organization. To improve icache efficiency we allowed dependent instructions to be packed together with a bit per pair of instructions that indicated whether or not they were dependent. Independent instructions could be executed in parallel, dependent instructions had to be executed sequentially, but still on the pipeline assigned to that slot. Just about the only wasted cache slots were for FP instructions that could not be paired with a load or integer op.
... Overall it was a very efficient architecture. With little extra cost for a second integer pipe and a simple control structure, it was possible to derive a lot of parallelism on many embedded loops.
As mentioned in the quote, the instruction cache was organized into instruction pair entries (or, 2-wide LIW), with each instruction mapped to one of the two integer pipelines. Pre-decoding ("instruction loading") was performed during instruction cache refill to determine the instruction pairs, identify any true dependency (i.e., RAW) between the two instructions in a pair, and to calculate and store the branch target address (rather than storing the branch offset). A predict-taken branching policy was used and yielded 0-cycle taken branches.
Two additional bits were used in each instruction cache entry: one to indicate a dependent pair and thus force sequential issue, and another to indicate an emulated rather than hardwired instruction.
The initial plans were to implement only a "performance-critical" core of NS32K instruction set. When a marked instruction was fetched, an instruction emulator unit would feed a sequence of core instructions to the pipelines in order to interpret an unimplemented instruction (cf. PPro). Later, an approach using a native RISC-like instruction set was adopted. The native instructions used the undefined opcodes in the NS32K definition. Pre-decoding would then classify the NS32K instructions into 3 groups:
My thanks to Don Alpert, Gideon Intrater, and Ran Talmudi for their help in collecting this information.
[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]