14. Instruction decoding (PPro, PII and PIII)
I am describing instruction decoding before instruction fetching here because you need to know how the decoders work in order to understand the possible delays in instruction fetching.
The decoders can handle three instructions per clock cycle, but only when certain conditions are met. Decoder D0 can handle any instruction that generates up to 4 uops in a single clock cycle. Decoders D1 and D2 can handle only instructions that generate 1 uop and these instructions can be no more than 8 bytes long.
To summarize the rules for decoding two or three instructions in the same clock cycle:
An instruction that generates more than 4 uops takes two or more clock cycles to decode, and no other instructions can decode in parallel.
It follows from the rules above that the decoders can produce a maximum of 6 uops per clock cycle if the first instruction in each decode group generates 4 uops and the next two generate 1 uop each. The minimum production is 2 uops per clock cycle, which you get when all instructions generate 2 uops each, so that D1 and D2 are never used.
For maximum throughput, it is recommended that you order your instructions according to the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two simple 1-uop instructions for free, in the sense that they do not add to the decoding time. Example:
MOV EBX, [MEM1] ; 1 uop (D0) INC EBX ; 1 uop (D1) ADD EAX, [MEM2] ; 2 uops (D0) ADD [MEM3], EAX ; 4 uops (D0)
This takes 3 clock cycles to decode. You can save one clock cycle by reordering the instructions into two decode groups:
ADD EAX, [MEM2] ; 2 uops (D0) MOV EBX, [MEM1] ; 1 uop (D1) INC EBX ; 1 uop (D2) ADD [MEM3], EAX ; 4 uops (D0)
The decoders now generate 8 uops in two clock cycles, which is probably satisfactory. Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate higher than this you can assume that decoding is not a bottleneck. However, complications in the fetch mechanism can delay decoding as described in the next chapter, so to be safe you may want to aim at a decoding rate higher than 3 uops per clock cycle.
You can see how many uops each instruction generates in the tables in chapter 29.
Instruction prefixes can also incur penalties in the decoders. Instructions can have several kinds of prefixes:
ADD BX, 9 ; no penalty because immediate operand is 8 bits MOV WORD PTR [MEM16], 9 ; penalty because operand is 16 bits The last instruction should be changed to:
MOV EAX, 9 MOV WORD PTR [MEM16], AX ; no penalty because no immediate