16.2 Register read stalls
But there is another limitation which may be quite serious, and that is that you can only read two different permanent register names per clock cycle. This limitation applies to all registers used by an instruction except those registers that the instruction writes to only. Example:
MOV [EDI + ESI], EAX MOV EBX, [ESP + EBP]
The first instruction generates two uops: one that reads EAX and one that reads EDI and ESI. The second instruction generates one uop that reads ESP and EBP. EBX does not count as a read because it is only written to by the instruction. Let's assume that these three uops go through the RAT together. I will use the word triplet for a group of three consecutive uops that go through the RAT together. Since the ROB can handle only two permanent register reads per clock cycle and we need five register reads, our triplet will be delayed for two extra clock cycles before it comes to the reservation station. With 3 or 4 register reads in the triplet it would be delayed by one clock cycle.
The same register can be read more than once in the same triplet without adding to the count. If the instructions above are changed to:
MOV [EDI + ESI], EDI MOV EBX, [EDI + EDI]
then you will need only two register reads (EDI and ESI) and the triplet will not be delayed.
A register that is going to be written to by a pending uop is stored in the ROB so that it can be read for free until it is written back, which takes at least 3 clock cycles, and usually more. Write-back is the end of the execution stage where the value becomes available. In other words, you can read any number of registers in the RAT without stall if their values are not yet available from the execution units. The reason for this is that when a value becomes available it is immediately written directly to any subsequent ROB entries that need it. But if the value has already been written back to a temporary or permanent register when a subsequent uop that needs it goes into the RAT, then the value has to be read from the register file, which has only two read ports. There are three pipeline stages from the RAT to the execution unit so you can be certain that a register written to in one uop-triplet can be read for free in at least the next three triplets. If the writeback is delayed by reordering, slow instructions, dependency chains, cache misses, or by any other kind of stall, then the register can be read for free further down the instruction stream.
Example:
MOV EAX, EBX SUB ECX, EAX INC EBX MOV EDX, [EAX] ADD ESI, EBX ADD EDI, ECX
These 6 instructions generate 1 uop each. Let's assume that the first 3 uops go through the RAT together. These 3 uops read register EBX, ECX, and EAX. But since we are writing to EAX before reading it, the read is free and we get no stall. The next three uops read EAX, ESI, EBX, EDI, and ECX. Since both EAX, EBX and ECX have been modified in the preceding triplet and not yet written back then they can be read for free, so that only ESI and EDI count, and we get no stall in the second triplet either. If the SUB ECX,EAX instruction in the first triplet is changed to CMP ECX,EAX then ECX is not written to and we will get a stall in the second triplet for reading ESI, EDI and ECX. Similarly, if the INCEBX instruction in the first triplet is changed to NOP or something else then we will get a stall in the second triplet for reading ESI, EBX and EDI.
No uop can read more than two registers. Therefore, all instructions that need to read more than two registers are split up into two or more uops.
To count the number of register reads, you have to include all registers which are read by the instruction. This includes integer registers, the flags register, the stack pointer, floating point registers and MMX registers. An XMM register counts as two registers, except when only part of it is used, as e.g. in ADDSS and MOVHLPS. Segment registers and the instruction pointer do not count. For example, in SETZ AL you count the flags register but not AL. ADD EBX,ECX counts both EBX and ECX, but not the flags because they are written to only. PUSH EAX reads EAX and the stack pointer and then writes to the stack pointer.
The FXCH instruction is a special case. It works by renaming, but doesn't read any values so that it doesn't count in the rules for register read stalls. An FXCH instruction behaves like 1 uop that neither reads nor writes any registers with regard to the rules for register read stalls.
Don't confuse uop triplets with decode groups. A decode group can generate from 1 to 6 uops, and even if the decode group has three instructions and generates three uops there is no guarantee that the three uops will go into the RAT together.
The queue between the decoders and the RAT is so short (10 uops) that you cannot assume that register read stalls do not stall the decoders or that fluctuations in decoder throughput do not stall the RAT.
It is very difficult to predict which uops go through the RAT together unless the queue is empty, and for optimized code the queue should be empty only after mispredicted branches. Several uops generated by the same instruction do not necessarily go through the RAT together; the uops are simply taken consecutively from the queue, three at at time. The sequence is not broken by a predicted jump: uops before and after the jump can go through the RAT together. Only a mispredicted jump will discard the queue and start over again so that the next three uops are sure to go into the RAT together.
If three consecutive uops read more than two different registers then you would of course prefer that they do not go through the RAT together. The probability that they do is one third. The penalty of reading three or four written-back registers in one triplet of uops is one clock cycle. You can think of the one clock delay as equivalent to the load of three more uops through the RAT. With the probability of 1/3 of the three uops going into the RAT together, the average penalty will be the equivalent of 3/3 = 1 uop. To calculate the average time it will take for a piece of code to go through the RAT, add the number of potential register read stalls to the number of uops and divide by three. You can see that It doesn't pay to remove the stall by putting in an extra instruction unless you know for sure which uops go into the RAT together or you can prevent more than one potential register read stall by one extra instruction.
In situations where you aim at a throughput of 3 uops per clock, the limit of two permanent register reads per clock cycle may be a problematic bottleneck to handle. Possible ways to remove register read stalls are:
For instructions that generate more than one uop you may want to know the order of the uops generated by the instruction in order to make a precise analysis of the possibility of register read stalls. I have therefore listed the most common cases below.
Writes to memory
A memory write generates two uops. The first one (to port 4) is a store operation, reading the register to store. The second uop (port 3) calculates the memory address, reading any pointer registers. Examples:
MOV [EDI], EAX
First uop reads EAX, second uop reads EDI.
FSTP QWORD PTR [EBX+8*ECX]
First uop reads ST(0), second uop reads EBX and ECX.
Read and modify
An instruction that reads a memory operand and modifies a register by some arithmetic or logical operation generates two uops. The first one (port 2) is a memory load instruction reading any pointer registers, the second uop is an arithmetic instruction (port 0 or 1) reading and writing to the destination register and possibly writing to the flags. Example:
ADD EAX, [ESI+20]
First uop reads ESI, second uop reads EAX and writes EAX and flags.
Read/modify/write
A read/modify/write instruction generates four uops. The first uop (port 2) reads any pointer registers, the second uop (port 0 or 1) reads and writes to any source register and possibly writes to the flags, the third uop (port 4) reads only the temporary result which doesn't count here, the fourth uop (port 3) reads any pointer registers again. Since the first and the fourth uop cannot go into the RAT together you cannot take advantage of the fact that they read the same pointer registers. Example:
OR [ESI+EDI], EAX
The first uop reads ESI and EDI, the second uop reads EAX and writes EAX and the flags, the third uop reads only the temporary result, the fourth uop reads ESI and EDI again. No matter how these uops go into the RAT you can be sure that the uop that reads EAX goes together with one of the uops that read ESI and EDI. A register read stall is therefore inevitable for this instruction unless one of the registers has been modified recently.
Push register
A push register instruction generates 3 uops. The first one (port 4) is a store instruction, reading the register. The second uop (port 3) generates the address, reading the stack pointer. The third uop (port 0 or 1) subtracts the word size from the stack pointer, reading and modifying the stack pointer.
Pop register
A pop register instruction generates 2 uops. The first uop (port 2) loads the value, reading the stack pointer and writing to the register. The second uop (port 0 or 1) adjusts the stack pointer, reading and modifying the stack pointer.
Call
A near call generates 4 uops (port 1, 4, 3, 01). The first two uops read only the instruction pointer which doesn't count because it cannot be renamed. The third uop reads the stack pointer. The last uop reads and modifies the stack pointer.
Return
A near return generates 4 uops (port 2, 01, 01, 1). The first uop reads the stack pointer. The third uop reads and modifies the stack pointer.
An example of how to avoid a register read stall is given in example 2.6.