There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent instruction can start to execute before both instructions in the imperfect pair have finished.
Imperfect pairing will happen in the following cases:
1. If the second instructions suffers an AGI stall (see chapter 9).
2. Two instructions cannot access the same DWORD of memory simultaneously. The following examples assume that ESI is divisible by 4:
MOV AL, [ESI] / MOV BL, [ESI+1]
The two operands are within the same DWORD, so they cannot execute simultaneously. The pair takes 2 clock cycles.
MOV AL, [ESI+3] / MOV BL, [ESI+4]
Here the two operands are on each side of a DWORD boundary, so they pair perfectly, and take only one clock cycle.
3. Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict). For DWORD addresses this means that the difference between the two addresses should not be divisible by 32. Examples:
MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing
Pairable integer instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for using complex addressing modes such as scaled index registers.
A pairable integer instruction which reads from memory, does some calculation, and stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).
A pairable integer instruction which reads from memory, does some calculation, and writes the result back to the memory, takes 3 clock cycles. (read/modify/write instructions).
4. If a read/modify/write instruction is paired with a read/modify or read/modify/write instruction, then they will pair imperfectly.
The number of clock cycles used is given in the following table:
First instruction | Second instruction | ||
MOV or register only | read/modify | read/modify/write | |
MOV or register only | 1 | 2 | 3 |
read/modify | 2 | 2 | 3 |
read/modify/write | 3 | 4 | 5 |
Example:
ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles
ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles
5. When two paired instructions both take extra time due to cache misses, misalignment, or jump misprediction, then the pair will take more time than each instruction, but less than the sum of the two.
6. A pairable floating point instruction followed by FXCH will make imperfect pairing if the next instruction is not a floating point instruction.
In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V-pipe. You can find out this by looking backwards in your code and search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair due to one of the rules above.
Imperfect pairing can often be avoided by reordering your instructions. Example:
L1: MOV EAX,[ESI] MOV EBX,[ESI] INC ECX
Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions.
L2: MOV EAX,OFFSET A XOR EBX,EBX INC EBX MOV ECX,[EAX] JMP L1
The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1 instead, then the sequence takes only 3 clocks.
The next example is in 16 bit mode, assuming that SP is divisible by 4:
L3: PUSH AX PUSH BX PUSH CX PUSH DX CALL FUNC
Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a DWORD boundary) but it doesn't because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode.