10.2 Imperfect pairing


日期: 2000-04-01 14:00 | 联系我
关注我: Telegram, Twitter

10.2 Imperfect pairing

There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent instruction can start to execute before both instructions in the imperfect pair have finished.

Imperfect pairing will happen in the following cases:

1. If the second instructions suffers an AGI stall (see chapter 9).

2. Two instructions cannot access the same DWORD of memory simultaneously. The following examples assume that ESI is divisible by 4:

MOV AL, [ESI] / MOV BL, [ESI+1]

The two operands are within the same DWORD, so they cannot execute simultaneously. The pair takes 2 clock cycles.

MOV AL, [ESI+3] / MOV BL, [ESI+4]

Here the two operands are on each side of a DWORD boundary, so they pair perfectly, and take only one clock cycle.

3. Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict). For DWORD addresses this means that the difference between the two addresses should not be divisible by 32. Examples:

MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing

Pairable integer instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for using complex addressing modes such as scaled index registers.

A pairable integer instruction which reads from memory, does some calculation, and stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).

A pairable integer instruction which reads from memory, does some calculation, and writes the result back to the memory, takes 3 clock cycles. (read/modify/write instructions).

4. If a read/modify/write instruction is paired with a read/modify or read/modify/write instruction, then they will pair imperfectly.

The number of clock cycles used is given in the following table:

    First instructionSecond instruction
    MOV or register only read/modify read/modify/write
    MOV or register only 1 2 3
    read/modify 2 2 3
    read/modify/write 3 4 5

    Example:

    ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles

    ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles

    5. When two paired instructions both take extra time due to cache misses, misalignment, or jump misprediction, then the pair will take more time than each instruction, but less than the sum of the two.

    6. A pairable floating point instruction followed by FXCH will make imperfect pairing if the next instruction is not a floating point instruction.

    In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V-pipe. You can find out this by looking backwards in your code and search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair due to one of the rules above.

    Imperfect pairing can often be avoided by reordering your instructions. Example:

    L1: MOV EAX,[ESI] MOV EBX,[ESI] INC ECX

    Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions.

    L2: MOV EAX,OFFSET A XOR EBX,EBX INC EBX MOV ECX,[EAX] JMP L1

    The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1 instead, then the sequence takes only 3 clocks.

    The next example is in 16 bit mode, assuming that SP is divisible by 4:

    L3: PUSH AX PUSH BX PUSH CX PUSH DX CALL FUNC

    Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a DWORD boundary) but it doesn't because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode.

    标签: MMX 优化

     文章评论
    目前没有任何评论.

    ↓ 快抢占第1楼,发表你的评论和意见 ↓

    当前页面是本站的 Google AMP 版本。
    欲查看完整版本和发表评论请点击:完整版 »

     

    程序员小辉 建站于 1997
    Copyright © XiaoHui.com; 保留所有权利。