» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

10.2 Imperfect pairing

日期: 2000-04-01 14:00 | 联系我 | 关注我: Telegram, Twitter

There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent instruction can start to execute before both instructions in the imperfect pair have finished.

Imperfect pairing will happen in the following cases:

1. If the second instructions suffers an AGI stall (see chapter 9).

2. Two instructions cannot access the same DWORD of memory simultaneously. The following examples assume that ESI is divisible by 4:

MOV AL, [ESI] / MOV BL, [ESI+1]

The two operands are within the same DWORD, so they cannot execute simultaneously. The pair takes 2 clock cycles.

MOV AL, [ESI+3] / MOV BL, [ESI+4]

Here the two operands are on each side of a DWORD boundary, so they pair perfectly, and take only one clock cycle.

3. Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict). For DWORD addresses this means that the difference between the two addresses should not be divisible by 32. Examples:

MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing

Pairable integer instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for using complex addressing modes such as scaled index registers.

A pairable integer instruction which reads from memory, does some calculation, and stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).

A pairable integer instruction which reads from memory, does some calculation, and writes the result back to the memory, takes 3 clock cycles. (read/modify/write instructions).

4. If a read/modify/write instruction is paired with a read/modify or read/modify/write instruction, then they will pair imperfectly.

The number of clock cycles used is given in the following table:

First instruction	Second instruction
	MOV or register only	read/modify	read/modify/write
MOV or register only	1	2	3
read/modify	2	2	3
read/modify/write	3	4	5

Example:

ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles

ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles

5. When two paired instructions both take extra time due to cache misses, misalignment, or jump misprediction, then the pair will take more time than each instruction, but less than the sum of the two.

6. A pairable floating point instruction followed by FXCH will make imperfect pairing if the next instruction is not a floating point instruction.

In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V-pipe. You can find out this by looking backwards in your code and search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair due to one of the rules above.

Imperfect pairing can often be avoided by reordering your instructions. Example:

L1: MOV EAX,[ESI] MOV EBX,[ESI] INC ECX

Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions.

L2: MOV EAX,OFFSET A XOR EBX,EBX INC EBX MOV ECX,[EAX] JMP L1

The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1 instead, then the sequence takes only 3 clocks.

The next example is in 16 bit mode, assuming that SP is divisible by 4:

L3: PUSH AX PUSH BX PUSH CX PUSH DX CALL FUNC

Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a DWORD boundary) but it doesn't because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode.

前一篇：19.2 Partial flags stalls
下一篇：19.1 Partial register stalls

标签: MMX 优化

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 6 x 5 + 2 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字