24. Scheduling floating point code (PPlain and PMMX)
Floating point instructions cannot pair the way integer instructions can, except for one special case, defined by the following rules:
While floating point instructions in general cannot be paired, many can be pipelined, i.e. one instruction can begin before the previous instruction has finished. Example:
FADD ST(1),ST(0) ; clock cycle 1-3 FADD ST(2),ST(0) ; clock cycle 2-4 FADD ST(3),ST(0) ; clock cycle 3-5 FADD ST(4),ST(0) ; clock cycle 4-6
Obviously, two instructions cannot overlap if the second instruction needs the result of the first. Since almost all floating point instructions involve the top of stack register, ST(0), there are seemingly not very many possibilities for making an instruction independent of the result of previous instructions. The solution to this problem is register renaming. The FXCH instruction does not in reality swap the contents of two registers, it only swaps their names. Instructions which push or pop the register stack also work by renaming. Floating point register renaming has been highly optimized on the Pentiums so that a register may be renamed while in use. Register renaming never causes stalls - it is even possible to rename a register more than once in the same clock cycle, as for example when you pair FLD or FCOMPP with FXCH.
By the proper use of FXCH instructions you may obtain a lot of overlapping in your floating point code. Example:
FLD [a1] ; clock cycle 1 FADD [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FADD [b2] ; clock cycle 4-6 FLD [c1] ; clock cycle 5 FADD [c2] ; clock cycle 6-8 FXCH ST(2) ; clock cycle 6 FADD [a3] ; clock cycle 7-9 FXCH ST(1) ; clock cycle 7 FADD [b3] ; clock cycle 8-10 FXCH ST(2) ; clock cycle 8 FADD [c3] ; clock cycle 9-11 FXCH ST(1) ; clock cycle 9 FADD [a4] ; clock cycle 10-12 FXCH ST(2) ; clock cycle 10 FADD [b4] ; clock cycle 11-13 FXCH ST(1) ; clock cycle 11 FADD [c4] ; clock cycle 12-14 FXCH ST(2) ; clock cycle 12In the above example we are interleaving three independent threads. Each FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle. When we have started an FADD in the 'a' thread we have time to start two new FADD instructions in the 'b' and 'c' threads before returning to the 'a' thread, so every third FADD belongs to the same thread. We are using FXCH instructions every time to get the register that belongs to the desired thread into ST(0). As you can see in the example above, this generates a regular pattern, but note well that the FXCH instructions repeat with a period of two while the threads have a period of three. This can be quite confusing, so you have to 'play computer' in order to know which registers are where.
All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock cycles and are able to overlap, so that these instructions may be scheduled using the method described above. Using a memory operand does not take more time than a register operand if the memory operand is in the level 1 cache and properly aligned.
By now you must be used to rules having exceptions, and the overlapping rule is no exception: You cannot start an FMUL instruction one clock cycle after another FMUL instruction, because the FMUL circuitry is not perfectly pipelined. It is recommended that you put another instruction in between two FMUL's. Example:
FLD [a1] ; clock cycle 1 FLD [b1] ; clock cycle 2 FLD [c1] ; clock cycle 3 FXCH ST(2) ; clock cycle 3 FMUL [a2] ; clock cycle 4-6 FXCH ; clock cycle 4 FMUL [b2] ; clock cycle 5-7 (stall) FXCH ST(2) ; clock cycle 5 FMUL [c2] ; clock cycle 7-9 (stall) FXCH ; clock cycle 7 FSTP [a3] ; clock cycle 8-9 FXCH ; clock cycle 10 (unpaired) FSTP [b3] ; clock cycle 11-12 FSTP [c3] ; clock cycle 13-14Here you have a stall before FMUL [b2] and before FMUL [c2] because another FMUL started in the preceding clock cycle. You can improve this code by putting FLD instructions in between the FMUL's:
FLD [a1] ; clock cycle 1 FMUL [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FMUL [b2] ; clock cycle 4-6 FLD [c1] ; clock cycle 5 FMUL [c2] ; clock cycle 6-8 FXCH ST(2) ; clock cycle 6 FSTP [a3] ; clock cycle 7-8 FSTP [b3] ; clock cycle 9-10 FSTP [c3] ; clock cycle 11-12
In other cases you may put FADD, FSUB, or anything else in between FMUL's to avoid the stalls.
Overlapping floating point instructions requires of course that you have some independent threads that you can interleave. If you have only one big formula to execute, then you may compute parts of the formula in parallel to achieve overlapping. If, for example, you want to add six numbers, then you may split the operations into two threads with three numbers in each, and add the two threads in the end:
FLD [a] ; clock cycle 1 FADD [b] ; clock cycle 2-4 FLD [c] ; clock cycle 3 FADD [d] ; clock cycle 4-6 FXCH ; clock cycle 4 FADD [e] ; clock cycle 5-7 FXCH ; clock cycle 5 FADD [f] ; clock cycle 7-9 (stall) FADD ; clock cycle 10-12 (stall)
Here we have a one clock stall before FADD [f] because it is waiting for the result of FADD [d] and a two clock stall before the last FADD because it is waiting for the result of FADD [f]. The latter stall can be hidden by filling in some integer instructions, but the first stall can not because an integer instruction at this place would make the FXCH pair imperfectly.
The first stall can be avoided by having three threads rather than two, but that would cost an extra FLD so we do not save anything by having three threads rather than two unless there are at least eight numbers to add.
Not all floating point instructions can overlap. And some floating point instructions can overlap more subsequent integer instructions than subsequent floating point instructions. The FDIV instruction, for example, takes 39 clock cycles. All but the first clock cycle can overlap with integer instructions, but only the last two clock cycles can overlap with floating point instructions. Example:
FDIV ; clock cycle 1-39 (U-pipe) FXCH ; clock cycle 1-2 (V-pipe, imperfect pairing) SHR EAX,1 ; clock cycle 3 (U-pipe) INC EBX ; clock cycle 3 (V-pipe) CMC ; clock cycle 4-5 (non-pairable) FADD [x] ; clock cycle 38-40 (U-pipe, waiting while FPU busy) FXCH ; clock cycle 38 (V-pipe) FMUL [y] ; clock cycle 40-42 (U-pipe, waiting for result of FDIV)The first FXCH pairs with the FDIV, but takes an extra clock cycle because it is not followed by a floating point instruction. The SHR / INC pair starts before the FDIV is finished, but has to wait for the FXCH to finish. The FADD has to wait till clock 38 because new floating point instructions can only execute during the last two clock cycles of the FDIV. The second FXCH pairs with the FADD. The FMUL has to wait for the FDIV to finish because it uses the result of the division.
If you have nothing else to put in after a floating point instruction with a large integer overlap, such as FDIV or FSQRT, then you may put in a dummy read from an address which you expect to need later in the program to make sure it is in the level one cache. Example:
FDIV QWORD PTR [EBX] CMP [ESI],ESI FMUL QWORD PTR [ESI]
Here we use the integer overlap to pre-load the value at [ESI] into the cache while the FDIV is being computed (we don't care what the result of the CMP is).
Chapter 28 gives a complete listing of floating point instructions, and what they can pair or overlap with.
There is no penalty for using a memory operand on floating point instuctions because the arithmetic unit is one step later in the pipeline than the read unit. The tradeoff of this comes when you store floating point data to memory. The FST or FSTP instruction with a memory operand takes two clock cycles in the execution stage, but it needs the data one clock earlier so you will get a one clock stall if the value to store is not ready one clock cycle in advance. This is analogous to an AGI stall. Example:
FLD [a1] ; clock cycle 1 FADD [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FADD [b2] ; clock cycle 4-6 FXCH ; clock cycle 4 FSTP [a3] ; clock cycle 6-7 FSTP [b3] ; clock cycle 8-9
The FSTP [a3] stalls for one clock cycle because the result of FADD [a2] is not ready in the preceding clock cycle. In many cases you cannot hide this type of stall without scheduling your floating point code into four threads or putting some integer instructions in between. The two clock cycles in the execution stage of the FST(P) instruction cannot pair or overlap with any subsequent instructions.
Instructions with integer operands such as FIADD, FISUB, FIMUL, FIDIV, FICOM may be split up into simpler operations in order to improve overlapping. Example:
FILD [a] ; clock cycle 1-3 FIMUL [b] ; clock cycle 4-9
Split up into:
FILD [a] ; clock cycle 1-3 FILD [b] ; clock cycle 2-4 FMUL ; clock cycle 5-7
In this example, you save two clocks by overlapping the two FILD instructions.