br> 28.3 MMX instructions (PMMX)
A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be overlapped and pipelined to yield a throughput of one multiplication per clock cycle.
The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX (but a possible small penalty on the PII and PIII).
There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes when you store data from an MMX register to memory or to a 32 bit register: The data have to be ready one clock cycle in advance. This is analogous to the floating point store instructions.
All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in chapter 10.