25.2 Loops in PPro, PII and PIII
In the previous chapter (25.1) I explained how to use convolution and loop unrolling in order to improve pairing in PPlain and PMMX. On the PPro, PII and PIII there is no reason to do this thanks to the out-of-order execution mechanism. But there are other quite difficult problems to take care of, most importantly ifetch boundaries and register read stalls.
I have chosen the same example as in chapter 25.1 for the previous microprocessors: a procedure which reads integers from an array, changes the sign of each integer, and stores the results in another array.
A C language code for this procedure would be:
void ChangeSign (int * A, int * B, int N) { int i; for (i=0; i<N; i++) B[i] = -A[i];}Translating to assembly, we might write the procedure like this: Example 2.1:
_ChangeSign PROC NEAR PUSH ESI PUSH EDI A EQU DWORD PTR [ESP+12] B EQU DWORD PTR [ESP+16] N EQU DWORD PTR [ESP+20] MOV ECX, [N] JECXZ L2 MOV ESI, [A] MOV EDI, [B] CLD L1: LODSD NEG EAX STOSD LOOP L1 L2: POP EDI POP ESI RET _ChangeSign ENDPThis looks like a nice solution, but it is not optimal because it uses the non-optimal instructions LOOP, LODSD and STOSD that generate many uops. It takes 6-7 clock cycles per iteration if all data are in the level one cache. Avoiding these instructions we get: Example 2.2:
MOV ECX, [N] JECXZ L2 MOV ESI, [A] MOV EDI, [B] ALIGN 16 L1: MOV EAX, [ESI] ; len=2, p2rESIwEAX ADD ESI, 4 ; len=3, p01rwESIwF NEG EAX ; len=2, p01rwEAXwF MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI ADD EDI, 4 ; len=3, p01rwEDIwF DEC ECX ; len=1, p01rwECXwF JNZ L1 ; len=2, p1rF L2:The comments are interpreted as follows: the MOV EAX,[ESI] instruction is 2 bytes long, it generates one uop for port 2 that reads ESI and writes to (renames) EAX. This information is needed for analyzing the possible bottlenecks.
Let's first analyze the instruction decoding (chapter 14): One of the instructions generates 2 uops (MOV [EDI],EAX). This instruction must go into decoder D0. There are three decode groups in the loop so it can decode in 3 clock cycles.
Next, let's look at the instruction fetch (chapter 15): If an ifetch boundary prevents the first three instructions from decoding together then there will be three decode groups in the last ifetch block so that the next iteration will have the ifetch block starting at the first instruction where we want it, and we will get a delay only in the first iteration. A worse situation would be a 16-byte boundary and an ifetch boundary in one of the last three instructions. According to the ifetch table, this will generate a delay of 1 clock and cause the next iteration to have its first ifetch block aligned by 16, so that the problem continues through all iterations. The result is a fetch time of 4 clocks per iteration rather than 3. There are two ways to prevent this situation: the first method is to control where the ifetch blocks lie on the first iteration; the second method is to control where the 16-byte boundaries are. The latter method is the easiest. Since the entire loop has only 15 bytes of code you can avoid any 16-byte boundary by aligning the loop entry by 16, as shown above. This will put the entire loop into a single ifetch block so that no further analysis of instruction fetching is needed.
The third problem to look at is register read stalls (chapter 16). No register is read in this loop without being written to at least a few clock cycles before, so there can be no register read stalls.
The fourth analysis is execution (chapter 17). Counting the uops for the different ports we get:
port 0 or 1: 4 uops
port 1 only: 1 uop
port 2: 1 uop
port 3: 1 uop
port 4: 1 uop
Assuming that the uops that can go to either port 0 or 1 are distributed optimally, the execution time will be 2.5 clocks per iteration.
The last analysis is retirement (chapter 18). Since the number of uops in the loop is not divisible by 3, the retirement slots will not be used optimally when the jump has to retire in the first slot. The time needed for retirement is the number of uops divided by 3, and rounded up to nearest integer. This gives 3 clocks for retirement.
In conclusion, the loop above can execute in 3 clocks per iteration if the loop entry is aligned by 16. I am assuming that the conditional jump is predicted every time except on the exit of the loop (chapter 22.2).
Using the same register for counter and index and letting the counter end at zero (PPro, PII and PIII)
MOV ECX, [N] MOV ESI, [A] MOV EDI, [B] LEA ESI, [ESI+4*ECX] ; point to end of array A LEA EDI, [EDI+4*ECX] ; point to end of array B NEG ECX ; -N JZ SHORT L2 ALIGN 16 L1: MOV EAX, [ESI+4*ECX] ; len=3, p2rESIrECXwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI+4*ECX], EAX ; len=3, p4rEAX, p3rEDIrECX INC ECX ; len=1, p01rwECXwF JNZ L1 ; len=2, p1rF L2:Here we have reduced the number of uops to 6 by using the same register as counter and index. The base pointers point to the end of the arrays so that the index can count up through negative values to zero.
Decoding: There are two decode groups in the loop so it will decode in 2 clocks.
Instruction fetch: A loop always takes at least one clock cycle more than the the number of 16 byte blocks. Since there are only 11 bytes of code in the loop it is possible to have it all in one ifetch block. By aligning the loop entry by 16 we can make sure that we don't get more than one 16-byte block so that it is possible to fetch in 2 clocks.
Register read stalls: The ESI and EDI registers are read, but not modified inside the loop. They will therefore be counted as permanent register reads, but not in the same triplet. Register EAX, ECX, and flags are modified inside the loop and read before they are written back so they will cause no permanent register reads. The conclusion is that there are no register read stalls.
Execution:
port 0 or 1: 2 uops
port 1: 1 uop
port 2: 1 uop
port 3: 1 uop
port 4: 1 uop
Execution time: 1.5 clocks.
Retirement:
6 uops = 2 clocks.
Conclusion: this loop takes only 2 clock cycles per iteration.
If you use absolute addresses instead of ESI and EDI then the loop will take 3 clocks because it cannot be contained in a single 16-byte block.
Unrolling a loop (PPro, PII and PIII)
Doing more than one operation in each run and doing correspondingly fewer runs is called loop unrolling. In previous processors you would unroll loops to get parallel execution by pairing (chapter 25.1). In PPro, PII and PIII this is not needed because the out-of-order execution mechanism takes care of that. There is no need to use two different registers either, because register renaming takes care of this. The purpose of unrolling here is to reduce the loop overhead per iteration.
The following example is the same as example 2.2 , but unrolled by 2, which means that you do two operations per iteration and half as many iterations Example 2.4:
MOV ECX, [N] MOV ESI, [A] MOV EDI, [B] SHR ECX, 1 ; N/2 JNC SHORT L1 ; test if N was odd MOV EAX, [ESI] ; do the odd one first ADD ESI, 4 NEG EAX MOV [EDI], EAX ADD EDI, 4 L1: JECXZ L3 ALIGN 16 L2: MOV EAX, [ESI] ; len=2, p2rESIwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI MOV EAX, [ESI+4] ; len=3, p2rESIwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI+4], EAX ; len=3, p4rEAX, p3rEDI ADD ESI, 8 ; len=3, p01rwESIwF ADD EDI, 8 ; len=3, p01rwEDIwF DEC ECX ; len=1, p01rwECXwF JNZ L2 ; len=2, p1rF L3:In example 2.2 the loop overhead (i.e. adjusting pointers and counter, and jumping back) was 4 uops and the 'real job' was 4 uops. When unrolling the loop by two you do the 'real job' twice and the overhead once, so you get 12 uops in all. This reduces the overhead from 50% to 33% of the uops. Since the unrolled loop can do only an even number of operations you have to check if N is odd and if so do one operation outside the loop.
Analyzing instruction fetching in this loop we find that a new ifetch block begins in the ADD ESI,8 instruction, forcing it into decoder D0. This makes the loop decode in 5 clock cycles and not 4 as we wanted. We can solve this problem by coding the preceding instruction in a longer version. Change MOV [EDI+4],EAX to:
MOV [EDI+9999],EAX ; make instruction with long displacement ORG $-4 DD 4 ; rewrite displacement to 4This will force a new ifetch block to begin at the long MOV [EDI+4],EAX instruction, so that decoding time is now down at 4 clocks. The rest of the pipeline can handle 3 uops per clock so that the expected execution time is 4 clocks per iteration, or 2 clocks per operation.
Testing this solution shows that it actually takes a little more. My measurements showed approximately 4.5 clocks per iteration. This is probably due to a sub-optimal reordering of the uops. Possibly, the ROB doesn't find the optimal execution-order for the uops but submits them in a less than optimal order. This problem was not predicted, and only testing can reveal such a problem. We may help the ROB by doing some of the reordering manually: Example 2.5:
ALIGN 16 L2: MOV EAX, [ESI] ; len=2, p2rESIwEAX MOV EBX, [ESI+4] ; len=3, p2rESIwEBX NEG EAX ; len=2, p01rwEAXwF MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI ADD ESI, 8 ; len=3, p01rwESIwF NEG EBX ; len=2, p01rwEBXwF MOV [EDI+4], EBX ; len=3, p4rEBX, p3rEDI ADD EDI, 8 ; len=3, p01rwEDIwF DEC ECX ; len=1, p01rwECXwF JNZ L2 ; len=2, p1rF L3:The loop now executes in 4 clocks per iteration. This solution also solves the problem with instruction fetch blocks. The cost is that we need an extra register because we cannot take advantage of register renaming.
Rolling out by more than 2
Loop unrolling is recommended when the loop overhead constitutes a high proportion of the total execution time. In example 2.3 the overhead is only 2 uops, so the gain by unrolling is little, but I will show you how to unroll it anyway, just for the exercise.
The 'real job' is 4 uops and the overhead 2. Unrolling by two we get 2*4+2 = 10 uops. The retirement time will be 10/3, rounded up to an integer, that is 4 clock cycles. This calculation shows that nothing is gained by unrolling this by two. Unrolling by four we get: Example 2.6:
MOV ECX, [N] SHL ECX, 2 ; number of bytes to handle MOV ESI, [A] MOV EDI, [B] ADD ESI, ECX ; point to end of array A ADD EDI, ECX ; point to end of array B NEG ECX ; -4*N TEST ECX, 4 ; test if N is odd JZ SHORT L1 MOV EAX, [ESI+ECX] ; N is odd. do the odd one NEG EAX MOV [EDI+ECX], EAX ADD ECX, 4 L1: TEST ECX, 8 ; test if N/2 is odd JZ SHORT L2 MOV EAX, [ESI+ECX] ; N/2 is odd. do two extra NEG EAX MOV [EDI+ECX], EAX MOV EAX, [ESI+ECX+4] NEG EAX MOV [EDI+ECX+4], EAX ADD ECX, 8 L2: JECXZ SHORT L4 ALIGN 16 L3: MOV EAX, [ESI+ECX] ; len=3, p2rESIrECXwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI+ECX], EAX ; len=3, p4rEAX, p3rEDIrECX MOV EAX, [ESI+ECX+4] ; len=4, p2rESIrECXwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI+ECX+4], EAX ; len=4, p4rEAX, p3rEDIrECX MOV EAX, [ESI+ECX+8] ; len=4, p2rESIrECXwEAX MOV EBX, [ESI+ECX+12] ; len=4, p2rESIrECXwEAX NEG EAX ; len=2, p01rwEAXwF MOV [EDI+ECX+8], EAX ; len=4, p4rEAX, p3rEDIrECX NEG EBX ; len=2, p01rwEAXwF MOV [EDI+ECX+12], EBX ; len=4, p4rEAX, p3rEDIrECX ADD ECX, 16 ; len=3, p01rwECXwF JS L3 ; len=2, p1rF L4:The ifetch blocks are where we want them. Decode time is 6 clocks.
Register read stalls is a problem here because ECX has retired near the end of the loop and we need to read both ESI, EDI, and ECX. The instructions have been reordered in order to avoid reading ESI near the bottom so that we can avoid a register read stall. In other words, the reason for reordering instructions and use an extra register (EBX) is not the same as in the previous example.
There are 12 uops and the loop executes in 6 clocks per iteration, or 1.5 clocks per operation.
It may be tempting to unroll loops by a high factor in order to get the maximum speed. But since the loop overhead in most cases can be reduced to something like one clock cycle per iteration then unrolling the loop by 4 rather than by 2 would save only 1/4 clock cycle per operation which is hardly worth the effort. Only if the loop overhead is high compared to the rest of the loop and N is very big should you think of unrolling by 4. Unrolling by more than 4 does not make sense.
The drawbacks of excessive loop unrolling are:
Using an unrolling factor which is not a power of 2 makes the calculation of N MODULO R quite difficult, and is generally not recommended unless N is known to be divisible by R. Example 1.14 shows how to unroll by 3.
Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPro, PII and PIII)
It is sometimes possible to handle four bytes at a time in the same 32 bit register. The following example adds 2 to all elements of an array of bytes: Example 2.7:
MOV ESI, [A] ; address of byte array MOV ECX, [N] ; number of elements in byte array JECXZ L2 ALIGN 16 DB 7 DUP (90H) ; 7 NOP's for controlling alignment L1: MOV EAX, [ESI] ; read four bytes MOV EBX, EAX ; copy into EBX AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX XOR EBX, EAX ; get the highest bit of each byte ADD EAX, 02020202H ; add desired value to all four bytes XOR EBX, EAX ; combine bits again MOV [ESI], EBX ; store result ADD ESI, 4 ; increment pointer SUB ECX, 4 ; decrement loop counter JA L1 ; loop L2:Note that I have masked out the highest bit of each byte to avoid a possible carry from each byte into the next one when adding. I am using XOR rather than ADD when putting in the high bit again to avoid carry. The array should of course be aligned by 4.
This loop should ideally take 4 clocks per iteration, but it takes somewhat more due to the dependency chain and difficult reordering. On PII and PIII you can do the same more effectively using MMX registers.
The next example finds the length of a zero-terminated string by searching for the first byte of zero. It is much faster than using REPNE SCASB: Example 2.8:
_strlen PROC NEAR PUSH EBX MOV EAX,[ESP+8] ; get pointer to string LEA EDX,[EAX+3] ; pointer+3 used in the end L1: MOV EBX,[EAX] ; read first 4 bytes ADD EAX,4 ; increment pointer LEA ECX,[EBX-01010101H] ; subtract 1 from each byte NOT EBX ; invert all bytes AND ECX,EBX ; and these two AND ECX,80808080H ; test all sign bits JZ L1 ; no zero bytes, continue loop MOV EBX,ECX SHR EBX,16 TEST ECX,00008080H ; test first two bytes CMOVZ ECX,EBX ; shift right if not in the first 2 bytes LEA EBX,[EAX+2] CMOVZ EAX,EBX SHL CL,1 ; use carry flag to avoid branch SBB EAX,EDX ; compute length POP EBX RET _strlen ENDPThis loop takes 3 clocks for each iteration testing 4 bytes. The string should of course be aligned by 4. The code may read past the end of the string, so the string should not be placed at the end of a segment.
Handling 4 bytes simultaneously can be quite difficult. This code uses a formula which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the LEA ECX instruction). I have not masked out the highest bit of each byte before subtracting, as I did in example 2.7, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we don't care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a zero, and then test all four bytes to find the last zero, or use BSWAP to reverse the order of the bytes. If you want to search for a byte value other than zero, then you may XOR all four bytes with the value you are searching for, and then use the method above to search for zero.
Loops with MMX instructions (PII and PIII)
Using MMX instructions we can compare 8 bytes in one operation: Example 2.9:
_strlen PROC NEAR PUSH EBX MOV EAX,[ESP+8] LEA EDX,[EAX+7] PXOR MM0,MM0 L1: MOVQ MM1,[EAX] ; len=3 p2rEAXwMM1 ADD EAX,8 ; len=3 p01rEAX PCMPEQB MM1,MM0 ; len=3 p01rMM0rMM1 MOVD EBX,MM1 ; len=3 p01rMM1wEBX PSRLQ MM1,32 ; len=4 p1rMM1 MOVD ECX,MM1 ; len=3 p01rMM1wECX OR ECX,EBX ; len=2 p01rECXrEBXwF JZ L1 ; len=2 p1rF MOVD ECX,MM1 TEST EBX,EBX CMOVZ EBX,ECX LEA ECX,[EAX+4] CMOVZ EAX,ECX MOV ECX,EBX SHR ECX,16 TEST BX,BX CMOVZ EBX,ECX LEA ECX,[EAX+2] CMOVZ EAX,ECX SHR BL,1 SBB EAX,EDX EMMS POP EBX RET _strlen ENDPThis loop has 7 uops for port 0 and 1 which gives an average execution time of 3.5 clocks per iteration. The measured time is 3.8 clocks which shows that the ROB handles the situation reasonably well despite a dependency chain that is 6 uops long. Testing 8 bytes in less than 4 clocks is incredibly much faster than REPNE SCASB.
Loops with floating point instructions (PPro, PII and PIII)
The methods for optimizing floating point loops are basically the same as for integer loops, but you should be more aware of dependency chains because of the long latencies of instruction execution.
Consider the C language code:
int i, n; double * X; double * Y; double DA; for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i];This piece of code (called DAXPY) has been studied extensively because it is the key to solving linear equations. Example 2.10:
DSIZE = 8 ; data size (4 or 8) MOV ECX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y JECXZ L2 ; test for N = 0 FLD DSIZE PTR [DA] ; load DA outside loop ALIGN 16 DB 2 DUP (90H) ; 2 NOP's for alignment L1: FLD DSIZE PTR [ESI] ; len=3 p2rESIwST0 ADD ESI,DSIZE ; len=3 p01rESI FMUL ST,ST(1) ; len=2 p0rST0rST1 FSUBR DSIZE PTR [EDI] ; len=3 p2rEDI, p0rST0 FSTP DSIZE PTR [EDI] ; len=3 p4rST0, p3rEDI ADD EDI,DSIZE ; len=3 p01rEDI DEC ECX ; len=1 p01rECXwF JNZ L1 ; len=2 p1rF FSTP ST ; discard DA L2:The dependency chain is 10 clock cycles long, but the loop takes only 4 clocks per iteration because it can begin a new operation before the previous one is finished. The purpose of the alignment is to prevent a 16-byte boundary in the last ifetch block.
Example 2.11:
DSIZE = 8 ; data size (4 or 8) MOV ECX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y LEA ESI, [ESI+DSIZE*ECX] ; point to end of array LEA EDI, [EDI+DSIZE*ECX] ; point to end of array NEG ECX ; -N JZ SHORT L2 ; test for N = 0 FLD DSIZE PTR [DA] ; load DA outside loop ALIGN 16 L1: FLD DSIZE PTR [ESI+DSIZE*ECX] ; len=3 p2rESIrECXwST0 FMUL ST,ST(1) ; len=2 p0rST0rST1 FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; len=3 p2rEDIrECX, p0rST0 FSTP DSIZE PTR [EDI+DSIZE*ECX] ; len=3 p4rST0, p3rEDIrECX INC ECX ; len=1 p01rECXwF JNZ L1 ; len=2 p1rF FSTP ST ; discard DA L2:Here we have used the same trick as in example 2.3. Ideally, this loop should take 3 clocks, but measurements say approximately 3.5 clocks due to the long dependency chain. Unrolling the loop doesn't save much.
Loops with XMM instructions (PIII)
The XMM instructions on the PIII allow you to operate on four single precision floating point numbers in parallel. The operands must be aligned by 16.
The DAXPY algorithm is not very suited for XMM instructions because the precision is poor, it may not be possible to align the operands by 16, and you need some extra code if the number of operations is not a multiple of four. I am showing the code here anyway, just to give an example of a loop with XMM instructions: Example 2.12:
MOV ECX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y SHL ECX, 2 ADD ESI, ECX ; point to end of X ADD EDI, ECX ; point to end of Y NEG ECX ; -4*N MOV EAX, [DA] ; load DA outside loop XOR EAX, 80000000H ; change sign of DA PUSH EAX MOVSS XMM1, [ESP] ; -DA ADD ESP, 4 SHUFPS XMM1, XMM1, 0 ; copy -DA to all four positions CMP ECX, -16 JG L2 L1: MOVAPS XMM0, [ESI+ECX] ; len=4 2*p2rESIrECXwXMM0 ADD ECX, 16 ; len=3 p01rwECXwF MULPS XMM0, XMM1 ; len=3 2*p0rXMM0rXMM1 CMP ECX, -16 ; len=3 p01rECXwF ADDPS XMM0, [EDI+ECX-16] ; len=5 2*p2rEDIrECX, 2*p1rXMM0 MOVAPS [EDI+ECX-16], XMM0 ; len=5 2*p4rXMM0, 2*p3rEDIrECX JNG L1 ; len=2 p1rF L2: JECXZ L4 ; check if finished MOVAPS XMM0, [ESI+ECX] ; 1-3 operations missing, do 4 more MULPS XMM0, XMM1 ADDPS XMM0, [EDI+ECX] CMP ECX, -8 JG L3 MOVLPS [EDI+ECX], XMM0 ; store two more results ADD ECX, 8 MOVHLPS XMM0, XMM0 L3: JECXZ L4 MOVSS [EDI+ECX], XMM0 ; store one more result L4:The L1 loop takes 5-6 clocks for 4 operations. The ECX instructions have been placed before and after the MULPS XMM0, XMM1 instruction in order to avoid a register read port stall generated by the reading of the two parts of the XMM1 register together with ESI or EDI in the RAT. The extra code after L2 takes care of the situation where N is not divisible by 4. Note that this code may read past the end of A and B. This may delay the last operation if the extra memory positions read do not contain normal floating point numbers. If possible, put in some dummy extra data to make the number of operations divisible by 4 and leave out the extra code after L2.