15. Instruction fetch (PPro, PII and PIII)
The code is fetched in aligned 16-bytes chunks from the code cache and placed in the double buffer, which is called so because it can contain two such chunks. The code is then taken from the double buffer and fed to the decoders in blocks which are usually 16 bytes long, but not necessarily aligned by 16. I will call these blocks ifetch blocks (instruction fetch blocks). If an ifetch block crosses a 16 byte boundary in the code then it needs to take from both chunks in the double buffer. So the purpose of the double buffer is to allow instruction fetching across 16 byte boundaries.
The double buffer can fetch one 16-bytes chunk per clock cycle and can generate one ifetch block per clock cycle. The ifetch blocks are usually 16 bytes long, but can be shorter if there is a predicted jump in the block. (See chapter 22 about jump prediction).
Unfortunately, the double buffer is not big enough for handling fetches around jumps without delay. If the ifetch block that contains the jump instruction crosses a 16-byte boundary then the double buffer needs to keep two consecutive aligned 16-bytes chunks of code in order to generate it. If the first instruction after the jump crosses a 16-byte boundary, then the double buffer needs to load two new 16-bytes chunks of code before a valid ifetch block can be generated. This means that, in the worst case, the decoding of the first instruction after a jump can be delayed for two clock cycles. You get one penalty for a 16-byte boundary in the ifetch block containing the jump instruction, and one penalty for a 16-byte boundary in the first instruction after the jump. You can get bonus if you have more than one decode group in the ifetch block that contains the jump because this gives the double buffer extra time to fetch one or two 16-byte chunks of code in advance for the instructions after the jump. The bonuses can compensate for the penalties according to the table below. If the double buffer has fetched only one 16-byte chunk of code after the jump, then the first ifetch block after the jump will be identical to this chunk, that is, aligned to a 16-byte boundary. In other words, the first ifetch block after the jump will not begin at the first instruction, but at the nearest preceding address divisible by 16. If the double buffer has had time to load two 16-byte chunks, then the new ifetch block can cross a 16-byte boundary and begin at the first instruction after the jump. These rules are summarized in the following table:
Number of decode groups in ifetch-block containing jump | 16-byte boundary in this ifetch-block | 16-byte boundary in first instruction after jump |
decoder delay | alignment of first ifetch after jump |
1 | 0 | 0 | 0 | by 16 |
1 | 0 | 1 | 1 | to instruction |
1 | 1 | 0 | 1 | by 16 |
1 | 1 | 1 | 2 | to instruction |
2 | 0 | 0 | 0 | to instruction |
2 | 0 | 1 | 0 | to instruction |
2 | 1 | 0 | 0 | by 16 |
2 | 1 | 1 | 1 | to instruction |
3 or more | 0 | 0 | 0 | to instruction |
3 or more | 0 | 1 | 0 | to instruction |
3 or more | 1 | 0 | 0 | to instruction |
3 or more | 1 | 1 | 0 | to instruction |
Jumps delay the fetching so that a loop always takes at least two clock cycles more per iteration than the number of 16 byte boundaries in the loop.
A further problem with the instruction fetch mechanism is that a new ifetch block is not generated until the previous one is exhausted. Each ifetch block can contain several decode groups. If a 16 bytes long ifetch block ends with an unfinished instruction, then the next ifetch block will begin at the beginning of that instruction. The first instruction in an ifetch block always goes to decoder D0, and the next two instructions go to D1 and D2, if possible. The consequence of this is that D1 and D2 are used less than optimally. If the code is structured according to the recommended 4-1-1 pattern, and an instruction intended to go into D1 or D2 happens to be the first instruction in an ifetch block, then that instruction has to go into D0 with the result that one clock cycle is wasted. This is probably a hardware design flaw. At least it is suboptimal design. The consequence of this problem is that the time it takes to decode a piece of code can vary considerably depending on where the first ifetch block begins.
If decoding speed is critical and you want to avoid these problems then you have to know where each ifetch block begins. This is quite a tedious job. First you need to make your code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then you have to look at the output listing from your assembler to see how long each instruction is. (It is recommended that you study how instructions are coded so that you can predict the lengths of the instructions.) If you know where one ifetch block begins then you can find where the next ifetch block begins in the following way: Make the block 16 bytes long. If it ends at an instruction boundary then the next block will begin there. If it ends with an unfinished instruction then the next block will begin at the beginning of this instruction. (Only the lengths of the instructions counts here, it doesn't matter how many uops they generate or what they do). This way you can work your way all through the code and mark where each ifetch block begins. The only problem is knowing where to start. If you know where one ifetch block is then you can find all the subsequent ones, but you have to know where the first one begins. Here are some guidelines:
I am sure you want an example now:
address instruction length uops expected decoder ---------------------------------------------------------------------- 1000h MOV ECX, 1000 5 1 D0 1005h LL: MOV [ESI], EAX 2 2 D0 1007h MOV [MEM], 0 10 2 D0 1011h LEA EBX, [EAX+200] 6 1 D1 1017h MOV BYTE PTR [ESI], 0 3 2 D0 101Ah BSR EDX, EAX 3 2 D0 101Dh MOV BYTE PTR [ESI+1],0 4 2 D0 1021h DEC ECX 1 1 D1 1022h JNZ LL 2 1 D2
Let's assume that the first ifetch block begins at address 1000h and ends at 1010h. This is before the end of the MOV [MEM],0 instruction so the next ifetch block will begin at 1007h and end at 1017h. This is at an instruction boundary so the third ifetch block will begin at 1017h and cover the rest of the loop. The number of clock cycles it takes to decode this is the number of D0 instructions, which is 5 per iteration of the LL loop. The last ifetch block contained three decode blocks covering the last five instructions, and it has one 16-byte boundary (1020h). Looking at the table above we find that the first ifetch block after the jump will begin at the first instruction after the jump, that is the LL label at 1005h, and end at 1015h. This is before the end of the LEA instruction, so the next ifetch block will go from 1011h to 1021h, and the last one from 1021h covering the rest. Now the LEA instruction and the DEC instruction both fall at the beginning of an ifetch block which forces them to go into D0. We now have 7 instructions in D0 and the loop takes 7 clocks to decode in the second iteration. The last ifetch block contains only one decode group (DEC ECX / JNZ LL) and has no 16-byte boundary. According to the table, the next ifetch block after the jump will begin at a 16-byte boundary, which is 1000h. This will give us the same situation as in the first iteration, and you will see that the loop takes alternatingly 5 and 7 clock cycles to decode. Since there are no other bottlenecks, the complete loop will take 6000 clocks to run 1000 iterations. If the starting address had been different so that you had a 16-byte boundary in the first or the last instruction of the loop then it would take 8000 clocks. If you reorder the loop so that no D1 or D2 instructions fall at the beginning of an ifetch block then you can make it take only 5000 clocks.
The example above was deliberately constructed so that fetch and decoding is the only bottleneck. The easiest way to avoid this problem is to structure your code to generate much more than 3 uops per clock cycle so that decoding will not be a bottleneck despite the penalties described here. In small loops this may not be possible and then you have to find out how to optimize the instruction fetch and decoding.
One thing you can do is to change the starting address of your procedure in order to avoid 16-byte boundaries where you don't want them. Remember to make your code segment paragraph aligned so that you know where the boundaries are.
If you insert an ALIGN 16 directive before the loop entry then the assembler will put in NOP's and other filler instructions up to the nearest 16 byte boundary. Most assemblers use the instruction XCHG EBX,EBX as a 2-byte filler (the so called 2-byte NOP). Whoever got this idea, it's a bad one because this instruction takes more time than two NOP's on most processors! If the loop executes many times then whatever is outside the loop is unimportant in terms of speed and you don't have to care about the suboptimal filler instructions. But if the time taken by the fillers is important then you may select the filler instructions manually. You may as well use filler instructions that do something useful, such as refreshing a register in order to avoid register read stalls (see chapter 16.2) For example, if you are using register EBP for addressing but seldom write to it, then you may use MOV EBP,EBP or ADD EBP, 0 as filler in order to reduce the possibilities of register read stalls. If you have nothing useful to do, you may use FXCH ST(0) as a good filler because it doesn't put any load on the execution ports, provided that ST(0) contains a valid floating point value.
Another possible remedy is to reorder your instructions in order to get the ifetch boundaries where they don't hurt. This can be quite a difficult puzzle and it is not always possible to find a satisfactory solution.
Yet another possibility is to manipulate instruction lengths. Sometimes you can substitute one instruction with another one with a different length. Many instructions can be coded in different versions with different lengths. The assembler always chooses the shortest possible version of an instruction, but it is often possible to hard-code a longer version. For example, DEC ECX is one byte long, SUB ECX,1 is 3 bytes, and you can code a 6 bytes version with a long immediate operand using this trick:
SUB ECX, 9999 ORG $-4 DD 1
Instructions with a memory operand can be made one byte longer with a SIB byte, but the easiest way of making an instruction one byte longer is to add a DS: segment prefix (DB 3Eh). The microprocessors generally accept redundant and meaningless prefixes (except LOCK) as long as the instruction length does not exceed 15 bytes. Even instructions without a memory operand can have a segment prefix. So if you want the DEC ECX instruction to be 2 bytes long, write:
DB 3Eh DEC ECX
Remember that you get a penalty in the decoder if an instruction has more than one prefix. It is possible that instructions with meaningless prefixes - especially repeat and lock prefixes - will be used in future processors for new instructions when there are no more vacant instruction codes, but I would consider it safe to use a segment prefix with any instruction.
With these methods it will usually be possible to put the ifetch boundaries where you want them, although it can be a tedious puzzle.