22.2 Branch prediction in PMMX, PPro, PII and PIII
22.2.1 BTB organization (PMMX, PPro, PII and PIII)
The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16 sets. Each entry is identified by bits 2-31 of the address of the last byte of the control transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB as a tag. Control transfer instructions which are spaced 64 bytes apart have the same set-value and may therefore occasionally push each other out of the BTB. Since there are 16 ways per set, this won't happen too often.
The branch target buffer (BTB) of the PPro, PII and PIII has 512 entries organized as 16 ways * 32 sets. Each entry is identified by bits 4-31 of the address of the last byte of the control transfer instruction it belongs to. Bits 4-8 define the set, and all bits are stored in the BTB as a tag. Control transfer instructions which are spaced 512 bytes apart have the same set-value and may therefore occasionally push each other out of the BTB. Since there are 16 ways per set, this won't happen too often.
The PPro, PII and PIII allocate a BTB entry to any control transfer instruction the first time it is executed. The PMMX allocates it the first time it jumps. A branch instruction which never jumps will stay out of the BTB on the PMMX. As soon as it has jumped once, it will stay in the BTB, even if it never jumps again.
An entry may be pushed out of the BTB when another control transfer instruction with the same set-value needs a BTB entry.
22.2.2 Misprediction penalty (PMMX, PPro, PII and PIII)
In the PMMX, the penalty for misprediction of a conditional jump is 4 clocks in the U-pipe, and 5 clocks if it is executed in the V-pipe. For all other control transfer instructions it is 4 clocks.
In the PPro, PII and PIII, the misprediction penalty is very high due to the long pipeline. A misprediction usually costs between 10 and 20 clock cycles. It is therefore very important to be aware of poorly predictable branches when running on PPro, PII and PIII.
22.2.3 Pattern recognition for conditional jumps (PMMX, PPro, PII and PIII)
These processors have an advanced pattern recognition mechanism which will correctly predict a branch instruction which, for example, is taken every fourth time and falls through the other three times. In fact, they can predict any repetitive pattern of jumps and nojumps with a period of up to five, and many patterns with higher periods.
The mechanism is a so-called "two-level adaptive branch prediction scheme", invented by T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two-bit counters as described above for the PPlain (but without the assymmetry flaw). The counter is incremented when the jump is taken and decremented when not taken. There is no wrap-around when counting up from 3 or down from 0. A branch instruction is predicted to be taken when the corresponding counter is in state 2 or 3, and to fall through when in state 0 or 1. An impressive improvement is now obtained by having sixteen such counters for each BTB entry. It selects one of these sixteen counters based on the history of the branch instruction for the last four executions. If, for example, the branch instruction jumps once and then falls through three times, then you have the history bits 1000 (1=jump, 0=nojump). This will make it use counter number 8 (1000 binary = 8) for predicting the next time and update counter 8 afterwards.
If the sequence 1000 is always followed by a 1, then counter number 8 will soon end up in its highest state (state 3) so that it will always predict a 1000 sequence to be followed by a 1. It will take two deviations from this pattern to change the prediction. The repetitive pattern 100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state 0. The other twelve counters will be unused.
22.2.4 Perfectly predicted patterns (PMMX, PPro, PII and PIII)
A repetitive branch pattern is predicted perfectly by this mechanism if every 4-bit sub-sequence in the period is unique. Below is a list of repetitive branch patterns which are predicted perfectly:
period | perfectly predicted patterns |
1-5 | all |
6 | 000011, 000101, 000111, 001011 |
7 | 0000101, 0000111, 0001011 |
8 | 00001011, 00001111, 00010011, 00010111, 00101101 |
9 | 000010011, 000010111, 000100111, 000101101 |
10 | 0000100111, 0000101101, 0000101111, 0000110111, 0001010011, 0001011101 |
11 | 00001001111, 00001010011, 00001011101, 00010100111 |
12 | 000010100111, 000010111101, 000011010111, 000100110111, 000100111011 |
13 | 0000100110111, 0000100111011, 0000101001111 |
14 | 00001001101111, 00001001111011, 00010011010111, 00010011101011, 00010110011101, 00010110100111 |
15 | 000010011010111, 000010011101011, 000010100110111, 000010100111011, 000010110011101, 000010110100111, 000010111010011, 000011010010111 |
16 | 0000100110101111, 0000100111101011, 0000101100111101, 0000101101001111 |
When reading this table, you should be aware that if a pattern is predicted correctly than the same pattern reversed (read backwards) is also predicted correctly, as well as the same pattern with all bits inverted. Example: In the table we find the pattern: 0001011. Reversing this pattern gives: 1101000. Inverting all bits gives: 1110100. Both reversing and inverting: 0010111. These four patterns are all recognizable. Rotating the pattern one place to the left gives: 0010110. This is of course not a new pattern, only a phase shifted version of the same pattern. All patterns which can be derived from one of the patterns in the table by reversing, inverting and rotating are also recognizable. For reasons of brevity, these are not listed.
It takes two periods for the pattern recognition mechanism to learn a regular repetitive pattern after the BTB entry has been allocated. The pattern of mispredictions in the learning period is not reproducible. This is probably because the BTB entry contained something prior to allocation. Since BTB entries are allocated according to a random scheme, there is little chance of predicting what happens during the initial learning period.
22.2.5 Handling deviations from a regular pattern (PMMX, PPro, PII and PIII)
The branch prediction mechanism is also extremely good at handling 'almost regular' patterns, or deviations from the regular pattern. Not only does it learn what the regular pattern looks like. It also learns what deviations from the regular pattern look like. If deviations are always of the same type, then it will remember what comes after the irregular event, and the deviation will cost only one misprediction.
Example:
0001110001110001110001011100011100011100010111000 ^ ^In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns that the repeated sequence is 000111. The first irregularity is an unexpected 0, which I have marked with a ^. After this 0 the next three jumps may be mispredicted, because it hasn't learned what comes after 0010, 0101, and 1011. After one or two irregularities of the same kind it has learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes 1. This means that after at most two irregularities of the same kind, it has learned to handle this kind of irregularity with only one misprediction.
The prediction mechanism is also very effective when alternating between two different regular patterns. If, for example, we have the pattern 000111 (with period 6) repeated many times, then the pattern 01 (period 2) many times, and then return to the 000111 pattern, then the mechanism doesn't have to relearn the 000111 pattern, because the counters used in the 000111 sequence have been left un-touched during the 01 sequence. After a few alternations between the two patterns, it has also learned to handle the changes of pattern with only one misprediction for each time the pattern is switched.
22.2.6 Patterns which are not predicted perfectly (PMMX, PPro, PII and PIII)
The simplest branch pattern which cannot be predicted perfectly is a branch which is taken on every 6'th execution. The pattern is:
000001000001000001 ^^ ^^ ^^ ab ab abThe sequence 0000 is alternatingly followed by a 0, in the positions marked a above, and by a 1, in the positions marked b. This affects counter number 0 which will count up and down all the time. If counter 0 happens to start in state 0 or 1, then it will alternate between state 0 and 1. This will lead to a misprediction in position b. If counter 0 happens to start in state 3, then it will alternate between state 2 and 3 which will cause a misprediction in position a. The worst case is when it starts in state 2. It will alternate between state 1 and 2 with the unfortunate consequence that we get a misprediction both in position a and b. (This is analogous to the worst case for the PPlain explained above). Which of these four situations we will get depends on the history of the BTB entry prior to allocation to this branch. This is beyond our control because of the random allocation method.
In principle, it is possible to avoid the worst case situation where we have two mispredictions per cycle by giving it an initial branch sequence which is specially designed for putting the counter in the desired state. Such an approach cannot be recommended, however, because of the considerable extra code complexity required, and because whatever information we have put into the counter is likely to be lost during the next timer interrupt or task switch.
22.2.7 Completely random patterns (PMMX, PPro, PII and PIII)
The powerful capability of pattern recognition has a minor drawback in the case of completely random sequences with no regularities.
The following table lists the experimental fraction of mispredictions for a completely random sequence of jumps and nojumps:
fraction of jumps/nojumps | fraction of mispredictions |
0.001/0.999 | 0.001001 |
0.01/0.99 | 0.0101 |
0.05/0.95 | 0.0525 |
0.10/0.90 | 0.110 |
0.15/0.85 | 0.171 |
0.20/0.80 | 0.235 |
0.25/0.75 | 0.300 |
0.30/0.70 | 0.362 |
0.35/0.65 | 0.418 |
0.40/0.60 | 0.462 |
0.45/0.55 | 0.490 |
0.50/0.50 | 0.500 |
The fraction of mispredictions is slightly higher than it would be without pattern recognition because the processor keeps trying to find repeated patterns in a sequence which has no regularities.
22.2.8 Tight loops (PMMX)
The branch prediction is not reliable in tiny loops where the pattern recognition mechanism doesn't have time to update its data before the next branch is met. This means that simple patterns, which would normally be predicted perfectly, are not recognized. Incidentally, some patterns which normally would not be recognized, are predicted perfectly in tight loops. For example, a loop which always repeats 6 times would have the branch pattern 111110 for the branch instruction at the bottom of the loop. This pattern would normally have one or two mispredictions per iteration, but in a tight loop it has none. The same applies to a loop which repeats 7 times. Most other repeat counts are predicted poorer in tight loops than normally. This means that a loop which iterates 6 or 7 times should preferably be tight, whereas other loops should preferably not be tight. You may unroll a loop if necessary to make it less tight.
To find out whether a loop will behave as 'tight' on the PMMX you may follow the following rule of thumb: Count the number of instructions in the loop. If the number is 6 or less, then the loop will behave as tight. If you have more than 7 instructions, then you can be reasonably sure that the pattern recognition functions normally. Strangely enough, it doesn't matter how many clock cycles each instruction takes, whether it has stalls, or whether it is paired or not. Complex integer instructions do not count. A loop can have lots of complex integer instructions and still behave as a tight loop. A complex integer instruction is a non-pairable integer instruction which always takes more than one clock cycle. Complex floating point instructions and MMX instructions still count as one. Note, that this rule of thumb is heuristic and not completely reliable. In important cases you may want to do your own testing. You can use performance monitor counter number 35H for the PMMX to count branch mispredictions. Test results may not be completely deterministic, because branch predictions may depend on the history of the BTB entry prior to allocation.
Tight loops on PPro, PII and PIII are predicted normally, and take minimum two clock cycles per iteration.
22.2.9 Indirect jumps and calls (PMMX, PPro, PII and PIII)
There is no pattern recognition for indirect jumps and calls, and the BTB can remember no more than one target for an indirect jump. It is simply predicted to go to the same target as it did last time.
22.2.10 JECXZ and LOOP (PMMX)
There is no pattern recognition for these two instructions in the PMMX. They are simply predicted to go the same way as last time they were executed. These two instructions should be avoided in time-critical code for PMMX. (In PPro, PII and PIII they are predicted using pattern recognition, but the loop instruction is still inferior to DEC ECX / JNZ).
22.2.11 Returns (PMMX, PPro, PII and PIII)
The PMMX, PPro, PII and PIII processors have a Return Stack Buffer (RSB) which is used for predicting return instructions. The RSB works as a First-In-Last-Out buffer. Each time a call instruction is executed, the corresponding return address is pushed into the RSB. And each time a return instruction is executed, a return address is pulled out of the RSB and used for prediction of the return. This mechanism makes sure that return instructions are correctly predicted when the same subroutine is called from several different locations.
In order to make sure this mechanism works correctly, you must make sure that all calls and returns are matched. Never jump out of a subroutine without a return and never use a return as an indirect jump if speed is critical.
The RSB can hold four entries in the PMMX, sixteen in the PPro, PII and PIII. In the case where the RSB is empty, the return instruction is predicted in the same way as an indirect jump, i.e. it is expected to go to the same target as it did last time.
On the PMMX, when subroutines are nested deeper than four levels then the innermost four levels use the RSB, whereas all subsequent returns from the outer levels use the simpler prediction mechanism as long as there are no new calls. A return instruction which uses the RSB still occupies a BTB entry. Four entries in the RSB of the PMMX doesn't sound of much, but it is probably sufficient. Subroutine nesting deeper than four levels is certainly not unusual, but only the innermost levels matter in terms of speed, except possibly for recursive procedures.
On the PPro, PII and PIII, when subroutines are nested deeper than sixteen levels then the innermost 16 levels use the RSB, whereas all subsequent returns from the outer levels are mispredicted. Recursive subroutines should therefore not go deeper than 16 levels.
22.2.12 Static prediction in PMMX
A control transfer instruction which has not been seen before or which is not in the BTB is always predicted to fall through on the PMMX. It doesn't matter whether it goes forward or backwards.
A branch instruction will not get a BTB entry if it always falls through. As soon as it is taken once, it will get into the BTB and stay there no matter how many times it falls through. A control transfer instruction can only go out of the BTB when it is pushed out by another control transfer instruction which steals its BTB entry.
Any control transfer instruction which jumps to the address immediately following itself will not get a BTB entry. Example:
JMP SHORT LL LL:
This instruction will never get a BTB entry and therefore always have a misprediction penalty.
22.2.13 Static prediction in PPro, PII and PIII
On PPro, PII and PIII, a control transfer instruction which has not been seen before or which is not in the BTB is predicted to fall through if it goes forwards, and to be taken if it goes backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on these processors.
If your code is unlikely to be cached then it is preferred to have the most frequently executed branch fall through in order to improve prefetching.
22.2.14 Close jumps (PMMX)
On the PMMX, there is a risk that two control transfer instructions will share the same BTB entry if they are too close to each other. The obvious result is that they will always be mispredicted.
The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of the last byte in the instruction. If two control transfer instructions are so close together that they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry. Example:
CALL P JNC SHORT L
If the last byte of the CALL instruction and the last byte of the JNC instruction lie within the same dword of memory, then we have the penalty. You have to look at the output list file from the assembler to see whether the two addresses are separated by a DWORD boundary or not. (A DWORD boundary is an address divisible by 4).
There are various ways to solve this problem:
1. Move the code sequence a little up or down in memory so that you get a dword boundary between the two addresses.
2. Change the short jump to a near jump (with 4 bytes displacement) so that the end of the instruction is moved further down. There is no way you can force the assembler to use anything but the shortest form of an instruction so you have to hard-code the near branch if you choose this solution.
3. Put in some instruction between the CALL and the JNC instructions. This is the easiest method, and the only method if you don't know where DWORD boundaries are because your segment is not dword aligned or because the code keeps moving up and down as you make changes in the preceding code:
CALL P MOV EAX,EAX ; two bytes filler to be safe JNC SHORT L
If you want to avoid problems on the PPlain too, then put in two NOP's instead to prevent pairing (see section 22.1.3 above).
The RET instruction is particularly prone to this problem because it is only one byte long:
JNZ NEXT RET
Here you may need up to three bytes of fillers:
JNZ NEXT NOP MOV EAX,EAX RET
22.2.15 Consecutive calls or returns (PMMX)
There is a penalty when the first instruction pair following the target label of a call contains another call instruction or if a return follows immediately after another return. Example:
FUNC1 PROC NEAR NOP ; avoid call after call NOP CALL FUNC2 CALL FUNC3 NOP ; avoid return after return RET FUNC1 ENDP
Two NOP's are required before CALL FUNC2 because a single NOP would pair with the CALL. One NOP is enough before the RET because RET is unpairable. No NOP's are required between the two CALL instructions because there is no penalty for call after return. (On the PPlain you would need two NOP's here too).
The penalty for chained calls only occurs when the same subroutines are called from more than one location (probably because the RSB needs updating). Chained returns always have a penalty. There is sometimes a small stall for a jump after a call, but no penalty for return after call; call after return; jump, call, or return after jump; or jump after return.
22.2.16 Chained jumps (PPro, PII and PIII)
A jump, call, or return cannot be executed in the first clock cycle after a previous jump, call, or return. Therefore, chained jumps will take two clock cycles for each jump, and you may want to make sure that the processor has something else to do in parallel. For the same reason, a loop will take at least two clock cycles per iteration on these processors.
22.2.17 Designing for branch predictabiligy (PMMX, PPro, PII and PIII)
Multiway branches (switch/case statements) are implemented either as an indirect jump using a list of jump addresses, or as a tree of branch instructions. Since indirect jumps are poorly predicted, the latter method may be preferred if easily predicted patterns can be expected and you have enough BTB entries. In case you decide to use the former method, then it is recommended that you put the list of jump addresses in the data segment.
You may want to reorganize your code so that branch patterns which are not predicted perfectly can be replaced by other patterns which are. Consider, for example, a loop which always executes 20 times. The conditional jump at the bottom of the loop is taken 19 times and falls through every 20'th time. This pattern is regular, but not recognized by the pattern recognition mechanism, so the fall-through is always mispredicted. You may make two nested loops by four and five, or unroll the loop by four and let it execute 5 times, in order to have only recognizable patterns. This kind of complicated schemes are only worth the extra code on the PPro, PII and PIII processors where mispredictions are very expensive. For higher loop counts there is no reason to do anything about the single misprediction.