31. Comparison of the different microprocessors
The following table summarizes some important differences between the microprocessors in the Pentium family:
PPlain | PMMX | PPro | PII | PIII | |
code cache, kb | 8 | 16 | 8 | 16 | 16 |
data cache, kb | 8 | 16 | 8 | 16 | 16 |
built in level 2 cache, kb | 0 | 0 | 256 | 512 *) | 512 *) |
MMX instructions | no | yes | no | yes | yes |
XMM instructions | no | no | no | no | yes |
conditional move instructruct. | no | no | yes | yes | yes |
out of order execution | no | no | yes | yes | yes |
branch prediction | poor | good | good | good | good |
branch target buffer entries | 256 | 256 | 512 | 512 | 512 |
return stack buffer size | 0 | 4 | 16 | 16 | 16 |
branch misprediction penalty | 3-4 | 4-5 | 10-20 | 10-20 | 10-20 |
partial register stall | 0 | 0 | 5 | 5 | 5 |
FMUL latency | 3 | 3 | 5 | 5 | 5 |
FMUL throughput | 1/2 | 1/2 | 1/2 | 1/2 | 1/2 |
IMUL latency | 9 | 9 | 4 | 4 | 4 |
IMUL throughput | 1/9 | 1/9 | 1/1 | 1/1 | 1/1 |
*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.
Comments to the table:
Code cache size is important if the critical part of your program is not limited to a small memory space.
Data cache size is important for all programs that handle more than small amounts of data in the critical part.
MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.
Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.
Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.
Processors with a good branch prediction method can predict simple repetitive patterns. A good branch prediction is most important if the branch misprediction penalty is high.
A return stack buffer improves prediction of return instructions when a subroutine is called alternatingly from different locations.
Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.
The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.
Most of the optimizations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of.
Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.
Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the conditional moves in the PPro, PII and PIII will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should detect which processor it is running on and select the appropriate version of code (chapter 27.10).