» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

31. Comparison of the different microprocessors

日期: 2000-04-03 14:00 | 联系我 | 关注我: Telegram, Twitter

31. Comparison of the different microprocessors

The following table summarizes some important differences between the microprocessors in the Pentium family:

	PPlain	PMMX	PPro	PII	PIII
code cache, kb	8	16	8	16	16
data cache, kb	8	16	8	16	16
built in level 2 cache, kb	0	0	256	512 *)	512 *)
MMX instructions	no	yes	no	yes	yes
XMM instructions	no	no	no	no	yes
conditional move instructruct.	no	no	yes	yes	yes
out of order execution	no	no	yes	yes	yes
branch prediction	poor	good	good	good	good
branch target buffer entries	256	256	512	512	512
return stack buffer size	0	4	16	16	16
branch misprediction penalty	3-4	4-5	10-20	10-20	10-20
partial register stall	0	0	5	5	5
FMUL latency	3	3	5	5	5
FMUL throughput	1/2	1/2	1/2	1/2	1/2
IMUL latency	9	9	4	4	4
IMUL throughput	1/9	1/9	1/1	1/1	1/1

*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.

Comments to the table:

Code cache size is important if the critical part of your program is not limited to a small memory space.

Data cache size is important for all programs that handle more than small amounts of data in the critical part.

MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.

Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.

Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.

Processors with a good branch prediction method can predict simple repetitive patterns. A good branch prediction is most important if the branch misprediction penalty is high.

A return stack buffer improves prediction of return instructions when a subroutine is called alternatingly from different locations.

Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.

The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.

Most of the optimizations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of.

Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.

Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the conditional moves in the PPro, PII and PIII will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should detect which processor it is running on and select the appropriate version of code (chapter 27.10).

前一篇：28.3 MMX instructions (PMMX)
下一篇：27.7 Using floating point instructions to do integer operations (PPlain and PMMX)

标签: MMX 优化

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 10 x 1 + 2 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字