Would just like to clarify a few HT-related questions and answers:
HT is not emulation, it doesn't pretend to be something it isn't. HT also isn't new, its been used in other CPU architectures very effectively; Intel's version of it is purely to get more performance out of the P4 as opposed to genuinely trying to cheaply increase raw performance. So how does HT work?
Well as you know, the P4 is deeply pipelined, this means there are many stages to the pipelines. This is the key feature that lets Intel's implementation work. You send instructions down the pipeline, they're executed and the results output to memory. Huzzah for that. Now, what if an instruction doesn't use the FPU (Floating Point Unit) on the P4? Thats effectively one unit wasted that cycle.
Here's where HT comes in, it reports to Windows that there are two processors. The Windows schedular will then send threads to both 'CPUs' as it would in any dual-CPU setup. This means the processor can effectively see two independant execution threads. I'm going to overly-simplify here, so don't take this as gospel:
While processing thread #1, the second FPU may be free, as well as the first ALU. If there are instructions waiting for the FPU and/or ALU on the second thread - they are executed there and then. This essentially means units are left idle far less in an HT-enabled CPU. A branch mis-prediction causes a similiar (and much more dramatic) event. If a branch is mis-predicted, then the entire pipeline must be cleared. This is great news for the second thread because it suddenly has access to almost the entire pipeline for a fair few cycles (I used to know how many, but been out of game dev for too long :\).
Of course HT has some very clever logic contained within it - and in fact the first version of HT to appear in the P4 was HT v2. This logic is essential to keeping performance, otherwise you really would be in a situation where threads have 50% of the CPU each as opposed to an unbalanced share which they should have. You wouldn't really want your game to run at half the speed it should because you're receiving an email in the background.
HT also can't work miracles - if the units are in use, then they're in use. So it would be impossible, for example to encode at twice the speed because the units the second encoder needs are already in full use by the first encoder. There will be a slight performance increase due to HT, but it will be small.
Oh - if you're wondering why the deep pipeline helps - its because there are more units available and branch mis-predictions are a much more major event.
Genuine processors designed with HT (forget the real technology name) in mind from the get-go generally double important units like the FPU/ALU.
Hope this helps
