OK, I was having a think yesterday. We all know that Bulldozer is dissappointing in lightly threaded tasks, and a monster in the right kind of parallel workloads.
I'm pretty sure there's more than one reason for this, so here's a couple of ideas that I'd like to throw open for debate. Feel free to tear me down
1) FPU Mis-scheduling.
Reading around it looks like the intended FPU operation is that when the FPU isn't processing 256bit AVX instructions, it's mean to be able to handle 2x 128bit FP instructions simultaneously. The benchmarks show pretty clearly that this isn't happening. So is this just a problem that the FPU scheduler isn't dispatching the second instruction when it's able to? If so, could that be fixed in a stepping (B3?!) and miraculously boost the FPU throughput by up to 100%?
2) Branch Misprediction
AMD's redone branch prediciton with bulldozer to have a "quick" and a "slow" predictor. The idea is that when the cores are lightly loaded the quick predictor will make sure there's something to work on so the core isn't sitting idle, but it's not as accurate: when the cores are heavily loaded the slow predictor will have time to put a much more accurate prediction into the queue. So what if the quick predictor is just
really bad? Could the poor performance in lightly threaded loads be a case of the quick predictor pushing lots of bad predicitons into the queue so it keeps having to flush and restart? If the pipeline is particularly long (as rumoured) then (
iirc from the p4 days) incorrect prediciton can have a hugely detrimental effect on performance...