Quote Originally Posted by badass View Post
IIRC since the days of the original Athlon for AMD and the Pentium 4 for Intel, the processors are RISC like internally any way. They effectively have a hardware emulator in the CPU to translate.
Sort of, though it actually goes back to the AMD K5 for AMD and the Pentium Pro for Intel. But the point is, in a small CPU the translation is far from free.

In a dual issue RISC chip, the you issue the instruction at the instruction pointer and whatever is at IP+4. Possibly not 4, but RISC is always a constant instruction size.

In an x86? Read the next 16 bytes into a buffer, then try and decode enough of the next instruction to work out the length so you can find where the second instruction starts. You might not have all of the second instruction bytes though, but work out the common cases and you can get some logic to decode at least some instructions in parallel. It doesn't always work, so you take the output and put it into a pre-decode cache so that next time through if you loop you don't have to do that decode work.

In a server/desktop chip, the rest of the execution units dwarf that overhead so it isn't so important. But for smaller CPUs, well you could probably build a usable 64 bit CPU with just the transistors and power budget used in instruction pre-decode and caching let alone the x86 to uop translation layer.