Still, NUMA drivers can only work with the underlying hardware which is what that cache benchmark is measuring. If Windows could detect that the thread was hammering L3 on the other CCX the only thing it could do is migrate the thread to the other CCX, in doing so it performs a context switch and invalidates the L1 and L2 caches of both cores which is horribly expensive. That was shown in the recent article where MS got their NUMA code wrong, it wasn't that they were placing threads badly as that would only cause a mild impact on performance. AIUI they were moving threads around, that can easily cause the halving in performance reported.
But when the benchmark above is measuring the L3 cache it is thrashing *all* of it. There is no optimum CCX to be in, the system just has to wear it.
Whoah there! ... and this is what cache is all about so let's do the maths.
100ns, at 4GHz we get 4 clocks per ns. So that's 400 clock cycles. A modern core can execute three instructions per clock (Bulldozer was slated for only averaging two) so 100ns is the time it would ideally take to execute 1200 instructions. Can't remember the Ryzen reservation size, but I think it is around the 300 mark? So 100ns is massively beyond what out of order instructions can work around and the CPU is halted on a dependent read. On the upside, you just gave the core's second thread plenty of time to use up