AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

**Iota** · 03-01-2019, 09:37 PM

Originally Posted by LSG501

Actually I did hence why I said AMD needs to work with Microsoft to get it fixed.... AMD can't fix MS code but they can work with MS to speed up a fix.

Which leads around in circles, to why should AMD help Microsoft fix their scheduler? It's Microsoft that will bleed customers of server software to Linux, you would think it would be in Microsoft's best interests to fix that asap.

**Tabbykatze** · 03-01-2019, 10:55 PM

Originally Posted by LSG501

Actually I did hence why I said AMD needs to work with Microsoft to get it fixed.... AMD can't fix MS code but they can work with MS to speed up a fix.

Why would Microsoft rush to fix something that (in their corporate eyes) isn't actually broken?

And something like changing a scheduler that dramatically is not a change that you can just get a few script kiddys to whack out in an afternoon like they're back in uni with their respective drunk booty calls. This stuff takes months to develop, finalise, regression test, closed and open box test, HCL test then eventually if it gets through all that, pick a branch to place it on and then slowly roll it out.

Months is putting the timeline in best case, this will take much longer if it's not even a priority...

**LSG501** · 03-01-2019, 11:29 PM

Originally Posted by Iota

Which leads around in circles, to why should AMD help Microsoft fix their scheduler? It's Microsoft that will bleed customers of server software to Linux, you would think it would be in Microsoft's best interests to fix that asap.

AMD understands their cpu better than MS, MS understands the scheduler better than AMD.... working together will get things done easier/faster. It's not like Intel doesn't work with MS to ensure that Windows performs as well it can with their hardware or anything...oh wait...

I have no idea why it's such a hard concept for people on this forum to understand that it's sometimes better to work together to make things go faster.

Originally Posted by Tabbykatze

Why would Microsoft rush to fix something that (in their corporate eyes) isn't actually broken?

Have you considered they may not have known it was broken....it wasn't exactly run of the mill testing that was needed to come up with this 'hack' to get it working better...
Why would they fix something quickly... because it might stop people going to linux when using threadripper and well MS sell their software....

**philehidiot** · 04-01-2019, 12:49 AM

If they do make a kernel level fix, I want to know two things:

- Will it run Crysis?
- Will it make my Netflix faster?

**Corky34** · 04-01-2019, 12:56 AM

Originally Posted by LSG501

AMD understands their cpu better than MS, MS understands the scheduler better than AMD.... working together will get things done easier/faster. It's not like Intel doesn't work with MS to ensure that Windows performs as well it can with their hardware or anything...oh wait...

I have no idea why it's such a hard concept for people on this forum to understand that it's sometimes better to work together to make things go faster.

I'm not sure it's something people are finding hard to understand, it's that they, myself included, believe this isn't something Microsoft has, or even will have, any interest in fixing, AMD could try working with them as much as the like but if Microsoft doesn't want to fix it there's not much they can do, you can't work with someone if they don't want to work with you.

Also it seems very likely that the 'fix' Microsoft came up with when supposedly working with Intel to get dual socket Xeon V4's and maybe V3's to playing nice with Windows is what's now causing this problem, AFAIK instead of altering the scheduler so it could handle NUMA nodes correctly they decide to set an "ideal_cpu" flag so when a process runs it's confined to running on only one CPU in a dual socket system, that's great if you actual have a dual socket system but not so great if the scheduler mistakes your 32 core CPU for 4 separate 8 core CPUs in 4 sockets.

**keithwalton** · 04-01-2019, 02:55 AM

Since i'm getting deja vu i'm guessing it's a sign of poor communication between AMD and MS.

As i recall when they first did SMT (partial) with bulldozer they used a different numbering convention to intel for differentiating between cores and threads.

On a 4 core 8 thread system one company used 0-3 for cores and 4-7 for additional threads and the other company used even for cores odd's for extra threads and since intel got there first windows was optimised to use their convention for utilising the cores and threads.

The threadripper is essentially a quad socket (die) cpu it's just all on the one package. The TR4 socket even looks like 4 discrete sockets all under 1 clamp.

The scheduler should treat it as 4 x 8 core (16 thread) cpu's and not a single 64 thread monster as infinity fabric might be much than multi socket communication inter core communication it is still orders of magnitude higher latency and lower bandwidth than communication on the same die.

**Corky34** · 04-01-2019, 09:03 AM

Originally Posted by keithwalton

The scheduler should treat it as 4 x 8 core (16 thread) cpu's and not a single 64 thread monster as infinity fabric might be much than multi socket communication inter core communication it is still orders of magnitude higher latency and lower bandwidth than communication on the same die.

If it should be treating it as 4 x 8 core (16 thread) cpu's and not a single 64 thread monster then we wouldn't see in excess of 50% performance improvement when this issue is temporarily fixed or when running on Linux, inter-die latency maybe slightly higher than inter-core but we're talking about 40-50ns difference IIRC, not your typical 400-500ns added from inter-socket communication.

And even if it did treat it as 4 x 8 core (16 thread) cpu's it still wouldn't fix the problem AFAIK as the problem is being caused because under some circumstances the scheduler is assigning a process 16 threads whereas the OS is attempting to run 64 threads so the scheduler spends half it's time swapping threads about rather than letting the process finish.

**DanceswithUnix** · 04-01-2019, 10:04 AM

Originally Posted by LSG501

I have no idea why it's such a hard concept for people on this forum to understand that it's sometimes better to work together to make things go faster.

Because that goes against Brook's law as described in the Mythical Man Month in the 1970's. You are trying to solve a problem by adding people to it, and by doing so increase the communication overheads within the group and throw out the existing group dynamic. Throwing people at problems makes them worse, it isn't obvious to the point that it still happens, but it is very well documented.

MS will have an existing kernel group that deals with scheduling problems. That team know the code involved, the issues surrounding messing with it, the tools to diagnose it. The leader of that group should be making sure they have a Threadripper and Epyc setup and should already have a contact or two within AMD to ask the odd question if they arise. Faults trump backlog, so next agile sprint the issue should be investigated and possibly fixed unless the fix is complex in which case it might need another sprint or two of work.

The main problem here is that having tweaked some thread placement or migration cost (likely both) within the scheduler, performance regression tests have to be run against every single platform that MS support. Running all those tests takes time, and at the end of the day this isn't like fixing a application crashing; you are optimising to get the best overall performance over a range of platforms and applications and improving Threadripper by 100% on one app might drop Intel Xeon performance 2% or perhaps slow another TR application that was happy with the old scheduler, so someone has to make a value call to say they are happy with the result.

The red flag here is gaming mode. That took effort to implement, which makes me wonder if AMD flagged these problems to MS early on and got a bad enough response that AMD put in place the only thing they could that was under total AMD control: a BIOS based hack.

I don't have a general bias against Microsoft and there are some things they do well, but I do have a low opinion of their support.

**DanceswithUnix** · 04-01-2019, 10:22 AM

Originally Posted by Corky34

If it should be treating it as 4 x 8 core (16 thread) cpu's and not a single 64 thread monster then we wouldn't see in excess of 50% performance improvement when this issue is temporarily fixed or when running on Linux, inter-die latency maybe slightly higher than inter-core but we're talking about 40-50ns difference IIRC, not your typical 400-500ns added from inter-socket communication.

Careful, I think your logic implies that Linux is just assigning the threads randomly across cpus. The current Linux kernel was forged on IBM Power NUMA systems a long time ago, and IBM are really good at this stuff. AMD will have tweaked the existing code to understand their physical layout, but the hard stuff was done.

You might also be ignoring cache line invalidation. The problem with threads being on the wrong CPU is that only one CPU can hold the current state of a bit of RAM. Two threads working too close to each other means a cache line thrashes back and forth between two cores, worst case between two NUMA nodes. So it isn't so much the additional latency that kills performance, it is the fact that with CPUs sending cache line invalidates to each other you are effectively running with the cache switched off. That can easily halve performance. That is why thread locality matters (matching memory locality with CPU cache hierarchy locality), and also why it is easier to schedule entire single thread tasks than threads within a task.

You get a similar problem if you move threads around from one CPU to another, the data has to follow the thread meaning you effectively throw away the cache contents until it is reloaded.

**Corky34** · 04-01-2019, 10:31 AM

Just throwing this out there to but when AMD released that thread scheduling thing for, was it, WattMan a while back that seems to imply they were aware of scheduling issues with Windows, one would assume they wouldn't dedicate resources to something like that unless they had to.

Originally Posted by DanceswithUnix

Careful, I think your logic implies that Linux is just assigning the threads randomly across cpus. The current Linux kernel was forged on IBM Power NUMA systems a long time ago, and IBM are really good at this stuff. AMD will have tweaked the existing code to understand their physical layout, but the hard stuff was done.

That wasn't the intention so thanks for point that out.

Yes the hard work was already done, from what i read, or listened to in the video, it seems it took over ten years for Linux to get NUMA working properly/efficiently.

Originally Posted by DanceswithUnix

You might also be ignoring cache line invalidation. The problem with threads being on the wrong CPU is that only one CPU can hold the current state of a bit of RAM. Two threads working too close to each other means a cache line thrashes back and forth between two cores, worst case between two NUMA nodes. So it isn't so much the additional latency that kills performance, it is the fact that with CPUs sending cache line invalidates to each other you are effectively running with the cache switched off. That can easily halve performance. That is why thread locality matters (matching memory locality with CPU cache hierarchy locality), and also why it is easier to schedule entire single thread tasks than threads within a task.

You get a similar problem if you move threads around from one CPU to another, the data has to follow the thread meaning you effectively throw away the cache contents until it is reloaded.

I think we maybe conflating CPUs with sockets, that's basically what the Windows scheduler seems to be doing, it seems to think, under some circumstances, that the 4 clusters of 8 cores within each TR are 4 separate sockets so it sets one of the CCX's as an "ideal_cpu" whereas the OS sees the 4 clusters of 8 cores within each TR as a single CPU.

I get what you're saying, just about, with cache line invalidation but wouldn't that only apply to a multi socket system and not something like TR where the cache is shared between all cores?

**scaryjim** · 04-01-2019, 11:56 AM

Originally Posted by Corky34

... wouldn't that only apply to a multi socket system and not something like TR where the cache is shared between all cores?

TR in 2990WX IS essentially four separate AM4 CPUs. The fact that they're wired together on a substrate rather than being wired together across motherboard traces doesn't change that basic fact. That's what having NUMA modes for TR is all about - it's telling the OS that it doesn't act like a single massive CPU.

Transport between dies on a substrate gets farmed across the infinity fabric, which is (based on cache & memory latency results) an order of magnitude slower than access to local L3 cache. But actually Ryzen has another problem, because even within a die the cores don't have direct access to all the cache. So in TR there's actually 3 different localities of access to L3 cache - there's within a CCX (very fast), to the other CCX on the die (a lot slower), and to the other dies (extremely slow). It's a complex scheduling issue, particularly for workloads that are latency and cache sensitive...

**Corky34** · 04-01-2019, 12:16 PM

It's not though is it? I thought the L3 cache within all Zen based CPUs were shared between all cores, i thought that was the job of the cache coherent master.

**scaryjim** · 04-01-2019, 01:05 PM

Originally Posted by Corky34

It's not though is it? I thought the L3 cache within all Zen based CPUs were shared between all cores, i thought that was the job of the cache coherent master.

It's shared, but they don't have symmetrical access to it, and that's what matters for performance. Once you go outside the CCX the latency jumps massively - e.g. look at the bottom chart on Anandtech's latency deep dive for Ryzen 2 and you'll see what I mean. Plus that chart's using a log scale for the y-axis. That's an enormous jump in latency going from 4MB strides to 8MB strides - if Ryzen had equal access to all 16MB of cache you simply wouldn't see that.

The entire Ryzen family - as it currently stands - is made of quad-core, 8MB L3 cache units, with varying degrees of interconnectedness.

**Corky34** · 04-01-2019, 01:41 PM

Oh right, i think i get what you mean, you're talking from a design POV or the way it works within the package, yes? If so then yes but isn't how each CPU/socket/package operates internally not known to the OS, all it sees is CPU 1, 2, 3, or whatever with a pool of external memory (RAM), either assigned to each CPU with equally priorities as unified whole or parts with different priorities that make up the whole, no?

**keithwalton** · 05-01-2019, 01:19 AM

Originally Posted by Corky34

Oh right, i think i get what you mean, you're talking from a design POV or the way it works within the package, yes? If so then yes but isn't how each CPU/socket/package operates internally not known to the OS, all it sees is CPU 1, 2, 3, or whatever with a pool of external memory (RAM), either assigned to each CPU with equally priorities as unified whole or parts with different priorities that make up the whole, no?

If the internal layout isn't known to the os then it will either treat them all equally or it has been optimised (different priorities) to existing hardware. If AMD's optimum layout is different to the existing hardware then prioritisation would make things worse rather than better.

Also from memory (excuse the pun) there isn't a direct link between die 1 and die 4 so data has to flow via die 2 or 3.

When i first saw the block diagram for threadripper it did jump out to me that the die's were arranged in a ring and all of the memory was attached to separate branches on the outside of the ring. At the time i thought wouldn't it be great / logical for the memory to be in the middle of the circle with each die having direct access to it.
The problem with that concept would be cores fighting over the same piece of memory. But since modern cpu's have an l3 caches shared across many cores someone must of figured the clash out.

Having seen some block diagrams for Zen 2 a few weeks back it looks like this is what AMD are planning to do move the memory to a common central pool rather than an external ring bus.

Edit - this is the link to Zen 2 being turned inside out with the I/O getting it's own central die and the cores being the spokes on the wheel. https://wccftech.com/amd-zen-2-7nm-c...yzen-official/

**Corky34** · 05-01-2019, 10:27 AM

Originally Posted by keithwalton

If the internal layout isn't known to the os then it will either treat them all equally or it has been optimised (different priorities) to existing hardware. If AMD's optimum layout is different to the existing hardware then prioritisation would make things worse rather than better.

When i said it's not know i meant at the level of knowing the access times of a CPUs various caches, AFAIK what the scheduler sees, or knows about, is fairly basic stuff like how many cores, how many sockets, and possibly if the socket is attached to its own pool of local memory (local as in RAM not cache).

In other words it only sees the hardware resources it can assign work to and what resources have what priorities.

Originally Posted by keithwalton

Also from memory (excuse the pun) there isn't a direct link between die 1 and die 4 so data has to flow via die 2 or 3.

Yes, but the scheduler should be marking the cores that have direct access to the memory as "ideal_cpu", something it seems to be doing, and then if a process (a program) calls for more threads than the "ideal_cpu" can handle those threads should spill over to the dies without direct access to memory, something it doesn't seems to be doing.

Originally Posted by keithwalton

When i first saw the block diagram for threadripper it did jump out to me that the die's were arranged in a ring and all of the memory was attached to separate branches on the outside of the ring. At the time i thought wouldn't it be great / logical for the memory to be in the middle of the circle with each die having direct access to it.
The problem with that concept would be cores fighting over the same piece of memory. But since modern cpu's have an l3 caches shared across many cores someone must of figured the clash out.

Having seen some block diagrams for Zen 2 a few weeks back it looks like this is what AMD are planning to do move the memory to a common central pool rather than an external ring bus.

Edit - this is the link to Zen 2 being turned inside out with the I/O getting it's own central die and the cores being the spokes on the wheel. https://wccftech.com/amd-zen-2-7nm-c...yzen-official/

It's more like a mesh than a ring but yes, that picture you saw (linked to) was of the next gen EPYC with clusters of CPU cores arranged around a central Input/Output die, putting your memory controllers, and by extension routing all request to RAM via a central core, is fine as long as you balance your memory controllers (channels) with your cores.

The central pool you refer to is, as far as we know so far, only for I/O, so that's things like data going to PCIe, USB, RAM, etc, etc. Each die attached to the separate branches will still have its own L1 and L2 cache along with its shared L3 cache. There is some speculation that the central I/O die may contain some L4 cache but it's probably best not to confusing things anymore.

Thread: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

LinkBack

Thread Tools

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Received thanks from:

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Re: AMD Ryzen Threadripper 2990WX perf boosted 2X by CorePrio tool

Thread Information

Users Browsing this Thread

Posting Permissions