AMD Matisse 12C 24T CPU spotted in UserBenchmark db

**scaryjim** · 26-01-2019, 04:08 PM

Originally Posted by Corky34

… I get where your coming from but I'm not sure i agree, sorry.

...

*shrug* I have no problem with you not agreeing, so there's no need to apologise for that

OTOH, the results clearly show that there's no cache latency increase @ 8MB step (unlike the earlier Zen designs), and that the latency @ 32MB step is higher than at 64MB step. I've thrown my best attempts at explaining those results into the ring, and I'm happy for people to offer alternate theories, or even to pick holes in my logic if I've missed something. It gets a bit tiresome if people just turn up and go "well, you might be wrong" without offering any reason, though...

**DanceswithUnix** · 27-01-2019, 09:10 AM

Originally Posted by Corky34

See if i can put it another way, if you're measuring how long it takes to read/write a block of memory you're not measuring latency you're measuring access time, to measure latency you need to send a ping from one place to another otherwise you're just measuring how long it takes read/write X amount of data.

It's the difference between saying i can download a 40MB file in 1min and saying I've got 1500ms latency to the server I'm downloading from.

I wonder if you are interpreting the memory size on the X axis of that graph as some sort of download size? It is a working set size, you thrash memory within that limit and I would hope these days it would be a pseudo random pattern or you are just measuring the prefetcher.
So at the 8kb mark the program should be measuring how many random accesses it can perform within an 8kb block in a second, which can be satisfied from the L1 D$ so measures the latency of that. Random accesses within a 128KB block will hardly ever hit the L1 cache, so you end up with a measurement of L2 performance. Yes there is an I/O queue involved like in your link about hard drives, but if you aren't hammering the memory read queue then OoO processing will rearrange the instruction order to hide the latency making it uninteresting as you tend to only measure bottlenecks that hurt performance, and also makes it pretty hard to measure. Besides, it isn't a long queue compared with a hard drive where IO queues can lead to seconds of overall response time.

It is a simple and useful test which as Scaryjim says highlighted the inter CCX latency of current Ryzen chips. If that inter CCX bump is gone, from the view of a single core, it is an interesting point.

Edit: Another way to think of it, caches are statistical beasts, so practically you can't send a single ping as you would have no idea what you are pinging. It requires repeated accesses to find how well the main memory latency of those accesses are hidden from the CPU. Also your first access has no need to be in cache, so you are probably grabbing it from main memory. The second access you are probably grabbing it from cache, but you don't know for sure.

**Corky34** · 27-01-2019, 06:11 PM

It is some sort of download size though, isn't it? If we run a process that requires 8kb of memory to workout the answer to a problem then that data needs to be downloaded (and uploaded) to that 8kb of memory. How the caches work though is slightly missing the point I'm trying to make though, that the 'problem' of inter-die latency that people identified with Zen/Zen+ isn't actually a problem IMO, it was identified as a 'problem' because people where looking for what caused the degradation of performance in some scenarios.

Basically I'm saying that if it wasn't for the crappy way Windows dealt with Zen/Zen+ we wouldn't/shouldn't have know about the 'problem' of inter-die latencies as we should have seen similar results as on an OS that dealt with it correctly, such as Linux. I'm also saying that from the perspective of Zen2 that design changes have probably helped to not confuse Windows.

My guess is that the latencies are still there but it's just Windows that's doing what it should have been doing all along with Zen/Zen+, if someone conducted a ping test from one core on one die to another core on another die that we'd still see an increase in latency as PCPer demonstrated, however as the latencies should never have been 'visible' to the OS in the first place it's all a bit academic IMO.

**scaryjim** · 27-01-2019, 08:42 PM

Originally Posted by Corky34

… Basically I'm saying that if it wasn't for the crappy way Windows dealt with Zen/Zen+ we wouldn't/shouldn't have know about the 'problem' of inter-die latencies ...

Hmmm... perhaps "problem" is too strong a term, but the design brings with it inherent performance issues. Windows has occasional issues with the more complex NUMA patterns on Threadripper, but the simple fact is that Ryzen's split L3 cache means that it doesn't performance like a 16MB cache processor.

This is nothing to do with core - to - core latencies, which are a different thing with a different set of performance issues; it's about how much cache each core can access, and at what speeds. Ryzen advertises itself as a 16MB cache processor, but each core can only access 8MB at full speed, and the remainder is barely any faster than main memory. If it also tells the OS it's a 16MB cache processor and all cores can access all of the cache, then tasks which like a lot of fast cache are going to drag on Ryzen.

That's the main reason I'm interested in that lack of that cache slow-down at 8MB: it shows AMD have done something. That's going to impact IPC for Zen 2 Ryzen CPUs with up to 8 cors, as they do have full-speed access to 16MB of cache. Of course, personally I'm hoping they've done something clever, rather than just switching to big 8-core CCXes...

**Corky34** · 28-01-2019, 09:14 AM

Originally Posted by scaryjim

...and the remainder is barely any faster than main memory. If it also tells the OS it's a 16MB cache processor and all cores can access all of the cache, then tasks which like a lot of fast cache are going to drag on Ryzen.

But the point I'm trying to make is that, yes you're correct to say transferring data from an 8MB chunk that's on the opposite die is slower however that only effects performance because of the way that's handled by Windows (IMO), the reason i say that is because we're only talking about adding 100ns to the already really low 40ns for pining between cores on the same die, when we compare that to Intel we're looking at inter-thread ping times of around the 80-90ns.

It's why i suspect we could be talking more, like you alluded to, a speed/bandwidth issue with the inter-die link combined with the awful way Windows handles Ryzen when looking at performance problems, adding an extra 100ns shouldn't cause that many issues (before i get told off for getting the latency numbers wrong I'd like to say that the UserBenchmark db is way off in its latency numbers IMO)

Originally Posted by scaryjim

That's the main reason I'm interested in that lack of that cache slow-down at 8MB: it shows AMD have done something. That's going to impact IPC for Zen 2 Ryzen CPUs with up to 8 cors, as they do have full-speed access to 16MB of cache. Of course, personally I'm hoping they've done something clever, rather than just switching to big 8-core CCXes...

Yea, i still don't see the 8-core CCX being a thing. I suspect that it's simply because Windows is no longer getting confused over whether it should be treating the CPU (as a whole package) as a NUMA or UMA node, that and what i suspect is the switch over to PCIe 4.0 (yes i know that's reportedly only used for off package communication, but I've got a sneaky feeling that's somehow linked to inter-die communications as switching to 4.0 doesn't seem to be something there's much need for currently)

**DanceswithUnix** · 29-01-2019, 09:16 AM

Originally Posted by Corky34

It is some sort of download size though, isn't it?

Not really, no. I can see how you can see it like that, but in doing so you are putting your head in the wrong place which will lead you down the wrong path. But...

How the caches work though is slightly missing the point I'm trying to make though

arguing over a cache latency measurement is probably not helping then

Still, NUMA drivers can only work with the underlying hardware which is what that cache benchmark is measuring. If Windows could detect that the thread was hammering L3 on the other CCX the only thing it could do is migrate the thread to the other CCX, in doing so it performs a context switch and invalidates the L1 and L2 caches of both cores which is horribly expensive. That was shown in the recent article where MS got their NUMA code wrong, it wasn't that they were placing threads badly as that would only cause a mild impact on performance. AIUI they were moving threads around, that can easily cause the halving in performance reported.

But when the benchmark above is measuring the L3 cache it is thrashing *all* of it. There is no optimum CCX to be in, the system just has to wear it.

we're only talking about adding 100ns

Whoah there! ... and this is what cache is all about so let's do the maths.
100ns, at 4GHz we get 4 clocks per ns. So that's 400 clock cycles. A modern core can execute three instructions per clock (Bulldozer was slated for only averaging two) so 100ns is the time it would ideally take to execute 1200 instructions. Can't remember the Ryzen reservation size, but I think it is around the 300 mark? So 100ns is massively beyond what out of order instructions can work around and the CPU is halted on a dependent read. On the upside, you just gave the core's second thread plenty of time to use up

**Corky34** · 29-01-2019, 10:12 AM

Originally Posted by DanceswithUnix

Not really, no. I can see how you can see it like that, but in doing so you are putting your head in the wrong place which will lead you down the wrong path. But...

Bah, it's OK, my heads so used to being there I'm starting to think it's set up home.

Originally Posted by DanceswithUnix

arguing over a cache latency measurement is probably not helping then

Wasn't me who started this.

Originally Posted by DanceswithUnix

Still, NUMA drivers can only work with the underlying hardware which is what that cache benchmark is measuring. If Windows could detect that the thread was hammering L3 on the other CCX the only thing it could do is migrate the thread to the other CCX, in doing so it performs a context switch and invalidates the L1 and L2 caches of both cores which is horribly expensive. That was shown in the recent article where MS got their NUMA code wrong, it wasn't that they were placing threads badly as that would only cause a mild impact on performance. AIUI they were moving threads around, that can easily cause the halving in performance reported.

But when the benchmark above is measuring the L3 cache it is thrashing *all* of it. There is no optimum CCX to be in, the system just has to wear it.

Whoah there! ... and this is what cache is all about so let's do the maths.
100ns, at 4GHz we get 4 clocks per ns. So that's 400 clock cycles. A modern core can execute three instructions per clock (Bulldozer was slated for only averaging two) so 100ns is the time it would ideally take to execute 1200 instructions. Can't remember the Ryzen reservation size, but I think it is around the 300 mark? So 100ns is massively beyond what out of order instructions can work around and the CPU is halted on a dependent read. On the upside, you just gave the core's second thread plenty of time to use up

And i get that but when looking at Intel we're talking about 80ns to ping from one thread to another, with Ryzen on the same die we're talking about 40ns and going to another die we're talking about adding a 100ns to that (so 140ns total), you're correct in saying that's a long time for a CPU to be waiting and that it's more than an out of order instructions can work around but that applies to Intel also, that's why I'm saying the latency of reading/writing to the L3 on another die, and the extra 100ns, shouldn't cause the sort of performance issues SJ said he was concerned about.

**scaryjim** · 29-01-2019, 01:26 PM

Originally Posted by Corky34

... when looking at Intel we're talking about 80ns to ping from one thread to another, with Ryzen on the same die we're talking about 40ns and going to another die we're talking about adding a 100ns to that (so 140ns total) ...

Still don't get why you're obsessing about thread-to-thread pings when everyone else is talking about cache latency...?

**Zak33** · 30-01-2019, 12:08 PM

Originally Posted by scaryjim

Still don't get why you're obsessing about thread-to-thread pings when everyone else is talking about cache latency...?

I don't understand why he's going on about thread-to-thread pings either.

Corky...you're getting some stuff here a tad confused I reckon.

Cache latency is a different subject. Please try not to muddy the water on this heavy tech stuff, because it is very specific and a right bugger to follow. I have had to re read it all twice before I realised you'd skipped subject.
HEXUS is trying it's hardest to help people understand and grasp some pretty jolly techy stuff, and tbh we're lucky to have the brain power that we do. You're included in that... but it needs to be very specific sometimes and not side step.

Thread: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

LinkBack

Thread Tools

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Received thanks from:

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Received thanks from:

Re: AMD Matisse 12C 24T CPU spotted in UserBenchmark db

Thread Information

Users Browsing this Thread

Posting Permissions