AMD - Zen chitchat

**Terbinator** · 02-04-2017, 01:22 PM

Originally Posted by watercooled

I've been wondering what the Xbox will use. Did Lisa Su rule out the possibility of Zen or was it more of a dismissal of the question? Either I've forgotten or I didn't watch it.

I believe it was something along the lines of "no semi-custom Zen APUs before 2018" during an investor call around October time. Maybe things have changed/progressed since then?

12GB GDDR5 and Vega is all but confirmed though, I believe.

**CAT-THE-FIFTH** · 02-04-2017, 01:26 PM

Originally Posted by watercooled

I mean, I do understand that when you're reviewing for a website, you have probably hundreds of benchmarks to process, along with hardware changes, and having that much information in front of you means you're not as likely to spot weirdness which might seem like nothing at first glance. Independent reviewers and even users often pick up on things like this, and it would be nice if it got some more attention in the media, to better explain the results. I respect sites more for doing things like this than publishing a ton of robotic test results, not least because it demonstrates the reviewers have a better understanding of what they're actually seeing

I know a few like Hexus do this when the need arises, but it would be nice if more sites could take this sort of new information and run their own investigation into why some results are the way they are.

We've seen something similar happen with GPUs too, where in some cases we get benchmarks with release drivers, and that's it. 6 months down the line when you're looking to upgrade, those numbers could be completely invalid because of both software and driver updates, and few places re-run benchmarks (again, I do understand it's a time-consuming process so it's probably not practical to do it frequently). In a way, I suppose this is the sort of reason we get rebrand launches from the likes of AMD as it, in a way, forces sites to re-test the products with the latest drivers, and can show performance improvements even if nothing has changed on the hardware side.

I remember that part of ROTTR being quite taxing when I played it with my GTX960,and its why I do like DF,since it is quite obvious they do know the areas where issues happened.

Another issue is reviews tend to not test the updated games or drivers with older CPUs too.

The internal benchmark under DX12 is jittery - I ran it like 10 times,and occasionally performance would be lower,irrespective of whether it was a warm or cold system,so the screenshots are ones which were the median ones.

The driver released with the GTX1080TI was a new DX12 performance driver,and interesting enough you do seem some gains in certain games now,but all the testing is done with the newest CPUs.

Thats under DX11. So the point where FPS dips is where you see a lot of animated animals run up to you and the part of the Village of the Remnants where most of the NPCs are located in.

Its less severe when you drop resolution. I thought it might be the GPU or CPU throttling but after testing it for a while,it wasn't that. But once the jittery runs happened if I dropped textures to high in-game it solved the problem.

I even tried a different area of the game,and the issue was not present - DX12 performance was slightly better at lower resolution.

Look at the RX470 in comparison.

The GTX1080 is no doubt faster,but it didn't have the DX12 issue.

Remember this is the SAME driver as used in the GTX1080TI reviews so it is some weird issue that is happening even with older CPUs like mine.

**kompukare** · 02-04-2017, 02:13 PM

Originally Posted by watercooled

I mean, I do understand that when you're reviewing for a website, you have probably hundreds of benchmarks to process, along with hardware changes, and having that much information in front of you means you're not as likely to spot weirdness which might seem like nothing at first glance. Independent reviewers and even users often pick up on things like this, and it would be nice if it got some more attention in the media, to better explain the results. I respect sites more for doing things like this than publishing a ton of robotic test results, not least because it demonstrates the reviewers have a better understanding of what they're actually seeing

Thing is - like the AotS results with <6> cores - this is not just something which affects Ryzen 7, both also Intel HEDT with 6+ cores. It's just that nobody seems to have noticed as almost everyone was benching with Nvidia cards even for DX12 despite the suspicions people had about Nvidia hardware and DX12 before Pascal and their lack of proper async compute.

So, really reviewers should have continued to keep an eye on this. But while a lot sites used Haswell-E or Broadwell-E in the past, recently a lot have been using Kaby/Skylake so the fact that Nvidia's DX12 driver is not able to utilise the additional cores of 6+ core CPUs wasn't noticed.

The question is of course, is this something in the way Nvidia's drivers handle the threading can be fixed in software (after all Nvidia's DX11 drivers threaded a lot better than AMD's), or is it related to AMD Radeons having those ACE's or a similar hardware feature?

If Nvidia's hardware is lacking, there is little they can do to improve the issue, whereas software should eventually be fixable. Might even be that their hardware is the problem but they might be able to do a workaround.

**CAT-THE-FIFTH** · 02-04-2017, 02:25 PM

Originally Posted by kompukare

Thing is - like the AotS results with <6> cores - this is not just something which affects Ryzen 7, both also Intel HEDT with 6+ cores. It's just that nobody seems to have noticed as almost everyone was benching with Nvidia cards even for DX12 despite the suspicions people had about Nvidia hardware and DX12 before Pascal and their lack of proper async compute.

So, really reviewers should have continued to keep an eye on this. But while a lot sites used Haswell-E or Broadwell-E in the past, recently a lot have been using Kaby/Skylake so the fact that Nvidia's DX12 driver is not able to utilise the additional cores of 6+ core CPUs wasn't noticed.

The question is of course, is this something in the way Nvidia's drivers handle the threading can be fixed in software (after all Nvidia's DX11 drivers threaded a lot better than AMD's), or is it related to AMD Radeons having those ACE's or a similar hardware feature?

If Nvidia's hardware is lacking, there is little they can do to improve the issue, whereas software should eventually be fixable. Might even be that their hardware is the problem but they might be able to do a workaround.

But if it was just a threading issue,then why with a 4C/8T IB Core i7 am I have really weird performance drops which are not present under DX11?? There is something definitely borked when it comes to VRAM management under DX12 too,since the RX470 I had with 4GB of VRAM didn't exhibit all the weirdness at all in ROTTR in the exact same scene,which also happens to be one of the most CPU intensive parts of the game and I have an older CPU too.

Its almost like it can't make a proper decision fast enough.

The benchmark and one or two other areas I tested didn't show that under DX12. Its basically that one area of the game. DX11 has none of the performance drop outs.

The game was installed on an SSD too.

Edit!!

This is what annoys me more.

The whole point of DX12/Vulkan/Mantle is to test gaming in more CPU limited scenarios,ie,older and slower CPUs.

That is where Mantle look great in.

Its all fine and dandy testing the latest and greatest £300 to £500 CPU,but why don't have the sites bother testing DX12 with an older CPU like a SB/IB Core i5/Core i7 or an FX8350??

Lots of people will be having older CPUs,so its not really some absurd situation that a person might upgrade to a qHD screen and get a faster card.

**CAT-THE-FIFTH** · 02-04-2017, 04:09 PM

Another review seeing something similar with ROTTR:

https://thetechaltar.com/amd-ryzen-1800x-performance/5/

There seems to be some concern over whether or not NVIDIA cards play well with Ryzen CPUs. So we tested using a GTX 1080 and turning threaded-optimization on and off to see what kind of a difference it makes. As it turns out, Ryzen is affected in a few games, though only significantly in Rise of the Tomb Raider. The other question concerns NVIDIA’s DX12 support and how that affects performance on this platform. In some games, notably Rise of the Tomb Raider, the GTX 1080 and Pascal Titan X tend to get better performance using DX11, though that’s something that is seemingly processor agnostic. That is, DX11 gives better results on both Intel and AMD processors in certain games with NVIDIA GPUs.

**CAT-THE-FIFTH** · 04-04-2017, 04:46 PM

Hardware.fr have done an article about DDR4 memory scaling:

http://www.hardware.fr/articles/958-...i7-6900k.html4

Look at the memory bandwidth - the Rzyen controller is actually more efficient than the BW-E one.

**watercooled** · 04-04-2017, 04:46 PM

This could be common knowledge already, but AIDA was updated to correctly read Ryzen's cache latencies, and they're far better than previously reported:
https://www.aida64.com/news/aida64-v...cy-cache-speed

Original numbers: https://www.techpowerup.com/231268/a...cx-compromises

The L2 and L3 now seem to be lower than the 6900k! Memory latency is still on the high side however, but apparently this should improve with AGESA updates?
http://semiaccurate.com/2017/04/03/a...emedies-ryzen/

Edit: WOW! How's that for timing CAT? xD

**scaryjim** · 04-04-2017, 05:24 PM

Hmmm, suspect. To me that looks like AIDA is using a small or predictable dataset for the testing - the in depth cache latency tests some of the sites did on release showed that with sequential data AMD's prefetchers were excellent and cache latency was kept very low, while using random data the latency soared as you filled the cache...

**nekomata** · 04-04-2017, 05:30 PM

Just got off of chat to Scan, and apparently retailers are under one weird (to me at least) NDA from AMD.

Despite only being a week out from release, and the RRP having already been announced, the Scan rep told me that they're still not allowed to tell us how much they'll be retailing the Ryzen5 line for. Furthermore, they also told me that the NDA also forbids them from telling us when they will be allowed to tell us how much money they want from us.

I hoping someone here is familiar with AMD NDAs but not bound by them. From the above, does it sound like the NDA forbids Scan for telling anyone their prices until the CPUs are already released? Or is there still room for the NDA to allow retailers to reveal their prices before the actual release?

**CAT-THE-FIFTH** · 04-04-2017, 05:30 PM

Originally Posted by watercooled

Edit: WOW! How's that for timing CAT? xD

Also on the same subject,LOL!!

**scaryjim** · 04-04-2017, 05:34 PM

Originally Posted by nekomata

... I hoping someone here is familiar with AMD NDAs but not bound by them. From the above, does it sound like the NDA forbids Scan for telling anyone their prices until the CPUs are already released? Or is there still room for the NDA to allow retailers to reveal their prices before the actual release?

I''m not familiar with AMD NDAs but I've seen a few different NDAs, and yes, it sounds perfectly reasonable that AMD will have had retailers sign an NDA that forbids them from telling anyone a) how much they will be selling processors for, and b) the date that the NDA lifts. Those are both perfectly standard things to include in an NDA.

Any reveal of either the prices or the NDA expiry date before the NDA lifts will almost certainly be a breach of the NDA.

**watercooled** · 04-04-2017, 05:45 PM

The way I interpreted it (which could be wrong) is that measuring cache latencies with software requires some knowledge about the CPU in order to get accurate results, hence the reason for the AIDA patch.

With regard to prefetching though, that shouldn't affect at least the L3 given it's a victim cache and doesn't have a prefetcher? Or am I misinterpreting what you're saying?

There are some results for both linear and random access here: http://www.legitreviews.com/amd-ryze...eview_191753/5

According to those results, it looks to me like AMD and Intel are overall very close for L2 until they get above 256kB which exceeds Intel's L2 cache size (which important to remember - AMD are achieving this latency with a cache twice the size), then remain well below Intel's latency through the CCX's 8MB L3.

**Biscuit** · 05-04-2017, 12:50 AM

I'm regularly under NDA with major broadcasters and technology partners. The terms in our case generally allow you to state you have an NDA with the company and that's about it, you can't say anything else without specific permission.

**scaryjim** · 05-04-2017, 10:49 AM

Originally Posted by watercooled

.,.. With regard to prefetching though, that shouldn't affect at least the L3 given it's a victim cache and doesn't have a prefetcher? Or am I misinterpreting what you're saying?

There are some results for both linear and random access here: http://www.legitreviews.com/amd-ryze...eview_191753/5 ...

I think there are a few things worth noting:

- In the 64 byte stride Linear Forward results AMD's results are frankly phenomenal - those pre-fetchers are amazing when they know what data shold be coming.

- Every set of results show an inflection for AMD at 6MB - 8MB block size, so the L3 cache is definitely being hit, and it's hurting AMD when the block size is too big for a single CCX-worth of L3.

- The larger the stride, the harder it is for AMD's prefetchers, apparently - at a 4096 byte stride the linear forward and full random almost perfectly overlap.

Ryzen's practical cache latencies are very good, as you say - up to a 6MB block size it comfortably beats Intel's latency. But for full random or large stride data patterns, Ryzen is to all intents and purposes an 8MB L3 cache design. Once your access goes beyond that, you're essentially going to main memory for every access. Previous builds of AIDA were, I would assume, testing the latency on the entire 16MB of cache, and finding half of it very much wanting. "Fixing" the benchmark so it only tests Ryzen under ideal circumstances doesn't strike me as the best way to represent real world performance.

Indeed, I can't help wondering if part of the fixes that have seen such dramatic results in AotS was telling it to treat Ryzen as having an 8MB cache. If the code manages caching block sizes based on the cache it thinks is available that could be a quick fix to boost performance (and I know gaming used to be very cache heavy back in the day...).

**CAT-THE-FIFTH** · 05-04-2017, 02:33 PM

I saw this mentioned on AT forums:

So he tests the GTX1060 and RX480 on a R7 and a Core i7 7700K.

That chap also tests ROTTR but this time tested the game at medium and the GTX1060 saw gains going from DX11 to DX12 which is opposite to what others saw at very high settings.

I saw exactly the same when I dropped down one or two settings in ROTTR from very high to high and that was on an IB Core i7.

**watercooled** · 05-04-2017, 07:45 PM

Originally Posted by scaryjim

I think there are a few things worth noting:

- In the 64 byte stride Linear Forward results AMD's results are frankly phenomenal - those pre-fetchers are amazing when they know what data shold be coming.

- Every set of results show an inflection for AMD at 6MB - 8MB block size, so the L3 cache is definitely being hit, and it's hurting AMD when the block size is too big for a single CCX-worth of L3.

Yeah that's what I meant by the CCX's L3 - it doesn't appear that L2 cache will normally be evicted to the other CCX's L3 cache, but that doesn't mean there is no access to it e.g. through snooping. A more complex test would be required to demonstrate this e.g. by having one CCX access data and have it evicted to its L3, immediately followed by the other CCX accessing it. IIRC the way this works, is the request is sent to both the other L3 and the IMC simultaneously, and if the data is present in the other L3, the memory access is cancelled (providing the cached data hasn't been invalidated of course).

Originally Posted by scaryjim

- The larger the stride, the harder it is for AMD's prefetchers, apparently - at a 4096 byte stride the linear forward and full random almost perfectly overlap.

The same seems true of Intel's prefetching too from what I can see.

Originally Posted by scaryjim

Ryzen's practical cache latencies are very good, as you say - up to a 6MB block size it comfortably beats Intel's latency. But for full random or large stride data patterns, Ryzen is to all intents and purposes an 8MB L3 cache design. Once your access goes beyond that, you're essentially going to main memory for every access. Previous builds of AIDA were, I would assume, testing the latency on the entire 16MB of cache, and finding half of it very much wanting. "Fixing" the benchmark so it only tests Ryzen under ideal circumstances doesn't strike me as the best way to represent real world performance.

It seems quite normal for cache latency to increase as you approach filling it, likely due to the caching algorithms not being designed to operate in that way. There's not enough granularity in those tests to see in more detail, but Ryzen stays completely flat even at 6MB (75%) while Intel's 20MB cache starts curving upwards at before the 75% mark. There's no 15MB point on the graph, but it's very obvious at 16MB, and apparent even at 8 and 12 depending on the other variables. AMD seem to have done a very good job of keeping cache latencies down within a CCX, but like you say, as far as an independent thread is concerned, it appears more like an 8MB cache than a 16MB one, and I think some places are describing it as 2x8MB.

WRT the comment about AIDA - I don't think that's how it works, and nor really possible to test in the way you describe. Don't forget it's a cache which is wholly controlled by the CPU itself, and not addressable local memory i.e. you can't choose to use a certain amount. This isn't a workload performance benchmark, you're looking for a precise measurement of one specific aspect of the microarchitecture, and there's really only one right answer for a given test - it's one of those cases where, provided you're measuring correctly, you're never going to get a result better than what is theoretically possible.

The sort of fixes I assume they will have had to make would be, like I said, more about getting accurate results e.g. ensuring they're timing the measurements correctly. If they were e.g. trying to use a 16MB block size with a single thread, they would have just been hitting main memory for half of it, which I guess is possibly what was happening.

Cache latency obviously cannot be extrapolated to real-world performance - either they're measuring it correctly or they're not. They freely admitted the older version was not. It's not like a special optimisation to make it look better than it is.

Originally Posted by scaryjim

Indeed, I can't help wondering if part of the fixes that have seen such dramatic results in AotS was telling it to treat Ryzen as having an 8MB cache. If the code manages caching block sizes based on the cache it thinks is available that could be a quick fix to boost performance (and I know gaming used to be very cache heavy back in the day...).

It would be interesting to read what exactly the fixes were. It is possible, to an extent, for software developers to try to keep certain objects in cache by carefully tuning their code, but again they're still relying on the cache controller's own algorithms to both behave how they expect, and that other competing threads won't eat into what's available to use.

There's some interesting information on uArch tuning and superoptimisers at the website for y-cruncher: http://www.numberworld.org/y-crunche...mizations.html

If I had to guess, I doubt the AotS fix would have been exactly that, with my reasoning being: Ryzen doesn't have just 8MB of L3 cache, and a game is more than able to access all 16MB of it depending on how it's threaded and scheduled. AotS had no 'knowledge' of Ryzen, so could not have been e.g. mistakenly tuned for 16MB of accessible cache per thread. You're just running a binary that already existed and will run happily on much smaller caches e.g. 6MB in the i5, so it doesn't appear like the game was specifically tuned in some way to utilise 16MB of cache.

So if the performance improvements are down to (or partly down to) caching, I reckon the answer would be more complicated and involve more than just its apparent size. Like many games, it's also multi-threaded which adds another dimension to the possibilities e.g. competition, snooping across the fabric, etc. Perhaps one way to limit the impact of Ryzen's layout would be to somehow avoid snooping the remote L3 cache e.g. by scheduling dependent threads on one CCX?

About the design in general, I was listening to another podcast with David Kanter as a guest (techreport) and he mentioned how it's very much a trade off. Intel do their best to keep latency uniform across the larger core-count parts with a huge ring bus, but this adds a significant amount of complexity, and probably cost, power and notably, local latency. Having a more beastly fabric is likely to add more cycles to every single access, whereas Ryzen's design keeps local performance very good at the expense of higher latency for (far less frequent, especially when properly scheduled) snooping across the interconnect.

He also mentioned that Intel's ring-bus paradigm gets increasingly difficult as you scale in core count, hence why Xeon Phi and probably the Big Skylake uses meshes, with non-uniform latency.

Thread: AMD - Zen chitchat

LinkBack

Thread Tools

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Thread Information

Users Browsing this Thread

Posting Permissions