AMD - Zen chitchat

**watercooled** · 05-04-2017, 10:57 PM

I wonder if anyone knows where this '22GB/s' figure appeared from, referring to the inter-CCX bandwidth? Unless it's some measured workload-specific value, it seems bizarre, and it's been bugging me since it first started appearing unreferenced in forum posts and articles.

The slides are out there for everyone to see and calculate. This is the on you need: https://i1.wp.com/thetechaltar.com/w...eCQG.jpg?ssl=1

Each CCX communicates with the fabric at memclk at 32 bytes per cycle. So, assuming DDR4-2667 as in the slide, this means 1333MHz. This is something that far too many people are claiming to understand and confusing - DDR memory transfer rate is *twice* the clock speed. The interconnect is not at a 1/2 ratio of the memory clock, and nor does it have anything to do with memory channels - all the information you need to understand that is clearly presented in the slide.

So, to continue, 1333MHz multiplied by 32B/cycle = 42.656GB/s

That's still an order of magnitude below intra-CCX L3 bandwidth, but it's not the 22GB/s figure being bandied about. It's not exactly double either so it doesn't seem like someone has confused some calculation/divided something by mistake because of that 0.5x memclk confusion.

Now on to another point - some people are again causing confusion by claiming the fabric bandwidth is below that available to memory. I'm not sure why anyone would think this, not least because the IMC connects through the fabric to the cores (one person also says 22GB/s is similar to, quite specifically, DDR3-1600 of all things - why it couldn't have been DDR4-1600 is beyond me - but either way, a single channel of this gives 12.8GB/s, not close to 22GB/s). Again looking at that slide, it seems that any CCX can fully saturate two memory channels. We don't have to do the bandwidth calculations for within the clock domain as it's 32B/cycle from the CCX, 32B/cycle to the IMC, and 16B/cycle to each memory channel (of which there are two).

And there's nothing weird going on either as this works out exactly as expected - 16B/cycle multiplied by 1333MHz gets you ~21.3GB/s. 1333MHz DDR memory means 2666MT/s, 64 bit (8 byte) wide memory channels, so 2666x8=21.3, exactly the same.

Another theory is that there are some hard partitions in the fabric. Again, this demonstrably doesn't seem to be the case, check that AIDA screenshot again - in that case they're using 1200MHz (DDR4-2400) memory but it works out all the same. 1200x16 = 19200, or 19.2GB/s per channel. Dual channel 38.4GB/s, AIDA shows ~37GB/s - no obvious bottlenecks there. So, if memory accesses can cross the fabric unhindered, why would inter-CCX cache snoops run at some arbitrarily lower number across the same fabric? That makes no sense.

What isn't yet clear AFAIK is how exactly this fabric works internally, how and if this bandwidth is shared - could CCX0 read from CCX1 at 32B/cycle, and CCX1 read from CCX0 at 32B/cycle simultaneously? Do cache snoops have to compete with the IO hub and memory for bandwidth on the fabric? E.g. can CCX0 theoretically access memory whilst CCX1 accesses the IO hub unhindered? Because nothing is mentioned on that slide about fabric bandwidth, and any one port would be able to saturate 32B/cycle.

One last thing, arrows on the slide are a bit unclear e.g. across clock boundaries like cores>L3 or L3>fabric. However I believe that this means one 'connection' per core/CCX, respectively. Reasoning> AIDA again, looking at L3 bandwidth, 32 x cclk(3500MHz) = 112GB/s. Two CCX means 224GB/s *max* if it means L3 bandwidth per CCX (which is unlikely anyway TBF). AIDA shows well above this for all tests. So from that I assume that each CCX has its own 32 bit wide 'port' on to the fabric, and further that the total bandwidth is not, as some are speculating, limited to this. AMD also talk about how scalable IF is.

So in conclusion, the 22GB/s figure seems wrong and it seems unlikely that nodes on the fabric have to compete for bandwidth, provided they're not competing for the same node of course. Nonetheless, I'd like to see some more detail about the Infinity Fabric itself.

Edit: It's typical that I find this right after posting but according to this, the fabric is a bi-directional crossbar, and so there should be no competition across the fabric itself, but of course there may still be competition for a given port. https://www.reddit.com/r/Amd/comment...ma&sh=a5ac8d75

**DanceswithUnix** · 07-04-2017, 09:07 PM

Originally Posted by watercooled

So, if memory accesses can cross the fabric unhindered, why would inter-CCX cache snoops run at some arbitrarily lower number across the same fabric? That makes no sense.

Do cache snoops even hit the fabric? One of the points of switching to the new inclusive L3 cache layout is that snooping only need hit the L3, unlike in BD where they may have been something in L2 in another module as well making snooping a bit of a nightmare. Anyway, your calculations looked sane to me.

I see that Ryzen 5 prices are starting to pop up

http://www.ebuyer.com/store/Componen...MD-AM4-Ryzen-5

**watercooled** · 08-04-2017, 03:44 PM

I think you'd only hit the fabric if something was in the other CCX's L3 e.g. because of inter-dependency between threads. I assume you mean exclusive for L3 given it's a victim cache, but yeah I think that means you avoid having to hit lower level caches if something is actually in L3, as it should only be there if it's already been evicted from L2? I think coherency is maintained by how Ryzen deals with foreign L3 accesses - if something isn't in the 'local' L3, a request is sent to the foreign L3 and the memory controller simultaneously, and if the item is found in the L3 then the memory access is cancelled.

As much as there's talk about L3 bandwidth, there's very little mention of latency, and even a foreign L3 access should be far lower latency than going out to memory.

If I'm understanding it correctly, there should never be any dirty items in L3 because as a request hits L3, the cache line will be swapped with one in L2.

I do wonder how lower level cache snoops work across the fabric though?

**watercooled** · 09-04-2017, 12:14 PM

Going back to something we talked about in the 1800X review thread, idle clocks.

Now, really we'd need to know if software voltage measurements are correct before jumping to conclusions, but I wonder if the somewhat high idle clocks are actually achieved at Vmin? I'm assuming AMD won't have bothered with duty cycle control with HEDT/server parts, so there wouldn't be much point going much lower.

For anyone who doesn't know, Vmin is essentially the lowest voltage you can supply the transistors with and still have them switch - it's not a gradually decreasing voltage/clock slope, you get to a threshold voltage at which you're probably still achieving >1GHz on modern processes. You can still decrease clocks, but there's not much point when you have clock gating as efficiency will actually start decreasing again, which is why Intel have started to implement duty cycle control on their GPUs starting with Broadwell, and now even the CPUs starting with Skylake. Essentially they rapidly switch on/off the core to achieve effectively lower clock speeds for less demanding tasks whilst maintaining the optimal efficiency obtained with max stable clocks at Vmin.

It looks like people are getting 3.2GHz stable with just over 1v!! https://www.reddit.com/r/Amd/comment...our_ryzen_cpu/

**CAT-THE-FIFTH** · 09-04-2017, 12:44 PM

Originally Posted by watercooled

It looks like people are getting 3.2GHz stable with just over 1v!! https://www.reddit.com/r/Amd/comment...our_ryzen_cpu/

Remember the testing The Stilt did a while back - anything over 3GHZ to 3.2GHZ seems to need excessively voltage. This is why the R7 1700 seems the best part in terms of overall efficiency.

**watercooled** · 09-04-2017, 01:18 PM

Anything over 3.2? There's a knee in the graph above about 3.3 but I wouldn't call it excessively high - if you continue the trend before the knee, it doesn't really start to deviate substantially until about 3.6GHz, which agrees with how AMD have stock-clocked the CPUs. We see much the same thing with Intel CPUs too, and the likes of the 6700k/7700k are clocked similarly high on their respective curves. Voltages aren't directly comparable between processes either, and IMO Ryzen clocks impressively high given the process it's made on (speaking of which, I wonder if AMD attract enough volumes with their current products, their fab partners could justify investing in a higher-clocking process rather than AMD having to make do with low power-oriented nodes putting a limit on their clocks?)

Nonetheless the 1700 does seem to lie in a very good position in terms of efficiency, which bodes well for the server parts where they'll be sticking 4 of those dies on a package!

WRT my last comment, I'd forgotten about that graph by The Stilt - that backs up what I was thinking about idle clocks - see the knee at 2.1GHz and ~0.7V? Intel seem to bottom out around that voltage too.

**CAT-THE-FIFTH** · 09-04-2017, 02:24 PM

Originally Posted by watercooled

Anything over 3.2? There's a knee in the graph above about 3.3 but I wouldn't call it excessively high - if you continue the trend before the knee, it doesn't really start to deviate substantially until about 3.6GHz, which agrees with how AMD have stock-clocked the CPUs. We see much the same thing with Intel CPUs too, and the likes of the 6700k/7700k are clocked similarly high on their respective curves. Voltages aren't directly comparable between processes either, and IMO Ryzen clocks impressively high given the process it's made on (speaking of which, I wonder if AMD attract enough volumes with their current products, their fab partners could justify investing in a higher-clocking process rather than AMD having to make do with low power-oriented nodes putting a limit on their clocks?)

Nonetheless the 1700 does seem to lie in a very good position in terms of efficiency, which bodes well for the server parts where they'll be sticking 4 of those dies on a package!

WRT my last comment, I'd forgotten about that graph by The Stilt - that backs up what I was thinking about idle clocks - see the knee at 2.1GHz and ~0.7V? Intel seem to bottom out around that voltage too.

Maybe I should have worded that 3.0GHZ to 3.2GHZ~3.3GHZ,but the kink happens at slightly before 3.3GHZ or thereabouts. Its starts really deviating at between 3.4GHZ~3.5GHZ or thereabouts.

You also need consider the R7 1700X/1800X also seem to be able to clock around 200MHZ higher than the R7 1700 on average.

If anything if you look at the base clockspeeds of the R7 1700,R5 1600 and R5 1400 they are 3.0GHZ to 3.2GHZ and I would expect the non-X to be the higher volume parts. The PRO parts also match the clockspeeds of the non-X parts too.

Edit!!

Looking at the graph again,their R7 1700X sampled needed around 80mV to go from just under 3.3GHZ to around 3.5GHZ,whereas going from 2.875GHZ to 3.2GHZ needed the same amount of voltage.

3.2GHZ~3.3GHZ is more than enough for server parts anyway - look at the Intel parts for example,which are often clocked lower.

**DanceswithUnix** · 09-04-2017, 03:48 PM

Originally Posted by watercooled

I do wonder how lower level cache snoops work across the fabric though?

AAUI if something is in L1/L2 it also exists in L3, hence I said it is inclusive. That is specifically aimed at making snoops stop at the L3 as that gives a definitive answer for whether the data is cached in the CCX.

**watercooled** · 09-04-2017, 04:11 PM

I don't think that is the case though, Zen's L3 is a victim cache and is exclusive of L1/2 (not sure if that means strictly exclusive though).

Edit: AMD seem to describe the L3 as being 'mostly exclusive' of L2. I'm not really sure what they mean by that but perhaps they're doing something to avoid L2 snoops?

PCPer also add the following line: "The L3 cache acts as a victim cache which partially copies what is in L1 and L2 caches."

https://www.pcper.com/reviews/Proces...ew-Focus-Ryzen

**watercooled** · 09-04-2017, 04:48 PM

To follow on, I wonder if this is what's going on with Zen's shadow tag macros? Admittedly it's getting a bit deep for me, but they talk about reducing L2 traffic.

http://pc.watch.impress.co.jp/docs/c...i/1043349.html
After translating, Ctrl+F for this part: To reduce the snoop traffic, a copy of the cache tag of the Zen L2 may be stored in L3.

http://www.tweaktown.com/reviews/807...dy/index2.html
Ctrl+F for: Shadow tag macros help improve efficiency by reducing probe traffic into the L2 cache.

There's also a bit of an interview, which doesn't seem to have been published yet, posted here: http://www.gamersnexus.net/hwreviews...arks?showall=1
Ctrl+f for: For now, here’s an excerpt

As an aside, those IEEE slides from AMD give us enough information to estimate the core sizes of both Zen, and presumably Skylake. You just subtract the L3 and L2(multiplied by 4) from the total area and divide what's left by 4. For Zen, we get 5.5mm2, and for Skylake 6.575mm2. Pretty tiny!

**kompukare** · 10-04-2017, 09:42 AM

Spotted this benchmark via someone posting on the AT forum:
http://www.sisoftware.eu/2017/04/05/...enchmarks-cpu/
Probably the closest thing to a server load benchmark anyone has done for Ryzen I think.

They only had SB-E (i7-5820K), but Zen does really well in just about all benches aside from very vectorised loads (AVX2 with lots of FMA). Other loses are where dual channel memory isn't enough but that shouldn't apply to Naples.

The .NET and Java VM results are especially impressive so it looks Naples should clean up as a webserver.

**CAT-THE-FIFTH** · 10-04-2017, 05:31 PM

https://www.hardocp.com/article/2017...1600_1400_cpus

HardOCP bought some retail samples of the R5 already - they got overclocks between 3.8GHZ~4.0GHZ,so not much difference overall to the R7 parts then.

**kompukare** · 10-04-2017, 06:36 PM

Originally Posted by CAT-THE-FIFTH

https://www.hardocp.com/article/2017...1600_1400_cpus

HardOCP bought some retail samples of the R5 already - they got overclocks between 3.8GHZ~4.0GHZ,so not much difference overall to the R7 parts then.

Yes, that's as expected really since thermals wasn't the reason for the clock ceiling. Now, if AMD did a revision and made mask for just a quad part we might see the 1500X downward clock better. Problem is that 1600 is probably the most interesting part though.

**CAT-THE-FIFTH** · 10-04-2017, 07:10 PM

Originally Posted by kompukare

Yes, that's as expected really since thermals wasn't the reason for the clock ceiling. Now, if AMD did a revision and made mask for just a quad part we might see the 1500X downward clock better. Problem is that 1600 is probably the most interesting part though.

I expect once the APU is released,we might see such a part,ie,one with the GPU disabled.

**scaryjim** · 10-04-2017, 08:04 PM

Originally Posted by CAT-THE-FIFTH

I expect once the APU is released,we might see such a part,ie,one with the GPU disabled.

I'm pretty sure that will be the case, although they might look at ways to artificially limit those chips since they'd be competing directly with the R5 1400/1500X. Not to mention that we haven't seen R3 chips yet - starting to wonder if they even exist...

Also the first gen Zen APUs might not clock any higher than the Ryzen CPUs - iirc they're targeting mobile-first with Zen APUs anyway, and if it's the same basic core and process the R7/R5 are using we'll still see that voltage/clock wall at ~ 4GHz. Higher clock speeds are more likely to come from a combination of tweaked core and tweaked process when Zen 2 lands in a year or so...

**CAT-THE-FIFTH** · 11-04-2017, 02:53 AM

Originally Posted by scaryjim

I'm pretty sure that will be the case, although they might look at ways to artificially limit those chips since they'd be competing directly with the R5 1400/1500X. Not to mention that we haven't seen R3 chips yet - starting to wonder if they even exist...

Also the first gen Zen APUs might not clock any higher than the Ryzen CPUs - iirc they're targeting mobile-first with Zen APUs anyway, and if it's the same basic core and process the R7/R5 are using we'll still see that voltage/clock wall at ~ 4GHz. Higher clock speeds are more likely to come from a combination of tweaked core and tweaked process when Zen 2 lands in a year or so...

The R3 1200 does exist:

https://videocardz.com/newz/amd-ryze...specifications

I still do wonder why it is appearing much later than the R5 and R7 chips??

Thread: AMD - Zen chitchat

LinkBack

Thread Tools

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Thread Information

Users Browsing this Thread

Posting Permissions