AMD - Zen chitchat

**watercooled** · 08-07-2019, 10:40 PM

The improvement in power efficiency may be more than it first appears, as the X570 boards seem fairly power hungry: https://www.extremetech.com/computin...pu-comparisons

I'm not sure how much of that is down to the chipset alone as the article claims but I very much doubt the chipset itself is responsible dissipating anywhere between the 30-50W seen in the difference between X470 and X570. Maybe the VRMs etc are partly responsible?

Edit: Hmm, I'm not too sure about their x570 results though. Hexus do whole-system power measurements and they seem to agree with other articles I've read using the 9900k as another reference point. x570 power seems anomalously high for extremetech?

**scaryjim** · 08-07-2019, 10:49 PM

Originally Posted by badass

I suspect it's just that the x570 die is simply identical to the I/O Die on the Ryzen 3000 Processor ...

I had to go and read the X570 article when I read this: hadn't realised they'd done that.

Presumably this means the IO chip has the capacity to plumb out all those extra USB and SATA ports too - I assume the reason it doesn't is down to pin compatibility with previous desktop generations.

It may also be the reason AMD have publicly stated that Zen 2 APUs won't be chiplet-based: if the IO chip is relatively power-hungry it'd be a very bad fit for a <= 15W processor....

**watercooled** · 08-07-2019, 10:55 PM

Something else I've just noticed (and found interesting): AVX performance seems very good with Zen2, and it doesn't seem to have much of an impact on power consumption. Intel CPUs seem to draw noticeably more power under AVX2 workloads as seen here: https://www.tomshardware.com/reviews...ew,6214-3.html

Bear in mind that the 3700X is *faster* at y-cruncher and on-par for Handbrake vs 9900k as seen on page 12 of that same article.

**scaryjim** · 09-07-2019, 10:22 AM

Wild Threadripper 3000 speculation for your consideration:

We already know that Ryzen 3000 will cover (at least) 6/12 - 16/32 configurations. I'd lay a small amount of my own money on the next gen APU series covering 2/4 - 4/8. I doubt we'll see a Ryzen 3000 CPU with less than 6 cores - the 3400G is already out there @ 4C/8T and I don't think it makes sense to overlay the product lines that much.

With the chiplet design, I'm not sure repurposing Epyc IO dies for TR makes much sense - with Zeppelin all that IO was bundled with the cores so you had to have it in there if you wanted the cores, but now there's just no need to waste half your silicon. So, if we have a specific IO die for TR - quad channel memory, 64 lanes of PCIe 4, presumably - how many chiplets does it take? More than 2, obviously, and I'd imagine at least 4 so they can match the current line up @ 32 cores. But will they stop there?

Also, where will they start? With X570 potentially supporting 3 or 4 NVMe drives (or 4 PCIe 4 x8 slots), is there any need for 16 core Threadripper? Could they start the line up with an 18C variant (i.e. 3 chiplets each with 6 active cores) to differentiate the platforms?

Any thoughts?

**Corky34** · 09-07-2019, 12:59 PM

Are AMD stretching the meaning of maximum boost?

**kalniel** · 09-07-2019, 01:08 PM

AMD seem to be stretching a lot of things in marketing videos these days. Not impressed (by that aspect, they don't need to, their chips are great on their own)

**Xlucine** · 09-07-2019, 10:33 PM

Originally Posted by scaryjim

Wild Threadripper 3000 speculation for your consideration:

We already know that Ryzen 3000 will cover (at least) 6/12 - 16/32 configurations. I'd lay a small amount of my own money on the next gen APU series covering 2/4 - 4/8. I doubt we'll see a Ryzen 3000 CPU with less than 6 cores - the 3400G is already out there @ 4C/8T and I don't think it makes sense to overlay the product lines that much.

With the chiplet design, I'm not sure repurposing Epyc IO dies for TR makes much sense - with Zeppelin all that IO was bundled with the cores so you had to have it in there if you wanted the cores, but now there's just no need to waste half your silicon. So, if we have a specific IO die for TR - quad channel memory, 64 lanes of PCIe 4, presumably - how many chiplets does it take? More than 2, obviously, and I'd imagine at least 4 so they can match the current line up @ 32 cores. But will they stop there?

Also, where will they start? With X570 potentially supporting 3 or 4 NVMe drives (or 4 PCIe 4 x8 slots), is there any need for 16 core Threadripper? Could they start the line up with an 18C variant (i.e. 3 chiplets each with 6 active cores) to differentiate the platforms?

Any thoughts?

With RDNA claimed to give 50% improvements at same power & same configuration, AMD could get massive gains on the 4000 series APUs just by upgrading zen+ to zen2 and GCN to RDNA (and have a tiny chip if it's on 7nm)

zen2 TR might be able to use salvaged epyc IO dies? Even if it's just IO, it's still a massive chip by AMD CPU standards so that might be useful. 32 cores is still fine for the top end, until intel start competing properly - and if it is ~$15 per chiplet (I agree with kompukare's maths) then 16 cores will match the old product stack with the same CPU grunt between top end AM4 and TR while offering insane bandwidth (4 x4 core dies would also help with this) and not costing much

**watercooled** · 10-07-2019, 09:43 PM

Something really interesting to note regarding inter-core latency - something I've not seen any English-speaking sites pick up on, even the ones who were quick to criticise this for Zen1

- latency on the multi-die processors is as consistent as single-die variants as all inter-CCX data passes through the IO die, and latency is improved vs Gen1 despite this!

https://www.reddit.com/r/Amd/comment..._data_latency/

Also confirmation/explanation from Robert Hallock: https://twitter.com/Thracks/status/1...316505602?s=19

Not sure what's going on with the Zen+ results from that site though, they seem weirdly high?

Another thing worth noting, at least in the tested processor, it seems a CCX comprises of three cores. I wonder if that's how all the 6/12C processors are configured, with 3C CCXs consistently across the dies? Amongst other benefits, you have even heat loading, bandwidth distribution, etc.

**scaryjim** · 10-07-2019, 11:32 PM

Originally Posted by Xlucine

… 16 cores will match the old product stack with the same CPU grunt between top end AM4 and TR while offering insane bandwidth (4 x4 core dies would also help with this) and not costing much

Originally Posted by watercooled

… Another thing worth noting, at least in the tested processor, it seems a CCX comprises of three cores. I wonder if that's how all the 6/12C processors are configured, with 3C CCXs consistently across the dies? ...

A couple of related points here, that feed off my 18 core TR speculation. Previous gens of Ryzen all followed the evenly spread number of cores per CCX when reducing core count - so 4+4, 3+3, 2+2, etc. That was one of the reasons I speculated we might see 18 core TR: it could be made from 3 salvaged 3+3 core chiplets. Reported yields are really high, so I'm not sure 2+2 core chiplets are going to be common enough to build a product stack around.

I'm pretty confident that the 3900X will be made up entirely of 3+3 core chiplets. We know that IO dies work with fewer chiplets than links, or you *couldn't* do 1 and 2 chiplet AM4 processors. So a sensible TR product stack - to me at least - would use 3 3+3 chiplets for an 18 core product, 4 4+4 chiplets for a 32 core product, then either 3 4+4 chiplets, or 4 3+3 chiplets, for 24 cores. Possibly the latter, as that uses more binned/salvaged parts?

As I say, given the reportedly very high yields I'm just not convinced that it's going to be worth AMD binning to 2+2 chiplets. And they seem to be aiming for higher boost clocks as you go up the stack for Ryzen 3000, so they're only going to want top bin parts going to TR (which, tbf, if also what they've done in previous generations).

It's going to be interesting, anyway. As to IO dies, I can see Xlucine's point about using binned EPYC IO chiplets: since you'll be ditching half the memory channels and PCIe lanes, you could salvage a lot of dies with minor faults to produce functional TR chips (and going that way might let them produce higher core count TR processors, too...)

**Corky34** · 10-07-2019, 11:56 PM

Originally Posted by watercooled

Something really interesting to note regarding inter-core latency - something I've not seen any English-speaking sites pick up on, even the ones who were quick to criticise this for Zen1

- latency on the multi-die processors is as consistent as single-die variants as all inter-CCX data passes through the IO die, and latency is improved vs Gen1 despite this!

https://www.reddit.com/r/Amd/comment..._data_latency/

Also confirmation/explanation from Robert Hallock: https://twitter.com/Thracks/status/1...316505602?s=19

Not sure what's going on with the Zen+ results from that site though, they seem weirdly high?

Another thing worth noting, at least in the tested processor, it seems a CCX comprises of three cores. I wonder if that's how all the 6/12C processors are configured, with 3C CCXs consistently across the dies? Amongst other benefits, you have even heat loading, bandwidth distribution, etc.

Have you done what I've done in the past an confused CCXs with the die, or what AMD are now calling CCDs. CCX data should only need to pass through the I/O die if it's headed for a CCX on another CCD (or some other I/O), each CCD has two CCXs so technically inter-CCX on the same CCD shouldn't need to go to the I/O die.

Each CCD comprises of two CCX and each CCX consists of four cores so it's 8 cores per CCD (per die/chiplet), all the 32Mb cache SKUs contain a single CCD and all the 64MB ones contain two CCD, what's more interesting IMO about that is lower core count SKUs still come with the same amount of L3.

Lastly it's hard to know why they got those results on Zen+, it could be they're measuring different source and destinations, it could be they're using slower RAM so the data fabric is clocked slower, it could be they're measuring beyond first word access times, or something else, i don't read Russian.

**watercooled** · 11-07-2019, 12:07 AM

Originally Posted by scaryjim

It's going to be interesting, anyway. As to IO dies, I can see Xlucine's point about using binned EPYC IO chiplets: since you'll be ditching half the memory channels and PCIe lanes, you could salvage a lot of dies with minor faults to produce functional TR chips (and going that way might let them produce higher core count TR processors, too...)

I seem to recall reading somewhere that the EPYC IO die was designed to be literally cut in half? It might have been speculation though so don't quote me on that.

Originally Posted by Corky34

Have you done what I've done in the past an confused CCXs with the die, or what AMD are now calling CCDs. CCX data should only need to pass through the I/O die if it's headed for a CCX on another CCD (or some other I/O), each CCD has two CCXs so technically inter-CCX on the same CCD shouldn't need to go to the I/O die.

Nope, check the links I posted - what I said is confirmed through an AMD rep and testing.

Directly quoting Robert Hallock (emphasis mine):

Yes. All CCX<->CCX communication traverses the IOD, meaning all CCXes communicate at a common latency. Same for cache. Same for DRAM. From the perspective of the system, this is monolithic die behavior.

A few ns of wire latency notwithstanding.

It sounds counter-intuitive at first but it is indeed what's happening.

**Corky34** · 11-07-2019, 06:31 AM

Ignore me, think i got myself confused again...

On second thoughts that seems a rather dumb choice if they have done that, why would you send data out of a piece of silicon only to have it immediately return.

I mean i know we're only talking signal transmission time so it's not really about the time but surely doing something like that adds to the complexity, and costs.

**DanceswithUnix** · 11-07-2019, 08:12 AM

Originally Posted by Corky34

I mean i know we're only talking signal transmission time so it's not really about the time but surely doing something like that adds to the complexity, and costs.

Signal transmission costs power and time as you are charging an RC network, but the alternative would be to add another layer/tier to the hierarchy and I'm sure AMD will have done extensive simulations and concluded this was the best answer. Heck the simplest answer would be to make a CCX of 8 cores and again I'm sure AMD would have simulated that but once again a cluster of 4 cores seems to be the best balance.

Wires are pretty cheap though, so I'm sure cost wasn't an issue either way.

**kalniel** · 11-07-2019, 09:36 AM

Originally Posted by watercooled

Something really interesting to note regarding inter-core latency - something I've not seen any English-speaking sites pick up on, even the ones who were quick to criticise this for Zen1

- latency on the multi-die processors is as consistent as single-die variants as all inter-CCX data passes through the IO die, and latency is improved vs Gen1 despite this!

https://www.reddit.com/r/Amd/comment..._data_latency/

Cool, the process is in line with what I was expecting, but the results are better than I'd hoped for.

Windows knows to bucket fill each CCX too. It would be interesting to know which apps heavily use inter-core data transfer/lookup and how threaded they are - if only 3-4 threads is usual then this is a very nice architecture.

**Corky34** · 11-07-2019, 09:50 AM

Originally Posted by DanceswithUnix

Signal transmission costs power and time as you are charging an RC network, but the alternative would be to add another layer/tier to the hierarchy and I'm sure AMD will have done extensive simulations and concluded this was the best answer. Heck the simplest answer would be to make a CCX of 8 cores and again I'm sure AMD would have simulated that but once again a cluster of 4 cores seems to be the best balance.

Wires are pretty cheap though, so I'm sure cost wasn't an issue either way.

For sure, they obviously know better than me but i would've thought, logically speaking, that allowing each cluster of four cores to speak directly to the other 4 cores on the die would've made more sense, and having a single entry/exit point (SerDes) for the entire die would've been more logical...although having thought about the single SerDes thing maybe that's why they choose the more convoluted route as having two SerDes on each die, one for each CCX, probably means less chance of a bottleneck vs the small hit in latency.

**kalniel** · 11-07-2019, 10:07 AM

Originally Posted by Corky34

For sure, they obviously know better than me but i would've thought, logically speaking, that allowing each cluster of four cores to speak directly to the other 4 cores on the die would've made more sense, and having a single entry/exit point (SerDes) for the entire die would've been more logical...although having thought about the single SerDes thing maybe that's why they choose the more convoluted route as having two SerDes on each die, one for each CCX, probably means less chance of a bottleneck vs the small hit in latency.

They'd need to make space for another interface then on each CCX and you'd add latency between your CCX and the IOD.

Thread: AMD - Zen chitchat

LinkBack

Thread Tools

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Received thanks from:

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Received thanks from:

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Re: AMD - Zen chitchat

Thread Information

Users Browsing this Thread

Posting Permissions