The IO die is physically much smaller - Anandtech has some measurements.
The IO die is physically much smaller - Anandtech has some measurements.
Yeah i expected the io dies to be different. with 14nm its cheap to do as well. I'm quite surprised how much smaller the am4 io die is but chucking 6 ddr4 channels off does help, along with the extra security stuff.
one good thing with chiplets is that doing 16c is easy but you don't need to worry about trying to do 12c, as you can use 2x chiplets (with 2 defective cores from each die) or you can use two of those chiplets in two seperate hex core chips.
chiplets are quite kind on defective dies.
more i think about chiplets, the more benefits i see to them, with very little downsides as long as they were small enough. Bulldozer era made that impossible - but at 7nm it becomes completely sane to do so
Pretty big though. 123mm^2 is about the size of a Coffee Lake quad core with integrated graphics.
Traditionally northbridge style designs are pad limited which is why they started putting integrated graphics on motherboards, to find something to do with the silicon area you get with that many I/O bumps around the outside. So I wonder what they do with the space? Huge cache perhaps, or some graphics, or both.
Aye, there are still plenty of questions over Zen2 - along with what you said, how many cores for a CCX?
But in terms of die size, that alone doesn't necessarily imply manufacturing cost when comparing two different types of IC e.g. I imagine it has far fewer layers and therefore lithography steps vs cores and will be running at a lower clock speed with less power distribution. It's fairly big vs CPU sizes but it's not unreasonable for a kinda-Northbridge IMO. And given they've seemingly hinted towards the possibility of 16C versions, the die presumably has the IF logic for another core chiplet to connect.
They're still being very quiet about details though!
Part of me is wondering if that I/O die will be used for Threadripper 3 - i.e. it's got 4 channels with only 2 in use for Ryzen 3 but it's got the ability to connect 4 chiplets and 4 memory channels. The size suggests otherwise and I imagine that instead AMD will use the Rome I/O die for Threadripper as I can't see Threadripper volumes justifying another die's development cost
"In a perfect world... spammers would get caught, go to jail, and share a cell with many men who have enlarged their penises, taken Viagra and are looking for a new relationship."
Anyone want to place bets on the I/O die having some amount of memory (L4 cache) or not, even at 14nm does it seem bigger than what's needed just for I/O?
I think it needs enough prefetch buffering to compensate for the SerDes hops that IF uses to cross from cpu die to the I/O and back, so yes probably an L4.
It is also interesting to note that on the GF 14nm process weighing in at 123mm^2 with 8 PCIe lanes, two memory controllers and a bunch of high speed serial I/O for HDMI & displayport is the RX560. Now the 560 is more of a square chip to maximise the amount of logic on the die, whereas the I/O chip here is rectangular to expose more edge for I/O. So this chip has fewer logic transistors but more connectivity than a 560, but I'm half expecting the I/O controller to contain a GPU if it has my guesstimate of 2.5B transistors to use up.
That's ... a curious concept. Would there still be space if we assume the IO chip has room for up to 4 chiplets/4 memory channels (and, I guess, 64 PCIe lanes) (so it can be used for Threadripper as well as plain Ryzen)?
Let's see - a full Zeppelin die (on the samne 14nm process) is 192mm2 and 4.8Bn transisitors. According to https://en.wikichip.org/wiki/amd/mic...en#Scalability a CCX is 44mm2 and 1.4Bn transistors. Two of those is 88mm2 and 2.8Bn transistors, leaving Zeppelin's non-CCX budget (2 memory channels, 32 PCIe lanes, peripheral IO and various IF links) at 104mm2 and 2Bn transitors.
So, speculatively, the Ryzen 2 IO die may have ~ 19mm2 and 0.5Bn transistors to play with over a Zeppelin uncore. I dfon't see any way they can cram double the oncore resources in to that, and it doesn't sound like much space for a GPU either, it's barely enough for a bit of L4 cache (the L3 in Ryzen is 16mm2 per that source)...
As far as Threadripper goes, I did wonder if they might engineer the smaller IO die so you can link them together, and TR will end up being essentially the same as it is now - two Ryzens linked over IF - the only difference would be that you'd be linking two IO chips together rather than two full dies...
I'd gone through that very thought process myself! Likewise, I think it's too small for that though.
It seems a fairly outlandish thing so normally I'd doubt it, but so is a lot of what AMD is doing at the moment so it wouldn't surprise me.
Interesting analysis - add to that you need the on-package interface to connect to the chiplets and you might have used up that 19mm2. I don't think 123mm2 is all that big for what it is - uncore takes up a lot of space on modern CPUs, and don't forget that's one reason for doing this in the first place given its relatively poor scaling.
As I've mentioned - often - in conversations about Ryzen, even going off-CCX on the same silicon has a significant latency penalty. With chiplets, you're talking about the potential for half your L3 cache to be on a different piece of silicon completely. If the only way to access that is a multi-step process via the IO chip, you're going to have absolute killer latency once a thread exhausts its chiplet's cache. Having a mirror L4 cache in the IO chip would help reduce that. And AMD have to minimize those cache latencies if Zen 2 is going to perform well in the real world...
It's worth remembering that there must be silicon in a Zeppelin die that connects the CCXes to each other and the IO/memory controllers etc., and that would all be included in the uncore in my analysis. So that shouldn't really need any extra silicon space, unless the IO chips have a lot more IF connectivity than a Zeppelin...
FTFY, at least i think i have as IIRC each individual CCX is fabricated on a single piece of silicon.
That touches on something I've been thinking about, we know currently Zen shares its L3 cache between all cores and all CCX's so it seem unlikely that would've changed with Zen2, that got me thinking why you'd want or need and L4 cache in the I/O die, if we assume the CCX's are connected directly to each other and sharing their L3 caches (very probable) then what's the advantage of adding L4 to the I/O die?
This may sounds nuts but i thought I'd spitball it with you guys, wouldn't it make more sense to move the separate pieces of shared L3 cache within each CCX to the I/O die, you're not really increasing latencies as each cores L3 cache has to remain consistent with every other core, both within it's own CCX and others so there was already a latency penalty in doing that, moving the L3 into a separate block means you reduce, or completely eliminate, the need to directly connect each CCX to each other as you no longer need to keep the data consistent between CCX's, you only need to keep it consistent with the L3 cache on the separate (I/O) die.
That makes it seem like an L4 cache in the I/O die seem not only pointless but more complicated than needs be as you now have to keep two caches consistent, and on of those is divided up between CCX's, thought?
Also a side thought i had, with Zen2 having a separate I/O die and 1-2 CCX's could/would it make direct die cooling safer? I know IHS' became a thing because smaller dies increased the risk of chipping the die but with three separate dies on a single package isn't that risk reduced what with spreading the load, obviously it would be a right PITA as you'd have to separate a IHS that's been soldered on and you'd have to reduce the Z height of the heatsink but it would be interesting to see how effective direct die cooling would be.
No, no you haven't, and this is why I keep mentioning it.
Each zeppelin die has 2 CCXes. Each CCX has 8MiB of L3 cache. The CPU reports itself to the OS as an 8-core, 16 thread chip with 16MiB of L3 cache, but in reality it's 2 4-core, 8-thread, 8MiB L3 cache chips with a lot of clever, fast interconnects.
Here's the relevant graph from Anandtech's 2nd-gen Ryzen Deep Dive:
See that big jump for Ryzen between 4MiB and 8MiB (n.b. that's a log scale graph, so the jump is actually even bigger than it appears there)? Notice how it happens at exactly the same point as the 8MiB Ryzen 2400G runs out of its 8MiB cache and hits main memory? Notice how @ 8MiB strides the 2700X has roughly the same latency to cache as the I7 8700k has to main memory?
That's the performance issue I talk about. It's nothing to do with going off-silicon, because the 2700X is one piece of silicon. It's all about the latency delay once a CCX fills its 8MiB of L3 cache and has to grab data from another CCX somewhere. That's slow on the same silicon. It's slower to another piece of silicon across a substrate (a la Threadripper), which would be the best case scenario in a multiple chiplet design*. If you had to travel across a substrate to an IO chip, from the IO chip to a second chiplet with the other CCXes on, then back to the original requesting core via the IO chip again, that's likely to be worse than going straight from the IO chip to the main memory.
As someone's said recently (either in this thread or elsewhere on Hexus) AMD have back-engineered from a fully integrated SoC to a packaged northbridge + base CPU. They've even decoupled the memory controller from the cores, so we're right back to - effectively - using FSB. It's basically a mash-up of the Core 2 quads and the early Core i3/i5 with IGPs. It needs amazingly good cache management and interconnects to hide the latency penalties.
* a little note on a chiplet design: if you want to keep cache accesses to other chiplets down to a single transfer across the substrate you'd need coherent links between all the chiplets That means each chiplet would need bumps and traces to 8 other chips (1 to the IO chip and 7 to the other chiplets), as well as logic and transport on the silicon itself to manage the access. That strikes me as a lot of extra silicon in each chiplet vs keeping a supplemental L4 cache on the IO chip and keeping the chiplets down to a single link to the IO chiplet.
Apologies, i keep mixing up my dies and CCX's don't i.
Not that it matters much but that Anandtech graph doesn't do a great job of showing what you're talking about IMO, i think PCPer did a better job when they looked at the 1600X and compared a 1800X with an Intel 5960X.
And to be fair the ping times between cores within a CCX are lower than between cores on an Intel CPU, 80ns vs 40ns, it's only when traversing between CCX's that there's a jump up to 140ns.
I think I've got you at it now. It would only need traces in the substrate for two/one die not 8 chips as like you say each die contains 8 cores so their connections are handled within the die.
Last edited by Corky34; 12-01-2019 at 05:37 PM.
I was among them (though by no means the only one). Intel have done good latency hiding in the past, and I wouldn't be surprised if AMD had picked up a lesson or two from their own console implementations too.
A possible benefit of northbridge is you get to really tune/push that memory controller, which might help counter a touch of latency as well.
I hadn't seen it mentioned here yet, but did you see what Anandtech said about the power?! They estimate it's nearly twice as energy efficient compared to a 9900K during Cinebench. That's got to be one of the gains by going with mixed die processes - you can keep each in the preferred power-frequency window without having to compromise quite so much.
It is next to main memory which needs to be coherent with the L3 cache, so coherency is a wash as the logic is pretty much there anyway.
Northbridge cache is I believe what made the Nvidia chipset motherboards so fast when they came out back in the Athlon era. Usually the L3 cache hides memory accesses of the memory controllers, but if they are on another piece of silicon then they should be tuned to hide the latency of the fabric to the IO controller, the controller can have it's own cache tuned to hide the memory latency also acting as a destination for any prefetchers it might have.
Corky34 (13-01-2019)
There are currently 9 users browsing this thread. (0 members and 9 guests)