Nvidia GeForce RTX 2080 Ti and RTX 2080

**Corky34** · 22-09-2018, 08:01 AM

Originally Posted by DanceswithUnix

As for width, I got the impression that pre-Volta the ALU was mixed fp32+fp64+int but now the integer processing is stripped out into its own execution unit so address calculations can be done in parallel with the floating point. ISTR that fp16 was removed from gpus early on, around Riva TNT, and no-one missed it until now. I think in Vega AMD put the data type back in, but a dedicated fp16 matrix multiply-accumulate would seem sensible.

Yea it was however i suspect that stripping out into individual units of FP and INT is done at the factory, possibly via firmware, as it doesn't seem logical to take what used to be a mixed precision unit and redesign it to only do INT or FP, from a failure POV it seems more logical to decide if a unit is going to do FP or INT after fabrication.

The FP/INT 32/64 part AFAIK appears to be something that's fixed at design and fabrication time, the width of a unit is a physical thing and ideally you want the width of the unit to match the width of the data going through it, while a MP INT/FP 64 unit can do INT/FP 32 work it's a waste of silicon and power.

Originally Posted by DanceswithUnix

The bit I noticed from that pdf was the huge matrix size required to hit high performance, when AIUI most AI just uses 4x4 multiplies.

Is does but even a 4x4 grid (afaik, a grid can also do +, -, and / on a per grid basis) consist of 16 individual 2 digit numbers (32bits), or any variation that results in either 16 or 32bits (there's also 64bits but that's more for the professional cards), at least that's my understanding and I'd welcome the input from someone with more knowledge of Tensor programing.

**Jace007** · 22-09-2018, 06:49 PM

nope, i think i wont buy a GPU for at least 3 years, Just got myself a PS4 .. Gaming on PC is expensive trying to keep up with the latest tech only for a handful of really decent AAA games to show up after waiting years. Thanks but no thanks I'm out

**DanceswithUnix** · 23-09-2018, 11:36 AM

Originally Posted by Corky34

Yea it was however i suspect that stripping out into individual units of FP and INT is done at the factory, possibly via firmware, as it doesn't seem logical to take what used to be a mixed precision unit and redesign it to only do INT or FP, from a failure POV it seems more logical to decide if a unit is going to do FP or INT after fabrication.

You would never disable that at manufacture, they are designed like that. Or put it another way, fp circuitry is big so you wouldn't have any idle with only an int part enabled.

The int and fp logic are not the same, but by combining them you save instruction decoding logic etc. So they have decided to spend a few transistors to make use of int instructions being lower latency rather than having them go down the same long pipe as the floating point ops. Sounds like a marginal improvement at best, but I'm sure simulations would have shown an improvement for them to make the change.

The FP/INT 32/64 part AFAIK appears to be something that's fixed at design and fabrication time, the width of a unit is a physical thing and ideally you want the width of the unit to match the width of the data going through it, while a MP INT/FP 64 unit can do INT/FP 32 work it's a waste of silicon and power.

Is does but even a 4x4 grid (afaik, a grid can also do +, -, and / on a per grid basis) consist of 16 individual 2 digit numbers (32bits), or any variation that results in either 16 or 32bits (there's also 64bits but that's more for the professional cards), at least that's my understanding and I'd welcome the input from someone with more knowledge of Tensor programing.

I believe the width of the unit is 64 bits. You do a pair of 32 bit operations, or 4 lots of 16 bit operations. Hence in the fully 64 bit enabled Volta parts they can do fp64 at half the rate of fp32.

The tensor calculation is a simple 4x4 multiply with accumulate, so you are performing 16 lots of 4 multiplies but instead of storing the result you add them into the destination with a choice of using fp16 or fp32 for the result.

**Corky34** · 23-09-2018, 07:20 PM

Originally Posted by DanceswithUnix

You would never disable that at manufacture, they are designed like that. Or put it another way, fp circuitry is big so you wouldn't have any idle with only an int part enabled.

The int and fp logic are not the same, but by combining them you save instruction decoding logic etc. So they have decided to spend a few transistors to make use of int instructions being lower latency rather than having them go down the same long pipe as the floating point ops. Sounds like a marginal improvement at best, but I'm sure simulations would have shown an improvement for them to make the change.

So when i read that a CUDA core executes a floating point or integer instruction per clock for a thread Nvidia used to include both FP and INT ALU's in each CUDA 'core' (at least the notional definition of a CUDA 'core')? In other words a CUDA 'core' could only perform one or the other type per clock for a thread.

Since Volta they still have those same separate FP & INT units but because of changes in the way the registers and cache works they can now address both units concurrently? In essence for each clock the notional idea of a CUDA 'core' can now run two threads, one for FP and another for INT?

Originally Posted by DanceswithUnix

I believe the width of the unit is 64 bits. You do a pair of 32 bit operations, or 4 lots of 16 bit operations. Hence in the fully 64 bit enabled Volta parts they can do fp64 at half the rate of fp32.

You mean the width of the notional idea of what a Tensor 'core' is? If so that does seem to make more sense as that would mean all those FP64 'cores' on professional cards have been repurposed for Tensor 'cores' on consumer cards.

Originally Posted by DanceswithUnix

The tensor calculation is a simple 4x4 multiply with accumulate, so you are performing 16 lots of 4 multiplies but instead of storing the result you add them into the destination with a choice of using fp16 or fp32 for the result.

AFAIK the type (multiply, addition, square root, subtraction) and size (2x2 8bit, 4x4 4bit, 8x8 2bit) can vary as long as each individual TensorRT kernel is constructed from the same type, so you could have one TensorRT kernel doing subtractions on a 2x2 grid constructed out of 4 four digit numbers, another TensorRT doing multiplication on a 4x4 grid of of 16 two digit numbers, and any variation there of that fits into a 16 or 32bit data structure, each of those 16/32bit TensorRT kernels go to make up the NN, at least that's my current understanding.

**ohmaheid** · 01-10-2018, 10:23 AM

I don't see any Hairworks comparisons. Have Nvidia dropped it?

Thread: Nvidia GeForce RTX 2080 Ti and RTX 2080

LinkBack

Thread Tools

Re: Nvidia GeForce RTX 2080 Ti and RTX 2080

Re: Nvidia GeForce RTX 2080 Ti and RTX 2080

Re: Nvidia GeForce RTX 2080 Ti and RTX 2080

Re: Nvidia GeForce RTX 2080 Ti and RTX 2080

Re: Nvidia GeForce RTX 2080 Ti and RTX 2080

Thread Information

Users Browsing this Thread

Posting Permissions