Just a heads up to ppl. I've discovered some rather disturbing issues regarding Nvidia GPU laptops. Some of you may have seen their 200mil hit to profits. Well here's what i've dug up so far. I suspect that some of you do have laptops with these chips in. Certainly worth looking at if yours is playing up.
It could potentially be ALL g84 and g86 based chips. (GeForce 8x00M "mobile" and 8x00 graphics cards)
I have posted there to see if Steve can find out more.
Originally Posted by mercyground
From the other thread.
Its not just Inq having a go. A class action suit on HP. Dell are replacing motherboards. There is something definatly not good in green land.
Originally Posted by mercyground
Shamelessly copied from the other thread. I had no idea that [H] didnt know about this issue. I've been looking into it as my other half's lappy is making odd bluescreens. It started with nvidia drivers and now its getting worse and randomly bluescreening. It is dumping out rather alot of heat too. I've contacted Acer and am awaiting their response on the issue. (Her lappy is a 5920G)
G84-400/403/405 a.k.a. GeForce 8600GTS
G84-300/303/305 a.k.a. GeForce 8600GT
G86-300/303/305 a.k.a. GeForce 8500GT
G86-303/305 a.k.a. GeForce 8300GTis
There are 3 revisions of the chips apparently. The last revision A03 bumped the power envelope up from 1.3V to 1.375V (12%)
The original thermal envelope for these chips was 20W. They are actually 25W parts.
Nvidia stuffed OEM's by promising ~20W but only delivered 25W parts.
Its a horribly messy and not nice situation. Nvidia is blaming faulty bonding process and OEMs etc. Basically anyone but themselves. This makes ppl unhappy.
Its thermal cracking. The hot cold cycles eventually crack the chip off the motherboard. Its basically a cooling issue.
Its all chips. Just not showing up on PC cards as they cycle less frequently. Laptops get turned on and off alot more.
Nothing went really wrong. Nvidia just pinning blame on anyone but themselfs. Its purely down to hot chip in tight space with insufficent cooling. They promised a 20W envelope and OEM's designed for that... then dropped a 25W chip into it.
- More here.
The short story is that all the G84 and G86 parts are bad. Period. No exceptions. All of them, mobile and desktop, use the exact same ASIC, so expect them to go south in inordinate numbers as well. There are caveats however, and we will detail those in a bit.
Both of these ASICs have a rather terminal problem with unnamed substrate or bumping material, and it is heat related. If you ask Nvidia officially, you will get no reason why this happened, and no list of parts affected, we tried. Unofficially, they will blame everyone under the sun, and trash their suppliers in very colourful language.
The official story is that it was a batch of end-of-life parts that used a different bonding/substrate process for only that batch. Once again, the trusty INQUIRER bull**** detectors went off so loudly that the phone almost vibrated out of my hand. More than enough people tell us both the G84 and G86 use the same ASIC across the board, and no changes were made during their lives.
When the process engineers pinged by the INQ picked themselves off the floor from laughing, they politely said that there is about zero chance that NV would change the assembly process or material set for a batch, much less an EOL part.
The other problem is the long tail. Failures occur due to heat cycling, cold -> hot -> cold for the non-engineers out there. If you remember, we said all G84s and G86s are affected, and all are the same ASIC, so why aren't the desktop parts dying? They are, you are just low enough on the bell curve that you don't see it in number that set off alarm bells publicly yet.
Laptops get turned on and off many times in a day, and due to the power management, throttle down much more than desktops. This has them going through the heat cycle multiple times in a day, whereas desktops typically get turned on and off once a day, sometimes left on for weeks at a time. Failures like this are typically on a bell curve, so they start out slow, build up, then tail off
If you look at the HP page, the prophylactic fix they offer is to more or less run the fan all the time. Once again, for the non-engineers out there, fan running eats a lot of power, so this destroys the battery life of notebooks. Basically, people bought a machine with a battery life of X, and now it is Y to prevent meltdown from a bum part. It doesn't fix anything, it just makes the failures take longer, hopefully past the warranty period, at a huge battery life cost. Fire up your class actions people, you got shafted.
Back to the engineering, we intoned that this was a cover-up of engineering failures by Nvidia. We also said that they probably knew what was happening. Think we were kidding? Read this, twice, linked again here for those that can't move their mouse to the left, it is that important.
If we knew a year and change ago that these exact parts had heat problems, think Nvidia did? Think the voltage difference between A02 and A03 is coincidence? This is a classic example of not meeting engineering goals and overclocking through brute force (voltage bump in engineering terms) to compensate.
HP and the others were blindsided by this, it happened far too late in the design cycle to compensate, and it looks to have been covered up hastily, badly, and eventually fatally. Blaming suppliers, OEMs and users is completely unfounded and says that NV is unwilling to properly address this issue, only hide from it. NV knew, they made silicon changes to fix another problem that directly lead to this problem.
Wrong Chips...C51 is the culprit
The GPUs aforementioned are not the issue here.
It is the mobile C51 chip that gets overheated. There is no fan (overheat protection) on the chipset and there is no thermal diode that shuts down the PC in the bios.
The C51 PCIe interface overheat easily. That's why the GPUs and the Wireless cards are affected (PCIe interface)...
Also, let it be know that it is not due to packaging material set. It is definitely a chip/system design issue.
posted by : Lolento, 10 July 2008
Now this is interesting as we've started noticing her lappy drop off the wireless when she is on the sofa or in the bedroom.
Raising the lappy off the table to let more air get at underneath seems to help alot.
I was having issues with the graphics and the wireless. The wireless worked intermittently and finally gave up.
I raised our laptop off the table and propped it up. after 10mins or so... I disabled and renabled the wireless... and it instantly saw the wireless and connected without issue.
Signs it going fubar? It looses the wireless... Then a rescan shows an UNNAMED wireless connection and asks you to provide the SSID. Bit longer and it just totally looses wireless and says it cant find any wireless signals.
Now i didnt reboot or do anything other than raise the lappy off the table about an inch. (Two Starwars novels under it. The Han Solo series
Definate heat issues. Now if this solves the problem with her gfx as well... then we know that the damn thing gets too hot. Question is if we can get the lappy replaced / fixed? If they come back to me with "Run the fan constantly" then i'll belt em and we'll be getting our money back.
Irritatingly its been fine for ages and now shows signs its going bad. Bluescreens getting worse. Wireless crapping out. We've only noticed this recently as i got a new router with wireless (stops me killing self going to loo in night and tripping over the damn cable otherwise)