HPE users: patch our SAS SSDs to quash permanent crash bug

**HEXUS** · 27-11-2019, 02:11 PM

Users should update the firmware to prevent crash bug occurring after 32,768 hours of use.

**DanceswithUnix** · 27-11-2019, 02:17 PM

Hopefully the patch doesn't make it fail at 65536 hours of use (which would be outside warranty

)

edit: Ooh, I speculated an overflow as soon as I saw the 32768 number, guess that makes me an expert!

**Tabbykatze** · 27-11-2019, 03:05 PM

Originally Posted by DanceswithUnix

Hopefully the patch doesn't make it fail at 65536 hours of use (which would be outside warranty

)

edit: Ooh, I speculated an overflow as soon as I saw the 32768 number, guess that makes me an expert!

That was immediately my first thought thinking this is juat simply a bug with a maximum value overflow!

Such a silly bug to have in 2019 xD

**DanceswithUnix** · 27-11-2019, 03:45 PM

Originally Posted by Tabbykatze

That was immediately my first thought thinking this is juat simply a bug with a maximum value overflow!

Such a silly bug to have in 2019 xD

It isn't usually the overflow that directly kills your code though, it is usually some secondary effect like using the resulting -32768 value from the overflow to search/index into a table which doesn't have any entries suitable for negative numbers. Given that power on hours isn't usually considered that important a metric I can imagine it not being that heavily tested either.

OTOH, if it was something like using the top bit as a debug flag then someone needs to be taken out and shot

**Tabbykatze** · 27-11-2019, 04:22 PM

Originally Posted by DanceswithUnix

It isn't usually the overflow that directly kills your code though, it is usually some secondary effect like using the resulting -32768 value from the overflow to search/index into a table which doesn't have any entries suitable for negative numbers. Given that power on hours isn't usually considered that important a metric I can imagine it not being that heavily tested either.

OTOH, if it was something like using the top bit as a debug flag then someone needs to be taken out and shot

Ha ha, chemical sheds and the ditches!

It is very interesting that the drive is completely inoperable/irrecoverable when this value is hit which definitely follows your logic of the secondary effect, maybe the time is used as a calculation in SMART, the SMART crashes and takes the controllers with it?

Edit: to qualify my thought, the flipped bit would make a negative time so the calculations, if uncaught, will just drop out of range. Why they're counting time using a signed 16-bit integer is a little bit odd...

**DanceswithUnix** · 27-11-2019, 05:36 PM

Originally Posted by Tabbykatze

Why they're counting time using a signed 16-bit integer is a little bit odd...

Thinking about it, there is a good chance they aren't, and this isn't an overflow...

Imagine you store that value in a word of flash, then every hour you erase the page it is in and re-write it with the new value one higher. That's 65535 writes to a page just to store one thing, where a page has an endurance in modern flash devices of about 3000 writes. Just to count.

Now imagine you choose an 4KB page of flash, that's 32768 bits in total. On first ever power up you clear the page so all the bits are 1's. Every hour, you clear one bit. Flash is written by erasing an entire page of bytes to all 1 bits (as in each byte 0xff) and then clearing the bits you want cleared to get the value you wanted stored. So you can actually zero a bit in flash at any time without erasing it first (flash programming fun fact!), you only need to erase to flip a zero into a one. Now you get 3.7 years of counting hours before you have to erase to count the next 3.7 years, so your 3000 erase endurance gets you 11000 years of counting. Handling the 3.7 year boundary would take some careful testing though (how many cycles you had been through being stored elsewhere).

That's probably how I would do it anyway, and given storage devices use a 4K filesystem page that fits nicely.

Hmm, so now I don't think it is an overflow. Will have to hand my expert title back

**QuorTek** · 27-11-2019, 06:10 PM

someone found out, they had to do something about it... simple as that

**philehidiot** · 27-11-2019, 07:51 PM

Why don't they just build the dam higher? Or stop putting water in the drive full stop? Sounds a bit silly to me.

Thread: HPE users: patch our SAS SSDs to quash permanent crash bug

LinkBack

Thread Tools

HPE users: patch our SAS SSDs to quash permanent crash bug

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Received thanks from:

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Received thanks from:

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Re: HPE users: patch our SAS SSDs to quash permanent crash bug

Thread Information

Users Browsing this Thread

Posting Permissions