ISP PlusNet Accidentally Deletes 700GB of Email

**Tobeman** · 04-08-2006, 05:48 PM

http://usertools.plus.net/status/archive/1154603560.htm

..12gb of unread email lost, erk

**jamin** · 04-08-2006, 09:06 PM

Heard about this! They have been working on restoring the lost mail for a while, but dont hold out much hope!

I can imagine the scene in their data centre:

Tech 1: Dave, have you finished optimising the exchange cluster?

Tech 2: Optimise, thats the same as format right?

Tech 1: Er..... no! Whats the Jobcentre number? I feel a career change coming up!

**atrull** · 04-08-2006, 09:29 PM

seriously doubt they use exchange

**jamin** · 04-08-2006, 11:55 PM

Everyone knows exchange, fit the purpose for the joke!

**chuckskull** · 05-08-2006, 01:02 AM

I get the feeling somone is still having the piss taken out of them daily for that **** up.

Heh theres a link to the explanation of what happened, but you have to be a user

**BigBry** · 05-08-2006, 07:31 AM

In case you're interested the original detailed explanation of the problem is here dated July 12th. It was included on the PlusNet User Group forum which you have to be signed up to view. It's reproduced so you can read the history and attempts they appear to have made (there are many smaller updates inbetween);

What follows is a fairly honest account of the current email situation, how we got here, and where we are going. It has been put together with my colleagues in Networks, Ross Bray (Operations Mangager), Phil Webb (Senior Network Architect) and Kelly Dorset (Infrastructure Manager). It makes for a bit of a long read, but I hope you find it interesting and useful.

Some of these events are already public knowledge and some are not, but we in PlusNet’s Network Services department felt that now is an appropriate time to let you know our take on what has happened with the email platform.

So, starting at the very beginning, due to increased customer numbers and the growing volume of email, we have been working on a project to upgrade the current mail storage from a clustered Network Appliance platform to a Sun 5300 NAS split over two sites, at a cost of £170,000. The new platform provides us with 2TB of storage, with huge scaling potential to accommodate future growth.

During the week beginning 26th, we started copying data, i.e. your inboxes and emails, onto the new platform. Due to the amount of data involved and the staged approach that we felt was appropriate this process was scheduled to complete on the 10th July. On Wednesday the 5th, an issue was identified with the speed and performance of the new Sun system. Customers were noticing a delay in their ability to login and collect email and a P1 ticket was raised. The issue worsened, as customers who had been migrated to the new platform found email to unusable.

We immediately started to move customer’s email back to the Network Appliance platform, to alleviate some of the loading issues that we were seeing, while we continued investigating the problems. Of course the volume of data was large, and the platform was underperforming, we always knew this would be a lengthy process. A high priority case was opened with the vendor and after initial investigation Sun saw no configuration issues and started looking for known bugs and hardware issues.

By Wednesday evening, we had implemented some Customer experience monitoring graphs that measured to length of time it took to login and receive an identical mailbox on both the new and the old storage platform. This allowed us to measure the success of any changes that were made while trying to resolve the issue. As more and more data was copied from the new to the old platform, we could see a definite improvement in response time. That improvement continued throughout Thursday while our support team and the Vendor continued to troubleshoot the issue.

By Friday, we had two recommendations from Sun that we had decided to implement:

1. Update the firmware on the boxes to the latest version. Although no specific bug had been identified, it was felt that this release was the latest and greatest and we needed to eliminate the firmware from our list of possible causes.
2. Make changes to some default parameters that would help us tune the platform for better performance.

A Sun engineer was dispatched to perform the upgrade, and arrived onsite on Friday evening. To perform the upgrade it was necessary to break the mirroring between the Master site and the redundant back up. The upgrade and configuration change went without hitch and the mirroring was re-established. However due to the length of time that the mirror was broken, it was going to take a significant amount of time to synchronise the data. It was decided at this point, that we could do no more than monitor the situation until such time as the data was synchronised across both sites.

During Saturday we continued to monitor, and the new platform was performing better then the old. However the resynchronisation was not progressing as fast as we had assumed it would. So on Sunday morning, an engineer involved in the issue, made the decision to try and improve the speed of the re-synch. The mirroring was stopped and a change was made to the platform to increase the speed of the synch. As part of re-establishing the mirror, the old mirror on the Backup, had to be deleted. As has already been stated in our Service Status announcements, the engineer made a mistake and deleted the wrong volume. This resulted in the loss of 700GB of customer inboxes and emails. The impact of this was that not only was half of the customer base’s email was lost; customers were unable to log in to the system, as their mailboxes no longer existed on the system.

The engineer, realising his error, immediately escalated the issue to the Duty Manager, who immediately escalated the issue to the relevant Director. A P1 ticket was opened and a case was opened with Sun to find out what options were available for data recovery.

A conference call was established, during which the engineer involved could only identify VISP customers as being effected; this was subsequently discovered to be incorrect as the issue also affected PlusNet customers. During the conference call, it became apparent that no standard tools could recover the data and that we would need to engage the services of a specialist data recovery company.

During the morning, the issue was raised to the highest levels of management within PlusNet and key technical staff were brought in to assist in the incident. The initial time estimates that were provided proved unrealistic due to the severity of the loss and the work involved in recreating mailboxes. In parallel to the recreation of mailboxes, the Sun NAS was un-racked, packed, and shipped to the Data Recovery specialists for analysis.

Early in the afternoon, all available technical resource, including systems and development staff, were engaged on the issue, and at this point plans were made to recover reinstate mailboxes to allow customers to send and receive email. This was prioritised to fix business customers first followed by out residential customers. This was necessary because the actual processing effort involved in recreating accounts, was going to take a significant number of hours.

Out aim was to restore service to business customers by 09:00 on Monday morning, this was achieved by 11:00 on Monday. About half of our residential customer base was also restored at this time, the remaining half had service restored by 18:00, however there were a limited number of customers who still couldn’t access their email, as here were some permissions issues that had to be resolved. These were resolved by Tuesday morning.

The latest update from the Data recovery people is that they can see data and directory structures on the disks, however, whilst most File Systems are very similar, there are always slight differences. With this system being relatively new to market, they are in the process of modifying their data recovery tools to restore the data to us. Due to the level of checking and verification of the tools that need to be done, it is likely to take a few more days for the data to be restored to us. We are in regular contact, and we have plans in place, so that as soon as the data becomes available, we will start restoring the data to you.

We are of course very aware of the critical nature of email to our customers, and are very sorry that there has been such a catastrophic disruption to this service, however, please be assured that we do have processes in place that shouldn’t have allowed this happen. These processes are all under review and we will provide the detail to you on any changes that are made as a result of this review. One point that we already know that we can learn from is that for major change s to major systems that could possibly affect customer’s service, there needs to be a verification stage before those changes can be rolled to live. This will be added to the change control process with some specific supporting documentation to ensure that everyone understands which systems are considered major.

For your information, attached to this post is the current change control process so that you can see the kind of rules that we are all meant to work to, to prevent this kind of issue occurring.

Added to this, a full investigation is being conducted by a non-network services manager and HR into the causes of Sunday’s failure, with the remit of discovering exactly what happened, any processes that failed, and whether any disciplinary action is required.

**BigBry** · 05-08-2006, 07:34 AM

The follow up response admitting it wasn't going to work was posted on August 3rd (with a separate email that made it clear that was the end of their recovery attempts;

Firstly you need to understand that in an attempt to recover the data swiftly, the engineer who deleted the 3 volumes in the first place swiftly followed up his error by immediately trying to create a volume of the same size as the 1st of the volumes in the same place. This is an old sysadm "trick" that on some file systems could have revealed the lost data, however in this case, it did not work, and in fact caused us more problems, as you will see later on in this account.

Within 2 hours of the data being deleted, a data recovery company had been contacted and within 3 hours the NAS was in transit to them. By 14:00 on that day the specialists were racking the NAS and began the process of copying all the 1's and 0's from our equipment to their own. This is standard operating procedure for anyone working in the field of data recovery, and is simply about ensuring that there is always an untouched copy of the information in case something further goes wrong while working on the recovery. Due to the volume of data that was being dealt with, the copy took until the early hours of the following morning.

At that point, based on their initial investigations, the data recovery specialists set the expectation with us, that we would recover some of the data, possibly not all of it though, and that it could take 4-5 days. From that point forward we have a tale of increasing woe as each new deadline set by the data recovery people was broken as they discovered more and more problems. In the following paragraphs I will briefly cover off the main problems that have been encountered.

The Sun NAS that we had selected for the mail storage platform is the first series of products to emerge from Sun since their purchase of StorageTek, and as such does not run the usual Sun OS of Solaris. It uses StorageTek's own proprietary OS which is a heavily modified FFS2 (Fast File System 2). The modifications are all about increasing the performance of the system to ensure enterprise level performance.

As the kit is fairly new to market, the data recovery specialists had not worked on this specific OS before, though they do have a lot of experience with NAS's in general, Therefore they had to significantly rewrite the tools that they use for analysing and recovering data. They utilised their engineering departments in both the UK and the US to work around the clock to achieve a re-worked set of tools.

Apart from the tools issue, the proprietary OS, uses the 1st volume it has access to, to store the master inode table. For more information about inodes, take a look at the Wikipedia article http://en.wikipedia.org/wiki/Inode. Essentially this is the table that tells the system where all the other files on the system are. As I mentioned earlier, the PlusNet engineer involved, had attempted to recover the data by creating a volume of the same size in the same place as the 1st volume. That action more than any other has caused us the most issues. By creating a new volume, the existing inodes were wiped and all data that was on that volume was essentially gone. Without that master inode table and with no knowledge of where the system stored it's back-up copy of this table, it has proved very difficult to work out what the data on the relatively undamaged 2nd and 3rd volumes actually is.

We have received a partial file list from the 2nd and 3rd volumes. This list amounts to a list of inodes and the data in them, not the list of complete files. Without even a partial directory structure it becomes vastly more complex to work out which inodes are associated with which other inodes and therefore piece together the complete files. Without the data on the 1st volume we do not believe we are ever going to get the directory structure. Without the directory structure it becomes vastly more complex trying to work out which file from the partial list belongs to which user.

So, here we are, it is almost a month since the 700GB of email and mailing lists were lost and we still have no recovered data to return to you. This is of course upsetting for us, and even more so for the customers whose data has been affected. The longer we wait for the equipment to be returned to us the greater the risk we run of hitting other capacity issues that we know are ahead of us, and we do not feel that we can justify any longer a wait, and still be taking the appropriate action for our customers.

When it is implemented the new platform will provide us and you with a vastly scalable, site resilient mail storage set-up, with 6 hourly checkpoints to ensure we can roll back the majority of the changes that happen on the system within a four hour period.

What this really means is that we are currently arranging for the return to us, of the Sun NAS head unit and Disk Arrays, so that we can push forward with the implementation of the new email storage platform.

**peterb** · 05-08-2006, 09:25 AM

Interesting story - I suppose the moral is..

Be VERY careful (double and triple check what you are deleting (and I guess most of us have done something like this, although not on such a monumental scale!) and that when you have done something major - STOP - and don't do anything else until you understand and have thought through the implications.

Wonder what the 'engineer' is doing now...?

**BigBry** · 05-08-2006, 10:35 AM

The relevant forums are full of disgruntled PN customers who left their emails on PN's servers who've now lost the lot!

There are also a number of people commenting about PN's lack of a real back up system (ie tape) and how you should never be in a position where you're actually working on both original and backup system in the same operation. Having a seperate true tape backup would have minimised the issue and led to less PN customers (like me) leaving them for good.

**atrull** · 05-08-2006, 11:00 AM

the problem is that the mail unread is usualy new - few mail systems I've ever seen actualy backup mail as it arrives

the main 'backup' is the fact that the mail is on massive redundant sans

but thats kinda useless when someone rm -rf it

there are ways of backing up live incoming mail though, for instance sending all mail to a 'backup-received' platform which wasn't on the same san/cluster - in addition to the recipients inbox. privacy laws would probabley disagree

**chuckskull** · 05-08-2006, 11:59 AM

As usual it looks like it would of all been ok if somone hadn't panicked and tried to fix it all themselves. I think we've formatted the wrong parition before but never on a such a scale.

**robg1701** · 07-08-2006, 12:57 AM

/me pats own mailserver

**pp05** · 07-08-2006, 01:35 AM

I don't use the ISP's mail service.

**badass** · 07-08-2006, 11:01 AM

Looks to me like they didn't bother testing this upgrade in a simulation environment first?

**BigBry** · 15-08-2006, 06:54 PM

So after this event, a different email today from Plus Net about deleting customers websites and guess what the back-up system failed here too.

Thank goodness I'm now with Eclipse.

We have now completed our investigation into the problem with missing CGI websites and identified the cause of the problem. We have restored the majority of customer data, however 71 directories have been lost which we believe relates to around 10 customers' sites.

One of our backend account management servers which generates the list of customers that exist on the CGI platform had an issue. This meant that rather than creating a complete list of customers an empty file was created. This empty file was then processed by the CGI servers and started removing all accounts. As there were no entries in this file the CGI servers interpreted this as being that there shouldn't be any accounts on the CGI platform. We discovered that this was happening and stopped this immediately; unfortunately customers with a username beginning with 'A' and some of 'B' had their CGI directories removed.

We immediately started recovering these and discovered a small number of the archived backups which the script creates were corrupt. This is because their directories were archived more than once. When the second archive was created it was creating an archive of an empty directory and so there was nothing to archive and overwrote the backup with an empty file.

We have modified our scripts to include the following changes to make sure this does not happen again:

1) At the end of the file with the list of users we will include a unique key. If this key is not present the script will not process any additions or deletions.

2) Each archive backup file that is created will have a unique timestamp as part of the filename. This means if the archive is attempted more than once a unique file will be created for each attempt.

We would like to offer our apologies to the customers affected by this. We intend to re-enable CGI activations for customers that have activated their CGI space since last Friday later in the week.

**BigBry** · 15-08-2006, 07:04 PM

I forgot to add they're also the ISP who sent out 20,000 customer details by mistake!

http://www.theregister.co.uk/2006/07...tomer_details/

Thread: ISP PlusNet Accidentally Deletes 700GB of Email

LinkBack

Thread Tools

ISP PlusNet Accidentally Deletes 700GB of Email

Thread Information

Users Browsing this Thread

Similar Threads

Plusnet Announce Bandwidth "Cap"

Posting Permissions