Sections:Fails

We strive to ensure that all our products are reliable and consistent.  Whenever services are interrupted we work hard to get to the bottom of the cause and solutions so that such an event does not happen again.

On Tuesday, July 22 we experienced an outage for selected clients from approximately 5 a.m. to 11 a.m. EST. 

The underlying cause:
An error in the behavior of clustering services led to the offlining of a number of mailbox stores which prevented access to those mailboxes. The same event also introduced inconsistencies into the log files that are generated for these mailbox stores which made bringing them back online a lengthy process with some element of risk. Once we had taken steps to ensure that incoming mail would continue to be accepted by our incoming mail servers we made copies of all affected mailbox stores to ensure that existing data was secure before beginning the process of rebuilding the mailbox stores. The rebuild process is resource intensive and to minimise the downtime for our customers we allocated additional hardware resources to the recovery process. Recovery of mailboxes began 3 hours after the initial problem and was complete 9 hours later. Other dependent services were brought up on completion of this work.

Steps taken to prevent reoccurrence:

  • We have modified the behavior of the clustering service to minimize the risk of multiple mailbox stores being affected in this way.
  • We have modified the distribution algorithm for mailboxes to minimize the impact of a failure of a mailbox store on any individual customer.
  • We have designed and are testing a method for providing continuity of service during mailbox store operations of this nature.

Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript

Best regards,

- The Technical Support Team at SureTech.com

blog comments powered by Disqus