(RFO) Downtime Information

August 10th, 2011

Downtime Information

If you’re not interested in what happened, just jump down to the paragraph that starts with “Now,” — the rest is a description of the morning catastrophe.

At about 11:30am (EDT) this morning, there was a widespread power outage in the Dallas-Fort Worth area possibly related to high-temperature caused blackouts (it has been over 100ยบ F in Dallas for the last week). It effected hundreds of customers, but most importantly: it effected the datacenter (Colo4) that our web server was located in. We host with KnownHost.com, a reputable high quality web host with thousands of affected customers.

This power failure itself, would have been insignificant as the datacenter has redundant power at three different levels, except that the Automatic Transfer Switch which is responsible for changing from the utility power to the backup generators failed. This is a serious problem as these systems are 2000+ amp systems moving more than a megawatt of power and require knowledgable, competent specialists to make calculated rational decisions to prevent server level hardware failure or worse.

Colo4 brought vendors to the site to identify and repair the problem, and they identified at about 2:30pm that the ATS had failed and that it would take “some time” to repair. Fortunately, they had another ATS on site that could be substituted in. Substituting the ATS required taking the entire datacenter offline in order to prevent power related failures, and then the process of actually replacing the infrastructure; a sizable job, and it appears they handled this catastrophe with professionalism.

To put this in perspective: in my time with KnownHost (about 3 years), this is the first downtime I have received. In other’s reports with regard to Colo4, in over 8 years, this is the first power-related downtime they have seen. This was a disaster, but it does not appear to be due to negligence on their part. You can rest assured this is not going to happen again, these are both high quality web hosts who learn from their mistakes.

Now, it is my understanding that some sites running revisions of MyReviewPlugin may have been subject to backend downtime. I recently (in one of the most recent revisions) replaced the licensing system to only make calls to the license server when you were in wp-admin and more importantly to fail gracefully (that is, if it cannot reach the license server it doesn’t break) — this means that for most users the front end of their site was entirely unaffected.

Indeed, I checked as many user’s websites as I could think of during the downtime and found that they were all online – though users may have had trouble accessing wp-admin/ on their sites. Sites running an earlier version of MyReviewPlugin, including one of mine, were subject to approximately 5-10 second delay waiting for the server. This is completely unacceptable and was absolutely my fault, for that I am most absolutely apologetic — I’ve spent the morning stressed out of my mind dealing with this.

Here is how I am making this right:

1. I’ve pushed a patch (revision 4456rep) to the download server that sets the licensing system to run only once every 7 days, to fail gracefully after a 1 second hang, and only run on the MyReviewPlugin-related administration pages. You can download the latest version of MyReviewPlugin at http://www.myreviewplugin.com/private/recover.html — if you are not running at least version 5, please contact me before doing this as you may need assistance upgrading.

2. I have created a replaced version of the licensing functionality that does not do ANY licensing checking whatsoever. As a valid licenseholder, you can use this file on any of your sites as desired. To install it, simply unzip the file and overwrite the myrp-license.php file in your MyRP directory. In order to get your hands on this, send me an email with your valid license key, that said, the previous update is a “good enough” solution to resolve the problem witnessed the other day.

3. I set up a Twitter account and a Gmail account during the downtime to allow users an alternative way to contact me. You can follow us at https://twitter.com/#!/myreviewplugin and you can send emails to MyReviewPlugin’s name, at GMail.com (that’s myreviewplugin@gmail.com). Note, I will not likely check the Gmail account when our traditional mechanisms of support are available, and I am in the process of changing how email works at MyReviewPlugin to make sure I am always accessible.

4. I’ve purchased and am currently preparing a backup web server in a different datacenter and will maintain it at all times so that any long downtime in the future can be avoided by simply repointing the domain name to the new server. If you’re interested in the details here, I’ve set up a middleman DNS service with CloudFlare.com that will allow “at the drop of a hat” DNS changes (rather than the normal 24 hours).

5. I am setting up a blog at this site, blog.myreviewplugin.com which will be located on a different web server at my earliest free period. This will contain status updates during any downtime, etc. and allow users to stay up to date with MyReviewPlugin happenings, be they cool features, tricks or walkthroughs, new versions, etc. Similarly, on that server, I will over the next few days set up status.myreviewplugin.com with an email address mark@status.myreviewplugin.com as a backup address.

6. I am in the process of moving email to Google Apps for Domains and considering other options for where to host my support desk (I tried Zendesk, but it seems to send emails far too slow). This will ensure that there are at least 4 different, distinct datacenters that host MyReviewPlugin related contact information.

Needless to say, I hope, with my above plan, nothing of this nature will ever happen again. I hope you can continue to trust my dedication to my product, service and user support.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Comments on the MyReviewPlugin blog are moderated and are expected to remain on topic. Feature requests, technical support, inquiries, etc. should be directed to the proper support channels and will not be approved on the blog.