Server Outage - Friday 17th March

Chris Board

Mar 21, 20173 min read

On Friday 17th March we had a server outage which lasted around 40 minutes.

We strive to offer the best service we possibly can and ensure that our apps and services work at all times. We were particularly unhappy that this is the second outage to have happened within the last two months.

On Friday 17th March we started receiving alerts from our monitoring system that the web server was no longer responding at 22:44 UTC. Shortly afterwards we received an alert from our hosting provider that they had detected a hardware fault with the server hosting our VPS (Virtual Private Server).

At 23:00 we received a message from our hosting company stating the hardware fault had been resolved and that the server was in the process of booting. At 23:19 UTC we starting receiving notifications that services were available again at which point we started doing our own testing and confirming that everything was OK. We confirmed that services were working correctly and updated our status page with information confirming services are restored at 00:22 UTC.

As you may be aware, we only had 1 web server which originally, didn't run anything except for some internal software and the website, so although not ideal wasn't a massive issue if the server went down (although the server going down was extremely rare). However, now that we have APIs for our Android App (Boardies MySQL Manager) we want to avoid downtime at all costs.

Therefore over the weekend we had some resilience to our web server to ensure there is no service downtime, whether it is planned or unplanned server maintenance.

We have now added a second web server which is an exact duplicate and replicated server along with a load balancer so traffic is split between the two servers. We have also confirmed with our hosting provider that the hardware issue has now been resolved, and that the two servers are on separate host machines so we shouldn't have any outages.

This will now ensure that if we do planned maintenance, or should 1 of the servers have an unexpected failure our web service and APIs will continue to work without interruption, and if a particular server acting strangely or slowly, we can remove it from the load balancer until the issue is resolved.

The work involved for this has been completed and has been added to our monitoring system so we know exactly what both servers are doing, and should be alerted quickly to a server failure.

If you have any questions or are experiencing any issues then please get in touch with us.

Apologise for any inconvenience the server outages have caused.

Thanks

Boardies IT Solutions

Test Track

Are you a developer or involved in Quality Assurance Testing or User Acceptance Testing, you might be interested in Test Track

A simple and affordable test planning and management solution.