Servers offline

03 June 2016 9:56 AM

Three servers are offline. It appears to be a problem at the datacentre, which is being investigated.

Affected servers:

Daenerys
Lannister
Email 4

10:03 AM Hetzner has confirmed a networking error in their Samrand DC which is being worked on.

10:05 AM Hetzner seems to have fixed the problem, all servers are now accessible.

Daenerys inaccessible

2 May 2016 5:38 PM Daenerys (197.189.230.226) is currently inaccessible due to what appears to be a network problem at Seacom / Hetzner. I’m still waiting for feedback from Hetzner.

5:45 PM Daenerys is accessible again, but connections are slow. Still waiting for feedback from Hetzner.

5:56 PM Inaccessible again. Still waiting for Hetzner to respond.

6:02 PM No feedback from Hetzner yet (email or phone) but they have updated their “Network Notices” page with the cause of the problem: a DDOS attack:

  • DDOS – Network connectivity – Johannesburg data centre (2016-05-02, ATTENDING)

    Start: 2016-05-02 5:53:48 SAST
    Resolved: TBA
    Status: attending
    Point of impact: Truserv and Co-Location customers
    Symptoms: Truserv and Co-location customers will have intermittent connectivity to their servers.
    Cause of problem: DDos
    Estimated time of repair: TBA
    Attending: Hetzner Engineers

Source: Hetzner Network Notices

6:20 PM Daenerys appears to be 1oo% accessible now.

Arya: Emergency maintenance

10 February 2016

8:00 Arya is back online. Performance will be degraded (sites will be slow) while the RAID rebuilds for the next hour or so.

7:15PM The server is being shut down now.

7:10PM Backup has completed

5:15 Starting a full server backup now

5PM Hetzner has just informed me that the RAID alarm for the server Arya is sounding.
This means that either a hard drive has failed (and requires replacing) or there is something wrong with the RAID configuration.
I have given them the go-ahead to take the server offline to diagnose and fix the problem.
I expect up to 30 minutes of downtime while they do this, for which I apologise.

Intl. connectivity: CloudFlare

Clients using CloudFlare for their website might be affected by an outage affecting two undersea cables and CloudFlare.

CloudFlare status: https://www.cloudflarestatus.com

More information: http://www.techcentral.co.za/seacom-wacs-problems-hit-sa-internet/62649/

Emergency maintenance

Tue 11 August 4:40PM Hetzner has informed us that the RAID alarm is currently sounding, and that KEATS needs to be switched off in order to diagnose and fix the problem.

5:28PM Keats is being shutdown now

6:02PM Response from Hetzner:

This mail serves to confirm that the maintenance on your server tex001_truservcomm_jhb1_009, was completed successfully. 
SDA was swapped, and RAID is currently rebuilding.

6:15PM The RAID rebuild is at 24%

6:27PM RAID rebuild is at 30%

6:41PM 41%

6:53PM RAID rebuild is at 50%

7:19PM 65%

7:51PM 79%

8:07PM RAID rebuild is at 90%

10:36PM Hetzner says that the server is “fixed”. Unfortunately, it won’t boot. I am therefore going to reinstall the server, and then restore all hosting accounts from backup. Please accept my sincere apologies for this. I will work through the night and tomorrow to get all sites up as soon as possible.

 

2AM Wed 12 August OS has been rebuilt, cPanel and Cloudlinux have been installed. Restoration of hosting accounts is starting now.

4:30AM All accounts have been restored from backup.

Ongoing maintenance: Tyrion

10:05 AM Tyrion will be offline for short periods of time througout the day as Hetzner technicians attempt to troubleshoot a hardware error.

Most clients have been moved off Tyrion and onto Lannister, so this will affect only the 10 or so domains which are still pointing to Tyrion.

13:15 Troubleshooting has completed. Tyron will go offline at 6PM tonight (14 July 2015) for up to 12 hours while the OS is reinstalled.