A number of Microsoft Azure cloud services recently malfunctioned for a subset of customers using services hosted in a Microsoft data centre in Japan, due to a cooling system outage that was caused by a failed RUPS (rotary uninterruptible power supply). Microsoft Azure has issued a report on the incident, which took place on on 31 March 2017, serving as a reminder for others of the importance of ensuring the integrity of mission critical infrastructure.
The report states that, as a part of standard monitoring, Azure engineers received alerts for availability drops for this region. Engineers identified the underlying cause was due to an error in the safe power recovery procedure followed by a failure within the power distribution system that was running at N+2. One RUPS in the N+2 parallel line up failed and resulted in being unable to supply power to the cooling system in this datacentre.
As a consequence of the cooling system going down, some resources were automatically shutdown to avoid overheating and ensure data integrity and resilience. The first failure within the power distribution system that was running at N+2 occurred at 11:28 UTC, and the Facility Service Provider promptly responded and initiated the safe power recovery procedure.
However, there was an error in the safe power recovery procedure and one of the cooling systems was incorrectly shutdown. As a result of this, some areas in the facility lost the cooling function and temperatures inside the facility went up and passed safe thresholds. Azure engineers and the Facility Service Provider received multiple overheating alerts due to overheating event at the facility, and started using outside airflow to force cool the data centre.
At 13:46 UTC, Microsoft site services personnel were onsite with the Facility Service Provider and restarted the cooling system air handlers as well as continued using outside airflow to force cool the datacentre. At the same time, Azure engineers prepared to bring systems back online when cooling was restored to the datacentre. By 15:24 UTC, the Facility Service Provider confirmed that the cooling systems were restored successfully. Temperatures for some impacted area in the inside the datacentre returned to safe operational thresholds. A thorough health check was completed after RUPS system and cooling systems were restored and any suspect or failed components were replaced and isolated due to damages by overheating. The majority of services were recovered successfully by 18:51 UTC.