Damning retex of the Cloudflare incident: but what really happened?


On November 2, 2023, Cloudflare’s customer interfaces, including their website and APIs, as well as logs, and analytics tools, stopped working properly. This was obviously a real problem.

More than 7.5 million websites use Cloudflare, and 3,280 of the world’s 10,000 most popular websites rely on its Content Delivery Network (CDN) services. The good news is that the CDN did not fail. The bad news is that Cloudflare Dashboard and its application programming interfaces (APIs) were down for almost two days.

This kind of thing doesn’t happen – or at least shouldn’t happen – to big internet service companies. The multi-million euro question is: “What happened?”

An electrical incident occurred in one of the company’s three main data centers

According to Matthew Prince, CEO of Cloudflare, the answer was an electrical incident at one of the company’s three main data centers in Oregon (USA), which are managed by Flexential, and which resulted in a series of cascading problems. Thirty-six hours later, Cloudflare services were finally back to normal.

Prince went out of his way to explain the problem:

To begin with, this should never have happened. We believed we had high availability systems in place that should have prevented such an outage, even when one of our major data center providers experienced a catastrophic failure. Although many systems remained online as expected, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the difficulty it has caused our customers and our team.

He is right. This incident should never have happened. Cloudflare’s control plane and analytics systems run on servers located in three data centers around Hillsboro, Oregon. But they are all independent of each other. Each has multiple redundant and independent power supplies and Internet connections.

Misfortune ! Flexential turned on its generators to supplement the power supply

The three data centers are far enough apart that a natural disaster wouldn’t cause them all to go down at once. At the same time, they are close enough to be able to leverage active, redundant data clusters. So, by design, if one of the facilities fails, the others should take over the load and continue to operate.

Sounds great, doesn’t it? But that’s not what happened.

What happened first was that a power outage at Flexential’s facility caused an unexpected service outage. Portland General Electric (PGE) was forced to cut off one of the building’s independent power supplies. But the data center has multiple power sources with some level of independence that power the facility. But Flexential turned on its generators to supplement the power that was out.

For those unfamiliar with data center best practices, this approach should be avoided.

We do not use off-site energy and generators at the same time. To make matters worse, Flexential didn’t tell Cloudflare that it had somehow switched to generator power.

A current of 12,470 volts which cut the connection and all the generators

Then there was a ground fault on a PGE transformer that supplied power to the data center. And when I say ground fault, I’m not talking about a short circuit, like the kind that forces you to go down to the basement to fix a fuse. I’m talking about 12,470 volts that knocked out the connection and all the generators in less time than it took you to read this sentence.

In theory, inverters should have kept the servers running for 10 minutes, which would have been enough to get the generators back up and running. Instead, the inverters began to die after about four minutes, and the generators could never be turned back on in time anyway.

Oops !

A technician on the job for a week as the only staff

It may be that no one was able to cope. But the night on-call staff on site “consisted of security and an unaccompanied technician who had only been on the job for a week”. In short, the situation was desperate.

In the meantime, Cloudflare discovered the hard way that some critical systems and services were not yet integrated into its high availability setup. Additionally, Cloudflare’s decision not to integrate logging systems into the high-availability cluster, because scan times would be acceptable, proved to be a mistake. For what ? Because Cloudflare staff couldn’t look at the logs to figure out what was wrong. And so the outage continued.

And while the three data centers were “mostly” redundant, they weren’t completely redundant. The region’s two other data centers supported the high-availability cluster and kept critical services online.

Benefits and disadvantages of high availability

So far so good. However, a subset of services that were supposed to be on the high-availability cluster depended on services that were running exclusively on the data center that had just died.

Specifically, two critical services that process logs and power Cloudflare’s analytics – Kafka and ClickHouse – were only available in the offline data center. So when the HA cluster services called Kafka and Clickhouse, they had no response.

Cloudflare admits it was “far too lax and should have required new products and their associated databases to integrate into the high availability cluster.” Furthermore, far too many of its services depend on the availability of its basic facilities.

The king of distributed systems stuck by his central system

Many companies do this, but Prince admits that it “doesn’t work in Cloudflare’s favor. We’re good at distributed systems. Throughout this incident, our global network continued to work as expected, but far too many systems fail if the core is unavailable. We must use the distributed systems products that we make available to all of our customers for all of our services, so that they continue to operate almost normally even if our central facilities are disrupted.” .

A few hours later, everything was finally back to normal. And it wasn’t easy. For example, almost all of the circuit breakers had blown and Flexentail had to purchase new ones to replace them.

Expecting there to have been several power surges, Cloudflare also decided that “the only safe recovery process was to do a complete reboot of the entire installation.” This approach required rebuilding and restarting all servers, which took hours.

The incident, which lasted until November 4, was finally resolved. Looking ahead, Mr. Prince concluded: “We have the right systems and procedures to be able to withstand even the series of cascading outages we saw at our data center provider, but we We need to be more rigorous in applying these systems and procedures and testing them for unknown dependencies. This issue will retain my full attention and that of much of our team until the end of the year. And the difficulty of the last two days will make us better.”


Source: “ZDNet.com”



Source link -97