For around two hours on June 21, 2022 morning, many internet services globally were not available or only available to a limited extent. The trigger was an error at the network service provider Cloudflare. Why is it actually so important?
“Error 500 - Internet Server Error” this was the morning greeting to many of the Internet users trying to access their favorite and important services. In a matter of minutes, thousands of digital businesses were disrupted. And while they were down, they lost revenue, client trust, and brand reputation due to their dependence on Cloudflare.
At 6:34 am European time, Cloudflare reported a critical incident stating that the “Connectivity in Cloudflare’s network has been disrupted in broad regions. The incident impacts all data plane services in our network.”. Around two hours later the error got fixed and the Cloudflare services were running again.
How can a single company trigger such a widespread disruption of the Internet? And when will it happen again?
Which role does Cloudflare play in global internet operations?#
Cloudflare is a content delivery network (CDN) that protects and accelerates websites. It provides a web application firewall, load balancing, and domain name server services from any of its data centers around the world. Cloudflare can be used to protect against denial of service attacks, reduce bandwidth costs and improve website performance. It is one of the leading providers with over 4 million customers worldwide including companies such as shopify, Fitbit, Just Eat, bet365 and many more. These customers depend on Cloudflare as their Internet machine for delivering services to billions of users.
What is a CDN?#
CDN stands for content delivery network, and it does exactly what the name implies – delivers the content.
A CDN essentially is a group of nodes (servers) placed at strategic locations worldwide to reach the maximum number of Internet users in the minimum possible time. Wherein, a copy of the same file is kept on multiple such servers. When the users’ request, the file is served from the nearest possible node.
Some large scale companies build their own CDN, while most of us rely on third party CDNs such as Akamai, Cloudflare, Fastly etc.
What went wrong?#
Cloudflare is quite open about outages that occur in their network and affect users. Recent outage was no exception, Cloudflare posted a detailed explanation about what caused the outage and how they plan to improve.
Was the outage caused by a hacking attempt?
This clearly was not a hacking attempt, Cloudflare has publicly admitted that this was a configuration mistake in their infrastructure. Certainly a config change that was supposed to improve customer experience.
These days automation is used heavily in infrastructure management, because it brings the advantage of making the config change at scale in a matter of minutes, which otherwise would have taken days if not months, if done manually;
At the same time if an error is left unchecked in the automation scripts, this can quickly turn into a nightmare – causing global scale outage.
Recent Cloudflare outage is a classic example.
Has this happened before and will it happen again?#
In June 2022 alone 28 incidents were reported but most of them only impacted a handful of customers. Cloudflare has experienced larger issues in the past such as in July and August 2020. Also the main competition such as Fastly isn’t safe from large-scale Internet incidents where it slowed down the Internet traffic on June 8, 2021.
There is not one vendor that is fully immune to the vulnerabilities of human-errors due to misconfiguration, external threats such as DDoS attacks, 3rd party vendor outages or physical threats such as power outages, floodings or fire in a data center leading to a chain reaction.
What can you do against that?#
There are various approaches that can be considered depending on the scale of the operations.
If you’re a large scale enterprise, you could consider not relying on any 3rd parties CDN, and roll out your own. Similar to tech giants like Google or Facebook. You don’t need to match their scale, but you need to figure out if it makes business sense. Because building a CDN brings its own set of challenges.
However, this is not a viable option for small and medium scale enterprises.
If you fall into the latter category, the MultiCDN approach makes more sense, and could add some resilience. In the MultiCDN approach, you distribute your content among multiple CDN providers, so that if any one of those goes down, your services and customers are not affected. However, the MultiCDN approach involves a lot more tech than a single CDN; and, of-course, will multiply the cost.
Even with the best service vendors and multiple resilience layers downtimes can happen and will happen. A single hour of downtime can cause businesses millions of dollars of losses due to lost revenue, client trust and brand reputation.
Businesses who are actively taking control of the financial risk would look for insurance as a solution. This would allow the business to get compensated for revenue losses, recovery costs and service liabilities.
About the authors
Rahul Makhija Rahul Makhija is a network engineer and web app developer who is either pushing bits or flipping bits, so is in-between bits all the time.
René Papesch René Papesch is a technology risk expert and Co-Founder of Riskwolf AG.