The shaky foundation of our time

On 4 Oct 2021, around mid-night, I was called on the phone by my colleagues telling me that our registration metrics were on the decline – we saw much less new accounts being created comparing to other normal days. When I jumped into the reaction discussion thread, it was reported that Facebook services (including WhatsApp) were down globally.

It was immediately obvious to me that this was the cause of our decreasing registration numbers: we offer “Signup with Facebook” as one of the registration options, and our “Signup with phone number” option uses WhatsApp as the default OTP delivery channel. With both Facebook and WhatsApp gone, many users were not able to signup in our App and on our website.

As part of our reaction plan, we toggled off WhatsApp and started falling back to SMS for OTP delivery, and the registration numbers climbed back to normal levels. Business impact mitigated.

At the point of time, I was not aware that I was involved in an outage later to be found as one of the biggest outages of the year in the tech world – it took Facebook close to 6 hours to finally get their services back online.

This 6 hours should not be taken lightly. Facebook (or Meta) might be perceived as a “social media” company by the general public, but its influence has long been extended out of mere casual social scenarios. It has a large catalog of product offerings that serve as a type of infrastructure to other businesses online – signup with Facebook, message delivery via WhatsApp are just 2 examples among them.

Obviously, the impact of this outage was well beyond just Facebook. Many other businesses that have Facebook as part of their infrastructure were affected at different degrees – for a business that only has “Login with Facebook” as a way for its users to login, it basically means the business is closed for good.

A connected world is a good world, isn’t it? The same technology that brings convenience to an entrepreneur based in the Bay Area could extend its reach to the son of a farmer in rural villages in China – equal rights for all! What could go wrong with such a fantastic thing?

Well, things do go wrong, in so many unexpected ways. Here are just some of the recent wide-spread outages:

When everything is in good condition, we very rarely notice that so many things are powered by so few companies and people. In the case of the Fastly’s outage in June 2021, the firm’s stock price saw a significant uptick after the incident simply because investors suddenly discovered Fastly was playing so important a role in the world-wide web.

However, when things start to go south, the ugly side of such a high level of centralized dependency on a handful of companies starts to surface. In the case of Cloudflare’s outage happened in June this year, popular sites like Twitter, AWS, Shopify, Udemy, Quora were all among the victims. The reality we’re facing is unforgiving: when a piece of tech infrastructure falls, the more successful it is, the higher casualties it will cause.

Competition in tech tends to end up with one big winner followed by few followers, and the top 3-5 companies probably will grab more than 90% of the market share. This level of centralization is an advantage in terms of efficiency, but also a source of concern for those building their products/businesses on top – what if these giants go down?

As we’ve seen so far, they do.

Well, you might think: those who manages to secure a success in the tech infrastructure war are supposed to be the best in the market, and the likelihood of their services going down is far slimmer than if we run our own infrastructure, right?

You’d be right, but there is a catch – self-managed infrastructure are independent from each other, so the chances of all these independent systems breaking down at the same time are very unlikely; but with centralized infrastructure on the cloud, the impact of a single incident will be far more wide-spread.

What’s more worrying is that when so much is at stake, wouldn’t these infrastructures fall pray to attackers with malicious intentions? What if, instead of hijacking a plane and flying it into a tall building, terrorists attempt to inject failures into the very infrastructure that powers the majority of the global digital economy? Sneaking a piece of seemingly innocent code change that brings the internet down is not something far fetched – probably much simpler than taking control of a plane.

Brittleness of the internet

Some may argue that intentional damage can be prevented by implementing proper processes for changes (e.g. code review, approval workflow etc), but the funny thing is, many of the things on the internet are not necessarily controlled by organizations that have internally enforced processes, or open source communities that define proper workflows for changes, they may just be controlled by individuals with total free will – and do not hesitate in exercising it.

module.exports = leftpad;
function leftpad (str, len, ch) {
	str = String(str);
	var i = -1;
	if (!ch && ch !== 0) ch = ' ';
	len = len - str.length;
	while(++i < len) {
		str = ch + str;
	}
	return str;
}

On March 22, 2016, Azer Koçulu – a self-taught developer – unpublished all his modules from NPM as a result of an unpleasant encounter with a lawyer. Among the more than 250 packages being removed, one tiny package with only 11 lines stood out. Apparently, this left-pad package was indirectly depended upon by some of the most widely used packages – with React being one of them. This quickly resulted in thousands of developers found their daily workflow disrupted by an error message complaining about the missing of this particular package.

The incident ended with NPM recovering the missing package for the greater good of the developer community at large, but it did surface the concern about how brittle the entire ecosystem is.

Does Azer Koçulu have the right to delete his packages? – Yes.

Does such a deletion cause a disruption beyond what was thought possible? – Yes.

And that’s my whole point.

Apparently, Koçulu’s action was not intentional, but the consequences were real, weren’t they?

So, what do you want to make out of this?

“Don’t reinvent the wheel” is a motto so famous that it is almost engraved into the mind of every single engineer on this planet. Yes, why bother doing it again when there’s something up for grab that’s already built by someone else? Maintaining bare metals in a rack inside an IDC hundreds of miles away from where you live is a torture when you have something that can spin up an instance equally capable via a simple click (In 2009, one of my friends had to travel thousands of miles by train to get back the hard disk containing critical data of a website he owned. True story) .

An engineer has many reasons to introduce a dependency into a system: sometimes for convenience (AWS vs bare metal), sometimes it’s just not practical to roll one’s own (self-operated CDN is definitely out of reach for most tech organizations), and sometimes just unintentionally (left-pad to React).

There’s little ground to argue for 100% in-house software to achieve pure independence. After all, building everything from scratch is a luxury only few companies can afford.

Counting on market leaders to not making mistakes is equally non-practical. Take a look at Cloudflare: the firm published a blog post
explaining what happened to Facebook shortly after the 6-hour long outage. The article did a great job describing the concept of BGP and how it’s related to the Facebook outage. However, on 21 June this year, Cloudflare itself experienced outage due to the exact same mistake – withdrawing critical subset of prefixes via BGP.

Knowing what’s right is one thing, not doing the wrongs is hard.

There are packages we have to reference, dependencies we have to introduce, infrastructures we have adopt, just bear in mind the implications – the foundation we’re building upon may not be as solid as it seems. If the wind turns against us someday, don’t be surprised.

The U.S. cutting Russia out of SWIFT; Russia cutting EU out for gas supply; Huawei being banned from using Android in its smartphones – these are all real world examples of dependencies gone wrong. So, what to make out of them? Not much, I guess just being aware of these facts is already quite educational.

Offbeat Engineer.

The shaky foundation of our time – one year after Facebook’s major outage