How to deal with application downtime

Meta just suffered two major outages in a week: What can organisations do to ensure their customers aren’t confronted with service outages?

Two developers / programmers collaborate as they review code via laptop display.
Elle Aon / Shutterstock

Instagram, the social media platform owned by Meta, suffered a widespread outage for around four hours on Monday, affecting around 1 billion users worldwide. Six days earlier, the Meta-owned messaging application WhatsApp also suffered downtime, with users in regions North America, Europe, Asia and Australia unable to send or receive messages.

According to research from UK-based price comparison website U-Switch, when it comes to reliability, Facebook’s mobile app is ranked the worst, having 15 reported app issues for every million monthly downloads—two-thirds more than WhatsApp, which had 9 issues per million downloads. In the UK over the past year, there have been 247,020 monthly search queries relating to the Facebook app suffering an outage.

The other applications that make up the top 10 include YouTube, Twitter, McDonald's, Tinder, Uber, Discord, and Amazon. Facebook, WhatsApp and Instagram were ranked first, ninth and tenth most unreliable apps, respectively.

While it’s difficult for someone outside Meta to say for sure if there was a common cause behind the two recent incidents, Josh Clay, solutions engineering director for UK and Ireland at application monitoring company Dynatrace, said that the outages look symptomatic of a wider trend—largely organisations coming under pressure to deliver innovation faster but often without an adequate solution to reduce the risk that entails. As a result, Clay said, many companies are running into these issues more often.

“The increase in pressure and reduced timelines to deliver software updates ultimately leads to bad code reaching production, where it impacts service availability and performance for users,” Clay said, adding that this risk is further heightened if organisations don’t have that same level of observability in pre-production environments.

Furthermore, Clay explains that there’s usually no simple way to roll back a code change in the event that it creates an outage.

“This challenge is exacerbated by the complexity of today’s multicloud environments, as applications are made up of millions of lines of code, running across a multitude of platforms, both in the cloud and on-premises,” Clay said.

How can organisations prevent downtime?

When it comes to dealing with downtime, companies have a number of different options for mitigating the damage caused by outages. The easiest method consists of switching from one site or server to another or using a backup server to get services up and running again. However, this approach often causes services to be unavailable for a short period of time.

The second is to have a more intelligent approach to software development and delivery, by building a service that takes failure into account from the outset.

“First, organisations should establish automated quality gates in their pipeline, to measure new code and products against Service Level Objectives (SLOs) and assess it against key performance indicators such as response time or throughput,” Clay said. “This means any new code or configuration changes cannot go live unless they meet the minimum baseline for user-experience, which prevents any unexpected outages.”

Additionally, if something does go wrong, Clay said that organisations can improve their time to resolution by ensuring they have established end-to-end observability across their technology stack, providing DevOps teams with code-level insights into all software builds, apps, and services, whether they’re in development or already deployed.

“Combining this level of observability with AIOps can take those insights to another level, by automatically prioritising issues according to their business impact. This enables DevOps teams to quickly identify the most pressing alerts and quickly resolve them before an incident occurs, to prevent users from ever experiencing a problem,” Clay said.

The impact of enterprise application downtime

Although the use-cases are very different, the challenges of preventing downtime across consumer-facing and enterprise applications are largely the same, as any problem across the digital service delivery chain can have significant repercussions for the end-user, resulting in application slowdowns or even a full-scale outage.

According to U-Switch’s data, Zoom was found to be the most reliable app, having just 3 reported issues per million monthly downloads on average, the lowest of all apps analysed. However, when it the platform did suffer a major outage early on the COVID-19 pandemic, the consequences for those who were reliant on the platform for work and education purposes was widespread.

While consumers rarely interact directly with the enterprise applications which support platforms like Instagram and WhatsApp, they are crucial to the delivery of services to an organisation’s customers, meaning the stakes of an enterprise application suffering an outage are no less significant.

Copyright © 2022 IDG Communications, Inc.