Service degradation (SFO1)

This report details the incident that occurred on December 28, 2017, impacting the availability of the ZEIT APIs, Docker and npm deployments in the sfo1 datacenter.

Our clusters run a critical consensus database that is in charge of IP allocations for networking, scaling and compute resource management.

The purpose of this database is for all the nodes in the network to have a single consistent view of the state and configuration of the cluster.

These are very carefully and rigourously monitored. Due to how important they are to the operation of our services, any alert originating from these and associated services are routinely given the highest priority.

Alerts for a variety of different operational measurements (CPU usage, load averages, memory, IOPS, etc) were configured. Many of them started to show rapid changes for the services mentioned above.

One of our site-reliability engineers on duty noticed and acted on them immediately. Due to the initial concern over potential instability, he proceeded to throttle new deployments.

Over the past few months especially, we have been experiencing very fast growth, which has naturally led to the database in question to growing very quickly.

Every time a new deployment is initiated or scaled, this incurs an additional memory, storage and compute demands on these critical consensus systems.

When we observed increased loads in critical components of our systems in the past, our typical solution has been to expand our capacity. In most cases, that would help nearly immediately and we would continue to accept new deployments with no hiccups.

As a precuation, as soon as we detected elevated error rates and a slowdown in the time to deployment start we took our normal set of actions and began to expand capacity.

This action was supported by the observation that when the incident began, we were operating a number of deployments that were at an all-time-high, based on our metrics.

We therefore continued our process to make sure the capacity was met and the cluster stable.

Additionally, we started tuning different garbage collection processes to execute more eagerly. This was intended to produce further relief to the affected nodes.

Throughout this entire process, availability was seriously hampered:

  • Most of the proxies were losing networking intermittently, leading to episodes of ephemeral availability
  • The new nodes would be registered and yet later unregistered, and all of the states inside the cluster were frequently getting out of sync

Unfortunately, we realized that our initial measures to try to reduce scheduling contention had had a negative and counter-productive effect.

They had increased the load on the main databases to the point where quorums were hard to reach, node deaths were frequent and load would continue to pile up.

After several failed attempts to add capacity that would give us relief from the added load, we decided to expand the available hardware resources of the databases themselves.

Unfortunately, even though this was a fairly quick process, the cluster had been left in a state where full recovery took a few extra hours to complete. In particular, networking complications persisted until we were able to restart many of the instances that hold our deployments.

Before the publication of this report, we wanted to ensure that the measures necessary to prevent such a critical cluster failure were systematically prevented in the future.

We are happy to communicate that this is already the case. In addition, we made resolve automatically to your nearest region and it is now capable of automatic failover.

Effective immediately, your requests to our APIs will exhibit lower latency. If API access is impacted in one region, it will simply be served by another one (like bru1).

These are our main takeaways from this incident:

  • While the analysis and recovery processes were ongoing, we didn't fail deployments over to our other clusters. This will be our main course of action moving forward, and we have already adapted our capacity plans to make sure another cluster can take 100% of the load of another one while recovery occurs.
  • Customers in the Brussels datacenter were not impacted, but this didn't result in a valid manual failover strategy for many of our customers for two reasons:
    • The bru1 dc is still not in General Availability.
    • API calls were also impacted, which is why we quickly moved to make multi-regional, as shared above, in preparation for the future.
    Consequently, opening Brussels to General Availability remains our main priority.
  • Based on customer feedback, our policy moving forward will be to provide detailed reports on incidents of significant impact. We have implemented the functionality to link to post-mortems, such as this one, on our status page.