Incident Post Mortem: May 19, 2021
By Bryant Khau and Leonardo Zizzamia
2. We saw timeouts and increased latency on our GraphQL service, which aggregates data from underlying services. The timeouts were caused by GraphQL autoscaling up too slowly. The autoscaling eventually caught up and the errors subsided, restoring functionality to the mobile app and logged-in users.
3. We saw that the database that powers the Coinbase Pro exchange had high latency and CPU load. Additionally the API servers that run our market data feed were under high CPU load. We increased the operation throughput configured on the database and also provisioned more API servers.
4. In our Non-US card payment processing service, the number of failed payments increased as the queue to process the payments became backlogged. We increased the number of queue workers and card payments started succeeding.
Improvements
At Coinbase, we’ve committed significant resources to improving our reliability, including regular load tests to prepare us for high periods of traffic. However, this incident has identified some blind spots to address, especially around very sudden spikes of traffic.
A common theme around several of the failures in this incident were autoscaling rules that weren’t tuned to the nature of traffic spikes that crypto markets can cause. We’re working on tailoring our load tests to better simulate real world situations, such as sudden traffic spikes. This will help surface more issues like untuned autoscaling rules, during controlled testing.
Another improvement that we are investing in is the implementation of kill switches for parts of the client application so that when failures happen, we can keep unaffected parts of our applications working while we work to address the failures.
We take the uptime and performance of our infrastructure very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving scaling challenges like those presented here, come work with us.
Incident Post Mortem: May 19, 2021 was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Go to Source
Author: Coinbase