Making retries safe with idempotent APIs

The Amazon Builder’s Library is a great set of deep dive papers into the challenges with modern systems. This post highlights some of the challenges that the retry pattern presents.

The paper, “Making retries safe with idempotent APIs, follows-up yesterday’s thread on the, “Timeouts, retries, and backoff with jitter” paper.

This one takes a much deeper dive into the challenges that a simple retry poses to an API. It’s all about balancing the customer experience with the systems’ stability & performance.

I call out a few more details in the Twitter thread below…

Tweet 1/9 👇 Next tweet

diving into the Amazon Builder's Library again today. this time with, "Making retries safe with idempotent APIs", by @mfeatonby

📑: https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/

this is a level 300 paper, digging a bit deeper than yesterday's' level 200

🧵☁️ #cloud #devops @awscloud

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 2/9 👇 Next tweet 👆 Start

this thread is available unrolled at https://t.co/nEPvsF8Awt

yesterday's thread is up at https://markn.ca/2021/timeouts-retries-and-backoff-with-jitter/

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 3/9 👇 Next tweet 👆 Start

idempotent is one of my all time favourite words, especially in tech.

if you're unfamiliar, in this context it means that you can run operations more than once and the results won't change

more at https://en.wikipedia.org/wiki/Idempotence

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 4/9 👇 Next tweet 👆 Start

for this paper, the author explores the concept of idempotency (see, awesome word) within the "retry" pattern

basically, how can the backend service make sure that retry doesn't end up being a duplicate or something worse

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 5/9 👇 Next tweet 👆 Start

excellent quote to build by, "We’ve found that in many cases the simplest solution is the best solution", @mfeatonby, @awscloud

followed by, "a surprisingly large number of transient or random faults can be overcome by simply retrying the call"

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 6/9 👇 Next tweet 👆 Start

the 📑 walks through some of the potential downsides of the retry pattern

it then moves on to a topic that isn't discussed enough; reducing complexity

the author discusses API design & how @awscloud uses an identifier handled by the SDKs to manage retries

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 7/9 👇 Next tweet 👆 Start

this approach avoids lots of problems on the service side, but issues remain. that brings up to the various strategies that can be used to implement a retry pattern

📑 uses @awscloud EC2 as an example & this really helps drive some of these key points home

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 8/9 👇 Next tweet 👆 Start

one fascinating edge case is that of late arriving requests. in any distributed system (especially one over the internet) this is a distinct possibility

📑 explores these challenges & explains how @awscloud looks at making reasonable trade offs to handle

🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

Tweet 9/9 👇 Next tweet 👆 Start

overall, this is a fantastic paper. it dives deep into an area that most assume is simple. at scale, nothing is

however, these patterns & tips can help you replicate this pattern in your services to deliver a better customer experience

worth the 🕙 to read

/🧵☁️ #cloud #devops

@marknca tweeted at 09-Nov-2021, 13:39

👆 Start