They should continue to retry but with exponential backoff and jitter. Not in a ...

bcrosby95 · 2025-10-20T16:30:27 1760977827

If the reliability of your system depends upon the competence of your customers then it isn't very reliable.

otterley · 2025-10-20T16:45:15 1760978715

Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?

There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.

See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...

ifwinterco · 2025-10-20T16:55:23 1760979323

Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?

Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?

otterley · 2025-10-20T17:09:05 1760980145

Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.