Patching around a C++ crash with a little bit of Lua

rtpg · on Dec 11, 2023

One cannot state the value of being able to make changes to prod quickly. Your entire world changes when you don't have a 30 minute "code to release" cycle (or god forbid a 6-8 hour one).

At the very least extremely liberal use of feature flags is table stakes for helping ops teams stay a bit sane.

fl0ki · on Dec 11, 2023

My experience with "extremely liberal use of feature flags" is that most combinations of flags are never tested because there are exponentially many combinations to test. That's true even for a single version of a binary, and you might be shipping multiple new versions per week.

Restrained use of feature flags is fine, because if you only have 2-3 flags, you have a decent chance of testing a useful subset of possible combinations. I've seen projects with 20+ flags even after instituting a policy that flags can only be in the code for X months. They were a nightmare in production. It was more common to roll back to "previous known good version + flags" than to roll back flags alone, which notably defeats the point of feature flags because they're coupled to the version anyway.

Even "push on green" practices are actively poisoned by feature flags because green isn't representative of prod. Even if every individual feature is thoroughly tested, that's usually in isolation, not in combination with other features. The most accurate testing would be the set of flags that's meant to be used in production, but that brings us back to the same problem; as soon as even one flag changes, the testing was no longer representative of prod, and if nobody ever changes any flags, then the feature flags were at best useless.

I'll take feature flags over diverging branches, but I'd still much rather take true converged development over rampant feature flags.

rtpg · on Dec 12, 2023

My answer to most of this is that you want to limit how much feature flags actually cover. My ideal : feature flags cove just the very last mile of a feature. So it would just cover displaying a feature to people, for example. The "testing gap" being limited to visibility helps a lot.

Of course in the original article there's not much to be said, and I would have probably shipped the "handle the new header" functionality on the server as-is. Feature flags wouldn't really help! But there are so many things where feature flags have been helpful.

The main point is that you want feature flags to help control usage, but really new features that would use them should do their best to not look at the flag inside of the business logic, because that's the stuff that's brittle.

nly · on Dec 11, 2023

The best way to use feature flags is when adding or removing a feature.

You add and test the feature flag before you deploy in to prod so you can disable the feature quickly if shit hits the fan.

Same for removal, disable first. Make sure it's all ok, then remove the feature later.

tialaramex · on Dec 11, 2023

I was expecting a way nastier hack where we can't zap the header so we turn e.g. "I-Can-Do-Fancy-Compression" (which triggers the bug) into "X-Can-Do-Fancy-Compression" which is the same length but doesn't match and so the bug isn't triggered.

Or even, since the bug is in some C++, we can try "\0-Can-Do-Fancy-Compression" and maybe the C++ code thinks this header is empty because apparently it's still the 1970s where Bjarne Stroustrup lives.

tines · on Dec 11, 2023

To be clear, the C++ `std::string` class can contain NUL bytes just fine. It's the C standard library that can't handle them in strings.

tialaramex · on Dec 11, 2023

Sure, but alas std::string is just a library convenience type similar to Java's StringBuilder, until fairly recently C++ didn't provide the simple fat pointer now named std::string_view so for a string we're just reading C's char* often ends up used instead in practice.

Would that guess definitely be correct? No. Is it a reasonable guess for code which is apparently untested and buggy? I think so.

nly · on Dec 11, 2023

Before string_view, in my experience, people tended to pass around strings by const&

tialaramex · on Dec 13, 2023

If I've got a std::string I can totally see passing it as a const reference. If I've got 418 bytes of RAM that I'm slicing up into "strings", how can I make the const &std::string I need from that? I could make a std::string_view easily enough, but the std::string wants to own the memory allocation, and it can't have that because it's mine. So now I'm either allocating and copying (bye bye performance) or I pass char*.

Because C++ loves implicit conversion this works even when the API didn't intend char* because it just calls the constructor on std::string to make one from the pointer. "Hooray"

exikyut · on Dec 11, 2023

In trying to pin this down just for the sake of curiosity, the closest I can imagine is a feature phone passing some sort of of `Accept-Encoding: nih`... and that's unfortunately morbid-insanity-sniping me with even more questions.

I'm trying to figure out what profound excuse the client has for not using gzip et al; "feature phone land" definitely provides no shortage of possibilities (hence the morbid "oh no I must know").

That's a hallucinatory rabbithole I could get stuck staring down all day because it's too fun (welp hehe) and there's surely a simple explanation I'm missing :)

sgerenser · on Dec 11, 2023

10 years or so ago I’d suspect it may have been snappy[0] on Android vs default gzip.

[0] https://en.m.wikipedia.org/wiki/Snappy_(compression)

Thorrez · on Dec 11, 2023

How about brotli?

anonacct37 · on Dec 11, 2023

This is an underrated capability (patching up requests) that many devops folks often take advantage of in platforms like haproxy, nginx, etc.

I once worked at a company where someone pushed a release that added timestamp based cache busters to every js/css file. We used a similar approach to just drop that query parameter at the edge, which allowed us to get cache hits and stop serving every single asset from object storage.

diego_moita · on Dec 11, 2023

The article doesn't mention it but I very much believe the proxy was Nginx with a Lua plug-in. It works very well for this kind of usage.

walth · on Dec 11, 2023

Yeah, for sure nginx with openresty. :)

So less of a hack, and more of a "hey, that's the way it works" - or - Clark's third law is highly dependent on your POV and aptitude.

twisteriffic · on Dec 11, 2023

Doing something similar with YARP to work around a client that issues a request with accept-encoding: identity,* but crashes when it gets a gzipped response.

Downside to quick workarounds like this is that they almost always end up becoming land mines. No one who looks at that config is going to know why it looks that way and will be tempted to take it out.

mike_hock · on Dec 11, 2023

So ... the new compression code was never tested before it got deployed to production?

tedunangst · on Dec 11, 2023

I would guess the clients were tested against a staging server with a newer working version of the compression code. And a server without compression. Maybe nobody even realized the productions servers had compression, and the client was being rolled out in anticipation.