You probably gave bad advice. By the time Reddit existed, you could have just gotten an netapp filer. They had higher availability than most data centers back then, so “the NFS server hung” wouldn’t be anywhere near the top of your “things that cause outages or interfere with engineering” list.
These days, there are plenty of NFS vendors with similar reliability. (Even as far back as NFSv3, the protocol makes it possible for the server to scale out).
I guess I have to earn your trust too. I was actually intimately familiar with Netapp filers at the time, since that is what we used to drive the NFS mounts for the desktops at the first place I mentioned. They were not as immune as you think and were not suitable.
Also, we were a startup, and a Netapp filer was way outside the realm of possibility.
Also, that would be a great solution if you have one datacenter, but as soon as you have more than one, you still have to solve the problem of syncing between the filers.
Also, you generally don't want all of your app servers to update to new code instantly all the same time, in case there is a bug. You want to slow roll the deploy.
Also, you couldn't get a filer in AWS, once we moved there.
And before we moved to AWS the rack was too full for a filer, I would have had to get a whole extra rack.
FWIW, NetApps were generally pretty solid, and they should have no problem keeping in sync across datacenters. You pay handsomely for the privilege though.
Failover, latency, and so on are something you need to think about independently of what transfer protocol you use. NFS may present its own challenges with all the different extensions and flags, but that's true of any mature technology.
That said, live code updates probably aren't a very good idea anyway, for exactly the reasons you mention. Those are the reasons you were right at the time, not any inherent deficiencies on the NFS protocol.
100% this. Sometimes it's not even the filer itself. `hard` NFS mounts on clients in combination with network issues have led to downtimes where I work. Soft mounts can be a solution for read only workloads that have other means of fault tolerance in front of them, but it's not a panacea.
I haven’t seen these problems at much larger scales than are being discussed here. I’ve heard of people buying crappy nfs filers or trying to use the Linux server in prod (it doesn’t support HA!), but I’ve also heard of people losing data when they install a key value store or consensus protocol on < 3 machines.
The only counterexample involved a buggy RHEL-backported NFS client that liked to deadlock, and that couldn’t be upgraded for… reasons.
Client bugs that force a single machine/process restart can happen with any network protocol.
> You probably gave bad advice. By the time Reddit existed, you could have just gotten an netapp filer. They had higher availability than most data centers back then, so “the NFS server hung” wouldn’t be anywhere near the top of your “things that cause outages or interfere with engineering” list.
Or distributed NFS filers like Isilon or Panasas: any particular node can be rebooted and its IPs are re-distributed between still-live node. At my last job we used one for HPC and it stored >11PB with minimal hassle. OS upgrades can be done in a rolling fashion so client service is not interrupted.
Newer NFS vendors like Vast Data have all-NVME backends (Isilon can have a mix if you need both fast and archival storage: tiering can happen on (e.g.) file age).
NetApps were a game changer. Large Windows Server 2003 file servers that ran CIFS, NFS, and AFP simultaneously could take 60-90 minutes to come back online because of the resource fork enumeration scan required by AFP sharing.
These days, there are plenty of NFS vendors with similar reliability. (Even as far back as NFSv3, the protocol makes it possible for the server to scale out).