Some time ago I mentioned here, in half-joking way, self-fixing software I work with. I said Patroni #Postgres has the best regeneration ability I've ever seen. And currently "the best ability" includes:
> After network migrations servers changed IP addresses. It broke etcd config so I had to completely delete that config and initialize etcd cluster again. Which also forced cleaning and renewing Patroni config because it is strongly dependent on etcd. Even when configuration temporarily didn't exist, connection with WAL archives (technically other separate server) wasn't interrupted (I am not even sure if real data transfer could happen at that time). It was seemingly enough to start new #database cluster from last timeline. I don't know WHAT forced servers to immediately pull that data on fresh start. At migration time there weren't any real production data so I didn't even purposely try to restore anything.
> Not so long time later (and now with real production things) some script tests, causing lots of database changes in relatively short time, beyond former server's capacity, killed master server. Patroni switched as intended and I could work on increasing server's capacity (had to do it live, not very convenient). First server finally decided data corruption was too big and to fix it automatically deleted whole /var/lib/postgresql/* directory and started to recreate thing from scratch, using data from new master server (and was doing it with at least 2 GB/s speed because why not? ).
> During above mentioned process impatient tester hit again with their not optimised scripts, finally killing whole cluster. Swearing silently I increased remaining servers as it was only thing I really could do. Postgresql API mostly wasn't responsive, it had limited info about last state before final failure. It wasn't possible to force any change or affect it in any way.
First server decided to delete whole directory again and recreate it (at least this time I saw exact moment in logs), at the same time second server did rewind to state of third server (why??). All these things happened automatically, without my help. I wouldn't even know what to do
And it's only beginning when we use it on production. Now I wait for stubborn users to do some more unintended durability tests... Maybe I would see it's even more invincible