Skip Navigation

Downtime - Apologies and what went wrong

Hi All,

As some of you may have realised, the planned upgrade sort of crashed everything, and we had our longest period of downtime since the site began.

This is partly because I had to go to sleep (thanks to a newborn and a job).

The good news is that the backup process worked! We've restored to seconds before the upgrade took the site offline.

The bad news is that federation is likely to be.. wonky.. for a little while. The site may also go up and down while I undo some of the fixes I tried.

Ultimately the issue came down to the upgrade failing (I am not sure why - will be digging into this now the priority is no longer getting the site up) and then the containers not talking to eachother, so the UI wouldn't talk to lemmy, and lemmy wouldn't talk to the database.

I rebuilt the containers, restored the backup, restarted everything, and it's all come back up (admittedly not perfect right now).

Importantly, I want to issue an apology. This isn't what I want for Lemmy.zip, and it should've been handled way better by myself. I'm always learning but this took way longer than it should've, and while I take some solace in the fact the backup process worked and has been proven to work in production, the delay in being able to get this back up is entirely my fault and frankly unacceptable.

I'll be working to document this outage, the steps it took to get it back up, and some form of repeatable plan so a repair can be replicated in the future if I'm not available.

In terms of upgrading to 0.19.11 - I will have to try again soon as it's got some security fixes we desperately need to implement.

Thanks

Demigodrick

111 comments
  • I will try and reply to each comment - but you've all been really kind and that means so much ❤️

    If you're interested, this graph will show you how far behind we are. We should eventually catch up, but things will likely be very delayed for up to 12 hours.

    The status page did not work as expected - and I'll try and link a few more places where I post updates. If you haven't yet, definitely join the matrix space and you'll get minute by minute panic updates 🫠

  • Importantly, I want to issue an apology

    Way I see it, family and mental health always comes before internet randos. Thanks for working hard for everyone.

    • Lots of Internet randos have been very nice and supportive, so I feel a debt to the community to make this place the best it can be.

      But thank you ❤️

  • Thank you for this post. Don't be so harsh on yourself, everyone can make a mistake!

    Good to see Lemmy.zip back up!

  • No need to apologize as you have been doing a stellar job. Your family needs to always take priority no matter what. I don't care if it is down for a week as your health and kid are far more important.

    One thing I will say is that I think Lemmy.zip could really benefit from a external way of communicating announcements. It doesn't need to be complicated and you could reuse your existing mastodon account to post updates when things go wrong. It also could allow for users to give advise on how to fix issues.

    • Thanks, yes I agree, I'll be likely adding something to mastodon and im planning to look at alternative status pages as this one failed the one time it was really needed.

  • Trial by fire. At least it was interesting(!)

    Praise be to the backup strategy 🙂

    • I'd tested the backup strategy before but its never the same as actually having to do it for real, the relief when the backup worked was immense 🤣

  • I think you handled it very well. Not sure how it could’ve been handled better tbh. I figured something didn’t go as planned and I didn’t have any problems waiting for you to find a solution. No apologies needed.

  • How dare you interrupt my ability to look at memes and see the same news article posted in 17 places at once!

    Jokes aside I appreciate the work y'all do to keep this sorta thing running without any pay or thanks for the most part.

    I am greatful.

    • Only 17 times?! I need to spin up some more communities 🤣

      Thank you, I appreciate the kind words 😊

  • Been there, done that, with my Friendica instance. 2 days of downtime while rebuilding a corrupted database, while people are tapping their feet waiting for all to return. I'm with you in spirit, my friend.

    Thanks for all your hard work keeping the dream alive! And for keeping good backups

  • Unfortunately these things can and do happen. I'm glad you were able to get things functional with a restoration. Best of luck troubleshooting and repairing the leftover gremlins.

    Thanks for all you do to support Lemmy.

  • Dude I'm a devops engineer and I totally get it when an app in my cluster goes down and customers start to freakout. However, I get paid to deal with it. Whereas you are doing this as I imagine as a side fun thing. You have ran this entire thing super professional since the beginning. You are doing great. If it was an istio thing that caused your containers to not talk... Believe me when I tell you that is my hell (that I'm currently experience now with one particular app). You are doing amazing work here. Thank you so much. Glad I'm here.

  • No worries man. It's just social media. We survived!

    One thing I was curious about is I didn't know where to go to look up info on what was going on? But you mentioned you were posting links, so I'll bookmark some of that info! Anyway thanks for all your hard work.

    • Yeah absolutely, I'll be putting more links up. I wasn't prepared for things to go quite so wrong and for so long, so other than the matrix chat there wasn't any other info.

      I'll be using mastodon, matrix, and I've updated the status page to be one that should actually work now :)

  • Chiil, you do a great job managing this. There is no need to blame that way yourself.

    Get some rest, enjoy first stages of parenting and take your time updating.

    Thanks a lot for lemmy.zip and take care of yourself.

111 comments