1mo ago

‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

www.theverge.com ‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

Signal was just one of many services brought down by the AWS outage.

cross-posted from: https://lemmy.zip/post/51866711

Signal was just one of many services brought down by the AWS outage.

You're viewing a single thread.

111 comments

Why is it that only the larger cloud providers are acceptable? What's wrong with one of the smaller providers like Linode/Akamai? There are a lot of crappy options, but also plenty of decent ones. If you build your infrastructure over a few different providers, you'll pay more upfront in engineering time, but you'll get a lot more flexibility.
For something like Signal, it should be pretty easy to build this type of redundancy since data storage is minimal and sending messages probably doesn't need to use that data storage.
- Akamai isnt small hehe
  
  It is, compared to AWS, Azure, and Google Cloud. Here's 2024 revenue to give an idea of scale:
  Akamai - $4B, Linode itself is ~$100M
  AWS - $107B
  Azure - ~$75B
  Google Cloud - ~$43B
  The smallest on this this list has 10x the revenue of Akamai.
  Here are a few other providers for reference:
  Hetzner (what I use) - €367M
  Digital Ocean - $692.9M
  Vultr (my old host) - not public, but estimates are ~$37M
  I'm arguing they could put together a solution with these smaller providers. That takes more work, but you're rewarded with more resilience and probably lower hosting costs. Once you have two providers in your infra, it's easier to add another. Maybe start with using them for disaster recovery, then slowly diversify the hosting portfolio.
  
  10% the size of google is decent. If I had ten percent of a tech giant's reach in any particular sector I would consider myself significant but I get where you ae coming from
- Also you know… building your own data centers / co-locating. Even with the added man hours required it ends up being far cheaper.
  
  But far less reliable. If your data center has a power outrage or internet disruption, you're screwed. Signal isn't big enough to have several data centers for geographic diversity and redundancy, they're maybe a few racks total.
  Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever? If there's an outage, you're talking hours to days to get another server up, vs minutes for rented hosting.
  For the scale that signal operates at and the relatively small processing needs, I think you'd want lots of small instances. To route messages, you need very little info, and messages don't need to be stored. I'd rather have 50 small replicas than 5 big instances for that workload.
  For something like Lemmy, colo makes a ton of sense though.
  
  It’s plenty reliable. AWS is just somebody else’s datacenter.
  Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever?
  Most Colo DCs offer ad hoc remote hands, but that’s beside the point. What do you mean here by “Various parts of the world”? In Signal’s case even Amazon didn’t need anyone in “various parts of the world” because the Signal infra on AWS was evidently in exactly one part of the world.
  If there's an outage, you're talking hours to days to get another server up, vs minutes for rented hosting.
  You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?
  
  the Signal infra on AWS was evidently in exactly one part of the world.
  We don't necessarily know that. All I know is that AWS's load balancers had issues in one region. It could be that they use that region for a critical load balancer, but they have local instances in other parts of the world to reduce latency.
  I'm not talking about how Signal is currently set up (maybe it is that fragile), I'm talking about how it could be set up. If their issue is merely w/ the load balancer, they could have a bit of redundancy in the load balancer w/o making their config that much more complex.
  You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?
  No, I mean if they had a proper distributed network of servers across the globe and were able to reroute traffic to other regions when one has issues, there could be minimal disruption to the service overall, with mostly local latency spikes for the impacted region.
  My company uses AWS, and we had a disaster recovery mechanism almost trigger that would move our workload to a different region. The only reason we didn't trigger it is because we only need the app to be responsive during specific work hours, and AWS recovered by the time we needed our production services available. A normal disaster recovery takes well under an hour.
  With a self-hosted datacenter/server room, if there's a disruption, there is usually no backup, so you're out until the outage is resolved. I don't know if Signal has disaster recovery or if they used it, I didn't follow their end of things very closely, but it's not difficult to do when you're using cloud services, whereas it is difficult to do when you're self-hosting. Colo is a bit easier since you can have hot spares in different regions/overbuild your infra so any node can go down.
  
  It was a DNS issue with DynamoDB, the load balancer issue was a knock-on effect after the DNS issue was resolved. But the problem is it was a ~15 hour outage, and a big reason behind that was the fact that the load in that region is massive. Signal could very well have had their infrastructure in more than one availability zone but since the outage affected the entire region they are screwed.
  You’re right that this can be somewhat mitigated by having infrastructure in multiple regions, but if they don’t, the reason is cost. Multi-region redundancy costs an arm and a leg. You can accomplish that same redundancy via Colo DCs for a fraction of the cost, and when you do fix the root issue, you won’t then have your load balancers fail on you because in addition to your own systems you have half the internet all trying to pass its backlog of traffic at once.
  
  Multi-region redundancy costs an arm and a leg
  Yes, if you buy an off the shelf solution, it'll be expensive.
  I'm suggesting treating VPS instances like you would a colo setup. Let cloud providers manage the hardware, and keep the load balancing in house. For Signal, this can be as simple as client-side latency/load checks. You can still colo in locations with heavier load; that's how some Linux distros handle repo mirrors, and it works well. Signal's data needs should be so low that simple DB replicas should be sufficient.

111 comments