1y ago

Data protection, the right to be forgotten, and federation

Hey fellow nerds, I have an idea that I'd like to discuss with you. All feedback – positive or negative – is welcome. Consider this a baby RFC (Request for Comments).

So. I've been having a think on how to implement the right to be forgotten (one of the cornerstones of eg. the GDPR) in the context of federated services. Currently, it's not possible to remove your comments, posts, etc., from the Fediverse and not just your "home instance" without manually contacting every node in the network. in my opinion, this is a fairly pressing problem, and there would already be a GDPR case here if someone were to bring the "eye of Sauron" (ie. a national data protection authority) upon us.

Please note that this is very much a draft and it does have some issues and downsides, some of which I've outlined towards the end.

The problem

In a nutshell, the problem I'm trying to solve is how to guarantee that "well-behaved" instances, which support this proposal, will delete user content even in the most common exceptional cases, such as changes in network topology, network errors, and server downtime. These are situations where you'd typically expect messages about content or user deletion to be lost. It's important to note that I've specifically approached this from the "right to be forgotten" perspective, so the current version of the proposal solely deals with "mass deletion" when user accounts are deleted. It doesn't currently integrate or work with the normal content deletion flow (I'll further discuss this below).

While I understand that in a federated or decentralized network it's impossible to guarantee that your content will be deleted, but we can't let "perfect be the enemy of good enough". Making a concerted effort to ensure that in most cases user content is deleted (initially this could even just be a Lemmy thing and not a wider Fediverse thing) when the user so wishes would already be a big step in the right direction.

I haven't yet looked into "prior art" except some very cursory searches and I had banged the outline of this proposal out before I even went looking, but I now know that eg. Mastodon has the ability to set TTLs on posts. This proposal is sort of adjacent and could be massaged a bit to support this on Lemmy (or whatever else service) too.

1. The proposal: TTLs on user content

Every comment, post etc. (content) must have an associated TTL (eg. a live_until timestamp). This TTL can be fairly long, on the order of weeks or even a couple of months
well before the content's TTL runs out (eg. even halfway through the TTL, with some random jitter to prevent "thundering herds"), an instance asks the "home instance" of the user who created the content whether the user account is still live. If it is, great, update the TTL and go on with life
1. in cases where the "home instance" of a content creator can't be reached due to eg. network problems, this "liveness check" must be repeated at random long-ish intervals (eg. every 20 – 30h) until an answer is gotten or the TTL runs out
2. information about user liveness should be cached, but with a much shorter TTL than content
3. in cases where the user's home instance isn't in an instance's linked instance list or is in their blocked instance list, this liveness check may be skipped
when content TTL runs out and a user liveness check hasn't succeeded, or when a user liveness check specifically comes back as negative, the content must be deleted
1. when a liveness check comes back as negative and the user has been removed, instances must delete the rest of that user's content and not just the one whose TTL ran out
2. when a liveness check fails (eg. the user's home instance doesn't respond), instances may delete the rest of that user's content. Or I guess they probably should?
user accounts must have a TTL, on the order of several years
1. when a user performs any activity on the instance, this TTL must be updated
2. when this TTL runs out, the account and all of its related content on the instance must be deleted
3. instances may eg. ping users via email to remind them about their account expiring before the TTL runs out

2. Advantages of this proposal

guarantees that user content is deleted from "well behaved" instances, even in the face of changing network topologies when instances defederate or disappear, hiccups in message delivery, server uptime and so on
would allow supporting Mastodon-like general content TTLs with a little modification, hence why it has TTLs per content and not just per user. Maybe something like a refresh_liveness boolean field on content that says whether an instance should do user liveness checks and refresh the content's TTL based on it or not?
with some modification this probably could (and should) be made to work with and support the regular content deletion flow. Something for draft v0.2 in case this gets any traction?

3. Disadvantages of this proposal

more network traffic, DB activity, and CPU usage, even during "normal" operation and not just when something gets deleted. Not a huge amount but the impact should probably be estimated so we'd have at least an idea of what it'd mean
1. however, considering the nature of the problem, some extra work is to be expected
as noted, the current form of this proposal does not support or work with the regular deletion flow for individual comments or posts, and only addresses the more drastic scenario when a user account is deleted or disappears
spurious deletions of content are theoretically possible, although with long TTLs and persistent liveness check retries they shouldn't happen except in rare cases. Whether this is actually a problem requires more thinkifying
requires buy-in from the rest of the Fediverse as long as it's not a protocol-level feature (and there's more protocols than just ActivityPub). This same disadvantage would naturally apply to all proposals that aren't protocol-level. The end goal would definitely be to have this feature be a protocol thing and not just a Lemmy thing, but one step at a time

3.1 "It's a feature, not a bug"

when an instance defederates or otherwise leaves the network, content from users on that instance will eventually disappear from instances no longer connected to its network. This is a feature: when you lose contact with an instance for a long time, you have to assume that it's been "lost at sea" to make sure that the users' right to forgotten is respected. As a side note, this would also help prune content from long gone instances
content can't be assumed to be forever. This is by design: in my opinon Lemmy shouldn't try to be a permanent archive of all content, like the Wayback Machine

11 comments

Something else to consider here would be some kind of batching. A system doing this check should group users together by instance and make a single call to that instance. Something like: "Hey, I have this list of users from your instance. Are they all still active? A, B, C, D..." Reply: "From your request, here is the list of users that I found in my database: A, D". Now the calling system would know it should remove all data for users B & C.

Not a backend dev, but it would seem like this could possibly be partially solved by purging data past a certain age that falls into specific scenarios:
Data from unfederated instances
Data from users/posts/comments that have been deleted/removed
Also, deleting/removing content doesn't really seem to do much currently as you still get all the info back from the server and it's up to the frontend to not display it. I'm normally of the opinion of it you want to delete your comment it should be properly deleted (moderation removal being a separate issue).

I don't fully understand the "right to be forgotten".
I mean, it's very useful when you want to make sure a corporation which profits from your data doesn't want to delete that data, but from the perspective of forums like in here I struggle to understand the need of people to delete everything at some point.
The only result I see from this is useful knowledge being lost.
Imagine if I make a useful post which people come from time to time to solve their issue. People would probably link to beehaw not my instance, since I posted in this community. After a couple of years I no longer can maintain my instance and goes down, then my useful post has a silent self-destruct, people won't know this and keep linking it and eventually it'll end up like with a lot of forums:
"The solution is in this link"
"Thanks, that solved my issue"
But now link is dead and the solution gone.
With how lemmy works now then people will still be able to find the content even if the instance where it originated from dies.
I see this as a very useful feature to preserve knowledge.
If you don't want something to be forever in the internet then don't post it, as you said, the wayback machine exists, so even then you're acknowledging the GDPR request you made to the instance was useless, you still need to go to any archiver there is to be sure your data has been properly deleted.
- I don’t fully understand the “right to be forgotten”.
  I think there is a difference between agreeing with the law itself and agreeing with the usefulness. GDPR gives users incredible power over their data, and in the case of Reddit it allows you to leave the platform very effectively for example.
  “The solution is in this link” “Thanks, that solved my issue” But now link is dead and the solution gone.
  This is sadly the case with everything on the internet and life in general tbh.
  even then you’re acknowledging the GDPR request you made to the instance was useless
  Don't quote me on this, but I don't think GDPR says they have to delete every instance of your content across the internet, just the ones they have power over.> “The solution is in this link”
  “Thanks, that solved my issue” But now link is dead and the solution gone.
  Also, I'm mainly adding some of my thoughts, don't take this as criticism of your post or your viewpoint. I fully agree that there is no solution that pleases everyone here.

A simpler solution would be to ship Lemmy so when something is deleted, it is not simply flagged as being deleted, it is actually deleted from the database. In order to circumvent this, bad actors would have to alter the source code and recompile their own version. For some reason, the Lemmy devs chose not to implement deletion in this way, and I agree this is concerning.
- A simpler solution would be to ship Lemmy so when something is deleted, it is not simply flagged as being deleted, it is actually deleted from the database
  While actual deletion of content is definitely something that needs to be implemented, it can't be relied on for data protection / right to be forgotten needs: you can't guarantee that a message about content or user account deletion makes it to every instance in the network, including instances a user's home instance is no longer federated with.
  The proposal is more complex exactly because we can't make assumptions about network topology or all instances in the network being reachable when content or a user is deleted. This is the fun part about distributed systems.
  Edit: I'll add this to the "it's a feature, not a bug" list since while you do have a valid point, this is specifically something that won't work in this context

IMO it can be MUCH simpler. Deleting content should propagate across federation just like adding content does. De-federating should retroactively remove all content that it would normally keep from propagating (possibly leaving "this post/comment deleted" markers so that replies make sense). And losing track of an instance for long enough (e.g. a week, or a month) should be equivalent to de-federating, possibly with the option to resurrect content when and if the instance comes back online.
I believe that would remove a lot of the issues with extra traffic, and possibly a lot of the issues with extra processing. I don't know enough about the protocol to tell whether it would add requirements for extra data, but I suspect it wouldn't.

TTL on all content scales extremely poorly. You touch on this but I don't think you appreciate just hope big of a SELECT * WHERE TTL ... this would be in just a few months/years. As an alternative, every instance sync should come with a list of newly deleted users. Retrying would not need to be reimplemented. If a user who wishes to be forgotten has had their home instance go dark, there will need to be a way for them to prove ownership over the original account (signup confirmation email perhaps) so a delete can be started from a foreign instance.

I like this idea, I think you could do some smart logic with non-responses to avoid spurious deletions. Like if any post from another instance responded recently, hold off. I’m just imagining if a server had some downtime and suddenly their content in the fedi is gone
- I’m just imagining if a server had some downtime and suddenly their content in the fedi is gone
  Yeah this is definitely a possibility, but if content TTLs that are a month or two and refreshing user account liveness starts happening much before the TTL of a content runs out (possibly even halfway through the TTL as I noted somewhere), an instance would have to be unreachable for up to a month (depending on the TTL of content / cached user account liveness info).
  But in general you're definitely right that there's probably Smart Stuff™ we could do regarding liveness checks to make spurious deletions as unlikely as humanly possible.
  Edit: I could also see having some sort of provision for "returning instances", but no idea how this would work

11 comments