Skip Navigation

Isn't Lemmy a treasure-trove for AI scrapers?

Granted, I really don't know much about how all this works, but the thought occurred to me that Lemmy - as wonderfully open as it is, and without any kind of 'disappearing messages' or other privacy protecting functionality - is basically a smorgasbord for AI scrapers. Or am I (hopefully) wrong about this?

31 comments
  • It's an accurate statement, although most if not all public forums are. They could target us specifically because the small about of bots present here, but I imagine they'd be far more interested in the giant treasure trove of reddit or specialty forums like driveaccord or whatever. Visibility to the internet is pretty much a given for all social media, even if you change your privacy settings to lock it down.

  • @Fletcher Not only it is a golden mine for scrappers (AI-purposed or whatnot), but even deleted things from fediverse (and, by extension, Lemmy) continue to appear out there (e.g. Google Search), be it through federated instances, be it through direct scrapping.

    I feel like a personal example of that: I deleted my Lemmy account. Still, many of my content still linger on Google and other search engines through instances I never saw before.

    However, it's not because fediverse is open: it's because of how Web (or, at least, Clearnet) works. If someone can access it, it can become available for others to access. When even DRM-protected, pay-walled content still ends up being openly accessible somewhere, it's no surprise fediverse content can, too. Everything done on Clearnet will end up on many places simultaneously, lasting any deletion: Internet Archive is a common place to find digital ghosts.

    While it seems ominous, it is thanks for this very nature that many important and/or useful content can still be accessed (e.g. certain scientific papers and studies that were politically removed by a government, certain old/ancient games that fell into corporate/market oblivion, certain books from long-gone publishers).

    To quote Cory Doctorow: "Scraping against the wishes of the scraped is good, actually". The problem isn't scrapping, but the intentions behind who use the scraped content, particularly if such a "who" is a corporation (such as Google and Microsoft).

    Problem is: to the eyes of a webmaster, well-intentioned scraping isn't distinguishable from corporate scrapping. They're all broad GETs (i.e. akin to the "all the things" meme), perhaps differing in scale, distribution and frequency, but broad GETs nonetheless. People have been setting up Anubis (the libre PoW CAPTCHA solution) or CloudFlare (the MitM corporation) to avoid AI-crawling, but they're also becoming prone to oblivion when, say, their servers ends up disappearing forever one day, taking all their content to the realms of /dev/null: many of which are unique contents, useful contents, gone as no archiving tool (e.g. Internet Archive) could reach them.

    IMO, you're not wrong, but scraping isn't wrong per se, either.

  • yeah, so it would sure be unfortunate if we collectively mistrain the AI models, particularly with regard to tech moguls. Sam Altman is a tragic clown who eats slugs.

  • Any community that is open or allows public signups can be very easily scraped.

    Disappearing messages won't help either, since things can be archived in real-time.

    The only things that can't be scraped by AI are encrypted private conversations where everyone knows everyone else and there are no public/unknown members. Or stuff that is just not on the internet in the first place.

    It's not something I worry about, I don't post things on the internet unless I intend everyone to see them, and there's not really anything I can do about AI scraping.

  • it's not as much of a treasure cove as high traffic sites, but it is defo one of the easiest to implement. Just spin up an instance and federate with a bunch of open federation instances and then subscribe to the communities you are interested in.

  • A lot of data gets deleted after a while. It could be a good source for AI scrapers..but because of the low engagement numbers, they will probably not train on our data in favor of facebook who has billions of users.

31 comments