and I wanted to block all ai crawlers from my selfhosted stuff.
I don't trust crawlers to respect the Robots.txt but you can get one here: https://darkvisitors.com/
Since I use Caddy as a Server, I generated a directive that blocks them based on their useragent. The content of the regex basically comes from darkvisitors.
Thats an easy modification. Just redirect or reverse proxy to the tarpit instead of abort .
I was even thinking about an infinitely linked data-poisoned html document, but there seemed to be no ready made project that can generate one at the moment.
(No published data-poisoning techniques for plain text at all afaik. But there is one for images.)
Ultimately I decided to just abort the connection as I don't want my servers to waste traffic or CPU cycles.