Skip Navigation

Technology @lemmy.world

KarlHeinzSchwuke @feddit.org

3d ago

I was wrong about robots.txt

evgeniipendragon.com

I was wrong about robots.txt

Hacker News @lemmy.bestiver.se

RSS Bot @lemmy.bestiver.se

4d ago

I was wrong about robots.txt

evgeniipendragon.com /posts/i-was-wrong-about-robots-txt/

22 comments

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
- I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.
  Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
  
  that is not how general news media has been talking about robots.txt.
  Ahh, yes. I think there is a lesson there.

So. If I can add something here for everyone's benefit
No search engine really obeys robots.txt
Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.
Google knows every inch of your site, allowed or not.
See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled. Just doesn't display the results based on your settings.
- And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/

Huh. So in this case, the file actually is respected. Refreshing
- Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.
  For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.
  Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.
  
  Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.
  False.
- Kinda, but also not really. Any major tech player that has billions to lose will make a show of respecting robots.txt when presenting that information to third parties, lest they be exposed by basic journalism.
  However, they also have separate networks in R&D that sweep the net all the time and do not care about such restrictions. It's theatre.
  And they're still happy to punish people that have the gall to publicly decline their crawlers. Basically they can eat their cake and have it too.

22 comments