Skip Navigation
10 comments
  • You can download a torrent of the whole thing, they don't need to give it to anyone.

    https://en.m.wikipedia.org/wiki/Wikipedia:Database_download

    • This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.

      • The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there's a fair few that do) took this data source directly, that they'd then put an exception in their scrapers to avoid wikipedia's site. I doubt they would bother.

  • I just feel like OpenAI might accept this and ignore the website, although it's very unlikely they will actually do that.

10 comments