Why don't open source projects add a notice to not use their code for AI training, to comply with the license terms?

ryujin470@fedia.io · 20 days ago

Why don't open source projects add a notice to not use their code for AI training, to comply with the license terms?

9tr6gyp3@lemmy.world · 20 days ago

Which datasets are not complying? Call them out by name. Gotta keep them accountable.

ryujin470@fedia.io · 20 days ago

I think basically every datased used to train modern AI and ML (from 2021 and beyond). They usually use code from GitHub and other code hosting sites.

9tr6gyp3@lemmy.world · 20 days ago

Yeah but which one? Can you name one?

ryujin470@fedia.io · 20 days ago

I’m not too much familiar with the concept, but The Pile maybe?

9tr6gyp3@lemmy.world · edit-2 20 days ago

According to TechCrunch, that dataset is built off of Public Domain works.

https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.