There are several datasets for AI and ML research that uses source code from open source projects. The models trained on these datasets often don’t comply with these licenses, that often require attribution of their authors, and in some cases, requires that any new projects using the code to be licensed under the same license as the original.


Which datasets are not complying? Call them out by name. Gotta keep them accountable.
I think basically every datased used to train modern AI and ML (from 2021 and beyond). They usually use code from GitHub and other code hosting sites.
Yeah but which one? Can you name one?
I’m not too much familiar with the concept, but The Pile maybe?
According to TechCrunch, that dataset is built off of Public Domain works.
https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/