Do OpenELM's training datasets contain copyrighted material?

I'm very excited about the release of this model and the efforts the team went through to openly document seemingly every aspect of it. Thank you!

I wonder if any information can be given concerning the selection of training datasets. On https://machinelearning.apple.com/research/openelm it says:

> our release includes the complete framework for training and evaluation of the language model on publicly available datasets

More specifically, on https://github.com/apple/corenet/blob/main/projects/openelm/README-pretraining.md it says:

> OpenELM was pretrained on public datasets. Specifically, our pre-training dataset contains RefinedWeb, PILE, a subset of RedPajama, and a subset of Dolma v1.6.

Digging into RefinedWeb on https://huggingface.co/datasets/tiiuae/falcon-refinedweb/viewer/default/train?q=nytimes.com, it contains content from sources like nytimes.com and cnn.com. 

<img width="560" alt="Screenshot 2024-05-01 at 11 01 57 AM" src="https://github.com/apple/corenet/assets/16358/c093ecc4-2bfd-4ec7-953b-6e720b7a1e1d">

This is not surprising: Because of the vast amounts of data needed to train LLMs (basically a snapshot of the internet), all training datasets will contain copyrighted material. LLMs are sort of a snapshot of humanity’s knowledge. References to copyrighted characters like Superman, Captain Kirk, Donald Duck, Bugs Bunny etc etc are part of that collective knowledge and references to them might pop up just about anywhere in a dataset. Getting a snapshot of humanity’s knowledge that is free of such references would be as impossible as removing the sugar from a cake after it has been baked.

So while the project only mentions "publicly available datasets" and never makes any claims to be "free of copyrighted material", can any information be shared about the selection process that went into choosing the datasets that were used to train OpenELM? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do OpenELM's training datasets contain copyrighted material? #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Do OpenELM's training datasets contain copyrighted material? #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions