Language Models are nothing without their training data. But the data is large, mysterious, and opaque, which requires selection, filtering, cleaning, and mixing. Checkout our survey paper (led by the incredible
@AlbalakAlon
) that describes the best (open) practices in the field.