What is Training Data?
TL;DR
The dataset used to train AI models. It fundamentally determines model performance and quality.
Training Data: Definition & Explanation
Training data is the dataset used to train (learn) AI models. For LLMs, training data consists of trillions of tokens worth of text from internet web pages, books, academic papers, code, and more. Model performance heavily depends on the quality and quantity of training data — models trained on biased data will reflect that bias in their outputs. In recent years, high-quality data curation, synthetic data usage, and data copyright issues have become major concerns. Public datasets such as Common Crawl, The Pile, RedPajama, and FineWeb are widely used for training open-source models.