Datasheets for Datasets
A documentation framework for AI training datasets that describes their composition, collection methodology, preprocessing steps, intended uses, and ethical considerations. Proposed by Gebru et al. in 2018, datasheets bring supply-chain transparency to the data that shapes AI behavior.
Why It Matters
Garbage in, garbage out — but worse, biased data in means biased decisions out. Datasheets force data creators to surface the assumptions, gaps, and potential harms baked into training data before that data shapes a model.
Example
A datasheet for a medical imaging dataset would document the demographic breakdown of patients (age, sex, ethnicity), the hospitals where images were collected, whether informed consent was obtained, and known gaps like underrepresentation of pediatric cases.
Think of it like...
Datasheets are like ingredient lists and sourcing disclosures on food packaging — they tell you where the raw materials came from and how they were handled before reaching your plate.
Related Terms
Model Card
A standardized document that accompanies a machine learning model, describing its intended use, performance metrics, limitations, training data, ethical considerations, and potential biases.
Data Drift
A change in the statistical properties of the input data over time compared to the data the model was trained on. When data drifts, model predictions become less reliable.