The News Data Conundrum

4 min readAug 4, 2024

Read it first on my Substack

We’re in a period of simultaneous abundance and scarcity of news datasets — the kind that researchers use to answer questions about, say, editorial decision making or audience behaviors. Generative AI’s training data demands are driving both trends.

The abundance is evident. Large language models train on a lot of news articles. The corpora that contain these articles are often freely available (see: The Pile, FineWeb). Researchers can, in theory, sit downstream of this enormous data collection effort, repurposing slices of catch-all corpora for more purposeful empirical study.

But in practice, the knock-on benefit for the research community is likely not that straightforward. Upstream data collectors don’t care about the context or structure of news coverage; researchers do. And so when we try to reverse engineer the kinds of structured data that might be interesting for empirical analysis, we might come up short. We can’t guarantee that every major news story about a particular event gets included, that every outlet for a given region is well represented, that a training dataset’s archive of a now-defunct publication is comprehensive. These guarantees are not that important for the training of an LLM, but they can be vital for social scientists. So in many cases, we might still need to collect our own purpose-built…

The News Data Conundrum

Written by Nick Hagar