Building meaning at scale

Nick Hagar
5 min readOct 4, 2023

Read it first on my Substack

The way large language models leverage text on the internet has upended how we think about and access all kinds of data. But, if used well, these models might also offer new possibilities for maintaining rich and searchable archives not currently possible.

Take journalism as an example. LLMs, with their need for large corpora of high-quality text, have driven a wave of concern around access to news articles. They’re not nearly as abundant as Reddit threads and fan fiction, but they are professionally written and edited. And, for the move toward using LLMs as chat agents (especially Google’s push for AI-driven search), encoding current events and information in that high-quality text is an attractive proposition.

This is a new paradigm in thinking about journalism — not as individual records, not as an aggregate representation of important events, but purely as a pile of text.

Contrast this view with a resource like the Internet Archive, which has also amassed an enormous corpus of news articles. The Internet Archive preserves as much as possible the original shape of the record — how the website where the article was published looked, how the story elements were laid out, the metadata (headline, byline, publication date, etc.), and even how the individual record changed over time. This is a mode of replication that is not intended for large-scale ingestion. There is a computational cost associated with replicating web pages faithfully, and the idiosyncrasies of these records make them difficult to compile in large volume.

These two views — -news as training data and news as historical record — -are at opposite ends of the spectrum of how news data might be persisted. They each have a clear-cut use case. Obviously LLMs use news articles as training data, folded in along with lots of other text from across the internet. An archive ensures access on a smaller scale. Journalists can preserve a portfolio of clips, without worrying about former employers’ servers going offline. Researchers can establish primary-source information supporting their work. And interested readers can reliably go back to favorite pieces of writing.

As a media researcher, for years I’ve wanted something in the middle, a resource that maintains some of the structure of news articles. One that isn’t primarily concerned with the text of the journalism, but in providing comprehensive information about it: when it was published, its topic and context, who wrote it, and whatever other metadata is available.

This kind of resource would have myriad applications in journalism research. To give a couple examples from my own work — In my masters thesis, I studied the trajectories of 20th century New York Times journalists in terms of where their bylines tended to appear in the print newspaper. This was made possible by clean, structured data about newspaper articles provided by the Times (disclosure: my current employer) at scale via a free API. Similarly, a New Media and Society piece I co-authored analyzed the movement patterns of freelance journalists as they published in various well-known outlets. Again, this work relied on a quality dataset connecting journalists to the articles they wrote, generously collated and uploaded to Kaggle.

The goal in this work is not to examine news articles themselves, but to use the information they contain to move up a level of abstraction. This approach could work for a wide range of interesting research questions: What kinds of stories do news outlets tend to use freelancers for, versus their in-house staff? When an outlet lays off some portion of the newsroom, how does coverage change, and in which areas? When an outlet hires new writers, what shape do their contributions take over time? Did the pivot to video meaningfully manifest in an increase in embedded videos, or in a decrease in text output? And so on.

Outside of academia, this kind of data might reveal some predictable patterns in news coverage. Much news is inherently chaotic, responding to events as they occur. Most of this kind of reporting will be new as it is produced, as it covers things that haven’t happened before. But there are also regularities at play. Newspapers publish a certain number of items per day, generally following a certain distribution by topic or section. Journalists are taught certain forms — the breaking news hit, the second day story, the feature piece — and when to apply them. Being able to roll this kind of information up doesn’t necessarily make the news itself easier to predict, but it provides a view into how and where news organizations are deploying their resources. A view that could be valuable for outlets themselves, for companies trying to make well-timed pitches, or for the myriad institutional investors who rely on signals from the news media to inform trades.

Despite these use cases, such well-structured data around the news is scarce. The most common abstraction from news articles is world events. This is the approach that services like GDELT, Event Registry, and Media Cloud take, aggregating individual reports into a broader understanding of what’s happening around the world. Another is media mentions, helping companies understand where they’re being mentioned and in what context. But these use cases are limited in scope — less metadata generation, more text extraction/summarization toward a specific end goal.

This void makes perfect sense, from a cost perspective. Extracting good metadata from news articles across hundreds to thousands of websites is labor intensive. It demands grappling with the idiosyncrasies of individual sites and the ways in which they publish. Data is structured differently everywhere, and there’s not a great way to handle picking out desired elements from an arbitrary page of HTML. We’ve been through a couple generations of widely-used “read later” apps like Pocket, Instapaper, and Matter, and generalized text/metadata abstraction is still a challenge.

Ironically though, large language models could offer a solution. As LLMs take on a more autonomous role, acting as agents that connect us to the broader web, they will need to be able to extract structured information from web pages. Context window limitations aside, they may already have some acumen for identifying relevant pieces of information within JSON, HTML, or other structured documents. This is also a type of task that a model could be explicitly trained for — entire web pages already get scraped for training data, so why not annotate their metadata and use it to reinforce the model’s parsing capabilities? An LLM-powered solution won’t scale nearly as well as a rule-based web scraper. But for many use cases, there might be feasible workarounds — use a traditional scraper to download entire HTML documents, for example, then process them in bulk later using GPT-4.

Ultimately, the ask outlined here isn’t specific to news. Every part of the web — social media sites, ecommerce portals, public records depositories — is made up of mountains of small records. Those records take care and attention to parse individually. They’re relatively easy to bundle up into an unstructured pile. But the middle layer, the abstractions we need to build some sort of working understanding of a community/platform/industry, currently requires painstaking identification, labeling, and maintenance to achieve. An alternative approach to building those abstractions, one that leverages models trained to parse and aggregate records with minimal input, is an exciting prospect.

--

--

Nick Hagar

PhD student @ Northwestern University. I worked in digital media, now I study it.