Member-only story
Classifying articles via URLs
Read it first on my Substack
Today, I want to work through a small exercise in language modeling. A common NLP task for researchers working with news articles is some form of topic modeling. By getting a sense of what news stories are about, aggregated into a defined vocabulary of labels, researchers can make comparisons across groups (e.g., politics news versus financial news, or hard versus soft coverage). The question this raises is: What do we need to know about a news article to predict its topic?
Generally, this type of prediction relies on some amount of text from the news article — whether the entire article, its headline, or a summarized excerpt. Benchmarks show that these datasets, when paired with a well-suited model, can yield accurate labels. But obtaining such datasets presents challenges, which have become more pronounced in the past couple years.
The most obvious challenge is access, which is limited at a few levels of severity. At a minimum, an increasing number of news publishers are politely asking not to be scraped. Ben Welsh maintains a list of news publishers who block scraping from OpenAI, Google AI, or the Common Crawl. Out of 1,156 publishers, ~54% deny access to at least one of these crawlers. This has a knock-on effect for researchers who rely on datasets like Common Crawl, and it suggests a general aversion to…