What should platform data collection look like now?
Read it first on my Substack
Large-scale digital trace data are increasingly difficult to come by, as platforms lock down APIs, restrict web scraping, and shift to formats (like short-form video) that make data collection and analysis tricker for researchers. I’ve written about this issue before, in the context of talking about a small Python package I made to collect data from Substack. In this post, I want to talk through some of the technical challenges with this kind of data collection in more specific detail, how they’ve affected my software package, and what they might mean for how social scientists approach sample collection more broadly.
The impetus for this post was a return to Substack data collection. With the platform’s recent controversies around hosting and recommending Nazi newsletters, I wanted to expand my Python package to collect new types of data. The existing implementation was focused on post and newsletter metadata; I figured researchers would now also be interested in recommendations and posts on Notes (Substack’s Twitter-like social product).
I developed the first version of this package with a pretty standard approach — leveraging API endpoints that were publicly available without authentication. Here, for example, is the endpoint to get metadata about a newsletter’s posts: https://<NEWSLETTER_SUBDOMAIN>.substack.com/api/v1/archive
. Anyone can find these endpoints - they appear in the "Network" section of your browser console whenever you access relevant data on the web page. And anyone can query those endpoints. This approach is similar to web scraping, but it's more efficient (you're not asking the server to render a bunch of layout and visual elements that you won't use) and easier to parse (it's more straightforward to find what you're looking for in a structured data object than in a lengthy web page).
Platforms often obscure these endpoints, or require authentication, for lots of good reasons — they don’t want their data easily accessible to the outside world, for example, or their servers aren’t equipped to handle lots of outside requests. And between my initial implementation and revisiting this problem, it appears that Substack has hidden calls to its API endpoints so that they no longer show up in the browser console. The endpoints I identified earlier appear to still work without authentication, but I couldn’t identify any endpoints corresponding to the new types of requests I wanted to support.
This is where the technical challenges of data collection in a post-API world become acutely felt. The appeal of having this kind of tiny software is that it’s lightweight. It offers some convenient wrappers around API endpoints so that you can plug them into the rest of your code, but it doesn’t bring along any heavy dependencies or involved logic. The fallback options for this kind of scenario move away from that goal. Web scraping directly from the page is an option, but that would require writing parsing rules that are far more likely to break. And in Substack’s case, the data of interest is rendered via Javascript. This means a scraping approach would require a much heavier dependency like Playwright or Selenium.
This isn’t a Substack-specific problem — obscured APIs and anti-scraping protections are standard across digital platforms, making the thin wrapper approach less viable for any kind of data collection. As a consequence, the open access by proxy that these packages enable is going by the wayside.
What alternatives do we have? From the researcher’s perspective, purpose-built scrapers can still make sense as one-off instruments for a specific project’s data collection. Data donations can provide a comprehensive, if spotty, representation of users’ behavior.
But from the developer’s perspective, I think this limitation demands a shift in how we think about building data collection software. The lightweight, plug-in paradigm is no longer compatible with how platforms are structured, and the needs of data collection and analysis are becoming more involved and intertwined. Rather than packages, then, it might make sense to start thinking in terms of standalone tools — the kinds of tools that could be downloaded as their own program, and might even have a user interface of some kind.
This is a drastic conceptual shift. But in my mind, this retooling allows the program to do heavier work. It can bring in those weightier dependencies, without worrying about how they interact with an outside pipeline. It can develop toward robustness around one clearly-defined task, rather than adaptability or compatibility. And — thinking toward what researchers might find beneficial — it can extend that approach to include common types of analysis or modeling.
One quick example of that latter point — If you’re researching TikTok, part of your data collection might involve transcribing videos, or programmatically identifying their primary subjects. These tasks are possible via software packages in a language like Python, but they often require you to put the pieces together yourself. A package that attempted to offer a full suite of video analysis tools would be ridiculously heavy, and would likely be difficult to even justify as a package — its collection of functionality and models is too specific.
But as a software tool for researchers with this set of use cases, a bundled approach makes a lot more sense. It provides a single, unified environment that collects data along some parameters, augments the raw output with metadata via machine learning or statistical modeling, and produces a relatively clean sample for downstream analysis.
I think there are still a lot of flaws in this argument, especially when thinking about a general approach to digital data collection (Who will build these tools? How do we decide what functionality to bundle in and what to exclude?). But from the perspective of enabling this kind of data collection (and empowering researchers without a software engineering background to collect high-quality samples), we need to consider this kind of paradigm shift. Data are getting more difficult to collect. Frameworks and models are becoming more resource intensive. The lightweight package approach can only take us so far, which presents real challenges for computational research in the long run. To meet those challenges, we need an approach to creating software that adapts to computational research workflows.