Generative agents as research tools

Nick Hagar
4 min readAug 30, 2023

Read it first on my Substack

In recent months, researchers have begun exploring what LLMs can tell us about human behavior. Because they’re trained on a large corpus of human expression, the thinking goes, these models should be able to produce a facsimile of how we think and act. For example, consider this prompt and response:

PROMPT: You're an elementary school teacher in a mid-sized American city. In bullet points, write out what you do on a typical day.

RESPONSE (EXCERPT): **Morning:**

- Arrive at school before students to prepare for the day ahead.

- Set up the classroom, including arranging desks, checking supplies, and organizing materials.

- Review the lesson plans and activities for the day.

- Make sure any special accommodations or individual student needs are accounted for.

- Greet students as they enter the classroom, helping create a positive and welcoming atmosphere.

This is a reasonable, if generic, representation of a teacher’s typical behavior in summary form. You could imagine further fine tuning the model to generate responses for a particular subject matter, for weekend activities and relationships, etc.

And in fact, researchers at Stanford and Google did a version of this several months ago. In “Generative Agents: Interactive Simulacra of Human Behavior”, the researchers gave 25 instances of ChatGPT different prompts, containing a name, a detailed personal and professional background, and a description of important relationships. They then allowed the instances to act as agents within a simulation — -describing their actions and interactions, communicating with each other in natural language, and generally going about their day. This exercise produced a realistic facsimile of communal human behavior, the researchers argue:

Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day.

This is intriguing work with several immediate applications — -the authors note immersive virtual environments as one use case, mirroring ongoing generative AI work in video games. From a social science perspective, though, this work also raises questions about the *epistemology* of LLM output. In other words, what do know by recording a model’s representation of a particular human behavior? Could this kind of exercise provide deeper insight into human communication or social structures?

Simulations of a different kind have long been a valuable tool in the social sciences. Agent-based models allow researchers to quickly ideate, surfacing candidate phenomena for deeper theory building. This was the approach our work on news popularity on Reddit took, comparing simulated hypothetical conditions to an empirical dataset. Agent based models also demonstrate the ways in which simple rulesets can produce the kinds of aggregate outcomes that social scientists love to explain. Schelling’s model of segregation is a great example of this.

So from one perspective, the idea of using simulated data to explore human behavior is nothing new. And in some ways, this approach is just a more complex version of the same principles. The simple interactions of weights attempting to predict text form the basis of the model, and their higher-level output (and its interactions) are the observed dynamic of interest.

But, the generative approach also adds some new wrinkles. With added complexity comes the loss of interpretability, and of direct attribution. It’s no longer straightforward (or really even possible) to connect the high-level dynamics to the simple ruleset, because the rules are obfuscated within large, often proprietary systems. It’s more difficult to isolate, say, the effect of a change in input to its corresponding output with so many layers of inference in between. This is also an issue because the output is so model dependent. The output from a prompt to ChatGPT might vary from that of Bard (or Llama 2, or GPT 2…) for reasons that are not at all transparent to the researcher. With so many factors at play, generative agents may convincingly fill the role of a human-like actor, but they introduce an undesirable amount of complexity into an agent-based model.

So from the perspective of obtaining valid research results, LLMs may not work as simulation tools. They emulate some broad “average” of human behavior as expressed through the training data and the model architecture, which is not an especially interpretable process. But there are still places where they could fit into study design.

Much like traditional agent-based models, a generative agent could act as a study prototype, or a proof of concept for a model design. Research is costly and time consuming — -what if you could do a test run of your study before going through the trouble of real-world data collection? If carefully prompted, perhaps an LLM could act as a sort of minimum viable study participant. Researchers could use these simulated interactions to spot holes in a simulation’s design, to look for unexpected responses to survey questions, or to test out a theory on a small human-like sample. The actual research still gets carried out at scale, with human participants and interpretable models. But by going through an initial round of ideation and testing with an LLM, researchers might be able to deploy their resources better, honing in on the most promising projects faster.

--

--

Nick Hagar

PhD student @ Northwestern University. I worked in digital media, now I study it.