Generative AI in the Newsroom

The Generative AI in the Newsroom project is an effort to collaboratively figure out how and when…

Follow publication

Structured Outputs: Making LLMs Reliable for Document Processing

Nick Hagar
Generative AI in the Newsroom
10 min readDec 5, 2024

--

A guide — and prototype — for getting clean data out of PDFs

A schema acts as a blueprint, telling the model exactly what data to extract from a PDF and how to format it.

Journalists get PDFs as responses to FOIA requests, though document dumps and via white papers. To make use of PDFs, these journalists need to get data out of documents and into an analysis-friendly format, like a spreadsheet. The process can involve laborious, manual transcription or copying and pasting data from one format to another.

Theoretically, large language models can assist with document processing, but risks like hallucinations and the inherent uncertainty of LLM outputs make this approach tricky. Journalists need to be certain the output actually contains the needed data, follows the needed data types and is in a usable format.

Structured outputs offer a solution to these challenges. Providers like Anthropic and OpenAI, and open source libraries like Outlines allow developers to define strict schemas that constrain LLM responses to specific fields, data types, and formats.

Structured outputs transform raw LLM capabilities into reliable data processing pipelines. When extracting tables from multi-page PDFs, for instance, a schema ensures consistent column names and data types across pages. While this approach cannot guarantee perfect accuracy, it reduces the engineering complexity of parsing and validating LLM responses, making document processing workflows both more reliable and more maintainable.

A schema is essentially a blueprint that tells the model exactly what information to look for and how to organize it. Think of it like a standardized form: rather than letting the model return data in any format it wants, the journalist provides specific fields to fill in — this is a date, that’s a dollar amount, this other thing should be a yes/no value. Just as a tax form ensures everyone reports their income in the same way, a schema ensures the model extracts data in a consistent, predictable format.

Using schema-based structured outputs with LLMs requires familiarity with three aspects of Python programming: data types, class structure, and how to interact with the OpenAI (or other provider) API. To demonstrate the potential of this approach, below are some example processing pipelines built with OpenAI’s GPT-4o model — first a walkthrough setting up the key concepts, then several real-world documents. You can also try this approach yourself with the prototype I’ve built: upload a document, and the system will suggest an appropriate schema and extract the data automatically.

Setup

Every case of working with structured outputs involves a workflow of 1) identifying the needed data from the input document(s), 2) defining a schema that describes that data, and 3) sending a request to OpenAI with this schema and a prompt. To walk through this process, let’s say we have a stack of business cards from a recent conference that we want to put into a spreadsheet.

Mock business cards that we want to process with an LLM

Data

Every card is formatted differently, and they all have slightly different information. For our spreadsheet, let’s say we need a name, job title, company, and email address for each business card. This defines the data that we will include in a schema — everything else gets ignored.

Schema

For structured outputs, schemas are defined using Pydantic, a Python library that lets us specify exactly what data we want and how it should be formatted. Pydantic schemas take the form of classes, where each data field is a class attribute with an associated data type. Since all the fields in these business cards are text (as opposed to, e.g., dates or numbers), they are all string types:

This expresses the data required from each business card in a structured, machine-readable format.

Getting data from OpenAI

Finally, a request sent via OpenAI’s Python library lets the LLM ingest each business card, identify the required fields, and return the requested information as a structured JSON object (see this guide from OpenAI on working with image data).

Code:

Output:

We can then apply this schema to every business card in the stack, turning a disorganized pile of documents into a standardized spreadsheet with exactly the information we need. And because LLMs are flexible, and schemas can describe all sorts of data, this approach works for a wide range of real-world applications.

Real-world data

To demonstrate how this workflow might be used for reporting, below are several examples of using structured outputs on real-world documents.

First is an example using campaign finance reports for a recent mayoral election in my hometown of Fort Wayne, Ind. Federal campaign finance data are regulated by the FEC and accessible through websites like OpenSecrets, but local data are more difficult to query. Reports are stored in state- or county-level records repositories and may not follow a standard format. They often aren’t OCRed — in some places, records are handwritten and scanned into the PDF format. An LLM can make the process of cleaning these data for downstream analysis much smoother.

A campaign finance report like this one often has several components — a cover sheet, tables of contributions from individuals and organizations, and tables of expenses and debts. Structured outputs can handle any of these formats reasonably well; to start, let’s transform a table into a spreadsheet format.

Looking at a sample of individual contributions, there’s a straightforward table layout that we can translate into a schema:

An example of individual contributions from a campaign finance report, which we’ll extract with GPT-4o

The code creates a blueprint for our table: first we define what each row should look like (a ContributorRecord with fields for names, amounts, dates, etc.), then we create a structure for the whole table (ContributorTable) that will contain a list of these records. Each field has a specific type — text fields are marked as ‘str’, dollar amounts as ‘float’ — which helps ensure our extracted data will be clean and consistent.

But as with many documents of this type, there are also some quirks that we might want to account for, and that might be challenging for other data extraction approaches. The report lumps name, address, and occupation into one column; we probably want those separated out. Date and receiver are combined in an unusual split column, and contribution type is recorded as a mix of optional checkboxes and free text entry. We can update our schema to account for all these features, transforming a non-standard table into a spreadsheet format that we can work with.

And, after sending these pages to OpenAI’s API, along with our desired schema, the LLM produces a well-formatted spreadsheet of information that we can use for further analysis. Even better, we don’t have to OCR or preprocess the pages in any way — we can send images directly to OpenAI and get back data.

What about a less standard data format? Let’s take a look at the cover sheet of this report, which contains a mix of key/value pairs, free text fields, and a table of financial data.

A campaign finance report cover sheet, with a mix of different kinds of structured data

Again, we can define a schema to capture the fields that we’re interested in. This can be selective — let’s say we care about the committee information and the financials, but not any of the information about the candidate or the report type:

This lets us not only structure data from a messy document, but also to preemptively filter out anything that isn’t relevant for downstream analysis:

Because GPT-4o can ingest as many images as will fit into its context window, this approach can also handle data that breaks across multiple pages. Consider this polling data, from the 2023 New Hampshire Republican primary:

The data are well-formatted, but, as with the campaign finance report, they’re difficult to ingest for machine analysis. Some questions break over multiple pages, and the layout includes formatting and information that is extraneous to the questions and answers themselves. But we can define a schema that corresponds to only the parts of the document we care about, send all the pages to the model as images at once, and get back a robust list of questions and responses from this survey:

In all three cases, it’s important to note that consistent structure does not equate to perfect data quality. For the campaign finance cover sheet, the model gets several details wrong (e.g., mixing up itemized and unitemized contributions). For the individual contributions, it struggles with contributor occupations, and it makes transcription errors (e.g., transposing digits, misspelling street names). While it is able to record the polling data accurately, these examples serve as a note of caution — for critical data, human review of model outputs is still very much required.

Because LLMs are so flexible, we can extend this schema-based approach even further. Rather than mapping out each field in a document, assigning it a datatype, and coding up a schema, we can input the document and have the model generate the schema itself. To get back an accurate and valid schema, this approach requires a few steps — -first reasoning about what kind of data is contained in the document, then generating a first pass of the schema, then cleaning it up to conform to OpenAI’s API requirements. But this pipeline produces schemas that are good enough as a jumping off point, and can often be used for data extraction as-is.

System diagram of a schema generation workflow

You can try out this schema-based data extraction with your own documents, with the prototype demo that I’ve uploaded to Streamlit.

How does it compare?

Of course, schema-based extraction with foundation models isn’t the only way to process documents. OCR has been around for a long time, and there are purpose-built tools for helping journalists process PDFs. Using these same test cases, I’ll walk through some points of comparison between the process described above and some alternative approaches.

Local LLM

One of the drawbacks of the approach that I’ve outlined is that it requires sending your data off to OpenAI or another external model provider. This means accruing the cost associated with querying foundation models, as well as taking on the risk of an outside organization gaining access to your data — a nonstarter for some reporting projects. To mitigate this, we can test the same pipeline with a small, permissively licensed model, running entirely locally. Other than specifying a different model, this change doesn’t even require any changes to our implementation. Qwen2-VL-7B-Instruct accepts multimodal inputs, and, if we run it through LM Studio, we can continue using structured outputs to define our desired schema.

One big caveat — these local models struggle with processing multiple images in one call (although, your mileage may vary depending on the amount of memory on your machine). But for single pages, the results on our test cases aren’t too far behind GPT-4o’s. For the campaign finance cover sheet, the local model misses some fields, and there are again transcription errors, but it’s the only approach I tested that correctly identifies all the financial information.

So while a local model might require a little more patience to process the data, and a little more manual review, it can also be a viable solution when cost or data security are important.

OCR + table extraction

This is the most established approach for pulling data out of documents. For these test cases, I used Docling, a state-of-the-art system with models for OCR and table extraction developed by researchers at IBM. And while the transcription quality of Docling is comparable to that of the LLMs, the output is generally unusable (especially when data don’t follow a standard table format). Here, for example, is a snippet of Docling’s output, in markdown format, for the polling data:

Pinpoint

The options described above all require some level of technical investment to implement. For an off-the-shelf solution, I turned to Google’s Pinpoint. Pinpoint is billed as a research tool for journalists, designed for searching, filtering, and analyzing large sets of documents. Pinpoint also has a feature, currently in beta, that allows journalists to extract structured data from these documents.

This feature works well for tables. In the UI, you can highlight a table, and Pinpoint automatically identifies the column headers and values. It also looks through all the documents uploaded in a project, finds tables with the same schema, and extracts the data from those tables automatically.

However, Pinpoint also has some limitations. It doesn’t allow for any adjustments to the columns in the output table. This means that, for example, the contributor data that we explored above has name, address, and occupation lumped together, in a sometimes messy chunk of text. To apply the table schema automatically across all occurrences, they have to be contained in multiple documents — -it doesn’t work for different pages in the same document. And the structured data feature is far less useful for non-table data. The polling data, for example, wasn’t really possible to extract with this approach. So while some tabular data might be a good fit for this approach, and it offers a (currently free) off-the-shelf solution, Pinpoint isn’t nearly as flexible as the schema-based solutions.

Conclusion

Schema-based document processing with foundation models occupies an interesting middle ground. It’s more flexible than traditional OCR approaches, allowing us to handle diverse document formats and extract complex information. It’s more reliable than pure LLM chat interfaces, with built-in validation through schema enforcement. There are still limitations to this approach: It requires technical knowledge of Python programming, and it doesn’t eliminate transcription errors or the need for human review. But at the same time, it can dramatically streamline the mechanical work of transforming documents into structured data.

For journalists working with local government documents, researchers processing survey responses, or developers building document analysis pipelines, this approach offers a practical path forward. You can process documents at scale while maintaining control over the output format, whether you’re using GPT-4o or running everything locally. The technology isn’t perfect — careful validation is still essential, especially for critical data. But by constraining LLM outputs to specific schemas, we can start to build reliable document processing workflows that help us spend less time copying and pasting, and more time on substantive analysis.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in Generative AI in the Newsroom

The Generative AI in the Newsroom project is an effort to collaboratively figure out how and when (or when not) to use generative AI in news production.

Written by Nick Hagar

Northwestern University postdoc researching digital media + AI

No responses yet

Write a response