Building Meta-Analysis Datasets with AI Assistance: A Case Study PART 4: Lessons Learned

bob.reed
Mar 17
9 min read

Introduction

This is the fourth and final blog in a series documenting an attempt to use AI tools to assist in constructing a dataset for a meta-analysis on the relationship between inflation and subjective well-being. The three previous blogs described, respectively, how ChatGPT was used to develop a systematic literature search strategy; how that strategy was executed across four bibliographic databases and the resulting records organized in Zotero; and how SysRev's AI-assisted auto-labeler was used for title and abstract screening and data extraction. This blog steps back from the details and asks the broader question: what did we learn?

The honest answer is: more than expected, and not always what we expected. Some AI tools performed impressively well on tasks that might have seemed difficult. Others struggled with tasks that seemed straightforward. And in several cases, the most valuable outcome was not a polished result but a clearer understanding of where the process broke down and why. What follows is an attempt to distill those lessons in a way that might be useful to other researchers considering a similar approach.

As with the previous three blogs, this blog was mostly written by Claude.ai. I simply pasted the first three blogs into Claude and asked it to extract what it thought were the most valuable lessons. What follows is essentially Claude’s product with some light editing by me.

Lesson 1: AI is most valuable as a thinking partner, not an oracle

The single most consistent finding across all three blogs is that AI tools added the most value when they were used interactively and iteratively, not when they were asked to simply produce an answer. This was true at every stage of the workflow.

In the literature search stage, the key contribution of ChatGPT was not to generate a list of keywords but to structure the reasoning process behind keyword selection. The two-pile discrimination exercise — constructing a set of true positives and a set of near misses, then asking ChatGPT to identify what distinguished them — was far more productive than simply asking "what keywords should I use?" ChatGPT's value lay in making implicit judgments explicit: forcing a precise articulation of what counted as a valid inflation variable, what counted as a global subjective well-being measure, and why certain plausible-seeming papers actually failed the inclusion criteria.

The same pattern held in the screening and extraction stages. Both research assistants used ChatGPT not to write their prompts for them, but to help diagnose specific failures and suggest targeted revisions. The diagnostic work — identifying which papers were misclassified and understanding why — required careful human judgment. ChatGPT helped translate those judgments into better prompt language. A researcher who simply asked ChatGPT to "write a prompt for screening papers about inflation and well-being" would have obtained a far less useful result than one who engaged in the iterative back-and-forth that both research assistants employed.

This is a theme that runs through the entire series: AI tools are most useful when they function as a structured interlocutor, pushing the researcher to be more precise, more explicit, and more systematic than they would otherwise be. That is a genuine contribution — but it is not the same as automation.

Lesson 2: The boundary between relevant and irrelevant is harder to define than it looks

One of the most practically useful insights from the literature search stage was the recognition that keyword failure rarely takes an obvious form. The papers that cause the most trouble are not the ones that are clearly irrelevant — those are easy to exclude. The trouble comes from papers that are "almost about the topic": studies that mention inflation but treat it as a psychological stressor rather than a macroeconomic variable; studies that measure well-being but at the domain-specific level of financial satisfaction rather than global life satisfaction; studies that use the right vocabulary but in the wrong context.

Defining the boundary between relevant and irrelevant required more conceptual precision than was initially anticipated. What exactly is an "objective" measure of inflation? Does a perceived inflation measure count if it is used as a predictor of well-being in an otherwise sound empirical model? Is a study that regresses happiness on unemployment and inflation — with inflation as a control rather than the primary variable of interest — within scope or outside it? These are not trivial questions, and they do not have obvious answers. Working through them explicitly, with ChatGPT as a structured interlocutor, produced a much cleaner inclusion/exclusion framework than would have emerged from a more intuitive approach.

The same issue surfaced in the screening stage, where Anne encountered a cascade of boundary problems: "CPI" being read as the Corruption Perceptions Index rather than the Consumer Price Index; "well-being" papers that were actually about job satisfaction or customer satisfaction; inflation studies that measured psychological distress rather than global subjective well-being. Each of these required a specific, targeted fix. The broader lesson is that inclusion criteria that seem clear in the abstract often turn out to be ambiguous at the boundary — and those boundary cases are precisely where AI tools are most likely to go wrong if not carefully supervised.

Lesson 3: Transparency and provenance matter from the start

One of the quieter lessons from the literature collection stage concerns the value of keeping careful records from the very beginning. The decision to tag every imported record with its source database immediately upon import — before any deduplication had taken place — turned out to be essential for tracking provenance after records from different databases were merged. Had that tagging been deferred, it would have been impossible to reconstruct which database contributed which unique records, or to report database-specific retrieval counts in a PRISMA-compliant way.

The same principle applies to the broader workflow. Every decision that was made explicitly and documented — which databases to search, why Business Source Complete was substituted for EconLit, how the search strings were adapted for each platform, what the stopping rule was for prompt refinement — is a decision that can be explained and defended after the fact. Every decision that was made informally or left implicit is a potential vulnerability when the work is reviewed or replicated.

AI tools can actually help with this. Because ChatGPT interactions are text-based, the reasoning behind each decision is at least partially preserved in the conversation history. In several cases across this series, revisiting earlier conversations helped reconstruct why a particular choice was made. That is not a substitute for deliberate documentation, but it is a useful supplement.

Lesson 4: In-sample performance is not out-of-sample performance

The data extraction results from Blog 3 illustrate a problem that will be familiar to anyone with a background in econometrics or machine learning: a model — or in this case, a prompt — that fits the training data well does not necessarily generalize. Both research assistants achieved essentially identical in-sample success rates of 98% on their training papers, but their out-of-sample performance diverged substantially (67% for Allen, 41% for Anne) and was considerably worse in absolute terms than their training performance.

This gap is not a failure of effort or care. It is a structural feature of the problem. Training papers are used to develop and refine the prompt, so by construction the prompt is adapted to whatever idiosyncrasies those papers happen to have. Out-of-sample papers introduce new formats, new reporting conventions, and new edge cases that the prompt was not designed to handle. The result is a drop in performance that is entirely predictable in retrospect, even if it is disappointing in the moment.

The practical implication is that researchers should not take high training success rates as evidence that their extraction prompts are ready for deployment. A meaningful held-out test set — papers that were set aside before prompt development began and never used during the refinement process — is essential for obtaining an honest estimate of how well the prompt will perform on the actual literature. If out-of-sample performance is substantially lower than in-sample performance, that is a signal to reconsider the prompt design, not just to apply it and hope for the best.

Allen's experience points toward one design principle that may help with generalization: separating the search phase from the filtering phase. By instructing Sysrev to first locate all regression specifications containing an inflation variable, and only then apply the inclusion criteria, Allen built a prompt that was more robust to variation in how papers lay out their tables and results. That kind of structural robustness is likely to matter more for out-of-sample performance than fine-tuning the language of the inclusion criteria.

Lesson 5: Different tasks require different skills

One of the more instructive findings from the screening and extraction exercises was that performance on one task did not predict performance on the other. Both Anne and Allen performed well on title and abstract screening (balanced accuracies of 97% and 94% respectively), yet Anne substantially underperformed Allen on out-of-sample data extraction (41% versus 67%). The same two people, working with the same tools and the same general approach, produced quite different relative results on two tasks that are part of the same overall workflow.

This makes sense in retrospect. Screening well requires symmetric attention to both types of error — being too inclusive is just as problematic as being too restrictive — and it requires careful attention to boundary cases in the vocabulary of inclusion and exclusion. Both research assistants ultimately succeeded at this, though Anne's more aggressive approach to specifying hard exclusion criteria from the outset gave her a slight edge. Data extraction, on the other hand, requires understanding how papers structure their results and building a search logic that can navigate that variation. Allen's insight about separating search from filtering gave him a clearer advantage here.

The broader implication is that researchers should not assume that someone who performs well at one stage of the meta-analytic workflow will automatically perform well at another. The skills are related but not identical, and the prompt design strategies that work for screening do not straightforwardly transfer to extraction.

Lesson 6: AI tools accelerate the process but do not replace judgment

It is worth being direct about what AI assistance did and did not accomplish in this project. On the positive side, it substantially accelerated several stages of the workflow. Developing a keyword search strategy from scratch, without ChatGPT's assistance, would have taken considerably longer and would likely have produced a less systematic result. Running Sysrev's auto-labeler on 271 records, even imperfectly, was faster than manual title and abstract screening of the same set. And using AI to assist with prompt refinement was more efficient than purely trial-and-error iteration.

On the negative side, AI assistance did not eliminate the need for careful human judgment at any stage. The two-pile discrimination exercise required substantive expertise about what the target literature actually looks like. Diagnosing screening and extraction failures required close reading of individual papers and careful comparison against ground truth. Deciding when to stop refining a prompt — when further iteration would yield diminishing returns or introduce new problems — required judgment that no AI tool supplied.

Perhaps most importantly, the quality of AI assistance was directly proportional to the quality of the human input. A vague or poorly structured prompt to ChatGPT produced vague or poorly structured output. A precisely framed diagnostic question — "here are three papers my prompt misclassified; what features do they share that might explain the error?" — produced useful, actionable suggestions. The researcher's ability to ask good questions mattered as much as the AI's ability to answer them.

Lesson 7: Do not rely on a single model

This final lesson reflects broader experience with AI tools rather than anything unique to this project. AI outputs should not be treated as singular, authoritative results; the same query can yield different answers across runs, and different models—trained on different data and with different objectives—often produce meaningfully different outputs. Errors, including hallucinations and misinterpretations, are real. However, these errors are often only weakly correlated across runs and especially across models, which creates an opportunity to improve reliability through triangulation rather than trust.

In practice, this means repeating queries, asking models to check or critique their own outputs, and—most importantly—cross-checking results across multiple systems (e.g., ChatGPT, Claude, Gemini, Grok). Agreement across models provides a useful robustness check, while disagreement highlights cases requiring closer human review. The parallel to meta-analysis is clear: just as we do not rely on a single study to draw inference, we should not rely on a single AI output. Reliability comes from aggregating across imperfect signals, not from assuming any one of them is correct. An illustrative example of this idea—what might be called “Duelling AIs”—is provided here.

Looking Forward

This series has documented one researcher's attempt to learn how AI tools can and cannot assist in the construction of a meta-analytic dataset. It has been an exercise in productive failure as much as in success: some approaches worked well, others did not, and the failures were often more instructive than the successes.

Several questions remain open. The data extraction results, in particular, leave room for improvement. Out-of-sample success rates of 41% to 67% are better than nothing, but they are not good enough to use without substantial human review and correction. It is possible that different prompt architectures, more extensive training sets, or different AI platforms would produce better results. It is also possible that some papers — those with unusual or highly complex table structures — will remain difficult to extract automatically regardless of how the prompt is designed.

The broader question is how to think about AI-assisted meta-analysis as a methodology. The tools described in this series are genuinely useful, but they are not yet reliable enough to be used without careful human oversight at every stage. The appropriate model, at least for now, is not full automation but augmented human judgment: AI tools that handle the high-volume, repetitive parts of the workflow while humans retain responsibility for the decisions that require genuine expertise. Whether that balance will shift as the tools improve is an open question — one worth watching closely.

In the meantime, the most useful thing this series can offer other researchers is probably not a set of recipes to follow, but an honest account of what the process actually looked like: where it was easier than expected, where it was harder, what went wrong, and what was learned from going wrong. I hope this account of our experience proves useful as you explore how AI might assist you in your own meta-analyses.