Building Meta-Analysis Datasets with AI Assistance: A Case Study. PART 1: Using ChatGPT to Assist with Literature Searching

bob.reed
Mar 17
11 min read

Introduction to this series of blogs

This is the first of four blogs on the topic of how AI can assist in assembling datasets for meta-analyses. The four blogs are organized by the following topics:

Using ChatGPT to Assist with Literature Searching
Using ChatGPT to Assist with Collecting and Organizing Studies
Using SysRev for Title and Abstract Screening and Data Extraction
Lessons Learned

These blogs chronicle my efforts to learn about how AI can assist in the process of selecting studies for inclusion in a meta-analysis. Some of this may already be familiar to you. And you may have already figured out better ways than what I describe here. If so, then share your experience with your own blog! I am keen to learn as much as I can about this.

One more thing. I saved my extensive chat with ChatGPT and then printed it out as a pdf and gave it to Claude.ai. I then asked Claude to write up my interactions with ChatGPT as a blog. I took what Claude wrote and then did some editing. So what you are reading is mostly Claude’s summary of my interactions with ChatGPT.

This project began with an admittedly vague question: What is the relationship between inflation (treatment) and health (outcome)? The original motivation for this came as I was helping a PhD student find a topic for her thesis. I then decided to turn it into an AI learning exercise for me. It became apparent very quickly that my original question was too broad for a meta-analysis. Health is measured by a variety of metrics, encompassing physical illness, diagnosed mental health, healthcare utilization, and subjective assessments of well-being. Combining estimated treatment effects across these categories -- even if using partial correlation coefficients to put all the estimates on the same scale -- would mix inherently noncomparable effects.

Accordingly, I asked ChatGPT for help in narrowing my focus. ChatGPT first suggested "inflation and mortality" since, as a health outcome, mortality is a well-defined outcome, facilitating the aggregation of estimated effects. Further, a quick literature search suggested that there were sufficient studies to support a meta-analysis. However further reflection/interaction with ChatGPT discouraged me from this line of analysis.

Inflation, particularly on a population-level outcome like mortality, would likely manifest with substantial and highly variable lags, making direct causal measurement and synthesis extremely challenging. Related was the difficulty in knowing whether transient inflationary spikes or sustained periods of price increases should be the appropriate “treatment”. These considerations moved me to find an outcome with a more direct and immediate measurable response to inflation. That led me to explore "happiness" and related subjective well-being measures. My subsequent interaction with ChatGPT consisted of developing a search strategy that would both (i) yield treatment effects that were conceptually similar and (ii) produce a sufficient number of studies.

An iterative approach to literature searching

It is my experience that the standard approach for literature searching consists of finding a set of keywords that will identify a broad initial pool of potentially relevant studies. This pool is then manually screened and narrowed down to a final set of studies from which data can be extracted.

When I asked ChatGPT for advice on how I should do my literature search, it suggested an alternative approach: finding keywords that defined the conceptual boundaries between relevant and irrelevant literature. The key component in this process is keyword construction. Keyword failure rarely takes an obvious form. These failures tend to occur at the boundary between relevance and irrelevance. Papers that are "almost about the topic" crowd out those that truly are. Detecting these boundary failures requires deliberate diagnostic effort.

That is where ChatGPT entered the workflow. Its role was to develop a procedure to develop keywords that articulated the boundary between relevant and irrelevant studies. Accordingly, ChatGPT recommended a four-stage pipeline. The defining feature of that pipeline—described in detail in the next section—is its focus on keyword boundary testing.

A four-stage pipeline for keyword discovery and refinement

The pipeline that ChatGPT recommended consisted of four distinct stages:

Stage A: Seed-paper anchoring, where initial relevant papers orient the search
Stage B: Two-pile discrimination, focusing on identifying conceptual boundaries through "boundary papers"
Stage C: Concept dictionary construction, where insights from boundary testing inform keyword expansion
Stage D: Database-specific execution and auditing, which involves systematically running and logging searches across different platforms

Stage A: Seed-paper anchoring

The pipeline began with what ChatGPT called "signature-feature extraction" from known relevant papers. Rather than starting with intuitive keywords, I worked with ChatGPT to identify a core set of papers that unambiguously belonged to the inflation-well-being literature. After some iteration—including discarding papers that mentioned inflation only in passing or focused on clinical mental health rather than subjective well-being—we settled on a core set that included Blanchflower (2007), Blanchflower et al. (2014), El-Jahel et al. (2023), and Welsch and Bonn (2008).

I then uploaded the PDFs of these papers to ChatGPT and asked it to extract the vocabulary these papers actually used. This was not simply a matter of reading abstracts. ChatGPT analyzed the full texts to identify how these papers described their exposure variables, outcome measures, econometric approaches, and datasets.

From the exposure side, ChatGPT extracted terms like "inflation," "inflation rate," "consumer price index," "CPI," "HICP" (Harmonised Index of Consumer Prices), "price inflation," "price increases," and "price-level convergence." It also noted methodological phrases like "misery index" and contextual terms indicating the macro-micro structure of the analyses.

On the outcome side, the extraction revealed a richer vocabulary than I had initially anticipated. Beyond "happiness" and "life satisfaction," these papers used "subjective well-being," "SWB," "life evaluation," "Cantril ladder," "ladder of life," "emotional well-being," and "positive affect" and "negative affect." This was an important discovery: El-Jahel et al. (2023), for instance, distinguished between evaluative well-being (life satisfaction) and emotional well-being (daily affect), both regressed on inflation.

ChatGPT also extracted methodological signatures that would later prove useful for refining searches: "panel data," "fixed effects," "ordered probit," "ordered logit," "cross-country," and dataset identifiers like "Eurobarometer" and "Gallup World Poll." These terms wouldn't necessarily appear in a title or abstract, but they signal that a paper belongs to the macro-well-being econometric tradition.

At the end of Stage A, I had a structured vocabulary list—a preliminary concept dictionary—grounded in how the target literature actually describes itself. This was already more precise than the ad hoc keyword list I would have generated on my own.

Stage B: Two-pile discrimination—the critical boundary test

Stage B was the most intellectually demanding part of the process, and also the most valuable. ChatGPT explained that extracting keywords from relevant papers was necessary but not sufficient. The real test was whether those keywords could discriminate between true matches and near misses—papers that superficially resembled the target literature but failed key inclusion criteria. ChatGPT proposed constructing two reference sets: Pile A (true positives) and Pile B (near misses). Pile A would contain papers that unambiguously met the inclusion criteria. Pile B would contain papers that looked relevant at first glance but failed on either the exposure or outcome dimension.

Constructing Pile A was straightforward. The four seed papers I had uploaded formed the core. Each met two strict criteria: (1) inflation (measured as CPI, HICP, or a related price index) appeared as an explicit regressor in a statistical model, and (2) the outcome was a global measure of subjective well-being—happiness, life satisfaction, SWB, life evaluation, or emotional well-being. Importantly, inflation could not merely provide background context; it had to be a named variable with an extractable coefficient.

Constructing Pile B required more deliberate effort. I needed to identify papers that failed the inclusion criteria in informative ways. ChatGPT suggested three categories of near misses, and I systematically uploaded examples of each. The first category comprised papers with the correct outcome but wrong exposure. I uploaded Gallie and Russell (1998), which examined unemployment's effect on life satisfaction, and Winkelmann and Winkelmann (1995), which studied unemployment and happiness. Both used valid well-being measures, but unemployment—not inflation—was the causal variable of interest. These papers were methodologically similar to my target literature but conceptually different.

The second category contained papers examining inflation but measuring clinical mental health rather than subjective well-being. Pathak et al. (2025) investigated how "stress due to inflation" related to anxiety and depression. Louie et al. (2024) examined "inflation hardships"—behavioral responses like skipping meals or delaying medical care—and their association with psychological distress. These studies captured important health impacts of inflation, but their outcomes were clinical diagnoses (using instruments like the PHQ-9 or GAD-7) rather than global life satisfaction scores. More critically, the exposure in these papers was often a psychological construct ("stress due to inflation") rather than inflation itself as a macroeconomic variable.

The third category featured papers using domain-specific satisfaction measures rather than global well-being. Lee et al. (2023) linked inflation to "subjective financial satisfaction"—a narrow assessment of one's financial situation rather than overall life satisfaction. Prati (2024) examined inflation inequality's effect on "material satisfaction" and "satisfaction with living standards." While conceptually related to well-being, these domain-specific measures differ fundamentally from the global evaluative judgments captured by instruments like the Cantril ladder.

With both piles constructed, ChatGPT performed what it called "discriminative feature mapping"—comparing the linguistic patterns, framing, and methodological structures across the two sets. This analysis revealed several critical distinctions.

Exposure-side discrimination: Pile A papers used phrases like "inflation and unemployment on happiness," "preferences over inflation and unemployment," or "welfare costs of inflation." The exposure was always a measurable economic variable—a price index, inflation rate, or CPI. In contrast, Pile B papers (particularly the mental health studies) framed inflation as a stressor: "stress due to inflation and its association with anxiety," or "inflation hardships and distress." This was a subtle but crucial difference. The former treats inflation as an objective macroeconomic condition; the latter treats it as a subjective psychological experience.

Outcome-side discrimination: Pile A papers used outcome language like "life satisfaction," "happiness," "subjective well-being," "life evaluation," and "emotional well-being" (when clearly framed as a global assessment). Pile B papers used "depression," "anxiety," "psychological distress," "mental health symptoms," or domain-specific terms like "financial satisfaction" and "material satisfaction." Even when Pile B papers mentioned well-being, it was typically as background context rather than the measured dependent variable.

Methodological discrimination: Pile A papers typically employed panel regressions with country and year fixed effects, ordered probit or logit models, or micro-macro merged datasets linking individual well-being surveys to national-level inflation data. Pile B papers, particularly the mental health studies, more often used cross-sectional survey designs with validated clinical scales. The unemployment studies in Pile B used similar econometric structures to Pile A but simply regressed well-being on the wrong exposure variable.

This discriminative analysis produced a refined set of positive keywords (terms that should appear in target papers) and negative keywords (terms signaling a paper should be excluded). Positive exposure terms included "inflation," "CPI," "HICP," "price index," "inflation rate," and "price inflation." Negative exposure terms included "unemployment" (as the primary exposure), "recession," "financial hardship," "economic crisis," "inflation hardship," and "stress due to inflation." Positive outcome terms included the full well-being vocabulary extracted in Stage A. Negative outcome terms included "depression," "anxiety," "distress," "psychological distress," "mental illness," "financial satisfaction," and "material satisfaction."

The value of this exercise cannot be overstated. Without Pile B, a search for "inflation AND well-being" would retrieve hundreds of papers about unemployment, clinical mental health responses to economic shocks, and domain-specific satisfaction measures. The two-pile discrimination transformed what would have been a noisy, unfocused search into a precision instrument.

Stage C: Concept dictionary construction

With the boundary clearly defined, Stage C involved synthesizing the insights from Stages A and B into a comprehensive concept dictionary. ChatGPT organized this dictionary into several structured categories.

The exposure dictionary included core inflation terms (inflation, inflation rate, CPI, HICP, price index, price inflation, price increases) and additional refinements like "inflation dispersion," "price-level convergence," and "misery index" (which combines inflation and unemployment). It also noted that food-price or energy-price inflation could be valid when explicitly interpreted as inflation components.

The outcome dictionary distinguished between evaluative well-being (life satisfaction, happiness, subjective well-being, life evaluation, Cantril ladder) and affective well-being (emotional well-being, positive affect, negative affect). Importantly, it flagged that some papers use both types of measures, as El-Jahel et al. (2023) did with the Gallup World Poll data.

The methods and context dictionary captured econometric and dataset vocabulary: panel data, fixed effects, cross-country, ordered probit, ordered logit, micro-macro merged datasets, Eurobarometer, Gallup World Poll, and phrases like "macroeconomics of happiness" or "happiness trade-off between unemployment and inflation."

The explicit exclusion dictionary listed terms characteristic of Pile B papers: clinical outcomes (depression, anxiety, distress, mental illness, mood disorders, psychological symptoms), domain-specific measures (financial satisfaction, material satisfaction, living standards satisfaction, consumption hardship), and wrong-exposure concepts (unemployment as primary exposure, recession effects, inflation hardship, stress due to inflation).

This concept dictionary became the engine for Stage D. It provided the semantic building blocks for constructing Boolean search strings that would capture the target literature while systematically excluding near misses.

Stage D: Database-specific execution

Previously, I had asked ChatGPT for advice about which databases I should search. We settled on the following four: Scopus, EconLit/EBSCO, RePEc, and SSRN. Scopus provided broad multidisciplinary coverage, indexing journals across economics, psychology, public health, and social sciences where well-being research appears. Business Source Complete (substituting for EconLit, which I lacked institutional access to) offered strong coverage of economics and business journals where macroeconomic well-being studies typically publish. RePEc (Research Papers in Economics) and SSRN were essential for capturing working papers, discussion papers, and pre-publication versions of studies—particularly important given that some well-being research circulates in gray literature for extended periods before formal publication. Together, these four databases balanced disciplinary coverage (economics-focused vs. multidisciplinary), publication status (peer-reviewed vs. working papers), and practical accessibility, while minimizing redundancy in retrieval.

Google Scholar was not used as a primary search database because its search behavior is not reproducible, its indexing rules are opaque, and it does not allow reliable application of field-restricted Boolean logic or exclusion criteria. Instead, we relied on Scopus and Business Source Complete for peer-reviewed literature and supplemented coverage of working papers using SSRN and RePEc. Google Scholar was used, where relevant, for targeted citation checks rather than systematic retrieval.

The final stage involved translating the concept dictionary into Boolean search strings tailored to each database's syntax and indexing behavior. This was not a simple copy-paste operation. Different databases have different search capabilities, controlled vocabularies, and retrieval behaviors.

For Scopus, which supports sophisticated Boolean logic and field restrictions, ChatGPT constructed a search string using TITLE-ABS-KEY() field tags. The positive block required at least one inflation term AND at least one well-being term. The negative block explicitly excluded clinical mental health and domain-specific satisfaction vocabulary. The resulting string was:

For Business Source Complete (which I used in place of EconLit, as I lacked institutional access), ChatGPT adapted the search to EBSCO's syntax using TX (all text) fields. The structure remained similar but accommodated EBSCO's handling of Boolean operators and phrase searching.

For RePEc and SSRN, which lack robust Boolean support, ChatGPT recommended a different strategy: running multiple focused queries and applying the Pile A/Pile B discrimination rules during manual screening. For RePEc, this meant separate searches like inflation "life satisfaction", inflation happiness, inflation "subjective well-being", and then manually filtering results against the exclusion vocabulary. SSRN required a similar approach given its weak search engine.

The database-specific strings were not static. After running initial searches, I audited the results and discussed boundary cases with ChatGPT. For instance, early Scopus searches retrieved several papers on "inflation expectations" and well-being, which required deciding whether inflation expectations (a forward-looking measure) counted as a valid exposure. We ultimately decided they did, as they represent a form of perceived inflation that some well-being studies regress on directly.

Reflections on the process

Looking back on this workflow, several features stand out. First, the process was genuinely iterative. I did not simply ask ChatGPT for search terms and receive a finished product. Instead, we engaged in multiple cycles of refinement, testing, and boundary clarification. ChatGPT's role was not to replace my judgment but to structure the decision-making process and make implicit assumptions explicit.

Second, the two-pile discrimination stage (Stage B) was indispensable. It forced me to articulate what I meant by "inflation" and "well-being" in operational terms, not just conceptual ones. By identifying papers that failed inclusion criteria in specific ways, I developed a much clearer understanding of what belonged in the meta-analysis and why.

Third, the concept dictionary approach provided transparency and auditability. Every term in the final Boolean strings had a traceable origin in either the seed papers (Stage A) or the boundary analysis (Stage B). This made it straightforward to document the search strategy in a protocol or methods appendix.

Finally, the database-specific adaptation (Stage D) underscored that search strategy is not database-neutral. A string that works well in Scopus may perform poorly in RePEc. Tailoring searches to each platform's capabilities, while maintaining conceptual consistency through the shared concept dictionary, was essential for comprehensive coverage.

The result was a set of search strings that retrieved a manageable number of highly relevant studies while systematically excluding the three categories of near misses identified in Stage B: unemployment-focused studies, clinical mental health papers, and domain-specific satisfaction research. In the next blog post, I will describe how I implemented the search and used Zotero and ChatGPT to manage and organize the retrieved studies.