Stress-Testing Meta-Research with AI Duels
- Tomas Havranek

- Dec 12, 2025
- 2 min read
By Zuzana Irsova and Tomas Havranek (meta-analysis.cz)
Many of us now use large language models for meta-analysis tasks like coding, short summaries, or quick checks. Where they become genuinely valuable for research, though, is not in producing a single clean answer. It is in creating structured disagreement: two models pushing on each other’s reasoning, with a researcher steering the process.
That is the idea behind the Research Audit Protocol (v1.7). It is a structured, human-in-the-loop workflow that coordinates ChatGPT and Gemini in a deliberate “duel.” The goal is not AI approval. The goal is to generate the kinds of counterexamples, boundary conditions, and missing assumptions that a single model (and often a single human pass) would not surface. (Of course, advanced users can substitute Claude for the auditor role, or run the same workflow via an API-based multi-agent setup.)

Why duels work A normal chat is optimized for flow. A duel is optimized for scrutiny:
Anchor (ChatGPT): Start with a full, file-grounded assessment before seeing any critique.
Duel (Gemini): Probe hard for identification problems, hidden assumptions, and failure modes—and force specificity.
Synthesis (You + ChatGPT): Map the disagreement (or convergence) and record what changed and why, so the final view is auditable.
A MAER-Net Case Study: WAIVE vs. MAIVE As a proof of concept for meta-method work, we applied the protocol to an audit of the proposed WAIVE idea against the MAIVE framework. The duel did not just restate the two approaches. It forced us to pin down the key tension that matters for applied meta-analysis: when does downweighting “suspiciously precise” results reduce spurious precision, and when might it also penalize genuinely informative studies?
In other words, it pushed us to state the boundary conditions clearly—the kind of slow thinking that improves methods before they hit peer review.
Resources
Protocol (v1.7): GitHub Repository
Worked Example (WAIVE/MAIVE): Examples Folder
Permanent DOI: Zenodo
MAIVE Code: CRAN | EasyMeta.org


Since sharing the protocol, we’ve heard two common questions: "Can I automate this?" and "Can I do this for free?"
Here is how the "AI Duel" concept scales up and down based on your resources.
1. The "Pro" Path: Multi-Agent Debate (MAD) For the technically inclined, our protocol is essentially an accessible version of Multi-Agent Debate (MAD) -- a technique often used internally by major AI labs.
How it works: Instead of copy-pasting, you use Python frameworks like Microsoft’s AutoGen, LangGraph, or CrewAI to connect models via API.
The Benefit: You can spin up a "swarm" of agents (e.g., a Critic, a Coder, and a Summarizer) that debate indefinitely until they converge on a solution. It’s faster and more systematic, though it requires coding…
This is great to see R packages being developed for MA!
This is brilliant! I love it! Well done, Zuzana and Tomas. I will definitely employ this in my future work. And very easy to implement! I followed your example and got slightly different results (of course). I then asked ChatGPT to compare my final report with yours and this is what it said (spoiler alert: it strengthens the value of your protocol): The two reports are substantively the same, with only minor stylistic differences.Their convergence is strong evidence that:
the key weaknesses of WAIVE have been correctly identified,
the improvement path is coherent and defensible,
and your final conclusions are not an artifact of one AI’s reasoning style.