Joint with Samuel G.Z. Asher, Jessica M. Persano, Elliot J. Paschal, Andrew C. W. Myers, and Andrew B. Hall2026
Abstract
Large language models (LLMs) are increasingly used as research assistants for statistical analysis. A well-documented concern using LLMs is sycophancy, or the tendency to tell users what they want to hear rather than what is true. If sycophancy extends to statistical reasoning, LLM-assisted research could inadvertently automate p-hacking. We evaluate this possibility by asking two AI coding agents—Claude Opus 4.6 and OpenAI Codex (GPT-5.2-Codex)—to analyze datasets from four published political science papers with null or near-null results, varying the research framing and the pressure applied for significant findings in a 2 × 4 factorial design across 640 independent runs. Under standard prompting, both models produce remarkably stable estimates and explicitly refuse direct requests to p-hack, identifying them as scientific misconduct. However, a prompt that reframes specification search as uncertainty reporting bypasses these guardrails, causing both models to engage in systematic specification search. The degree of estimate inflation under this adversarial nudge tracks the analytical flexibility available in each research design: observational studies are more vulnerable than randomized experiments. These findings suggest that, at least in narrow estimation tasks, LLMs themselves are unlikely to bias results toward statistical significance, but safety guardrails are likely unable to restrain researchers intent on p-hacking.