Randomized experiments or randomized controlled trials (RCTs) are gold standards for causal inference, yet cost and sample-size constraints limit power. We introduce CALM (Causal Analysis leveraging Language Models), a statistical framework that integrates large language models (LLMs) generated insights of RCTs with established causal estimators to increase precision while preserving statistical validity. In particular, CALM treats LLM-generated outputs as auxiliary prognostic information and corrects their potential bias via a heterogeneous calibration step that residualizes and optimally reweights predictions. We prove that CALM remains consistent even when LLM predictions are biased and achieves efficiency gains over augmented inverse probability weighting estimators for various causal effects. In particular, CALM develops a few-shot variant that aggregates predictions across randomly sampled demonstration sets. The resulting U-statistic-like predictor restores i.i.d. structure and also mitigates prompt-selection variability. Empirically, in simulations calibrated to a mobile-app depression RCT, CALM delivers lower variance relative to other benchmarking methods, is effective in zero- and few-shot settings, and remains stable across prompt designs. By principled use of LLMs to harness unstructured data and external knowledge learned during pretraining, CALM provides a practical path to more precise causal analyses in RCTs.
LLMs can generate a wealth of data, ranging from simulated personas imitating human
valuations and preferences, to demand forecasts based on world knowledge. But how well do such
LLM-generated distributions support downstream decision-making? For example, when pricing
a new product, a firm could prompt an LLM to simulate how much consumers are willing to
pay based on a product description, but how useful is the resulting distribution for optimizing
the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct
an estimated distribution and the decision is then optimized under that distribution. In this
paper, we study metrics to evaluate the quality of these LLM-generated distributions, based
on the decisions they induce. Taking three canonical decision-making problems (assortment
optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions
are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics
such as Wasserstein distance can be misleading when evaluating these distributions for decision-
making.