2026-05-11 | | Total: 12
"Vibe coding" and "vibe analytics" have been framed as a democratization of technical capability. This paper argues that AI-assisted methodology more broadly, or what I call "vibe methodology," also democratizes the failure modes specific to each domain. When AI assists with methods whose validity depends on assumptions that cannot be verified from the output alone (a class I call "vibe inference"), the failure surface is structurally different: the output does not reliably signal invalidity, and when it does, recognizing the signal requires the expertise the workflow bypasses. I focus on "vibe econometrics," the subset of AI-assisted causal analysis where identification can be named faster than it can be audited. The claim of this paper is not that AI invents inferential failures that did not previously exist, but that it changes their incidence, observability, and persuasive force enough to create a practically distinct governance problem. This results in three failure modes: method-data mismatch, where AI bypasses expertise at execution; confidence laundering, where AI amplifies the credibility of formatted output; and invisible forking, which spans both. What is new is not the failure modes but AI's industrialization of their packaging. The barrier between naming a method and executing it has collapsed, and weak foundations, dressed as rigorous analysis, now reach audiences at a scale, speed, and polish that previously required expertise. I propose the Analysis Contract, a pre-commitment framework that adapts the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting. The contract imposes three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result. The framework generalizes across domains of vibe inference through domain-specific instantiation.
We propose an aggregate notion of non-transferable utility (NTU) stability for decentralized matching markets with fixed prices, where market clearing is achieved through one-sided money burning, which can be interpreted as waiting. Agents are grouped into observable types and are indifferent among individuals within type; equilibrium is defined at the type level and delivers equal indirect utility within each type. We introduce money burning into two types of NTU models: In a deterministic model, we relate our notion to classical Gale--Shapley stability and show how money burning decentralizes stable outcomes under aggregation. We then introduce separable random utility, obtaining an NTU counterpart to Choo and Siow (2006). We prove the existence and uniqueness of equilibrium and provide a stationary queueing interpretation. Finally, we develop a generalized deferred acceptance algorithm based on alternating constrained discrete-choice problems and prove its convergence to the unique equilibrium.
We study which outcomes are implementable by disclosing coarse statistics of a data-generating process rather than its full distribution. Players observe data whose joint distribution is only partially known: they know the expectations of finitely many random variables and form beliefs by maximum-entropy inference. We obtain two characterizations. When message spaces are unrestricted, implementable outcomes coincide with jointly coherent outcomes, expanding the set of correlated equilibria. With canonical mechanisms, implementability reduces to a single cross-entropy condition: the target outcome must lie on the cross-entropy level set of some correlated equilibrium that passes through that equilibrium itself. Examples and several classes of games illustrate the reach of the framework.
This paper develops a two-period dynastic overlapping-generations (OLG) model in which parents simultaneously choose consumption, savings, fertility, and three distinct dimensions of child quality-education, physical health, and mental health-under a pay-as-you-go (PAYG) pension system. The central innovation is modelling mental health as an independent productivity-enhancing input with its own elasticity $θ$ in a Cobb-Douglas human-capital technology. This yields simple proportional allocation rules and shows how pension policy affects not only the overall level but also the composition of human capital investments. In steady state, higher PAYG contribution rates raise fertility through the Yakita effect but crowd out per-child investments in all quality dimensions, including mental health. An increase in the mental-health elasticity $θ$ shifts resources toward non-cognitive skill development while reducing fertility. These results reveal a fundamental policy tension for developing economies: pension systems that rely on children for old-age support simultaneously increase birth rates while reducing long-term human capital formation, with disproportionate effects on non-cognitive skills. The framework provides theoretical guidance for complementary policies that protect mental-health investments, with particular relevance for countries such as India where children remain a primary source of retirement security and mental-health services are underfunded.
System dynamics is a methodology that is widely used in many academic fields. It explains the behavior of social and economic systems with models that capture complex causality and feedback effects. This 'practice paper' discusses the opportunities and barriers for introducing feedback thinking and system dynamics models in the economics curriculum. We start by providing a pricing feedback model that illustrates some of the benefits that system dynamics can provide in enhancing economics education. Then we summarize the experiences of each of the authors in teaching system dynamics on economics educational programs. This includes different approaches to teaching economics with system dynamics that depend on the learning objectives, the preparation of students, and the background of the instructor. We also develop a four-level course hierarchy for using system dynamics in economics teaching. We then point out the tradeoffs that instructors must consider as they introduce new pedagogies for delivering economics material. Finally, we provide some concluding comments with some suggestions for future work. The expected audiences for this paper are instructors as well as graduate students who are considering academia as a profession.
Nash equilibrium serves as a fundamental mathematical tool in economics and game theory. However, it classically assumes knowledge of player utilities, whereas economics generally regards preferences as more fundamental. To leverage equilibrium analysis in strategic scenarios, one must first elicit numerical utilities consistent with player preferences, a delicate and time-consuming process. In this work, we forgo precise utilities and generalize the Nash equilibrium to a setting where we only assume a player is capable of providing an ordinal ranking of their actions within the context of other players' joint actions. The key technical challenge is to rethink the definition of a best-response. While the classical definition identifies actions maximizing expected payoff, we naturally look towards social choice theory for how to aggregate preferences to identify the most preferred actions. We define this generalized notion of a context-ordinal Nash equilibrium, establish its existence under mild conditions on aggregation methods, introduce notions of regularization, approximation, and regret, explore complexity for simple settings, and develop learning rules for computing such equilibria. In doing so, we provide a generalization of Nash equilibrium and demonstrate its direct applicability to elicited preferences in human experiments.
Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
This paper proposes self-normalized tests for multistep conditional predictive ability in forecast comparison. By normalizing the sample mean of the transformed loss differential using functionals of its cumulative sum (CUSUM) process, specifically an adjusted-range normalizer for scalars and a matrix normalizer for vectors, our approach avoids direct estimation of the long-run covariance matrix. Consequently, it eliminates the need for the ad hoc bandwidth, kernel, and lag-truncation choices required by traditional methods. We establish the asymptotic theory for these statistics, deriving pivotal null limiting distributions and proving test consistency. Monte Carlo simulations show that the proposed tests effectively mitigate the finite-sample size distortions associated with traditional heteroskedasticity and autocorrelation consistent (HAC) methods, while retaining strong empirical power against conditional predictability alternatives.
Individual treatment effects are not point-identified from data. The Probability of Necessity and Sufficiency (PNS) circumvents this limitation by characterizing individual-level causality through intersection bounds derived from combined experimental and observational data. In finite samples, however, standard plug-in estimators systematically fail: they violate structural probability constraints and suffer from extremum bias induced by max-min operators, yielding spuriously narrow intervals. We propose a neural framework for finite-sample PNS estimation that resolves both pathologies. We introduce an anchored neural architecture that guarantees structural constraint satisfaction by construction. To correct extremum bias, we employ precision-corrected intersection-bound inference, leveraging Epistemic Neural Networks for scalable, high-dimensional uncertainty quantification. Empirical evaluations confirm that this approach maintains nominal coverage and exact constraint validity in high-dimensional regimes where standard estimators systematically undercover.
Aligning large language models (LLMs) to human preferences typically relies on aggregating pooled feedback into a single reward model. However, this standard approach assumes that all labelers share the same underlying preferences, ignoring the fact that real-world labelers are highly heterogeneous and usually anonymous. Consequently, relying solely on binary choice data fundamentally distorts the learned policy, making the true population-average preference unidentifiable. To overcome this critical limitation, we demonstrate that augmenting preference datasets with a simple, secondary signal -- the user's response time -- can restore the identifiability of the population's average preference. By modeling each decision as a Drift-Diffusion Model (DDM), we introduce a novel, consistent estimator of heterogeneous preferences that successfully corrects the distortions of standard choice-only labels. We prove that our estimator asymptotically converges to the true average preference even in extreme cases where each anonymous labeler contributes only a single choice. Empirically, across both synthetic and real-world datasets, our method consistently outperforms standard baselines that otherwise fail and plateau at a bias floor. Because response times are essentially free to record and require zero user tracking or identification, our results bring promises and open up new opportunities for future data-collection pipelines to improve the social benefit without requiring user-level identifiers or repeated elicitations.
This note proposes a simple polynomial-time method for constructing an ex ante stable school-choice lottery satisfying equal treatment of equals. The method applies the ETE reassignment to a constrained efficient stable matching and yields a lottery that is not ordinally dominated by any other ex ante stable lottery.
Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).